## Data Collection

To build a model, I will need a large number of plot summaries. I have decided to use
Wikipedia as a source for these summaries, since they have lengthy summaries, and they have
lists of movies that make it easy to search.

To begin with I am testing out individual queries to Wikipedia, to make sure I can pull
multiple entries from a list.

In [None]:
import requests

S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

TITLE = "Category:1961 films|Category:1965 films"

PARAMS = {
    'action': "query",
    'list': 'categorymembers',
    'cmtitle': TITLE,
    'cmlimit': '10',
    'format': "json",
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()

In [None]:
print(DATA['error'])
#[item['title'] for item in DATA['query']['categorymembers']]

In [None]:
years = []
URL = "https://en.wikipedia.org/w/api.php"

for y in range(1960,1965):
    TITLE = "Category:" + str(y) + " films"

    PARAMS = {
        'action': "query",
        'list': 'categorymembers',
        'cmtitle': TITLE,
        'cmlimit': '10',
        'format': "json",
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    ids = [item['pageid'] for ind,item in enumerate(DATA['query']['categorymembers']) if ind>1]
    years.append(ids)
years

In [None]:
DATA['query']

In [None]:
DATA['parse']

It is difficult to extract plaintext from the page content that the standard api returns,
so I am testing out some wrappers for the api, which may provide extra functionality.

In [None]:
from mediawiki import MediaWiki
import wikipedia

query = "Category:1961 films"

c = wikipedia.page(pageid=years[3][6])
c.title

In [None]:
fulltext = c.content

In [None]:
import string
ind1 = fulltext.find('== Plot ==') + 10
ind2 = fulltext.find('==',ind1)
plottext = fulltext[ind1:ind2]
plottext.replace('\n',' ').translate(str.maketrans('','',string.punctuation))

The `wikipedia` package automatically returns a field called `content` that contains the page as plaintext,
so I am going to use that for the specific pages to make extracting the plot easier.

In [44]:
import datetime
import time
from wikiparse_movies import WikiParser

In [34]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


I have created a WikiParser object with all the functionality I need to make calls to the wikipedia api, and now I am running it to get plots from the years 1960-1965.

In [None]:
now = datetime.datetime.now()

wp = WikiParser()
yrs_1960_1970 = wp.get_years(1960,1965)

later = datetime.datetime.now()
elapsed = later-now
print("Time: ", elapsed) 
#year_1960 = movies
print(len(yrs_1960_1970))

In [None]:
import pandas as pd

yrs1_df = pd.DataFrame.from_dict(yrs_1960_1970)
yrs1_df.head()

In [None]:
yrs1_df = yrs1_df[~yrs1_df['title'].str.startswith('List')]

After some examination, there are a few problems with this data. The language used to describe foreign films it somehwat unique, and tends to trow off the topic modeling, so it is necessesary to refine the dataset by only using english-language films. Fortunately this is a specific category on wikipedia, so in lieu of search by each year, I only need to search for english-language films. 

In [64]:
#430
#

now = datetime.datetime.now()

wp = WikiParser()
english_films_2 = wp.get_plots_from_year('English-language',start=431,skip=True)

later = datetime.datetime.now()
elapsed = later-now
print("Time: ", elapsed) 
#year_1960 = movies
print(len(english_films_2))

page: 431 parsing... . . . . . . . . .  
page: 432 parsing... . . . . . . . . .  
page: 433 parsing... . . . . . . . . .  
page: 434 parsing... . . . . . . . . .  
page: 435 parsing... . . . . . . . . .  
page: 436 parsing... . . . . . . . . .  
page: 437 parsing... . . . . . . . . .  
page: 438 parsing... . . . . . . . . .  
page: 439 parsing... . . . . . . . . .  
page: 440 parsing... . . . . . . . . .  
page: 441 parsing... . . . . . . . . .  
page: 442 parsing... . . . . . . . . .  
page: 443 parsing... . . . . . . . . .  
page: 444 parsing... . . . . . . . . .  
page: 445 parsing... . . . . . . . . .  
page: 446 parsing... . . . . . . . . .  
page: 447 parsing... . . . . . . . . .  
page: 448 parsing... . . . . . . . . .  
page: 449 parsing... . . . . . . . . .  
page: 450 parsing... . . . . . . . . .  
page: 451 parsing... . . . . . . . . .  
page: 452 parsing... . . . . . . . . .  
page: 453 parsing... . . . . . . . . .  
page: 454 parsing... . . . . . . . . .  
page: 455 parsin

In [65]:
yrs4_df = pd.DataFrame(english_films_2)
yrs4_df.head()

Unnamed: 0,summary,title
0,nn and Raggedy Andy is a tworeel cartoon produ...,Raggedy Ann and Raggedy Andy (1941 film)
1,Nita a divorced mother of two boys works as a...,Raggedy Man
2,The film centres around the character of Tom ...,The Raggedy Rawney
3,In 1964 an aging overweight Italian American ...,Raging Bull
4,Bruce Pritchard Malcolm McDowell is a 24yearo...,The Raging Moon


In [66]:
yrs4_df.shape

(15821, 2)

Now that I have the dataset compiled, I will save it as a json file for temporary storage, until I can get it uploaded to a MongoDB.

In [67]:
yrs4_df.to_json('data/eng_431_601.json')

In [None]:
movies1_df = pd.read_json('1960_1964.json')
movies2_df = pd.read_json('1965_1970.json')
movies_df = movies1_df.append(movies2_df)
print(movies_df.shape)
movies_df.head()