## Data Collection

To build a model, I will need a large number of plot summaries. I have decided to use
Wikipedia as a source for these summaries, since they have lengthy summaries, and they have
lists of movies that make it easy to search.

To begin with I am testing out individual queries to Wikipedia, to make sure I can pull
multiple entries from a list.

In [None]:
import requests

S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

TITLE = "Category:1961 films|Category:1965 films"

PARAMS = {
    'action': "query",
    'list': 'categorymembers',
    'cmtitle': TITLE,
    'cmlimit': '10',
    'format': "json",
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()

In [None]:
[item['title'] for item in DATA['query']['categorymembers']]

In [121]:
years = []
URL = "https://en.wikipedia.org/w/api.php"

for y in range(1960,1965):
    TITLE = "Category:" + str(y) + " films"

    PARAMS = {
        'action': "query",
        'list': 'categorymembers',
        'cmtitle': TITLE,
        'cmlimit': '10',
        'format': "json",
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    ids = [item['pageid'] for ind,item in enumerate(DATA['query']['categorymembers']) if ind>1]
    years.append(ids)
years

In [122]:
len(sci_fi)

600

In [None]:
DATA['parse']

The API returns the page content in HTML format, which is a hassle to parse.
To get around this I am testing out some wrappers for the api, which may provide extra functionality.

In [None]:
from mediawiki import MediaWiki
import wikipedia

query = "Category:1961 films"

c = wikipedia.page(pageid=years[3][6])
c.title

In [None]:
fulltext = c.content

In [None]:
import string
ind1 = fulltext.find('== Plot ==') + 10
ind2 = fulltext.find('==',ind1)
plottext = fulltext[ind1:ind2]
plottext.replace('\n',' ').translate(str.maketrans('','',string.punctuation))

The `wikipedia` package automatically returns a field called `content` that contains the page as plaintext,
so I am going to use that for the specific pages to make extracting the plot easier.

In [44]:
import datetime
import time
from wikiparse_movies import WikiParser

In [34]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


I have created a WikiParser object with all the functionality I need to make calls to the wikipedia api, and now I am running it to get plots from the years 1960-1965. This takes hours to run, so running this cell is not advisable.

In [None]:
now = datetime.datetime.now()

wp = WikiParser()
yrs_1960_1970 = wp.get_years(1960,1965)

later = datetime.datetime.now()
elapsed = later-now
print("Time: ", elapsed) 
#year_1960 = movies
print(len(yrs_1960_1970))

In [None]:
import pandas as pd

yrs1_df = pd.DataFrame.from_dict(yrs_1960_1970)
yrs1_df.head()

In [None]:
yrs1_df = yrs1_df[~yrs1_df['title'].str.startswith('List')]

After some examination, there are a few problems with this data. The language used to describe foreign films is somehwat unique, and tends to trow off the topic modeling, so it is necessesary to refine the dataset by only using english-language films. Fortunately this is a specific category on wikipedia, so in lieu of search by each year, I only need to search for english-language films. However, there are so many entries that rather than risk the function breaking in the middle and losing hours of work, I am setting it to create a backup every 10 pages. I will also collect it in two halves, so that each half will complete faster.

In [None]:
now = datetime.datetime.now()

wp = WikiParser()
english_films_1 = wp.get_plots_from_year('English-language',start=0,end=430)

later = datetime.datetime.now()
elapsed = later-now
print("Time: ", elapsed) 
print(len(english_films_1))

In [64]:

now = datetime.datetime.now()

wp = WikiParser()
english_films_2 = wp.get_plots_from_year('English-language',start=431,skip=True)

later = datetime.datetime.now()
elapsed = later-now
print("Time: ", elapsed) 
print(len(english_films_2))

page: 431 parsing... . . . . . . . . .  
page: 432 parsing... . . . . . . . . .  
page: 433 parsing... . . . . . . . . .  
page: 434 parsing... . . . . . . . . .  
page: 435 parsing... . . . . . . . . .  
page: 436 parsing... . . . . . . . . .  
page: 437 parsing... . . . . . . . . .  
page: 438 parsing... . . . . . . . . .  
page: 439 parsing... . . . . . . . . .  
page: 440 parsing... . . . . . . . . .  
page: 441 parsing... . . . . . . . . .  
page: 442 parsing... . . . . . . . . .  
page: 443 parsing... . . . . . . . . .  
page: 444 parsing... . . . . . . . . .  
page: 445 parsing... . . . . . . . . .  
page: 446 parsing... . . . . . . . . .  
page: 447 parsing... . . . . . . . . .  
page: 448 parsing... . . . . . . . . .  
page: 449 parsing... . . . . . . . . .  
page: 450 parsing... . . . . . . . . .  
page: 451 parsing... . . . . . . . . .  
page: 452 parsing... . . . . . . . . .  
page: 453 parsing... . . . . . . . . .  
page: 454 parsing... . . . . . . . . .  
page: 455 parsin

In [65]:
eng1_df = pd.DataFrame(english_films_1)
eng2_df = pd.DataFrame(english_films_2)
eng2_df.head()

Unnamed: 0,summary,title
0,nn and Raggedy Andy is a tworeel cartoon produ...,Raggedy Ann and Raggedy Andy (1941 film)
1,Nita a divorced mother of two boys works as a...,Raggedy Man
2,The film centres around the character of Tom ...,The Raggedy Rawney
3,In 1964 an aging overweight Italian American ...,Raging Bull
4,Bruce Pritchard Malcolm McDowell is a 24yearo...,The Raging Moon


In [66]:
eng2_df.shape

(15821, 2)

Now that I have the dataset compiled, I will save it as a json file for temporary storage, until I can get it uploaded to a MongoDB.

In [67]:
# eng1_df.to_json('data/eng_0_430.json')
# eng2_df.to_json('data/eng_431_601.json')

In [137]:
movies1_df = pd.read_json('themeter/dev/data/eng_1_430.json')
movies2_df = pd.read_json('themeter/dev/data/eng_431_601.json')
movies = movies1_df.append(movies2_df)
print(movies1_df.shape)
print(movies2_df.shape)
movies.head()

(40184, 2)
(15821, 2)


Unnamed: 0,summary,title
0,nown as 1 Life On The Limit is a 2013 document...,1 (2013 film)
1,While on the phone with his girlfriend Jill w...,+1 (film)
10,Gully Mercer Mick Rossi leads a group of prol...,2:22 (2008 film)
100,o Christmas is a Ghanaian drama movie about Re...,6 Hours To Christmas
1000,r is a 1995 American romantic drama television...,The Affair (1995 film)


## Storing with MongoDB

Now that I have the data backed up locally, I will attempt to store it remotely with Mongo.

In [130]:
from pymongo import MongoClient
from pprint import pprint
import json

with open('/Users/alexanderbailey/.secret/mongo_creds.json','r') as f:
    params = json.load(f)

url = 'mongodb+srv://zmbailey:' + params['password'] + '@cluster0-ykzgc.mongodb.net/test?retryWrites=true&w=majority'

client = MongoClient(url)
db=client.admin

serverStatusResult=db.command("serverStatus")
pprint(serverStatusResult)

{'$clusterTime': {'clusterTime': Timestamp(1562004936, 1),
                  'signature': {'hash': b'\x85Y\xca\x85\xbf\xa3\xb9\xa8'
                                        b'\x8c\x08\n\xe60 \x0fP\xe6vC\x7f',
                                'keyId': 6703621046206988289}},
 'connections': {'available': 98, 'current': 2, 'totalCreated': 26},
 'extra_info': {'note': 'fields vary by platform', 'page_faults': 0},
 'host': 'cluster0-shard-00-01-ykzgc.mongodb.net:27017',
 'localTime': datetime.datetime(2019, 7, 1, 18, 15, 38, 943000),
 'mem': {'bits': 64,
         'mapped': 0,
         'mappedWithJournal': 0,
         'resident': 0,
         'supported': True,
         'virtual': 0},
 'metrics': {'atlas': {'bytesInWrites': 0,
                       'connectionPool': {'totalCreated': 278}}},
 'network': {'bytesIn': 35770, 'bytesOut': 511897, 'numRequests': 188},
 'ok': 1.0,
 'opcounters': {'command': 183,
                'delete': 0,
                'getmore': 0,
                'insert': 0,
  

In [131]:
mongodb = client.movieplots

In [134]:
#mon = mongodb.movies.insert_many(movies.to_dict('records'))
print(mon.inserted_ids[:10])

[ObjectId('5d1a4e861293e4051fb185ae'), ObjectId('5d1a4e861293e4051fb185af'), ObjectId('5d1a4e861293e4051fb185b0'), ObjectId('5d1a4e861293e4051fb185b1'), ObjectId('5d1a4e861293e4051fb185b2'), ObjectId('5d1a4e861293e4051fb185b3'), ObjectId('5d1a4e861293e4051fb185b4'), ObjectId('5d1a4e861293e4051fb185b5'), ObjectId('5d1a4e861293e4051fb185b6'), ObjectId('5d1a4e861293e4051fb185b7')]
