# Film Script Analyzer

## Data Collection/Preparation

- **Webscrapping**
- **Natural Language Processing**
- **IMDB API**
- **IBM Watson API**

### Webscrapping

All scripts were scrapped from the following page.

- [SprinfieldSpringfield](http://www.springfieldspringfield.co.uk/)

Proceed to scrap the information from the pages, and parse it to *UTF-8* encoded text.

The **'fetch'** function requests and receives the contents of a webpage throws and exception if finds a problem, it is used to scrape the web for information.

In [None]:
import operator
import requests
import string
import sys

def fetch(address):
    res = requests.get(address)
    try:
        res.raise_for_status()
    except Exception as exc:
        print('Problem: %s'%(exc))
    return res

For the source URL, the variations are in the page number and the page letter, which is the starting letter of the script name. Collect the ranges for all the initial letters.

In [None]:
import bs4
import json

url       = ["http://www.springfieldspringfield.co.uk/movie_scripts.php?order=","&page="]
letters   = string.ascii_uppercase
charNum   = {}

for i in list('0'+letters):
    start = '1'
    main = bs4.BeautifulSoup(fetch(url[0]+i+url[1]+start).text,'lxml')
    temp = main.find_all('a')
    for j in temp[-1:]:
        num = str(j.contents[0]).encode('utf8')
        charNum.update({i:int(num)})


with open('data/charNum.json','w') as f:
    json.dump(charNum,f)

Now that the Letter/Length pair is available, proceed to scrap the links. In order to not overload the server it, randomly select one number for each letter, that is, one page randomy visited for per letter. That should serve the purpose of a statistically valid sampling.

The URLs are returned without the root that needs to be added. A dictionary is defined to hold the name of the movie and its link these links contain the scripts.

In [None]:
import pandas as pd
import numpy  as np

prefix = 'http://www.springfieldspringfield.co.uk'
pages  = {}

for i in [0,1,2,3]:
    np.random.seed(i)
    for k in charNum.keys():
        j = np.random.choice(range(1,charNum[k]))
        temp = bs4.BeautifulSoup(fetch(url[0]+k+url[1]+str(j)).text,'lxml')
        aLinks = temp.find_all('a',class_='script-list-item')
        clean = ''
        for link in aLinks:
            pages.update({str(link.contents[0]):prefix+link.get('href')})

To perform the web scrapping specifically from this page. The **'springScrap'** function is defined.

In [None]:
def springScrap(raw):
    result = ' '
    soup = bs4.BeautifulSoup(raw.text,'lxml')
    for e in soup.findAll('br'):
        e.extract()
    for text in soup.find_all('div',class_='scrolling-script-container'):
        result += text.get_text().encode('utf8')
    return result

Iterarively, for each name in the pages dictionary, request the data.

In [None]:
script = []
for i in pages.values():
    script.append(springScrap(fetch(i)))

Proceed to store everything on disk, remove punctuation and non alphanumeric characters from the names before saving.

In [None]:
for i in range(len(script)):
    f = open('scrapped/'+''.join(e for e in pages.keys()[i] if e.isalnum() or e == ' ')+'.txt','w')
    f.write(script[i])
    f.close()

A total of **2319** scripts are stored in the HD.

### Natural Language Processing

Load the previously saved files to memory.

In [None]:
from os import listdir
from os.path import isfile

script = {}
onlyfiles = [f for f in listdir('scrapped/') if isfile('scrapped/'+f)]
for i in onlyfiles:
    with open('scrapped/'+i) as k:
        script.update({i:[k.readlines(),[]]})

Data will be generated by applying NLP tools to the scripts and extracting valuable statistics from them using the NLTK library.

- **Words:** Total number of words in the script; a measure of the length of the script.
- **Diversity:** Total number of unique words / total number of words; a measure of diversity of language.
- **Length:** Mean word length on the script.
- **Parts of speech:** Normalized counted parts of speech: **Verb, Noun, Adp, Adj, Conj, Pron, Prt, Num, Punc, X.**

The function **'words'** is defined to generate this values.

In [None]:
import nltk

def words(script):
    tokens    = nltk.word_tokenize(script)
    nwords    = len(tokens)
    diversity = len(set(tokens))/float(nwords)
    tagger    = nltk.pos_tag(tokens,tagset='universal')
    
    wordL     = 0.
    for i in tokens:
        wordL += len(i)
    wordL     = wordL/nwords
    
    speech    = ['VERB','NOUN','ADP','.','ADJ','ADV','CONJ','PRON','PRT','NUM','X']
    counter   = [0.]*len(speech)
    for i in tagger:
        for j in range(len(speech)):
            if i[1] == speech[j]:
                counter[j] += 1.
    
    return [nwords,diversity,wordL,
            counter[0]/nwords,counter[1]/nwords,counter[2]/nwords,counter[3]/nwords,
            counter[4]/nwords,counter[5]/nwords,counter[6]/nwords,counter[7]/nwords,
            counter[8]/nwords,counter[9]/nwords,counter[10]/nwords]

Create dataframe to contain the dataset.

In [None]:
speech = ['WORDS','DIVERSITY','LENGTH','VERB','NOUN','ADP','.','ADJ','ADV','CONJ','PRON','PRT','NUM','X']

df = pd.DataFrame(index=script.keys(),columns=speech)

In [None]:
for film in df.index:
    if len(script[film][0])<2:
        pass
    else:
        temp = words(script[film][0][2])
        for i in range(len(temp)):
            df[speech[i]].ix[film] = temp[i]

### IMDB API

[OMDB](https://www.omdbapi.com/) is an IMDB API interfase, it is used by submitting a query with film name or IMDB tag and answers with JSON data such as Year, Rating, Actors, Directors, IMDB rating, etc.

Define a **'omdb'** function to submit the queries.

In [None]:
def omdb(tag=None,name=None):
    if tag:
        url = 'http://www.omdbapi.com/?i='+tag+'&plot=short&r=json'
        raw = fetch(url)
    else:
        url = "http://www.omdbapi.com/?t="+name+"&y=&plot=short&r=json"
        raw = fetch(url)
    result = ''
    for i in raw:
        result += i 
    return json.loads(result)

Use the index of the df, the names of the films, to request the data, but first remove the ' (year)'+'.txt' end part of the strings.

In [None]:
imdb = {}
for key in df.index:
    name = ''.join(e for e in key if e.isalnum() or e == ' ')[:-8]
    imdb.update({key:omdb(name=name)})

Proceed to save the file:

In [None]:
with open('data/imbd.json','w') as f:
    json.dump(imdb,f)

Create a dataset.

In [None]:
columns = imdb['Anegan 2015.txt'].keys()
index   = df.index
tempDf  = pd.DataFrame(columns=columns,index=index)

for row in index:
    for col in columns:
        if imdb[row]['Response']=='True':
            tempDf[col][row] = imdb[row][col]

Concatenate it with the previous one.

In [None]:
df2 = pd.concat([df,tempDf],axis=1)

Remove non movies, scripts with less than a 1000 words, scripts with high rate of numbers (time indexed). Also remove all non movies.

In [None]:
df2 = df2.ix[df2['WORDS']>1000]
df2 = df2.ix[df2['NUM']<0.1]
df2 = df2.ix[df2['Type']=='movie']
df2 = df2.drop('Type',1)

Transform features to easier to handle values:
- **Language:** Leave only first language in list.
- **Genre:** Include first two genres listed in different columns.
- **Actors:** Include first two actors listed in different columns.
- **Year:** Set it as an integer.
- **Runtime:** Split minutes integer from 'min' string.

Define **'split'** function for it.

In [None]:
def split(x,y,n):
    temp = str(x).split(y)
    if len(temp)>1:
        return str(x).split(y)[n:n+1][0]
    return x

In [None]:
df2['Runtime']  = df2['Runtime'].apply(split,args=(' ',0))
df2['Actors1']  = df2['Actors'].apply(split,args=(',',0))
df2['Actors2']  = df2['Actors'].apply(split,args=(',',1))
df2['Year']     = df2['Year'].apply(int)
df2['Genre1']   = df2['Genre'].apply(split,args=(',',0))
df2['Genre2']   = df2['Genre'].apply(split,args=(',',1))
df2['Language'] = df2['Language'].apply(split,args=(',',0))
df2             = df2.drop('Actors',1)
df2             = df2.drop('Genre',1)
df2             = df2.ix[df2['Language']=='English']

### IBM Watson's API

"IBM Watson is a technology platform that uses natural language processing and machine learning to reveal insights from large amounts of unstructured data"

Watson's interfase needs a [registered account](https://www.ibm.com/account/us-en/signup/register.html?a=MTBmNDg2NDktNDI2MC00&ctx=C001&trial=yes&catalogName=Master&quantity=1&partNumber=WA2PROTRIAL&source=mrsaas-trial-ibmid&pkg=ov49121&S_TACT=000000WB&S_OFF_CD=10000752&siteID=WA&watsonanalytics=true), which assigns a needed username / password.

The process is as follows:
- Authenticate with username/password
- Submit text to be analized
- Receive analysis of text, 30 features that describe the data.

Such features are:
![features](pi_viz.jpg)

Assign variables with username / password.

In [None]:
from watson_developer_cloud import PersonalityInsightsV2 as PerIns

iusername = 'ABCDE'
ipassword = '12345'

Define function **'insight'** to submit the queries.

In [None]:
def insight(text,user,password):
    connect = PerIns(username=user,password=password)
    return connect.profile(text)

Send text to analyze, store it in a dictionary.

In [None]:
insights = {}

for row in df2.index:
    temp    = open('scrapped/'+row,'r').read()
    insights.update({row:insight(temp,iusername,ipassword)})

Define **'flatten'** function to extract data from the insights dictionary.

In [None]:
def flatten(orig):
    data = {}
    for c in orig['tree']['children']:
        if 'children' in c:
            for c2 in c['children']:
                if 'children' in c2:
                    for c3 in c2['children']:
                        if 'children' in c3:
                            for c4 in c3['children']:
                                if (c4['category'] == 'personality'):
                                    data[c4['id']] = c4['percentage']
                                    if 'children' not in c3:
                                        if (c3['category'] == 'personality'):
                                                data[c3['id']] = c3['percentage']
    return data

In [None]:
finsights = {}

for key in insights.keys():
    finsights.update({key:flatten(insights[key])})

Convert it to dataframe and concatenate with the previous dataset.

In [None]:
columns = finsights.items()[0][1].keys()
index   = df2.index
tempDf  = pd.DataFrame(index=index,columns=columns)

for row in tempDf.index:
    for feature in finsights[row].keys():
        tempDf[feature].ix[row] = finsights[row][feature]

In [None]:
df3 = pd.concat([tempDf,df2],axis=1)
df3.to_csv('data/dataset.csv',encoding='utf-8')

From the initial **2319** scrapped scripts, after filtering and cleaning, **1364** scripts were left with **65** features.

Continue to [Data Visualization.](https://github.com/luisecastro/dataInc/blob/master/data_viz.ipynb)