# Film Script Analyzer, by Luis Castro

The aim of this project is to create a web application capable of processing a film script, that is, analyze its syntactic and semantic characteristics along with additional meta-data. 

This is done by:
- Scrapping the web for scripts to build and keep growing the available data.
- Request meta-data from IMDB API to receive general information from a film, that information is related to a film script. Along with this information is the IMDB rating, our target value to predict. 
- Information from the text of the script is sent to Watson's API to its Personality Insights application which returns a vector of 30 float numbers that describes features of the text.
- Additionaly, using the NLTK library, a NLP tool further information is created to enrich the dataset.

When the dataset is ready:
- Data is preprocessed, that is, checked for missing values, scaled as it is needed for many machine learning algorithms that calculate distances among instances.
- Most relevant features are selected by feature selection and variance explained.
- Data is split into training and test sets
- A set of supervised learning algorithms are tested individualy and as an ensamble to get the best prediction posible.
- Accuracy of the model is tested using Root Mean Square Error.

By developing and refining the application, it can be a valuable tool to quickly evaluate a film script, providing valuable insight to aspiring and profesional writers alike, it will also prove to be a valuable ally to producers or directors that need to read many of those, by setting a guideline or baseline to guide their efforts.

In [None]:
# Libraries to be used.
# The helper library was developed for this project.
import helper as hp
import pandas as pd
import numpy  as np
import bs4
import string
from sklearn.linear_model import LinearRegression as lr

In [None]:
# Done for testing purposes, a subset of film scripts was selected randomly.
np.random.seed(0)
n = 1
letters   = list(np.random.choice(list(string.ascii_uppercase),size=n,replace=False))
url       = ["http://www.springfieldspringfield.co.uk/movie_scripts.php?order=","&page="]

# Append data obtained to empty list
# fetch the url, and extract the links from it
# these links contain the droids we are looking for... i mean the scripts.
prefix = 'http://www.springfieldspringfield.co.uk'
pages  = {}
n      = 3
for i in letters:
    for j in range(1,n): # Number of pages to go in, change latter too.
        temp = bs4.BeautifulSoup(hp.fetch(url[0]+i+url[1]+str(j)).text,'lxml')
        temp = temp.find_all('a',class_='script-list-item')
        clean = ''
        for link in temp:
            pages.update({str(link.contents[0]):prefix+link.get('href')})

In [None]:
# Having the list of url's, got to each one and scrape the script
# the sprintScrap was specifically designed to scrape information
# form that webpage.
script = []
for i in pages.values():
    script.append(hp.springScrap(hp.fetch(i)))

In [8]:
# Process script to remove all caps words, indicating actions
# to be seen if leaving or removing them creates a better model.
scriptRC = []
for i in range(len(script)):
    scriptRC.append(hp.removeCAPS(script[i]))

In [None]:
# Save text of scrapped and process scripts to local disk.
for i in range(len(script)):
    f = open('scrapped/'+pages.keys()[i]+'.txt','w')
    f.write(script[i])
    f.close()
    
for i in range(len(script)):
    f = open('scrapped/'+pages.keys()[i]+'RC.txt','w')
    f.write(scriptRC[i])
    f.close()

In [None]:
# Username and password for Watson's API, obtained
# Creating an account with them, some free perks.
iusername = 'ABCDEF'
ipassword = '123456'

# Submit scripts (in this case those with actions removed)
# to Watson through the API, the API is for Personality Insights
# it returns a set of 30 parameters obtained by analysing
# the text supplied.
insights = []
for i in scriptRC:
    insights.append(hp.insight(i,iusername,ipassword))

In [None]:
# Associate names of movie scripts with insights returned.
nins = []
for i in range(len(insights)):
    nins.append([pages.keys()[i],hp.dToL(hp.flatten(insights[i]))[0]])

In [None]:
# Create a dataframe, the columns will be the each of the insights features
# the rows each instant of movie script, it starts with 0's.
index   = pages.keys()
columns = hp.dToL(hp.flatten(insights[0]))[1]

df = pd.DataFrame(data=np.zeros(shape=(len(index),len(columns))),index=index,columns=columns)

In [None]:
# Fill the dataframe with the information from the insights.
for i in range(len(index)):
    for j in range(len(columns)):
        df.ix[(i,j)] = nins[i][1][j]

In [None]:
# After completing receiving data from Watson, proceed to use OMDB API, this API
# returns important data from a movie by supplying the name of the movie or
# the imdb tag. The names of the movies were available from the previous scrapping, 
# however names sometimes are spelled or written differently and could not be found 
# by the API, that is why this was checked by hand.

# Some curation of the titles needs to be done by hand, will automate (somehow) later
titles = ['Ca$h','Caddyshack','Cabin Fever 3: Patient Zero','Caged','Cairo Time',
          'Caddyshack II','Cable Guy','Caffeine','Caligula','Calendar Girls',
          'Cabin in the sky','The White Horse','Cadaver','Cabin Fever','Call Girl',
          'C.r.a.z.y','Julius Caesar','Caesar and Cleopatra','Cake','Cabin Fever 2',
          'California Solo','Cabaret Desire','C.O.G.','Cafe Society','Cadillac Records',
          'Cabin in the woods','Cabaret','Illustrious Corpses','Cadillac Man','Cahill',
          'Cake Eaters','Calamity Jane','The Dark Knight','The Godfather','Fight Club',
          'The Lord of the Rings: The Fellowship of the Ring',
          'Star Wars: Episode V - The Empire Strikes Back','The Matrix',
          'The silence of the lambs']

In [None]:
# Set the dataframe index to the curated names, the pandas dataframe
# will have the film names as index and the features as columns.
df.index = titles

# Proceed to request the information to the OMDB API and store it in a dictionary.
imdbDict= {}
for i in titles:
    imdbDict.update({i:omdb(name=i)})

In [None]:
# Our target is in sight, the 'imdbRating' is what the app will
# predict through regression. Create the target and store it in memory.
target = pd.DataFrame(index=df.index,columns=['imdbRating'])
for i in df.index:
    target.ix[i] = imdbDict[i]['imdbRating']

In [None]:
# Check if the rating was available, if not remove that row, as
# there will be no way of evaluating that instance.
nas    = target['imdbRating']!='N/A'
target = target.ix[nas]
df     = df.ix[nas]

In [None]:
# By using the NLTK library for Natural Language Processing
# it is possible to obtain interesting statistics of the text
# that can give further power to the model analytics 
# the model now uses count of words, diversity of words = (unique words/total words) 
# and verbs/nouns.
nanalytics     = []
for i in range(len(scriptRC)):
    nanalytics.append([pages.keys()[i],words(scriptRC[i])])

# Store this new analytics in the training df.
df['Words']      = 0.
df['DiversityW'] = 0.
df['Verb/Noun']  = 0.

for i in range(len(nanalytics)):
    df['Words'].iloc[i]      = nanalytics[i][1][0]
    df['DiversityW'].iloc[i] = nanalytics[i][1][1]
    df['Verb/Noun'].iloc[i]  = nanalytics[i][1][2]

In [None]:
# From the IMDB API information we use 2 features so far,
# year and the runtime of the movie.
df['Year']    = 0
df['Runtime'] = 0

for i in df.index:
    df['Year'][i]    = imdbDict[i]['Year']
    df['Runtime'][i] = imdbDict[i]['Runtime'].split()[0]

In [None]:
# Finish by storing the dataset and the target on disk for further use.
df.to_csv('scrapped/final.csv')
target.to_csv('scrapped/target.csv')