# Film Script Analyzer

## Data Preparation

## Cleaning

Load the previous df, perform checks to assess the validity of the data so far.

In [1]:
import pandas as pd

df = pd.read_csv('data/df0.csv',index_col=0)

Dimensions

In [2]:
df.shape

(12713, 67)

First row and columns

In [3]:
df.head(1)

Unnamed: 0,Dutifulness,Cooperation,Self-consciousness,Orderliness,Achievement striving,Self-efficacy,Activity level,Self-discipline,Excitement-seeking,Cautiousness,...,Language1,Language2,Actor1,Actor2,Writer1,Writer2,Country1,Country2,Genre1,Genre2
Mascots 2016.txt,0.629298,0.930427,0.366518,0.294757,0.415045,0.047446,0.22242,0.523312,0.089939,0.924176,...,English,,Zach Woods,Wayne Wilderson,Christopher Guest,Jim Piddock,USA,,Comedy,


Statistical description, returns only for numeric values.

In [4]:
df.describe()

Unnamed: 0,Dutifulness,Cooperation,Self-consciousness,Orderliness,Achievement striving,Self-efficacy,Activity level,Self-discipline,Excitement-seeking,Cautiousness,...,ADV,CONJ,PRON,PRT,NUM,X,Metascore,imdbRating,Year,Runtime
count,12713.0,12713.0,12713.0,12713.0,12713.0,12713.0,12713.0,12713.0,12713.0,12713.0,...,12713.0,12713.0,12713.0,12713.0,12713.0,12713.0,5010.0,12411.0,12713.0,12358.0
mean,0.488751,0.676519,0.307052,0.448979,0.415456,0.219902,0.268673,0.524163,0.251768,0.659258,...,0.062001,0.015984,0.132272,0.028528,0.007952,0.008427,54.481238,6.302498,1997.961535,100.673734
std,0.192243,0.198775,0.179021,0.187396,0.143855,0.154371,0.155818,0.194001,0.145952,0.208545,...,0.010048,0.005332,0.02188,0.004719,0.01887,0.004568,17.984226,1.185783,20.924082,25.973905
min,0.00013,0.003025,0.005664,0.008173,0.018783,0.00124,0.007049,0.000443,0.00518,0.000306,...,0.001364,0.000142,0.0,0.0,0.0,0.0,1.0,1.3,1899.0,1.0
25%,0.354514,0.554785,0.170399,0.305702,0.314035,0.101923,0.150682,0.391369,0.145917,0.530015,...,0.057111,0.012731,0.12285,0.025831,0.004511,0.005136,41.0,5.6,1992.0,90.0
50%,0.496516,0.710666,0.27464,0.438189,0.410223,0.185717,0.234533,0.536547,0.22475,0.697536,...,0.062306,0.015116,0.134937,0.028646,0.005931,0.007789,55.0,6.5,2007.0,98.0
75%,0.629807,0.833403,0.408637,0.583158,0.509952,0.301233,0.354964,0.668768,0.326739,0.82392,...,0.067722,0.018051,0.145716,0.031402,0.007897,0.011094,68.0,7.2,2012.0,111.0
max,0.974698,0.997843,0.995562,0.985683,0.911754,0.954355,0.97439,0.986867,0.998324,0.997966,...,0.11293,0.07133,0.228687,0.06916,0.321981,0.070501,100.0,9.6,2017.0,566.0


Check for missing, 'NAs' values, define **'checkNAS'.** The information will be relevant later on. 

As it is seen, all NA's are from the IMDB data.

In [5]:
def checkNAS(df):
    NAS = []
    for i in df.columns:
        NAS.append([sum(pd.isnull(df[i])),i])
    NAS = sorted(NAS,reverse=True)
    for i in NAS:
        if i[0] !=0:
            print i

In [6]:
NAS = checkNAS(df)
NAS

[9968, 'Country2']
[9679, 'Language2']
[7703, 'Metascore']
[4824, 'Awards']
[4760, 'Writer2']
[3526, 'Rated']
[2618, 'Genre2']
[1074, 'Plot']
[912, 'Poster']
[587, 'Writer1']
[388, 'Released']
[355, 'Runtime']
[303, 'imdbVotes']
[303, 'Actor2']
[302, 'imdbRating']
[142, 'Actor1']
[131, 'Director']
[105, 'Language1']
[54, 'Genre1']
[38, 'Country1']


Continue by searching for outliers, define **'getIQR'** to check which values are outliers.

In [7]:
def getIQR(series,n1=.25,n2=.75,mult=1.5):
    q1 = series.quantile(n1)
    q3 = series.quantile(n2)
    dist = (q3-q1)*mult
    iqr = (series < q1-dist) | (series > q3+dist)
    return iqr

Check the outliers for every numeric column of the dataset.

In [11]:
IQR = {}
for feat in df.columns[:44]:
    IQR.update({feat:df[feat].ix[getIQR(df[feat])]})
    
IQR['WORDS'].sort_values(ascending=False)[:5]

Shoah 1985.txt                                 54934
Dangerous Days Making Blade Runner 2007.txt    45082
Gettysburg 1993.txt                            38580
Hamlet 1996.txt                                38122
Chastity Bites 2013.txt                        37956
Name: WORDS, dtype: int64

After looking at the files, the text seem valid so they are conserved.

The main interest is for the script to be originally written in English, that is why all the non English scripts (even if translated) will be removed, this will be done in 2 ways:

- Remove scripts whose 'Language1' feature is different from English.
- Check the 'X' feature, that is words that could not be recognized, if the rate is high it may be an indication that the script isn't written in English.
- An additional feature to check is 'NUM', some of the scripts have a time tag (like CC), these will be removed as well.

The thresholds listed were selected after looking at the files.

In [26]:
df = df.ix[df['Language1']=='English']
df = df.ix[df['X']<0.030548]
df = df.ix[df['NUM']<0.05]

After looking at the data, duplicated rows weres spotted, as shown next, there are 2 'Doctor Strange' movies but with different years, should remove the duplicates.

In [28]:
df['imdbRating'].ix[df['Actor1']=='Benedict Cumberbatch']

Hawking 2004.txt                        7.7
Doctor Strange 2007.txt                 8.0
Van Gogh Painted With Words 2010.txt    8.1
The Imitation Game 2014.txt             8.1
Stuart A Life Backwards 2007.txt        8.0
Doctor Strange 2016.txt                 8.0
Name: imdbRating, dtype: float64

Here the duplicates are found and the real instance is selected by matching it with the year provided by imdb.

In [29]:
index = []
nyear = []

for script in df.index:
    index.append(script[:-9])
    nyear.append(int(script[-8:-4]))

count  = 0
nindex = []
for i, j in zip(df['Year'],nyear):
    if i == j or i == j+1 or i == j-1:
        nindex.append(count)
    count += 1

Filter the dataset and save.

In [30]:
df = df.iloc[nindex]
df.shape

df.to_csv('data/dfrep0.csv',encoding='utf-8')

(9575, 67)

Next section: [Preprocessing](https://github.com/luisecastro/film_script_analysis/blob/master/05_preprocessing.ipynb)