# Film Script Analyzer

## Data preparation

## IMDB

[OMDB](https://www.omdbapi.com/) is an IMDB API interfase, it is used by submitting a query with film name or IMDB tag and answers with *json* data such as: Year, Rating, Actors, Directors, IMDB rating, etc.

Create a list of all the names of the files downloaded previously, those names will be submited to request the data.

In [None]:
from os import listdir
from os.path import isfile

onlyfiles = [f for f in listdir('scripts/') if isfile('scripts/'+f)]

Define a **'omdb'** function to submit the queries to the API.

In [None]:
def omdb(tag=None,name=None):
    if tag:
        url = 'http://www.omdbapi.com/?i='+tag+'&plot=short&r=json'
        raw = fetch(url)
    else:
        url = "http://www.omdbapi.com/?t="+name+"&y=&plot=short&r=json"
        raw = fetch(url)
    result = ''
    for i in raw:
        result += i 
    return json.loads(result)

Use the list generated with the names of the films, to iteratively request the data, but first remove the ' (year)'+'.txt' end part of the strings.

In [None]:
imdb = {}
for key in onlyfiles:
    name = ''.join(e for e in key if e.isalnum() or e == ' ')[:-8]
    imdb.update({key:omdb(name=name)})

Save json file with the names of the scripts as key and the data as values.

In [None]:
with open('data/imbd.json','w') as f:
    json.dump(imdb,f)

Not all the queries resulted in success, either because the name was incorrect or the script wasn't in imdb's db, either way filter those instances were the response was False, and where the number of values is different than 20 (that means TV series), leave in the index variable only the names of scripts that satisfy this condition.

In [None]:
index = []
for key in imdb.keys():
    if imdb[key]['Response'] == 'True':
        if len(imdb[key]) == 20:
            index.append(key)

Create the structure for the dataset that will be used here on, it will be a pandas dataframe with columns = values of imdb's query and index = the names of the scripts, then save the data to disk.

In [None]:
columns = imdb.values()[1].keys()

df = pd.DataFrame(index=index,columns=columns)

for row in df.index:
    for col in df.columns:
        df[col].ix[row] = imdb[row][col]
        
df.to_csv('data/imdb.csv',encoding='utf-8')

Transform features to easier to handle values:
- **Language:** Leave only first language in list.
- **Director:** Guarantee that only one director is listed
- **Genre:** Include first two genres listed in different columns.
- **Actors:** Include first two actors listed in different columns.
- **Writers:** Include first two writers listed in different columns.
- **Country:** Include first two countries listed in different columns.
- **Year:** Set it as an integer.
- **Runtime:** Split minutes integer from 'min' string.

Define **'split'** function for it, it splits the string by the specified char and returns the number of the part selected.

In [None]:
def split(x,y,n):
    temp = str(x).split(y)
    if len(temp)>1:
        return str(x).split(y)[n:n+1][0].strip()
    return x

Proceed to use the function with apply, which allows to use the function for each element of the list, drop (delete) all the initial features and if the second column is equal to the first one, assign 'NaN' instead, then save the dataset.

In [None]:
df              = df.ix[df['Type']!='game']

df['Language1'] = df['Language'].apply(split,args=(',',0))
df['Language2'] = df['Language'].apply(split,args=(',',1))
df['Runtime']   = df['Runtime'].apply(split,args=(' ',0))
df['Runtime']   = df['Runtime'].apply(int)
df['Actor1']    = df['Actors'].apply(split,args=(',',0))
df['Actor2']    = df['Actors'].apply(split,args=(',',1))
df['Year']      = df['Year'].apply(int)
df['Director']  = df['Director'].apply(split,args=(',',0))
df['Writer1']   = df['Writer'].apply(split,args=(',',0))
df['Writer2']   = df['Writer'].apply(split,args=(',',1))
df['Country1']  = df['Country'].apply(split,args=(',',0))
df['Country2']  = df['Country'].apply(split,args=(',',1))
df['Genre1']    = df['Genre'].apply(split,args=(',',0))
df['Genre2']    = df['Genre'].apply(split,args=(',',1))

df = df.drop('Language',1)
df = df.drop('Actors',1)
df = df.drop('Country',1)
df = df.drop('Type',1)
df = df.drop('Writer',1)
df = df.drop('Response',1)
df = df.drop('Genre',1)

df['Writer2'].ix[df['Writer1']==df['Writer2']]       = 'NaN'
df['Country2'].ix[df['Country1']==df['Country2']]    = 'NaN'
df['Language2'].ix[df['Language1']==df['Language2']] = 'NaN'
df['Genre2'].ix[df['Genre1']==df['Genre2']]          = 'NaN'
df['Actor2'].ix[df['Actor1']==df['Actor2']]          = 'NaN'   

df.to_csv('data/imdbP.csv',encoding='utf-8')