# Film Script Analyzer

## Recommendation

These systems work primarly in 2 ways. 
- Analyze customers behavior and look at similarities among them. 
    - Ex. Cust1 likes (A,B). Cust2 likes (A,C). Cust1 may enjoy 'C'.
- Look and things that have similar characteristics or that are associated. 
    - Ex. Cust1 likes A. A has (a,b,c). C has (a,b,d). Cust1 may enjoy 'C'. 
    
Two interesting questions rise, how to know what customers like? what characteristics are important to create associations.

- How to know what customers like?
    - Primarily by votes (likes), ratings (4 stars) or browsed items.
- What characteristics are important to create associations? 
    - That needs more substantive expertise. Smurfs are blue, I like smurfs but I may not enjoy blue scarfs. Fortunately since the comparison is just text we dwell in a single domain.

In [1]:
import pandas as pd 

df = pd.read_csv('data/dfreps0.csv',index_col=0)

Create a new dataset that will contain the recommendation matrix for a script (film) based on its script. To achieve such task the cosine similarity function is used, as it calculates the cosines distance from a script to another.

Only the first 30 features will be used (the ones generated by IBM Watson).

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy                  as np

similarities = cosine_similarity(df[df.columns[:30]].values)
np.fill_diagonal(similarities,0)
re           = pd.DataFrame(data=similarities,columns=df.index,index=df.index)

In [5]:
re.to_csv('data/re.csv',encoding='utf-8')

Asign a size to the recommendation window, that is the number of scripts that will show up on a query. Since it is desirable to recommend good scripts instead of bad scripts, sort the selected (most similar) scripts by imdbRating. This will also create a lot smaller file to be accessing.

In [6]:
w   = 20
rec = {}

for row in re.index:
    temp = re.ix[row].sort_values(ascending=False)[:w].index
    rec.update({row:list(df.ix[temp].sort_values(by='imdbRating',ascending=False).index)})

Try it and save it, the names of the scripts will be later formated.

In [7]:
rec['Eat Pray Love 2010.txt']

['The Shop Around the Corner 1940.txt',
 'Dodsworth 1936.txt',
 'Detachment 2011.txt',
 'Vanya On 42nd Street 1994.txt',
 'My Brilliant Career 1979.txt',
 'Under the Greenwood Tree 2005.txt',
 'The Letter Writer 2011.txt',
 'Stoker 2013.txt',
 'Loving Annabelle 2006.txt',
 'High Anxiety 1977.txt',
 'In The French Style 1963.txt',
 'Cracks 2009.txt',
 'Words and Pictures 2013.txt',
 'Pumpkin 2002.txt',
 'Peggy Sue Got Married 1986.txt',
 'Beyond the Black Rainbow 2010.txt',
 'To the Wonder 2012.txt',
 'Nina 2016.txt',
 'The Exorcism of Molly Hartley 2015.txt',
 'Cruel Intentions 2 2000.txt']

And interesting thing about this recommendation system is that it shows great movies, similar to what we are looking but of any year, like in the previous example

In [8]:
import json

with open('data/rec.json','w') as f:
    json.dump(rec,f)