# Sources
https://www.kaggle.com/rounakbanik/movie-recommender-systems

## Simple Recommender

The Simple Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user. 

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre. 

In [46]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

import warnings; warnings.simplefilter('ignore')

In [47]:
md = pd. read_csv('../data/movies_metadata.csv')
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [48]:
md.shape

(45466, 24)

In [49]:
md = (md.drop(['budget'], axis=1)
         .join(md['budget'].apply(pd.to_numeric, errors='coerce')))

md = md[md['budget'].notnull()]

In [50]:


md = md.replace(0,'NaN')
md = md[md.budget.notnull()]
md = md[md.budget != 'NaN']
md.loc[:, 'budget'] = pd.to_numeric(md['budget'])
md = md[md['budget'] != 0]


In [51]:
md.shape[0]

8890

In [52]:
md = md[md['production_companies'] != '[]']
md.shape[0]

8158

In [53]:
md.head()

Unnamed: 0,adult,belongs_to_collection,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,budget
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,...,373554000.0,81,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,,7.7,5415,30000000.0
1,False,,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,...,262797000.0,104,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,,6.9,2413,65000000.0
3,False,,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.85949,...,81452200.0,127,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,,6.1,34,16000000.0
5,False,,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",17.9249,...,187437000.0,170,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,,7.7,1886,60000000.0
6,False,,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,11860,tt0114319,en,Sabrina,An ugly duckling having undergone a remarkable...,6.67728,...,,127,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,You are cordially invited to the most surprisi...,Sabrina,,6.2,141,58000000.0


In [56]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

I use the TMDB Ratings to come up with our **Top Movies Chart.** I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [59]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

ValueError: invalid literal for long() with base 10: 'NaN'

In [None]:
m = vote_counts.quantile(0.95)
m

In [None]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [None]:
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres','homepage','poster_path']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

In [None]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [None]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

In [None]:
qualified.head(15)

In [None]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

In [None]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity','homepage','poster_path']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

In [None]:
build_chart('Romance').head(15)

In [None]:
genreSet = list(set(s))
dfGenre = build_chart(genreSet[2]).head(5)
dfGenre['Genre'] = genreSet[2]
for i in range(3,len(genreSet)):
    try:
        x = build_chart(genreSet[i]).head(5)
        x['Genre'] = genreSet[i]
        dfGenre = dfGenre.append(x)
    except:
        continue

In [None]:
dfGenre

In [None]:
dfGenre.homepage.fillna('https://www.google.com/search?q=' + dfGenre.title.str.replace(' ','+'), inplace=True)

In [None]:
dfGenre

In [None]:
dfGenre.to_csv("../data/top5Genre.csv")

## Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves *Dilwale Dulhania Le Jayenge*, *My Name is Khan* and *Kabhi Khushi Kabhi Gham*. One inference we can obtain is that the person loves the actor Shahrukh Khan and the director Karan Johar. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

I will build two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me. 

In [60]:
links_small = pd.read_csv('../data/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [61]:
md = md.drop([19730, 29503, 35587])

ValueError: labels [19730 29503 35587] not contained in axis

In [18]:
#Check EDA Notebook for how and why I got these indices.
md['id'] = md['id'].astype('int')

In [19]:
smd = md[md['id'].isin(links_small)]
smd.shape

(4582, 24)

We have **9099** movies avaiable in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

### Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [20]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [21]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [22]:
tfidf_matrix.shape

(4582, 142981)

#### Cosine Similarity

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [23]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [24]:
cosine_sim[0]

array([ 1.       ,  0.0077797,  0.       , ...,  0.       ,  0.       ,  0.       ])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [25]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [26]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [27]:
get_recommendations('The Godfather').head(10)

486      The Godfather: Part II
4207                 The Family
1847                       Made
850     The Godfather: Part III
2153                    8 Women
1883              Harlem Nights
1193           The Color Purple
4433              Run All Night
1742          Jaws: The Revenge
3738                    Machete
Name: title, dtype: object

We see that for **The Dark Knight**, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked **The Dark Knight** probably likes it more because of Nolan and would hate **Batman Forever** and every other substandard movie in the Batman Franchise.

Therefore, we are going to use much more suggestive metadata than **Overview** and **Tagline**. In the next subsection, we will build a more sophisticated recommender that takes **genre**, **keywords**, **cast** and **crew** into consideration.

### Metadata Based Recommender

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

In [40]:
credits = pd.read_csv('../data/credits.csv')
keywords = pd.read_csv('../data/keywords.csv')

In [41]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [42]:
md.shape

(8306, 27)

In [43]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [44]:
smd = md[md['id'].isin(links_small)]
smd = smd[:3000]
smd.shape

(3000, 30)

We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:

1. **Crew:** From the crew, we will only pick the director as our feature since the others don't contribute that much to the *feel* of the movie.
2. **Cast:** Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list. 

In [45]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

KeyError: 'cast'

In [34]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [35]:
smd['director'] = smd['crew'].apply(get_director)

In [39]:
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

TypeError: string indices must be integers, not str

In [37]:
smd['cast']

0                     [Tom Hanks, Tim Allen, Don Rickles]
1          [Robin Williams, Jonathan Hyde, Kirsten Dunst]
2       [Whitney Houston, Angela Bassett, Loretta Devine]
3                 [Al Pacino, Robert De Niro, Val Kilmer]
4             [Harrison Ford, Julia Ormond, Greg Kinnear]
5       [Jean-Claude Van Damme, Powers Boothe, Dorian ...
6          [Pierce Brosnan, Sean Bean, Izabella Scorupco]
7       [Michael Douglas, Annette Bening, Michael J. Fox]
8            [Anthony Hopkins, Joan Allen, Powers Boothe]
9           [Geena Davis, Matthew Modine, Frank Langella]
10              [Robert De Niro, Sharon Stone, Joe Pesci]
11              [Kate Winslet, Emma Thompson, Hugh Grant]
12           [Tim Roth, Antonio Banderas, Jennifer Beals]
13                [Jim Carrey, Ian McNeice, Simon Callow]
14       [Wesley Snipes, Woody Harrelson, Jennifer Lopez]
15              [John Travolta, Gene Hackman, Rene Russo]
16      [Sylvester Stallone, Antonio Banderas, Juliann...
17           [

In [457]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

My approach to building the recommender is going to be extremely *hacky*. What I plan on doing is creating a metadata dump for every movie which consists of **genres, director, main actors and keywords.** I then use a **Count Vectorizer** to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

These are steps I follow in the preparation of my genres and credits data:
1. **Strip Spaces and Convert to Lowercase** from all our features. This way, our engine will not confuse between **Johnny Depp** and **Johnny Galecki.** 
2. **Mention Director 3 times** to give it more weight relative to the entire cast.

In [458]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [459]:
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x, x])

#### Keywords

We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we calculate the frequenct counts of every keyword that appears in the dataset.

In [460]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [461]:
s = s.value_counts()
s[:5]

independent film    230
woman director      172
murder              135
based on novel      116
suspense             94
Name: keyword, dtype: int64

Keywords occur in frequencies ranging from 1 to 610. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. Finally, we will convert every word to its stem so that words such as *Dogs* and *Dog* are considered the same.

In [462]:
s = s[s > 1]

In [463]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

u'dog'

In [464]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [466]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [467]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [468]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [469]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [471]:
get_recommendations('Mad Max').head(10)

2973          Mad Max 2: The Road Warrior
2974           Mad Max Beyond Thunderdome
1911                Babe: Pig in the City
2451                        The Omega Man
522            Terminator 2: Judgment Day
2045    Battle for the Planet of the Apes
983                    Return of the Jedi
2042                          Logan's Run
2885                    Battlefield Earth
31                         Twelve Monkeys
Name: title, dtype: object

In [None]:
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:31]
movie_indices = [i[0] for i in sim_scores]
#return titles.iloc[movie_indices]

In [472]:
titleDF = pd.DataFrame(titles)


In [473]:
cosine_simDF = pd.DataFrame(cosine_sim)

In [395]:
cosine_simDF.to_csv("../data/cosine_sim.csv")

In [474]:
totalDF = pd.concat([titleDF, cosine_simDF],axis =1)

In [475]:
totalDF

Unnamed: 0,title,0,1,2,3,4,5,6,7,8,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
0,Toy Story,1.000000,0.037980,0.045326,0.020847,0.019407,0.000000,0.020332,0.025055,0.000000,...,0.025055,0.000000,0.000000,0.018990,0.000000,0.000000,0.000000,0.020332,0.000000,0.000000
1,Jumanji,0.037980,1.000000,0.000000,0.023357,0.000000,0.000000,0.000000,0.056143,0.025392,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.045561,0.000000,0.000000
2,Grumpier Old Men,0.045326,0.000000,1.000000,0.055749,0.025950,0.000000,0.054373,0.000000,0.000000,...,0.033501,0.000000,0.028618,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,Waiting to Exhale,0.020847,0.023357,0.055749,1.000000,0.047741,0.018016,0.075023,0.030817,0.000000,...,0.030817,0.026325,0.078975,0.000000,0.000000,0.018490,0.025641,0.000000,0.025641,0.021592
4,Father of the Bride Part II,0.019407,0.000000,0.025950,0.047741,1.000000,0.000000,0.046562,0.000000,0.000000,...,0.028689,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.046562,0.000000,0.000000
5,Heat,0.000000,0.000000,0.000000,0.018016,0.000000,1.000000,0.000000,0.043305,0.039171,...,0.043305,0.036993,0.018496,0.000000,0.086432,0.077948,0.144127,0.070284,0.072063,0.151707
6,Sabrina,0.020332,0.000000,0.054373,0.075023,0.046562,0.000000,1.000000,0.000000,0.000000,...,0.030056,0.000000,0.025675,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,Tom and Huck,0.025055,0.056143,0.000000,0.030817,0.000000,0.043305,0.000000,1.000000,0.100504,...,0.037037,0.031639,0.031639,0.000000,0.024641,0.022222,0.030817,0.060111,0.030817,0.025950
8,Sudden Death,0.000000,0.025392,0.000000,0.000000,0.000000,0.039171,0.000000,0.100504,1.000000,...,0.067003,0.000000,0.000000,0.000000,0.022288,0.020101,0.027875,0.081559,0.000000,0.023473
9,GoldenEye,0.000000,0.016623,0.000000,0.000000,0.000000,0.025643,0.000000,0.043863,0.059514,...,0.043863,0.000000,0.000000,0.000000,0.014591,0.026318,0.018248,0.088988,0.000000,0.015366


In [476]:
totalDF.to_csv('../data/contentRec_3000.csv', index = False)

In [383]:
totalDF.shape[0]

9219

In [369]:
lastComplete = 0
for i in range(1,10):
    partialDF = totalDF[lastComplete:int(totalDF.shape[0]*(float(i)/9))]
    lastComplete = int(totalDF.shape[0]*(float(i)/9))
    partialDF.to_csv('../data/contentRec'+str(i) +'.csv')


In [322]:
partialDF =totalDF[lastComplete:int(totalDF.shape[0]*(float(1)/10))]

In [323]:
partialDF

Unnamed: 0,title,0,1,2,3,4,5,6,7,8,...,9209,9210,9211,9212,9213,9214,9215,9216,9217,9218
0,Toy Story,1.000000,0.037980,0.045326,0.019854,0.018990,0.000000,0.020332,0.025055,0.000000,...,0.000000,0.000000,0.018990,0.000000,0.016669,0.026038,0.000000,0.000000,0.000000,0.000000
1,Jumanji,0.037980,1.000000,0.000000,0.022244,0.000000,0.000000,0.000000,0.056143,0.023980,...,0.000000,0.000000,0.085106,0.000000,0.000000,0.000000,0.000000,0.027086,0.044488,0.000000
2,Grumpier Old Men,0.045326,0.000000,1.000000,0.053093,0.025392,0.000000,0.054373,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.022288,0.000000,0.033501,0.032325,0.000000,0.000000
3,Waiting to Exhale,0.019854,0.022244,0.053093,1.000000,0.044488,0.015813,0.071449,0.029348,0.000000,...,0.000000,0.017379,0.044488,0.030500,0.019525,0.030500,0.029348,0.056637,0.023256,0.000000
4,Father of the Bride Part II,0.018990,0.000000,0.025392,0.044488,1.000000,0.000000,0.045561,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.018676,0.000000,0.000000,0.000000,0.000000,0.000000
5,Heat,0.000000,0.000000,0.000000,0.015813,0.000000,1.000000,0.000000,0.039912,0.034095,...,0.000000,0.106354,0.045376,0.062217,0.000000,0.020739,0.019956,0.019256,0.031627,0.000000
6,Sabrina,0.020332,0.000000,0.054373,0.071449,0.045561,0.000000,1.000000,0.000000,0.000000,...,0.000000,0.035595,0.000000,0.000000,0.019996,0.000000,0.030056,0.029001,0.000000,0.000000
7,Tom and Huck,0.025055,0.056143,0.000000,0.029348,0.000000,0.039912,0.000000,1.000000,0.094916,...,0.000000,0.021932,0.112287,0.038490,0.000000,0.038490,0.000000,0.107211,0.146742,0.000000
8,Sudden Death,0.000000,0.023980,0.000000,0.000000,0.000000,0.034095,0.000000,0.094916,1.000000,...,0.000000,0.018735,0.047960,0.032880,0.000000,0.000000,0.031639,0.030528,0.075212,0.000000
9,GoldenEye,0.000000,0.016011,0.000000,0.000000,0.000000,0.022764,0.000000,0.042248,0.054135,...,0.000000,0.037526,0.032022,0.021953,0.000000,0.000000,0.021124,0.020383,0.033478,0.000000


In [370]:
fullDF = pd.read_csv('../data/contentRec1.csv', index_col = 0)
for i in range(2,10):

    fullDF = fullDF.append(pd.read_csv('../data/contentRec'+str(i) +'.csv'))

In [349]:
fullDF.shape

(9219, 9221)

In [350]:
indices = pd.Series(fullDF.index, index=fullDF['title'])
titles = fullDF['title']
fullDF = fullDF.drop(['title','Unnamed: 0'],axis = 1)

print indices


In [357]:
fullDF.values

array([[ 1.        ,  0.03798001,  0.04066418, ...,  0.        ,
         0.        ,  0.02084691],
       [ 0.03798001,  1.        ,  0.02278028, ...,  0.01899   ,
         0.        ,  0.        ],
       [ 0.04532596,  0.        ,  0.05437272, ...,  0.02266298,
         0.        ,  0.05574947],
       ..., 
       [ 0.        ,  0.02708645,  0.05800148, ...,  0.0967019 ,
         0.06611395,  0.0594701 ],
       [ 0.        ,  0.04448841,  0.02381628, ...,  0.03970725,
         0.01809825,  0.02441931],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.04407596,  0.        ]])

In [384]:
def get_recommendations_manual(title):
    totalDF = totalDF
    indices = pd.Series(totalDF.index, index=totalDF['title'])
    titles = totalDF['title']
    totalDF = totalDF.drop(['title','Unnamed: 0'],axis = 1)    
    idx = indices[title]
    cosine_sim = total.values
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [385]:
get_recommendations_manual("The Dark Knight")

UnboundLocalError: local variable 'totalDF' referenced before assignment

NameError: name 'total' is not defined

In [448]:
totalDF = pd.read_csv('../data/contentRec.csv', memory_map=True)
totalDF = totalDF[:3000]
indices = pd.Series(totalDF.index, index=totalDF['title'])
titles = totalDF['title']
totalDF = totalDF.drop(['title'],axis = 1)    


In [449]:
idx = indices['Mad Max']

cosine_sim = totalDF.values
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:5]
movie_indices = [i[0] for i in sim_scores]
titles.iloc[movie_indices].tolist()

IndexError: positional indexers are out-of-bounds

In [450]:
movie_indices

[2974, 2973, 8864, 6905]

In [427]:
list(enumerate(cosine_sim))[6218]

array([ 0.01183536,  0.01326045,  0.        , ...,  0.01688139,
        0.02772701,  0.        ])

In [431]:
cosine_sim[idx]

array([ 0.        ,  0.01546166,  0.        , ...,  0.01968367,
        0.03232963,  0.        ])

In [584]:
totalDF = pd.read_csv('../data/contentRec_3000.csv', memory_map=True)

In [442]:
totalDF[:3000]

Unnamed: 0,title,0,1,2,3,4,5,6,7,8,...,9209,9210,9211,9212,9213,9214,9215,9216,9217,9218
0,Toy Story,1.000000,0.037980,0.045326,0.019854,0.018990,0.000000,0.020332,0.025055,0.000000,...,0.000000,0.000000,0.018990,0.000000,0.016669,0.026038,0.000000,0.000000,0.000000,0.000000
1,Jumanji,0.037980,1.000000,0.000000,0.022244,0.000000,0.000000,0.000000,0.056143,0.023980,...,0.000000,0.000000,0.085106,0.000000,0.000000,0.000000,0.000000,0.027086,0.044488,0.000000
2,Grumpier Old Men,0.045326,0.000000,1.000000,0.053093,0.025392,0.000000,0.054373,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.022288,0.000000,0.033501,0.032325,0.000000,0.000000
3,Waiting to Exhale,0.019854,0.022244,0.053093,1.000000,0.044488,0.015813,0.071449,0.029348,0.000000,...,0.000000,0.017379,0.044488,0.030500,0.019525,0.030500,0.029348,0.056637,0.023256,0.000000
4,Father of the Bride Part II,0.018990,0.000000,0.025392,0.044488,1.000000,0.000000,0.045561,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.018676,0.000000,0.000000,0.000000,0.000000,0.000000
5,Heat,0.000000,0.000000,0.000000,0.015813,0.000000,1.000000,0.000000,0.039912,0.034095,...,0.000000,0.106354,0.045376,0.062217,0.000000,0.020739,0.019956,0.019256,0.031627,0.000000
6,Sabrina,0.020332,0.000000,0.054373,0.071449,0.045561,0.000000,1.000000,0.000000,0.000000,...,0.000000,0.035595,0.000000,0.000000,0.019996,0.000000,0.030056,0.029001,0.000000,0.000000
7,Tom and Huck,0.025055,0.056143,0.000000,0.029348,0.000000,0.039912,0.000000,1.000000,0.094916,...,0.000000,0.021932,0.112287,0.038490,0.000000,0.038490,0.000000,0.107211,0.146742,0.000000
8,Sudden Death,0.000000,0.023980,0.000000,0.000000,0.000000,0.034095,0.000000,0.094916,1.000000,...,0.000000,0.018735,0.047960,0.032880,0.000000,0.000000,0.031639,0.030528,0.075212,0.000000
9,GoldenEye,0.000000,0.016011,0.000000,0.000000,0.000000,0.022764,0.000000,0.042248,0.054135,...,0.000000,0.037526,0.032022,0.021953,0.000000,0.000000,0.021124,0.020383,0.033478,0.000000


In [516]:
totalDF

Unnamed: 0,title,0,1,2,3,4,5,6,7,8,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
0,Toy Story,1.000000,0.037980,0.045326,0.020847,0.019407,0.000000,0.020332,0.025055,0.000000,...,0.025055,0.000000,0.000000,0.018990,0.000000,0.000000,0.000000,0.020332,0.000000,0.000000
1,Jumanji,0.037980,1.000000,0.000000,0.023357,0.000000,0.000000,0.000000,0.056143,0.025392,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.045561,0.000000,0.000000
2,Grumpier Old Men,0.045326,0.000000,1.000000,0.055749,0.025950,0.000000,0.054373,0.000000,0.000000,...,0.033501,0.000000,0.028618,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,Waiting to Exhale,0.020847,0.023357,0.055749,1.000000,0.047741,0.018016,0.075023,0.030817,0.000000,...,0.030817,0.026325,0.078975,0.000000,0.000000,0.018490,0.025641,0.000000,0.025641,0.021592
4,Father of the Bride Part II,0.019407,0.000000,0.025950,0.047741,1.000000,0.000000,0.046562,0.000000,0.000000,...,0.028689,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.046562,0.000000,0.000000
5,Heat,0.000000,0.000000,0.000000,0.018016,0.000000,1.000000,0.000000,0.043305,0.039171,...,0.043305,0.036993,0.018496,0.000000,0.086432,0.077948,0.144127,0.070284,0.072063,0.151707
6,Sabrina,0.020332,0.000000,0.054373,0.075023,0.046562,0.000000,1.000000,0.000000,0.000000,...,0.030056,0.000000,0.025675,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,Tom and Huck,0.025055,0.056143,0.000000,0.030817,0.000000,0.043305,0.000000,1.000000,0.100504,...,0.037037,0.031639,0.031639,0.000000,0.024641,0.022222,0.030817,0.060111,0.030817,0.025950
8,Sudden Death,0.000000,0.025392,0.000000,0.000000,0.000000,0.039171,0.000000,0.100504,1.000000,...,0.067003,0.000000,0.000000,0.000000,0.022288,0.020101,0.027875,0.081559,0.000000,0.023473
9,GoldenEye,0.000000,0.016623,0.000000,0.000000,0.000000,0.025643,0.000000,0.043863,0.059514,...,0.043863,0.000000,0.000000,0.000000,0.014591,0.026318,0.018248,0.088988,0.000000,0.015366


In [585]:
metaData = pd.read_csv('../data/movies_metadata.csv')

In [529]:
metaList = metaData[metaData["title"].isin(totalDF['title'].tolist())]

In [587]:
titleTemp = pd.DataFrame(totalDF['title'])

In [588]:
titleTemp

Unnamed: 0,title
0,Toy Story
1,Jumanji
2,Grumpier Old Men
3,Waiting to Exhale
4,Father of the Bride Part II
5,Heat
6,Sabrina
7,Tom and Huck
8,Sudden Death
9,GoldenEye


In [592]:
titleTemp.merge(metaList, on='title',how = 'right').drop_duplicates(['title'])

Unnamed: 0,title,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,video,vote_average,vote_count
0,Toy Story,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,False,7.7,5415.0
1,Jumanji,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,False,6.9,2413.0
2,Grumpier Old Men,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,False,6.5,92.0
3,Waiting to Exhale,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,False,6.1,34.0
4,Father of the Bride Part II,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,False,5.7,173.0
5,Heat,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,False,7.7,1886.0
8,Sabrina,False,,58000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,11860,tt0114319,en,Sabrina,...,"[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",1995-12-15,0.0,127.0,"[{'iso_639_1': 'fr', 'name': 'FranÃ§ais'}, {'i...",Released,You are cordially invited to the most surprisi...,False,6.2,141.0
12,Tom and Huck,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,45325,tt0112302,en,Tom and Huck,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,97.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The Original Bad Boys.,False,5.4,45.0
13,Sudden Death,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,False,5.5,174.0
14,GoldenEye,False,"{'id': 645, 'name': 'James Bond Collection', '...",58000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",http://www.mgm.com/view/movie/757/Goldeneye/,710,tt0113189,en,GoldenEye,...,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",1995-11-16,352194034.0,130.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,No limits. No fears. No substitutes.,False,6.6,1194.0


In [500]:
for index, row in metaList.iterrows():
    totalDF['homepage'][index] = metaList['homepage'][index]
    totalDF['poster_path'][index] = metaList['poster_path'][index]
    totalDF['release_date'][index] = metaList['release_date'][index]

KeyboardInterrupt: 

In [499]:
metaList.columns

Index([u'adult', u'belongs_to_collection', u'budget', u'genres', u'homepage',
       u'id', u'imdb_id', u'original_language', u'original_title', u'overview',
       u'popularity', u'poster_path', u'production_companies',
       u'production_countries', u'release_date', u'revenue', u'runtime',
       u'spoken_languages', u'status', u'tagline', u'title', u'video',
       u'vote_average', u'vote_count'],
      dtype='object')

In [565]:
result = totalDF.merge(metaList, on='title',how = 'left')


In [566]:
result[2500:3000]

Unnamed: 0,title,0,1,2,3,4,5,6,7,8,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,video,vote_average,vote_count
2500,Wilde,0.000000,0.000000,0.000000,0.046714,0.021744,0.016411,0.022780,0.028072,0.000000,...,"[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",1997-09-01,0.0,118.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Loved for being unique. Hated for being differ...,False,6.7,62.0
2501,Outside Ozona,0.020332,0.000000,0.054373,0.125039,0.023281,0.035142,0.048780,0.030056,0.027186,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1998-12-18,0.0,100.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Somewhere down the road is the end of the line.,False,6.0,2.0
2502,Affliction,0.000000,0.000000,0.000000,0.034943,0.000000,0.024551,0.000000,0.041996,0.000000,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1997-08-28,6330054.0,114.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,False,5.8,56.0
2503,Another Day in Paradise,0.000000,0.000000,0.000000,0.027067,0.000000,0.133122,0.000000,0.032530,0.029424,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1998-12-30,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,We live at the abyss.,False,6.4,38.0
2504,The Hi-Lo Country,0.000000,0.000000,0.027186,0.050016,0.000000,0.035142,0.024390,0.060111,0.027186,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1998-12-30,0.0,114.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A woman like Mona can drive men to extremes.,False,4.7,9.0
2505,Hilary and Jackie,0.000000,0.000000,0.000000,0.050016,0.023281,0.017571,0.024390,0.030056,0.000000,...,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",1998-12-30,0.0,121.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Two sisters. Two lives. One Love...,False,6.5,23.0
2506,Playing by Heart,0.000000,0.000000,0.000000,0.050016,0.023281,0.017571,0.024390,0.030056,0.000000,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1998-12-30,3970078.0,121.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"If romance is a mystery, there's only one way ...",False,6.6,49.0
2507,At First Sight,0.000000,0.000000,0.034816,0.096077,0.000000,0.045004,0.031235,0.038490,0.000000,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1999-01-15,0.0,128.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Science gave him sight. She gave him vision.,False,5.9,49.0
2508,In Dreams,0.000000,0.021277,0.000000,0.093428,0.000000,0.065644,0.000000,0.028072,0.025392,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1999-01-15,0.0,100.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,You don't have to sleep to dream,False,5.4,45.0
2509,Varsity Blues,0.017244,0.019320,0.046114,0.106047,0.019745,0.014902,0.041371,0.025491,0.000000,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1999-01-15,0.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It takes a hero to know what's worth winning.,False,6.0,126.0


In [567]:
result.columns

Index([u'title', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8',
       ...
       u'production_countries', u'release_date', u'revenue', u'runtime',
       u'spoken_languages', u'status', u'tagline', u'video', u'vote_average',
       u'vote_count'],
      dtype='object', length=3024)

In [568]:
result = result.drop(['revenue','runtime','spoken_languages','status','tagline','video','vote_average','vote_count'], axis = 1)

In [569]:
result.columns

Index([u'title', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8',
       ...
       u'id', u'imdb_id', u'original_language', u'original_title', u'overview',
       u'popularity', u'poster_path', u'production_companies',
       u'production_countries', u'release_date'],
      dtype='object', length=3016)

In [570]:
result = result.drop(['imdb_id','original_language','original_title','overview','popularity','production_companies','production_countires'], axis = 1)

ValueError: labels ['production_countires'] not contained in axis

In [571]:
result.columns

Index([u'title', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8',
       ...
       u'id', u'imdb_id', u'original_language', u'original_title', u'overview',
       u'popularity', u'poster_path', u'production_companies',
       u'production_countries', u'release_date'],
      dtype='object', length=3016)

In [572]:
result = result.drop(['adult','belongs_to_collection','budget','genres','id','production_countries'],axis = 1)

In [573]:
result.columns

Index([u'title', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8',
       ...
       u'2999', u'homepage', u'imdb_id', u'original_language',
       u'original_title', u'overview', u'popularity', u'poster_path',
       u'production_companies', u'release_date'],
      dtype='object', length=3010)

In [574]:
result.shape

(3767, 3010)

In [575]:
result

Unnamed: 0,title,0,1,2,3,4,5,6,7,8,...,2999,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,release_date
0,Toy Story,1.000000,0.037980,0.045326,0.020847,0.019407,0.000000,0.020332,0.025055,0.000000,...,0.000000,http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]",1995-10-30
1,Jumanji,0.037980,1.000000,0.000000,0.023357,0.000000,0.000000,0.000000,0.056143,0.025392,...,0.000000,,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",1995-12-15
2,Grumpier Old Men,0.045326,0.000000,1.000000,0.055749,0.025950,0.000000,0.054373,0.000000,0.000000,...,0.000000,,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",1995-12-22
3,Waiting to Exhale,0.020847,0.023357,0.055749,1.000000,0.047741,0.018016,0.075023,0.030817,0.000000,...,0.021592,,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.85949,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,1995-12-22
4,Father of the Bride Part II,0.019407,0.000000,0.025950,0.047741,1.000000,0.000000,0.046562,0.000000,0.000000,...,0.000000,,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.38752,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...",1995-02-10
5,Heat,0.000000,0.000000,0.000000,0.018016,0.000000,1.000000,0.000000,0.043305,0.039171,...,0.151707,,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",17.9249,/zMyfPUelumio3tiDKPffaUpsQTD.jpg,"[{'name': 'Regency Enterprises', 'id': 508}, {...",1995-12-15
6,Heat,0.000000,0.000000,0.000000,0.018016,0.000000,1.000000,0.000000,0.043305,0.039171,...,0.151707,,tt0068688,en,Heat,"Former child star Joe Davis (Joe Dallesandro),...",0.466019,/l6xBVdKfjteFnQucRRJoyIRJOkm.jpg,"[{'name': 'Andy Warhol Productions', 'id': 100...",1972-10-06
7,Heat,0.000000,0.000000,0.000000,0.018016,0.000000,1.000000,0.000000,0.043305,0.039171,...,0.151707,,tt0093164,en,Heat,Reynolds plays an ex-soldier-of-fortunish char...,1.35223,/lRicKGyG3kjkfvlCvv5kzGlox35.jpg,"[{'name': 'New Century Productions', 'id': 866...",1986-01-01
8,Sabrina,0.020332,0.000000,0.054373,0.075023,0.046562,0.000000,1.000000,0.000000,0.000000,...,0.000000,,tt0114319,en,Sabrina,An ugly duckling having undergone a remarkable...,6.67728,/jQh15y5YB7bWz1NtffNZmRw0s9D.jpg,"[{'name': 'Paramount Pictures', 'id': 4}, {'na...",1995-12-15
9,Sabrina,0.020332,0.000000,0.054373,0.075023,0.046562,0.000000,1.000000,0.000000,0.000000,...,0.000000,,tt0047437,en,Sabrina,Linus and David Larrabee are the two sons of a...,7.35974,/7ITDmatHa2yf5UTzjwaKAvf3Xr6.jpg,"[{'name': 'Paramount Pictures', 'id': 4}]",1954-09-28


In [576]:
result.homepage.fillna('https://www.google.com/search?q=' + result.title.str.replace(' ','+'), inplace=True)

In [577]:
result.head()

Unnamed: 0,title,0,1,2,3,4,5,6,7,8,...,2999,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,release_date
0,Toy Story,1.0,0.03798,0.045326,0.020847,0.019407,0.0,0.020332,0.025055,0.0,...,0.0,http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]",1995-10-30
1,Jumanji,0.03798,1.0,0.0,0.023357,0.0,0.0,0.0,0.056143,0.025392,...,0.0,https://www.google.com/search?q=Jumanji,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",1995-12-15
2,Grumpier Old Men,0.045326,0.0,1.0,0.055749,0.02595,0.0,0.054373,0.0,0.0,...,0.0,https://www.google.com/search?q=Grumpier+Old+Men,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",1995-12-22
3,Waiting to Exhale,0.020847,0.023357,0.055749,1.0,0.047741,0.018016,0.075023,0.030817,0.0,...,0.021592,https://www.google.com/search?q=Waiting+to+Exhale,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.85949,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,1995-12-22
4,Father of the Bride Part II,0.019407,0.0,0.02595,0.047741,1.0,0.0,0.046562,0.0,0.0,...,0.0,https://www.google.com/search?q=Father+of+the+...,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.38752,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...",1995-02-10


In [580]:
result = result.drop(['imdb_id','original_language','original_title','overview','popularity','production_companies'],axis = 1)

In [581]:
result.columns

Index([u'title', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8',
       ...
       u'2993', u'2994', u'2995', u'2996', u'2997', u'2998', u'2999',
       u'homepage', u'poster_path', u'release_date'],
      dtype='object', length=3004)

In [583]:
result.to_csv("../data/contentRec_3000_meta.csv")