** Simple movie recommender script **

---

This script uses an unsupervised learning k-nearest-neighbours method to make a recommendation for new movies to watch based on their similarlity to a user specified input movie. 

This script is adapated from: https://www.kaggle.com/kkooijman/tmdb-means-per-genre

---

In [1]:
import pandas as pd
import json
from sklearn.preprocessing import MaxAbsScaler
from sklearn.neighbors import NearestNeighbors
import pickle

---

All machine learning depends on training data. To make a movie recommender we need a bunch of data about movies.

Information on 5000 different movies is contained in a CSV file. Here we define a function that can read the information from that file and put it into a pandas dataframe:

In [2]:
def load_tmdb_movies(path):
    
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())
    
    json_columns = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
        
    return df

Some of the information in the CSV columns is contained in a JSON object. Here we deine a function to flatten that information:

In [3]:
def pipe_flatten_names(keywords):
    
    return '|'.join([x['name'] for x in keywords])

---

Now we've defined the functions to parse the input data we can take a look at it. First let's load it using the function above:

In [4]:
movies = load_tmdb_movies('tmdb_5000_movies.csv')

Let's take a look at the first few data entries:

In [5]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{u'id': 28, u'name': u'Action'}, {u'id': 12, ...",http://www.avatarmovie.com/,19995,"[{u'id': 1463, u'name': u'culture clash'}, {u'...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{u'name': u'Ingenious Film Partners', u'id': ...","[{u'iso_3166_1': u'US', u'name': u'United Stat...",2009-12-10,2787965087,162.0,"[{u'iso_639_1': u'en', u'name': u'English'}, {...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{u'id': 12, u'name': u'Adventure'}, {u'id': 1...",http://disney.go.com/disneypictures/pirates/,285,"[{u'id': 270, u'name': u'ocean'}, {u'id': 726,...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{u'name': u'Walt Disney Pictures', u'id': 2},...","[{u'iso_3166_1': u'US', u'name': u'United Stat...",2007-05-19,961000000,169.0,"[{u'iso_639_1': u'en', u'name': u'English'}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{u'id': 28, u'name': u'Action'}, {u'id': 12, ...",http://www.sonypictures.com/movies/spectre/,206647,"[{u'id': 470, u'name': u'spy'}, {u'id': 818, u...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{u'name': u'Columbia Pictures', u'id': 5}, {u...","[{u'iso_3166_1': u'GB', u'name': u'United King...",2015-10-26,880674609,148.0,"[{u'iso_639_1': u'fr', u'name': u'Français'}, ...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{u'id': 28, u'name': u'Action'}, {u'id': 80, ...",http://www.thedarkknightrises.com/,49026,"[{u'id': 849, u'name': u'dc comics'}, {u'id': ...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{u'name': u'Legendary Pictures', u'id': 923},...","[{u'iso_3166_1': u'US', u'name': u'United Stat...",2012-07-16,1084939099,165.0,"[{u'iso_639_1': u'en', u'name': u'English'}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{u'id': 28, u'name': u'Action'}, {u'id': 12, ...",http://movies.disney.com/john-carter,49529,"[{u'id': 818, u'name': u'based on novel'}, {u'...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{u'name': u'Walt Disney Pictures', u'id': 2}]","[{u'iso_3166_1': u'US', u'name': u'United Stat...",2012-03-07,284139100,132.0,"[{u'iso_639_1': u'en', u'name': u'English'}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


You can see from the information above that each "genres" entry is a JSON object. Let's fix that by flattening the info using the function we defined earlier:

In [6]:
movies['genres'] = movies['genres'].apply(pipe_flatten_names)

This simple movie recommender doesn't use all of the available information in the CSV file, so we're going to extract what we will use and put that in its own new data frame:

In [7]:
movies_new_df = movies[['original_title', 'genres','popularity']]

Let's quickly check the data format of the new data frame:

In [8]:
movies_new_df.head()

Unnamed: 0,original_title,genres,popularity
0,Avatar,Action|Adventure|Fantasy|Science Fiction,150.437577
1,Pirates of the Caribbean: At World's End,Adventure|Fantasy|Action,139.082615
2,Spectre,Action|Adventure|Crime,107.376788
3,The Dark Knight Rises,Action|Crime|Drama|Thriller,112.31295
4,John Carter,Action|Adventure|Science Fiction,43.926995


...and let's save it to it's own CSV file:

In [9]:
movies_new_df.to_csv('tmb_movies_clean.csv', index=False)

The simple movie recommender will use **genre** and **popularity** to make movie recommendations, so let's extract those into their own data frame:

In [10]:
df = movies[['genres','popularity']]

...and quickly take a look at it:

In [11]:
df.head()

Unnamed: 0,genres,popularity
0,Action|Adventure|Fantasy|Science Fiction,150.437577
1,Adventure|Fantasy|Action,139.082615
2,Action|Adventure|Crime,107.376788
3,Action|Crime|Drama|Thriller,112.31295
4,Action|Adventure|Science Fiction,43.926995


We can also look at the data in other ways. For example:

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 2 columns):
genres        4803 non-null object
popularity    4803 non-null float64
dtypes: float64(1), object(1)
memory usage: 75.1+ KB


In [13]:
df.isnull().sum().sort_values(ascending=False)/len(df)

popularity    0.0
genres        0.0
dtype: float64

Let's convert the categorical genre data into numerical values and concatenate that array of values with the popularity value into a single array of machine learning features:

In [14]:
features = pd.concat([df.genres.str.get_dummies(sep="|"),df.popularity],axis=1)

**features** is also a data frame and we can take a look at it:

In [15]:
features.head()

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,...,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,popularity
0,1,1,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,150.437577
1,1,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,139.082615
2,1,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,107.376788
3,1,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,112.31295
4,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,43.926995


---

This is where the machine learning kicks in. First we are going to normalise the values of each feature to lie in the range 0 to 1 using the MaxAbsScaler from scikit_learn:

In [16]:
max_abs_scaler = MaxAbsScaler()
features = max_abs_scaler.fit_transform(features)

Then we're going to build a machine learning model using the unsupervised verion of K-Nearest Neighbors, Here I'm setting *k = 5*

In [17]:
nn_model = NearestNeighbors(n_neighbors=5,algorithm='auto').fit(features)

We can use this model to find the indices and distance to the nearest "K" neighbours of each data point:

In [18]:
distances, indices = nn_model.kneighbors(features)

We can export this model to a file so that we can use it again later:

In [19]:
with open('movieindices.pkl', 'wb') as fid:
    pickle.dump(indices, fid,2)

---

Now let's define our query function. **query** = the name of a movie you like so the algorithm can recommend other movies to you.

In [20]:
def similar_movie_content(query):
    
    # first check that the query movie is in the database:
    if query not in movies['original_title']:
        
        # find the index of the movie in the database:
        N = movies[movies['original_title'] == query].index[0]
        
        # extract the info for the k-nearest neighbour movie indices:
        print('Similar movies to "{}":'.format(query))
        for n in indices[N][1:]:
            print('Movie: {} \n Genre: {}; Average Popularity: {}'.format(movies.original_title[n],
                                                                      movies.genres[n],
                                                                      movies.popularity[n])) 
        
    else:
        
        # if the query isn't in the database then we can make any recommendations:
        print('The movie {} does not exist in our database.'.format(query))
        
    return

Now let's call our recommender function. Here I'm testing it by asking for recommendatiosn similar to the movie **Spectre**:

In [21]:
similar_movie_content('Spectre')

Similar movies to "Spectre":
Movie: Speed 
 Genre: Action|Adventure|Crime; Average Popularity: 49.526736
Movie: Kick-Ass 2 
 Genre: Action|Adventure|Crime; Average Popularity: 40.28635
Movie: The Art of War 
 Genre: Crime|Action|Adventure; Average Popularity: 7.832337
Movie: Quantum of Solace 
 Genre: Adventure|Action|Thriller|Crime; Average Popularity: 107.928811
