# Content-Based Recommendations
Discover how item attributes can be used to make recommendations. Create valuable comparisons between items with both categorical and text data. Generate profiles to recommend new items for users based on their past preferences.

In [1]:
import pandas as pd

In [2]:
movies_df = pd.read_csv('movies.csv')

In [3]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
# Explode/Split column into multiple rows
movie_genre_df = pd.DataFrame(movies_df.genres.str.split('|').tolist(), index=movies_df.movieId).stack()
movie_genre_df = movie_genre_df.reset_index([0, 'movieId'])
movie_genre_df.columns = ['movieId', 'genres']

In [5]:
movie_genre_df.head()

Unnamed: 0,movieId,genres
0,1,Adventure
1,1,Animation
2,1,Children
3,1,Comedy
4,1,Fantasy


In [6]:
movie_genre_df = pd.merge(movie_genre_df, movies_df[['movieId','title']], how="inner", on=["movieId"])

In [7]:
movie_genre_df.head()

Unnamed: 0,movieId,genres,title
0,1,Adventure,Toy Story (1995)
1,1,Animation,Toy Story (1995)
2,1,Children,Toy Story (1995)
3,1,Comedy,Toy Story (1995)
4,1,Fantasy,Toy Story (1995)


## Creating content-based data
As much as you might want to jump right to finding similar items and making recommendations, you first need to get your data in a usable format. In the next few exercises, you will explore your base data and work through how to format that data to be used for content-based recommendations.

As a reminder, the desired outcome is a row per movie with each column indicating whether a genre applies to the movie. You will be looking at movie_genre_df, which contains these columns:
- name - Name of movie
- genre_list - Genre that the movie has been labeled as

A movie may have multiple genres, and therefore multiple rows. In this exercise, you will particularly focus on one movie (Toy Story in this case) to be able to clearly see what is happening with the data.

In [8]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df['title'] == 'Toy Story (1995)']

# Create cross-tabulated DataFrame from name and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['title'], movie_genre_df['genres'])

# Select only the rows with Toy Story as the index
toy_story_genres_ct = movie_cross_table[movie_cross_table.index == 'Toy Story (1995)']
toy_story_genres_ct

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0


## Understanding the content-based data

You are now able to convert common attribute data to a DataFrame containing a row per movie, and each of its attributes as columns. You will now take a closer look at the full DataFrame you just created to see if you understand the information within.

A subset of the DataFrame you have created in the last exercise has been loaded as <code>movie_cross_table</code>. As a reminder, the genres are stored as individual columns and the movie names are stored as the index.

Inspect the rows corresponding to 'Toy Story' and 'Yogi Bear' in movie_cross_table. How many genres do they have in common?

In [9]:
movie_cross_table[movie_cross_table.index=='Toy Story (1995)']

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [10]:
movie_cross_table[movie_cross_table.index=='Jumanji (1995)']

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Jumanji (1995),0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


## Comparing individual movies with Jaccard similarity

In the last lesson, you built a DataFrame of movies, where each column represents a different genre. You can now use this DataFrame to compare movies by measuring the Jaccard similarity between rows. The higher the Jaccard similarity score, the more similar the two items are.

In this exercise, you will compare the movie <code>GoldenEye</code> with the movie <code>Toy Story</code>, and <code>GoldenEye</code> with <code>SkyFal</code>l and compare the results.

The DataFrame <code>movie_cross_table</code> containing all the movies as rows and the genres as Boolean columns that you created in the last lesson has been loaded.

In [11]:
# Import numpy and the Jaccard similarity function
import numpy as np
from sklearn.metrics import jaccard_score

In [12]:
# Extract just the rows containing GoldenEye and Toy Story
jumanji_values = movie_cross_table.loc['Jumanji (1995)'].values
toy_story_values = movie_cross_table.loc['Toy Story (1995)'].values

# Find the similarity between GoldenEye and Toy Story
print(jaccard_score(jumanji_values, toy_story_values))

# Repeat for GoldenEye and Skyfall
godzilla_values = movie_cross_table.loc['Shin Godzilla (2016)'].to_numpy()
print(jaccard_score(jumanji_values, godzilla_values))

0.6
0.4


## Comparing all your movies at once

While finding the Jaccard similarity between any two individual movies in your dataset is great for small-scale analyses, it can prove slow on larger datasets to make recommendations.

In this exercise, you will find the similarities between all movies and store them in a DataFrame for quick and easy lookup.

When finding the similarities between the rows in a DataFrame, you could run through all pairs and calculate them individually, but it's more efficient to use the <code>pdist()</code> (pairwise distance) function from <code>scipy</code>.

This can be reshaped into the desired rectangular shape using <code>squareform()</code> from the same library. Since you want similarity values as opposed to distances, you should subtract the values from 1.

<code>movie_cross_table</code> has once again been loaded for you.

In [13]:
# Import functions from scipy
from scipy.spatial.distance import pdist, squareform

# Calculate all pairwise distances
jaccard_distances = pdist(movie_cross_table.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 - squareform(jaccard_distances)

# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=movie_cross_table.index, columns=movie_cross_table.index)

# Print the top 5 rows of the DataFrame
jaccard_similarity_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.125,0.2,0.333333,0.2,0.0,0.0,0.25,0.166667,0.0,...,0.4,0.4,0.2,0.2,0.2,0.4,0.4,0.4,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.125,1.0,0.0,0.0,0.0,0.0,0.2,0.0,0.142857,0.285714,...,0.0,0.0,0.0,0.0,0.0,0.142857,0.142857,0.142857,0.166667,0.166667
'Round Midnight (1986),0.2,0.0,1.0,0.2,0.333333,0.0,0.0,0.5,0.25,0.0,...,0.25,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.333333
'Salem's Lot (2004),0.333333,0.0,0.2,1.0,0.2,0.0,0.0,0.25,0.166667,0.0,...,0.4,0.75,0.5,0.5,0.2,0.166667,0.166667,0.166667,0.0,0.0
'Til There Was You (1997),0.2,0.0,0.333333,0.2,1.0,0.5,0.0,0.5,0.666667,0.0,...,0.25,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0


## Making recommendations based on movie genres

Now that you have your data in a usable format and know how to compare two movies, the next step is to use this to generate recommendations. In this exercise, you will learn how to generate recommendations for any movie in your dataset. The similarity scores between all movies in the dataset that you calculated in the last exercise have been pre-loaded for you as <code>jaccard_similarity_array</code>. <code>movie_cross_table</code> containing the movies and their attributes is also available.

For ease of use, you will need to wrap the similarity scores in a DataFrame. Then you will use this new DataFrame to suggest a movie recommendation.

In [14]:
# Find the values for the movie Thor
jaccard_similarity_series = jaccard_similarity_df.loc['Shin Godzilla (2016)']

# Sort these values from highest to lowest
ordered_similarities = jaccard_similarity_series.sort_values(ascending=False)

# Print the results
ordered_similarities.head(20)

title
Wolverine, The (2013)                                                        1.0
Thunderbirds (2004)                                                          1.0
Krull (1983)                                                                 1.0
Rogue One: A Star Wars Story (2016)                                          1.0
Godzilla vs. Mothra (Mosura tai Gojira) (1964)                               1.0
Shin Godzilla (2016)                                                         1.0
Fantastic Four (2015)                                                        1.0
Trip to the Moon, A (Voyage dans la lune, Le) (1902)                         1.0
The Pumaman (1980)                                                           1.0
Marvel One-Shot: Agent Carter (2013)                                         1.0
Hellboy II: The Golden Army (2008)                                           1.0
Batman v Superman: Dawn of Justice (2016)                                    1.0
X-Men: Apocalypse (201

In [15]:
print(jaccard_similarity_df['Shin Godzilla (2016)']['Red Dawn (1984)'])

0.16666666666666663


In [16]:
movie_cross_table[movie_cross_table.index=='Wolverine, The (2013)']

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
"Wolverine, The (2013)",0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0


In [17]:
movie_cross_table[movie_cross_table.index=='Krull (1983)']

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Krull (1983),0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
