# Text-based similarities

## Instantiate the TF-IDF model

TF-IDF by default generates a column for every word in all of your documents (movie summaries in our case). This creates a huge and unintuitive dataset as it will contain both very common words that appear in every document, and words that appear so rarely they provide no value in finding similarities between items.

In this exercise, you will work with the <code>df_plots</code> DataFrame. It contains movies' names in the Title column and their plots in the Plot column.

Using this DataFrame, you will generate the default TF-IDF scores and see if non-valuable columns are present.

You will go on to rerun the TF-IDF calculations, this time limiting the number of columns using the min_df and max_df arguments and hopefully see the improvement.

In [222]:
import pandas as pd

In [223]:
movies_df = pd.read_csv('tmdb_5000_movies.csv')
movies_df = movies_df[:3000]
movies_df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [224]:
df_plots = movies_df[['original_title', 'overview']]
df_plots.head()

Unnamed: 0,original_title,overview
0,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,Spectre,A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,Following the death of District Attorney Harve...
4,John Carter,"John Carter is a war-weary, former military ca..."


In [225]:
df_plots.rename(columns={'original_title':'Title', 'overview':'Plot'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [226]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object to the vectorizer variable
vectorizer = TfidfVectorizer(min_df=2, max_df=0.7)

# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(df_plots['Plot'].values.astype('U'))

# Look at the features generated
vectorizer.get_feature_names()

['00',
 '000',
 '007',
 '10',
 '100',
 '10th',
 '11',
 '119',
 '11th',
 '12',
 '1200',
 '13',
 '14',
 '14th',
 '15',
 '150',
 '15th',
 '16',
 '16th',
 '17',
 '17th',
 '18',
 '1800',
 '1863',
 '18th',
 '19',
 '1900',
 '1914',
 '1919',
 '1920',
 '1920s',
 '1927',
 '1930s',
 '1933',
 '1937',
 '1939',
 '1940',
 '1940s',
 '1941',
 '1944',
 '1945',
 '1946',
 '1947',
 '1949',
 '1950',
 '1950s',
 '1951',
 '1952',
 '1955',
 '1958',
 '1959',
 '1960',
 '1960s',
 '1964',
 '1965',
 '1967',
 '1970s',
 '1971',
 '1972',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1994',
 '1995',
 '1999',
 '19th',
 '20',
 '200',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2009',
 '2010',
 '2013',
 '2019',
 '20th',
 '21',
 '21st',
 '22nd',
 '23',
 '24',
 '25',
 '250',
 '26',
 '28',
 '29',
 '2nd',
 '30',
 '300',
 '3000',
 '30s',
 '35',
 '3d',
 '3po',
 '40',
 '400',
 '47',
 '48',
 '4th',
 '50',
 '500',
 '50s

## Creating the TF-IDF DataFrame

Now that you have generated our TF-IDF features, you will need to get them in a format that you can use to make recommendations. You will once again leverage pandas for this and wrap the array in a DataFrame. As you will be using the movie titles to do your filtering of the data, you can assign the titles to the DataFrame's index.

The df_plots DataFrame has once again been loaded for you. It contains movies' names in the Title column and their plots in the Plot column.

In [227]:
# Create Dataframe from TF-IDFarray
tfidf_df = pd.DataFrame(vectorized_data.toarray(), columns=vectorizer.get_feature_names())

# Assign the movie titles to the index and inspect
tfidf_df.index = df_plots['Title']
tfidf_df.head()

Unnamed: 0_level_0,00,000,007,10,100,10th,11,119,11th,12,...,zhao,zion,zoe,zombie,zombies,zone,zoo,zooey,zookeeper,zorro
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Pirates of the Caribbean: At World's End,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Spectre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight Rises,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
John Carter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Comparing all your movies with TF-IDF

Now that you have put in the hard work of getting your TF-IDF data into a usable format, it's time to put it to work generating finding similarities and generating recommendations.

This time as you are using TF-IDF scores (which are floats as opposed to Booleans) you will use the cosine similarity metric to find the similarities between items. In this exercise, you will generate a matrix of all of the movie cosine similarities and store them in a DataFrame for ease of lookup. This will allow you to compare movies and find recommendations quickly and easily.

The tfidf_df DataFrame you created in the last exercise containing a row for each movie has been loaded for you.

In [228]:
tfidf_summary_df = tfidf_df

In [229]:
# Import cosine_similarity measure
from sklearn.metrics.pairwise import cosine_similarity

# Create the array of cosine similarity values
cosine_similarity_array = cosine_similarity(tfidf_summary_df)

# Wrap the array in a pandas DataFrame
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_summary_df.index, 
                                    columns=tfidf_summary_df.index)

# Print the top 5 rows of the DataFrame
cosine_similarity_df.head()

Title,Avatar,Pirates of the Caribbean: At World's End,Spectre,The Dark Knight Rises,John Carter,Spider-Man 3,Tangled,Avengers: Age of Ultron,Harry Potter and the Half-Blood Prince,Batman v Superman: Dawn of Justice,...,Jeepers Creepers,Bill & Ted's Excellent Adventure,Oliver!,The Best Exotic Marigold Hotel,Recess: School's Out,Mad Max Beyond Thunderdome,Commando,The Boy,Devil,Friday After Next
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,1.0,0.022558,0.006077,0.032716,0.04006,0.041562,0.01282,0.052983,0.007094,0.007403,...,0.064358,0.009685,0.007161,0.014525,0.025916,0.064795,0.014226,0.004047,0.014361,0.010606
Pirates of the Caribbean: At World's End,0.022558,1.0,0.020982,0.013456,0.056598,0.004523,0.009666,0.052721,0.018047,0.032644,...,0.004248,0.046534,0.092393,0.018046,0.060777,0.0,0.04526,0.024316,0.017781,0.017684
Spectre,0.006077,0.020982,1.0,0.006165,0.004512,0.012784,0.002887,0.045489,0.025347,0.016324,...,0.008725,0.010447,0.007237,0.0,0.051655,0.0,0.01056,0.0,0.0,0.014644
The Dark Knight Rises,0.032716,0.013456,0.006165,1.0,0.020639,0.014329,0.025678,0.038262,0.024734,0.131414,...,0.007736,0.008937,0.017342,0.009328,0.014584,0.0,0.030225,0.020017,0.011318,0.008608
John Carter,0.04006,0.056598,0.004512,0.020639,1.0,0.016891,0.030714,0.066306,0.018114,0.039837,...,0.014669,0.017542,0.011332,0.038065,0.057423,0.058931,0.072803,0.00601,0.103747,0.012997


## Making recommendations with TF-IDF

In the last exercise you pre-calculated the similarity ratings between all movies in the dataset based on their plots transformed by TF-IDF. Now you will put these similarity ratings in a DataFrame for ease of use. Then you will use this new DataFrame to suggest a movie recommendation.

The cosine_similarity_array containing a matrix of the similarity values between all movies that you created in the last exercise has been loaded for you. The tfidf_summary_df DataFrame containing the movies and their TF-IDF features is also available.

In [230]:
# Wrap the preloaded array in a DataFrame
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_summary_df.index, 
                                    columns=tfidf_summary_df.index)

# Find the values for the movie Thor
cosine_similarity_series = cosine_similarity_df.loc['Avatar']

# Sort these values highest to lowest
ordered_similarities = cosine_similarity_series.sort_values(ascending=False)

In [231]:
ordered_similarities.head(10)

Title
Avatar                          1.000000
The American                    0.172487
Tears of the Sun                0.159614
Obitaemyy Ostrov                0.146821
Blood and Chocolate             0.138074
The Book of Life                0.131167
The Matrix                      0.129570
The Adventures of Pluto Nash    0.121129
Semi-Pro                        0.115918
Beowulf                         0.113885
Name: Avatar, dtype: float64

## Build the user profiles

You are now able to generate suggestions for similar items based on their labeled features or based on their descriptions. But sometimes finding similar items might not be enough. In the next exercises, you will work through how one could create recommendations based on a user and all the items they liked as opposed to a singular item. You will first generate a profile for a user by aggregating all of the movies they have previously enjoyed.

The tfidf_summary_df you have been working on in the last few exercises has been loaded for you. This contains a row per movie with their titles as the index and a column for each feature containing their respective TF-IDF score.

In [232]:
list_of_movies_enjoyed = ['Captain America: The First Avenger', 'Green Lantern', 'The Avengers']

# Create a subset of only the movies the user has enjoyed
movies_enjoyed_df = tfidf_summary_df.reindex(list_of_movies_enjoyed)

# Generate the user profile by finding the average scores of movies they enjoyed
user_prof = movies_enjoyed_df.mean()

# Inspect the results
user_prof

00           0.0
000          0.0
007          0.0
10           0.0
100          0.0
            ... 
zone         0.0
zoo          0.0
zooey        0.0
zookeeper    0.0
zorro        0.0
Length: 8315, dtype: float64

## User profile based recommendations

Now that you have built the user profile based on the aggregate of the individual movies they enjoyed, you can compare it to the larger tfidf_summary_df DataFrame that you have been working with to generate suggestions. As you would not want to suggest movies that the user has already watched, you will first find a subset of the tfidf_summary_df DataFrame that does not contain any of the previously watched movies.

The DataFrame user_prof that you generated in the last exercise that contains a single column representing the user has been loaded for you. Similarly, the list_of_movies_enjoyed has been loaded so you can exclude them from the predictions.


In [233]:
# Find subset of tfidf_df that does not include movies in list_of_movies_enjoyed
tfidf_subset_df = tfidf_df.drop(list_of_movies_enjoyed, axis=0)

# Calculate the cosine_similarity and wrap it in a DataFrame
similarity_array = cosine_similarity(user_prof.values.reshape(1, -1), tfidf_subset_df)
similarity_df = pd.DataFrame(similarity_array.T, index=tfidf_subset_df.index, columns=["similarity_score"])

# Sort the values from high to low by the values in the similarity_score
sorted_similarity_df = similarity_df.sort_values(by="similarity_score", ascending=False)

# Inspect the most similar to the user preferences
sorted_similarity_df.head(10)

Unnamed: 0_level_0,similarity_score
Title,Unnamed: 1_level_1
Captain America: The Winter Soldier,0.199847
Avengers: Age of Ultron,0.163996
Ballistic: Ecks vs. Sever,0.150866
Enemy at the Gates,0.143811
Valkyrie,0.136726
The Thin Red Line,0.13323
The Art of War,0.128399
The Abyss,0.128183
Superman IV: The Quest for Peace,0.126407
The Losers,0.125889
