# 📋 1. Content-based recommendation

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('movies.csv')
df.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [4]:
# Adding year column
df[['name', 'year']] = df.title.str.split('\(|\)', expand=True).iloc[:, [0,1]]
df.head(3)

Unnamed: 0,movieId,title,genres,name,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men,1995


In [5]:
df.drop('title', axis=1, inplace=True)
df.head(3)

Unnamed: 0,movieId,genres,name,year
0,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
1,2,Adventure|Children|Fantasy,Jumanji,1995
2,3,Comedy|Romance,Grumpier Old Men,1995


In [6]:
df['name'] = df['name'].str.strip()

In [7]:
movie_genre_df = df[['genres', 'name']]
movie_genre_df

Unnamed: 0,genres,name
0,Adventure|Animation|Children|Comedy|Fantasy,Toy Story
1,Adventure|Children|Fantasy,Jumanji
2,Comedy|Romance,Grumpier Old Men
3,Comedy|Drama|Romance,Waiting to Exhale
4,Comedy,Father of the Bride Part II
...,...,...
9737,Action|Animation|Comedy|Fantasy,Black Butler: Book of the Atlantic
9738,Animation|Comedy|Fantasy,No Game No Life: Zero
9739,Drama,Flint
9740,Action|Animation,Bungo Stray Dogs: Dead Apple


In [8]:
movie_genre_df = movie_genre_df.apply(lambda x: x.str.split('|').explode()).reset_index()
movie_genre_df

Unnamed: 0,index,genres,name
0,0,Adventure,Toy Story
1,0,Animation,Toy Story
2,0,Children,Toy Story
3,0,Comedy,Toy Story
4,0,Fantasy,Toy Story
...,...,...,...
22079,9738,Fantasy,No Game No Life: Zero
22080,9739,Drama,Flint
22081,9740,Action,Bungo Stray Dogs: Dead Apple
22082,9740,Animation,Bungo Stray Dogs: Dead Apple


In [9]:
movie_genre_df.drop('index', inplace=True, axis=1)

In [10]:
movie_genre_df

Unnamed: 0,genres,name
0,Adventure,Toy Story
1,Animation,Toy Story
2,Children,Toy Story
3,Comedy,Toy Story
4,Fantasy,Toy Story
...,...,...
22079,Fantasy,No Game No Life: Zero
22080,Drama,Flint
22081,Action,Bungo Stray Dogs: Dead Apple
22082,Animation,Bungo Stray Dogs: Dead Apple


In [11]:
movie_genre_df['genres'].unique()

array(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Mystery', 'Sci-Fi', 'War', 'Musical', 'Documentary', 'IMAX',
       'Western', 'Film-Noir', '(no genres listed)'], dtype=object)

In [12]:
# Saving data frame into csv
movie_genre_df.to_csv('movie_genre_df.csv')

##  ✏️ Exercises

> ## 1. Creating content-based data

As much as you might want to jump right to finding similar items and making recommendations, you first need to get your data in a usable format. In the next few exercises, you will explore your base data and work through how to format that data to be used for content-based recommendations.

As a reminder, the desired outcome is a row per movie with each column indicating whether a genre applies to the movie. You will be looking at `movie_genre_df`, which contains these columns:

- `name` - Name of movie
- `genre_list` - Genre that the movie has been labeled as

A movie may have multiple genres, and therefore multiple rows. In this exercise, you will particularly focus on one movie (Toy Story in this case) to be able to clearly see what is happening with the data.

### How many different movies are contained in `movie_genre_df`?


In [13]:
movie_genre_df = pd.read_csv('movie_genre_df.csv', index_col=False)

In [14]:
len(movie_genre_df['name'].unique())

9412

### Get the rows in `movie_genre_df` which have a name equal to Toy Story and save this as `toy_story_genres`.

In [15]:
toy_story_genres = movie_genre_df[movie_genre_df['name'] == "Toy Story"]
toy_story_genres.head()

Unnamed: 0.1,Unnamed: 0,genres,name
0,0,Adventure,Toy Story
1,1,Animation,Toy Story
2,2,Children,Toy Story
3,3,Comedy,Toy Story
4,4,Fantasy,Toy Story


### Transform movie_genre_df to a table called movie_cross_table.

Assign the subset of `movie_cross_table` that contains Toy Story to the variable `toy_story_genres_ct` and inspect the results

In [16]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df['name'] == 'Toy Story']

# Create cross-tabulated DataFrame from name and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['name'], movie_genre_df['genres'])

# Select only the rows with Toy Story as the index
toy_story_genres_ct = movie_cross_table[movie_cross_table.index == 'Toy Story']
print(toy_story_genres_ct)

genres     (no genres listed)  Action  Adventure  Animation  Children  Comedy  \
name                                                                            
Toy Story                   0       0          1          1         1       1   

genres     Crime  Documentary  Drama  Fantasy  Film-Noir  Horror  IMAX  \
name                                                                     
Toy Story      0            0      0        1          0       0     0   

genres     Musical  Mystery  Romance  Sci-Fi  Thriller  War  Western  
name                                                                  
Toy Story        0        0        0       0         0    0        0  


### Understanding the content-based data

You are now able to convert common attribute data to a DataFrame containing a row per movie, and each of its attributes as columns. You will now take a closer look at the full DataFrame you just created to see if you understand the information within.

A subset of the DataFrame you have created in the last exercise has been loaded as movie_cross_table. As a reminder, the genres are stored as individual columns and the movie names are stored as the index.

Inspect the rows corresponding to 'Toy Story' and 'Yogi Bear' in movie_cross_table. How many genres do they have in common?


Possible Answers

- 0 genres in common

- 2 genres in common ✅ (*Children and comedy*)

- 4 genres in common

- 6 genres in common



In [17]:
movie_cross_table

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
'71,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0
'Hellboy': The Seeds of Creation,0,1,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0
'Round Midnight,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
'Salem's Lot,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0
'Til There Was You,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
xXx,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
xXx: State of the Union,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
¡Three Amigos!,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [18]:
selected_movies = ["Toy Story", "Yogi Bear"]
movie_cross_table[movie_cross_table.index.isin(selected_movies)]


genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Toy Story,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Yogi Bear,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


> Correct! Yogi Bear and Toy Story both have the 'Children' and 'Comedy' attributes. The more genres that two movies have in common, the more likely it is that someone who liked one will like the other, so now we're going to apply this at a larger scale instead of just one pair of movies.

# 📋 2. Making content-based recommendations

- We need a way to calculate similarities between rows.
- We are going to use the **Jaccard similarity**

$$
J(A, B) = \frac{A \cap B}{A \cup B}
$$

it varies from zero to one. The bigger the jaccard similarity the similar are the items. 

In [19]:
# Data 
movie_genre_df.head()

Unnamed: 0.1,Unnamed: 0,genres,name
0,0,Adventure,Toy Story
1,1,Animation,Toy Story
2,2,Children,Toy Story
3,3,Comedy,Toy Story
4,4,Fantasy,Toy Story


In [20]:
# Calculating Jaccard similarity between movies 
from sklearn.metrics import jaccard_score

toy_row = movie_cross_table.loc['Toy Story']
yogi_row = movie_cross_table.loc['Yogi Bear']

In [21]:
print(jaccard_score(toy_row, yogi_row))

0.4


## ✏️Exercises

> # 1. Comparing individual movies with Jaccard similarity

In the last lesson, you built a DataFrame of movies, where each column represents a different genre. You can now use this DataFrame to compare movies by measuring the Jaccard similarity between rows. The higher the Jaccard similarity score, the more similar the two items are.

In this exercise, you will compare the movie GoldenEye with the movie Toy Story, and GoldenEye with SkyFall and compare the results.

The DataFrame movie_cross_table containing all the movies as rows and the genres as Boolean columns that you created in the last lesson has been loaded.

1. Import the Jaccard similarity score function from `sklearn.metrics`.
2. Convert the rows containing 'GoldenEye' and 'Toy Story' to numpy arrays and measure their similarity.
3. Convert the row containing Skyfall to a numpy array and measure its similarity to GoldenEye.

In [26]:
# Data
movie_cross_table.head()

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
'71,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0
'Hellboy': The Seeds of Creation,0,1,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0
'Round Midnight,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
'Salem's Lot,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0
'Til There Was You,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0


In [27]:
# Import numpy and the Jaccard similarity function
import numpy as np
from sklearn.metrics import jaccard_score

In [28]:
# Extract just the rows containing GoldenEye and Toy Story
goldeneye_values = movie_cross_table.loc['GoldenEye'].values
toy_story_values = movie_cross_table.loc['Toy Story'].values

# Find the similarity between GoldenEye and Toy Story
print(jaccard_score(goldeneye_values, toy_story_values))

0.14285714285714285


In [29]:
# Repeat for GoldenEye and Skyfall
skyfall_values = movie_cross_table.loc['Skyfall'].values
print(jaccard_score(goldeneye_values, skyfall_values))

0.75


> *As you can see, based on Jaccard similarity, GoldenEye and Skyfall (both James Bond movies) are more similar than GoldenEye and Toy Story (a spy movie and an animated kids movie).*

> # 2.Comparing all your movies at once

While finding the Jaccard similarity between any two individual movies in your dataset is great for small-scale analyses, it can prove slow on larger datasets to make recommendations.

In this exercise, you will find the similarities between all movies and store them in a DataFrame for quick and easy lookup.

When finding the similarities between the rows in a DataFrame, you could run through all pairs and calculate them individually, but it's more efficient to use the `pdist()` (pairwise distance) function from `scipy`.

This can be reshaped into the desired rectangular shape using `squareform()` from the same library. Since you want similarity values as opposed to distances, you should subtract the values from 1.

1. Find the Jaccard distance measures between all movies and assign the results to `jaccard_similarity_array`.
2. Create a DataFrame from the `jaccard_similarity_array` with `movie_genre_df.index` as its rows and columns.
3. Print the top 5 rows of the DataFrame and examine the similarity scores.


In [30]:
# Import functions from scipy
from scipy.spatial.distance import pdist, squareform

# Calculate all pairwise distances
jaccard_distances = pdist(movie_cross_table.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 - squareform(jaccard_distances)

# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=movie_cross_table.index, columns=movie_cross_table.index)

In [31]:
# Print the top 5 rows of the DataFrame
jaccard_similarity_df.head()

name,'71,'Hellboy': The Seeds of Creation,'Round Midnight,'Salem's Lot,'Til There Was You,'Tis the Season for Love,"'burbs, The",'night Mother,*batteries not included,...All the Marbles,...,Zulu,[REC],[REC]²,[REC]³ 3 Génesis,anohana: The Flower We Saw That Day - The Movie,eXistenZ,xXx,xXx: State of the Union,¡Three Amigos!,À nous la liberté
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71,1.0,0.125,0.2,0.333333,0.2,0.0,0.0,0.25,0.0,0.2,...,0.6,0.4,0.2,0.2,0.2,0.4,0.4,0.4,0.0,0.0
'Hellboy': The Seeds of Creation,0.125,1.0,0.0,0.0,0.0,0.0,0.2,0.0,0.285714,0.166667,...,0.111111,0.0,0.0,0.0,0.0,0.142857,0.142857,0.142857,0.166667,0.166667
'Round Midnight,0.2,0.0,1.0,0.2,0.333333,0.0,0.0,0.5,0.0,0.333333,...,0.0,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.333333
'Salem's Lot,0.333333,0.0,0.2,1.0,0.2,0.0,0.0,0.25,0.0,0.2,...,0.142857,0.75,0.5,0.5,0.2,0.166667,0.166667,0.166667,0.0,0.0
'Til There Was You,0.2,0.0,0.333333,0.2,1.0,0.5,0.0,0.5,0.0,0.333333,...,0.0,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0


> # 3. Making recommendations based on movie genres

Now that you have your data in a usable format and know how to compare two movies, the next step is to use this to generate recommendations. In this exercise, you will learn how to generate recommendations for any movie in your dataset. The similarity scores between all movies in the dataset that you calculated in the last exercise have been pre-loaded for you as `jaccard_similarity_array`. `movie_cross_table` containing the movies and their attributes is also available.

For ease of use, you will need to wrap the similarity scores in a DataFrame. Then you will use this new DataFrame to suggest a movie recommendation.


- Generate a DataFrame called jaccard_similarity_df from jaccard_similarity_array.
- Store the similarity values between Thor and all other movies as a Series.
- Sort these from largest to smallest in ordered_similarities.

In [32]:
# Wrap the preloaded array in a DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=movie_cross_table.index, columns=movie_cross_table.index)

# Find the values for the movie Thor
jaccard_similarity_series = jaccard_similarity_df.loc['Thor']

# Sort these values from highest to lowest
ordered_similarities = jaccard_similarity_series.sort_values(ascending=False)

# Print the results
print(ordered_similarities)

name
Thor                                            1.000000
Harry Potter and the Deathly Hallows: Part 2    0.833333
In the Name of the King III                     0.800000
Thor: The Dark World                            0.800000
Seeker: The Dark Is Rising, The                 0.800000
                                                  ...   
Runaway Bride                                   0.000000
Heidi Fleiss: Hollywood Madam                   0.000000
Hedgehog in the Fog                             0.000000
Heavyweights                                    0.000000
À nous la liberté                               0.000000
Name: Thor, Length: 9411, dtype: float64


# 📋 3. Text-based similarities

## 3.1. Term frequency inverse document frequency

$$
\text{TF-IDF} = \dfrac{\dfrac{\text{Count of word ocurrences}}{\text{Total words in document}}}{log \left( \dfrac{\text{Number of docs words in}}{\text{Total number of docs}} \right)}
$$

- It gives a higher weight in words not so common in the document.

## 3.2. Filtering the data 

```python 
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer(min_df=2, max_df=0.7)

# vectorizing the data
vectorized_data = tfidfvec.fit_transform(book_summary_df['Descriptions'])
print(tfidvec.get_features_names)

print(vectorized_data.to_array())
```

### Formatting the data

```python 
tfidf_df = pd.DataFrame(vectorized_data.toarray(),
                       columns=tfidfvec.get_feature_names())
tfidf_df.index = book_summary_df['Book']
```

## 3.3. Cosine similarity

Cosine distance:

$$
cos(\theta) = \dfrac{A \cdot B}{||A|| \cdot ||B||}
$$


- Distance between two documents in the high dimensional matrix space. 
- Values are from 0 to 1, being 1 a perfect match.

```python
from sklearn.metrics.pairwise import cosine_similarity

# Find similarity between all items 
cosine_similarity_array = cosine_similarity(tfidf_summary_df)

# Find similarity between two items 
cosine_similarity(tfidf_df.loc['The Hobbit'].values.reshape(1, -1),
                  tfidf_df.loc['Macbeth'].values.reshape(1, -1))
```

## ✏️ Exercises

> ## 1. Instantiate the TF-IDF model

TF-IDF by default generates a column for every word in all of your documents (movie summaries in our case). This creates a huge and unintuitive dataset as it will contain both very common words that appear in every document, and words that appear so rarely they provide no value in finding similarities between items.

In this exercise, you will work with the `df_plots` DataFrame. It contains movies' names in the `Title` column and their plots in the `Plot` column.

Using this DataFrame, you will generate the default TF-IDF scores and see if non-valuable columns are present.

You will go on to rerun the TF-IDF calculations, this time limiting the number of columns using the `min_df` and `max_df` arguments and hopefully see the improvement.

- Create a `TfidfVectorizer` and call it `vectorizer`.
- Use `vectorizer` to transform the data in the `Plots` column of `df_plots` and assign the output to `vectorized_data`.
- Inspect the features that have been generated by the transformation.

In [33]:
# Data
df_plots = pd.read_csv('df_plots.csv')
df_plots.head()

Unnamed: 0,Title,Plot
0,Ace Ventura: When Nature Calls,"In the Himalayas, after a failed rescue missio..."
1,Dracula: Dead and Loving It,Solicitor Thomas Renfield travels all the way ...
2,Father of the Bride Part II,The film begins five years after the events of...
3,Four Rooms,"The film is set on New Year\'s Eve, and starts..."
4,Grumpier Old Men,The feud between Max (Walter Matthau) and John...


In [34]:
df_plots.shape

(12, 2)

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object to the vectorizer variable
vectorizer = TfidfVectorizer()

# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(df_plots['Plot'])

# Look at the features generated
print(vectorizer.get_feature_names())





- Repeat the creation of the TfidfVectorizer, but this time, set the minimum document frequency to 2 and the maximum document frequency to 0.7.
Inspect the features that have been generated by the transformation.

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object to the vectorizer variable
vectorizer = TfidfVectorizer(min_df=2, max_df=0.7)

# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(df_plots['Plot'])

# Look at the features generated
print(vectorizer.get_feature_names())

['000', '100', 'abandoned', 'above', 'accidentally', 'accomplice', 'admits', 'adult', 'african', 'again', 'against', 'agent', 'agents', 'alive', 'all', 'allows', 'alone', 'also', 'although', 'animals', 'another', 'appears', 'approached', 'around', 'arrested', 'arrives', 'arriving', 'asks', 'assistant', 'assists', 'attack', 'attacked', 'attacks', 'attempting', 'attempts', 'attending', 'away', 'baby', 'back', 'ball', 'bank', 'bats', 'because', 'become', 'becomes', 'bed', 'been', 'before', 'begin', 'begins', 'being', 'between', 'blow', 'board', 'bond', 'boss', 'both', 'box', 'bride', 'bring', 'brings', 'britain', 'british', 'burns', 'business', 'call', 'called', 'calls', 'can', 'canadian', 'captured', 'captures', 'car', 'care', 'case', 'caves', 'chaos', 'chase', 'chest', 'child', 'children', 'christmas', 'cia', 'clock', 'closed', 'come', 'comes', 'containing', 'continue', 'control', 'convinces', 'country', 'couple', 'credits', 'crew', 'crime', 'dart', 'darts', 'daughter', 'day', 'dead', '

> ## 2. Creating the TF-IDF DataFrame

Now that you have generated our TF-IDF features, you will need to get them in a format that you can use to make recommendations. You will once again leverage `pandas` for this and wrap the array in a DataFrame. As you will be using the movie titles to do your filtering of the data, you can assign the titles to the DataFrame's index.

The `df_plots` DataFrame has once again been loaded for you. It contains movies' names in the `Title` column and their plots in the `Plot` column.

- Create a `TfidfVectorizer` and fit and transform it as you did in the previous exercise.
- Wrap the generated `vectorized_data` in a `DataFrame`. Use the names of the features generated during the fit and transform phase as its column names and assign your new DataFrame to `tfidf_df.
- Assign the original movie titles to the index of the newly created `tfidf_df` DataFrame.

In [37]:
df_plots.head()

Unnamed: 0,Title,Plot
0,Ace Ventura: When Nature Calls,"In the Himalayas, after a failed rescue missio..."
1,Dracula: Dead and Loving It,Solicitor Thomas Renfield travels all the way ...
2,Father of the Bride Part II,The film begins five years after the events of...
3,Four Rooms,"The film is set on New Year\'s Eve, and starts..."
4,Grumpier Old Men,The feud between Max (Walter Matthau) and John...


In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object and transform the plot column 
vectorizer = TfidfVectorizer(max_df=0.7, min_df=2) 
vectorized_data = vectorizer.fit_transform(df_plots['Plot'])

# Create Dataframe from TF-IDFarray
tfidf_df = pd.DataFrame(vectorized_data.toarray(), columns=vectorizer.get_feature_names())

# Assign the movie titles to the index and inspect
tfidf_df.index = df_plots['Title']

tfidf_df.head()



Unnamed: 0_level_0,000,100,abandoned,above,accidentally,accomplice,admits,adult,african,again,...,work,working,world,worried,wounded,wrong,year,years,you,young
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ace Ventura: When Nature Calls,0.0,0.0,0.0,0.0,0.0,0.0,0.068283,0.0,0.068283,0.0,...,0.0,0.068283,0.0,0.0,0.0,0.0,0.0,0.044825,0.0,0.054141
Dracula: Dead and Loving It,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060408,0.0
Father of the Bride Part II,0.045557,0.045557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.120728,...,0.045557,0.0,0.0,0.040243,0.0,0.045557,0.0,0.029906,0.0,0.072242
Four Rooms,0.039788,0.039788,0.0,0.079576,0.039788,0.0,0.0,0.039788,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.039788,0.079576,0.026119,0.0,0.0
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071803,...,0.0,0.0,0.0,0.071803,0.0,0.0,0.0,0.0,0.0,0.0


> ## 3. Comparing all your movies with TF-IDF

Now that you have put in the hard work of getting your TF-IDF data into a usable format, it's time to put it to work generating finding similarities and generating recommendations.

This time as you are using TF-IDF scores (which are floats as opposed to Booleans) you will use the cosine similarity metric to find the similarities between items. In this exercise, you will generate a matrix of all of the movie cosine similarities and store them in a DataFrame for ease of lookup. This will allow you to compare movies and find recommendations quickly and easily.

The `tfidf_df` DataFrame you created in the last exercise containing a row for each movie has been loaded for you.

- Find the cosine similarity measures between all movies and assign the results to `cosine_similarity_array`.
- Create a DataFrame from the `cosine_similarity_array` with `tfidf_summary_df.index` as its rows and columns.
- Print the top five rows of the DataFrame and examine the similarity scores.

In [39]:
# Removing index name
tfidf_df.index.name = None
tfidf_summary_df = tfidf_df

In [41]:
tfidf_summary_df

Unnamed: 0,000,100,abandoned,above,accidentally,accomplice,admits,adult,african,again,...,work,working,world,worried,wounded,wrong,year,years,you,young
Ace Ventura: When Nature Calls,0.0,0.0,0.0,0.0,0.0,0.0,0.068283,0.0,0.068283,0.0,...,0.0,0.068283,0.0,0.0,0.0,0.0,0.0,0.044825,0.0,0.054141
Dracula: Dead and Loving It,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060408,0.0
Father of the Bride Part II,0.045557,0.045557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.120728,...,0.045557,0.0,0.0,0.040243,0.0,0.045557,0.0,0.029906,0.0,0.072242
Four Rooms,0.039788,0.039788,0.0,0.079576,0.039788,0.0,0.0,0.039788,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.039788,0.079576,0.026119,0.0,0.0
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071803,...,0.0,0.0,0.0,0.071803,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.0,0.0,0.04456,0.0,0.0,0.0,0.04456,0.04456,0.0,0.0,...,0.0,0.04456,0.0,0.0,0.0,0.0,0.0,0.087755,0.0,0.035331
Sudden Death,0.0,0.0,0.0,0.07128,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.07128,0.0,0.0,0.0,0.0,0.056516
Tom and Huck,0.0,0.0,0.0,0.0,0.0,0.07458,0.0,0.0,0.0,0.06588,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Toy Story,0.0,0.0,0.0,0.0,0.066623,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.058852,0.058852,0.0,0.0,0.066623,0.0,0.0,0.0
Waiting to Exhale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071909,0.0,...,0.0,0.0,0.063521,0.0,0.0,0.0,0.0,0.047205,0.143818,0.0


In [42]:
# Import cosine_similarity measure
from sklearn.metrics.pairwise import cosine_similarity

# Create the array of cosine similarity values
cosine_similarity_array = cosine_similarity(tfidf_summary_df)

In [43]:
# Wrap the array in a pandas DataFrame
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_summary_df.index, columns=tfidf_summary_df.index)

# Print the top 5 rows of the DataFrame
cosine_similarity_df.head()

Unnamed: 0,Ace Ventura: When Nature Calls,Dracula: Dead and Loving It,Father of the Bride Part II,Four Rooms,Grumpier Old Men,Jumanji,Sudden Death,Tom and Huck,Toy Story,Waiting to Exhale,GoldenEye,Skyfall
Ace Ventura: When Nature Calls,1.0,0.188175,0.188603,0.134598,0.124676,0.136359,0.180795,0.211408,0.191649,0.134086,0.086921,0.072335
Dracula: Dead and Loving It,0.188175,1.0,0.200218,0.252971,0.105283,0.173103,0.187161,0.185923,0.146967,0.09746,0.071844,0.075339
Father of the Bride Part II,0.188603,0.200218,1.0,0.176152,0.224413,0.187813,0.180852,0.203003,0.305868,0.18129,0.034139,0.065996
Four Rooms,0.134598,0.252971,0.176152,1.0,0.214116,0.161895,0.148883,0.128753,0.173899,0.138599,0.029987,0.045536
Grumpier Old Men,0.124676,0.105283,0.224413,0.214116,1.0,0.172372,0.114862,0.197313,0.223474,0.154554,0.027098,0.04279


In [44]:
cosine_similarity_df

Unnamed: 0,Ace Ventura: When Nature Calls,Dracula: Dead and Loving It,Father of the Bride Part II,Four Rooms,Grumpier Old Men,Jumanji,Sudden Death,Tom and Huck,Toy Story,Waiting to Exhale,GoldenEye,Skyfall
Ace Ventura: When Nature Calls,1.0,0.188175,0.188603,0.134598,0.124676,0.136359,0.180795,0.211408,0.191649,0.134086,0.086921,0.072335
Dracula: Dead and Loving It,0.188175,1.0,0.200218,0.252971,0.105283,0.173103,0.187161,0.185923,0.146967,0.09746,0.071844,0.075339
Father of the Bride Part II,0.188603,0.200218,1.0,0.176152,0.224413,0.187813,0.180852,0.203003,0.305868,0.18129,0.034139,0.065996
Four Rooms,0.134598,0.252971,0.176152,1.0,0.214116,0.161895,0.148883,0.128753,0.173899,0.138599,0.029987,0.045536
Grumpier Old Men,0.124676,0.105283,0.224413,0.214116,1.0,0.172372,0.114862,0.197313,0.223474,0.154554,0.027098,0.04279
Jumanji,0.136359,0.173103,0.187813,0.161895,0.172372,1.0,0.247923,0.145758,0.158545,0.086274,0.056943,0.043752
Sudden Death,0.180795,0.187161,0.180852,0.148883,0.114862,0.247923,1.0,0.176888,0.217872,0.150136,0.127008,0.06328
Tom and Huck,0.211408,0.185923,0.203003,0.128753,0.197313,0.145758,0.176888,1.0,0.179207,0.119998,0.065026,0.08042
Toy Story,0.191649,0.146967,0.305868,0.173899,0.223474,0.158545,0.217872,0.179207,1.0,0.124956,0.067811,0.089457
Waiting to Exhale,0.134086,0.09746,0.18129,0.138599,0.154554,0.086274,0.150136,0.119998,0.124956,1.0,0.038875,0.039488


In [45]:
cosine_similarity_df.index

Index(['Ace Ventura: When Nature Calls', 'Dracula: Dead and Loving It',
       'Father of the Bride Part II', 'Four Rooms', 'Grumpier Old Men',
       'Jumanji', 'Sudden Death', 'Tom and Huck', 'Toy Story',
       'Waiting to Exhale', 'GoldenEye', 'Skyfall'],
      dtype='object')

> ## 4. Making recommendations with TF-IDF

In the last exercise you pre-calculated the similarity ratings between all movies in the dataset based on their plots transformed by TF-IDF. Now you will put these similarity ratings in a DataFrame for ease of use. Then you will use this new DataFrame to suggest a movie recommendation.

The `cosine_similarity_array` containing a matrix of the similarity values between all movies that you created in the last exercise has been loaded for you. The `tfidf_summary_df` DataFrame containing the movies and their TF-IDF features is also available.

In [46]:
cosine_similarity_array

array([[1.        , 0.18817497, 0.1886031 , 0.13459848, 0.1246757 ,
        0.13635879, 0.18079475, 0.21140834, 0.19164876, 0.13408647,
        0.08692089, 0.07233452],
       [0.18817497, 1.        , 0.20021825, 0.25297079, 0.10528327,
        0.1731033 , 0.18716114, 0.18592271, 0.14696745, 0.09746021,
        0.07184429, 0.0753394 ],
       [0.1886031 , 0.20021825, 1.        , 0.17615201, 0.22441322,
        0.18781266, 0.18085204, 0.20300348, 0.30586829, 0.18128975,
        0.03413906, 0.0659957 ],
       [0.13459848, 0.25297079, 0.17615201, 1.        , 0.21411564,
        0.16189494, 0.14888271, 0.12875296, 0.17389911, 0.13859856,
        0.02998718, 0.04553642],
       [0.1246757 , 0.10528327, 0.22441322, 0.21411564, 1.        ,
        0.17237167, 0.11486183, 0.19731269, 0.22347362, 0.15455361,
        0.02709798, 0.04279006],
       [0.13635879, 0.1731033 , 0.18781266, 0.16189494, 0.17237167,
        1.        , 0.2479228 , 0.14575801, 0.1585449 , 0.08627399,
        0.0569434 ,

In [47]:
tfidf_summary_df

Unnamed: 0,000,100,abandoned,above,accidentally,accomplice,admits,adult,african,again,...,work,working,world,worried,wounded,wrong,year,years,you,young
Ace Ventura: When Nature Calls,0.0,0.0,0.0,0.0,0.0,0.0,0.068283,0.0,0.068283,0.0,...,0.0,0.068283,0.0,0.0,0.0,0.0,0.0,0.044825,0.0,0.054141
Dracula: Dead and Loving It,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060408,0.0
Father of the Bride Part II,0.045557,0.045557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.120728,...,0.045557,0.0,0.0,0.040243,0.0,0.045557,0.0,0.029906,0.0,0.072242
Four Rooms,0.039788,0.039788,0.0,0.079576,0.039788,0.0,0.0,0.039788,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.039788,0.079576,0.026119,0.0,0.0
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071803,...,0.0,0.0,0.0,0.071803,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.0,0.0,0.04456,0.0,0.0,0.0,0.04456,0.04456,0.0,0.0,...,0.0,0.04456,0.0,0.0,0.0,0.0,0.0,0.087755,0.0,0.035331
Sudden Death,0.0,0.0,0.0,0.07128,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.07128,0.0,0.0,0.0,0.0,0.056516
Tom and Huck,0.0,0.0,0.0,0.0,0.0,0.07458,0.0,0.0,0.0,0.06588,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Toy Story,0.0,0.0,0.0,0.0,0.066623,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.058852,0.058852,0.0,0.0,0.066623,0.0,0.0,0.0
Waiting to Exhale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071909,0.0,...,0.0,0.0,0.063521,0.0,0.0,0.0,0.0,0.047205,0.143818,0.0


- Generate a DataFrame from `cosine_similarity_array`.
- Store the cosine similarity values between the movie Rio and all other movies as a Series.
- Sort these from largest to smallest in ordered_similarities and print the ordered results.

In [48]:
# Wrap the preloaded array in a DataFrame
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_summary_df.index, columns=tfidf_summary_df.index)

# Find the values for the movie Rio
cosine_similarity_series = cosine_similarity_df.loc['Four Rooms']

# Sort these values highest to lowest
ordered_similarities = cosine_similarity_series.sort_values(ascending=False)

# Print the results
print(ordered_similarities)

Four Rooms                        1.000000
Dracula: Dead and Loving It       0.252971
Grumpier Old Men                  0.214116
Father of the Bride Part II       0.176152
Toy Story                         0.173899
Jumanji                           0.161895
Sudden Death                      0.148883
Waiting to Exhale                 0.138599
Ace Ventura: When Nature Calls    0.134598
Tom and Huck                      0.128753
Skyfall                           0.045536
GoldenEye                         0.029987
Name: Four Rooms, dtype: float64


**Dracula: Dead and Loving It** has the highest similarity with **Four Rooms**. This means that viwers that liked **Four Rooms** are likely to enjoy **Dracula: Dead and Loving It**.

# 📋  4. User profile recommendations

## ✏️ Exercises

> ## 1. Build the user profiles

You are now able to generate suggestions for similar items based on their labeled features or based on their descriptions. But sometimes finding similar items might not be enough. In the next exercises, you will work through how one could create recommendations based on a user and all the items they liked as opposed to a singular item. You will first generate a profile for a user by aggregating all of the movies they have previously enjoyed.

The `tfidf_summary_df` you have been working on in the last few exercises has been loaded for you. This contains a row per movie with their titles as the index and a column for each feature containing their respective TF-IDF score.

- Create a subset of the `tfidf_summary_df` that contains only rows corresponding to the supplied `list_of_movies_enjoyed` list.

In [49]:
tfidf_summary_df

Unnamed: 0,000,100,abandoned,above,accidentally,accomplice,admits,adult,african,again,...,work,working,world,worried,wounded,wrong,year,years,you,young
Ace Ventura: When Nature Calls,0.0,0.0,0.0,0.0,0.0,0.0,0.068283,0.0,0.068283,0.0,...,0.0,0.068283,0.0,0.0,0.0,0.0,0.0,0.044825,0.0,0.054141
Dracula: Dead and Loving It,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060408,0.0
Father of the Bride Part II,0.045557,0.045557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.120728,...,0.045557,0.0,0.0,0.040243,0.0,0.045557,0.0,0.029906,0.0,0.072242
Four Rooms,0.039788,0.039788,0.0,0.079576,0.039788,0.0,0.0,0.039788,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.039788,0.079576,0.026119,0.0,0.0
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071803,...,0.0,0.0,0.0,0.071803,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.0,0.0,0.04456,0.0,0.0,0.0,0.04456,0.04456,0.0,0.0,...,0.0,0.04456,0.0,0.0,0.0,0.0,0.0,0.087755,0.0,0.035331
Sudden Death,0.0,0.0,0.0,0.07128,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.07128,0.0,0.0,0.0,0.0,0.056516
Tom and Huck,0.0,0.0,0.0,0.0,0.0,0.07458,0.0,0.0,0.0,0.06588,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Toy Story,0.0,0.0,0.0,0.0,0.066623,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.058852,0.058852,0.0,0.0,0.066623,0.0,0.0,0.0
Waiting to Exhale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071909,0.0,...,0.0,0.0,0.063521,0.0,0.0,0.0,0.0,0.047205,0.143818,0.0


In [56]:
list_of_movies_enjoyed = ['Ace Ventura: When Nature Calls', 'Grumpier Old Men', 'Father of the Bride Part II']

# Create a subset of only the movies the user has enjoyed
movies_enjoyed_df = tfidf_summary_df.reindex(list_of_movies_enjoyed)

# Inspect the DataFrame
movies_enjoyed_df

Unnamed: 0,000,100,abandoned,above,accidentally,accomplice,admits,adult,african,again,...,work,working,world,worried,wounded,wrong,year,years,you,young
Ace Ventura: When Nature Calls,0.0,0.0,0.0,0.0,0.0,0.0,0.068283,0.0,0.068283,0.0,...,0.0,0.068283,0.0,0.0,0.0,0.0,0.0,0.044825,0.0,0.054141
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071803,...,0.0,0.0,0.0,0.071803,0.0,0.0,0.0,0.0,0.0,0.0
Father of the Bride Part II,0.045557,0.045557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.120728,...,0.045557,0.0,0.0,0.040243,0.0,0.045557,0.0,0.029906,0.0,0.072242


In [59]:
movies_enjoyed_df.dtypes

000             float64
100             float64
abandoned       float64
above           float64
accidentally    float64
                 ...   
wrong           float64
year            float64
years           float64
you             float64
young           float64
Length: 536, dtype: object

- Generate the user profile by finding the average TF-IDF scores of each of the features of the movies contained in movies_enjoyed_df.
- Inspect the results.

In [60]:
list_of_movies_enjoyed = ['Ace Ventura: When Nature Calls', 'Grumpier Old Men', 'Father of the Bride Part II']

# Create a subset of only the movies the user has enjoyed
movies_enjoyed_df = tfidf_summary_df.reindex(list_of_movies_enjoyed)

# Generate the user profile by finding the average scores of movies they enjoyed
user_prof = movies_enjoyed_df.mean(skipna=True)
user_prof

000             0.015186
100             0.015186
abandoned       0.000000
above           0.000000
accidentally    0.000000
                  ...   
wrong           0.015186
year            0.000000
years           0.024910
you             0.000000
young           0.042128
Length: 536, dtype: float64

Good work, by aggregating the scores of the movies the user enjoyed, you have been able to create a summary of a user's tastes that you will be able to use to find new movies similar to what they usually enjoy.

> ## 2. User profile based recommendations

Now that you have built the user profile based on the aggregate of the individual movies they enjoyed, you can compare it to the larger `tfidf_summary_df` DataFrame that you have been working with to generate suggestions. As you would not want to suggest movies that the user has already watched, you will first find a subset of the `tfidf_summary_df` DataFrame that does not contain any of the previously watched movies.

The DataFrame `user_prof` that you generated in the last exercise that contains a single column representing the user has been loaded for you. Similarly, the `list_of_movies_enjoyed` has been loaded so you can exclude them from the predictions.

In [61]:
from sklearn.metrics.pairwise import cosine_similarity

# Find subset of tfidf_df that does not include movies in list_of_movies_enjoyed
tfidf_subset_df = tfidf_df.drop(list_of_movies_enjoyed, axis=0)

- Calculate the cosine_similarity between the user profile contained in `user_prof` and all the movie profiles in `tfidf_subset_df`.
- Wrap the `similarity_array` in a DataFrame, assigning it the same index as `tfidf_subset_df`.

In [62]:
from sklearn.metrics.pairwise import cosine_similarity

# Find subset of tfidf_df that does not include movies in list_of_movies_enjoyed
tfidf_subset_df = tfidf_df.drop(list_of_movies_enjoyed, axis=0)

# Calculate the cosine_similarity and wrap it in a DataFrame
similarity_array = cosine_similarity(user_prof.values.reshape(1, -1), tfidf_subset_df)
similarity_df = pd.DataFrame(similarity_array.T,
                             index=tfidf_subset_df.index, 
                             columns=["similarity_score"])

- Sort the results from high to low and take a look at the movies most similar to the user's likes.

In [63]:
# Sort the values from high to low by the values in the similarity_score
sorted_similarity_df = similarity_df.sort_values(by="similarity_score", ascending=False)

# Inspect the most similar to the user preferences
print(sorted_similarity_df.head())

                             similarity_score
Toy Story                            0.357146
Tom and Huck                         0.303020
Four Rooms                           0.259995
Jumanji                              0.245965
Dracula: Dead and Loving It          0.244545
