# Content-Based Recommendations
  
Discover how item attributes can be used to make recommendations. Create valuable comparisons between items with both categorical and text data. Generate profiles to recommend new items for users based on their past preferences.

## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>DataFrame.nunique()</td>
    <td>Count distinct elements in each column</td>
  </tr>
  <tr>
    <td>2</td>
    <td>DataFrame.hist()</td>
    <td>Create histograms for numeric columns</td>
  </tr>
  <tr>
    <td>3</td>
    <td>DataFrame.drop()</td>
    <td>Drop specified labels from rows or columns</td>
  </tr>
  <tr>
    <td>4</td>
    <td>DataFrame.value_counts()</td>
    <td>Count unique values in a Series</td>
  </tr>
  <tr>
    <td>5</td>
    <td>DataFrame.groupby()</td>
    <td>Group DataFrame using a mapper or by a Series of columns</td>
  </tr>
  <tr>
    <td>6</td>
    <td>DataFrame.sort_values()</td>
    <td>Sort DataFrame by one or more columns</td>
  </tr>
  <tr>
    <td>7</td>
    <td>DataFrame.index</td>
    <td>Get the index (row labels) of the DataFrame</td>
  </tr>
  <tr>
    <td>8</td>
    <td>DataFrame.isin()</td>
    <td>Check whether each element in the DataFrame is contained in a list-like object</td>
  </tr>
  <tr>
    <td>9</td>
    <td>itertools.permutations()</td>
    <td>Generate all permutations of an iterable</td>
  </tr>
  <tr>
    <td>10</td>
    <td>DataFrame.apply()</td>
    <td>Apply a function along an axis of the DataFrame</td>
  </tr>
  <tr>
    <td>11</td>
    <td>DataFrame.size()</td>
    <td>Return the number of elements in the DataFrame</td>
  </tr>
  <tr>
    <td>12</td>
    <td>DataFrame.reset_index()</td>
    <td>Reset the index of the DataFrame</td>
  </tr>
  <tr>
    <td>13</td>
    <td>.to_frame()</td>
    <td>Convert a Series to a DataFrame</td>
  </tr>
  <tr>
    <td>14</td>
    <td>str.contains()</td>
    <td>Check if each element in a Series or DataFrame column contains a substring</td>
  </tr>
  <tr>
    <td>15</td>
    <td>DataFrame.empty</td>
    <td>Check if the DataFrame is empty (contains no rows or columns)</td>
  </tr>
</table>

  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [2]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Setting a standard style
plt.style.use('ggplot')

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

## Intro to content-based recommendations
  
So far we have looked at making recommendations based solely on how the entire population feels about items. While these recommendations can be useful, they aren't personalized.
  
**What are content-based recommendations?**
  
In this chapter, we will move to more targeted models by recommending items based on their similarities to items a user has liked in the past. For example, if a user likes book A, and we calculate that book A and book B are similar, we believe the user will like book B. We will address how to calculate what items are similar and which ones are not. We can do so by comparing the attributes of our items. The recommendations made by finding items with similar attributes are called content-based recommendations.
  
<center><img src='../_images/intro-to-content-based-recommendations.png' alt='img' width='740'></center>
  
**Items' attributes or characteristics**
  
For example, if we were looking at a dataset describing books, the attributes could be the author of the book, its publishing date, its length, or its genre, really any descriptive information. A big advantage of using an item's attributes over user feedback is that you can make recommendations for any items you have attribute data on. This includes even brand new items that users have not seen yet. Content-based models require us to use any available attributes to build profiles of items in a way that allows us to mathematically compare between them. This allows us for example to find the most similar items and recommend them.
  
<center><img src='../_images/intro-to-content-based-recommendations1.png' alt='img' width='740'></center>
  
**Vectorizing your attributes**
  
This is best done by encoding each item as a vector. Here we can see an example with a vector for each item stored as a row and each feature as a column. Why this shape you might ask? It is extremely valuable to have your data in this format so the distance and similarities between items can be easily calculated, which is vital for generating recommendations. We'll discuss how to calculate distances and similarities between vectors later in the course. First, we will cover how to convert the most common data format for attributes to this shape. We will continue using the book dataset from chapter 1, but this time we introduce an additional book_genre table.
  
<center><img src='../_images/intro-to-content-based-recommendations2.png' alt='img' width='740'></center>
  
**One to many relationships**
  
This book_genre table, as seen here on the left, contains a one to many reference of books to their genres. This type of one to many lookup is very common in relational databases. Remember from this table, we want to create a new table that contains a single row per item, encoding whether or not it has that attribute like you see here on the right.
  
<center><img src='../_images/intro-to-content-based-recommendations3.png' alt='img' width='740'></center>
  
**Crosstabulation**
  
To transform this data we can use `pandas.crosstab()` function. The `pandas.crosstab()` function generates the cross-tabulation of two (or more) factors, and here we want to use it to find the cross-tabulation of the book titles and the genres they have been labeled with.
  
We call `pandas.crosstab()`, passing in the book titles as the first argument, and the book genres as the second argument. The first argument will become the rows, and the second becomes the columns. Here we can see the desired result.
  
<center><img src='../_images/intro-to-content-based-recommendations4.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Great, now we have our data in a format that will allow us to calculate similarities and make recommendations. Time to try these data transformations yourself.

### Why use content-based models?
  
Imagine you are working for a large retailer that has a constantly changing product line, with new items being added every day. Why might content-based models be a good choice to make recommendations on your data?
  
---
  
Possible Answers
  
- [ ] You are always guaranteed better recommendations with content-based data.
  
- [ ] Content-based models always recommend the newest products; customers always like the newest products no matter what their past preferences were.
  
- [x] As the recommendations are based on the item attributes rather than user feedback, recommendations can be made on never-before-purchased products.
  
Correct! Content-based models are ideal for creating recommendations for products that have no user feedback data such as reviews or purchases.

### Creating content-based data
  
As much as you might want to jump right to finding similar items and making recommendations, you first need to get your data in a usable format. In the next few exercises, you will explore your base data and work through how to format that data to be used for content-based recommendations.
  
As a reminder, the desired outcome is a row per movie with each column indicating whether a genre applies to the movie. You will be looking at `movie_genre_df`, which contains these columns:
  
- name - Name of movie
- genre_list - Genre that the movie has been labeled as
  
A movie may have multiple genres, and therefore multiple rows. In this exercise, you will particularly focus on one movie (Toy Story in this case) to be able to clearly see what is happening with the data.
  
---
  
1. How many different movies are contained in `movie_genre_df`?
  
Possible answers
  
- [ ] 50
- [x] 21
- [ ] 11
  
Solution
  
```python
print(movie_genre_df.name.nunique())
```
  
2. Get the rows in `movie_genre_df` which have a name equal to `Toy Story` and save this as `toy_story_genres`.
3. Transform `movie_genre_df` to a table called `movie_cross_table`.
4. Assign the subset of `movie_cross_table` that contains `Toy Story` to the variable `toy_story_genres_ct` and inspect the results.

In [3]:
# Load dataset
movie_genre_df = pd.read_csv('../_datasets/movie_genre_small.csv')
print(movie_genre_df.shape)
movie_genre_df.head()

(50, 2)


Unnamed: 0,name,genre_list
0,Toy Story,Adventure
1,Toy Story,Animation
2,Toy Story,Children
3,Toy Story,Comedy
4,Toy Story,Fantasy


In [4]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df.name == 'Toy Story']

# Inspect the subset
print(toy_story_genres)

        name genre_list
0  Toy Story  Adventure
1  Toy Story  Animation
2  Toy Story   Children
3  Toy Story     Comedy
4  Toy Story    Fantasy


In [6]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df['name'] == 'Toy Story']

# Create cross-tabulated DataFrame from name and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['name'], movie_genre_df['genre_list'])

# Select only the rows with Toy Story as the index
toy_story_genres_ct = movie_cross_table[movie_cross_table.index == 'Toy Story']
toy_story_genres_ct

genre_list,Action,Adventure,Animation,Children,Comedy,Crime,Drama,Fantasy,Horror,Romance,Thriller
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Toy Story,0,1,1,1,1,0,0,1,0,0,0


Good work! This newly formatted table with a vector contained in a row per movie and a column per feature will allow you to calculate distances and similarities between movies.

### Understanding the content-based data
  
You are now able to convert common attribute data to a DataFrame containing a row per movie, and each of its attributes as columns. You will now take a closer look at the full DataFrame you just created to see if you understand the information within.
  
A subset of the DataFrame you have created in the last exercise has been loaded as `movie_cross_table`. As a reminder, the genres are stored as individual columns and the movie names are stored as the index.
  
Inspect the rows corresponding to 'Toy Story' and 'Yogi Bear' in `movie_cross_table`. How many genres do they have in common?
  
---
  
Possible answers
  
- [ ] 0 genres in common
- [x] 2 genres in common
- [ ] 4 genres in common
- [ ] 6 genres in common
  
```python
movie_cross_table[movie_cross_table.index == 'Toy Story']
movie_cross_table[movie_cross_table.index == 'Yogi Bear']
```
  
Correct! Yogi Bear and Toy Story both have the 'Children' and 'Comedy' attributes. The more genres that two movies have in common, the more likely it is that someone who liked one will like the other, so now we're going to apply this at a larger scale instead of just one pair of movies.

## Making content-based recommendations
  
With our data formatted, we can begin making comparisons and recommendations, but to do so, we will need a way of calculating similarity between rows.
  
**Introducing the Jaccard similarity**
  
The metric we will use to measure similarity between items in our newly encoded dataset is called the Jaccard similarity. The Jaccard similarity is the ratio of attributes that two items have in common, divided by the total number of their combined attributes. These are respectively shown by the two orange shaded areas in the Venn diagrams here. It will always be between 0 and 1 and the more attributes the two items have in common, the higher the score.
  
<center><img src='../_images/making-content-based-recommendations.png' alt='img' width='740'></center>
  
**Calculating Jaccard similarity between books**
  
We will continue working on the book genre DataFrame created in the last video called genres_array_df. This contains one row for each item (books in this case) and a column for each genre.
  
<center><img src='../_images/making-content-based-recommendations1.png' alt='img' width='740'></center>
  
**Calculating Jaccard similarity between books**
  
To calculate the Jaccard similarity between the books in the DataFrame we first need to import `jaccard_score` from the `sklearn.metrics` library. This function takes two vectors (rows in our case) and calculates the similarity value. So we can take the row for The Hobbit And the row for A Game of Thrones And find the Jaccard score. While this is valuable for the lookup of individual similarities, it is often more useful to have the similarities of all your items calculated at once in an easy to access DataFrame.
  
<center><img src='../_images/making-content-based-recommendations2.png' alt='img' width='740'></center>
  
**Finding the distance between all items**
  
To get all of these similarities at once for our data we will call upon two helpful functions from the `scipy` package. First `pdist` (short for pairwise distance) helps us find all the distances at once, using Jaccard as the metric argument. This returns a condensed matrix, which contains all the distances in a 1D array. We then use `squareform` to get this 1D data into the rectangular shape we need.
  
<center><img src='../_images/making-content-based-recommendations3.png' alt='img' width='740'></center>
  
**Finding the distance between all items**
  
Note that `pdist` calculates the Jaccard distance which is a measure of how different rows are from each other. As we want the complement of this, the similarity, we subtract the values from 1.
  
<center><img src='../_images/making-content-based-recommendations4.png' alt='img' width='740'></center>
  
**Creating a usable distance table**
  
We can now wrap this similarity array in a DataFrame for ease of use. We create a `pandas.DataFrame()` with the newly generated jaccard_similarity_array as the main argument and set both the `index=` and `column=` arguments to the title column of the distance_df DataFrame. Let's take a look at the distance_df DataFrame we just created.
  
<center><img src='../_images/making-content-based-recommendations5.png' alt='img' width='740'></center>
  
**Comparing books**
  
This distance DataFrame can be used to look up any pairings of Books to see how similar they are. Let's look up the similarity between The Hobbit and A Game of Thrones again by using book titles to filter the distance_df DataFrame. This returns 0.75, a reasonable score, as they are both fun action-packed fantasy books. If we perform a similar comparison between The Hobbit and The Great Gatsby, we get a much lower score of .15. Not a huge surprise as the Great Gatsby has very little in common with The Hobbit.
  
<center><img src='../_images/making-content-based-recommendations6.png' alt='img' width='740'></center>
  
**Finding the most similar books**
  
Finally, while comparing two books is useful, it is most valuable when you can use it to find a new book that is similar to the one you just read and enjoyed. For this, we select the column containing the book we want to compare with and then sort the results using `.sort_values()`. The ascending argument must be set to False to show the highest ranked books first. Unsurprisingly, all the top recommendations are similar fantasy adventure books!
  
<center><img src='../_images/making-content-based-recommendations7.png' alt='img' width='740'></center>
  
**Let's practice!**
  
This method of recommendation is valuable for instances when you have good descriptive attributes on the items you want to compare, lets generate recommendations using these techniques with the movie dataset from chapter one.

### Comparing individual movies with Jaccard similarity
  
In the last lesson, you built a DataFrame of movies, where each column represents a different genre. You can now use this DataFrame to compare movies by measuring the Jaccard similarity between rows. The higher the Jaccard similarity score, the more similar the two items are.
  
In this exercise, you will compare the movie `Cutthroat Island` with the movie `Toy Story`, and `Cutthroat Island` with `SkyFall` and compare the results.
  
The DataFrame `movie_cross_table` containing all the movies as rows and the genres as Boolean columns that you created in the last lesson has been loaded.
  
---
  
1. Import the Jaccard similarity score function from `sklearn.metrics`.
2. Convert the rows containing `'Cutthroat Island'` and `'Toy Story'` to `numpy` arrays and measure their similarity.
3. Convert the row containing `Skyfall` to a `numpy` array and measure its similarity to `Cutthroat Island`.

In [9]:
# Import numpy and the Jaccard similarity function
import numpy as np
from sklearn.metrics import jaccard_score

In [10]:
# Extract just the rows containing GoldenEye and Toy Story
goldeneye_values = movie_cross_table.loc['GoldenEye'].values
toy_story_values = movie_cross_table.loc['Toy Story'].values

# Find the similarity between GoldenEye and Toy Story
print(jaccard_score(goldeneye_values, toy_story_values))

0.14285714285714285


In [13]:
# Repeat for Cutthroat Island and Skyfall
cutthroat_island_values = movie_cross_table.loc['Cutthroat Island'].values
print(jaccard_score(goldeneye_values, cutthroat_island_values))

0.5


Great! As you can see, based on Jaccard similarity, Cutthroat Island and Skyfall are more similar than GoldenEye and Toy Story (a spy movie and an animated kids movie).

### Comparing all your movies at once
  
While finding the Jaccard similarity between any two individual movies in your dataset is great for small-scale analyses, it can prove slow on larger datasets to make recommendations.
  
In this exercise, you will find the similarities between all movies and store them in a DataFrame for quick and easy lookup.
  
When finding the similarities between the rows in a DataFrame, you could run through all pairs and calculate them individually, but it's more efficient to use the `pdist()` (pairwise distance) function from `scipy`.
  
This can be reshaped into the desired rectangular shape using `squareform()` from the same library. Since you want similarity values as opposed to distances, you should subtract the values from 1.
  
`movie_cross_table` has once again been loaded for you.
  
---
  
1. Find the Jaccard distance measures between all movies and assign the results to `jaccard_similarity_array`.
2. Create a DataFrame from the `jaccard_similarity_array` with `movie_genre_df.index` as its rows and columns.
3. Print the top 5 rows of the DataFrame and examine the similarity scores.

In [17]:
# Importing the large dataset
movies_large_df = pd.read_csv('../_datasets/movies.csv')
print(movies_large_df.shape)
movies_large_df.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [23]:
# Split the "genres" column into multiple columns
genres_df = movies_large_df['genres'].str.get_dummies('|')

# Create a DataFrame with movie names and genre columns
result_df = pd.concat([movies_large_df['title'], genres_df], axis=1)

# Setting title as index
result_df = result_df.set_index('title')

# Display
result_df.head()

Unnamed: 0_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Jumanji (1995),0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Grumpier Old Men (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
Waiting to Exhale (1995),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [24]:
# Import functions from scipy
from scipy.spatial.distance import pdist, squareform

# Calculate all pairwise distances
jaccard_distances = pdist(result_df.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 - squareform(jaccard_distances)

# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=result_df.index, columns=result_df.index)

# Print the top 5 rows of the DataFrame
jaccard_similarity_df.head()

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),"American President, The (1995)",Dracula: Dead and Loving It (1995),Balto (1995),Nixon (1995),Cutthroat Island (1995),Casino (1995),Sense and Sensibility (1995),Four Rooms (1995),Ace Ventura: When Nature Calls (1995),Money Train (1995),Get Shorty (1995),Copycat (1995),Assassins (1995),Powder (1995),Leaving Las Vegas (1995),...,The Man Who Killed Don Quixote (2018),Boundaries (2018),Spiral (2018),Mission: Impossible - Fallout (2018),SuperFly (2018),Iron Soldier (2010),BlacKkKlansman (2018),The Darkest Minds (2018),Tilt (2011),Jeff Ross Roasts the Border (2017),John From (2015),Liquid Truth (2017),Bunny (1998),Hommage à Zgougou (et salut à Sabine Mamou) (2002),Gintama (2017),Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
Toy Story (1995),1.0,0.6,0.166667,0.142857,0.2,0.0,0.166667,0.4,0.0,0.142857,0.142857,0.166667,0.6,0.0,0.142857,0.0,0.0,0.2,0.2,0.111111,0.142857,0.0,0.0,0.0,0.0,...,0.6,0.166667,0.0,0.142857,0.0,0.0,0.142857,0.0,0.0,0.2,0.0,0.0,0.2,0.0,0.285714,0.285714,0.166667,0.166667,0.2,0.0,0.5,0.6,0.0,0.166667,0.2
Jumanji (1995),0.6,1.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.2,0.0,0.0,0.5,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.2,0.0,0.0,0.0
Grumpier Old Men (1995),0.166667,0.0,1.0,0.666667,0.5,0.0,1.0,0.0,0.0,0.0,0.666667,0.333333,0.0,0.0,0.25,0.0,0.333333,0.5,0.5,0.166667,0.25,0.0,0.0,0.0,0.333333,...,0.25,0.333333,0.0,0.0,0.0,0.0,0.25,0.0,0.333333,0.5,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.333333,0.0,0.0,0.2,0.25,0.0,0.0,0.5
Waiting to Exhale (1995),0.142857,0.0,0.666667,1.0,0.333333,0.0,0.666667,0.0,0.0,0.0,1.0,0.25,0.0,0.333333,0.2,0.25,0.666667,0.333333,0.333333,0.333333,0.2,0.142857,0.0,0.25,0.666667,...,0.2,0.666667,0.0,0.0,0.0,0.0,0.5,0.0,0.666667,0.333333,0.333333,0.333333,0.0,0.0,0.166667,0.166667,0.25,0.666667,0.0,0.0,0.166667,0.2,0.333333,0.0,0.333333
Father of the Bride Part II (1995),0.2,0.0,0.5,0.333333,1.0,0.0,0.5,0.0,0.0,0.0,0.333333,0.5,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.2,0.333333,0.0,0.0,0.0,0.0,...,0.333333,0.5,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.25,0.25,0.0,0.5,0.0,0.0,0.25,0.333333,0.0,0.0,1.0


Correct! As you can see, the table has the movies as rows and columns, allowing you to quickly look up any distance of any movie pairing.

### Making recommendations based on movie genres
  
Now that you have your data in a usable format and know how to compare two movies, the next step is to use this to generate recommendations. In this exercise, you will learn how to generate recommendations for any movie in your dataset. The similarity scores between all movies in the dataset that you calculated in the last exercise have been pre-loaded for you as `jaccard_similarity_array`. `movie_cross_table` containing the movies and their attributes is also available.
  
For ease of use, you will need to wrap the similarity scores in a DataFrame. Then you will use this new DataFrame to suggest a movie recommendation.
  
---
  
1. Generate a DataFrame called `jaccard_similarity_df` from `jaccard_similarity_array`.
2. Store the similarity values between `Thor` and all other movies as a Series.
3. Sort these from largest to smallest in `ordered_similarities`.

In [69]:
# Wrap the preloaded array in a DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=result_df.index, columns=result_df.index)

# Find the values for the movie 'Thor (2011)'
jaccard_similarity_series = jaccard_similarity_df.loc['Thor (2011)']

# Sort these values from highest to lowest
ordered_similarities = jaccard_similarity_series.sort_values(ascending=False)

# Print the results
ordered_similarities.head(25)


title
Thor (2011)                                              1.000000
Harry Potter and the Deathly Hallows: Part 2 (2011)      0.833333
Beowulf & Grendel (2005)                                 0.800000
The Huntsman Winter's War (2016)                         0.800000
In the Name of the King III (2014)                       0.800000
Thor: The Dark World (2013)                              0.800000
Oz the Great and Powerful (2013)                         0.800000
Harry Potter and the Deathly Hallows: Part 1 (2010)      0.800000
Harry Potter and the Order of the Phoenix (2007)         0.800000
Clash of the Titans (2010)                               0.800000
Seeker: The Dark Is Rising, The (2007)                   0.800000
Pirates of the Caribbean: On Stranger Tides (2011)       0.800000
Lord of the Rings: The Return of the King, The (2003)    0.800000
Wrath of the Titans (2012)                               0.800000
Star Wars: Episode VII - The Force Awakens (2015)        0.666667
Merl

Correct!

## Text-based similarities
  
You can now generate content-based recommendations when descriptive attributes are available.
  
**Working without clear attributes**
  
Unfortunately in the real world, this is often not the case as attribute labels such as book genres might not be available. Thankfully if there is text tied to an item then we may still be in luck. This could be a plot summary, an item description, or even the contents of a book itself. For this kind of data, we use "Term Frequency Inverse Document Frequency" or TF-IDF to transform the text into something usable.
  
<center><img src='../_images/text-based-similarities.png' alt='img' width='740'></center>
  
**Term frequency inverse document frequency**
  
TF-IDF divides the number of times a word occurs in a document by a measure of what proportion of all the documents a word occurs in. This has the effect of reducing the value of common words while increasing the weight of words that do not occur in many documents. For example, if you were comparing the script of this course against the scripts of all the courses on DataCamp, the term "DataFrame" might get a low score as although it occurs a lot, it is present in many DataCamp courses. The term "recommendation" on the other hand would get a high score as it is not as common in other course's scripts.
  
<center><img src='../_images/text-based-similarities1.png' alt='img' width='740'></center>
  
**Our data**
  
In this video, we will be working with a dataset of books and their descriptions as seen here.
  
<center><img src='../_images/text-based-similarities2.png' alt='img' width='740'></center>
  
**Instantiate the vectorizer**
  
To transform our data we import `TfidfVectorizer()` from `sklearn`. We instantiate it to a variable; tfidfvec in this case. By default, the vectorizer generates a feature for every word in every document, which is a lot of features. Thankfully we can specify restraints on the features being generated.
  
<center><img src='../_images/text-based-similarities3.png' alt='img' width='740'></center>
  
**Filtering the data**
  
First, we set the `min_df=` argument to two. This limits our features to only those that have occurred in at least two documents. Useful as terms occurring once are not valuable for finding similarities.
  
<center><img src='../_images/text-based-similarities4.png' alt='img' width='740'></center>
  
**Filtering the data**
  
We should also remove words that are too common using `max_df=`. By setting this to point seven, words that occur in more than 70% of the descriptions will be excluded.
  
<center><img src='../_images/text-based-similarities5.png' alt='img' width='740'></center>
  
**Vectorizing the data**
  
Once the vectorizer is instantiated we call its `.fit_transform()` method on the text column. The vectorizer's `.get_feature_names_out()` method shows the features that were generated. Vectorized_data when converted to an array has a row for each book, and a column for each feature. Success! We have transformed unorganized text into usable features for our models.
  
<center><img src='../_images/text-based-similarities6.png' alt='img' width='740'></center>
  
**Formatting the data**
  
Let's wrap the array in a DataFrame (using the output of the `.get_feature_names_out()` method as the columns). And assign the titles from the original DataFrame as the index. The resulting DataFrame will look familiar to you from the previous exercises, with a row per item, and a column per feature. The scores represent how prominent that word is in the text compared to other texts, a useful attribute. For example, the term 'battle' is much higher for A Game of Thrones, understandable due to its theme.
  
<center><img src='../_images/text-based-similarities7.png' alt='img' width='740'></center>
  
**Cosine similarity**
  
As we advance from Boolean features to continuous TF-IDF values, we will use a metric that's better at measuring between items that have more variation in their data; cosine similarity. We won't go into it in depth here, but mathematically, it's the measure of the angle between two documents in the high dimensional metric space as seen on this two-dimensional example. All values are between 0 and 1 where 1 is an exact match.
  
<center><img src='../_images/text-based-similarities8.png' alt='img' width='740'></center>
  
**Cosine similarity**
  
Thankfully `sklearn` has a premade `cosine_similarity()` function, that we use to find the distance between all rows by calling it on the DataFrame. Or between two rows by shaping their values as seen here.
  
<center><img src='../_images/text-based-similarities9.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Now its your turn to use these similarities to generate recommendations!

### Instantiate the TF-IDF model
  
TF-IDF by default generates a column for every word in all of your documents (movie summaries in our case). This creates a huge and unintuitive dataset as it will contain both very common words that appear in every document, and words that appear so rarely they provide no value in finding similarities between items.
  
In this exercise, you will work with the `df_plots` DataFrame. It contains movies' names in the `Title` column and their plots in the `Plot` column.
  
Using this DataFrame, you will generate the default TF-IDF scores and see if non-valuable columns are present.
  
You will go on to rerun the TF-IDF calculations, this time limiting the number of columns using the `min_df=` and `max_df=` arguments and hopefully see the improvement.
  
---
  
1. Create a `TfidfVectorizer` and call it `vectorizer`.
2. Use `vectorizer` to transform the data in the `Plots` column of `df_plots` and assign the output to `vectorized_data`.
3. Inspect the features that have been generated by the transformation.
4. Repeat the creation of the `TfidfVectorizer`, but this time, set the minimum document frequency to 2 and the maximum document frequency to 0.7.
5. Inspect the features that have been generated by the transformation.

In [70]:
# Loading the required dataset
df_plots = pd.read_csv('../_datasets/movies_plot.csv')
print(df_plots.shape)
df_plots.head()

(1000, 2)


Unnamed: 0,Title,Plot
0,The Ballad of Cable Hogue,"Cable Hogue is isolated in the desert, awaitin..."
1,Monsters vs. Aliens,"In the far reaches of space, a planet explodes..."
2,The Bandit Queen,Zarra Montalvo is the daughter of an American ...
3,Broken Arrow,Major Vic Deakins (John Travolta) and Captain ...
4,Dolemite,Dolemite is a pimp and nightclub owner who is ...


In [71]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object to the vectorizer variable
vectorizer = TfidfVectorizer()

# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(df_plots['Plot'])

# Look at the features generated
print(vectorizer.get_feature_names_out())

['00' '000' '007' ... 'émile' 'étoile' 'željko']


In [72]:
# Instantiate the vectorizer object to the vectorizer variable
vectorizer = TfidfVectorizer(min_df=2, max_df=0.7)

# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(df_plots['Plot'])

# Look at the features generated
print(vectorizer.get_feature_names_out())

['00' '000' '04' ... 'zoological' 'zorro' 'zuckerman']


Great! You now have a way of trainsforming free bodies of text into structured arrays, with each relevant word being stored as a feature. This can be used to to measure similarities between items and make recommendations, even for items that you have no structured attribute data for.

### Creating the TF-IDF DataFrame
  
Now that you have generated our TF-IDF features, you will need to get them in a format that you can use to make recommendations. You will once again leverage `pandas` for this and wrap the array in a DataFrame. As you will be using the movie titles to do your filtering of the data, you can assign the titles to the DataFrame's index.
  
The `df_plots` DataFrame has once again been loaded for you. It contains movies' names in the `Title` column and their plots in the `Plot` column.
  
---
  
1. Create a `TfidfVectorizer` and fit and transform it as you did in the previous exercise.
2. Wrap the generated `vectorized_data` in a DataFrame. Use the names of the features generated during the fit and transform phase as its column names and assign your new DataFrame to `tfidf_df`.
3. Assign the original movie titles to the index of the newly created `tfidf_df` DataFrame.

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object and transform the plot column
vectorizer = TfidfVectorizer(max_df=0.7, min_df=2)
vectorized_data = vectorizer.fit_transform(df_plots['Plot']) 

# Create Dataframe from TF-IDFarray 
tfidf_df = pd.DataFrame(vectorized_data.toarray(), columns=vectorizer.get_feature_names_out())

# Assign the movie titles to the index and inspect
tfidf_df.index = df_plots['Title']
tfidf_df.head()

Unnamed: 0_level_0,00,000,04,10,100,1000,101st,10th,11,12,1200,13,14,15,15th,16,17,1795,18,1880s,1890s,1898,18th,19,1902,...,youngest,youngster,your,youth,youthful,yves,yvonne,zac,zach,zachary,zack,zander,zane,zeal,zellweger,zero,zoe,zola,zombie,zombies,zone,zoo,zoological,zorro,zuckerman
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
The Ballad of Cable Hogue,0.0,0.0,0.0,0.0,0.022794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Monsters vs. Aliens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Bandit Queen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.106494,0.0
Broken Arrow,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Dolemite,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Good work! You now are able to manipulate text data into DataFrames with each row representing an item, and each column represeting a word extracted from the texts. You will be able to use this in a similar way to the attribute DataFrames you generated previously to to measure similarities between items and make recommendations.

### Comparing all your movies with TF-IDF
  
Now that you have put in the hard work of getting your TF-IDF data into a usable format, it's time to put it to work generating finding similarities and generating recommendations.
  
This time as you are using TF-IDF scores (which are floats as opposed to Booleans) you will use the cosine similarity metric to find the similarities between items. In this exercise, you will generate a matrix of all of the movie cosine similarities and store them in a DataFrame for ease of lookup. This will allow you to compare movies and find recommendations quickly and easily.
  
The `tfidf_df` DataFrame you created in the last exercise containing a row for each movie has been loaded for you.
  
---
  
1. Find the cosine similarity measures between all movies and assign the results to `cosine_similarity_array`.
2. Create a DataFrame from the `cosine_similarity_array` with `tfidf_df.index` as its rows and columns.
3. Print the top five rows of the DataFrame and examine the similarity scores.

In [76]:
# Import cosine_similarity measure
from sklearn.metrics.pairwise import cosine_similarity

# Create the array of cosine similarity values
cosine_similarity_array = cosine_similarity(tfidf_df)

# Wrap the array in a pandas DataFrame
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_df.index, columns=tfidf_df.index)

# Print the top 5 rows of the DataFrame
cosine_similarity_df.head()

Title,The Ballad of Cable Hogue,Monsters vs. Aliens,The Bandit Queen,Broken Arrow,Dolemite,The Astounding She-Monster,The Wolf Song,Conan the Barbarian,Star Kid,Halloween,Seven Chances,The Gamma People,The War Wagon,Billy the Kid Trapped,Beauty for Sale,Mumford,The Jacket,Out Cold,Eagle Squadron,The Prince Who Was a Thief,No Time for Flowers,Big Money Rustlas,A Lady Takes a Chance,The Savage Is Loose,The Truth About Charlie,...,Excision,Snatched,W,The Young Rajah,Canyon River,A Summer Place,American Pie,The Fury,American Me,No Way to Treat a Lady,Mischief,Tumbleweeds,Lionheart,Cry Vengeance,Marry the Girl,The Sisterhood of the Traveling Pants 2,Shield for Murder,Two for the Seesaw,Tarzan and the Lost Safari,The Nut Job 2: Nutty by Nature,Unknown Island,Boss Nigger,Secret Command,The Monolith Monsters,Dick Tracy
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
The Ballad of Cable Hogue,1.0,0.028441,0.012453,0.023429,0.01298,0.038187,0.008889,0.020613,0.028757,0.026356,0.035464,0.01765,0.029383,0.026394,0.029426,0.017046,0.047255,0.022345,0.017433,0.023415,0.011487,0.013563,0.02052,0.013615,0.014208,...,0.019384,0.034609,0.020783,0.033388,0.007973,0.027585,0.032968,0.012901,0.032206,0.017803,0.023307,0.019234,0.073812,0.023139,0.009829,0.026715,0.022323,0.029898,0.011584,0.031396,0.040103,0.044568,0.010396,0.065232,0.007656
Monsters vs. Aliens,0.028441,1.0,0.017621,0.038809,0.023125,0.066625,0.014403,0.043604,0.05888,0.028349,0.035874,0.03399,0.026807,0.010083,0.047422,0.022813,0.06075,0.028169,0.025492,0.014027,0.02856,0.012498,0.015393,0.033812,0.036396,...,0.03566,0.043198,0.025233,0.02128,0.003872,0.031413,0.034258,0.027386,0.037837,0.019692,0.012672,0.011264,0.036958,0.029522,0.011561,0.044824,0.021015,0.041246,0.02507,0.045551,0.052202,0.025738,0.009485,0.074491,0.020735
The Bandit Queen,0.012453,0.017621,1.0,0.012407,0.004632,0.010637,0.052729,0.025402,0.018278,0.017238,0.023209,0.010935,0.061398,0.025012,0.021849,0.018857,0.022293,0.019919,0.023008,0.0,0.002382,0.012285,0.014697,0.020675,0.022153,...,0.019069,0.030147,0.00613,0.014235,0.002421,0.01978,0.047295,0.017839,0.04025,0.015988,0.00799,0.023531,0.01819,0.008032,0.004814,0.026803,0.015266,0.013925,0.00886,0.016744,0.0289,0.038336,0.009885,0.040089,0.001565
Broken Arrow,0.023429,0.038809,0.012407,1.0,0.013701,0.025679,0.01823,0.027254,0.032897,0.026099,0.020382,0.042388,0.040938,0.018508,0.020789,0.009461,0.033495,0.016841,0.03406,0.023628,0.017858,0.008115,0.007979,0.020298,0.007995,...,0.01653,0.042712,0.013538,0.01375,0.005537,0.025545,0.027832,0.015772,0.02564,0.01785,0.007843,0.013854,0.021258,0.032316,0.009062,0.037092,0.023815,0.020893,0.019569,0.064618,0.041855,0.021669,0.005233,0.034998,0.005826
Dolemite,0.01298,0.023125,0.004632,0.013701,1.0,0.006501,0.002061,0.030737,0.021017,0.012694,0.028146,0.007753,0.016767,0.023022,0.014686,0.018362,0.013776,0.010273,0.004911,0.006102,0.0,0.012048,0.0058,0.0,0.017002,...,0.008301,0.012008,0.0,0.019204,0.012241,0.008629,0.018012,0.008309,0.058915,0.010007,0.00215,0.029618,0.00739,0.026445,0.00506,0.01094,0.012211,0.027765,0.008315,0.018803,0.010829,0.01467,0.003753,0.007722,0.023323


Correct! As you can see in the table, each movie has its own row and its own column, so for example, the value in the cell where the `Toy Story` row meets the `Thor` column represents the cosine distance between them. This allows you to look up any distance of any movie pairing by filtering on the two axes.

### Making recommendations with TF-IDF
  
In the last exercise you pre-calculated the similarity ratings between all movies in the dataset based on their plots transformed by TF-IDF. Now you will put these similarity ratings in a DataFrame for ease of use. Then you will use this new DataFrame to suggest a movie recommendation.
  
The `cosine_similarity_array` containing a matrix of the similarity values between all movies that you created in the last exercise has been loaded for you. The `tfidf_df` DataFrame containing the movies and their TF-IDF features is also available.
  
---
  
1. Generate a DataFrame from `cosine_similarity_array`.
2. Store the cosine similarity values between the movie `Iron Man` and all other movies as a Series.
3. Sort these from largest to smallest in `ordered_similarities` and print the ordered results.

In [90]:
# Wrap the preloaded array in a DataFrame
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_df.index, columns=tfidf_df.index)

# Find the values for the movie Iron Man
cosine_similarity_series = cosine_similarity_df.loc['Iron Man']

# Sort these values highest to lowest
ordered_similarities = cosine_similarity_series.sort_values(ascending=False)

# Print the results
ordered_similarities.head(25)

Title
Iron Man                      1.000000
Animal Crackers               0.119312
The Great White Hope          0.088461
A Free Soul                   0.087870
Sinner Take All               0.086986
The Raging Tide               0.084518
Steelyard Blues               0.060390
The Pick-up Artist            0.048532
The Savage Is Loose           0.047253
The Outcasts of Poker Flat    0.041358
Palooka                       0.041163
The Tiger Woman               0.039834
The Emperor Jones             0.035588
Rimfire                       0.034939
Canyon Passage                0.034000
Secret Command                0.033250
Mothers Cry                   0.032364
Loan Shark                    0.032260
Grand Central Murder          0.032060
The 11th Hour                 0.029519
Little Miss Marker            0.028965
The Boxtrolls                 0.028778
Homeboy                       0.028267
Red Tomahawk                  0.028198
12:01 PM                      0.028112
Name: Iron Man, dty

Correct. `Animal Crackers` has the highest similarity value to `Iron Man`! This means that viewers that liked `Iron Man` are likely to enjoy `Animal Crackers` also. Since both are movies geared for a younger audience, this makes a lot of sense.

## User profile recommendations
  
In this chapter, you have learned how to use items' attributes to generate content-based recommendations by finding items that are similar to each other.
  
**Item to item recommendations**
  
This has many uses such as suggesting obscure books that are similar to your favorite, proposing the next movie to watch that is like the one you just finished, or even finding alternative options when items are out of stock.
  
<center><img src='../_images/user-profile-recommendations.png' alt='img' width='740'></center>
  
**User profiles**
  
But people are not so one dimensional that they only like one item. They may have read many different books and want to find one that is aligned with their wide array of tastes. For example, taking a look at tfidf_summary_df that we have used previously, we have a row per book, with a column for each of the possible genres it could fall under. For user-based recommendations, we need vectors to represent individual items as well as vectors to represent a user's likes. This will allow us to compare a user's likes to various items to see which items might suit them best.
  
<center><img src='../_images/user-profile-recommendations1.png' alt='img' width='740'></center>
  
**Extract the user data**
  
Let's take an example of a user that has read a set of books. The most straightforward way of creating a user profile is to first get the vectors corresponding with the books they have read, by slicing tfidf_summary_df containing all the books, as you see here using the reindex method. Remember tfidf_summary_df contains TF-IDF features for all books in our dataset. This creates a DataFrame containing rows only for books the user has read and their TF-IDF scores. This still has multiple rows and we want a single vector for our user. To go from the full table to a summary of the users tastes we can simply find the average of each column, representing the average of the characteristics of the books the user liked.
  
<center><img src='../_images/user-profile-recommendations2.png' alt='img' width='740'></center>
  
**Build the user profile**
  
We find the average of each column by calling `.mean()` on the DataFrame. The average values in this Series represent the user profile or in other words a way of representing all of the user's preferences at once. For example, this user appears to enjoy books that have high values in the "ancient" TF-IDF feature. This implies that the word "ancient" is prominent in books they like. This profile can, with a bit of reshaping, be used as a vector to compare against other books.
  
<center><img src='../_images/user-profile-recommendations3.png' alt='img' width='740'></center>
  
**Finding recommendations for a user**
  
This user profile can then be used to find the most similar books that they have not yet read. We first must find the subset of books that have not been read by dropping those contained in the watched list (specifying the index axis by setting axis to 0). We then calculate the cosine similarity matrix as we did in the previous lesson, but this time between the User profile vector you just created and the DataFrame of all the books the user has not read yet. Then we wrap the output in a DataFrame and sort the results once again so we can access and order the data easily.
  
<center><img src='../_images/user-profile-recommendations4.png' alt='img' width='740'></center>
  
**Getting the top recommendations**
  
After sorting the recommendation scores you will now be able to recommend items based on a user's full history, not just based on individual items. These top values are the items that are the most similar to the interests of the user based on their full background of interests, making them good suggestions for the user to read next.
  
<center><img src='../_images/user-profile-recommendations5.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Great, let's work with the movie dataset to build up user-profiles and create recommendations based on them.

### Build the user profiles
  
You are now able to generate suggestions for similar items based on their labeled features or based on their descriptions. But sometimes finding similar items might not be enough. In the next exercises, you will work through how one could create recommendations based on a user and all the items they liked as opposed to a singular item. You will first generate a profile for a user by aggregating all of the movies they have previously enjoyed.
  
The `tfidf_df` you have been working on in the last few exercises has been loaded for you. This contains a row per movie with their titles as the index and a column for each feature containing their respective TF-IDF score.
  
---
  
1. Create a subset of the `tfidf_df` that contains only rows corresponding to the supplied `list_of_movies_enjoyed` list.
2. Generate the user profile by finding the average TF-IDF scores of each of the features of the movies contained in `movies_enjoyed_df`.
3. Inspect the results.

In [129]:
list_of_movies_enjoyed = ['The Breakfast Club', 'Iron Man', 'Top Gun']

# Create a subset of only the movies the user has enjoyed
movies_enjoyed_df = tfidf_df.reindex(list_of_movies_enjoyed)

# Inspect the DataFrame
print(movies_enjoyed_df)

ValueError: cannot reindex on an axis with duplicate labels

In [128]:
for x in list(tfidf_df.index):
    print(x)

The Ballad of Cable Hogue
Monsters vs. Aliens
The Bandit Queen
Broken Arrow
Dolemite
The Astounding She-Monster
The Wolf Song
Conan the Barbarian
Star Kid
Halloween
Seven Chances
The Gamma People
The War Wagon
Billy the Kid Trapped
Beauty for Sale
Mumford
 The Jacket
Out Cold
Eagle Squadron
The Prince Who Was a Thief
No Time for Flowers
Big Money Rustlas
A Lady Takes a Chance
The Savage Is Loose
The Truth About Charlie
Flame of Calcutta
You Can't Have Everything
Pleasantville
A Shot in the Dark
Dragonslayer
The 4th Floor
Disturbing Behavior
True Believer
Don't Bet on Love
Arizona
The Twelve Chairs
Roar of the Press
That Kind of Woman
Act of Love
The Preacher's Wife
Kidnapped
The White Dawn
The Black Shield of Falworth
Eaten Alive
The Package
Oculus
The Desert Hawk
Highlander
Hoodwinked Too! Hood vs. Evil
Hulk
Killers from Space
The Bad Seed
Alien Nation: The Udara Legacy
Sniper
The Way We Were
Union Station
10:30 P.M. Summer
Book of Love
Rover Dangerfield
Run for Cover
Valerian and the