# Content-Based Recommendations
  
Discover how item attributes can be used to make recommendations. Create valuable comparisons between items with both categorical and text data. Generate profiles to recommend new items for users based on their past preferences.

## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>DataFrame.nunique()</td>
    <td>Count distinct elements in each column</td>
  </tr>
  <tr>
    <td>2</td>
    <td>DataFrame.hist()</td>
    <td>Create histograms for numeric columns</td>
  </tr>
  <tr>
    <td>3</td>
    <td>DataFrame.drop()</td>
    <td>Drop specified labels from rows or columns</td>
  </tr>
  <tr>
    <td>4</td>
    <td>DataFrame.value_counts()</td>
    <td>Count unique values in a Series</td>
  </tr>
  <tr>
    <td>5</td>
    <td>DataFrame.groupby()</td>
    <td>Group DataFrame using a mapper or by a Series of columns</td>
  </tr>
  <tr>
    <td>6</td>
    <td>DataFrame.sort_values()</td>
    <td>Sort DataFrame by one or more columns</td>
  </tr>
  <tr>
    <td>7</td>
    <td>DataFrame.index</td>
    <td>Get the index (row labels) of the DataFrame</td>
  </tr>
  <tr>
    <td>8</td>
    <td>DataFrame.isin()</td>
    <td>Check whether each element in the DataFrame is contained in a list-like object</td>
  </tr>
  <tr>
    <td>9</td>
    <td>itertools.permutations()</td>
    <td>Generate all permutations of an iterable</td>
  </tr>
  <tr>
    <td>10</td>
    <td>DataFrame.apply()</td>
    <td>Apply a function along an axis of the DataFrame</td>
  </tr>
  <tr>
    <td>11</td>
    <td>DataFrame.size()</td>
    <td>Return the number of elements in the DataFrame</td>
  </tr>
  <tr>
    <td>12</td>
    <td>DataFrame.reset_index()</td>
    <td>Reset the index of the DataFrame</td>
  </tr>
  <tr>
    <td>13</td>
    <td>.to_frame()</td>
    <td>Convert a Series to a DataFrame</td>
  </tr>
  <tr>
    <td>14</td>
    <td>str.contains()</td>
    <td>Check if each element in a Series or DataFrame column contains a substring</td>
  </tr>
  <tr>
    <td>15</td>
    <td>DataFrame.empty</td>
    <td>Check if the DataFrame is empty (contains no rows or columns)</td>
  </tr>
</table>

  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [2]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Setting a standard style
plt.style.use('ggplot')

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

## Intro to content-based recommendations
  
So far we have looked at making recommendations based solely on how the entire population feels about items. While these recommendations can be useful, they aren't personalized.
  
**What are content-based recommendations?**
  
In this chapter, we will move to more targeted models by recommending items based on their similarities to items a user has liked in the past. For example, if a user likes book A, and we calculate that book A and book B are similar, we believe the user will like book B. We will address how to calculate what items are similar and which ones are not. We can do so by comparing the attributes of our items. The recommendations made by finding items with similar attributes are called content-based recommendations.
  
<center><img src='../_images/intro-to-content-based-recommendations.png' alt='img' width='740'></center>
  
**Items' attributes or characteristics**
  
For example, if we were looking at a dataset describing books, the attributes could be the author of the book, its publishing date, its length, or its genre, really any descriptive information. A big advantage of using an item's attributes over user feedback is that you can make recommendations for any items you have attribute data on. This includes even brand new items that users have not seen yet. Content-based models require us to use any available attributes to build profiles of items in a way that allows us to mathematically compare between them. This allows us for example to find the most similar items and recommend them.
  
<center><img src='../_images/intro-to-content-based-recommendations1.png' alt='img' width='740'></center>
  
**Vectorizing your attributes**
  
This is best done by encoding each item as a vector. Here we can see an example with a vector for each item stored as a row and each feature as a column. Why this shape you might ask? It is extremely valuable to have your data in this format so the distance and similarities between items can be easily calculated, which is vital for generating recommendations. We'll discuss how to calculate distances and similarities between vectors later in the course. First, we will cover how to convert the most common data format for attributes to this shape. We will continue using the book dataset from chapter 1, but this time we introduce an additional book_genre table.
  
<center><img src='../_images/intro-to-content-based-recommendations2.png' alt='img' width='740'></center>
  
**One to many relationships**
  
This book_genre table, as seen here on the left, contains a one to many reference of books to their genres. This type of one to many lookup is very common in relational databases. Remember from this table, we want to create a new table that contains a single row per item, encoding whether or not it has that attribute like you see here on the right.
  
<center><img src='../_images/intro-to-content-based-recommendations3.png' alt='img' width='740'></center>
  
**Crosstabulation**
  
To transform this data we can use `pandas.crosstab()` function. The `pandas.crosstab()` function generates the cross-tabulation of two (or more) factors, and here we want to use it to find the cross-tabulation of the book titles and the genres they have been labeled with.
  
We call `pandas.crosstab()`, passing in the book titles as the first argument, and the book genres as the second argument. The first argument will become the rows, and the second becomes the columns. Here we can see the desired result.
  
<center><img src='../_images/intro-to-content-based-recommendations4.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Great, now we have our data in a format that will allow us to calculate similarities and make recommendations. Time to try these data transformations yourself.

### Why use content-based models?
  
Imagine you are working for a large retailer that has a constantly changing product line, with new items being added every day. Why might content-based models be a good choice to make recommendations on your data?
  
---
  
Possible Answers
  
- [ ] You are always guaranteed better recommendations with content-based data.
  
- [ ] Content-based models always recommend the newest products; customers always like the newest products no matter what their past preferences were.
  
- [x] As the recommendations are based on the item attributes rather than user feedback, recommendations can be made on never-before-purchased products.
  
Correct! Content-based models are ideal for creating recommendations for products that have no user feedback data such as reviews or purchases.

### Creating content-based data
  
As much as you might want to jump right to finding similar items and making recommendations, you first need to get your data in a usable format. In the next few exercises, you will explore your base data and work through how to format that data to be used for content-based recommendations.
  
As a reminder, the desired outcome is a row per movie with each column indicating whether a genre applies to the movie. You will be looking at `movie_genre_df`, which contains these columns:
  
- name - Name of movie
- genre_list - Genre that the movie has been labeled as
  
A movie may have multiple genres, and therefore multiple rows. In this exercise, you will particularly focus on one movie (Toy Story in this case) to be able to clearly see what is happening with the data.
  
---
  
1. How many different movies are contained in `movie_genre_df`?
  
Possible answers
  
- [ ] 50
- [x] 21
- [ ] 11
  
Solution
  
```python
print(movie_genre_df.name.nunique())
```
  
2. Get the rows in `movie_genre_df` which have a name equal to `Toy Story` and save this as `toy_story_genres`.
3. Transform `movie_genre_df` to a table called `movie_cross_table`.
4. Assign the subset of `movie_cross_table` that contains `Toy Story` to the variable `toy_story_genres_ct` and inspect the results.

In [3]:
# Load dataset
movie_genre_df = pd.read_csv('../_datasets/movie_genre_small.csv')
print(movie_genre_df.shape)
movie_genre_df.head()

(50, 2)


Unnamed: 0,name,genre_list
0,Toy Story,Adventure
1,Toy Story,Animation
2,Toy Story,Children
3,Toy Story,Comedy
4,Toy Story,Fantasy


In [4]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df.name == 'Toy Story']

# Inspect the subset
print(toy_story_genres)

        name genre_list
0  Toy Story  Adventure
1  Toy Story  Animation
2  Toy Story   Children
3  Toy Story     Comedy
4  Toy Story    Fantasy


In [6]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df['name'] == 'Toy Story']

# Create cross-tabulated DataFrame from name and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['name'], movie_genre_df['genre_list'])

# Select only the rows with Toy Story as the index
toy_story_genres_ct = movie_cross_table[movie_cross_table.index == 'Toy Story']
toy_story_genres_ct

genre_list,Action,Adventure,Animation,Children,Comedy,Crime,Drama,Fantasy,Horror,Romance,Thriller
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Toy Story,0,1,1,1,1,0,0,1,0,0,0


Good work! This newly formatted table with a vector contained in a row per movie and a column per feature will allow you to calculate distances and similarities between movies.