# Content-based recommendation

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('movies.csv')
df.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [4]:
# Adding year column
df[['name', 'year']] = df.title.str.split('\(|\)', expand=True).iloc[:, [0,1]]
df.head(3)

Unnamed: 0,movieId,title,genres,name,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men,1995


In [5]:
df.drop('title', axis=1, inplace=True)
df.head(3)

Unnamed: 0,movieId,genres,name,year
0,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
1,2,Adventure|Children|Fantasy,Jumanji,1995
2,3,Comedy|Romance,Grumpier Old Men,1995


In [43]:
df['name'] = df['name'].str.strip()

In [44]:
movie_genre_df = df[['genres', 'name']]
movie_genre_df

Unnamed: 0,genres,name
0,Adventure|Animation|Children|Comedy|Fantasy,Toy Story
1,Adventure|Children|Fantasy,Jumanji
2,Comedy|Romance,Grumpier Old Men
3,Comedy|Drama|Romance,Waiting to Exhale
4,Comedy,Father of the Bride Part II
...,...,...
9737,Action|Animation|Comedy|Fantasy,Black Butler: Book of the Atlantic
9738,Animation|Comedy|Fantasy,No Game No Life: Zero
9739,Drama,Flint
9740,Action|Animation,Bungo Stray Dogs: Dead Apple


In [45]:
movie_genre_df = movie_genre_df.apply(lambda x: x.str.split('|').explode()).reset_index()
movie_genre_df

Unnamed: 0,index,genres,name
0,0,Adventure,Toy Story
1,0,Animation,Toy Story
2,0,Children,Toy Story
3,0,Comedy,Toy Story
4,0,Fantasy,Toy Story
...,...,...,...
22079,9738,Fantasy,No Game No Life: Zero
22080,9739,Drama,Flint
22081,9740,Action,Bungo Stray Dogs: Dead Apple
22082,9740,Animation,Bungo Stray Dogs: Dead Apple


In [46]:
movie_genre_df.drop('index', inplace=True, axis=1)

In [47]:
movie_genre_df

Unnamed: 0,genres,name
0,Adventure,Toy Story
1,Animation,Toy Story
2,Children,Toy Story
3,Comedy,Toy Story
4,Fantasy,Toy Story
...,...,...
22079,Fantasy,No Game No Life: Zero
22080,Drama,Flint
22081,Action,Bungo Stray Dogs: Dead Apple
22082,Animation,Bungo Stray Dogs: Dead Apple


In [48]:
movie_genre_df['genres'].unique()

array(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Mystery', 'Sci-Fi', 'War', 'Musical', 'Documentary', 'IMAX',
       'Western', 'Film-Noir', '(no genres listed)'], dtype=object)

In [49]:
# Saving data frame into csv
movie_genre_df.to_csv('movie_genre_df.csv')

##  ✏️ Creating content-based data

As much as you might want to jump right to finding similar items and making recommendations, you first need to get your data in a usable format. In the next few exercises, you will explore your base data and work through how to format that data to be used for content-based recommendations.

As a reminder, the desired outcome is a row per movie with each column indicating whether a genre applies to the movie. You will be looking at `movie_genre_df`, which contains these columns:

- `name` - Name of movie
- `genre_list` - Genre that the movie has been labeled as

A movie may have multiple genres, and therefore multiple rows. In this exercise, you will particularly focus on one movie (Toy Story in this case) to be able to clearly see what is happening with the data.

### How many different movies are contained in `movie_genre_df`?


In [54]:
movie_genre_df = pd.read_csv('movie_genre_df.csv', index_col=False)

In [55]:
len(movie_genre_df['name'].unique())

9412

### Get the rows in `movie_genre_df` which have a name equal to Toy Story and save this as `toy_story_genres`.

In [60]:
toy_story_genres = movie_genre_df[movie_genre_df['name'] == "Toy Story"]
toy_story_genres.head()

Unnamed: 0.1,Unnamed: 0,genres,name
0,0,Adventure,Toy Story
1,1,Animation,Toy Story
2,2,Children,Toy Story
3,3,Comedy,Toy Story
4,4,Fantasy,Toy Story


### Transform movie_genre_df to a table called movie_cross_table.

Assign the subset of `movie_cross_table` that contains Toy Story to the variable `toy_story_genres_ct` and inspect the results

In [61]:
# Select only the rows with values in the name column equal to Toy Story
toy_story_genres = movie_genre_df[movie_genre_df['name'] == 'Toy Story']

# Create cross-tabulated DataFrame from name and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['name'], movie_genre_df['genres'])

# Select only the rows with Toy Story as the index
toy_story_genres_ct = movie_cross_table[movie_cross_table.index == 'Toy Story']
print(toy_story_genres_ct)

genres     (no genres listed)  Action  Adventure  Animation  Children  Comedy  \
name                                                                            
Toy Story                   0       0          1          1         1       1   

genres     Crime  Documentary  Drama  Fantasy  Film-Noir  Horror  IMAX  \
name                                                                     
Toy Story      0            0      0        1          0       0     0   

genres     Musical  Mystery  Romance  Sci-Fi  Thriller  War  Western  
name                                                                  
Toy Story        0        0        0       0         0    0        0  


### Understanding the content-based data

You are now able to convert common attribute data to a DataFrame containing a row per movie, and each of its attributes as columns. You will now take a closer look at the full DataFrame you just created to see if you understand the information within.

A subset of the DataFrame you have created in the last exercise has been loaded as movie_cross_table. As a reminder, the genres are stored as individual columns and the movie names are stored as the index.

Inspect the rows corresponding to 'Toy Story' and 'Yogi Bear' in movie_cross_table. How many genres do they have in common?


Possible Answers

- 0 genres in common

- 2 genres in common ✅ (*Children and comedy*)

- 4 genres in common

- 6 genres in common



In [62]:
movie_cross_table

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
'71,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0
'Hellboy': The Seeds of Creation,0,1,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0
'Round Midnight,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
'Salem's Lot,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0
'Til There Was You,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
xXx,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
xXx: State of the Union,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
¡Three Amigos!,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [65]:
selected_movies = ["Toy Story", "Yogi Bear"]
movie_cross_table[movie_cross_table.index.isin(selected_movies)]


genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Toy Story,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Yogi Bear,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


> Correct! Yogi Bear and Toy Story both have the 'Children' and 'Comedy' attributes. The more genres that two movies have in common, the more likely it is that someone who liked one will like the other, so now we're going to apply this at a larger scale instead of just one pair of movies.