# Building your own recommender system

Today, we will start working on building our own recommender system. For this assignment, we will work with movie data.
Download the following datasets [here](https://www.kaggle.com/tmdb/tmdb-movie-metadata):
- `tmdb_5000_credits.csv`
- `tmdb_5000_movies.csv`

Place the files a `data/` folder.


## Task 1. Explore and preprocess the data.

### a.  Explore the data

- As a first step, explore the datasets. Inspect what data you have at hand, what might be interesting variables and what not. Make a selection of interesting columns.
  -  Keep in mind that ultimatly, you want to build a knowledge-based and content-based recommender systems. Hence, look for columns that might be suitable to use later on.
- Combine (merge) both datasets. Can you identify a variable that can be used for matching?
Think about a good way to do this. Can you write a function that will return the merged data?

- Check whether the data is ready to use, or whether you need to transform or pre-processing your data somehow.

In [69]:
import pandas as pd
from ast import literal_eval
from sklearn.metrics.pairwise import cosine_similarity

#### Read the data

<br>
<div class="alert-block alert-warning">
For now, only keep movies with frequent votes (this will keep the dataset rather small and therefore computation is faster)
</div>

If you want to work with a larger dataset, delete the following line from the function `get_data(PATH)`:

```python
data = data[data['vote_count'] > 2000]
```


   

In [235]:
PATH = 'data/'

def get_data(PATH):

    data1 = pd.read_csv(f'{PATH}tmdb_5000_credits.csv')
    data2 = pd.read_csv(f'{PATH}tmdb_5000_movies.csv')
    data2.rename(columns={'id': 'movie_id'}, inplace=True)

    data = pd.merge(data1,data2,  on=['movie_id', 'title'])
    data["original_title"] = data["original_title"].str.lower()

    data = data[data['vote_count'] > 2000] # for now, only keep movies with frequent votes (this will keep the dataset rather small and therefore computation is faster)
    data.index = [i for i in range(0,len(data))]
    return data

In [174]:
data = get_data(PATH)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124



### b.  Pre-processing and feature engineering

As a first step, some data wrangling techniques are needed to get the data into the right shape.
- Think about relevant attributes of movies that you want to use later on when designing a recommender system.
- Can you convert `release_year` to a yearly-level variable?
- Can you clean up the `genres` column?

In the next code block, we will convert `release_year` to a yearly-level variable. In addition, we will clean up the `genres` column

In [175]:
data['release_year'] = pd.DatetimeIndex(data['release_date']).year

def get_genres(x):
    return " ".join( [e['name'] for e in literal_eval(x)] )

data['genres'] = data['genres'].apply(get_genres)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012


Next, we will convert the `genres` column and explode the data so we have a single genre for each row. Specifically, in order to create a knowledge-based recommender system in the next step, that leverages information on genre, we need to transform our data so that each genre is a single observation. Hence, we want a single genre in the rows. We will therefore transform/explode the data to a long format:

In [176]:
s = data.apply(lambda x: pd.Series(x['genres'].split()),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'single_genre'
data = data.join(s)

data[['single_genre', 'title', 'vote_average', 'vote_count', 'release_year']].head(10)

Unnamed: 0,single_genre,title,vote_average,vote_count,release_year
0,Action,Avatar,7.2,11800,2009
0,Adventure,Avatar,7.2,11800,2009
0,Fantasy,Avatar,7.2,11800,2009
0,Science,Avatar,7.2,11800,2009
0,Fiction,Avatar,7.2,11800,2009
1,Adventure,Pirates of the Caribbean: At World's End,6.9,4500,2007
1,Fantasy,Pirates of the Caribbean: At World's End,6.9,4500,2007
1,Action,Pirates of the Caribbean: At World's End,6.9,4500,2007
2,Action,Spectre,6.3,4466,2015
2,Adventure,Spectre,6.3,4466,2015


## Task 2.   Create a knowledge-based recommender system

Below, you will find an example of a simple knowledge based recommender system. 

First, play around a bit with the `input()` function: 

In [177]:
print("Enter whatever:")
a_test = input()

Enter whatever:
hello world


In [179]:
print(a_test)

hello world


### Example of a simple knowlege based recommender

In [180]:
def knowledge_based_recommender(data):

    data = data[data['single_genre'].notna()]
    data['single_genre'] = data['single_genre'].str.lower()

    print(f"What type of genre do you like? \n\nYou can choose from the following:\n\n{set(data['single_genre'])}")
    genre = input().lower()

    print("What is the minimum release year of movies you are interested in? (e.g., how 'old' may a movie be?)" )

    release_year = int(input())

    movies = data[(data['single_genre'] == genre) &
    (data['release_year'] >= release_year) ]

    recommend_movies = movies.sort_values('vote_average', ascending=False)

    return recommend_movies[['title', 'vote_average', 'genres']].head(5)

### Run the knowledge-based recommender...

In [181]:
knowledge_based_recommender(data)

What type of genre do you like? 

You can choose from the following:

{'horror', 'family', 'war', 'romance', 'mystery', 'western', 'adventure', 'crime', 'fiction', 'comedy', 'fantasy', 'music', 'history', 'drama', 'thriller', 'science', 'action', 'animation'}
horror
What is the minimum release year of movies you are interested in? (e.g., how 'old' may a movie be?)
1997


Unnamed: 0,title,vote_average,genres
3597,Shaun of the Dead,7.5,Horror Comedy
2096,The Conjuring,7.4,Horror Thriller
1988,Zombieland,7.2,Comedy Horror
4291,Saw,7.2,Horror Mystery Crime
116,I Am Legend,6.9,Drama Horror Action Thriller Science Fiction


<u>Question: can you improve the scoring algoritm?</u> 
<br>
<br>
<div class="alert-info">
We can further improve this algorithm by accounting for the fact that some movies have not been frequently rated. See for an improved scoring algorithm that is typically employed by IMDB <a href=https://www.datacamp.com/community/tutorials/recommender-systems-python>here</a>
</div>




<u>Question: can you improve this recommender?</u> 
<br>
<br>
<div class="alert-info">
Can you improve instructions for users, give hints for correct answers (such as the range of available values in the dataset, etc?)
</div>




<u>Question: can you make this recommender more elaborate?</u> 
<br>
<br>
<div class="alert-info">
Can you make this recommender more elaborate, taking in different types of user-generated input?
</div>


## Task 3. Create a content-based recommender system using cosine similarity

In [182]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


For this taks, we go back to the dataset in the original format (hence, before exploding the data to a long format).

In [183]:
data = get_data(PATH)
data['release_year'] = pd.DatetimeIndex(data['release_date']).year
data['genres'] = data['genres'].apply(get_genres)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012


### a. Create a combined feature column.
Create a combined feature column. More specifically, combine (textual) data of several columns, that can be used later on.

<u>Question: Create your own combined feature column</u> 
<br>
<br>
<div class="alert-info">
Can you adjust this, to make your own combined feature column? Are there different columns that might be interesting to consider here? Do you need to preprocess them?
</div>
<br>
<div class="alert-block alert-warning">
Make sure you are making an  <b>informed decision</b> about merging these columns together. Why do think this is a good idea, when designing a recommender system?
</div>



If you want to 'glue' several columns with textual data together in `pandas`, you can do something like this:

```python
data['combined'] = data[['genres', 'overview']].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
```




In [184]:
def combine_features(data): 
    data['combined_features'] = data[['original_title', 'genres', 'overview', 'tagline']].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return data

In [185]:
data = combine_features(data)
data.head()


Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year,combined_features
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009,"avatar,Action Adventure Fantasy Science Fictio..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007,"pirates of the caribbean: at world's end,Adven..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015,"spectre,Action Adventure Crime,A cryptic messa..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012,"the dark knight rises,Action Crime Drama Thril..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012,"john carter,Action Adventure Science Fiction,J..."


### b. Transform your data

Think about a strategy for transforming your combined data column, as designed in the previous step. More specifically, `fit_transform` the combined data column using `tfidf` or `count` vectorizer.


<u>Question: can you adjust settings of the vectorizer?</u> 
<br>
<br>
<div class="alert-info">
When initializing the vectorizer, think about some of settings we've discussed in earlier weeks. Do you, for example, want to remove stopwords manually, or use pruning?
</div>

In [186]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(data['combined_features'])


In [207]:
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year,combined_features
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009,"avatar,Action Adventure Fantasy Science Fictio..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007,"pirates of the caribbean: at world's end,Adven..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015,"spectre,Action Adventure Crime,A cryptic messa..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012,"the dark knight rises,Action Crime Drama Thril..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012,"john carter,Action Adventure Science Fiction,J..."


### c. Calculate cosine similarity

Using the vectorized (sparse) matrix, calculate cosine similarity.

In [188]:
cosine_sim = cosine_similarity(tfidf_matrix)


### d. Find the most similair movies

How can we, based on the matrix of cosine scores, find the most similar movie?

As an example, let's try to look up the movie that is most similair to `the dark knight rises`. This movie is at index number 3 in our dataframe, as you can see by simple looking at our dataframe. In other words, we know the index is 3, because we looked it up manually. Now, we can get the similarity scores between this movie and all other movies in our dataset, simply by printing: 

In [189]:
cosine_sim[3]

array([0.02427704, 0.00282487, 0.00851024, 1.        , 0.0102412 ,
       0.00706375, 0.01776274, 0.02671467, 0.04557697, 0.13282578,
       0.00943453, 0.00278821, 0.00167139, 0.00275856, 0.01690524,
       0.001906  , 0.00869027, 0.0181771 , 0.00745184, 0.        ,
       0.02062451, 0.02553921, 0.02017573, 0.00441297, 0.01389112,
       0.02893029, 0.0233559 , 0.02935539, 0.01470082, 0.00733287,
       0.        , 0.00273384, 0.03670128, 0.        , 0.02978632,
       0.01141878, 0.        , 0.01942972, 0.0081568 , 0.00550505,
       0.00357409, 0.01453257, 0.00238474, 0.0137226 , 0.03369991,
       0.00271323, 0.03252951, 0.00627501, 0.05257956, 0.01920857,
       0.00131995, 0.04208555, 0.01169546, 0.00147419, 0.0025743 ,
       0.02931331, 0.00583151, 0.        , 0.32070174, 0.00566778,
       0.00277407, 0.00188154, 0.00774397, 0.00200139, 0.0215999 ,
       0.00377077, 0.00651166, 0.02654705, 0.01219126, 0.0101658 ,
       0.02874627, 0.        , 0.00250414, 0.00766731, 0.03888

There are, however, more systematic ways of retrieving an index value. 
For example, we can create a list of indices, and simple look our movie up: 

In [190]:
indices = pd.Series(data.index, index = data['original_title'])
index = indices['the dark knight rises']
print(index)


3


Now that we have the index value, we need to sort the associated vector of cosine similarity scores to get the highest values.

In order to sort the cosine scores, while keep track of the entries to which these scores belong, we map the movie-index to the cosine value. We can do so by simple enumerating the cosine scores:

In [191]:
sim_scores = list(enumerate(cosine_sim[index])) 
sim_scores

[(0, 0.024277042319228448),
 (1, 0.002824866577723068),
 (2, 0.008510237920055208),
 (3, 0.9999999999999999),
 (4, 0.01024120468586011),
 (5, 0.007063745060022694),
 (6, 0.01776274135961471),
 (7, 0.02671467486274916),
 (8, 0.04557697145861502),
 (9, 0.13282577950620048),
 (10, 0.009434525488526165),
 (11, 0.0027882139582398966),
 (12, 0.0016713925918207888),
 (13, 0.0027585552953611346),
 (14, 0.016905237662794823),
 (15, 0.0019060005201172792),
 (16, 0.008690269930241908),
 (17, 0.018177098280360432),
 (18, 0.007451840135341565),
 (19, 0.0),
 (20, 0.02062451301459884),
 (21, 0.025539208245505592),
 (22, 0.020175734274512164),
 (23, 0.004412974562837318),
 (24, 0.013891122271350824),
 (25, 0.028930290435817227),
 (26, 0.02335590096642317),
 (27, 0.029355392310509196),
 (28, 0.014700816920407577),
 (29, 0.007332867723416884),
 (30, 0.0),
 (31, 0.002733843092024661),
 (32, 0.03670127645541509),
 (33, 0.0),
 (34, 0.02978632117622636),
 (35, 0.011418778708855755),
 (36, 0.0),
 (37, 0.0194

now, we can sort this list on the basis of the cosine scores, while keeping track of the indexes. 

In [192]:
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores

[(3, 0.9999999999999999),
 (58, 0.3207017376449523),
 (311, 0.2938272593271538),
 (95, 0.22080725229371226),
 (9, 0.13282577950620048),
 (300, 0.09778533939138273),
 (326, 0.07936901215666946),
 (220, 0.0719767027579663),
 (155, 0.06618894694863214),
 (266, 0.06582417539666552),
 (288, 0.06444986487286353),
 (318, 0.0634982782720183),
 (92, 0.06212708512955022),
 (431, 0.06048616783331664),
 (198, 0.05992973146218572),
 (114, 0.05584773654084899),
 (48, 0.05257956476386228),
 (171, 0.051820073436548514),
 (246, 0.049709004283587946),
 (141, 0.04933059148605068),
 (356, 0.0491197856376579),
 (128, 0.04893281330647687),
 (316, 0.048460329490873824),
 (410, 0.047195210850759395),
 (116, 0.04650408473994224),
 (244, 0.04634905630529929),
 (8, 0.04557697145861502),
 (262, 0.04364205052097604),
 (258, 0.043614071439744294),
 (106, 0.0431534618789252),
 (111, 0.04285066889531842),
 (427, 0.04271023641153203),
 (51, 0.04208555377100488),
 (325, 0.04173819263808157),
 (185, 0.0394782518944081),

We are only interested in the most similar ones, so we keep the top 10:

In [193]:
sim_scores = sim_scores[1:11]
sim_scores

[(58, 0.3207017376449523),
 (311, 0.2938272593271538),
 (95, 0.22080725229371226),
 (9, 0.13282577950620048),
 (300, 0.09778533939138273),
 (326, 0.07936901215666946),
 (220, 0.0719767027579663),
 (155, 0.06618894694863214),
 (266, 0.06582417539666552),
 (288, 0.06444986487286353)]

To which movies to these scores belong? Lets map the index values back to the dataframe and find them.

In [194]:
movie_indices = [i[0] for i in sim_scores]
movie_indices

[58, 311, 95, 9, 300, 326, 220, 155, 266, 288]

Now we can simple look those movies up in our dataframe:

In [195]:
data.iloc[movie_indices]['title']

65                         The Dark Knight
1359                                Batman
119                          Batman Begins
9       Batman v Superman: Dawn of Justice
1177                              Sin City
1554                            District 9
449                        The Book of Eli
238           Teenage Mutant Ninja Turtles
790                        American Sniper
1051                             Prisoners
Name: title, dtype: object

### e. Putting the pieces together: Content-based recommender system using cosine similarity

In [196]:
def transform_data(data):

    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(data['combined_features'])
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

    return cosine_sim

def recommender(movie_title, data, transformed_data):

    indices = pd.Series(data.index, index = data['original_title'])
    index = indices[movie_title]

    sim_scores = list(enumerate(transformed_data[index]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]

    movie_indices = [i[0] for i in sim_scores]

    movie_id = data['movie_id'].iloc[movie_indices]

    movie_title = data['original_title'].iloc[movie_indices]
    movie_genres = data['genres'].iloc[movie_indices]

    recommendation = pd.DataFrame(columns=['Id','title', 'genres'])

    recommendation['Id'] = movie_id
    recommendation['title'] = movie_title
    recommendation['genres'] = movie_genres

    return recommendation

def results(insert_movie_name):
    insert_movie_name =insert_movie_name.lower()


    data = get_data(PATH)
    data_with_combined_features = combine_features(data)
    transformed_data = transform_data(data_with_combined_features)

    if insert_movie_name not in data['original_title'].unique():
        return 'Movie not in Database'

    else:
        recommendations = recommender(insert_movie_name, data_with_combined_features, transformed_data)
        return recommendations.to_dict('records')


Now, try it out... You can get recommendations by simply calling the `results()` function

In [197]:
results('Avatar')


[{'Id': 603,
  'title': 'the matrix',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 68721,
  'title': 'iron man 3',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 1726,
  'title': 'iron man',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 878, "name": "Science Fiction"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 601,
  'title': 'e.t. the extra-terrestrial',
  'genres': '[{"id": 878, "name": "Science Fiction"}, {"id": 12, "name": "Adventure"}, {"id": 10751, "name": "Family"}, {"id": 14, "name": "Fantasy"}]'},
 {'Id': 18,
  'title': 'the fifth element',
  'genres': '[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}, {"id": 53, "name": "Thriller"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 72190,
  'title': 'world war z',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 18, "name": "Drama"}, {"id": 27

<u>Question: can you improve this recommender?</u> 
<br>
<br>
<div class="alert-info">
Think about alternative ways to find relevant movies. Consider, for example, which features you want to include.
</div>

<u>Question: adjust the (amount of) information returned to the user</u> 
<br>
<br>
<div class="alert-info">
What information do you want to return to the user?
</div>

## 4. Content-based recommender system using Soft Cosine Similarity

By now, you may wonder whether we can improve our recommender system by using a "better" similarity measure, such as soft cosine similiarity. Let's try it out..

For the next steps, we largely rely on the code we've used for week 4 as well, so go back to the slides and exercises of that week if you are confused. 

In [210]:
import gensim
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix
print(gensim.__version__)

from gensim.corpora import Dictionary
import numpy as np

3.8.3


In [129]:
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')

In [130]:
dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in data['combined_features']]) #initialize a Dictionary. This step assigns a token_id to each word
bag_of_words_vectors = [ dictionary.doc2bow(simple_preprocess(doc)) for doc in data_small['combined_features']]

In [133]:
## This step also takes quite a while.... 
similarity_index = WordEmbeddingSimilarityIndex(fasttext_model300)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary) # Build a term similarity matrix and compute the Soft Cosine Measure.

Next, let's create a matrix of soft-cosine similarity scores..

<div class="alert-danger">
Only run the following code on a dataset that is not too large, otherwise it will be very slow...
</div>


In [204]:
def create_soft_cossim_matrix(documents):# reference: https://www.machinelearningplus.com/nlp/cosine-similarity/
    len_array = np.arange(len(documents))
    xx, yy = np.meshgrid(len_array, len_array)
    cossim_mat = pd.DataFrame([[round(similarity_matrix.inner_product(documents[i],documents[j], normalized=True) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return cossim_mat

In [212]:
transformed_data = create_soft_cossim_matrix(bag_of_words_vectors)

29it [12:25, 25.70s/it]
298it [12:18,  2.48s/it]  


Just like we did with the cosine similiarity scores, enumerate the similarity scores, so we can sort them will maintaining information regarding their index (i.e., the movie they belong to)

In [216]:
sim_scores = list(enumerate(transformed_data[index].to_list() ))

The remainder of the code is the same as we've also used for cosine similarity: sort the values, keep the top 10 most similar ones, and retrieve the associated titles... 

In [218]:
#sim_scores = list(enumerate(cosine_sim[index])) 
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:11]
movie_indices = [i[0] for i in sim_scores]
data_small.iloc[movie_indices]['title']

65                                    The Dark Knight
119                                     Batman Begins
1359                                           Batman
160                        How to Train Your Dragon 2
329     The Lord of the Rings: The Return of the King
19          The Hobbit: The Battle of the Five Armies
48                              Jack the Giant Slayer
212                            The Day After Tomorrow
229      Star Wars: Episode III - Revenge of the Sith
274                                         Gladiator
Name: title, dtype: object

<u>Question: What do you think of the quality of these recommendations?</u> 
<br>
<br>
<div class="alert-info">
Create recommendations for different movies. How do these results compare to the recommendations based on regular cosine similarity? Which one performance better do you think? 
</div>

### putting the pieces together: building content-based recommender system based on soft-cosine similarity

In [236]:

def recommender(movie_title, data, transformed_data):

    indices = pd.Series(data.index, index = data['original_title'])
    index = indices[movie_title]

    sim_scores = list(enumerate(transformed_data[index].to_list() ))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]

    movie_indices = [i[0] for i in sim_scores]

    movie_id = data['movie_id'].iloc[movie_indices]

    movie_title = data['original_title'].iloc[movie_indices]
    movie_genres = data['genres'].iloc[movie_indices]

    recommendation = pd.DataFrame(columns=['Id','title', 'genres'])

    recommendation['Id'] = movie_id
    recommendation['title'] = movie_title
    recommendation['genres'] = movie_genres

    return recommendation

def results(insert_movie_name, transformed_data):
    insert_movie_name =insert_movie_name.lower()

     #initialize a Dictionary. This step assigns a token_id to each word
    data = combine_features(get_data(PATH))
    
    if insert_movie_name not in data['original_title'].unique():
        return 'Movie not in Database'

    else:
        recommendations = recommender(insert_movie_name, data, transformed_data)
        return recommendations.to_dict('records')


#fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')
#similarity_index = WordEmbeddingSimilarityIndex(fasttext_model300)
#similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary)
#transformed_data = create_soft_cossim_matrix(bag_of_words_vectors)
#bag_of_words_vectors = [ dictionary.doc2bow(simple_preprocess(doc)) for doc in data['combined_features']]
#dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in data['combined_features']])

In [237]:
results('Saw', transformed_data)

[{'Id': 70160,
  'title': 'the hunger games',
  'genres': '[{"id": 878, "name": "Science Fiction"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}]'},
 {'Id': 81005,
  'title': 'jack the giant slayer',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 10751, "name": "Family"}, {"id": 14, "name": "Fantasy"}]'},
 {'Id': 82702,
  'title': 'how to train your dragon 2',
  'genres': '[{"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 16, "name": "Animation"}, {"id": 35, "name": "Comedy"}, {"id": 10751, "name": "Family"}]'},
 {'Id': 98,
  'title': 'gladiator',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 18, "name": "Drama"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 203801,
  'title': 'the man from u.n.c.l.e.',
  'genres': '[{"id": 35, "name": "Comedy"}, {"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 76341,
  'title': 'mad max: fury road',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 

<u>Question: Can you improve this recommender?</u> 
<br>
<br>
<div class="alert-info">
Can you adjust this recommender? Can you for example think of alternative ways to select and combine features?
</div>