<a href="https://colab.research.google.com/github/badrishdavey/datascience_lab/blob/master/Assignment_1_Recommendation_System_using_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendation System

## What is Recommendation System?

Very simply put, as humans we can make decisions based on three mechanisms:

1. Explore everything
2. Get expert opinion
3. Innovate

Lets go into a more specific example, lets say we want to purchase something (it can be clothes, household items, a movie, car, etc). We are left with 2 of the above options:

1. Explore all items available
2. Get someone to suggest items

In today's life, we rarely have the time, patience and energy to go out and try out all available products. Therefore the seller actively provides us with a few suggested products.

The process of advicing a few (may even be 1) product from a larger list of products is called recommendation.

## How do we recommend?

We can divide the recommendation systems into 2 major categories:

1. Content Based Filters
2. Collaborative Filters

### Content Based Filtering

Content based filters are a more traditional approach to the recommendation problem. These basically provide various properties to an item. Some examples are:

1. Genre of a movie
2. Author of a book
3. RAM in a computer
4. Class of a car
5. Actor/Director/Story writer of a movie

These properties are used to filter the content for you. The user could select a genre and see movies in that genre, or a user can select a movie and other movies of the same genre will be made available.

### Collborative Filtering

This is the more 'data-age' solution of the recommendation problem. This treats the recommendation problem as a matching problem of a different type. There are 2 popular forms of collaborative filtering:

1. User-User Collaborative Filtering
2. Item-Item Collaborative Filtering

In Assignment 2 we will explore the User-User form of collaborative filtering. Many a times when we say Collaborative Filtering, it simply means the User-User type.

In this case, instead of matching items (or products), we match users. This simply means identify people with similar tastes and refer one's favorites to the other and so on. Thats the core idea here.

## Lets look at the data

The data we will use is called MovieLens data. It can be downloaded from [MoviLens Data](https://grouplens.org/datasets/movielens/latest/).

For the purpose of this assignment, we will use the small dataset but you can simply apply the same logic to the large dataset.

Lets read the [README](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html) of the dataset.

In [0]:
from IPython.display import IFrame
base_url = 'http://files.grouplens.org/datasets/movielens/'
url = '{}/ml-latest-small-README.html'.format(base_url)
IFrame(url, width=600, height=400)

Lets download the dataset now:

In [0]:
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

--2019-07-22 02:34:30--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2019-07-22 02:34:30 (4.12 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]



Lets check the files in the dataset and read the top few lines from them:

In [0]:
from zipfile import ZipFile
zipname = 'ml-latest-small.zip'
with ZipFile(zipname) as zf:
    for filename in zf.namelist():
        if filename[-1] != '/':
            print('\nFile Discovered: {}'.format(filename))
            file = zf.open(filename)
            for i in range(5):
                print(str(file.readline()))


File Discovered: ml-latest-small/links.csv
b'movieId,imdbId,tmdbId\r\n'
b'1,0114709,862\r\n'
b'2,0113497,8844\r\n'
b'3,0113228,15602\r\n'
b'4,0114885,31357\r\n'

File Discovered: ml-latest-small/tags.csv
b'userId,movieId,tag,timestamp\r\n'
b'2,60756,funny,1445714994\r\n'
b'2,60756,Highly quotable,1445714996\r\n'
b'2,60756,will ferrell,1445714992\r\n'
b'2,89774,Boxing story,1445715207\r\n'

File Discovered: ml-latest-small/ratings.csv
b'userId,movieId,rating,timestamp\r\n'
b'1,1,4.0,964982703\r\n'
b'1,3,4.0,964981247\r\n'
b'1,6,4.0,964982224\r\n'
b'1,47,5.0,964983815\r\n'

File Discovered: ml-latest-small/README.txt
b'Summary\n'
b'\n'
b'This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 

Time to make the proper dataframes. There are 2 ways here. One involves us extracting the zip file and then loading the extracted file into pandas or the other is simply taking the zipped file and pushing it to pandas. We will try the second approach here. I am setting up an example for the links file.

In [0]:
import pandas as pd

In [0]:
links_df = None
with ZipFile(zipname) as zf:
    links_df = pd.read_csv(zf.open('ml-latest-small/links.csv'))
print(links_df.info())
links_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId    9742 non-null int64
imdbId     9742 non-null int64
tmdbId     9734 non-null float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB
None


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


Now perform the same action to extract the data from the files:

- ml-latest-small/tags.csv
- ml-latest-small/ratings.csv
- ml-latest-small/movies.csv

into the variables:

- tags_df
- ratings_df
- movies_df

respectively.

<!--
tags_df = None
ratings_df = None
movies_df = None
with ZipFile(zipname) as zf:
    tags_df = pd.read_csv(zf.open('ml-latest-small/tags.csv'))
    ratings_df = pd.read_csv(zf.open('ml-latest-small/ratings.csv'))
    movies_df = pd.read_csv(zf.open('ml-latest-small/movies.csv'))
print('Tags Data\n=============')
print(tags_df.info())
print(tags_df.head())
print('Ratings Data\n=============')
print(ratings_df.info())
print(ratings_df.head())
print('Movies Data\n=============')
print(movies_df.info())
print(movies_df.head())
-->

Double click on this cell to see the solution.

Tags Data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
userId       3683 non-null int64
movieId      3683 non-null int64
tag          3683 non-null object
timestamp    3683 non-null int64
dtypes: int64(3), object(1)
memory usage: 115.2+ KB
None
   userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
2       2    60756     will ferrell  1445714992
3       2    89774     Boxing story  1445715207
4       2    89774              MMA  1445715200
Ratings Data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
userId       100836 non-null int64
movieId      100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
None
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1 

While we are doing this, lets also read the README in this zip file.

In [0]:
with ZipFile(zipname) as zf:
    readme = zf.open('ml-latest-small/README.txt').read()
    print(readme.decode('utf-8'))
    ## The decode('utf-8') is used becasuse the original string
    ## is a byte string. That can be converted into python string
    ## by using decode. Amongst other things, this allows python
    ## to display /n and /r as new line characters for proper
    ## formatting.

Lets first build a simple content based recommendation. Since we already have genre data, lets just `wrangle` it into usable format.

In [0]:
movies_df.head()

Lets split the genres data into multiple columns depicting the different genres. We must first split the data by the character `|` and then convert this into multiple columns, one for each genre.

We can then add a 1 for each genre the movie lies in and 0 for other genres. It should look like this:

In [0]:
pd.DataFrame({
    'movieId': 1,
    'title': 'Toy Story (1995)',
    'genre_adventure': 1,
    'genre_animations': 1,
    'genre_action': 0,
    'genre_children': 1,
    'genre_comedy': 1,
    'genre_drama': 0,
    'genre_fantasy': 1,
    'genre_romance': 0
}, index=[1])

First create a column `genre_list` which is the `genres` column after splitting by `|`.

<!--
movies_df['genre_list'] = movies_df['genres'].str.split('|')
movies_df.head()
-->

Lets find the list of all genres. The way to do that will be to convert column into a list of lists. We will use the [values](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html) property of the DataFrame or the [column within the DataFrame](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.Series.values.html).

<!--
movies_df.genre_list.values
-->

Since this will return an array, we will wrap this in a `list` function.

<!--
list(movies_df.genre_list.values)
-->

Now that we know how to extract this as a list, lets make a set with all the genre values. We will initialise the set by using the function `set()`:

In [0]:
genres = set()

This creates an empty set. Now taking one movie at a time (hint: use for loop) and we will find the [union](https://docs.python.org/2/library/sets.html#set-objects) of all the genre_lists.

<!--
for genre in list(movies_df.genre_list.values):
    genres = genres.union(genre)
genres
-->

Now that we have the list of genres, lets add these as columns to the DataFrame with a default value of 0. Here's how you do it:

In [0]:
movies_df['New Column'] = 0
column_name = 'Another New Column'
movies_df[column_name] = 0
movies_df.head()

Now you should loop over all the genres and create a column name as: `genre_{}` where you replace the `{}` with the lowercase of the genre name.

<!--
for col in genres:
    column_name = 'genre_{}'.format(col.lower())
    movies_df[column_name] = 0
movies_df.head()
-->

Great work! Now lets encode the genre_list into `1` for each of the genre our movie is a part of. This time we will use the `apply` function. But before we can use the silence function, we must write a function for a single row:

<!--
def fill_genres_as_1_into_columns(row):
    genre_list = row['genre_list']
    for genre in genre_list:
        col = 'genre_{}'.format(genre.lower())
        row[col] = 1
    return(row)
movies_df = movies_df.apply(fill_genres_as_1_into_columns, axis=1)
-->

In [0]:
def fill_genres_as_1_into_columns(row):
    # Write code to firt extract the genre_list column into genre_list variable
    genre_list = 
    print(genre_list) ## Remove this once you are sure you are able to extract the correct values
    # Now loop over all the genres. Use the looping variable as genre and the list as genre_list
    for genre in genre_list:
        # Now create the column name you want to update into variable col
        col = 
        print(col)  # Remove this once you have validated you are getting correct values
        # finally update the column in row as 1
        
    return(row)
movies_df = movies_df.apply(fill_genres_as_1_into_columns, axis=1)


In [0]:
movies_df.head()

Lets do a little bit of cleanup - remove columns we no longer need and free up some display space along with some memory

In [0]:
movies_df = movies_df.drop(columns=['genres', 'genre_list', 'New Column', 'Another New Column', 'genre_(no genres listed)'])

This way of proviing multiple columns per categorical variable is called `1 hot encoding`. This `encoding` is used very frequently in machine learning.

Now lets select a movie and find movies similar to it.

In [0]:
selected_movie = 'Jumanji (1995)'

Lets find the row corresponding to this movie. Use the `where` styles structure we have seen before.

<!--
movie = movies_df[movies_df.title == selected_movie]
movie
-->

Now lets `apply` over each row of movies and identify how much does each movie differ from it. Simplest way is to identify

- number of genres both movies belong to or both don't belong to
- number of genres selected movie belongs to but the the other movie does not
- number of genres other movie belongs to but the the selected movie does not

Create a function to return the id, name, n_both, n_selected, n_other.

<!--
cols = [col for col in movies_df.columns if col[:6] == 'genre_']
def find_similar_genre_count(row):
    diff = row[cols] - movie[cols]
    n_both = (diff == 0).values.sum()
    n_selected = (diff == -1).values.sum() 
    n_other = (diff == 1).values.sum()
    return(pd.Series({
        'movieId': row['movieId'],
        'title': row['title'],
        'both': n_both,
        'selected': n_selected,
        'other': n_other
    }))
similarity_df = movies_df.apply(find_similar_genre_count, axis=1)
-->

In [0]:
# select the genre columns (column names start with 'genre_')
cols = 
def find_similar_genre_count(row):
    # Subtracting the values for the present and the selected movies
    diff = row[cols] - movie[cols]
    # Count how many 0s: This is the number of cases where both have the same value
    # This means eaither both movies belong to the genre or neither does
    n_both = (diff == 0).sum()
    # Now hunt for -1: This is the genres selected movie belongs to but not the other movie
    n_selected = 
    # You know what to do (hint: 1 = present movie but not selected movie)
    n_other = 
    return({
        'movieId': row['movieId'],
        'title': row['title'],
        'both': n_both,
        'selected': n_selected,
        'other': n_other
    })
similarity_df = movies_df.apply(find_similar_genre_count, axis=1)

In [0]:
similarity_df.head()

Now we want to present the recommendations as sorting this list by:

- both sorted in descending order (maximum match)
- other sorted in ascending order (minimize the new genres)
- selected sorted in ascending order (minimize the mising genres)

Brain Exercise: Think of the order and why such an order!

Lets use the [sort_values](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)

<!--
sorted_df = similarity_df.sort_values(by=['both', 'selected', 'other'], ascending=[False, True, True])
-->

In [0]:
sorted_df = 

In [0]:
print(sorted_df.title[:5])  # Printing top 5 recommendations

# Congratulations!! Thats a Recommendation Engine Right There!!

That was the easy simple way of doing things. The old fashioned way as I call it. Content-Based if you remember.

Now lets move onto slightly more serious business.

## Better Content Based Filtering Approach

Collect the data

In [0]:
zipname = 'ml-latest-small.zip'
links_df = None
tags_df = None
ratings_df = None
movies_df = None
with ZipFile(zipname) as zf:
    links_df = pd.read_csv(zf.open('ml-latest-small/links.csv'))
    tags_df = pd.read_csv(zf.open('ml-latest-small/tags.csv'))
    ratings_df = pd.read_csv(zf.open('ml-latest-small/ratings.csv'))
    movies_df = pd.read_csv(zf.open('ml-latest-small/movies.csv'))
print('Links Data\n=============')
print(links_df.info())
print(links_df.head())
print('Tags Data\n=============')
print(tags_df.info())
print(tags_df.head())
print('Ratings Data\n=============')
print(ratings_df.info())
print(ratings_df.head())
print('Movies Data\n=============')
print(movies_df.info())
print(movies_df.head())

Links Data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId    9742 non-null int64
imdbId     9742 non-null int64
tmdbId     9734 non-null float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB
None
   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114885  31357.0
4        5  113041  11862.0
Tags Data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
userId       3683 non-null int64
movieId      3683 non-null int64
tag          3683 non-null object
timestamp    3683 non-null int64
dtypes: int64(3), object(1)
memory usage: 115.2+ KB
None
   userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
2       2    60756     will ferrell  1445714992
3       2    89774     Boxing story  1445715207
4       2    8977

Build the genres columns for movie_df

In [0]:
movies_df = movies_df.apply(lambda x: pd.Series({
    'movieId': x['movieId'],
    'title': x['title']
}).append(pd.Series(
    {'genre_{}'.format(g.lower()): 1
     for g in x['genres'].split('|')
     if g not in ['(no genres listed)']
    }
)), axis=1)

## Same thing as what we did before. Just crystallised into 1 statement.
## Looks a bit dirty but saves a lot of memory and speed for large datasets.
## We have na instead of 0.

{'genre_adventure': 1, 'genre_animation': 1, 'genre_children': 1, 'genre_comedy': 1, 'genre_fantasy': 1}
{'genre_adventure': 1, 'genre_animation': 1, 'genre_children': 1, 'genre_comedy': 1, 'genre_fantasy': 1}
{'genre_adventure': 1, 'genre_children': 1, 'genre_fantasy': 1}
{'genre_comedy': 1, 'genre_romance': 1}
{'genre_comedy': 1, 'genre_drama': 1, 'genre_romance': 1}
{'genre_comedy': 1}
{'genre_action': 1, 'genre_crime': 1, 'genre_thriller': 1}
{'genre_comedy': 1, 'genre_romance': 1}
{'genre_adventure': 1, 'genre_children': 1}
{'genre_action': 1}
{'genre_action': 1, 'genre_adventure': 1, 'genre_thriller': 1}
{'genre_comedy': 1, 'genre_drama': 1, 'genre_romance': 1}
{'genre_comedy': 1, 'genre_horror': 1}
{'genre_adventure': 1, 'genre_animation': 1, 'genre_children': 1}
{'genre_drama': 1}
{'genre_action': 1, 'genre_adventure': 1, 'genre_romance': 1}
{'genre_crime': 1, 'genre_drama': 1}
{'genre_drama': 1, 'genre_romance': 1}
{'genre_comedy': 1}
{'genre_comedy': 1}
{'genre_action': 1, 'g

In [0]:
movies_df.fillna(0)

Unnamed: 0,genre_action,genre_adventure,genre_animation,genre_children,genre_comedy,genre_crime,genre_documentary,genre_drama,genre_fantasy,genre_film-noir,genre_horror,genre_imax,genre_musical,genre_mystery,genre_romance,genre_sci-fi,genre_thriller,genre_war,genre_western,movieId,title
0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Toy Story (1995)
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,Jumanji (1995)
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,3,Grumpier Old Men (1995)
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4,Waiting to Exhale (1995)
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,Father of the Bride Part II (1995)
5,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,6,Heat (1995)
6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,7,Sabrina (1995)
7,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8,Tom and Huck (1995)
8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9,Sudden Death (1995)
9,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,10,GoldenEye (1995)


Lets create a dataframe that contains userid, movieid, movie title, rating and all genre variables. Its a simple join of movies_df and ratings_df (optionally delete the timestamp data)

Fill in the blanks here to achieve this.

<!--
user_movie_genre_review = movies_df.merge(ratings_df)
-->

In [0]:
user_movie_genre_review = 

In [0]:
user_movie_genre_review.head()

Now lets identify the preference of each user based on their weighted preference of each user.

Simplest way to do this is to

1. multiply the rating with the genres
2. find average rating of the genres per user (0 for cases when genre was not associated with the movie)
3. find the closest (`distance`) other movies and display them

So step 1:

[Multiply](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.multiply.html)

<!--
user_movie_genre_review_weighted = user_movie_genre_review.copy()
user_movie_genre_review_weighted[cols] = user_movie_genre_review[cols].multiply(user_movie_genre_review['rating'], axis="index")
-->

In [0]:
## find the columns for genre
cols = [col for col in movies_df.columns if col[:6] == 'genre_']
# Multiple the columns by rating
## Make a copy
user_movie_genre_review_weighted = 
## Use multiply


Find the mean of all genres by user. We will use the [Split-Apply-Combine](https://pandas.pydata.org/pandas-docs/stable/groupby.html) Strategy. This simply means we will `group` the data (split), then `apply` the function or action. Finally pandas will `combine` the results for us.

In [0]:
user_preferences = user_movie_genre_review_weighted.groupby('userId').mean()[cols].fillna(0)

In [0]:
user_preferences.head()

Now lets do the same for Movies. Group by movieId and find average movie review!

<!--
movie_profile = user_movie_genre_review_weighted.groupby('movieId').mean()[cols].fillna(0)
-->

In [0]:
movie_profile = 

In [0]:
movie_profile.head()

Now lets select a user. Say 1. We must now find the closest matching movie to the user preferences.

Here I will introduce you to the concept of distance. Distance is a way of identifying how much a value is different from another.

The simplest form of distance we are familiar with is called `Euclidean Distance`.

For points $\left(1, 0\right)$ and $\left(1, 1\right)$ is $\sqrt{\left(1 - 1\right)^2 + \left(1 - 0\right)^2} = \sqrt{0^2 + 1^2} = \sqrt{1} = 1$

```
a = np.array([1, 0])
b = np.array([1, 1])
distance = np.sqrt(np.sum(np.pow(a-b, 2)))
```

In [0]:
import numpy as np
a = np.array([1, 0])
b = np.array([1, 1])
distance = np.sqrt(np.sum(np.power(a-b, 2)))
distance

In [0]:
a = np.array([1, 0, 0, 1])
b = np.array([1, 1, 1, 0])
distance = np.sqrt(np.sum(np.power(a-b, 2)))
distance

Another similar concept is `cosine distance` or `cosine similarity`. This is calculated as

$$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$$

This is inspired from the fact that

$$\mathbf{A}\cdot\mathbf{B}
=\left\|\mathbf{A}\right\|\left\|\mathbf{B}\right\|\cos\theta$$

So cosine distance is really the angle between the two vectors.

We are going to use the `scipy spacial` module for this.

In [0]:
from scipy.spatial import distance as d
d.cosine(a, b)

If I wanted to apply this to a movie and a user, I can use the following code:

In [0]:
user = user_preferences.iloc[0]
movie = movie_profile.iloc[0]
d.cosine(user, movie)

Doing this for a large number of users and movies is not efficient. We will use an inbuilt function [sklearn.metrics.pairwise.cosine_similarity](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

<!--
from sklearn.metrics.pairwise import cosine_similarity
user_preferences ## No action needed
movie_profile ## No action needed
similarities = cosine_similarity(user_preferences, movie_profile)
-->

In [0]:
from sklearn.metrics.pairwise import cosine_similarity
## First matrix is users
user_preferences ## No action needed
## Second is the movies
movie_profile ## No action needed
## Now apply the distance matrix
similarities = 
## Check all shapes make sense. I expect the output (610, 9724) (610, 19) (9724, 19)
print(similarities.shape, user_preferences.shape, movie_profile.shape)

Now lets ask the user for a user id and provide top 5 movies for the user.

<!--
user = int(input('Which user are we looking at?'))
user_row = similarities[user, :]
sorted_row = np.argsort(user_row)
movieIds = sorted_row[:5]
movies_df.title[movieIds]
-->

In [0]:
user = int(input('Which user are we looking at?'))
## Get row for the user
user_row = 
## Get Sort order (arg to the rescue https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html)
sorted_row = 
## Get top 5 movie Ids
movieIds = 
## Get movie names


# Congrtulations! Thats another Recommendation Engine!!