# Homework 1 Part 2

In [1]:
# import libraries and magics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

# Question 1 (45 points)

**Build a content-based recommender system using association rules on the GoodBooks-10k dataset.**

**Start by downloading the [data](https://www.kaggle.com/datasets/shubhammehta21/movie-lens-small-latest-dataset/download?datasetVersionNumber=1). There you will find 5 csv files. For this problem, you will only need _movies.csv_ and _ratings.csv_.**

In [2]:
#Import datasets
moviesRaw = pd.read_csv('MovieLens 10k/movies.csv')
ratingsRaw= pd.read_csv('MovieLens 10k/ratings.csv')

In [3]:
# Merge Movies and Tags
all_movies = pd.merge(moviesRaw, ratingsRaw, left_on='movieId', right_on='movieId', how='inner')
all_movies_grouped = all_movies.groupby('userId')
all_movies

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483
...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,1537109082
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,1537109545
100833,193585,Flint (2017),Drama,184,3.5,1537109805
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,1537110021


1. **Build a routine to compute the *support* and *confidence* of the association rules $\text{Movie 1} \rightarrow \text{Movie 2}$.**

In [4]:
def association_rule(df, m1, m2):
    '''This function computes the support and confidence of the 
    association rule movie m1 --> movie m2'''
    
    # Users that watched movie m1
    users_m1 = df[df['title']==m1]['userId']
    # Users that watched movie m2
    users_m2 = df[df['title']==m2]['userId']
    
    # Support of rule m1-->m2
    # number of users that watched m1 and m2 / number of movies
    support = len(np.intersect1d(users_m1, users_m2))/len(df['title'].unique())
    
    # Confidence of rule m1 --> m2
    # number of users that watched m1 and m2 / number of users that watched m1
    confidence = len(np.intersect1d(users_m1, users_m2))/len(users_m1)
    
    return support, confidence

2. **If a particular *user* watched *movie $x$*, recommend movies with a confidence of at least 50%.**

In [5]:
def recommender_system(df, x):
    '''This function implements a simple association rule for movie recommendations.
    If the user watched movie x, the system will recommend movies with confidence >50%'''
    
    # List of all (unique) movies
    movies_list = df['title'].unique()
    
    recommendations, confidence, support = [], [], []
    for y in np.setdiff1d(movies_list, x): # for each movie in the movies list, excluding x
        # compute support and confidence
        supp, conf = association_rule(all_movies, x, y)
        if conf >= 0.5:
            recommendations += [y]
            confidence += [conf*100]
            support += [supp*100]
                
    return recommendations, confidence, support

In [6]:
x = 'Misérables, Les (1995)'

y, conf, supp = recommender_system(all_movies, x)

pd.DataFrame(np.array(sorted(zip(supp, conf, y), reverse=True)), 
             columns=['support (in %)', 'confidence (in %)', 'recommended movie'])

Unnamed: 0,support (in %),confidence (in %),recommended movie
0,0.1131803683506533,84.61538461538461,"Shawshank Redemption, The (1994)"
1,0.1028912439551394,76.92307692307693,Toy Story (1995)
2,0.1028912439551394,76.92307692307693,Jurassic Park (1993)
3,0.1028912439551394,76.92307692307693,Forrest Gump (1994)
4,0.0926021195596254,69.23076923076923,Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
5,0.0926021195596254,69.23076923076923,Schindler's List (1993)
6,0.0926021195596254,69.23076923076923,Pulp Fiction (1994)
7,0.0926021195596254,69.23076923076923,Jumanji (1995)
8,0.0926021195596254,69.23076923076923,Beauty and the Beast (1991)
9,0.0926021195596254,69.23076923076923,Apollo 13 (1995)


Note that, in addition to a large confidence, one wants the association rule $X \rightarrow Y$ to have large support which indicates that both movies are frequently wacthed.

3. **Discus:**
    * **how you would generalize to two or more movies?**
    * **how can you use the movie rating information to calculate which movie to propose to a customer?**

We can extend this association rule to two or more movies using the equations
* Support: 

$$P(X_1\cap X_2 \cap \dots X_k \cap Y) = \frac{\#\{\text{customers who watched } X_1 \text{ and } \dots X_k \text{ and } Y\}}{\#\{\text{movies}\}}$$

* Confidence:

$$P(Y|X_1 \cap X_2 \cap \dots \cap X_k) = \frac{P(X_1\cap X_2\cap \dots\cap X_k\cap Y)}{P(X_1\cap X_2\cap \dots\cap X_k)} = \frac{\#\{\text{customers who watched } X_1 \text{ and } \dots X_k \text{ and } Y\}}{\#\{\text{customers who watched } X_1 \text{ and } \dots X_k\}}$$

where $X_i$ is a movie in the database and $Y$ is the movie that a user watched.

The rating can be incorporated to shrink the set of recommendations by including only those with high ratings. Namely, the support and confidence can be modified to include movies who watched $Y$ and rated it >=3 stars.

# On-Time (5 points)

Submit your assignment before the deadline.

___

# Submit Your Solution

Confirm that you've successfully completed the assignment.

Along with the Notebook, include a PDF of the notebook with your solutions.

```add``` and ```commit``` the final version of your work, and ```push``` your code to your GitHub repository.

Submit the URL of your GitHub Repository as your assignment submission on Canvas.

___