### *Assignment 1*: User-based Collaborative Filtering Recommendations
---

In [1]:
from user_recommendation import UserRecommendation
from dataset import Dataset
import pandas

ratings_df = pandas.read_csv(Dataset.get_dataset_path() + 'ratings.csv')
ds = Dataset(ratings_df)

user_recommendation = UserRecommendation(ds)

**(a)** Download the MovieLens 100K rating dataset from https://grouplens.org/datasets/movielens/ 
(the small dataset recommended for education and development). Read the
dataset, display the first few rows to understand it, and display the count of ratings (rows)
in the dataset to be sure that you download it correctly

In [2]:
print('There are ', len(ds.movies_df), ' movies and ', len(ds.ratings_df), ' ratings')

There are  9742  movies and  100836  ratings


In [3]:
ds.movies_df.head(10)

Unnamed: 0,movieId,title,genres,avg_rating
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",3.92093
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",3.431818
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",3.259615
4,5,Father of the Bride Part II (1995),[Comedy],2.357143
5,6,Heat (1995),"[Action, Crime, Thriller]",3.071429
6,7,Sabrina (1995),"[Comedy, Romance]",3.946078
7,8,Tom and Huck (1995),"[Adventure, Children]",3.185185
8,9,Sudden Death (1995),[Action],2.875
9,10,GoldenEye (1995),"[Action, Adventure, Thriller]",3.125


In [4]:
ds.ratings_df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,datetime
0,1,1,4.0,964982703,30-07-2000
1,1,3,4.0,964981247,30-07-2000
2,1,6,4.0,964982224,30-07-2000
3,1,47,5.0,964983815,30-07-2000
4,1,50,5.0,964982931,30-07-2000
5,1,70,3.0,964982400,30-07-2000
6,1,101,5.0,964980868,30-07-2000
7,1,110,4.0,964982176,30-07-2000
8,1,151,5.0,964984041,30-07-2000
9,1,157,5.0,964984100,30-07-2000


---

**(b)** Implement the user-based collaborative filtering approach, using the Pearson
correlation function for computing similarities between users, and

**(c)** the prediction function presented in class for predicting movies scores.

In [5]:
userA = 1
userB = 10
movieP = 1

In [6]:
sim = user_recommendation.sim_pcc(userA, userB)
print(f'Pearson correlation coefficient between user {userA} and {userB} is {sim}')

Pearson correlation coefficient between user 1 and 10 is -0.15121755406386428


$$sim^{PCC}(a, b) = \frac{\sum_{p \in P} (r_{a,p} - \bar{r_a})(r_{b,p} - \bar{r_b})}{\sqrt{\sum_{p \in P}(r_{a,p} - \bar{r_a})^2}\sqrt{\sum_{p \in P}(r_{b,p} - \bar{r_b})^2}}$$

In [7]:
neighbors = user_recommendation.top_n_similar_users(50)
prediction = user_recommendation.prediction_from_neighbors(userA, movieP, neighbors)
print(f'Prediction for the rating of movie {movieP} rating by user {userA} is {prediction}')

Prediction for the rating of movie 1 rating by user 1 is 4.502742946708464


$$pred(a,p)=\bar{r_a} + \frac{\sum_{b \in N}sim(a,b)\cdot(r_{b,p}-\bar{r_b})}{\sum_{b \in N}sim(a,b)}$$

---

**(d)** Select a user from the dataset, and for this user, show the 10 most similar users and
the 10 most relevant movies that the recommender suggests.

In [8]:
similar_users = user_recommendation.top_n_similar_users(userA, n=10)

sim_users_common_movies = []
for tuple in similar_users:
    sim_users_common_movies.append((tuple[0], tuple[1], len(ds.get_common_movies(userA, tuple[0]))))
sim_users_df = pandas.DataFrame(sim_users_common_movies, columns=['userId', 'similarity', '#common_movies'])

print(f'Top 10 similar users to user {userA} using PCC are:')
sim_users_df.head(10)

Top 10 similar users to user 1 using PCC are:


Unnamed: 0,userId,similarity,#common_movies
0,77,1.0,6
1,12,1.0,2
2,85,1.0,1
3,253,1.0,1
4,291,1.0,1
5,358,1.0,1
6,388,1.0,1
7,2,1.0,2
8,146,0.99905,2
9,278,0.971061,3


In [9]:
top_rec = user_recommendation.top_n_recommendations(userA)
top_rec_df = pandas.DataFrame(top_rec, columns=['movieId', 'prediction'])

print(f'Top 10 recommendations for user {userA} are:')
top_rec_df.head(10)

Top 10 recommendations for user 1 are:


Unnamed: 0,movieId,prediction
0,319,6.769157
1,3567,6.726379
2,555,6.641379
3,913,6.252743
4,55276,6.252743
5,30803,6.225754
6,3972,6.223522
7,27611,6.223522
8,5066,6.110282
9,42728,6.110282


---

**(e)** Design and implement a new similarity function for computing similarities between
users. Explain why this similarity function is useful for the collaborative filtering approach.

We decided to apply a weight after calculating the PCC. The weight is calculated by dividing the number of common movies between the users by the total number of ratings of the other user.

$$ sim^{WPCC}(a,b)=sim^{PCC}(a,b) \cdot \dfrac{|common\_ movies(a,b)|}{|ratings(b)|} $$

The similarity score is penalized when the two users have very few rated movies in common. Dividing by the number of ratings of user $b$ helps to avoid giving too much weight to the number of common movies rated by both users.
Furthermore, it helps to avoid the problem of having a similarity score of 1 when the two users have only one movie in common.
Additionally, it penalizes the similarity score when user $b$ has rated a very large number of movies compared to
the movies rated in common with user $a$.

In [10]:
similar_users = user_recommendation.top_n_similar_users(userA, user_recommendation.sim_wpcc_jaccard)

# Creating the table to show
sim_users_common_movies = []
for user2, similarity in similar_users:
    number_common_movies = len(ds.get_common_movies(userA, user2))
    number_rated_movies_user2 = len(ds.get_movies_rated_by_user(user_id=user2))
    
    sim_users_common_movies.append((user2, 
                                    similarity, 
                                    user_recommendation.sim_pcc(userA, user2),
                                    user_recommendation.sim_jaccard(userA, user2)
                                    ))

sim_users_df = pandas.DataFrame(sim_users_common_movies, 
                                columns=['User', 'PCC * Jaccard', 'PCC', 
                                         'Jaccard'])

print(f'Top 10 similar users to user {userA} using Weighted PCC are:')
sim_users_df.head(10)

Top 10 similar users to user 1 using Weighted PCC are:


Unnamed: 0,User,PCC * Jaccard,PCC,Jaccard
0,266,0.075094,0.38668,0.194203
1,597,0.073614,0.443985,0.165803
2,57,0.066334,0.35299,0.187919
3,577,0.058652,0.313126,0.187311
4,135,0.056401,0.303862,0.185615
5,198,0.055574,0.365331,0.15212
6,434,0.054316,0.340325,0.159601
7,469,0.05383,0.287256,0.187394
8,477,0.053458,0.40039,0.133515
9,369,0.052815,0.628122,0.084084
