## 1 Preliminary data exploration

In [1]:
import pandas as pd

titles = ['links','movies','ratings','tags']
path_csv = lambda title: f'/Users/G/WBS Bootcamp/8. Recommender Systems/Data/{title}.csv'

links = pd.read_csv(path_csv(titles[0]))
movies = pd.read_csv(path_csv(titles[1]))
ratings = pd.read_csv(path_csv(titles[2]))
tags = pd.read_csv(path_csv(titles[3]))

### 1.1 Dataframes and Features description

* `links.csv`: Identifiers that can be used to link to other sources of movie data. Each line of this file after the header row represents one movie
    * `imdbId` is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

    * `tmdbId` is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.

* `ratings.csv`: Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

* `tags.csv`: Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

* `Timestamps`: represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.



There are no data to impute nor to convert in appropriate datatype.

## 2 Making Recommendations Based on Popularity
A popularity-based, non-personalised recommender system that takes as an input the ratings and movies datasets and outputs the “best” movies. How you define “best” is up to you. Those movies will appear as the top row of the WBSFLIX site.

In [2]:
#introduce the average rating and the rating count
popularity = ratings[['movieId','rating']].groupby(by='movieId').agg(avg_rating=("rating","mean"))
popularity['rating_count'] = ratings[['movieId','rating']].groupby(by='movieId').agg(rating_count=("rating","count"))['rating_count']

In [3]:
#ordering by avg_rating
popularity.sort_values(by='avg_rating',ascending = False).head(2)

Unnamed: 0_level_0,avg_rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
88448,5.0,1
100556,5.0,1


In [4]:
#ordering by counts
popularity.sort_values(by='rating_count',ascending = False).head(2)

Unnamed: 0_level_0,avg_rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
356,4.164134,329
318,4.429022,317


### 2.1 Introducing hybrid metrics

* Weighted average
$$ w_i = \frac{ c_i \cdot r_i}{\sum_i c_i} $$
where $w_i$ is the new hybrid measure, $c_i$ and $r_i$ the counts and rating of the $i$-th system.

* Linear combination: we assign different weight to counts and ratings and then sum

$$ \ell_i = a c_i + b r_i$$

In [5]:
def weight_hybrid(n,df):
    
    #this function adds a new column with the weights and returns the "heaviest" n resturants
    
    df2 = df.copy() 
    df2['weight'] = (df['rating_count'] * df['avg_rating']) / (df['rating_count'].sum())
    
    return df2.sort_values(by="weight", ascending = False).head(n)

#weight_hybrid(10,popularity)

In [6]:
def linear_hybrid(n, df, weight_counts):
    #This function linearly combines ratings and counts with appropriate weights
    
    #Error message
    if weight_counts < 0 or weight_counts > 1:
        print("Weight must be in [0, 1]")
    
    #Scaling of the data
    from sklearn.preprocessing import MinMaxScaler
    my_scaler = MinMaxScaler().set_output(transform="pandas")
    my_scaler.fit(df)
    df1 = my_scaler.transform(df)
    
    
    col_name = f"lin. {weight_counts*100}%"
    df1[col_name] = weight_counts * df1['rating_count'] + (1 - weight_counts) * df1['avg_rating']
    
    return df1.sort_values(by=col_name, ascending=False).head(n)
#linear_hybrid(10,popularity, 0.7)

## 3. Making Recommendations Based on Correlation

### 3.1 Item-based collaborative filtering

A similarity-based, semi-personalised recommender system that takes a movie as an input – when put into production, it will be a movie that the user has watched recently or rated highly, for now, it’s a manually inputted movie – and then outputs a list of movies that are “similar” to the one inputted based on rating correlations from the user-item matrix. Those movies will appear as the second row of the WBSFLIX site.

* Create a pivot table userId VS movieId for ratings
* Pick up one movieId and calculate the Similarity with the others
* Sort the data

In [7]:
ratings_pivot = pd.pivot_table(data = ratings, values='rating', index='userId', columns='movieId')

#### 3.1.1 Similarities for a specific movie

Based on the previous analysis (linear method) we know that the most popular movie has `movieId=356` (Forrest Gump (1994)).
We calculate the correlations with the method `.coorwith()`

In [8]:
ratings_ForrestGump = ratings_pivot[356]
similar_to_ForrestGump = ratings_pivot.corrwith(ratings_ForrestGump)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


We get wornings due to the NaNs

In [9]:
#create a pandas dataframe
corr_ForrestGump = pd.DataFrame(similar_to_ForrestGump, columns = ['Pearson'])
len0 = len(corr_ForrestGump)
#drop the NaNs
corr_ForrestGump.dropna(inplace = True)
print(f'# of rows before and after dropping NaNs: {len0} -> {len(corr_ForrestGump)}\n\n')
corr_ForrestGump.sample(5)

# of rows before and after dropping NaNs: 9724 -> 5460




Unnamed: 0_level_0,Pearson
movieId,Unnamed: 1_level_1
72998,0.225503
6663,0.636364
2427,-0.055205
1707,0.075307
7646,1.0


In [10]:
#Now we wanna construct a dataframe of the form (movies) VS (Pearson, popularity_metric)
#Notice: we use the previoiusly introduced function linear_hybrid()

mixed_ForrestGump = linear_hybrid(len(popularity),popularity, 0.7)[['lin. 70.0%']].join(corr_ForrestGump['Pearson'], how='left')
mixed_ForrestGump.drop(356, inplace=True) # drop Forrest Gump itself

#The 'lin. 70.0%' column ranges from 0 to ~1.
#We filter out all rows below a threshold 0.7 and then keep only the first 10 movies in terms of similatities to Forrest Gump
mixed_ForrestGump.loc[mixed_ForrestGump['lin. 70.0%'] > 0.7].sort_values(by='Pearson',ascending=False).head(10)

Unnamed: 0_level_0,lin. 70.0%,Pearson
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
110,0.739102,0.416976
318,0.936325,0.297438
527,0.715711,0.291108
480,0.722459,0.290114
2571,0.837322,0.280199
593,0.837379,0.221777
2959,0.714639,0.188095
589,0.707313,0.180805
260,0.782275,0.108355
296,0.89952,0.077001


#### 3.1.2 Similarities for a generic movie

In [11]:
# now we condense all the steps above in a unique function
# arguments: {movie_name: movie name, n:most similar n-movies}

def item_based_collaborative_filtering(movie_name,n):
    
    #map the movie_name into movieId
    movieID = movies.loc[movies['title'] == movie_name,'movieId'].values[0]

    #pivot table
    ratings_pivot = pd.pivot_table(data = ratings, values='rating', index='userId', columns='movieId')
    
    #create a pandas df with the correlations of the other movies
    similar_to_movieID = ratings_pivot.corrwith(ratings_pivot[movieID])
    corr_movieID = pd.DataFrame(similar_to_movieID, columns = ['Pearson'])
    corr_movieID.dropna(inplace = True) #drop the NaNs

    #Construct a df of (movies) VS (Pearson, popularity_metric)
    mixed_movieID = linear_hybrid(len(popularity),popularity, 0.5)[['lin. 50.0%']].join(corr_movieID['Pearson'], how='left')
    #Drop movieID
    mixed_movieID.drop(movieID, inplace=True)
    #We also drop NaNs
    mixed_movieID.dropna(inplace = True) #drop the NaNs
    #Filter out all rows below a threshold 0.7 and then keep only the first n movies in terms of similatities to movieID
    return mixed_movieID.loc[mixed_movieID['lin. 50.0%'] > 0.5].sort_values(by='Pearson',ascending=False).head(n)


item_based_collaborative_filtering("Father of the Bride Part II (1995)",10)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,lin. 50.0%,Pearson
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
6016,0.51799,0.797241
1201,0.513324,0.685419
500,0.538975,0.670881
367,0.536106,0.647512
1276,0.504469,0.645497
2959,0.750008,0.635759
2028,0.690203,0.610365
5989,0.553974,0.583042
2396,0.50285,0.552881
597,0.535956,0.537787


### 3.2 User-based collaborative filtering

To create a user-based collaborative recommender we are going to go through a very similar process as we did with the item-based recommender. This time though we’re going to calculate the cosine similarity between users, instead of between movies.

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

#create a users-items table
user_item = pd.pivot_table(data=ratings, values='rating', index='userId', columns='movieId')

#replace NaNs with zeros
user_item.fillna(0,inplace=True)

#cosine similarities
cos_sim = pd.DataFrame(data=cosine_similarity(user_item), index=user_item.index, columns=user_item.index)

Let us now focus on one user `uID=30`.

The goal is to estimate the numbers where `uID` did not give a rate. So, first of all we have to identify those movies with rating = 0. As a result we get an array of `movieId` which we call missing_movies.

For any movie in missing_movies we calculate the estimated rating $r_{\text{u}_\text{ID}}$ as

$$r_{\text{u}_\text{ID}}= \sum_{i\neq\text{u}_\text{ID}
} w_i r_i$$

where $r_i$ the true rating of the other users ad $w_i$ is the similarity weight defined as

$$w_i = \frac{c_i}{\sum_{i\neq\text{u}_\text{ID}}c_i} $$

where $c_i$ are is the cosine similarity of the $i$-th user and $w_i$ its weight.

In [131]:
uID = 300

#find the unrated movies and the ratings of the other users
unseen_rating_uID = user_item.loc[user_item.index!=uID,user_item.loc[uID,:]==0]

#calculate weights
weights_uID = cos_sim.query('userId!=@uID')[uID]/sum(cos_sim.query('userId!=@uID')[uID])

#construct the predicted_rating by means of the dot product
predicted_uID = pd.DataFrame(unseen_rating_uID.T.dot(weights_uID), columns = ["predicted_rate"]).sort_values(by="predicted_rate",ascending=False)

In [132]:
#to find the top 5 UNRATED movies we have to merge our findings with the original table
recommendations = predicted_uID.merge(movies, left_index=True, right_on="movieId")
recommendations.sort_values("predicted_rate", ascending=False).head(5)

Unnamed: 0,predicted_rate,movieId,title,genres
257,2.688146,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
224,1.907312,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
3638,1.887349,4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
46,1.878196,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
4800,1.846094,7153,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy


#### 3.2.1 The function

In [133]:
def special_for_you(uID,n):
    
    #find the unrated movies and the ratings of the other users
    unseen_rating_uID = user_item.loc[user_item.index!=uID,user_item.loc[uID,:]==0]
    
    #calculate weights
    weights_uID = cos_sim.query('userId!=@uID')[uID]/sum(cos_sim.query('userId!=@uID')[uID])
    
    #construct the predicted_rating by means of the dot product
    predicted_uID = pd.DataFrame(unseen_rating_uID.T.dot(weights_uID), columns = ["predicted_rate"]).sort_values(by="predicted_rate",ascending=False)
    
    #to find the top 5 UNRATED movies we have to merge our findings with the original table
    recommendations = predicted_uID.merge(movies, left_index=True, right_on="movieId")
    
    return recommendations.sort_values("predicted_rate", ascending=False).head(n)

In [134]:
special_for_you(47,5)

Unnamed: 0,predicted_rate,movieId,title,genres
257,2.730911,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
1939,2.395105,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
461,2.115291,527,Schindler's List (1993),Drama|War
97,2.009459,110,Braveheart (1995),Action|Drama|War
224,1.981588,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
