## Importing useful libraries

First of all we need to import all of the libraries that are going to help us along the way, in order to recommend movies to our users. Here pandas is going to be used in order to perform different transformations with our data. Moreover, will help us when creating arrays and operating with them. Finally, we imported also cosine_similarity, which will be used to determine the similarity among the different users.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

## Importing of the data

Here, we load the different two dataframes that will be needed in order to provide the personalized recommendations. The first one corresponds to the ratings of the movies, which has the following data (userId, movieId and rating). The second one, is the movies dataframe, which I selected only the first two columns which will be the relevant ones for this exercise (movieId and title).

In [2]:
ratings_df = pd.read_csv('u.data', sep='\t', index_col=False, names=['userId', 'movieId', 'rating'])
movies_df = pd.read_csv('u.item', sep='|', usecols=[0,1], index_col=False, encoding='ISO-8859-1', names=["movieId", "title"])
print(movies_df)
print(ratings_df)

      movieId                                      title
0           1                           Toy Story (1995)
1           2                           GoldenEye (1995)
2           3                          Four Rooms (1995)
3           4                          Get Shorty (1995)
4           5                             Copycat (1995)
...       ...                                        ...
1677     1678                          Mat' i syn (1997)
1678     1679                           B. Monkey (1998)
1679     1680                       Sliding Doors (1998)
1680     1681                        You So Crazy (1994)
1681     1682  Scream of Stone (Schrei aus Stein) (1991)

[1682 rows x 2 columns]
       userId  movieId  rating
0         196      242       3
1         186      302       3
2          22      377       1
3         244       51       2
4         166      346       1
...       ...      ...     ...
99995     880      476       3
99996     716      204       5
99997     27

## Calculating average rating per user

Here, what is being done is the following: First, we groupby userId the ratings dataframe, and pick just the columns (userID and rating). With this we will obtain the average rating of each user. After we merge the ratings dataframe and average one, by userId, in order to have all the relevant information need to classify users in one dataframe.

In [3]:
# Add a column with the centered ratings
average_df = ratings_df[['userId', 'rating']].groupby(['userId'], as_index=False).mean().rename(columns={'rating': 'average'})
print(average_df)
ratings_df = pd.merge(ratings_df, average_df, on='userId', how='left')
ratings_df['rating_centered'] = ratings_df['rating'] - ratings_df['average']
print(ratings_df)
print(ratings_df["movieId"].nunique())
ratings_df.info()
#this code was just made for checking different things, no relevance whatsoever for the exercise.
print(ratings_df[ratings_df["userId"]==1]["userId"].sum())
print(ratings_df[ratings_df["userId"]==2])

     userId   average
0         1  3.610294
1         2  3.709677
2         3  2.796296
3         4  4.333333
4         5  2.874286
..      ...       ...
938     939  4.265306
939     940  3.457944
940     941  4.045455
941     942  4.265823
942     943  3.410714

[943 rows x 2 columns]
       userId  movieId  rating   average  rating_centered
0         196      242       3  3.615385        -0.615385
1         186      302       3  3.413043        -0.413043
2          22      377       1  3.351562        -2.351562
3         244       51       2  3.651261        -1.651261
4         166      346       1  3.550000        -2.550000
...       ...      ...     ...       ...              ...
99995     880      476       3  3.426630        -0.426630
99996     716      204       5  3.888476         1.111524
99997     276     1090       1  3.465251        -2.465251
99998      13      225       2  3.097484        -1.097484
99999      12      203       3  4.392157        -1.392157

[100000 rows x 

## Pivot table creation

In this step of the process, I am just creating two pivot tables with the userIds in the rows and the movieIds on the columns. The first one, has the ratings of each user to the corresponding movie, whereas the second one has the ratings centered values (rating of the user to a particular movie minus the average rating of the user).

In [4]:
users_ratings_matrix = pd.pivot_table(ratings_df,values='rating',index='userId',columns='movieId')
users_ratings_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


In [5]:
adjusted_matrix = pd.pivot_table(ratings_df,values='rating_centered',index='userId',columns='movieId')
adjusted_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.389706,-0.610294,0.389706,-0.610294,-0.610294,1.389706,0.389706,-2.610294,1.389706,-0.610294,...,,,,,,,,,,
2,0.290323,,,,,,,,,-1.709677,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,1.125714,0.125714,,,,,,,,,...,,,,,,,,,,


## Filling null values

Here, I perform two ways of filling the null values present on the pivot matrix, however only one will be used when recommending the movies to the user. The first one is filling the null values by the movie average, since what we are recommmending are the movies is more representative than the user average. Therefore, when recommending movies this is the pivot table that we will be considering. The second one is filling the null values by user average.

In [6]:
# Replacing NaN by Movie Average
final_movie = adjusted_matrix.fillna(adjusted_matrix.mean(axis=0))
final_movie.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.389706,-0.610294,0.389706,-0.610294,-0.610294,1.389706,0.389706,-2.610294,1.389706,-0.610294,...,-1.147059,-0.137056,-0.45933,-1.45933,-0.211982,-2.121495,-0.121495,-1.121495,0.019337,-0.365931
2,0.290323,-0.253455,-0.406476,-0.02917,-0.206708,0.099592,0.241369,0.370904,0.316282,-1.709677,...,-1.147059,-0.137056,-0.45933,-1.45933,-0.211982,-2.121495,-0.121495,-1.121495,0.019337,-0.365931
3,0.299264,-0.253455,-0.406476,-0.02917,-0.206708,0.099592,0.241369,0.370904,0.316282,0.251461,...,-1.147059,-0.137056,-0.45933,-1.45933,-0.211982,-2.121495,-0.121495,-1.121495,0.019337,-0.365931
4,0.299264,-0.253455,-0.406476,-0.02917,-0.206708,0.099592,0.241369,0.370904,0.316282,0.251461,...,-1.147059,-0.137056,-0.45933,-1.45933,-0.211982,-2.121495,-0.121495,-1.121495,0.019337,-0.365931
5,1.125714,0.125714,-0.406476,-0.02917,-0.206708,0.099592,0.241369,0.370904,0.316282,0.251461,...,-1.147059,-0.137056,-0.45933,-1.45933,-0.211982,-2.121495,-0.121495,-1.121495,0.019337,-0.365931


In [7]:
# Replacing NaN by user Average
adjusted_matrix_filled_user = adjusted_matrix.apply(lambda row: row.fillna(row.mean()), axis=1)
adjusted_matrix_filled_user.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.389706,-0.6102941,0.3897059,-0.6102941,-0.6102941,1.389706,0.3897059,-2.610294,1.389706,-0.6102941,...,2.579636e-16,2.579636e-16,2.579636e-16,2.579636e-16,2.579636e-16,2.579636e-16,2.579636e-16,2.579636e-16,2.579636e-16,2.579636e-16
2,0.2903226,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,-1.709677,...,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16,4.655774e-16
3,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,...,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16,1.151342e-16
4,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,...,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16,2.960595e-16
5,1.125714,0.1257143,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,...,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16,2.131628e-16


## User similarity

What is interesting in the following two matrixes, or pivot tables, whatever you prefer to call them is that we are calculating the similarity among users, which is crucial when recommending someone a movie. Since, people that have similar likes and dislikes than you, will probably like or dislike the same movies.

In [8]:
# user similarity on replacing NAN by user avg
distances = cosine_similarity(adjusted_matrix_filled_user)
# Filling the diagonal with null values, since we are not interested in the similarity of one user to himself.
np.fill_diagonal(distances, 0 )
similarity_with_user = pd.DataFrame(distances,index=adjusted_matrix_filled_user.index, columns=adjusted_matrix_filled_user.index)
similarity_with_user.head()

userId,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.043411,0.011051,0.059303,0.134514,0.103373,0.110556,0.180891,0.012253,-0.000621,...,0.025835,-0.047952,0.087224,0.007718,0.074378,0.078714,0.067433,0.02879,-0.03127,0.032123
2,0.043411,0.0,0.013658,-0.017016,0.03577,0.094503,0.089408,0.05564,0.027294,0.097846,...,0.012853,-0.028798,0.056659,0.197835,0.090009,0.032505,0.015053,-0.017344,0.012068,0.039173
3,0.011051,0.013658,0.0,-0.059638,0.016037,-0.017158,0.016141,0.041177,-0.010093,0.023856,...,0.001615,0.000658,-0.006888,0.036157,-0.018513,-0.00624,-0.023907,0.034414,-0.009187,0.001489
4,0.059303,-0.017016,-0.059638,0.0,0.007373,-0.053929,-0.025604,0.136046,0.016082,-0.013588,...,0.011895,0.002174,-0.028,-0.025021,0.022882,-0.00596,0.279818,0.258594,0.064504,-0.019222
5,0.134514,0.03577,0.016037,0.007373,0.0,0.038484,0.067874,0.140106,0.010195,0.014335,...,0.070014,-0.070821,0.024278,0.038672,0.093567,0.051782,0.02954,0.036234,0.043318,0.099324


In [9]:
# user similarity on replacing NAN by item(movie) avg
cosine_distances = cosine_similarity(final_movie)
# same as the matrix before filling the diagonal with zeros
np.fill_diagonal(cosine_distances, 0 )
similarity_with_movie = pd.DataFrame(cosine_distances,index=final_movie.index)
similarity_with_movie.columns=adjusted_matrix_filled_user.index
similarity_with_movie.head()

userId,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.843356,0.826035,0.857827,0.764312,0.779988,0.713977,0.853533,0.855259,0.816118,...,0.780533,0.83732,0.818577,0.83578,0.794697,0.851909,0.817304,0.859819,0.831436,0.745902
2,0.843356,0.0,0.927383,0.956761,0.843712,0.872662,0.804099,0.941021,0.956912,0.933291,...,0.884964,0.946499,0.906733,0.951505,0.885671,0.952297,0.909712,0.961835,0.947336,0.857147
3,0.826035,0.927383,0.0,0.93998,0.82725,0.852937,0.779676,0.923743,0.939207,0.917426,...,0.867559,0.930667,0.889823,0.926949,0.868076,0.934073,0.892905,0.94632,0.924428,0.839377
4,0.857827,0.956761,0.93998,0.0,0.855949,0.879266,0.8018,0.959257,0.974333,0.94694,...,0.898505,0.964633,0.919858,0.953728,0.900577,0.967555,0.939123,0.98232,0.963425,0.868326
5,0.764312,0.843712,0.82725,0.855949,0.0,0.768636,0.706424,0.844057,0.854108,0.82936,...,0.791333,0.837326,0.806724,0.840056,0.797082,0.851854,0.811623,0.858221,0.843267,0.771807


## Calculation of neighbors

Here, in order to make the program computationally efficient, I decided to just use the 40 most similar users to the on we pick in order to recommend him/her a movie. Instead, of going through the whole data each time we wanted to recommend a movie, which will elevate the program runtime exponentially.

In [10]:
def find_n_neighbours(df,n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
           .iloc[:n].index, 
          index=['top{}'.format(i) for i in range(1, n+1)]), axis=1)
    return df

In [11]:
similar_users_40_m = find_n_neighbours(similarity_with_movie,40)
similar_users_40_m.head()

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10,...,top31,top32,top33,top34,top35,top36,top37,top38,top39,top40
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,225,549,895,266,105,800,594,926,384,769,...,247,171,477,17,33,191,754,701,355,685
2,384,33,849,888,800,171,252,482,651,728,...,926,355,513,646,571,905,583,631,485,34
3,33,810,687,47,191,266,284,155,512,369,...,220,359,819,688,278,895,384,431,594,827
4,849,888,431,827,631,33,800,384,876,941,...,920,29,105,140,740,220,47,855,477,132
5,584,369,728,384,800,571,319,849,565,549,...,191,594,414,441,855,876,400,482,140,558


## Getting to know which movies each user saw

Here we are grouping by userId the ratings dataframe and showing which movies each user saw.

In [12]:
ratings_df = ratings_df.astype({"movieId": str})
Movies_per_user = ratings_df.groupby(by = 'userId')['movieId'].apply(lambda x:','.join(x))
Movies_per_user.head(10)

userId
1     61,189,33,160,20,202,171,265,155,117,47,222,25...
2     292,251,50,314,297,290,312,281,13,280,303,308,...
3     335,245,337,343,323,331,294,332,328,334,350,34...
4     264,303,361,357,260,356,294,288,50,354,271,300...
5     2,17,439,225,110,454,424,1,363,98,102,211,382,...
6     86,14,98,463,301,258,69,517,23,492,478,508,469...
7     32,479,455,382,163,430,497,492,661,648,378,200...
8     338,550,22,50,182,79,294,457,385,89,190,686,30...
9     298,691,521,487,286,6,479,340,527,507,276,615,...
10    16,486,175,611,7,100,461,488,285,504,289,340,5...
Name: movieId, dtype: object

## Function that combines everything we previously calculated and provides recommendations.

Each step is explained in the function.

In [13]:
def User_recommendations(user):
    # Getting the different movies seen by the user (specified on the function)
    Movie_seen_by_user = users_ratings_matrix.columns[users_ratings_matrix[users_ratings_matrix.index==user].notna().any()].tolist()
    # Getting the values of the userIds of the top 40 users similar to the one we are looking for
    userIds_similar = similar_users_40_m[similar_users_40_m.index==user].values
    list_with_similar_userIds = userIds_similar.squeeze().tolist()
    # Getting the diferent movies seen by the users in the list_with_similar_userIds
    d = Movies_per_user[Movies_per_user.index.isin(list_with_similar_userIds)]
    list_of_movies_separated_by_commas = ','.join(d.values)
    Movies_seen_by_similar_users = list_of_movies_separated_by_commas.split(',')
    # Here we are just calculating the movies which we need to consider. They are the ones seen by similar users minus the ones the user already has seen.
    Movies_to_consider = list(set(Movies_seen_by_similar_users)-set(list(map(str, Movie_seen_by_user))))
    Movies_to_consider = list(map(int, Movies_to_consider))
    
    # We initialize a list called score, which will store the diferent scores for the movies under consideration.
    score = []
    
    # This for loop what is performing is the following: 
        # First it checks for the similar users that have seen that movie.
        # Once they are obtained, the correlation among them and the user is calculated.
        # Once the correlation is calculated, it is used alongside the adjusted score to calculate the final score for that movie.
    
    for item in Movies_to_consider:
        c = final_movie.loc[:,item]
        d = c[c.index.isin(list_with_similar_userIds)]
        f = d[d.notnull()]
        avg_user = average_df.loc[average_df['userId'] == user,'average'].values[0]
        index = f.index.values.squeeze().tolist()
        correlation = similarity_with_movie.loc[user,index]
        fin = pd.concat([f, correlation], axis=1)
        fin.columns = ['adg_score','correlation']
        fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
        nume = fin['score'].sum()
        deno = fin['correlation'].sum()
        final_score = avg_user + (nume/deno)
        score.append(final_score)
    
    # After the calculation of each score for each movie. We create a dataframe with the movies and the scores.Then sort it in descending order and pikc the top-10.
    data = pd.DataFrame({'movieId':Movies_to_consider,'score':score})
    top_10_recommendation = data.sort_values(by='score',ascending=False).head(10)
    Movie_Name = top_10_recommendation.merge(movies_df, how='inner', on='movieId')
    Movie_Names = Movie_Name.title.values.tolist()
    
    return  Movie_Names

In [14]:
print("The top-10 recommendations for user", 1, "are :", User_recommendations(1))

The top-10 recommendations for user 1 are : ['Close Shave, A (1995)', "Schindler's List (1993)", 'Casablanca (1942)', 'Titanic (1997)', 'As Good As It Gets (1997)', 'Boot, Das (1981)', 'Rear Window (1954)', 'L.A. Confidential (1997)', 'Secrets & Lies (1996)', "One Flew Over the Cuckoo's Nest (1975)"]


In [15]:
print("The top-10 recommendations for user", 196, "are :", User_recommendations(196))

The top-10 recommendations for user 196 are : ['Close Shave, A (1995)', 'Wrong Trousers, The (1993)', "Schindler's List (1993)", 'Shawshank Redemption, The (1994)', 'Usual Suspects, The (1995)', 'Wallace & Gromit: The Best of Aardman Animation (1996)', 'Good Will Hunting (1997)', 'Casablanca (1942)', 'Star Wars (1977)', 'As Good As It Gets (1997)']


In [16]:
print("The top-10 recommendations for user", 880, "are :", User_recommendations(880))

The top-10 recommendations for user 880 are : ['Wallace & Gromit: The Best of Aardman Animation (1996)', 'Titanic (1997)', 'Boot, Das (1981)', 'Secrets & Lies (1996)', 'Sling Blade (1996)', 'To Kill a Mockingbird (1962)', 'Alien (1979)', 'Kolya (1996)', 'Cinema Paradiso (1988)', 'Three Colors: Red (1994)']


# Using ALS to solve the problem

Here I am using the code we have seen in class about ALS, to calculate the similar users to the one we are looking for and then performed again the function that suggested the movies, which was defined above. The only difference than the code we had in class, is that this ALS since we are trying to give recommendations based on an user-view collaborative filtering, instead of product-view,the cosine distance is calculated on X. This will give us the similar users to the one we want. If we used Y.T like in the example seen in class we will be taking into consideration the movies instead of the users.

In [17]:
from sklearn.metrics.pairwise import cosine_distances
nUsers = ratings_df.userId.unique().size
nMovies = ratings_df.movieId.unique().size

ratingsM = np.zeros((nUsers+1, nMovies+1))
for index, rating in ratings_df.iterrows():
    ratingsM[int(rating.userId), int(rating.movieId)] = rating.rating_centered
    
#print(ratingsM)

# Metaparameters
k = 100        # number of latent factors
l = 0.1        # lambda. The same value for x and y
accuracy = 0.999

# X and Y initialization
np.random.seed(42)
X = np.random.normal(size=(ratingsM.shape[0], k))
Y = np.random.normal(size=(k, ratingsM.shape[1]))

converged = False
pL = np.Inf
while not converged:
    y = Y.T
    inv = np.linalg.inv(y.T.dot(y) + l*np.eye(k))
    for u in range(0, X.shape[0]):
        X[u] = ratingsM[u,:].dot(y).dot(inv)
    
    inv = np.linalg.inv(X.T.dot(X) + l*np.eye(k))    
    for i in range(0, Y.shape[1]):
        Y[:,i] = ratingsM[:,i].dot(X).dot(inv)
        
    L = np.square(ratingsM - X.dot(Y)).sum()
    L = L + l * (np.square(np.linalg.norm(X)) + np.square(np.linalg.norm(Y)))
                     
    # Improvement stop criteria
    converged = (L / pL) > accuracy
    
    pL = L
 
# Let's make predictions
# Get the similarity matrix with the items latent factors
myuser = 196
print("My user is: ", ratings_df[ratings_df.userId == myuser].userId.iloc[0])
print()

distances = cosine_distances(X)

# Print the 10 users closest to my user
distancesSortedIx = np.argsort(distances[myuser])
similar_users = []
for i in range(1, 11):
    userId = distancesSortedIx[i]
    similar_users.append(userId)
print(similar_users)

def User_recommendations_2(user, list_of_users):
    Movie_seen_by_user = users_ratings_matrix.columns[users_ratings_matrix[users_ratings_matrix.index==user].notna().any()].tolist()
    d = Movies_per_user[Movies_per_user.index.isin(list_of_users)]
    l = ','.join(d.values)
    Movies_seen_by_similar_users = l.split(',')
    Movies_under_consideration = list(set(Movies_seen_by_similar_users)-set(list(map(str, Movie_seen_by_user))))
    Movies_under_consideration = list(map(int, Movies_under_consideration))
    
    score = []
    for item in Movies_under_consideration:
        c = final_movie.loc[:,item]
        d = c[c.index.isin(list_of_users)]
        f = d[d.notnull()]
        avg_user = average_df.loc[average_df['userId'] == user,'average'].values[0]
        index = f.index.values.squeeze().tolist()
        corr = similarity_with_movie.loc[user,index]
        fin = pd.concat([f, corr], axis=1)
        fin.columns = ['adg_score','correlation']
        fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
        nume = fin['score'].sum()
        deno = fin['correlation'].sum()
        final_score = avg_user + (nume/deno)
        score.append(final_score)
    data = pd.DataFrame({'movieId':Movies_under_consideration,'score':score})
    top_10_recommendation = data.sort_values(by='score',ascending=False).head(10)
    Movie_Name = top_10_recommendation.merge(movies_df, how='inner', on='movieId')
    Movie_Names = Movie_Name.title.values.tolist()
    
    return  Movie_Names

print("The top-10 recommendations for user", 196, "are :", User_recommendations_2(196, similar_users))


My user is:  196

[120, 308, 97, 845, 720, 147, 228, 420, 906, 98]
The top-10 recommendations for user 196 are : ['Wrong Trousers, The (1993)', 'Usual Suspects, The (1995)', "Schindler's List (1993)", 'Shawshank Redemption, The (1994)', 'Star Wars (1977)', 'Close Shave, A (1995)', 'Fresh (1994)', 'Good Will Hunting (1997)', '12 Angry Men (1957)', 'Citizen Kane (1941)']


In [18]:
myuser = 1
print("My user is: ", ratings_df[ratings_df.userId == myuser].userId.iloc[0])
print()

distances = cosine_distances(X)

# Print the 10 users closest to my user
distancesSortedIx = np.argsort(distances[myuser])
similar_users = []
for i in range(1, 11):
    userId = distancesSortedIx[i]
    similar_users.append(userId)
print(similar_users)

def User_recommendations_2(user, list_of_users):
    Movie_seen_by_user = users_ratings_matrix.columns[users_ratings_matrix[users_ratings_matrix.index==user].notna().any()].tolist()
    d = Movies_per_user[Movies_per_user.index.isin(list_of_users)]
    l = ','.join(d.values)
    Movies_seen_by_similar_users = l.split(',')
    Movies_under_consideration = list(set(Movies_seen_by_similar_users)-set(list(map(str, Movie_seen_by_user))))
    Movies_under_consideration = list(map(int, Movies_under_consideration))
    
    score = []
    for item in Movies_under_consideration:
        c = final_movie.loc[:,item]
        d = c[c.index.isin(list_of_users)]
        f = d[d.notnull()]
        avg_user = average_df.loc[average_df['userId'] == user,'average'].values[0]
        index = f.index.values.squeeze().tolist()
        corr = similarity_with_movie.loc[user,index]
        fin = pd.concat([f, corr], axis=1)
        fin.columns = ['adg_score','correlation']
        fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
        nume = fin['score'].sum()
        deno = fin['correlation'].sum()
        final_score = avg_user + (nume/deno)
        score.append(final_score)
    data = pd.DataFrame({'movieId':Movies_under_consideration,'score':score})
    top_10_recommendation = data.sort_values(by='score',ascending=False).head(10)
    Movie_Name = top_10_recommendation.merge(movies_df, how='inner', on='movieId')
    Movie_Names = Movie_Name.title.values.tolist()
    
    return  Movie_Names

print("The top-10 recommendations for user", 1, "are :", User_recommendations_2(1, similar_users))


My user is:  1

[691, 628, 566, 803, 550, 315, 598, 340, 754, 331]
The top-10 recommendations for user 1 are : ['Titanic (1997)', "Schindler's List (1993)", 'Casablanca (1942)', 'Rear Window (1954)', 'L.A. Confidential (1997)', 'Secrets & Lies (1996)', 'Lawrence of Arabia (1962)', 'Third Man, The (1949)', 'North by Northwest (1959)', 'Before the Rain (Pred dozhdot) (1994)']


In [19]:
myuser = 880
print("My user is: ", ratings_df[ratings_df.userId == myuser].userId.iloc[0])
print()

distances = cosine_distances(X)

# Print the 10 users closest to my user
distancesSortedIx = np.argsort(distances[myuser])
similar_users = []
for i in range(1, 11):
    userId = distancesSortedIx[i]
    similar_users.append(userId)
print(similar_users)

def User_recommendations_2(user, list_of_users):
    Movie_seen_by_user = users_ratings_matrix.columns[users_ratings_matrix[users_ratings_matrix.index==user].notna().any()].tolist()
    d = Movies_per_user[Movies_per_user.index.isin(list_of_users)]
    l = ','.join(d.values)
    Movies_seen_by_similar_users = l.split(',')
    Movies_under_consideration = list(set(Movies_seen_by_similar_users)-set(list(map(str, Movie_seen_by_user))))
    Movies_under_consideration = list(map(int, Movies_under_consideration))
    
    score = []
    for item in Movies_under_consideration:
        c = final_movie.loc[:,item]
        d = c[c.index.isin(list_of_users)]
        f = d[d.notnull()]
        avg_user = average_df.loc[average_df['userId'] == user,'average'].values[0]
        index = f.index.values.squeeze().tolist()
        corr = similarity_with_movie.loc[user,index]
        fin = pd.concat([f, corr], axis=1)
        fin.columns = ['adg_score','correlation']
        fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
        nume = fin['score'].sum()
        deno = fin['correlation'].sum()
        final_score = avg_user + (nume/deno)
        score.append(final_score)
    data = pd.DataFrame({'movieId':Movies_under_consideration,'score':score})
    top_10_recommendation = data.sort_values(by='score',ascending=False).head(10)
    Movie_Name = top_10_recommendation.merge(movies_df, how='inner', on='movieId')
    Movie_Names = Movie_Name.title.values.tolist()
    
    return  Movie_Names


print("The top-10 recommendations for user", 880, "are :", User_recommendations_2(880, similar_users))

My user is:  880

[43, 715, 910, 94, 160, 379, 903, 548, 243, 766]
The top-10 recommendations for user 880 are : ['Close Shave, A (1995)', 'Wallace & Gromit: The Best of Aardman Animation (1996)', 'Wrong Trousers, The (1993)', 'Titanic (1997)', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)', 'North by Northwest (1959)', 'Casablanca (1942)', '12 Angry Men (1957)', 'Patton (1970)', 'Chinatown (1974)']


## Conclusion

As it can be seen both algorithms lead to pretty similar recommendations, besides with 880 user. One of the reasons behind it might be because we changed the number of similar users to consider, in ALS and the one before. But this is what I expected when performing it, just wanted to make sure how different would be the ratings of the movies when taking fewer similar users into account.