# Recommendation System Notebook
- User based recommendation
- User based prediction & evaluation
- Item based recommendation
- Item based prediction & evaluation

Different Approaches to develop Recommendation System -

1. Demographich based Recommendation System

2. Content Based Recommendation System

3. Collaborative filtering Recommendation System

In [1]:
# import libraties
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Reading ratings file from GitHub. # MovieLens
ratings = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/ratings_final.csv' , encoding='latin-1')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [3]:
ratings['userId'].nunique()

2071

In [4]:
ratings['movieId'].nunique()

14657

In [5]:
ratings.shape

(300126, 4)

## Dividing the dataset into train and test

In [6]:
# Test and Train split of the dataset.
from sklearn.model_selection import train_test_split
train, test = train_test_split(ratings, test_size=0.30, random_state=31)

In [7]:
print(train.shape)
print(test.shape)

(210088, 4)
(90038, 4)


In [6]:
# Pivot the train ratings' dataset into matrix format in which columns are movies and the rows are user IDs.
df_pivot = train.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(0)

df_pivot.head(3)

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
df_pivot.shape

(2071, 12911)

### Creating dummy train & dummy test dataset
These dataset will be used for prediction 
- Dummy train will be used later for prediction of the movies which has not been rated by the user. To ignore the movies rated by the user, we will mark it as 0 during prediction. The movies not rated by user is marked as 1 for prediction in dummy train dataset. 

- Dummy test will be used for evaluation. To evaluate, we will only make prediction on the movies rated by the user. So, this is marked as 1. This is just opposite of dummy_train.

In [72]:
train.head()

Unnamed: 0,userId,movieId,rating,timestamp
285695,1986,5618,5.0,1468556487
216175,1525,1357,3.0,860937440
223915,1591,2571,5.0,1446621026
200924,1418,2060,3.0,1034922803
217719,1536,1704,4.0,1225328446


In [73]:
train['rating'].unique()

array([5. , 3. , 4. , 2. , 2.5, 3.5, 1.5, 0.5, 4.5, 1. ])

In [75]:
train['rating'].isnull().sum()

0

In [19]:
# Copy the train dataset into dummy_train
dummy_train = train.copy()

In [20]:
# The movies not rated by user is marked as 1 for prediction. 
dummy_train['rating'] = 0

In [21]:
# Convert the dummy train dataset into matrix format.
dummy_train = dummy_train.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(1)

In [22]:
dummy_train.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Cosine Similarity**

Cosine Similarity is a measurement that quantifies the similarity between two vectors [Which is Rating Vector in this case] 

**Adjusted Cosine**

Adjusted cosine similarity is a modified version of vector-based similarity where we incorporate the fact that different users have different ratings schemes. In other words, some users might rate items highly in general, and others might give items lower ratings as a preference. To handle this nature from rating given by user , we subtract average ratings for each user from each user's rating for different movies.



# User Similarity Matrix

## Using Cosine Similarity

In [15]:
from sklearn.metrics.pairwise import pairwise_distances

# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_pivot, metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

[[1.         0.02834514 0.04260006 ... 0.02010714 0.         0.02016493]
 [0.02834514 1.         0.12915063 ... 0.16690495 0.         0.11004906]
 [0.04260006 0.12915063 1.         ... 0.17826473 0.         0.04473138]
 ...
 [0.02010714 0.16690495 0.17826473 ... 1.         0.00286873 0.09695713]
 [0.         0.         0.         ... 0.00286873 1.         0.01475374]
 [0.02016493 0.11004906 0.04473138 ... 0.09695713 0.01475374 1.        ]]


In [16]:
user_correlation.shape

(2071, 2071)

## Using adjusted Cosine 

### Here, we are not removing the NaN values and calculating the mean only for the movies rated by the user

In [17]:
# Create a user-movie matrix.
df_pivot = train.pivot(
    index='userId',
    columns='movieId',
    values='rating'
)

In [18]:
df_pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,3.5,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,3.0,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [19]:
df_pivot.shape

(2071, 12911)

### Normalising the rating of the movie for each user around 0 mean

In [20]:
mean = np.nanmean(df_pivot, axis=1)
mean.shape

(2071,)

In [21]:
df_subtracted = (df_pivot.T-mean).T

In [22]:
df_subtracted.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,-0.169355,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,-0.368421,,,,,,,,,,...,,,,,,,,,,
5,0.105263,,,,,,,,,,...,,,,,,,,,,


### Finding cosine similarity

In [23]:
from sklearn.metrics.pairwise import pairwise_distances

In [24]:
# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

[[ 1.00000000e+00  1.89942155e-03  1.65714482e-02 ...  2.02835872e-02
   0.00000000e+00  4.45393263e-03]
 [ 1.89942155e-03  1.00000000e+00  2.43906478e-02 ...  3.46154488e-02
   0.00000000e+00 -6.29642259e-04]
 [ 1.65714482e-02  2.43906478e-02  1.00000000e+00 ...  1.12606845e-01
   0.00000000e+00  1.72034212e-02]
 ...
 [ 2.02835872e-02  3.46154488e-02  1.12606845e-01 ...  1.00000000e+00
   1.50194969e-02  4.05549027e-03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  1.50194969e-02
   1.00000000e+00 -3.89401922e-02]
 [ 4.45393263e-03 -6.29642259e-04  1.72034212e-02 ...  4.05549027e-03
  -3.89401922e-02  1.00000000e+00]]


## Prediction - User User

Doing the prediction for the users which are positively related with other users, and not the users which are negatively related as we are interested in the users which are more similar to the current users. So, ignoring the correlation for values less than 0. 

In [25]:
user_correlation[user_correlation<0]=0
user_correlation

array([[1.        , 0.00189942, 0.01657145, ..., 0.02028359, 0.        ,
        0.00445393],
       [0.00189942, 1.        , 0.02439065, ..., 0.03461545, 0.        ,
        0.        ],
       [0.01657145, 0.02439065, 1.        , ..., 0.11260685, 0.        ,
        0.01720342],
       ...,
       [0.02028359, 0.03461545, 0.11260685, ..., 1.        , 0.0150195 ,
        0.00405549],
       [0.        , 0.        , 0.        , ..., 0.0150195 , 1.        ,
        0.        ],
       [0.00445393, 0.        , 0.01720342, ..., 0.00405549, 0.        ,
        1.        ]])

In [27]:
user_correlation.shape

(2071, 2071)

In [28]:
df_pivot.shape

(2071, 12911)

Rating predicted by the user (for movies rated as well as not rated) is the weighted sum of correlation with the movie rating (as present in the rating dataset). 

In [29]:
user_predicted_ratings = np.dot(user_correlation, df_pivot.fillna(0))
user_predicted_ratings

array([[9.55212958e+00, 3.29449494e+00, 1.47095436e+00, ...,
        0.00000000e+00, 0.00000000e+00, 7.45759814e-02],
       [4.27943869e+01, 1.49257424e+01, 7.10816723e+00, ...,
        1.13239244e-01, 1.13239244e-01, 5.83545467e-02],
       [6.61412412e+01, 2.27700115e+01, 1.08069083e+01, ...,
        3.65238676e-01, 3.65238676e-01, 2.34784706e-01],
       ...,
       [7.09804143e+01, 2.56071809e+01, 1.05493425e+01, ...,
        1.01014810e-01, 1.01014810e-01, 1.07246400e-01],
       [1.12478540e+01, 3.37613436e+00, 2.67961234e+00, ...,
        0.00000000e+00, 0.00000000e+00, 3.20864676e-03],
       [3.75005766e+01, 1.31940683e+01, 8.47280930e+00, ...,
        3.98994887e-02, 3.98994887e-02, 9.16359183e-02]])

In [30]:
user_predicted_ratings.shape

(2071, 12911)

Since we are interested only in the movies not rated by the user, we will ignore the movies rated by the user by making it zero. 

In [31]:
user_final_rating = np.multiply(user_predicted_ratings,dummy_train)
user_final_rating.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,9.55213,3.294495,1.470954,0.040189,1.020561,5.201704,1.659968,0.054285,0.349317,5.031108,...,0.0,0.003304,0.008125,0.0,0.033861,0.0,0.0,0.0,0.0,0.074576
2,0.0,14.925742,7.108167,0.908528,4.979097,14.034879,7.099211,0.785965,1.635084,19.88008,...,0.02464,0.061503,0.013683,0.0,0.054489,0.0,0.0176,0.113239,0.113239,0.058355
3,66.141241,22.770011,10.806908,1.206505,6.68577,34.141321,9.858038,0.669255,2.676321,32.180909,...,0.171026,0.100944,0.124693,0.080384,0.133438,0.317882,0.122161,0.365239,0.365239,0.234785
4,0.0,10.226084,2.658416,0.299828,2.207704,12.222046,3.007363,0.123162,0.998604,12.896807,...,0.324958,0.172648,0.064566,0.009314,0.211999,0.144026,0.232113,0.346767,0.346767,0.200223
5,0.0,19.867664,17.681491,2.762963,12.826641,32.774931,16.06281,1.801925,3.842096,35.779545,...,0.073628,0.0,0.025125,0.025513,0.044435,0.156123,0.052592,0.060123,0.060123,0.082602


### Finding the top 5 recommendation for the *user*

In [31]:
# Take the user ID as input.
user_input = int(input("Enter your user name"))
print(user_input)

Enter your user name3
3


In [32]:
user_final_rating.head(2)

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,9.55213,3.294495,1.470954,0.040189,1.020561,5.201704,1.659968,0.054285,0.349317,5.031108,...,0.0,0.003304,0.008125,0.0,0.033861,0.0,0.0,0.0,0.0,0.074576
2,0.0,14.925742,7.108167,0.908528,4.979097,14.034879,7.099211,0.785965,1.635084,19.88008,...,0.02464,0.061503,0.013683,0.0,0.054489,0.0,0.0176,0.113239,0.113239,0.058355


In [33]:
d = user_final_rating.loc[user_input].sort_values(ascending=False)[0:5]
d

movieId
1196    90.231482
47      82.637700
2858    81.097697
1198    78.595074
589     78.530892
Name: 3, dtype: float64

In [3]:
#Mapping with Movie Title / Genres 
movie_mapping = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/movies.csv')
movie_mapping.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [34]:
d = pd.merge(d,movie_mapping,left_on='movieId',right_on='movieId', how = 'left')
d.head()

Unnamed: 0,movieId,3,title,genres
0,1196,90.231482,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi
1,47,82.6377,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
2,2858,81.097697,American Beauty (1999),Drama|Romance
3,1198,78.595074,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure
4,589,78.530892,Terminator 2: Judgment Day (1991),Action|Sci-Fi


# Evaluation - User User 

Evaluation will we same as you have seen above for the prediction. The only difference being, you will evaluate for the movie already rated by the user insead of predicting it for the movie not rated by the user. 

In [44]:
common_not = test[~test.userId.isin(train.userId)]
common_not.shape

(0, 4)

In [45]:
test.shape

(90038, 4)

In [46]:
# Find out the common users of test and train dataset.
common = test[test.userId.isin(train.userId)]
common.shape

(90038, 4)

In [47]:
common.head()

Unnamed: 0,userId,movieId,rating,timestamp
29643,226,3156,1.0,1059516139
152649,1074,2194,3.0,906133915
123175,886,4886,3.5,1168350634
23712,185,1101,4.0,1191923488
99726,757,908,3.5,1184016903


In [48]:
# convert into the user-movie matrix.
common_user_based_matrix = common.pivot_table(index='userId', columns='movieId', values='rating')
common_user_based_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,204704,204780,205054,205106,205156,205383,205499,206499,206805,207309
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2067,,,,,,,,,,,...,,,,,,,,,,
2068,,,,,,,,,,,...,,,,,,,,,,
2069,,,,,,,,,,,...,,,,,,,,,,
2070,,,,,,,,,,,...,,,,,,,,,,


In [49]:
common_user_based_matrix.shape

(2071, 9529)

In [50]:
# Convert the user_correlation matrix into dataframe.
user_correlation_df = pd.DataFrame(user_correlation)

In [51]:
user_correlation_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2061,2062,2063,2064,2065,2066,2067,2068,2069,2070
0,1.0,0.001899,0.016571,0.0,0.0,0.0,0.039761,0.0,0.003991,0.0,...,0.0,0.0,0.000829,0.0,0.053598,0.015322,0.0,0.020284,0.0,0.004454
1,0.001899,1.0,0.024391,0.014429,0.009537,0.034863,0.0,0.0,0.0,0.003106,...,0.0,0.051286,0.036513,0.029803,0.0,0.045524,0.010638,0.034615,0.0,0.0
2,0.016571,0.024391,1.0,0.062999,0.040416,0.009671,0.0,0.017017,0.0,0.036063,...,0.0,0.061911,0.01747,0.083151,0.038958,0.060932,0.028651,0.112607,0.0,0.017203
3,0.0,0.014429,0.062999,1.0,0.0,0.011082,0.0,0.0,0.0,0.02775,...,0.0,0.013295,0.0,0.003963,0.0,0.022046,0.020043,0.045028,0.0,0.004144
4,0.0,0.009537,0.040416,0.0,1.0,0.051212,0.045804,0.088584,0.079276,0.133326,...,0.0,0.033921,0.103921,0.007723,0.0,0.081722,0.106743,0.014523,0.03769,0.009734


In [52]:
df_subtracted.head(5)

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,-0.169355,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,-0.368421,,,,,,,,,,...,,,,,,,,,,
5,0.105263,,,,,,,,,,...,,,,,,,,,,


In [53]:
user_correlation_df['userId'] = df_subtracted.index
user_correlation_df.set_index('userId',inplace=True)

user_correlation_df.columns = df_subtracted.index.tolist()
user_correlation_df.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,2062,2063,2064,2065,2066,2067,2068,2069,2070,2071
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.001899,0.016571,0.0,0.0,0.0,0.039761,0.0,0.003991,0.0,...,0.0,0.0,0.000829,0.0,0.053598,0.015322,0.0,0.020284,0.0,0.004454
2,0.001899,1.0,0.024391,0.014429,0.009537,0.034863,0.0,0.0,0.0,0.003106,...,0.0,0.051286,0.036513,0.029803,0.0,0.045524,0.010638,0.034615,0.0,0.0
3,0.016571,0.024391,1.0,0.062999,0.040416,0.009671,0.0,0.017017,0.0,0.036063,...,0.0,0.061911,0.01747,0.083151,0.038958,0.060932,0.028651,0.112607,0.0,0.017203
4,0.0,0.014429,0.062999,1.0,0.0,0.011082,0.0,0.0,0.0,0.02775,...,0.0,0.013295,0.0,0.003963,0.0,0.022046,0.020043,0.045028,0.0,0.004144
5,0.0,0.009537,0.040416,0.0,1.0,0.051212,0.045804,0.088584,0.079276,0.133326,...,0.0,0.033921,0.103921,0.007723,0.0,0.081722,0.106743,0.014523,0.03769,0.009734


In [55]:
user_correlation_df.shape

(2071, 2071)

In [56]:
common.head(1)

Unnamed: 0,userId,movieId,rating,timestamp
29643,226,3156,1.0,1059516139


In [57]:
list_name = common.userId.tolist()

user_correlation_df_1 =  user_correlation_df[user_correlation_df.index.isin(list_name)]

In [58]:
user_correlation_df_1.shape

(2071, 2071)

In [59]:
user_correlation_df_2 = user_correlation_df_1.T[user_correlation_df_1.T.index.isin(list_name)]

In [60]:
user_correlation_df_3 = user_correlation_df_2.T

In [61]:
user_correlation_df_3.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,2062,2063,2064,2065,2066,2067,2068,2069,2070,2071
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.001899,0.016571,0.0,0.0,0.0,0.039761,0.0,0.003991,0.0,...,0.0,0.0,0.000829,0.0,0.053598,0.015322,0.0,0.020284,0.0,0.004454
2,0.001899,1.0,0.024391,0.014429,0.009537,0.034863,0.0,0.0,0.0,0.003106,...,0.0,0.051286,0.036513,0.029803,0.0,0.045524,0.010638,0.034615,0.0,0.0
3,0.016571,0.024391,1.0,0.062999,0.040416,0.009671,0.0,0.017017,0.0,0.036063,...,0.0,0.061911,0.01747,0.083151,0.038958,0.060932,0.028651,0.112607,0.0,0.017203
4,0.0,0.014429,0.062999,1.0,0.0,0.011082,0.0,0.0,0.0,0.02775,...,0.0,0.013295,0.0,0.003963,0.0,0.022046,0.020043,0.045028,0.0,0.004144
5,0.0,0.009537,0.040416,0.0,1.0,0.051212,0.045804,0.088584,0.079276,0.133326,...,0.0,0.033921,0.103921,0.007723,0.0,0.081722,0.106743,0.014523,0.03769,0.009734


In [62]:
user_correlation_df_3[1]

userId
1       1.000000
2       0.001899
3       0.016571
4       0.000000
5       0.000000
          ...   
2067    0.015322
2068    0.000000
2069    0.020284
2070    0.000000
2071    0.004454
Name: 1, Length: 2071, dtype: float64

In [63]:
user_correlation_df_3.shape

(2071, 2071)

In [64]:
user_correlation_df_3[user_correlation_df_3<0]=0

common_user_predicted_ratings = np.dot(user_correlation_df_3, common_user_based_matrix.fillna(0))
common_user_predicted_ratings

array([[5.10460224e+00, 1.74539764e+00, 6.96831596e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.02224987e+01, 6.74419414e+00, 2.26162307e+00, ...,
        1.99325288e-01, 0.00000000e+00, 0.00000000e+00],
       [3.57707385e+01, 1.15467646e+01, 2.97886475e+00, ...,
        4.02202154e-01, 1.42784707e-01, 1.42784707e-01],
       ...,
       [3.42388737e+01, 1.00466459e+01, 4.12806648e+00, ...,
        2.66442361e-01, 3.56811942e-02, 3.56811942e-02],
       [3.66282688e+00, 1.23265539e+00, 1.14883522e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.83078030e+01, 7.19569850e+00, 2.61970557e+00, ...,
        6.05070641e-02, 7.60002285e-02, 7.60002285e-02]])

In [65]:
dummy_test = common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)

In [56]:
dummy_test.shape

(2071, 9529)

In [57]:
common_user_predicted_ratings = np.multiply(common_user_predicted_ratings,dummy_test)

In [58]:
common_user_predicted_ratings.head(2)

movieId,1,2,3,4,5,6,7,8,9,10,...,204704,204780,205054,205106,205156,205383,205499,206499,206805,207309
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Calculating the RMSE for only the movies rated by user. For RMSE, normalising the rating to (1,5) range.

In [59]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_user_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

  data_min = np.nanmin(X, axis=0)
  data_max = np.nanmax(X, axis=0)


MinMaxScaler(copy=True, feature_range=(1, 5))
[[     nan      nan      nan ...      nan      nan      nan]
 [     nan      nan      nan ...      nan      nan      nan]
 [3.563153      nan      nan ...      nan      nan      nan]
 ...
 [     nan      nan      nan ...      nan      nan      nan]
 [     nan      nan      nan ...      nan      nan      nan]
 [     nan      nan      nan ...      nan      nan      nan]]


In [60]:
common_ = common.pivot_table(index='userId', columns='movieId', values='rating')

In [62]:
common_

movieId,1,2,3,4,5,6,7,8,9,10,...,204704,204780,205054,205106,205156,205383,205499,206499,206805,207309
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2067,,,,,,,,,,,...,,,,,,,,,,
2068,,,,,,,,,,,...,,,,,,,,,,
2069,,,,,,,,,,,...,,,,,,,,,,
2070,,,,,,,,,,,...,,,,,,,,,,


In [63]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

In [64]:
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

1.4350691659022683


# Trial for taking all the users in one go.

In [69]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [85]:
ratings['userId'].nunique()

2071

In [86]:
ratings['movieId'].nunique()

14657

In [89]:
# convert into the user-movie matrix.
all_user_based_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating')
all_user_based_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,206272,206293,206499,206523,206805,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,3.5,,,,,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,3.0,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2067,4.0,,3.0,,,,,,,,...,,,,,,,,,,
2068,,,,,,,3.0,,,,...,,,,,,,,,,
2069,,2.5,,,,,,,,2.5,...,,,,,,,,,,
2070,,,,,,,,,,,...,,,,,,,,,,


In [87]:
all_user_based_matrix.shape

(2071, 14657)

In [88]:
#making predicted movies rating as 1 and rest none zero for evaluation.
dummy_all= ratings.copy()
dummy_all['rating']=1

# convert into the user-movie matrix.
all_user_dummy_matrix = dummy_all.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
all_user_dummy_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,206272,206293,206499,206523,206805,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2067,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2068,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2069,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2070,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [90]:
mean = np.nanmean(all_user_based_matrix, axis=1)
mean.shape

(2071,)

In [91]:
df_all_subtracted = (all_user_based_matrix.T-mean).T

In [92]:
df_all_subtracted.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,206272,206293,206499,206523,206805,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,-0.130435,,,,,,,,,,...,,,,,,,,,,
3,0.302591,,,,,,,,,,...,,,,,,,,,,
4,-0.378099,,,,,,,,,,...,,,,,,,,,,
5,0.247525,,,,,,,,,,...,,,,,,,,,,


In [93]:
# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation_all = 1 - pairwise_distances(df_all_subtracted.fillna(0), metric='cosine')
user_correlation_all[np.isnan(user_correlation_all)] = 0
print(user_correlation_all)

[[ 1.00000000e+00  3.28496960e-04  2.80700803e-02 ...  5.42645326e-02
   0.00000000e+00 -1.10693366e-02]
 [ 3.28496960e-04  1.00000000e+00  6.93481082e-02 ...  8.87546018e-02
   0.00000000e+00  4.17649631e-02]
 [ 2.80700803e-02  6.93481082e-02  1.00000000e+00 ...  1.37564678e-01
   0.00000000e+00  3.64683158e-02]
 ...
 [ 5.42645326e-02  8.87546018e-02  1.37564678e-01 ...  1.00000000e+00
   1.84948897e-02  4.13824555e-03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  1.84948897e-02
   1.00000000e+00 -5.62671470e-02]
 [-1.10693366e-02  4.17649631e-02  3.64683158e-02 ...  4.13824555e-03
  -5.62671470e-02  1.00000000e+00]]


In [94]:
user_correlation_all[user_correlation_all<0]=0
user_correlation_all

array([[1.00000000e+00, 3.28496960e-04, 2.80700803e-02, ...,
        5.42645326e-02, 0.00000000e+00, 0.00000000e+00],
       [3.28496960e-04, 1.00000000e+00, 6.93481082e-02, ...,
        8.87546018e-02, 0.00000000e+00, 4.17649631e-02],
       [2.80700803e-02, 6.93481082e-02, 1.00000000e+00, ...,
        1.37564678e-01, 0.00000000e+00, 3.64683158e-02],
       ...,
       [5.42645326e-02, 8.87546018e-02, 1.37564678e-01, ...,
        1.00000000e+00, 1.84948897e-02, 4.13824555e-03],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.84948897e-02, 1.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 4.17649631e-02, 3.64683158e-02, ...,
        4.13824555e-03, 0.00000000e+00, 1.00000000e+00]])

In [95]:
user_predicted_ratings_all = np.dot(user_correlation_all, all_user_based_matrix.fillna(0))
user_predicted_ratings_all

array([[3.28545378e+01, 1.02256602e+01, 3.95987787e+00, ...,
        4.70969146e-02, 3.95589566e-02, 4.94843947e-02],
       [8.70768585e+01, 2.93107438e+01, 1.32318318e+01, ...,
        1.58519212e-01, 1.58519212e-01, 1.34394705e-01],
       [1.27603484e+02, 4.17790760e+01, 1.58710367e+01, ...,
        8.72666104e-01, 6.07147590e-01, 3.19710892e-01],
       ...,
       [1.39824470e+02, 4.84441070e+01, 1.96582526e+01, ...,
        2.33717177e-01, 2.07003296e-01, 1.88747110e-01],
       [1.52188519e+01, 5.65875361e+00, 4.84929444e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.39298756e-02],
       [8.79624715e+01, 3.07773067e+01, 1.78407409e+01, ...,
        2.28015083e-01, 7.06659784e-02, 1.15214799e-01]])

In [96]:
user_predicted_ratings_all.shape

(2071, 14657)

In [97]:
user_final_rating_all = np.multiply(user_predicted_ratings_all,all_user_dummy_matrix)
user_final_rating_all.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,206272,206293,206499,206523,206805,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,87.076859,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,127.603484,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,67.929839,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,112.589198,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = user_final_rating_all.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
z = (scaler.transform(X))

print(z)

MinMaxScaler(copy=True, feature_range=(1, 5))
[[       nan        nan        nan ...        nan        nan        nan]
 [2.44439462        nan        nan ...        nan        nan        nan]
 [3.29527043        nan        nan ...        nan        nan        nan]
 ...
 [       nan 3.2637267         nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]
 [2.46298849        nan        nan ...        nan        nan        nan]]


In [102]:
all_user_based_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,206272,206293,206499,206523,206805,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,3.5,,,,,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,3.0,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2067,4.0,,3.0,,,,,,,,...,,,,,,,,,,
2068,,,,,,,3.0,,,,...,,,,,,,,,,
2069,,2.5,,,,,,,,2.5,...,,,,,,,,,,
2070,,,,,,,,,,,...,,,,,,,,,,


In [101]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(z))
total_non_nan

300098

In [103]:
rmse = (sum(sum((all_user_based_matrix - z )**2))/total_non_nan)**0.5
print(rmse)

1.5493432842929666


## Using Item similarity

# Item Based Similarity

Taking the transpose of the rating matrix to normalize the rating around the mean for different movie ID. In the user based similarity, we had taken mean for each user instead of each movie. 

In [8]:
df_pivot = train.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).T

df_pivot.head()

userId,1,2,3,4,5,6,7,8,9,10,...,2062,2063,2064,2065,2066,2067,2068,2069,2070,2071
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,3.5,,3.0,4.0,,,4.0,,3.5,...,,,4.0,,,4.0,,,,3.0
2,,,,,,,,,5.0,,...,,,,,,,,2.5,,
3,,,,,,,,4.0,,,...,5.0,,,,,3.0,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,3.0,,,,,,,,,


Normalising the movie rating for each movie for using the Adujsted Cosine

In [9]:
mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

In [10]:
df_subtracted.head()

userId,1,2,3,4,5,6,7,8,9,10,...,2062,2063,2064,2065,2066,2067,2068,2069,2070,2071
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,-0.45501,,-0.95501,0.04499,,,0.04499,,-0.45501,...,,,0.04499,,,0.04499,,,,-0.95501
2,,,,,,,,,1.610811,,...,,,,,,,,-0.889189,,
3,,,,,,,,0.816667,,,...,1.816667,,,,,-0.183333,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,-0.133663,,,,,,,,,


Finding the cosine similarity using pairwise distances approach

In [11]:
from sklearn.metrics.pairwise import pairwise_distances

In [12]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation)

[[ 1.          0.10291474 -0.00527634 ...  0.          0.
   0.        ]
 [ 0.10291474  1.          0.03170014 ...  0.          0.
   0.        ]
 [-0.00527634  0.03170014  1.         ...  0.          0.
   0.        ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]


In [46]:
item_correlation.shape

(12911, 12911)

In [47]:
df_pivot.shape

(12911, 2071)

Filtering the correlation only for which the value is greater than 0. (Positively correlated)

In [52]:
item_correlation[item_correlation<0]=0
item_correlation

array([[1.        , 0.10291474, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.10291474, 1.        , 0.03170014, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.03170014, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

# Prediction - Item Item

In [16]:
item_predicted_ratings = np.dot((df_pivot.fillna(0).T),item_correlation)
item_predicted_ratings

array([[ 5.83679879,  2.41464654,  1.94850435, ...,  0.        ,
         0.        ,  0.        ],
       [27.32838057, 18.73037026,  6.83694139, ...,  0.        ,
         0.        ,  0.        ],
       [71.07549452, 50.83081163, 16.86416369, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [21.92022103, 20.39922913,  7.56019237, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.34159079,  0.84726105,  0.66059729, ...,  0.        ,
         0.        ,  0.        ],
       [16.64338891, 12.69091959,  5.81227628, ...,  0.        ,
         0.        ,  0.        ]])

In [17]:
item_predicted_ratings.shape

(2071, 12911)

In [23]:
dummy_train.shape

(2071, 12911)

### Filtering the rating only for the movies not rated by the user for recommendation

In [24]:
item_final_rating = np.multiply(item_predicted_ratings,dummy_train)
item_final_rating.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.836799,2.414647,1.948504,1.053313,1.664116,2.900967,1.464508,0.875935,1.481628,3.313284,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,18.73037,6.836941,9.363118,8.222216,13.292342,9.611464,3.705273,9.897551,21.484258,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,71.075495,50.830812,16.864164,36.129112,17.996283,36.056161,19.455536,26.062272,32.047686,42.683568,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,16.289441,6.163784,6.882548,5.744073,13.938653,5.300006,8.632429,8.484791,15.86887,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,14.295191,7.947143,6.407635,7.038309,7.613087,5.866075,5.89383,8.269718,14.117642,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Finding the top 5 recommendation for the *user*



In [25]:
# Take the user ID as input
user_input = int(input("Enter your user name"))
print(user_input)

Enter your user name2
2


In [26]:
# Recommending the Top 5 products to the user.
d = item_final_rating.loc[user_input].sort_values(ascending=False)[0:5]
d

movieId
6377    33.970166
1270    32.979010
1704    32.001494
8636    31.896838
8644    31.439675
Name: 2, dtype: float64

In [28]:
#Mapping with Movie Title / Genres 
movie_mapping = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/movies.csv')

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

In [None]:
d = pd.merge(d,movie_mapping,left_on='movieId',right_on='movieId',how = 'left')
d.head()

In [None]:
train_new = pd.merge(train,movie_mapping,left_on='movieId',right_on='movieId',how='left')
train_new[train_new.userId == 1] .head()

# Evaluation - Item Item

Evaluation will we same as you have seen above for the prediction. The only difference being, you will evaluate for the movie already rated by the user insead of predicting it for the movie not rated by the user. 

In [29]:
test.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

In [30]:
common =  test[test.movieId.isin(train.movieId)]
common.shape

(88035, 4)

In [31]:
common.head(4)

Unnamed: 0,userId,movieId,rating,timestamp
29643,226,3156,1.0,1059516139
152649,1074,2194,3.0,906133915
123175,886,4886,3.5,1168350634
23712,185,1101,4.0,1191923488


In [32]:
common_item_based_matrix = common.pivot_table(index='userId', columns='movieId', values='rating').T

In [33]:
common_item_based_matrix.shape

(7783, 2071)

In [34]:
item_correlation_df = pd.DataFrame(item_correlation)

In [35]:
item_correlation_df.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12901,12902,12903,12904,12905,12906,12907,12908,12909,12910
0,1.0,0.102915,0.0,0.00434,0.095113,0.066226,0.0,0.009237,0.02017,0.110567,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
item_correlation_df['movieId'] = df_subtracted.index
item_correlation_df.set_index('movieId',inplace=True)
item_correlation_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,12901,12902,12903,12904,12905,12906,12907,12908,12909,12910
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.102915,0.0,0.00434,0.095113,0.066226,0.0,0.009237,0.02017,0.110567,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.102915,1.0,0.0317,0.0,0.060859,0.046665,0.030885,0.025288,0.022028,0.157213,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0317,1.0,0.004382,0.069432,0.011788,0.054688,0.060193,0.100534,0.011718,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.00434,0.0,0.004382,1.0,0.038519,0.029985,0.019847,0.0,0.06789,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.095113,0.060859,0.069432,0.038519,1.0,0.063813,0.0,0.0,0.043664,0.091824,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
list_name = common.movieId.tolist()

In [38]:
item_correlation_df.columns = df_subtracted.index.tolist()

item_correlation_df_1 =  item_correlation_df[item_correlation_df.index.isin(list_name)]

In [39]:
item_correlation_df_2 = item_correlation_df_1.T[item_correlation_df_1.T.index.isin(list_name)]

item_correlation_df_3 = item_correlation_df_2.T

In [40]:
item_correlation_df_3.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,201773,201811,202429,202439,203222,203519,204698,205383,206499,207309
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.102915,0.0,0.00434,0.095113,0.066226,0.0,0.009237,0.02017,0.110567,...,0.0,0.0,0.0,0.0,0.0,0.0,0.003606,0.0,0.0,0.0
2,0.102915,1.0,0.0317,0.0,0.060859,0.046665,0.030885,0.025288,0.022028,0.157213,...,0.0,0.035445,0.05432,0.020464,0.0,0.0,0.0,0.10963,0.0,0.0
3,0.0,0.0317,1.0,0.004382,0.069432,0.011788,0.054688,0.060193,0.100534,0.011718,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.00434,0.0,0.004382,1.0,0.038519,0.029985,0.019847,0.0,0.06789,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.095113,0.060859,0.069432,0.038519,1.0,0.063813,0.0,0.0,0.043664,0.091824,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
item_correlation_df_3.shape

(7783, 7783)

In [45]:
common_item_based_matrix.shape

(7783, 2071)

In [41]:
item_correlation_df_3[item_correlation_df_3<0]=0

common_item_predicted_ratings = np.dot(item_correlation_df_3, common_item_based_matrix.fillna(0))
common_item_predicted_ratings


array([[ 1.10153262, 12.83252349, 35.8647666 , ...,  9.1484491 ,
         0.51093827,  9.15792271],
       [ 0.39077366,  9.74879386, 22.62663044, ...,  7.05031014,
         0.92552995,  6.28294116],
       [ 0.11700293,  4.30626816,  8.2908921 , ...,  3.77831808,
         0.26364204,  3.77408128],
       ...,
       [ 0.43354838,  3.97316391, 22.48683777, ...,  6.770097  ,
         0.1689317 ,  4.1952958 ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [42]:
common_item_predicted_ratings.shape

(7783, 2071)

Dummy test will be used for evaluation. To evaluate, we will only make prediction on the movies rated by the user. So, this is marked as 1. This is just opposite of dummy_train



In [53]:
dummy_test = common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>0 else 0)

dummy_test = dummy_test.pivot_table(index='userId', columns='movieId', values='rating').T.fillna(0)

common_item_predicted_ratings = np.multiply(common_item_predicted_ratings,dummy_test)

The products not rated is marked as 0 for evaluation. And make the item- item matrix representaion.


In [54]:
common_ = common.pivot_table(index='userId', columns='movieId', values='rating').T

In [55]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_item_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

MinMaxScaler(copy=True, feature_range=(1, 5))
[[       nan        nan 2.52545034 ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]
 ...
 [       nan        nan        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]]


In [56]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

In [57]:
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

1.317090534462963
