# Problem Statement

Consider the ratings dataset below, containig the data on: UserID, MovieID, Rating and Timestamp.Each line of this file represents one rating of one movie by one user, and has the following format:
UserID::MovieID::Rating::Timestamp

Ratings are made on a 5 star scale with half star increments.
UserID represents the ID of the user, movieID represents the ID of the movie, ant Timestamps represents seconds from midnight coordinated universal time or UTC of January 1, 1970.

In [4]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv('Datasets/Recommend.csv')

In [6]:
df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,186,302,3,891717742
1,22,377,1,878887116
2,244,51,2,880606923
3,166,346,1,886397596
4,298,474,4,884182806


In [22]:
from sklearn.model_selection import train_test_split
n_users = df.user_id.unique().shape[0]
n_movies = df.movie_id.unique().shape[0]
train_data, test_data= train_test_split(df, test_size = 0.25)

In [23]:
train_data_matrix = np.zeros((n_users, n_movies))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]
train_data_matrix

array([[5., 3., 4., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.]])

In [24]:
test_data_matrix = np.zeros((n_users, n_movies))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]
test_data_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [31]:
from sklearn.metrics import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric = 'cosine')
movie_similarity = pairwise_distances(train_data_matrix.T, metric = 'cosine')
mean_user_rating = train_data_matrix.mean(axis=1)[:, np.newaxis]
ratings_diff = (train_data_matrix - mean_user_rating)
user_pred = mean_user_rating + user_similarity.dot(ratings_diff)/np.array([np.abs(user_similarity).
                                                                            sum(axis=1)]).T

user_pred



array([[ 1.58164746,  0.58237709,  0.45590077, ...,  0.27326925,
         0.27576026,  0.2755086 ],
       [ 1.35560001,  0.34439134,  0.16257725, ..., -0.04878103,
        -0.04543124, -0.0453361 ],
       [ 1.35568451,  0.28693442,  0.11488228, ..., -0.10002496,
        -0.0968305 , -0.09669711],
       ...,
       [ 1.2046634 ,  0.24468641,  0.07031219, ..., -0.12875854,
        -0.12565362, -0.12558123],
       [ 1.37981417,  0.33770748,  0.19378923, ..., -0.01486955,
        -0.0122441 , -0.01176152],
       [ 1.45161353,  0.42158608,  0.30346566, ...,  0.11919125,
         0.12164975,  0.12173321]])

# Quick Recap

1 - Import Libraries and the dataset

2 - Identify total number of users and movies

3 - Split the data into training and testing sets

4 - Populate the train test matrices with random ratings

5 - Create cosine similarity matrices for users and movies

6 - Perform predictions