# **Recommendation system**
here we will build a recommender system using Model-based Collaborative-filtering tecnique in PyTorch and applying Bayesian average method for cold start problem.

## Step 01: **Import Libraries**

In [2]:
import numpy as np
import pandas as pd
import torch
from sklearn.preprocessing import LabelEncoder

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from src.recommender import recommend
from src.model import MF, MF_bias, train_epochs

from scipy.sparse import csr_matrix

#np.random.seed(seed=33) # for reproducability

## Step 02: **Load Data**

In [4]:
movies = pd.read_csv("data/ml-latest-small/movies.csv")
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [5]:
ratings = pd.read_csv("data/ml-latest-small/ratings.csv")
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


## Step 03: **Data Pre-processing**

#### **Removing unrated movies from movies dataset**

In [7]:
ex1 = np.setdiff1d(ratings.movieId.unique(), movies.movieId.unique()).tolist()
print(f"No. of movies in ratings.csv but not in movies.csv: {len(ex1)}")
ex2 = np.setdiff1d(movies.movieId.unique(), ratings.movieId.unique()).tolist()
print(f"No. of movies in movies.csv but not in ratings.csv: {len(ex2)}")

# remove the extra movies form movies.csv which are not in ratings.csv
movies = movies[movies.movieId.isin(ex2) == False]
movies.reset_index(drop=True, inplace=True) # reset the index after removing some rows
print("removed extra movies from movies.csv ----------")

ex1 = np.setdiff1d(ratings.movieId.unique(), movies.movieId.unique()).tolist()
print(f"No. of movies in ratings.csv but not in movies.csv: {len(ex1)}")
ex2 = np.setdiff1d(movies.movieId.unique(), ratings.movieId.unique()).tolist()
print(f"No. of movies in movies.csv but not in ratings.csv: {len(ex2)}")

No. of movies in ratings.csv but not in movies.csv: 0
No. of movies in movies.csv but not in ratings.csv: 18
removed extra movies from movies.csv ----------
No. of movies in ratings.csv but not in movies.csv: 0
No. of movies in movies.csv but not in ratings.csv: 0


#### **Encoding**
We encode the data to have contiguous id's for users and movies. You can think about this as a categorical encoding of our two categorical variables `userId` and `movieId`.

In [8]:
lbl_user = LabelEncoder()
lbl_movie = LabelEncoder()
lbl2_movie = LabelEncoder()
ratings.userId = lbl_user.fit_transform(ratings.userId.values)
movies.movieId = lbl_movie.fit_transform(movies.movieId.values)
ratings.movieId = lbl2_movie.fit_transform(ratings.movieId.values)

### **Retrieve Movies Genres**

In the movies dataset, `genres` is expressed as a string with a pipe `|` separating each genre. We will manipulate this string into a list, which will make it much easier to analyze.

In [9]:
movies['genres'] = movies['genres'].apply(lambda x: x.split("|"))
movies.head(3)

Unnamed: 0,movieId,title,genres
0,0,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1,1,Jumanji (1995),"[Adventure, Children, Fantasy]"
2,2,Grumpier Old Men (1995),"[Comedy, Romance]"
