## Introduction

Recommender systems generate a list of items
(or people) to be recommended to the users.
Historically, the area of recommender systems was part of data mining and information filtering. Later, in the 1990s, it has been recognized as a full-fledged research area. Currently, major companies such as Amazon, Netflix, Launch, Google, YouTube and Facebook are heavily using and relying on recommender systems to sell their products
and services by recommending the most relevant items to the users, leading to a significant increase of revenue. <br />
Due to its economical role, Netflix announced the Netflix Grand Prize (1 Million US Dollars), an open competition on the best collaborative recommender system to predict the ratings of the films based on the users’ preferences. BellKor’ Pragmatic Chaos won the Netflix 2009 Grand Prize by providing prediction at only 10.06% of accuracy.
For further resarch for recommendation system resarch follow the paper https://www.researchgate.net/publication/317904909_New_Insights_Towards_Developing_Recommender_Systems having detailed information on various recommendation systems which helped me to study.

## Types of Recommendation Systems
Recommendation systems can be broadly classified into 3 types —  </b>

**1.Collaborative Filtering <br />
2.Content-Based Filtering  <br />
3.Hybrid Recommendation Systems**

## Collaborative filtering

Collaborative techniques are the most widely known and used in recommender systems. This approach recommends items to the user that were liked in the past by other users with similar tastes.

The major techiques into algorithms are:

* **Memory Based struture uses Neighborhood methods**
* **Model Based structure uses Latent factor models**


## Memory Based Collaborative filtering based recommendation

 * Memory-based algorithms are essentially heuristics and predict ratings based on the entire previously collected rated items by the users.
  * **user based filtering***
  * **Item based filtering** <br />
Item based approach is usually preferred over user-based approach.<br />
User-based approach is often harder to scale because of the dynamic nature of users, whereas items usually don’t change much, and item based approach often can be computed offline and served without constantly re-training.

To implement an **item based collaborative filtering**, **KNN** is a perfect go-to model and also a very good baseline for recommender system development.

## User based Collaborative Filtering
Well, UB-CF uses that logic and recommends items by finding similar users to the active user (to whom we are trying to recommend a movie). A specific application of this is the user-based Nearest Neighbor algorithm. 

   ###  K-Nearest Neighbours (KNN)

KNN is a **non-parametric, lazy** learning method. It uses a database in which the data points are separated into several clusters to make inference for new samples.
*  KNN does not make any assumptions on the underlying data distribution but it relies on item feature similarity.<br />
### Selecting Neighborhoods
* All the neighbors
* Threshold similarity or distance
* Random neighbors
* Top-N neighbors by similarity or distance


### Prepare the data

#### Attribute Information of Movielens dataset
To build a movie recommender, I choose MovieLens Datasets. It contains 27,753,444 ratings and 1,108,997 tag applications across 58,098 movies.

The two files from the data will be used in this study as rating.csv and movie.csv

* **rating.csv** that contains ratings of movies by users:

  * userId
  *movieId
  *rating
  *timestamp
* **movie.csv** that contains movie information:

  * movieId
  * title
  * genres

In [1]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [5]:
movie = pd.read_csv(r'C:\Users\ankit\Downloads\Movie Recomender System\dataset\movie.csv')
rating = pd.read_csv(r'C:\Users\ankit\Downloads\Movie Recomender System\dataset\rating.csv')
df = movie.merge(rating, how="left", on="movieId")
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41


In [3]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [5]:
print('Number of Users: {}, Number of movies:{}'  .format(len(df.userId.unique()),len(df.title.unique())))

Number of Users: 138494, Number of movies:27262


In [6]:
df.isnull().sum()

movieId        0
title          0
genres         0
userId       534
rating       534
timestamp    534
dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000797 entries, 0 to 20000796
Data columns (total 6 columns):
 #   Column     Dtype  
---  ------     -----  
 0   movieId    int64  
 1   title      object 
 2   genres     object 
 3   userId     float64
 4   rating     float64
 5   timestamp  object 
dtypes: float64(2), int64(1), object(3)
memory usage: 1.0+ GB


First, we need to transform the dataframe of ratings into a proper format that can be consumed by a KNN model. We want the data to be in an m x n array, where m is the number of movies and n is the number of users. To reshape dataframe of ratings, we’ll pivot the dataframe to the wide format with movies as rows and users as columns. Then we’ll fill the missing observations with 0s since we’re going to be performing linear algebra operations (calculating distances between vectors). 

In [8]:
pd.DataFrame(df.title.value_counts())

Unnamed: 0,title
Pulp Fiction (1994),67310
Forrest Gump (1994),66172
"Shawshank Redemption, The (1994)",63366
"Silence of the Lambs, The (1991)",63299
Jurassic Park (1993),59715
...,...
Rapture (Arrebato) (1980),1
"Education of Mohammad Hussein, The (2013)",1
Satanas (2007),1
Psychosis (2010),1


### Data Cleaning
 Remove the movies having less than 1000

In [9]:
gf =pd.DataFrame(df['title'].value_counts())
gf

Unnamed: 0,title
Pulp Fiction (1994),67310
Forrest Gump (1994),66172
"Shawshank Redemption, The (1994)",63366
"Silence of the Lambs, The (1991)",63299
Jurassic Park (1993),59715
...,...
Rapture (Arrebato) (1980),1
"Education of Mohammad Hussein, The (2013)",1
Satanas (2007),1
Psychosis (2010),1


In [10]:
df.shape

(20000797, 6)

In [11]:
movie_rating_avg = df.groupby('title')['rating'].mean().sort_values(ascending=False).reset_index().rename(columns={'rating':'Average_rating'})
movie_rating_avg.head()

Unnamed: 0,title,Average_rating
0,Small Roads (2011),5.0
1,Divorce (1945),5.0
2,The Beautiful Story (1992),5.0
3,Into the Middle of Nowhere (2010),5.0
4,The Sea That Thinks (2000),5.0


In [12]:
movie_rating_count =df.groupby('title')['rating'].count().sort_values(ascending =False).reset_index().rename(columns={'rating':'Total_RatingCount'})
movie_rating_count

Unnamed: 0,title,Total_RatingCount
0,Pulp Fiction (1994),67310
1,Forrest Gump (1994),66172
2,"Shawshank Redemption, The (1994)",63366
3,"Silence of the Lambs, The (1991)",63299
4,Jurassic Park (1993),59715
...,...,...
27257,"North and South, Book I (1985)",0
27258,"North and South, Book II (1986)",0
27259,The Happy Road (1957),0
27260,Angele (1934),0


In [13]:
movie_rate_count_avg = movie_rating_count.merge(movie_rating_avg,on='title')
movie_rate_count_avg.head()

Unnamed: 0,title,Total_RatingCount,Average_rating
0,Pulp Fiction (1994),67310,4.174231
1,Forrest Gump (1994),66172,4.029
2,"Shawshank Redemption, The (1994)",63366,4.44699
3,"Silence of the Lambs, The (1991)",63299,4.177057
4,Jurassic Park (1993),59715,3.664741


In [14]:

rare_movie = gf[gf['title'] <=10000].index
# rare_movies
movies_df =df[~df['title'].isin(rare_movie)]
movies_df

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41
...,...,...,...,...,...,...
19985698,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,28195.0,4.0,2014-09-22 20:52:18
19985699,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,51334.0,3.0,2014-09-23 15:53:39
19985700,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,120575.0,2.5,2014-10-08 14:23:39
19985701,114240,Aladdin (1992),Adventure|Animation|Children|Comedy|Fantasy,124998.0,2.5,2014-09-20 22:16:14


## Creating Pivot table
 In the pivot table, the index will be **'title'** ,columns will be **'UserId'** and values in the table will be **'rating'** of each user for the movies.

In [15]:
movieRating_feature= movies_df.pivot_table(index='title',columns='userId',values='rating').fillna(0)


## Create KNN model for prediction
The sparse matrix is created for pivot table and fill the empty values with zeros.

Finding the **Nearest Neighbors** use the **unsupervised algorithms with sklearn.neighbors**. <br />The algorithm we use to compute the nearest neighbors is **“brute”**, and we specify **“metric=cosine”** so that the algorithm will calculate the cosine similarity between rating vectors. Finally, we fit the model.


In [16]:
def recommendedMovie(movieRating_feature):
    mat_movie_features_array = csr_matrix(movieRating_feature.values)
    model_knn = NearestNeighbors(metric='cosine', algorithm='brute',n_jobs=-1)
    model_knn.fit(mat_movie_features_array)
    query_index =np.random.choice(movieRating_feature.shape[0])
    print(query_index)
    distances,indices =model_knn.kneighbors(movieRating_feature.iloc[query_index,:].values.reshape(1,-1), n_neighbors=6)
    print("Distances -->",distances," Indices -->",indices)
 
    print(distances.flatten())
    print(len(distances.flatten()))
 
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for {0}:\n'.format(movieRating_feature.index[query_index]))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i, movieRating_feature.index[indices.flatten()[i]], distances.flatten()[i]))

In [17]:
recommendedMovie(movieRating_feature)

33
Distances --> [[0.         0.56519351 0.57897557 0.5927973  0.59798754 0.60009349]]  Indices --> [[ 33  17  15 402 408 167]]
[0.         0.56519351 0.57897557 0.5927973  0.59798754 0.60009349]
6
Recommendations for Army of Darkness (1993):

1: Aliens (1986), with distance of 0.5651935147849823:
2: Alien (1979), with distance of 0.5789755682630586:
3: Starship Troopers (1997), with distance of 0.5927973030540457:
4: Terminator, The (1984), with distance of 0.597987539751613:
5: From Dusk Till Dawn (1996), with distance of 0.6000934894278684:


In [18]:
def recommendedMovie(movieRating_feature,favoriteMovie):
    mat_movie_features_array = csr_matrix(movieRating_feature.values)
    model_knn = NearestNeighbors(metric='cosine', algorithm='brute',n_jobs=-1)
    model_knn.fit(mat_movie_features_array)
#     favoriteMovie = 'Brazil (1985)'
    query_index = movieRating_feature.index.get_loc(favoriteMovie)
#     query_index =np.random.choice(movieRating_feature.shape[0])
    print(query_index)
    distances,indices =model_knn.kneighbors(movieRating_feature.iloc[query_index,:].values.reshape(1,-1), n_neighbors=6)
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for {0}:\n'.format(movieRating_feature.index[query_index]))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i, movieRating_feature.index[indices.flatten()[i]], distances.flatten()[i]))

In [19]:
recommendedMovie(movieRating_feature,'Star Wars: Episode V - The Empire Strikes Back (1980)')

399
Recommendations for Star Wars: Episode V - The Empire Strikes Back (1980):

1: Star Wars: Episode IV - A New Hope (1977), with distance of 0.20741613801828163:
2: Star Wars: Episode VI - Return of the Jedi (1983), with distance of 0.2156239425493356:
3: Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981), with distance of 0.266592703428847:
4: Indiana Jones and the Last Crusade (1989), with distance of 0.3287503456378781:
5: Terminator, The (1984), with distance of 0.32956188888246263:
