# Recommeder System For Collaborative Filtering

In this notebook, we will focus on basic recommendation system by suggesting most similar item.

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

### Get the data

In [2]:
cols = ["user_id", "item_id", "rating", "timestamp"]
df = pd.read_csv('ml-100k/u.data', sep='\t', names=cols)
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
movies = pd.read_csv('ml-100k/u.item', sep='|', encoding="ISO-8859-1", header=None)
movies.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [4]:
movie_titles = movies[[0,1]]
movie_titles.rename(columns={0:'item_id', 1:'item_title'}, inplace=True)
movie_titles.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,item_id,item_title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [5]:
df = pd.merge(df, movie_titles, on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,item_title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


### Data Analysis

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns

We sort the movie list based on the user ratings

In [7]:
df.groupby('item_title')['rating'].mean()

item_title
'Til There Was You (1997)                2.333333
1-900 (1994)                             2.600000
101 Dalmatians (1996)                    2.908257
12 Angry Men (1957)                      4.344000
187 (1997)                               3.024390
                                           ...   
Young Guns II (1990)                     2.772727
Young Poisoner's Handbook, The (1995)    3.341463
Zeus and Roxanne (1997)                  2.166667
unknown                                  3.444444
Á köldum klaka (Cold Fever) (1994)       3.000000
Name: rating, Length: 1664, dtype: float64

In [8]:
df.groupby('item_title')['rating'].mean().sort_values(ascending=False).head(20)

item_title
Marlene Dietrich: Shadow and Light (1996)                 5.000000
Prefontaine (1997)                                        5.000000
Santa with Muscles (1996)                                 5.000000
Star Kid (1997)                                           5.000000
Someone Else's America (1995)                             5.000000
Entertaining Angels: The Dorothy Day Story (1996)         5.000000
Saint of Fort Washington, The (1993)                      5.000000
Great Day in Harlem, A (1994)                             5.000000
They Made Me a Criminal (1939)                            5.000000
Aiqing wansui (1994)                                      5.000000
Pather Panchali (1955)                                    4.625000
Anna (1996)                                               4.500000
Everest (1998)                                            4.500000
Maya Lin: A Strong Clear Vision (1994)                    4.500000
Some Mother's Son (1996)                           

Now we can observe that even when the movie has a high ratings, popular movies are not shown in the list. This can occur is only few users watched the movie and given the top ratings. So, this is not good recommendation. So, instead we go for a different approach by looking at the total number of rating provided for a particular movie.

In [9]:
df.groupby('item_title')['rating'].count().sort_values(ascending=False).head(20)

item_title
Star Wars (1977)                    583
Contact (1997)                      509
Fargo (1996)                        508
Return of the Jedi (1983)           507
Liar Liar (1997)                    485
English Patient, The (1996)         481
Scream (1996)                       478
Toy Story (1995)                    452
Air Force One (1997)                431
Independence Day (ID4) (1996)       429
Raiders of the Lost Ark (1981)      420
Godfather, The (1972)               413
Pulp Fiction (1994)                 394
Twelve Monkeys (1995)               392
Silence of the Lambs, The (1991)    390
Jerry Maguire (1996)                384
Chasing Amy (1997)                  379
Rock, The (1996)                    378
Empire Strikes Back, The (1980)     367
Star Trek: First Contact (1996)     365
Name: rating, dtype: int64

We can comapre the list of movies in the top 20 by rating and by number of reviews. We can see the chances of movie being popular is higher, thus this is better than previous technique. 
This happens because the chances of popular movie getting very high ratings are not as high as compared to good but non-popular movies.

Now we are going to create a dataframe with the mean rating and number of review provided

In [10]:
rating = pd.DataFrame(df.groupby('item_title')['rating'].mean())
rating.head()

Unnamed: 0_level_0,rating
item_title,Unnamed: 1_level_1
'Til There Was You (1997),2.333333
1-900 (1994),2.6
101 Dalmatians (1996),2.908257
12 Angry Men (1957),4.344
187 (1997),3.02439


In [11]:
rating['no of rating'] = pd.DataFrame(df.groupby('item_title')['rating'].count())
rating.head()

Unnamed: 0_level_0,rating,no of rating
item_title,Unnamed: 1_level_1,Unnamed: 2_level_1
'Til There Was You (1997),2.333333,9
1-900 (1994),2.6,5
101 Dalmatians (1996),2.908257,109
12 Angry Men (1957),4.344,125
187 (1997),3.02439,41


### Creating Table

In [12]:
movie_mat = pd.pivot_table(data=df, index='user_id', columns='item_title', values='rating')
movie_mat.head()

item_title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


In [13]:
rating.sort_values('no of rating', ascending=False).head(10)

Unnamed: 0_level_0,rating,no of rating
item_title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars (1977),4.358491,583
Contact (1997),3.803536,509
Fargo (1996),4.155512,508
Return of the Jedi (1983),4.00789,507
Liar Liar (1997),3.156701,485
"English Patient, The (1996)",3.656965,481
Scream (1996),3.441423,478
Toy Story (1995),3.878319,452
Air Force One (1997),3.63109,431
Independence Day (ID4) (1996),3.438228,429


In [14]:
starwars_user_ratings = movie_mat['Star Wars (1977)']
starwars_user_ratings.head()

user_id
1    5.0
2    5.0
3    NaN
4    5.0
5    4.0
Name: Star Wars (1977), dtype: float64

### FInding Similar Movie

In [15]:
similar_to_starwars = movie_mat.corrwith(starwars_user_ratings)
similar_to_starwars.sort_values(ascending=False).head(10)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


item_title
No Escape (1994)                        1.0
Man of the Year (1995)                  1.0
Hollow Reed (1996)                      1.0
Commandments (1997)                     1.0
Cosi (1996)                             1.0
Stripes (1981)                          1.0
Golden Earrings (1947)                  1.0
Mondo (1996)                            1.0
Line King: Al Hirschfeld, The (1996)    1.0
Outlaw, The (1943)                      1.0
dtype: float64

In [16]:
starwars_corr = pd.DataFrame(similar_to_starwars, columns=['Correlation'])
final_matrix = starwars_corr.join(rating['no of rating'])
final_matrix.head()

Unnamed: 0_level_0,Correlation,no of rating
item_title,Unnamed: 1_level_1,Unnamed: 2_level_1
'Til There Was You (1997),0.872872,9
1-900 (1994),-0.645497,5
101 Dalmatians (1996),0.211132,109
12 Angry Men (1957),0.184289,125
187 (1997),0.027398,41


In [17]:
# removing movies with no of rating less than 100
result = final_matrix[final_matrix['no of rating'] > 100]
result.head()

Unnamed: 0_level_0,Correlation,no of rating
item_title,Unnamed: 1_level_1,Unnamed: 2_level_1
101 Dalmatians (1996),0.211132,109
12 Angry Men (1957),0.184289,125
2001: A Space Odyssey (1968),0.230884,259
Absolute Power (1997),0.08544,127
"Abyss, The (1989)",0.203709,151


### Final Recommendation

In [18]:
result.sort_values('Correlation', ascending=False).head(10)

Unnamed: 0_level_0,Correlation,no of rating
item_title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars (1977),1.0,583
"Empire Strikes Back, The (1980)",0.747981,367
Return of the Jedi (1983),0.672556,507
Raiders of the Lost Ark (1981),0.536117,420
Austin Powers: International Man of Mystery (1997),0.377433,130
"Sting, The (1973)",0.367538,241
Indiana Jones and the Last Crusade (1989),0.350107,331
Pinocchio (1940),0.347868,101
"Frighteners, The (1996)",0.332729,115
L.A. Confidential (1997),0.319065,297
