# Item-Based Collaborative Filtering

As before, we'll start by importing the MovieLens 100K data set into a pandas DataFrame:

In [1]:
import pandas as pd

import numpy as np

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

m_cols = ['movie_id', 'title']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")

ratings = pd.merge(movies, ratings)
ratings.head()

outliers = ratings.groupby('user_id').agg({'rating': [np.size]})
median=np.median(outliers['rating']['size'])
std=np.std(outliers['rating']['size'])
outliers1 = median-2*std >outliers['rating']['size'] #first
outliers2=median+2*std <outliers['rating']['size']
outliers=outliers[outliers2]
ratings.set_index("user_id",inplace=True)
filtred=ratings.drop(outliers.index)
print(outliers.index)
outliers=outliers[outliers1]
ratings=filtred.drop(outliers.index)
ratings.sort_values(inplace = True, by="user_id",ascending = True)

ratings.head()



Int64Index([  1,   7,  13,  18,  59,  85,  90,  92,  94,  95, 130, 145, 151,
            178, 181, 194, 201, 222, 234, 268, 269, 271, 276, 279, 286, 291,
            293, 299, 301, 303, 308, 311, 327, 328, 334, 363, 374, 378, 385,
            387, 389, 393, 399, 405, 406, 416, 417, 429, 435, 450, 457, 474,
            497, 524, 532, 537, 551, 561, 592, 642, 648, 650, 653, 655, 682,
            716, 727, 747, 749, 758, 796, 804, 805, 833, 846, 864, 870, 880,
            883, 889, 896, 916],
           dtype='int64', name='user_id')




Unnamed: 0_level_0,movie_id,title,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,50,Star Wars (1977),5
0,133,Gone with the Wind (1939),1
0,172,"Empire Strikes Back, The (1980)",5
2,269,"Full Monty, The (1997)",4
2,25,"Birdcage, The (1996)",4


Now we'll pivot this table to construct a nice matrix of users and the movies they rated. NaN indicates missing data, or movies that a given user did not watch:

In [2]:
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
userRatings.head(288)

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,"World of Apu, The (Apur Sansar) (1959)","Wrong Trousers, The (1993)",Wyatt Earp (1994),Year of the Horse (1997),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,5.0,,,4.0,,,,,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315,,,,4.0,,,,,,,...,,,,,,,,,,
316,,,,,,,,,,,...,,,,,,,,,,
317,,,,,,,,,,,...,,,,,,,,,,
318,,,,,,,,,,,...,,,,,4.0,,,,,


Pandas has a built-in corr() method that will compute a correlation score for every column pair in the matrix! This gives us a correlation score between every pair of movies (where at least one user rated both movies - otherwise NaN's will show up.) That's amazing!

In [3]:
corrMatrix = userRatings.corr()
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,"World of Apu, The (Apur Sansar) (1959)","Wrong Trousers, The (1993)",Wyatt Earp (1994),Year of the Horse (1997),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),1.0,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,0.02882,1.0,0.191741,0.075593,0.054997,,0.258199,...,,-0.117267,,,0.157143,0.376845,0.738549,0.5,0.852803,
12 Angry Men (1957),,,0.02882,1.0,,0.062017,0.196553,0.223255,,0.3849,...,1.0,0.158666,-0.342997,,0.224506,-0.111359,-0.375,,1.0,
187 (1997),,,1.0,,1.0,0.258199,,-0.426401,,,...,,,,,0.866025,,,,,


However, we want to avoid spurious results that happened from just a handful of users that happened to rate the same pair of movies. In order to restrict our results to movies that lots of people rated together - and also give us more popular results that are more easily recongnizable - we'll use the min_periods argument to throw out results where fewer than 100 users rated a given movie pair:

In [4]:
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,"World of Apu, The (Apur Sansar) (1959)","Wrong Trousers, The (1993)",Wyatt Earp (1994),Year of the Horse (1997),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


Now let's produce some movie recommendations for user ID 0, who I manually added to the data set as a test case. I'll extract his ratings from the userRatings DataFrame, and use dropna() to get rid of missing data (leaving me only with a Series of the movies I actually rated:)

In [5]:
myRatings = userRatings.loc[0].dropna()
myRatings

title
Empire Strikes Back, The (1980)    5.0
Gone with the Wind (1939)          1.0
Star Wars (1977)                   5.0
Name: 0, dtype: float64

In [6]:
# l utilisateur 0 a evaluer seulment 3 films :
#Empire Strikes Back, The (1980) avec 5.0 
#Gone with the Wind (1939) avec 1.0
#Star Wars (1977) avec 5.0

Now, let's go through each movie I rated one at a time, and build up a list of possible recommendations based on the movies similar to the ones I rated.

In [7]:
simCandidates = pd.Series()
for i in range(0, len(myRatings.index)):
    print ("Adding sims for " + myRatings.index[i] + "...")
    # Retrieve similar movies to this one that I rated
    sims = corrMatrix[myRatings.index[i]].dropna()
    # Now scale its similarity by how well I rated this movie
    sims = sims.map(lambda x: x * myRatings[i])
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)
    
#Glance at our results so far:
print ("sorting...")
simCandidates.sort_values(inplace = True, ascending = False)
print (simCandidates.head(10))

Adding sims for Empire Strikes Back, The (1980)...
Adding sims for Gone with the Wind (1939)...
Adding sims for Star Wars (1977)...
sorting...
Empire Strikes Back, The (1980)               5.000000
Star Wars (1977)                              5.000000
Empire Strikes Back, The (1980)               3.711522
Star Wars (1977)                              3.711522
Return of the Jedi (1983)                     3.612693
Return of the Jedi (1983)                     3.274085
Raiders of the Lost Ark (1981)                2.667206
Raiders of the Lost Ark (1981)                2.544421
Star Trek III: The Search for Spock (1984)    2.263380
Men in Black (1997)                           2.151404
dtype: float64


  """Entry point for launching an IPython kernel.


In [8]:
# c est normalede trouver le mem film plus d une fois car il peut etre similaire a plus d un seul film des films qui ont ete evaluer par l user 0 

We'll use groupby() to add together the scores from movies that show up more than once, so they'll count more:

In [9]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()

In [10]:
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

Star Wars (1977)                              8.764495
Empire Strikes Back, The (1980)               8.711522
Return of the Jedi (1983)                     6.886778
Raiders of the Lost Ark (1981)                5.351456
Star Trek III: The Search for Spock (1984)    4.057361
Sting, The (1973)                             3.614919
Star Trek: The Wrath of Khan (1982)           3.558104
Wizard of Oz, The (1939)                      3.466549
Lion King, The (1994)                         3.295743
Men in Black (1997)                           3.257072
dtype: float64

The last thing we have to do is filter out movies I've already rated:

In [11]:
filteredSims = simCandidates.drop(myRatings.index)
filteredSims


Return of the Jedi (1983)                     6.886778
Raiders of the Lost Ark (1981)                5.351456
Star Trek III: The Search for Spock (1984)    4.057361
Sting, The (1973)                             3.614919
Star Trek: The Wrath of Khan (1982)           3.558104
                                                ...   
This Is Spinal Tap (1984)                    -0.186743
Dead Man Walking (1995)                      -0.286047
Heat (1995)                                  -0.322224
Annie Hall (1977)                            -0.388719
Brazil (1985)                                -0.532340
Length: 155, dtype: float64

In [12]:
# on doit enlever les fils deja evaluer par l utilisateur car si on fait pas ca il vont etre dans le debut de la liste de suggestion donc il faut les enlever .


## Exercice 

Can you improve on these results? Perhaps a different method or min_periods value on the correlation computation would produce more interesting results.
##### 5assena nbadelo lmethode w l min Periods wnchoufo wach resultat dyal recomondation hssan wla la 

Also, it looks like some movies similar to Gone with the Wind made it through to the final list of recommendations. Perhaps movies similar to ones the user rated poorly should actually be penalized, instead of just scaled down?
##### ##khassena ntjanebo n3tweh f list de recomendatio nles film likaychabho lfilm li 3tahe rate sghir  

There are also probably some outliers in the user rating data set - some users may have rated a huge amount of movies and have a disporportionate effect on the results. Go back to earlier courses to learn how to identify these outliers, and see if removing them improves things.
#### hadi mafhamnahach ana wl kantri am9ernach n3arefo chnouhama Outliers fhade l7ala dyalna 

For an even bigger project: we're evaluating the result qualitatively here, but we could actually apply train/test and measure our ability to predict user ratings for movies they've already watched. Whether that's actually a measure of a "good" recommendation is debatable though.
##### chi user 3arefin bli tfarej fchi film w3ajebo ghadi nhayedo rating dyalo 3la hade l film wnchoufo wach modele dyalna 9ader y recomandilo hadel film lihowa asaln kay3ajebo 