# Recommendation system - User Based Matrix Factorization

## Overview
Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement simple versions of them using Python and the Pandas library.

## Problem Statement
The dataset we will use is the MovieLens Dataset. It contains 100k movie ratings from 943 users and a selection of 1682 movies. We will use Python and the Pandas library to implement two simple recommendation systems: one based on the mean of the user's ratings, and the other based on user-user collaborative filtering.


## Obvious Applications
1. Amazon - Product Recommendations
2. Netflix - Movie Recommendations
3. Pandora - Music Recommendations
4. Yelp - Recommendations for restaurants, businesses, etc.
5. Goodreads - Book Recommendations
6. Facebook - Friend Recommendations
7. LinkedIn - Job Recommendations
8. YouTube - Video Recommendations
9. Twitter - Who to Follow Recommendations
10. Instagram - Who to Follow Recommendations
11. Spotify - Music Recommendations
12. Google - Search Recommendations
13. Airbnb - Travel Recommendations
14. Uber - Ride Recommendations
15. eBay - Product Recommendations
16. Pinterest - Image Recommendations
17. Reddit - Post Recommendations

## Collaborative Filtering
Collaborative filtering is a technique used by recommendation systems to make predictions about an interest of a user by collecting preferences from many users. The underlying assumption is that if a user A has the same opinion as a user B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen user. There are two main types of collaborative filtering: user-based and item-based. In either scenario, the system has the preferences of a group of users on a set of items. It uses this information to recommend items to users. In general, collaborative filtering is the workhorse of recommender engines. The algorithm has a very interesting property of being able to do feature learning on its own, which means that it can start to learn for itself what features to use. It can be divided into Memory-Based Collaborative Filtering and Model-Based Collaborative filtering. In this notebook, we will implement Model-Based CF by using singular value decomposition (SVD) and Memory-Based CF by computing cosine similarity.

## Starting Pont: Matrix Factorization
Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The user-item interaction matrix is a matrix of size m x n, where m is the number of users and n is the number of items. Each cell in the matrix represents the rating given by a user to an item. The goal of matrix factorization is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. Matrix factorization can be done through Singular Value Decomposition (SVD) or Alternating Least Squares (ALS). In this notebook, we will use Singular Value Decomposition.


## Defining our error
In Ml, defining the error (or loss, or cost) is often the core of defining the objetive solution. Once we define the error, we can ussually plug it into a canned solver which can minize it. Definin the error can be obvious, or very subtle, or have multiple acceptable methods.

### Clustering
for k-means we simply used the distance from the centroid as the error. This is a very common approach.

### Image Recognition:
if our algorithm tags a picture of a cat as a dog, is that a larger error than if it tags it as a horse? or a car? How would you quantify that? This is a very hard problem, and the error function is not at all obvious. 

### Regression
Do you want to penalize a lot of medium errors more than an occasional large error? Then you might use the sum of the squares of the errors. This is called the L2 norm.

### Recommender
we will take the mean square error distance between our given matrix and our approximation as a starting point.

## Roadmap
1. Load the dataset and explore it.
2. create ALS model
3. Train it with varying ranks(k) to find reasonable hyperparameters.
4. Add a new user
5. Get top recommendations for a user

## Dataset 
We will use the MovieLens dataset, which is one of the most common datasets used when implementing and testing recommendation engines. It contains 100k movie ratings from 943 users and a selection of 1682 movies. You can download the dataset [here](http://files.grouplens.org/datasets/movielens/ml-100k.zip). We will use u.data and u.item files from the dataset.
References: https://grouplens.org/datasets/movielens/

ACKNOWLEDGEMENTS
==============================================

Thanks to Al Borchers for cleaning up this data and writing the
accompanying scripts.

PUBLISHED WORK THAT HAS USED THIS DATASET
==============================================

Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.

FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
==============================================

The GroupLens Research Project is a research group in the Department
of Computer Science and Engineering at the University of Minnesota.
Members of the GroupLens Research Project are involved in many
research projects related to the fields of information filtering,
collaborative filtering, and recommender systems. The project is lead
by professors John Riedl and Joseph Konstan. The project began to
explore automated collaborative filtering in 1992, but is most well
known for its world wide trial of an automated collaborative filtering
system for Usenet news in 1996.  The technology developed in the
Usenet trial formed the base for the formation of Net Perceptions,
Inc., which was founded by members of GroupLens Research. Since then
the project has expanded its scope to research overall information
filtering solutions, integrating in content-based methods as well as
improving current collaborative filtering technology.

Further information on the GroupLens Research project, including
research publications, can be found at the following web site:
        
        http://www.grouplens.org/

GroupLens Research currently operates a movie recommender based on
collaborative filtering:

        http://www.movielens.org/

DETAILED DESCRIPTIONS OF DATA FILES
==============================================

Here are brief descriptions of the data.

ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
                gunzip ml-data.tar.gz
                tar xvf ml-data.tar
                mku.sh

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   

u.info     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

u1.base    -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test       are 80%/20% splits of the u data into training and test data.
u2.base       Each of u1, ..., u5 have disjoint test sets; this if for
u2.test       5 fold cross validation (where you repeat your experiment
u3.base       with each training and test set and average the results).
u3.test       These data sets can be generated from u.data by mku.sh.
u4.base
u4.test
u5.base
u5.test

ua.base    -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test       split the u data into a training set and a test set with
ub.base       exactly 10 ratings per user in the test set.  The sets
ub.test       ua.test and ub.test are disjoint.  These data sets can
              be generated from u.data by mku.sh.

allbut.pl  -- The script that generates training and test sets where
              all but n of a users ratings are in the training data.

mku.sh     -- A shell script to generate all the u data sets from u.data.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns



## Load Raitings

In [2]:
r_columns = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', header=None, usecols=range(3), names=r_columns)
ratings

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1
...,...,...,...
99995,880,476,3
99996,716,204,5
99997,276,1090,1
99998,13,225,2



## Load Movies

In [3]:
r_columns=['movie_id', 'title']
movies = pd.read_csv('ml-100k/u.item', encoding='iso-8859-1', sep='|', header=None, names=r_columns, usecols=range(2))
movies

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)
...,...,...
1677,1678,Mat' i syn (1997)
1678,1679,B. Monkey (1998)
1679,1680,Sliding Doors (1998)
1680,1681,You So Crazy (1994)


## Load Users

In [4]:
r_columns = ['user_id', 'age', 'gender', 'profession', 'zipcode']
users = pd.read_table('ml-100k/u.user', sep='|', header=None, names=r_columns)
users

Unnamed: 0,user_id,age,gender,profession,zipcode
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
...,...,...,...,...,...
938,939,26,F,student,33319
939,940,32,M,administrator,02215
940,941,20,M,student,97229
941,942,48,F,librarian,78209


In [5]:
users.iloc[196]

user_id              197
age                   55
gender                 M
profession    technician
zipcode            75094
Name: 196, dtype: object

In [6]:
movies.iloc[241]

movie_id             242
title       Kolya (1996)
Name: 241, dtype: object

## Join Data

In [7]:
ratings = pd.merge(movies, ratings)
ratings

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3
...,...,...,...,...
99995,1678,Mat' i syn (1997),863,1
99996,1679,B. Monkey (1998),863,3
99997,1680,Sliding Doors (1998),863,2
99998,1681,You So Crazy (1994),896,3


In [8]:
## Data Visualisation

In [9]:
ratings.title.value_counts()

title
Star Wars (1977)                                583
Contact (1997)                                  509
Fargo (1996)                                    508
Return of the Jedi (1983)                       507
Liar Liar (1997)                                485
                                               ... 
Tigrero: A Film That Was Never Made (1994)        1
Eye of Vichy, The (Oeil de Vichy, L') (1993)      1
Promise, The (Versprechen, Das) (1994)            1
To Cross the Rubicon (1991)                       1
Scream of Stone (Schrei aus Stein) (1991)         1
Name: count, Length: 1664, dtype: int64

In [10]:
ratings.movie_id.nunique()

1682

In [11]:
np.size, np.sum, np.mean

(<function size at 0x7f79b83b2d30>,
 <function sum at 0x7f79b83b1530>,
 <function mean at 0x7f79b83b3370>)

In [12]:
np.size.__doc__

'\n    Return the number of elements along a given axis.\n\n    Parameters\n    ----------\n    a : array_like\n        Input data.\n    axis : int, optional\n        Axis along which the elements are counted.  By default, give\n        the total number of elements.\n\n    Returns\n    -------\n    element_count : int\n        Number of elements along the specified axis.\n\n    See Also\n    --------\n    shape : dimensions of array\n    ndarray.shape : dimensions of array\n    ndarray.size : number of elements in array\n\n    Examples\n    --------\n    >>> a = np.array([[1,2,3],[4,5,6]])\n    >>> np.size(a)\n    6\n    >>> np.size(a,1)\n    3\n    >>> np.size(a,0)\n    2\n\n    '

## Matrix Factorization
Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The user-item interaction matrix is a matrix of size m x n, where m is the number of users and n is the number of items. Each cell in the matrix represents the rating given by a user to an item. The goal of matrix factorization is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. Matrix factorization can be done through Singular Value Decomposition (SVD) or Alternating Least Squares (ALS). In this notebook, we will use Singular Value Decomposition.

In [13]:
movie_ratings = ratings.pivot_table(index=['user_id'], columns=['title'], values='rating')
movie_ratings

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,,,...,,,,,,,,,,
940,,,,,,,,,,,...,,,,,,,,,,
941,,,,,,,,,,,...,,,,,,,,,,
942,,,,,,,,3.0,,3.0,...,,,,,,,,,,


In [32]:
movie_ratings['The Matrix']


KeyError: 'The Matrix'

In [15]:
starwars_ratings = movie_ratings['Star Wars (1977)']
starwars_ratings.value_counts()

Star Wars (1977)
5.0    325
4.0    176
3.0     57
2.0     16
1.0      9
Name: count, dtype: int64

### correlation other movies

In [16]:
movie_ratings[['101 Dalmatians (1996)', 'Star Wars (1977)']].corr()

title,101 Dalmatians (1996),Star Wars (1977)
title,Unnamed: 1_level_1,Unnamed: 2_level_1
101 Dalmatians (1996),1.0,0.211132
Star Wars (1977),0.211132,1.0


### Similarity


In [17]:
movie_ratings.corrwith(starwars_ratings)

title
'Til There Was You (1997)                0.872872
1-900 (1994)                            -0.645497
101 Dalmatians (1996)                    0.211132
12 Angry Men (1957)                      0.184289
187 (1997)                               0.027398
                                           ...   
Young Guns II (1990)                     0.228615
Young Poisoner's Handbook, The (1995)   -0.007374
Zeus and Roxanne (1997)                  0.818182
unknown                                  0.723123
Á köldum klaka (Cold Fever) (1994)            NaN
Length: 1664, dtype: float64

In [18]:
similar_movies = movie_ratings.corrwith(starwars_ratings).sort_values(ascending=False)
# dropna().
similar_movies

title
Hollow Reed (1996)                         1.0
Commandments (1997)                        1.0
Cosi (1996)                                1.0
No Escape (1994)                           1.0
Stripes (1981)                             1.0
                                          ... 
Wonderland (1997)                          NaN
Wooden Man's Bride, The (Wu Kui) (1994)    NaN
Yankee Zulu (1994)                         NaN
You So Crazy (1994)                        NaN
Á köldum klaka (Cold Fever) (1994)         NaN
Length: 1664, dtype: float64

## Popularity

In [19]:
movie_stats = ratings.groupby('title').agg({'rating': [np.size,np.sum, np.mean]})


In [20]:
movie_stats.sort_values([('rating', 'mean')], ascending=False)

Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,size,sum,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
They Made Me a Criminal (1939),1,5,5.0
Marlene Dietrich: Shadow and Light (1996),1,5,5.0
"Saint of Fort Washington, The (1993)",2,10,5.0
Someone Else's America (1995),1,5,5.0
Star Kid (1997),3,15,5.0
...,...,...,...
"Eye of Vichy, The (Oeil de Vichy, L') (1993)",1,1,1.0
King of New York (1990),1,1,1.0
Touki Bouki (Journey of the Hyena) (1973),1,1,1.0
"Bloody Child, The (1996)",1,1,1.0


In [21]:
# change column names to size, mean, and sum
# movie_stats.columns = movie_stats.columns.droplevel()
# movie_stats.columns = ['size', 'sum', 'mean']

In [22]:
popular_movies = movie_stats[movie_stats[('rating', 'size')] > 100]
popular_movies.sort_values([('rating', 'mean')], ascending=False)

Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,size,sum,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
"Close Shave, A (1995)",112,503,4.491071
Schindler's List (1993),298,1331,4.466443
"Wrong Trousers, The (1993)",118,527,4.466102
Casablanca (1942),243,1083,4.456790
"Shawshank Redemption, The (1994)",283,1258,4.445230
...,...,...,...
Spawn (1997),143,374,2.615385
Event Horizon (1997),127,327,2.574803
Crash (1996),128,326,2.546875
Jungle2Jungle (1997),132,322,2.439394


In [23]:
similar_movies

title
Hollow Reed (1996)                         1.0
Commandments (1997)                        1.0
Cosi (1996)                                1.0
No Escape (1994)                           1.0
Stripes (1981)                             1.0
                                          ... 
Wonderland (1997)                          NaN
Wooden Man's Bride, The (Wu Kui) (1994)    NaN
Yankee Zulu (1994)                         NaN
You So Crazy (1994)                        NaN
Á köldum klaka (Cold Fever) (1994)         NaN
Length: 1664, dtype: float64

In [24]:
popular_movies

Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,size,sum,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
101 Dalmatians (1996),109,317,2.908257
12 Angry Men (1957),125,543,4.344000
2001: A Space Odyssey (1968),259,1028,3.969112
Absolute Power (1997),127,428,3.370079
"Abyss, The (1989)",151,542,3.589404
...,...,...,...
Willy Wonka and the Chocolate Factory (1971),326,1184,3.631902
"Wizard of Oz, The (1939)",246,1003,4.077236
"Wrong Trousers, The (1993)",118,527,4.466102
Young Frankenstein (1974),200,789,3.945000


In [25]:
popular_movies.columns = popular_movies.columns.get_level_values(0)
popular_movies

Unnamed: 0_level_0,rating,rating,rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101 Dalmatians (1996),109,317,2.908257
12 Angry Men (1957),125,543,4.344000
2001: A Space Odyssey (1968),259,1028,3.969112
Absolute Power (1997),127,428,3.370079
"Abyss, The (1989)",151,542,3.589404
...,...,...,...
Willy Wonka and the Chocolate Factory (1971),326,1184,3.631902
"Wizard of Oz, The (1939)",246,1003,4.077236
"Wrong Trousers, The (1993)",118,527,4.466102
Young Frankenstein (1974),200,789,3.945000


In [26]:
similar_movies_df = pd.DataFrame(similar_movies, columns=['similarity'])

In [27]:
df = popular_movies.join(pd.DataFrame(similar_movies, columns=['similarity']))
df

Unnamed: 0_level_0,rating,rating,rating,similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
101 Dalmatians (1996),109,317,2.908257,0.211132
12 Angry Men (1957),125,543,4.344000,0.184289
2001: A Space Odyssey (1968),259,1028,3.969112,0.230884
Absolute Power (1997),127,428,3.370079,0.085440
"Abyss, The (1989)",151,542,3.589404,0.203709
...,...,...,...,...
Willy Wonka and the Chocolate Factory (1971),326,1184,3.631902,0.221902
"Wizard of Oz, The (1939)",246,1003,4.077236,0.266335
"Wrong Trousers, The (1993)",118,527,4.466102,0.216204
Young Frankenstein (1974),200,789,3.945000,0.192589


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 334 entries, 101 Dalmatians (1996) to Young Guns (1988)
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   rating      334 non-null    int64  
 1   rating      334 non-null    int64  
 2   rating      334 non-null    float64
 3   similarity  334 non-null    float64
dtypes: float64(2), int64(2)
memory usage: 21.1+ KB


In [29]:
df.sort_values(['similarity'], ascending=False)

Unnamed: 0_level_0,rating,rating,rating,similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Star Wars (1977),583,2541,4.358491,1.000000
"Empire Strikes Back, The (1980)",367,1543,4.204360,0.747981
Return of the Jedi (1983),507,2032,4.007890,0.672556
Raiders of the Lost Ark (1981),420,1786,4.252381,0.536117
Austin Powers: International Man of Mystery (1997),130,422,3.246154,0.377433
...,...,...,...,...
"Edge, The (1997)",113,400,3.539823,-0.127167
As Good As It Gets (1997),112,470,4.196429,-0.130466
Crash (1996),128,326,2.546875,-0.148507
G.I. Jane (1997),175,588,3.360000,-0.176734
