# Recommmendation systems

Recommendations system are nowadays widely used in many domains and basically are systems that allow people to get advice on what to use,buy, read, watch etc. In some domains people behaviour are well studied and huge databases are available (like MovieLens for film recommendation domain). In others - not so much. Here, I'll concentrate only on film recommendation domain and will present only a few ideas as well as some challenges that may come with recommendation systems. Nevertheless, there are a lot of resources/books/papers that may be helpful when approaching the topic. Here I'd recommend too books that inspired me into diving deeper. Toby Segaran's "Programming collective intelligence" is a classic book for Machine Learning which presents some basic concepts about collaborative filtering (along with other classical technics for maschine learning). For me it's the best book to grasp basic understandment of the topic. The other is "Recommender Systems - an Introduction" by D. Jannah, M. Zanker, A. Felfernig, G. Friedrich, which presents a great overview of recommender systems and challenges that may appear.

Types of recommender systems:
- Collaborative (to be described in more details)
- Content-based
- Knowledge-based
- Hybrid

The type of recommender system to use is dependendant on the domain. For example for film domain where we have a lot of users and a lot of items collaborative filtering and content-based approaches seems to be appropriate approaches. I'll concentrate here on colaborative filtering approach and present some of the approaches from mentioned earlier "Recommender Systems Introduction" book.

Collaborative filtering uses experiences (here ratings) of other users to find sth to recommend. There are many approaches which can be taken to exploit those experiences.
First to be described here is __"User-based nearest neighbour recommendation"__ where you find other users that are similar to you and recommend sth they liked. There are a few challanges with recommendation system that are to be found here (although most of those are more general). What does it mean "nearest neighbour"? what metrics to choose to measure this? People tend to watch different films and usually only some from many. Also one needs to be carefull with some methods as there is relatively "small number" of ratings in comparison to film and users (In case of described MovieLens datasets we have about 270tho users, 45tho films and 2.6mln ratings. That means that if we wanted to create users x films matrix with ratings in cells only around 0.2% of it would be filled in) and calculating all the users similarity may very expensive computationally (due to size of users x films matrix) and calculating similarity between users may require user x user matrix which is even bigger (on the other hand when calculating recommendations for the specific user (which seems as a common task) we only need to calculate similarieties of this user to other users (without calculating similarieties between other users). Other problem known as "cold start" is when we have a new user that has not yet rated any movies. In such case it may be difficult to assess similarity to other users.

Basing on Movielens dataset I'll try to use _user-based nearest neighbour_ approach to find prediction for randomly chosen 100 ratings (Note, that recommendation problem can be sometimes treated as rating prediction problem or classification problem). This way I'm cutting a bit the computational issue and allow me to do some evaluation on achieved scores. On the other hand some technics used here cannot be directly applied to calculating scores for all users/movies. As a similarity measure I'm going to use pearson correlation using up to 40 nearest neighbours with positive correlation. I'll try to _fill in_ the database one example at a time.

In [1]:
import numpy as np
import pandas as pd

In [4]:
movies = pd.read_csv('../movies.csv')
movies.set_index('movieId',inplace=True)

In [8]:
ratings = pd.read_csv('../ratings.csv')
ratings.drop(['timestamp'],axis=1,inplace=True)
#ratings.set_index(['userId','movieId'],inplace=True)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,110,1.0
1,1,147,4.5
2,1,858,5.0
3,1,1221,5.0
4,1,1246,5.0


In [15]:
#randomly choosing 100 elements
import random
random100 = [int(ratings.shape[0]*random.random()) for i in range(100)]
org100 = ratings.loc[random100,:]

(100, 3)

In [19]:
ratings.drop(random100,axis=0,inplace=True)
ratings.shape

(26024189, 3)

In [17]:
ratings.shape

(26024289, 3)

In [65]:
def calc_cor(df,user,movie):
    #helper function for selecting base for the calculation of separate example.
    #I need base of ratings of users that rated the movie and only this movie and movies rated by user 
    #(as other will be removed when calculating correlation anyway)
    users = df[df.movieId==movie].userId.unique()
    df.set_index('movieId')
    dfu=ratings[ratings.userId==user]
    korelacja = []
    ile = []
    i=0
    progres=0
    for u in users:
        if i/len(users)>=progres:
            print(progres)
            progres+=0.2
        i+=1
        s=dfu.merge(ratings[ratings.userId==user],how='left',left_index=True,right_index=True)[['rating_x','rating_y']]
        korelacja.append(s.corr().iloc[0,1])
        ile.append(sum(s.rating_y.notna()))
        
    return pd.DataFrame(data={'korelacje':korelacja,'ile':ile}, index=users)

In [58]:
xx = org100.head(1)
aa = calc_cor(ratings,int(xx.userId),int(xx.movieId))
aa.head()


0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399
0/36399


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



0/36399
0/36399
0/36399
Traceback (most recent call last):
  File "/home/ab/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-58-d9855faf211b>", line 2, in <module>
    aa = calc_cor(ratings,int(xx.userId),int(xx.movieId))
  File "<ipython-input-57-0822a4a3d4da>", line 13, in calc_cor
    s=dfu.merge(ratings[ratings.userId==user],how='left',left_index=True,right_index=True)[['rating_x','rating_y']]
  File "/home/ab/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 879, in wrapper
    res = na_op(values, other)
  File "/home/ab/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 816, in na_op
    result = getattr(x, name)(y)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ab/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1

KeyboardInterrupt: 

In [64]:
org100.head(1)

Unnamed: 0,userId,movieId,rating
8125655,83784,316,3.0


In [45]:
aa.set_index('movieId')

Unnamed: 0_level_0,userId,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
32,16,4.5
111,16,5.0
198,16,4.5
260,16,3.5
316,16,3.5
480,16,4.5
541,16,5.0
589,16,5.0
593,16,4.5
1080,16,5.0


In [47]:
xx

Unnamed: 0,userId,movieId,rating
8125655,83784,316,3.0


[back](part4.ipynb) [index](index.ipynb)