# IS 643 Project 1 -- Xingjia Wu

### - Describe the recommender system from a business perspective

A toy dataset from the class survey was used for this project. This dataset include 16 students and totally 6 movies were rated. The business application would be to recommend movies to users based on personal preference.

### - Load csv file as pandas dataframe

In [1]:
import pandas as pd

In [2]:
cr = pd.read_csv("https://raw.githubusercontent.com/ddsmile/DATA643/master/MovieRatings.csv")

In [3]:
cr.head()

Unnamed: 0,Critic,CaptainAmerica,Deadpool,Frozen,JungleBook,PitchPerfect2,StarWarsForce
0,Burton,,,,4.0,,4.0
1,Charley,4.0,5.0,4.0,3.0,2.0,3.0
2,Dan,,5.0,,,,5.0
3,Dieudonne,5.0,4.0,,,,5.0
4,Matt,4.0,,2.0,,2.0,5.0


Manipulate the dataframe to have each user as column index and movies as row index. Since the cosine distance will be used for similarity, the NaN value is filled with 0 so that the operation can be processed. For the cosine distance, such replacement doesn't affect the results. 

In [4]:
cr_T = cr.T # transpose df
cr_T.columns = cr_T.loc["Critic"] # change column index names
cr_T = cr_T.drop('Critic') # drop duplicate row
cr_all = cr_T.fillna(0) 
cr_all

Critic,Burton,Charley,Dan,Dieudonne,Matt,Mauricio,Max,Nathan,Param,Parshu,Prashanth,Shipra,Sreejaya,Steve,Vuthy,Xingjia
CaptainAmerica,0.0,4.0,0.0,5.0,4.0,4.0,4.0,0.0,4.0,4.0,5.0,0.0,5.0,4.0,4.0,0.0
Deadpool,0.0,5.0,5.0,4.0,0.0,0.0,4.0,0.0,4.0,3.0,5.0,0.0,5.0,0.0,5.0,0.0
Frozen,0.0,4.0,0.0,0.0,2.0,3.0,4.0,0.0,1.0,5.0,5.0,4.0,5.0,0.0,3.0,5.0
JungleBook,4.0,3.0,0.0,0.0,0.0,3.0,2.0,0.0,0.0,5.0,5.0,5.0,4.0,0.0,3.0,5.0
PitchPerfect2,0.0,2.0,0.0,0.0,2.0,4.0,2.0,0.0,0.0,2.0,0.0,0.0,4.0,0.0,3.0,0.0
StarWarsForce,4.0,3.0,5.0,5.0,5.0,0.0,4.0,4.0,5.0,3.0,4.0,3.0,5.0,4.0,0.0,0.0


### - Use collaborative filtering

- First, create similarity function using cosine distance


In [5]:
# cosine similarity function
from math import*
def cosine_sim(x,y):
 dotprod = sum(a*b for a,b in zip(x,y))
 dist = sqrt(sum([a*a for a in x]))*sqrt(sum([b*b for b in y]))
 return dotprod/dist

cosine_sim(cr_all["Burton"], cr_all["Charley"]) # find similarity between Burton and Charley

0.47733437050543803

- Second, use cosine distance function in scipy and compare the result

In [6]:
# Using cosine function
from scipy.spatial.distance import cosine
1-cosine(cr_all["Burton"], cr_all["Charley"])

0.47733437050543803

Two different functions returned the same results. 

- Find the most similar person based on cosine similarity

In [7]:
# Function that returns the top similar person 
def topSim(person, df, n):
    score = [(cosine_sim(df[person], df[i]), i) for i in df.columns if i != person]
    score.sort()
    score.reverse()
    return score[0:n]
    
topSim("Burton", cr_all, 3)    # find the most similar person for Burton

[(0.79999999999999982, 'Shipra'),
 (0.70710678118654746, 'Nathan'),
 (0.60302268915552715, 'Parshu')]

The most similar person for Burton are Shipra, Nathan and then Parshu.

- Making recommendation based on weighted score

In [8]:
def getRecommendations(df,person):
    totals={}
    simSums={}
    for other in df.columns:
    # Skip the comparison to self 
        if other==person: continue
        sim = cosine_sim(df[person],df[other])

        # ignore scores of zero or lower
        if sim<=0: continue
        for item in df.index:
        # only score movies the person hasn't seen 
          if df.at[item, other]!=0 and df.at[item, person]==0:  # The person has score 0 and other has score not 0
            # Similarity * Score
            totals.setdefault(item,0)
            totals[item]+= df.at[item, other]*sim
            # Sum of similarities
            simSums.setdefault(item,0)
            simSums[item]+=sim

    # Create the normalized list
    rankings=[(total/simSums[item],item) for item,total in totals.items()]

    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

In [9]:
getRecommendations(cr_all, "Burton")

[(4.4053902124673447, 'Deadpool'),
 (4.304610359062635, 'CaptainAmerica'),
 (3.871696813249442, 'Frozen'),
 (2.6147393009638979, 'PitchPerfect2')]

The top recommended movies for Burton are Deadpool, CaptainAmerica, then Frozen. 

### - Recommender system with GraphLab Create

Change original dataset into long format for GraphLab Create

In [10]:
df_long = pd.melt(cr, id_vars="Critic").dropna()
df_long = df_long.sort_values(["Critic"]).reset_index(drop = True)
df_long.head()

Unnamed: 0,Critic,variable,value
0,Burton,StarWarsForce,4.0
1,Burton,JungleBook,4.0
2,Charley,CaptainAmerica,4.0
3,Charley,StarWarsForce,3.0
4,Charley,PitchPerfect2,2.0


In [11]:
import graphlab
from graphlab import SFrame
sf = SFrame(data = df_long)
m = graphlab.recommender.create(sf, user_id="Critic", item_id="variable", target = "value")
rec = m.recommend()
print(rec)

This non-commercial license of GraphLab Create is assigned to xingjia.wu@spsmail.cuny.edu and will expire on June 18, 2017. For commercial licensing options, visit https://dato.com/buy/.


[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: C:\Users\ddsmile\AppData\Local\Temp\graphlab_server_1466303209.log.0


+-----------+----------------+---------------+------+
|   Critic  |    variable    |     score     | rank |
+-----------+----------------+---------------+------+
|   Burton  | CaptainAmerica | 3.61883964768 |  1   |
|   Burton  |     Frozen     | 3.59268916777 |  2   |
|   Burton  |    Deadpool    | 3.08599166206 |  3   |
|   Burton  | PitchPerfect2  | 2.45758728317 |  4   |
|    Dan    | CaptainAmerica | 4.46461896977 |  1   |
|    Dan    | PitchPerfect2  | 2.71280913434 |  2   |
|    Dan    |     Frozen     | 2.69984333596 |  3   |
|    Dan    |   JungleBook   | 1.46254186711 |  4   |
| Dieudonne | PitchPerfect2  | 2.74535439542 |  1   |
| Dieudonne |     Frozen     | 2.19848845533 |  2   |
+-----------+----------------+---------------+------+
[35 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


The top recommended movies for Burton are Frozen, CaptainAmerica, then Deadpool. 

### - Handle missing value

The missing value in this dataset are missing rating from users. For similarity function, the missing values were replaced by 0. The rating value ranges from 1 to 5. The replacement of NaN with 0 didn't change the results. In GraphLab Create module, the dataset was reshaped into user_id, item_id and rating columns. Any row with NaN was removed and the final dataset doesn't contain any missing value.

### - Comparison between two created recommender system

My recommendation is based on weighted score. The GraphLab Create recommender toolkit uses the default Factorization recommender. To compare two different systems, a test was run to recommend movies to one of user, Burton. Although PitchPerfect2 was the last recommended movie for both methods, the sequence for top three are different. The difference was probably due to different algorithms. 

##### Ref: Toby Segaran. Programming Collective Intelligence. (2007) Chapter 2 Making Recommendations