# Recommender  System - CF

This notebook introduce the CF pipeline - user-based<br>
1. we will compute a user matrix by history record.<br>
2. then, give a user recommendation by user-similarity.<br>

**note: this function is time-consuming and space-consuming. so we just set a sample**

Dataset Info<br>
./data/train.txt include 4 columns<br>
uid: user ID<br>
mid: movie ID<br>
timeStamp : record create time<br>
star: user mark the movie(scale 1.0 - 5.0)<br>

In [226]:
import sys
import pandas as pd
from pandas import DataFrame 

PATH="./train.txt"
user_record = pd.read_csv(PATH, sep='\t',header=0,encoding='gbk')
user_record.star = user_record.star.astype(float)
print(user_record.star.value_counts())
user_record['count']=1. # set count column, means marked by user
user_record.head()

4.0    419
3.0    315
5.0    266
Name: star, dtype: int64


Unnamed: 0,uid,mid,timeStamp,star,count
0,1722994,1306505,2007-08-22,5.0,1.0
1,1405477,10574468,2013-04-24,4.0,1.0
2,15849871,4910186,2011-12-20,5.0,1.0
3,1068524,1304643,2006-02-07,3.0,1.0
4,1307041,1851857,2008-09-22,4.0,1.0


## create the user-product matrix by history record.

In [227]:
sell_pivot = user_record.pivot_table(values='star',index='uid',columns='mid',aggfunc=sum,fill_value=0)
print(sell_pivot.shape)
sell_pivot.head(2)

(921, 876)


mid,1291546,1291548,1291549,1291552,1291557,1291560,1291561,1291566,1291569,1291571,...,11587489,11620863,13939691,19955821,19961360,19962285,20378817,20395646,20451283,23090008
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000226,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000232,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [228]:
# define some function to calculate the distance of user's similarity 
import numpy as np
def euclidea_sim(x,y):
    assert len(x) == len(y)
    dis = np.linalg.norm(x-y)
    sim = 1/(1+dis)
    return sim

def jaccard_sim(x,y):
    assert len(x) == len(y)
    x,y = np.array(x).astype(bool),np,array(y).astype(bool)
    return sum(x*y)/sum(x+y)

def cosine_sim(x,y):
    assert len(x) == len(y)
    sum_x_y = np.dot(x,y)
    return sum_x_y/np.linalg.norm(x)/np.linalg.norm(y)

In [229]:
# compute the user matrix, dimension is user*user
def sim_mat(sell_group,sim=euclidea_sim):
    sim_matrix = np.zeros((sell_group.shape[0],sell_group.shape[0]),dtype=float)
    sim_matrix = DataFrame(sim_matrix,index=sell_group.index,columns=sell_group.index)
    print(sim_matrix.shape)
    for index in sell_group.index:
        for column in sell_group.index:
            sim_matrix.loc[index,column] = sim(sell_group.loc[index,:],sell_group.loc[column,:])
    return sim_matrix

# give a user recommendation by user-similarity.
def recommendation(sim_mat,customer,n_sim_customer,n_product,sell_record):
    '''
    paramer:
    sim_mat:matrix of user-similarity
    customer: the user we need to recommend
    n_sim_customer: select some similar users
    n_product: how many products we want to recommend
    sell_record: the user-product list: the row : user, the column: product,
                 if user buy the product, the value will be set 1.0 otherwise set 0.0
    '''
    try:
        k_similar = sim_mat.sort_values(customer,axis=0,ascending=False)[:n_sim_customer]
    except:
        print('This user never purchases the item, we can introduce a hot one.') 
        return
        
    # 找到k个相似用户购买的所有产品
    # find product-lists of k-similar-user
    recom_product = sell_record.loc[k_similar.index,:].astype(bool).sum(axis=0)
    recom_product = recom_product[recom_product>0].sort_values(axis=0,ascending=False).index
    count_ = 0
    recom_list = []
    for i in recom_product:
        # the product has beed bought before, we shouldn't recommend again.
        if sell_record[i][customer] > 0:
            continue
        else:
            recom_list.append(i)
            count_ += 1
        if count_ >= n_product:
            break
    if len(recom_list) > 0:
        print("The recommended items are：","/".join([str(r) for r in recom_list]))
    else:
        print('There is no product to be recommended, we can introduce a hot one.')

## calculate the user-similarity matrix

In [214]:
sim = sim_mat(sell_pivot)
print(sim.shape)

(921, 921)


## recommendation by user-similarity matrix

In [221]:
recommendation(sim,1000226,2,2,sell_pivot)
recommendation(sim,19556493,2,2,sell_pivot)
recommendation(sim,1291552,2,2,sell_pivot)

There is no product to be recommended, we can introduce a hot one.
The recommended items are： 1295280
This user never purchases the item, we can introduce a hot one.
