## Recommendation system (Collaborative Filtering) - Book Genre Purchase History


<a id='intro'></a>
### 1.Introduction
The goal is to recommend genres for customers based on purchase data of genres by finding most similar customer using Cosine Similarity and Pearson Coefficient methods and Apple Turi Create package.

Data Profile:
33,347 records of individual customers purchase data of book genres - data provides the frequency, number of times a customer bought a book from 30 genres.

Genres:
fiction, classics, cartoons, legends, philosophy, religion, psychology, linguistics,art,music,facsimile,history, contemporary history, economy, politics, science, computer science, railroads,maps,travel guides, health, cooking, learning, GamesRiddles, sports, hobby, nature, encyclopaedia, videos, nonbooks

In [17]:
import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.model_selection import train_test_split

#from sklearn.preprocessing import MinMaxScaler

<a id='preprocess'></a>
### 2. Preprocessing

In [3]:
df_books = pd.read_csv('df_rfm.csv')
# df_books.info()

In [4]:
df_books.head(2)

Unnamed: 0,id,Ffiction1,Fclassics3,Fcartoons5,Flegends6,Fphilosophy7,Freligion8,Fpsychology9,Flinguistics10,Fart12,...,Fhealth35,Fcooking36,Flearning37,FGamesRiddles38,Fsports39,Fhobby40,Fnature41,Fencyclopaedia44,Fvideos50,Fnonbooks99
0,2901870,63,16,49,7,29,101,2,8,13,...,60,54,24,2,0,50,21,50,7,8
1,3145166,11,8,4,0,2,29,0,9,30,...,40,21,19,0,0,46,18,17,10,3


In [4]:
# df_rfm.columns

In [5]:
#melting the data from pivot table form to column form 
data = pd.melt(df_books, id_vars=['id'], value_vars=['Ffiction1', 'Fclassics3', 'Fcartoons5', 'Flegends6',
       'Fphilosophy7', 'Freligion8', 'Fpsychology9', 'Flinguistics10',
       'Fart12', 'Fmusic14', 'Ffacsimile17', 'Fhistory19', 'Fconthist20',
       'Feconomy21', 'Fpolitics22', 'Fscience23', 'Fcompsci26', 'Frailroads27',
       'Fmaps30', 'Ftravelguides31', 'Fhealth35', 'Fcooking36', 'Flearning37',
       'FGamesRiddles38', 'Fsports39', 'Fhobby40', 'Fnature41',
       'Fencyclopaedia44', 'Fvideos50', 'Fnonbooks99'], var_name ='genre', value_name ='count' )

In [6]:
data.head(3)

Unnamed: 0,id,genre,count
0,2901870,Ffiction1,63
1,3145166,Ffiction1,11
2,3511502,Ffiction1,1


In [7]:
#Removing zero value rows which indidates customer did not purchase that item/genre
data = data[data['count'] != 0]

In [8]:
data.shape

(208450, 3)

#### User-Item Matrix

In [11]:
df_matrix = pd.pivot_table(data, values='count',index='id', columns='genre')

In [12]:
df_matrix.head(3)

genre,FGamesRiddles38,Fart12,Fcartoons5,Fclassics3,Fcompsci26,Fconthist20,Fcooking36,Feconomy21,Fencyclopaedia44,Ffacsimile17,...,Fnonbooks99,Fphilosophy7,Fpolitics22,Fpsychology9,Frailroads27,Freligion8,Fscience23,Fsports39,Ftravelguides31,Fvideos50
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
914,,1.0,1.0,,,17.0,1.0,,,,...,,,,,,,,,2.0,
957,,,2.0,,1.0,4.0,,,1.0,,...,,,,,,,,,,
1406,,,,,,34.0,,,,,...,,,,,,,,,,



#### Creating Features

In [9]:
#creating new feature- purchase yes/no dummy variable
#dummy 1 to indicate customer made purchase atleast 1 time
data_dummy = data
data_dummy['purchase_dummy'] = 1

In [10]:
data_dummy.head(3)

Unnamed: 0,id,genre,count,purchase_dummy
0,2901870,Ffiction1,63,1
1,3145166,Ffiction1,11,1
2,3511502,Ffiction1,1,1


In [18]:
# data_dummy.describe()

In [13]:
#creating second feature: 'scaled_purchase_freq' - normalized count within genre
#Normalization of count with percentile within genre
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
df_matrix_norm.head()

genre,FGamesRiddles38,Fart12,Fcartoons5,Fclassics3,Fcompsci26,Fconthist20,Fcooking36,Feconomy21,Fencyclopaedia44,Ffacsimile17,...,Fnonbooks99,Fphilosophy7,Fpolitics22,Fpsychology9,Frailroads27,Freligion8,Fscience23,Fsports39,Ftravelguides31,Fvideos50
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
914,,0.0,0.0,,,0.075472,0.0,,,,...,,,,,,,,,0.006135,
957,,,0.020833,,0.0,0.014151,,,0.0,,...,,,,,,,,,,
1406,,,,,,0.15566,,,,,...,,,,,,,,,,
1414,,,,,,0.028302,,,,,...,,,,,,,,,,
1546,,0.0,0.0,,,0.028302,,0.0,,,...,,0.058824,,,0.016667,0.0,0.018519,,0.0,0.037037


In [14]:
df_temp = df_matrix_norm.reset_index()

In [15]:
df_temp.index.names = ['scaled_purchase_freq']

In [16]:
data_norm = pd.melt(df_temp, id_vars=['id'],value_name='scaled_purchase_freq').dropna()

In [17]:
data_norm.head()

Unnamed: 0,id,genre,scaled_purchase_freq
6,2046,FGamesRiddles38,0.0
16,4693,FGamesRiddles38,0.0
38,9520,FGamesRiddles38,0.0
52,14745,FGamesRiddles38,0.0
80,21768,FGamesRiddles38,0.0


In [18]:
def split_data(data):
    '''
        Splits Pandas Dataframe dataset into training and test set and returns TuriCreate SFrame.
    '''
    train, test = train_test_split(data, test_size = .2)
    train_data = tc.SFrame(train)
    test_data = tc.SFrame(test)
    return train_data, test_data

In [20]:
train_data, test_data = split_data(data)
train_data_dummy, test_data_dummy = split_data(data_dummy)
train_data_norm, test_data_norm = split_data(data_norm)