### Recommender Systems dengan menggunakan Collaborative Filtering

Model yang akan saya gunakan adalah Collaborative Filtering dengan AlternatingLeastSquares packages. Dari jurnal penjelasan ALS, dikatakan bahwa ALS digunakan pada data yang bersifat implicit feedback serta bersifat memory-based. Implicit feedback secara tidak langsung adalah opini dari hasil observasi user behaviour, seperti purchase history, browsing history, etc. Referensi dari sebuah project di Kaggle, seseorang menggunakan data sejenis (purchase history) dengan saya dan mengartikan implicit untuk total pembelian produk dari setiap user. Artinya adalah user dikatakan menyukai atau tidak menyukai produk dapat dilihat dari total pembelian tersebut.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("clean.csv")
df = df.drop(columns = "Unnamed: 0")

In [3]:
df.head()

Unnamed: 0,Unnamed: 0.1,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Outliers
0,0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom,0
1,1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,0
2,2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom,0
3,3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,0
4,4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,0


In [4]:
new = df[['StockCode', 'CustomerID', 'Quantity']].groupby(by = ['StockCode', 'CustomerID']).count().reset_index()
new = new.rename(columns = {'Quantity': 'Stock_count'})
new.sort_values(by = 'Stock_count', ascending = False).head()

Unnamed: 0,StockCode,CustomerID,Stock_count
16419,22632,17850.0,17
27128,85123A,17850.0,17
24316,84029E,17850.0,17
16477,22633,17850.0,17
6353,21730,17850.0,17


In [5]:
from sklearn.model_selection import train_test_split
import implicit
import scipy.sparse as sparse

In [6]:
new_stock_id = []
id = 1

for i in range(len(new['StockCode'].unique())):
    new_stock_id.append(id)
    id += 1

In [7]:
new_stock = pd.DataFrame()
new_stock['StockCode'] = new['StockCode'].unique()
new_stock['StockID'] = new_stock_id

In [8]:
new_stockid = pd.merge(new, new_stock, how = 'left', on = 'StockCode')
new_stockid

Unnamed: 0,StockCode,CustomerID,Stock_count,StockID
0,10002,12583.0,1,1
1,10002,12682.0,2,1
2,10002,12731.0,1,1
3,10002,12867.0,1,1
4,10002,12872.0,1,1
...,...,...,...,...
27938,POST,14646.0,1,2501
27939,POST,15107.0,1,2501
27940,POST,15602.0,1,2501
27941,POST,15694.0,1,2501


### Train Test 80:20

In [9]:
train, test = train_test_split(new_stockid, test_size = 0.2, random_state = 1)

In [10]:
train.shape

(22354, 4)

In [11]:
test.shape

(5589, 4)

In [12]:
user_items = sparse.csr_matrix((train['Stock_count'].astype(float), (train['CustomerID'].astype(int), train['StockID'].astype(int))))

In [13]:
item_users = sparse.csr_matrix((train['Stock_count'].astype(float), (train['StockID'], train['CustomerID'].astype(int))))

In [14]:
print(item_users)

  (1, 12682)	2.0
  (1, 12867)	1.0
  (1, 12872)	1.0
  (1, 13069)	1.0
  (1, 14258)	1.0
  (1, 14713)	2.0
  (1, 14911)	1.0
  (1, 15529)	1.0
  (1, 16098)	1.0
  (1, 16795)	1.0
  (1, 17677)	1.0
  (1, 17967)	1.0
  (2, 12748)	1.0
  (2, 15021)	1.0
  (2, 17198)	1.0
  (3, 17967)	1.0
  (4, 16710)	1.0
  (5, 16916)	1.0
  (6, 12735)	1.0
  (6, 13069)	1.0
  (6, 15750)	1.0
  (6, 16145)	1.0
  (6, 16722)	1.0
  (6, 16898)	1.0
  (6, 17511)	1.0
  :	:
  (2501, 12682)	3.0
  (2501, 12686)	1.0
  (2501, 12691)	1.0
  (2501, 12705)	1.0
  (2501, 12708)	1.0
  (2501, 12709)	1.0
  (2501, 12720)	3.0
  (2501, 12725)	1.0
  (2501, 12726)	1.0
  (2501, 12731)	1.0
  (2501, 12735)	1.0
  (2501, 12738)	1.0
  (2501, 12766)	1.0
  (2501, 12785)	1.0
  (2501, 12791)	1.0
  (2501, 12797)	1.0
  (2501, 12808)	1.0
  (2501, 12971)	1.0
  (2501, 13520)	1.0
  (2501, 13817)	1.0
  (2501, 14646)	1.0
  (2501, 15107)	1.0
  (2501, 15602)	1.0
  (2501, 15694)	1.0
  (2501, 16861)	1.0


In [15]:
import os
os.environ['MKL_NUM_THREADS'] = '1' 
os.environ['OPENBLAS_NUM_THREADS'] = '1'

Metode ALS mempelajari data berdasarkan matrix item-user dan menggunakan matrix user-item sebagai masukan sebagai preferensi apakah user akan menyukai atau tidak produk yang diberikan dan disertai dengan estimasi skor dari confidence level.

In [16]:
model = implicit.als.AlternatingLeastSquares(factors = 500, iterations = 30)
model.fit(item_users)

HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




In [17]:
custID = 14911
rekomendasi = model.recommend(custID, user_items, N = 5, filter_already_liked_items = True)

#### Total Rekomendasi pada Dashboard sesuai User

In [18]:
rekomendasi

[(1310, 0.02404292),
 (2329, 0.022595033),
 (434, 0.022293271),
 (575, 0.021644741),
 (1080, 0.019152567)]

In [19]:
df_rec = pd.DataFrame(columns = ['CustomerID', 'Recommendation'])
df_rec

Unnamed: 0,CustomerID,Recommendation


In [20]:
cust = []
rec = []

for i in range(len(rekomendasi)):
    cust.append(rekomendasi[i][0])
    rec.append(rekomendasi[i][1])
    
df_rec['CustomerID'], df_rec['Recommendation'] = cust, rec

df_rec

# print('Rekomendasi untuk ID {}:'.format(custID))
# for i in range(len(rekomendasi)):
#     print('{}. Stock ID {} dengan skor {}'.format(i+1, rekomendasi[i][0], rekomendasi[i][1]))

Unnamed: 0,CustomerID,Recommendation
0,1310,0.024043
1,2329,0.022595
2,434,0.022293
3,575,0.021645
4,1080,0.019153


### Evaluation Metric  

**Mean Average Recall at K dan Mean Average Precision at K**

Dari jurnal mengenai Implicit Feedback Datasets dikatakan bahwa metode evaluasi yang cocok diaplikasikan adalah Recall dibandingkan dengan Precision. Alasannya adalah karena tidak adanya variabel yang pantas untuk diukur, dalam kasus saya adalah total pembelian per produk. Semakin seseorang melakukan transaksi terhadap sebuah produk, maka diindakasi menyukai produk tersebut. Padahal dalam kenyataannya seseorang yang pernah membeli produk belum tentu menyukainya, bisa saja hanya untuk kado, barang titipan atau yang lainnya.<br>

Dari beberapa referensi yang saya lihat, recall@k dapat dihitung dengan total dari banyaknya rekomendasi benar dibagi dengan total aktual. Sedangkan precision@k merupakan total rekomendasi benar dibagi dengan total rekomendasi yang diberikan.

In [21]:
import ml_metrics
import ast
import recmetrics as rec

In [22]:
id = []
rec_list = []
for i in list(train['CustomerID'].astype(int).unique()):
    rekomendasi = model.recommend(i, user_items, N = 5, filter_already_liked_items = True)
    
    temp_list = []
    for j in rekomendasi:
        temp_list.append(j[0])
        
    id.append(i)
    rec_list.append(temp_list)
    
predicted_train = pd.DataFrame()
predicted_train['CustID'] = id
predicted_train['Predicted_train'] = rec_list

predicted_train['Predicted_train'] = predicted_train['Predicted_train'].apply(lambda x: ("""{}""".format(x))) 
predic_train = [ast.literal_eval(a) for a in list(predicted_train['Predicted_train'])]

predicted_train.head()

Unnamed: 0,CustID,Predicted_train
0,17284,"[682, 499, 601, 1057, 65]"
1,16385,"[1637, 1634, 422, 532, 421]"
2,12585,"[658, 928, 1645, 779, 349]"
3,14911,"[1310, 2329, 434, 575, 1080]"
4,14733,"[1551, 402, 477, 1515, 2235]"


In [23]:
train_id = []
train_stock = []

for i in train['CustomerID'].unique():
    train_id.append(i.astype(int))
    train_stock.append(list(train[train['CustomerID'] == i]['StockID'].unique()))
    
actual_train = pd.DataFrame()
actual_train['CustomerID'] = train_id
actual_train['Actual_train'] = train_stock
actual_train.head()

Unnamed: 0,CustomerID,Actual_train
0,17284,"[1452, 1291, 1033, 420, 2196, 426, 1056, 1210,..."
1,16385,"[341, 565, 1636, 1585, 1655, 879, 2197, 343, 1..."
2,12585,"[367, 1497, 1288, 1337, 807, 1502, 2168, 341, ..."
3,14911,"[1876, 393, 1354, 1943, 1325, 1067, 1302, 2498..."
4,14733,"[676, 1059, 181, 1375, 2291, 944, 1939, 974, 1..."


In [24]:
predic_train

[[682, 499, 601, 1057, 65],
 [1637, 1634, 422, 532, 421],
 [658, 928, 1645, 779, 349],
 [1310, 2329, 434, 575, 1080],
 [1551, 402, 477, 1515, 2235],
 [1753, 956, 1542, 1218, 1995],
 [2260, 57, 1248, 2121, 2236],
 [1398, 1247, 1333, 850, 851],
 [979, 1423, 934, 2031, 925],
 [340, 1308, 266, 941, 1216],
 [1272, 521, 698, 274, 1470],
 [130, 786, 791, 1116, 427],
 [2108, 599, 2058, 1594, 1686],
 [2251, 961, 1647, 1321, 1159],
 [2249, 1167, 2250, 736, 2367],
 [703, 883, 1550, 1988, 111],
 [1479, 791, 1124, 1986, 130],
 [181, 1542, 1644, 1387, 611],
 [181, 1523, 1105, 1300, 1431],
 [911, 1320, 1321, 1310, 1327],
 [129, 531, 837, 1223, 942],
 [1526, 1820, 1651, 1523, 705],
 [1485, 1321, 2170, 669, 1325],
 [1847, 1967, 1002, 1331, 1430],
 [886, 2289, 569, 1421, 1725],
 [2198, 382, 383, 2200, 2201],
 [2481, 351, 143, 2337, 759],
 [1877, 1242, 1307, 1106, 1300],
 [856, 1521, 140, 1497, 324],
 [1454, 1491, 1659, 960, 1935],
 [143, 35, 1837, 532, 251],
 [175, 521, 663, 1517, 1871],
 [600, 1242, 13

In [25]:
id = []
rec_list = []
for i in list(test['CustomerID'].astype(int).unique()):
    rekomendasi = model.recommend(i, user_items, N = 5, filter_already_liked_items = True)
    
    temp_list = []
    for j in rekomendasi:
        temp_list.append(j[0])
        
    id.append(i)
    rec_list.append(temp_list)
    
predicted_test = pd.DataFrame()
predicted_test['CustID'] = id
predicted_test['Predicted_test'] = rec_list

predicted_test['Predicted_test'] = predicted_test['Predicted_test'].apply(lambda x: ("""{}""".format(x))) 
predic_test = [ast.literal_eval(a) for a in list(predicted_test['Predicted_test'])]

predicted_test.head()

Unnamed: 0,CustID,Predicted_test
0,18283,"[784, 1683, 312, 1677, 1311]"
1,14298,"[2330, 1904, 1573, 1247, 748]"
2,14527,"[1414, 1243, 2498, 1412, 1405]"
3,16931,"[627, 426, 1227, 1283, 521]"
4,17211,"[76, 1710, 1491, 250, 2331]"


In [26]:
test_id = []
test_stock = []

for i in test['CustomerID'].unique():
    test_id.append(i.astype(int))
    test_stock.append(list(test[test['CustomerID'] == i]['StockID'].unique()))
    
actual_test = pd.DataFrame()
actual_test['CustomerID'] = test_id
actual_test['Actual_test'] = test_stock
actual_test.head()

Unnamed: 0,CustomerID,Actual_test
0,18283,"[741, 1446, 1137, 1574, 787, 1136, 2279, 1122,..."
1,14298,"[1675, 2363, 589, 301, 1428, 2069, 1128, 1602,..."
2,14527,"[1404, 1338, 1109, 1555, 1415, 2180, 787, 1557..."
3,16931,"[1074, 1298, 1460, 1324, 644, 1472, 1309, 141,..."
4,17211,"[1832, 1556, 1522, 1471, 1627, 1462, 1181, 145..."


**Recall**

In [27]:
rec.mark(actual_train['Actual_train'], predic_train, k = 5)

0.0

In [28]:
rec.mark(actual_test['Actual_test'], predic_test, k = 5)

0.03843518702467566

**Precision**

Terjadi kenaikan skor pada data test yang cukup signifikan. Namun, angka tersebut dapat dibilang tergolong rendah dan jika melihat rumusnya, hal tersebut bisa saja dipengaruhi oleh orang yang aktualnya membeli produk jauh lebih banyak dari total rekomendasi yang disediakan oleh model yang dibuat sehingga pembagiannya jadi tidak sama rata. Oleh karena itu, penggunaan Precision bisa membantu acuan akurasi dari model lebih lanjut.

In [29]:
ml_metrics.mapk(actual_train['Actual_train'], predic_train, k = 5)

0.0

In [30]:
ml_metrics.mapk(actual_test['Actual_test'], predic_test, k = 5)

0.044163825757575756

Nilai recall dan precision saat train = 0, itu terjadi karena memang tidak ada satupun prediksi yang tepat. Namun, terjadi kenaikan 6% dan 8% saat recall dan precision test. Nilai recall bisa lebih kecil dibandingkan precision karena perbandingan recall antara prediksi benar dengan total aktual yang jumlahnya lebih banyak dari total prediksi, sedangkan precision perbandingannya antara prediksi benar dengan total prediksi. Dapat disimpulkan bahwa prediksi rekomendasi benar lebih banyak di 0. Padahal jika dilihat dari EDA yang telah dibuat, consumer yang melakukan pembelian lebih dari 5 produk melebihi orang yang hanya membeli 1-5 produk. Ada kemungkinan orang yang membeli > 5 produk hanya melakukan sekali transaksi di setiap produk yang mengakibatkan nilai korelasi didalam model menjadi kurang maksimal.

In [31]:
x = 0
tepat = []
for i in range(len(predic_train)):
    for j in range(len(predic_train[i])):
        if predic_train[i][j] in actual_train['Actual_train'][i]:
            x += 1
            
    tepat.append(x)
    x = 0
train_tepat = pd.DataFrame(tepat)

train_tepat[0].value_counts()

0    1025
Name: 0, dtype: int64

In [32]:
x = 0
tepat = []
for i in range(len(predic_test)):
    for j in range(len(predic_test[i])):
        if predic_test[i][j] in actual_test['Actual_test'][i]:
            x += 1
            
    tepat.append(x)
    x = 0
test_tepat = pd.DataFrame(tepat)

test_tepat[0].value_counts()

0    714
1    142
2     22
3      2
Name: 0, dtype: int64

### Train Test 90:10

In [33]:
train2, test2 = train_test_split(new_stockid, test_size = 0.1, random_state = 1)

In [34]:
user_items2 = sparse.csr_matrix((train2['Stock_count'].astype(float), (train2['CustomerID'].astype(int), train2['StockID'].astype(int))))
item_users2 = sparse.csr_matrix((train2['Stock_count'].astype(float), (train2['StockID'], train2['CustomerID'].astype(int))))

In [35]:
model2 = implicit.als.AlternatingLeastSquares(factors = 500, iterations = 30)
model2.fit(item_users2)

HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




In [36]:
custID = 14911
rekomendasi2 = model2.recommend(custID, user_items2, N = 5, filter_already_liked_items = True)

In [37]:
df_rec2 = pd.DataFrame(columns = ['CustomerID', 'Recommendation'])
df_rec2

cust = []
rec = []

for i in range(len(rekomendasi)):
    cust.append(rekomendasi[i][0])
    rec.append(rekomendasi[i][1])
    
df_rec2['CustomerID'], df_rec2['Recommendation'] = cust, rec

df_rec2

Unnamed: 0,CustomerID,Recommendation
0,838,0.0
1,835,0.0
2,832,0.0
3,833,0.0
4,2501,0.0


In [38]:
# ----- Predicted Train -----

id = []
rec_list = []
for i in list(train2['CustomerID'].astype(int).unique()):
    rekomendasi = model2.recommend(i, user_items2, N = 5, filter_already_liked_items = True)
    
    temp_list = []
    for j in rekomendasi:
        temp_list.append(j[0])
        
    id.append(i)
    rec_list.append(temp_list)
    
predicted_train2 = pd.DataFrame()
predicted_train2['CustID'] = id
predicted_train2['Predicted_train'] = rec_list

predicted_train2['Predicted_train'] = predicted_train2['Predicted_train'].apply(lambda x: ("""{}""".format(x))) 
predic_train2 = [ast.literal_eval(a) for a in list(predicted_train2['Predicted_train'])]

# ----- Actual Train -----

train_id = []
train_stock = []

for i in train2['CustomerID'].unique():
    train_id.append(i.astype(int))
    train_stock.append(list(train2[train2['CustomerID'] == i]['StockID'].unique()))
    
actual_train2 = pd.DataFrame()
actual_train2['CustomerID'] = train_id
actual_train2['Actual_train'] = train_stock
actual_train2.head()

Unnamed: 0,CustomerID,Actual_train
0,17059,"[2201, 905, 903, 583, 424, 669, 2277, 1457, 22..."
1,12480,"[374, 562, 1108, 1625, 135, 1381, 791, 1684, 1..."
2,15867,"[1362, 1193, 1473, 903, 1195, 1298, 905, 1447,..."
3,15601,"[1942, 285, 1837, 2291, 615, 1490, 438, 1832, ..."
4,12415,"[1350, 874, 1651, 1650, 383, 111, 1576, 382, 1..."


In [39]:
# ----- Train Test -----

id = []
rec_list = []
for i in list(test2['CustomerID'].astype(int).unique()):
    rekomendasi = model2.recommend(i, user_items2, N = 5, filter_already_liked_items = True)
    
    temp_list = []
    for j in rekomendasi:
        temp_list.append(j[0])
        
    id.append(i)
    rec_list.append(temp_list)
    
predicted_test2 = pd.DataFrame()
predicted_test2['CustID'] = id
predicted_test2['Predicted_test'] = rec_list

predicted_test2['Predicted_test'] = predicted_test2['Predicted_test'].apply(lambda x: ("""{}""".format(x))) 
predic_test2 = [ast.literal_eval(a) for a in list(predicted_test2['Predicted_test'])]

# ----- Actual Test -----

test_id = []
test_stock = []

for i in test2['CustomerID'].unique():
    test_id.append(i.astype(int))
    test_stock.append(list(test[test['CustomerID'] == i]['StockID'].unique()))
    
actual_test2 = pd.DataFrame()
actual_test2['CustomerID'] = test_id
actual_test2['Actual_test'] = test_stock
actual_test2.head()

Unnamed: 0,CustomerID,Actual_test
0,18283,"[741, 1446, 1137, 1574, 787, 1136, 2279, 1122,..."
1,14298,"[1675, 2363, 589, 301, 1428, 2069, 1128, 1602,..."
2,14527,"[1404, 1338, 1109, 1555, 1415, 2180, 787, 1557..."
3,16931,"[1074, 1298, 1460, 1324, 644, 1472, 1309, 141,..."
4,17211,"[1832, 1556, 1522, 1471, 1627, 1462, 1181, 145..."


**Recall 2**

In [40]:
import recmetrics as rec

In [41]:
rec.mark(actual_train2['Actual_train'], predic_train2, k = 5)

0.0

In [42]:
rec.mark(actual_test2['Actual_test'], predic_test2, k = 5)

0.027499334636385025

**Precision 2**

In [43]:
ml_metrics.mapk(actual_train2['Actual_train'], predic_train2, k = 5)

0.0

In [44]:
ml_metrics.mapk(actual_test2['Actual_test'], predic_test2, k = 5)

0.031344343039258295

In [45]:
x = 0
tepat = []
for i in range(len(predic_train2)):
    for j in range(len(predic_train2[i])):
        if predic_train2[i][j] in actual_train2['Actual_train'][i]:
            x += 1
            
    tepat.append(x)
    x = 0
train_tepat2 = pd.DataFrame(tepat)

train_tepat2[0].value_counts()

0    1032
Name: 0, dtype: int64

In [46]:
x = 0
tepat = []
for i in range(len(predic_test2)):
    for j in range(len(predic_test2[i])):
        if predic_test2[i][j] in actual_test2['Actual_test'][i]:
            x += 1
            
    tepat.append(x)
    x = 0
test_tepat2 = pd.DataFrame(tepat)

test_tepat2[0].value_counts()

0    652
1    102
2     13
Name: 0, dtype: int64

Hasilnya adalah lebih baik saat menggunakan train test 80:20, meskipun perbedaan pada recall dan precision hanya sedikit. Pada model kedua tidak ada yang benar memprediksi lebih dari 2 rekomendasi pada 1 customer. Untuk train recall dan precision tetap mendapat nilai 0.

Untuk Tuning saya belum mendapatkan cara untuk diterapkan pada ALS karena inputannya berupa sparse matrix bukan berupa list angka.