# Recommending products to existing users

Based on other users and items

First we need to create a version of our data matrix that allows us to work with the data

We do it by selecting only the VL_ columns and considering each row a user (the indexes, their IDs)

**Note: this is just a prototype. Real recommender systems need much more data than simply user preferences.**

Thus, this system is prone to suffer from cold starts with both new users and products.

It also does not scale, does not learn from bias and does not take serial data.

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import pandas as pd
import numpy as np

from reco.recommender import SVDRecommender

First we download the data and set the indexes to some useful names

In [2]:
raw_data = pd.read_csv("../../data/datasets/raw.csv", sep="|")
names = pd.read_csv("../../data/datasets/names.csv")
raw_data = raw_data.set_index(names['name'])
data = raw_data

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
data[list(filter(lambda x: "VL" in x, data.columns.values))].columns.values

array(['DSESTCVL', 'VL_RENDA_FATANUAL', 'VL_RENDA_PFPJ', 'VL_DOMICBANC',
       'VL_PLANOCOTAS', 'VL_PLANOPOUPPROG', 'VL_LMTCARTAOCREDITO',
       'VL_LMTCHESPECIAL', 'VL_EMP_FINAN', 'VL_APLICACAO',
       'VL_COTASCAPITAL', 'VL_POUPPROG', 'VL_CONSORCIO', 'VL_RECEB_FOLHA',
       'VL_LMTTRANSACAO', 'VL_LMTDESCTTIT', 'VL_LMTDESCTCHEQ',
       'VL_COBBANC', 'VL_SEG_VIDA', 'VL_SEG_RES', 'VL_SEG_AUTO',
       'VL_CONV_FOLHA_PAGTO', 'VL_DEBITOAUT'], dtype=object)

Now we select the items columns and set zeros as nan

In [4]:
products = list(filter(lambda x: x.startswith("VL"), data.columns.values))
data = data[products]
data = data.fillna(0)
for column in data.columns.values:
    try:
        data[column] = data[column].str.replace(',', '')
    except:
        pass
    data[column] = pd.to_numeric(data[column], errors='coerce')

data = data.replace(0, np.NaN)

Lets see our new dataset

In [5]:
data

Unnamed: 0_level_0,VL_RENDA_FATANUAL,VL_RENDA_PFPJ,VL_DOMICBANC,VL_PLANOCOTAS,VL_PLANOPOUPPROG,VL_LMTCARTAOCREDITO,VL_LMTCHESPECIAL,VL_EMP_FINAN,VL_APLICACAO,VL_COTASCAPITAL,...,VL_RECEB_FOLHA,VL_LMTTRANSACAO,VL_LMTDESCTTIT,VL_LMTDESCTCHEQ,VL_COBBANC,VL_SEG_VIDA,VL_SEG_RES,VL_SEG_AUTO,VL_CONV_FOLHA_PAGTO,VL_DEBITOAUT
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Eric Modafferi,1.745000e+04,2980.0,,10.0,,,,2306983.0,,637.0,...,,,,5000.0,,,,,,
Jay Robinson,2.000000e+04,51307.0,,20.0,25.0,1.0,2300.0,799141.0,501828.0,309338.0,...,,11000.0,,,,,,,,
William Dickhoff,1.933774e+07,178111.0,,50.0,,,5000.0,,,421398.0,...,,55000.0,,,,,,,,
Ernesto Jackson,6.501000e+04,54175.0,,25.0,,2000.0,,477693.0,92427.0,115768.0,...,,18000.0,,,1168221.0,,,,,
Jack Barnett,3.752502e+07,6359236.0,,55.0,,5000.0,5000.0,3552662.0,10849.0,549289.0,...,,90000.0,,,,,,,,
Jamie Hamilton,2.919665e+06,423845.0,,5437.0,,2000.0,2500.0,116493.0,2498.0,159671.0,...,,26000.0,,,,,,,,
Emelina Harrell,2.547475e+07,23940.0,,3269.0,300.0,3000.0,10000.0,,,102944.0,...,,12000.0,,,3250521.0,,,,,
Eric Moore,1.000000e+00,3800.0,,,,,,580517.0,,2138.0,...,,,,,,,,,,
Eric Fitzgerald,1.765406e+06,13404.0,,2175.0,20.0,,1000.0,901532.0,,57979.0,...,,,,,,,,,,
Gloria Taylor,1.200000e+05,1401543.0,,5394.0,,,,,3259284.0,140448.0,...,,30000.0,,,1639708.0,,,,,2.0


Sparse, but how sparse?

In [6]:
data.to_sparse(fill_value=np.NaN).density

0.35969844617075786

Not that sparse, actually, but the columns variance is huge. Since we'll normalize and scale the data, this should not be a problem, but I've never done this with this kind of dataset, so let's hope for the best

Now we'll import our model and get on with it

Quick Q.A. on the number of features below:
- **How many features?** dunno
- **How do you choose the number of featues?** Grid search
- **Did I do it?** No
- **More features == better?** Not necessarily. A lot of features increase variance and model complexity, reducing bias  and overfitting your model. Too little features decrease your model complexity, making it too biased and useless as a general model

In [7]:
model = SVDRecommender(n_features=14)

model.fit(data)

SVDRecommender. features: 14, method:default

In [8]:
test_users = ['Eric Modafferi', 'Tina Roberts']
test_items = ['VL_POUPPROG', 'VL_LMTCARTAOCREDITO']


# recommends 3 undiscovered items per each user
print("\nUser recommendations", model.recommend(test_users, N=3))

# Recommend 3 users to buy certain items and show the prices
print("\nItem recommendations", model.recommend(test_items, content='item', N=3, values=True))

# outputs 5 most similar items to VL_DOMICBANC
# What makes a similar item?
print("\nMost similar", model.topN_similar(x='VL_DOMICBANC', N=3, column='item'))


User recommendations [['VL_COBBANC', 'VL_CONV_FOLHA_PAGTO', 'VL_APLICACAO'], ['VL_COBBANC', 'VL_CONV_FOLHA_PAGTO', 'VL_APLICACAO']]

Item recommendations [[('Debra Propst', 76895.469567145788), ('James Payne', 76496.779612962768), ('Dorothy Powell', 76406.044309660851)], [('Dorothy Lujan', 54753.889388363728), ('Kandy Ford', 48631.013085703984), ('Heather Coleman', 41067.195125058832)]]

Most similar [('VL_LMTCARTAOCREDITO', 13694.805049625806), ('VL_PLANOCOTAS', 13695.272958898822), ('VL_DEBITOAUT', 13695.757727444545)]


As you can see, Eric's recommendations are really close to Tina's. Let's see if they come as similar

In [9]:
model.topN_similar(x='Eric Modafferi', N=5, column='user')

[('Ashlee Irvin', 2.629715661965943),
 ('Kimberly Colin', 3.0407430086878495),
 ('Stacey Byrd', 3.0431025129539995),
 ('Dawn Mason', 3.5870264131794696),
 ('Gloria Cardoso', 3.8856505688515326)]

Yep, there she is, third place.

Note that the recommendation does not correlate completely with the similarity. That is just a coincidence

Let's take a look at their vectors

In [10]:
data[(data.index == 'Eric Modafferi') | (data.index == 'Tina Roberts')]

Unnamed: 0_level_0,VL_RENDA_FATANUAL,VL_RENDA_PFPJ,VL_DOMICBANC,VL_PLANOCOTAS,VL_PLANOPOUPPROG,VL_LMTCARTAOCREDITO,VL_LMTCHESPECIAL,VL_EMP_FINAN,VL_APLICACAO,VL_COTASCAPITAL,...,VL_RECEB_FOLHA,VL_LMTTRANSACAO,VL_LMTDESCTTIT,VL_LMTDESCTCHEQ,VL_COBBANC,VL_SEG_VIDA,VL_SEG_RES,VL_SEG_AUTO,VL_CONV_FOLHA_PAGTO,VL_DEBITOAUT
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Eric Modafferi,17450.0,2980.0,,10.0,,,,2306983.0,,637.0,...,,,,5000.0,,,,,,
Tina Roberts,15000.0,4530.0,,150.0,,1.0,,4423174.0,,440921.0,...,,300.0,,,,,,,,2.0


We can also predict values for any client

In [11]:
query = [['Eric Modafferi', 'VL_RENDA_PFPJ'], ['Eric Modafferi', 'VL_DOMICBANC']]
query = pd.DataFrame(query)

query

Unnamed: 0,0,1
0,Eric Modafferi,VL_RENDA_PFPJ
1,Eric Modafferi,VL_DOMICBANC


Our client already has VL_RENDA_PFPJ and its value is 2980.0, so we hope that our prediction is close to that

In [12]:
model.predict(query)

[2979.9996505621821, 531150.32808172004]

Indeed, 2979.99 is pretty much 2980. Not too shaby

Finally, let's predict the values of every user for the product VL_LMTCARTAOCREDITO, VL_LMTCHESPECIAL

In [13]:
new = pd.DataFrame()
new['user'] = data.index
new['item'] = ['VL_LMTCARTAOCREDITO'] * len(data)
new['VL_LMTCARTAOCREDITO'] = model.predict(new)
new_2 = pd.DataFrame()
new_2['user'] = data.index
new_2['item'] = ['VL_LMTCHESPECIAL'] * len(data)
new_2['pred'] = model.predict(new_2)
new['VL_LMTCHESPECIAL'] = new_2['pred']

new.head(15)

Unnamed: 0,user,item,VL_LMTCARTAOCREDITO,VL_LMTCHESPECIAL
0,Eric Modafferi,VL_LMTCARTAOCREDITO,2121.508846,6267.639448
1,Jay Robinson,VL_LMTCARTAOCREDITO,2504.177017,2160.929938
2,William Dickhoff,VL_LMTCARTAOCREDITO,2837.417817,4988.817359
3,Ernesto Jackson,VL_LMTCARTAOCREDITO,2553.567414,6201.736091
4,Jack Barnett,VL_LMTCARTAOCREDITO,2919.957317,5099.407992
5,Jamie Hamilton,VL_LMTCARTAOCREDITO,2437.967964,2503.849817
6,Emelina Harrell,VL_LMTCARTAOCREDITO,2893.18635,10018.479394
7,Eric Moore,VL_LMTCARTAOCREDITO,2553.721586,6251.965182
8,Eric Fitzgerald,VL_LMTCARTAOCREDITO,2306.415776,1032.46556
9,Gloria Taylor,VL_LMTCARTAOCREDITO,2703.245454,6269.026256


In [14]:
(new.set_index(names['name'])['VL_LMTCHESPECIAL'] - data['VL_LMTCHESPECIAL'].fillna(0))[~pd.isnull(data['VL_LMTCHESPECIAL'])].describe()

count     5179.000000
mean        11.844479
std        357.871128
min      -3285.705701
25%       -115.268543
50%        -25.487257
75%         26.185173
max      12633.725639
Name: VL_LMTCHESPECIAL, dtype: float64

In [15]:
(new.set_index(names['name'])['VL_LMTCARTAOCREDITO'] - data['VL_LMTCARTAOCREDITO'].fillna(0))[~pd.isnull(data['VL_LMTCARTAOCREDITO'])].describe()

count    4.080000e+04
mean    -4.415780e+01
std      9.852801e+03
min     -1.291017e+06
25%     -1.617784e+02
50%      1.548923e+03
75%      2.575811e+03
max      6.042688e+04
Name: VL_LMTCARTAOCREDITO, dtype: float64