In [2]:
import pandas as pd

In [11]:
df = pd.read_csv("train_ver2.csv", nrows = 100000)

In [5]:
df.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,...,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1
0,2015-01-28,1375586,N,ES,H,35,2015-01-12,0.0,6,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
1,2015-01-28,1050611,N,ES,V,23,2012-08-10,0.0,35,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
2,2015-01-28,1050612,N,ES,V,23,2012-08-10,0.0,35,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
3,2015-01-28,1050613,N,ES,H,22,2012-08-10,0.0,35,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
4,2015-01-28,1050614,N,ES,V,23,2012-08-10,0.0,35,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


In [6]:
df.shape

(100000, 48)

## Our first goal is formattting the data as events.
It's already listed by date, but we need to add binary indicators for the person and the product.

In [12]:
df = pd.get_dummies(df)

In [10]:
df.shape

(100000, 1152)

## Now, let's see if that fits into a SageMaker model right away
Once we know we have the mechanics working, we can iterate on putting new sets of features and hyperparameters into the model to find the best one

SageMaker's built-in recommender handles only binary classification and regression tasks. Which means that we'll need to either create a binary classifier for each class, or will need to convert the product indicators to a continuous outcome, and then map the continuous outcome to a discrete product.

This isn't exactly ideal; what we'd like is a multi-class classifier. Start with this approach for Factorization Machines, because it does so well with sparse datasets at scale. Eventually move into exploring XGBoost and Linear Learner, treating the problem as a multi-class classification problem.

If you really want to challenge yourself, you can select another open source recommender algorithm, and load it into a Docker container for training. Using a scikit learn estimator should be somewhat easier, because you can just use the AWS-managed container rather than building your own.

## Convert product bins into continuous numbers to see the task as regression

Let's give each product a 10-digit space, so the space between them is large enough for the model to capture. First we'll need to get a list of all 23 products.

In [5]:
headers = list(df)
products = [h for h in headers if "_ult1" in h]
print (len(products))

24


In [6]:
products

['ind_ahor_fin_ult1',
 'ind_aval_fin_ult1',
 'ind_cco_fin_ult1',
 'ind_cder_fin_ult1',
 'ind_cno_fin_ult1',
 'ind_ctju_fin_ult1',
 'ind_ctma_fin_ult1',
 'ind_ctop_fin_ult1',
 'ind_ctpp_fin_ult1',
 'ind_deco_fin_ult1',
 'ind_deme_fin_ult1',
 'ind_dela_fin_ult1',
 'ind_ecue_fin_ult1',
 'ind_fond_fin_ult1',
 'ind_hip_fin_ult1',
 'ind_plan_fin_ult1',
 'ind_pres_fin_ult1',
 'ind_reca_fin_ult1',
 'ind_tjcr_fin_ult1',
 'ind_valo_fin_ult1',
 'ind_viv_fin_ult1',
 'ind_nomina_ult1',
 'ind_nom_pens_ult1',
 'ind_recibo_ult1']

In [7]:
product_dict = {}

for idx, product in enumerate(products):
    value = idx * 10
    product_dict[product] = value

In [8]:
product_dict

{'ind_ahor_fin_ult1': 0,
 'ind_aval_fin_ult1': 10,
 'ind_cco_fin_ult1': 20,
 'ind_cder_fin_ult1': 30,
 'ind_cno_fin_ult1': 40,
 'ind_ctju_fin_ult1': 50,
 'ind_ctma_fin_ult1': 60,
 'ind_ctop_fin_ult1': 70,
 'ind_ctpp_fin_ult1': 80,
 'ind_deco_fin_ult1': 90,
 'ind_deme_fin_ult1': 100,
 'ind_dela_fin_ult1': 110,
 'ind_ecue_fin_ult1': 120,
 'ind_fond_fin_ult1': 130,
 'ind_hip_fin_ult1': 140,
 'ind_plan_fin_ult1': 150,
 'ind_pres_fin_ult1': 160,
 'ind_reca_fin_ult1': 170,
 'ind_tjcr_fin_ult1': 180,
 'ind_valo_fin_ult1': 190,
 'ind_viv_fin_ult1': 200,
 'ind_nomina_ult1': 210,
 'ind_nom_pens_ult1': 220,
 'ind_recibo_ult1': 230}

In [13]:
def grab_row_header(idx, start, headers):
    return headers[idx+start]

headers = list(df)

df["outcome"] = [0.0 for i in range(df.shape[0])]

for i in df.index:
    
    start = 9
    
    end = start + 24
    
    row = df.loc[i][start:end]
    
    for idx, each in enumerate(row):
        if each == 1.0:
            
            # grab the row header
            h = grab_row_header(idx, start, headers)
            
            # grab the discrete value
            val = product_dict[h]
            
            # set the discrete value in the dataframe 
            df.set_value(i, "outcome", val)




In [15]:
df["outcome"].value_counts()

20.0     79197
230.0     9776
120.0     4020
220.0     1040
110.0      955
60.0       942
50.0       932
170.0      769
40.0       694
0.0        612
180.0      480
190.0      283
130.0      228
150.0       33
90.0        15
100.0        8
160.0        5
200.0        4
80.0         3
70.0         2
140.0        1
30.0         1
Name: outcome, dtype: int64

 It appears there is strong class balance here. Can you determine down the road whether or not this is correlated to any of the covariates?

## Can you train a SageMaker regression Factorization Machines model with this dataset?