# First contact

The purpose of this analysis is to find out which products to recommend to new users.

This is done by training a model that figures out the relationship between the users' profiles and their preferences of a given product.

**The main use of this model is to provide a warm start for our collaborative recommender**

# Import things

In [2]:
%matplotlib inline
import pandas as pd
import pickle
import numpy as np
from scipy.stats import pearsonr
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectFromModel

from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import TruncatedSVD # Plays nicely with sparse data
from sklearn.ensemble import VotingClassifier
from sklearn.base import BaseEstimator, TransformerMixin

import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from data.data import clean_data


disable_grid_search = True

# Generate the data

The data is generated by extracting the profile features from the dataset, by doing some feature mining and engineering. Then, another table is created to hold the preferences by user. This will be used as a target to our model

In [3]:
# Get raw data
raw_data = pd.read_csv("../../data/datasets/raw.csv", sep="|")
# Get list of names
names = pd.read_csv("../../data/datasets/names.csv")
# Get list of product column names
product_names = list(filter(lambda x: x.startswith("QTDE"), raw_data.columns.values))

# Extract label columns
labels_raw = raw_data[product_names]

# Drop labels that only have one type of label
labels = labels_raw.drop(labels_raw.apply(lambda x: x.nunique())[labels_raw.apply(lambda x: x.nunique())==1].index, axis=1)

# Remove labels from feature table
features_raw = raw_data.drop(product_names, axis=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# Clean features
features = clean_data(features_raw)

# Note that every column in features e of type np.integer

# A deep look at the data
Let's look at our features first. What are them?

In [5]:
features_raw.columns.values

array(['COOP_NMRESCOP', 'COOP_SKCOOPERATIVA', 'COOP_NRENDCOP',
       'AGENCIA_NMRESAGE', 'AGENCIA_DSENDCOP', 'AGENCIA_NMBAIRRO',
       'AGENCIA_NMCIDADE', 'DSDRISGP', 'DSENDERE', 'NRENDERE', 'NMBAIRRO',
       'NMCIDADE', 'DSESCOLA', 'DSESTCVL', 'DSGRUPORENDAFAT', 'DSINADIMPL',
       'DSINPESSOA', 'DSMOTIDEM', 'DSPROFTL', 'DSRESTR', 'DSSEXO',
       'DSSITDCT', 'DSSITDTL', 'DSSPCOUTRASINST', 'DSTIPCTA', 'FLGCIDATU',
       'FLGRENDAAUT', 'VL_RENDA_FATANUAL', 'QTFUNCIO', 'CDDSCNAE',
       'VL_RENDA_PFPJ', 'CARTAOCRED', 'COBRANCARIA', 'CONSORCIO',
       'CONVFOLHAPAGTO', 'EMPRESFINAN', 'SEGRESIDENCIA', 'CESSINTERNET',
       'DEBITOAUT', 'LMTDSTTIT', 'LMTTRANSACAO', 'LMTCHESPECIAL',
       'SEGVIDA', 'DDA', 'LMTDSRCHEQ', 'POUPPROG', 'SEGAUTO',
       'DOMICBANCARIO', 'APLICACAO', 'RRECBFOLHAPAGTO', 'PLANOCOTAS',
       'UTILCOBRANCA', 'CONTRATOS_PREJUIZO', 'VL_DOMICBANC',
       'VL_PLANOCOTAS', 'VL_PLANOPOUPPROG', 'VL_LMTCARTAOCREDITO',
       'VL_LMTCHESPECIAL', 'VL_EMP_FINAN', 'V

From those, keep in mind that some of them are going to be removed by our `clean_data` function.

We are going to remove features that represent just a single value, or that likely do not matter, like `AGENCIA_NMBAIRRO`.

We are also going to remove the ones that start with `VL`, since they represent the amount of product purchased and are analogous to our labels

We end up with the following features

In [6]:
clean_data(features_raw, dummify_categorical=False).columns.values

array(['COOP_NMRESCOP', 'COOP_SKCOOPERATIVA', 'COOP_NRENDCOP', 'DSDRISGP',
       'NMCIDADE', 'DSGRUPORENDAFAT', 'DSINADIMPL', 'DSMOTIDEM', 'DSRESTR',
       'DSSITDCT', 'DSSITDTL', 'DSSPCOUTRASINST', 'DSTIPCTA', 'FLGCIDATU',
       'CDDSCNAE', 'CONTRATOS_PREJUIZO', 'DSTIPOVINCULACAO'], dtype=object)

Now looking at our labels, we have

In [7]:
labels_raw.columns.values

array(['QTDE_CONTAS', 'QTDE_PLANOCOTAS', 'QTDE_APLICACAO',
       'QTDE_LMTCHESPECIAL', 'QTDE_POUPPROG', 'QTDE_EMPRES_FINAN',
       'QTDE_LMTDSTCHEQ', 'QTDE_LMTDSTTIT', 'QTDE_ACESSINTERNET',
       'QTDE_LMTTRANSACAO', 'QTDE_CARTAOCRED', 'QTDE_DEBITOAUT',
       'QTDE_DDA', 'QTDE_COBBANCARIA', 'QTDE_SEGVIDA',
       'QTDE_SEGRESIDENCIA', 'QTDE_SEGAUTO', 'QTDE_CONSORCIO',
       'QTDE_CONVFOLHA', 'QTDE_RECEBFOLHA', 'QTDE_DOMICBANC',
       'QTDE_UTILCOBRANCA'], dtype=object)

But most of thoses features are categorical, so whe dummified them. Let's take a look

In [8]:
features.columns.values

array(['DSINADIMPL', 'DSRESTR', 'DSSPCOUTRASINST', ...,
       'DSTIPOVINCULACAO_Baixíssima', 'DSTIPOVINCULACAO_Boa',
       'DSTIPOVINCULACAO_Média'], dtype=object)

In [9]:
# Number of final features
features.shape

(61777, 1458)

# Model the model
Now we need to create our model

Lets start by defining some helpful constants

In [10]:
# These columns are the ones dummyfied by our clean_function
d_columns = ['COOP_NMRESCOP',
 'COOP_SKCOOPERATIVA',
 'COOP_NRENDCOP',
 'DSDRISGP',
 'NMCIDADE',
 'DSGRUPORENDAFAT',
 'DSMOTIDEM',
 'DSSITDCT',
 'DSSITDTL',
 'DSTIPCTA',
 'CDDSCNAE',
 'QTD_TOTAL_PROD',
 'DSTIPOVINCULACAO']

# Lets now select every dummyfied column
dummy_columns = []
for d in d_columns:
    for c in features.columns.values:
        if c.startswith(d):
            dummy_columns.append(c)

# And the remaining columns
not_dummy_columns = [i for i in features.columns.values if i not in dummy_columns]

In [11]:
not_dummy_columns

['DSINADIMPL', 'DSRESTR', 'DSSPCOUTRASINST', 'FLGCIDATU', 'CONTRATOS_PREJUIZO']

Since our dummy_columns list represents the columns created from the "dummyfication" method, they probably represent a very sparse matrix. We can test it out:

In [10]:
dummy_columns_df = features[dummy_columns].to_sparse(fill_value=0)
dummy_columns_df.density

0.00820828586659646

As we can see, it is pretty sparse.

Let's compare it against our `not_dummy_columns`

In [11]:
not_dummy_columns_df = features[not_dummy_columns].to_sparse(fill_value=0)
not_dummy_columns_df.density

0.9282937015394078

Not sparse at all.

So, since we have a very heterogeneous dataset, let's treat it as one and divide our features into two categories

One is a sparse, dummyfied set and another a dense, not dummified set

To exploit that, let's create a custom transformer that selects columns from the dataset

In [12]:
class ColumnsSelector(BaseEstimator, TransformerMixin):
    """Selects a column from the data passed (as a list)"""
    def __init__(self, columns):
        self.columns = columns

    def fit(self, x, y=None):
        return self

    def transform(self, dataframe):
        return dataframe[self.columns]

Now lets create some pipelines. We'll create some simple ones and the a big, complex one. Let's see how they compare

Let's start by creating some naive classifiers that take raw data

I also included some hyperparameter options for our grid search later

In [17]:
pipes = {
    "nb_pipe": {
        'pipe': Pipeline([
            ('classifier', BernoulliNB(alpha=1, fit_prior=True))
        ]),
        'params':
            {
                'classifier__alpha': [0, 0.5, 0.3, 0.7, 1, 1.5],
                'classifier__fit_prior': [True]}
    },
    
    "rf_pipe": {
        'pipe': Pipeline([
            ('classifier', RandomForestClassifier(criterion='entropy', n_estimators=100, oob_score=False))
        ]),
        'params':
        {
            'classifier__n_estimators': [5, 10, 20, 100],
            'classifier__criterion': ['gini', 'entropy'],
            'classifier__oob_score': [True, False]}
    },
    
    "sv_pipe": {
        'pipe': Pipeline([
            ('classifier', LinearSVC(C=1, fit_intercept=True, intercept_scaling=1e-05, tol=0.01))
        ]),
        'params':
        {
            'classifier__tol': [1e-4, 1e-2],
            'classifier__C': [1, 5, 100],
            'classifier__fit_intercept': [True, False],
            'classifier__intercept_scaling': [1, 0.1, 1e-5]}
    },

    "lr_pipe": {
        'pipe': Pipeline([
            ('classifier', LogisticRegression(C=1, fit_intercept=True, intercept_scaling=1, solver='saga', tol=0.01))
        ]),
        'params':
        {
            'classifier__tol': [1e-4, 1e-2],
            'classifier__C': [1, 1e-5, 5, 100],
            'classifier__fit_intercept': [True, False],
            'classifier__intercept_scaling': [1, 0.1, 1e-5],
            'classifier__solver': ['liblinear', 'saga', 'sag']}
    },

    "mlp_pipe": {
        'pipe': Pipeline([
            ('classifier', MLPClassifier(early_stopping=True, hidden_layer_sizes=(40, 30, 5), learning_rate='adaptive'))
        ]),
        'params':
        {
            'classifier__hidden_layer_sizes': [(40, 30, 5), (40, 10, 5), (40, 20, 5), (40, 20, 10)]}
    }
}

And try them out

In [18]:
for k, v in pipes.items():
    print()
    print("Testing", k)
    %time print("Score:", cross_val_score(v['pipe'], features, labels['QTDE_EMPRES_FINAN'], cv=10, n_jobs=-1).mean())
    


Testing nb_pipe
Score: 0.739013414489
CPU times: user 1.76 s, sys: 141 ms, total: 1.9 s
Wall time: 23.9 s

Testing lr_pipe
Score: 0.810253381112
CPU times: user 1.79 s, sys: 148 ms, total: 1.94 s
Wall time: 38.7 s

Testing sv_pipe
Score: 0.809508863862
CPU times: user 1.77 s, sys: 139 ms, total: 1.91 s
Wall time: 15 s

Testing rf_pipe
Score: 0.797546086369
CPU times: user 2.06 s, sys: 183 ms, total: 2.24 s
Wall time: 4min 24s

Testing mlp_pipe
Score: 0.811856035393
CPU times: user 1.86 s, sys: 174 ms, total: 2.03 s
Wall time: 55.2 s


We see that the best ones are Logistic Regression, Support Vectors and Multi-layer Perceptrons, in order

We can do a grid search to find the best parameters. I've done it already and filled out the best ones above, but you can try the code for yourself if you want

In [None]:
if disable_grid_search:
    for k, v in pipes.items():
        if not k in list(results):
            print("Fitting {}".format(k))
            clf = GridSearchCV(v['pipe'], v['params'], cv=3, n_jobs=1, verbose=10)
            %time clf.fit(X_train, y_train)
            print("{} best score: {}\nbest params: {}".format(k, clf.best_score_, clf.best_params_))
            results[k] = pd.DataFrame(clf.cv_results_)
            print()

Now let's take those and add do some feature engineering like scaling, selection and dimensionality reduction

We'll be using FeatureUnion to append engineered features

Finally, we'll wrap everything using an ensemble method

In [19]:
pipeline = Pipeline([
    # Use FeatureUnion to combine the processed
    ('union', FeatureUnion(
        transformer_list=[

            # Pipeline for scaling and preprocessing not dummified features
            ('not_dummy', Pipeline([
                ('selector', ColumnsSelector(columns=not_dummy_columns)),
                ('scaler', StandardScaler()),
            ])),

            # Pipeline for scaling and preprocessing dummified features
            ('dummy', Pipeline([
                ('selector',  ColumnsSelector(columns=dummy_columns)),
                ('scaler', StandardScaler(with_mean=False, with_std=False)),
            ])),
        ],

        # weight components in FeatureUnion
        transformer_weights={
            'not_dummy': 1,
            'dummy': 1,
        },
    )),
    # Apply a voting classifier as ensemble method with 3 classifiers    
#     ('lr', LogisticRegression(C=1, fit_intercept=True, intercept_scaling=1, solver='saga', tol=0.01))
    ('ensemble', VotingClassifier(
        estimators=[
            ('lr', LogisticRegression(C=1, fit_intercept=True, intercept_scaling=1, solver='saga', tol=0.01)),
            ('mlp', MLPClassifier(early_stopping=True, hidden_layer_sizes=(40, 30, 5), learning_rate='adaptive')),
            ('lsv', LinearSVC(C=1, fit_intercept=True, intercept_scaling=1e-05, tol=0.01)),
        ],
        voting='hard'
    ))
])

That should be robust enough for our purposes.

We also should do some more feature engineering and filtering, but let's leave that for another day

# Evaluate it

To check our pipeline, we should run it against all of our products(may take a while)

In [20]:
for product in labels.columns.values:
    print("Evaluating", product)
    label = labels[product]
    print()
    %time print("Score:", cross_val_score(pipeline, features, label, cv=10, n_jobs=-1).mean())
    print()

Evaluating QTDE_PLANOCOTAS

Score: 0.938731382607
CPU times: user 1.87 s, sys: 259 ms, total: 2.13 s
Wall time: 2min 9s

Evaluating QTDE_APLICACAO

Score: 0.744160937884
CPU times: user 1.87 s, sys: 158 ms, total: 2.03 s
Wall time: 2min 23s

Evaluating QTDE_LMTCHESPECIAL

Score: 0.998397450554
CPU times: user 1.86 s, sys: 146 ms, total: 2.01 s
Wall time: 1min 51s

Evaluating QTDE_POUPPROG

Score: 0.795603955405
CPU times: user 1.87 s, sys: 150 ms, total: 2.02 s
Wall time: 2min 29s

Evaluating QTDE_EMPRES_FINAN

Score: 0.811758877271
CPU times: user 1.89 s, sys: 153 ms, total: 2.04 s
Wall time: 2min 28s

Evaluating QTDE_LMTDSTCHEQ

Score: 0.85127157221
CPU times: user 1.9 s, sys: 156 ms, total: 2.06 s
Wall time: 2min 57s

Evaluating QTDE_LMTDSTTIT

Score: 0.9126698052
CPU times: user 1.88 s, sys: 155 ms, total: 2.04 s
Wall time: 2min 39s

Evaluating QTDE_ACESSINTERNET

Score: 0.950062539175
CPU times: user 1.86 s, sys: 132 ms, total: 1.99 s
Wall time: 2min 14s

Evaluating QTDE_LMTTRANSA