# Case de Estudos: Santander Value Prediction

Ajude Santander a identificar o valor das transações para cada cliente potencial. Esse é um primeiro passo que o Santander precisa acertar para personalizar seus serviços em grande escala.
De acordo com uma pesquisa da Epsilon, 80% dos clientes tendem a voltar a fazer negócios com a sua empresa se a mesma entregar um serviço personalizado.

<br>
## Link para os dados e o desafio: 

https://www.kaggle.com/c/santander-value-prediction-challenge/data

O case podera ser quebrado nas 6 partes seguintes:

    Identificar o problema
        Qual o tipo de problema(classificação, regressão, clustering)?
    Necessidades de aplicar transformaçoes?
        Ex: imputing de valores null, encoding de colunas string, etc
    Separar os sets de treinamento e teste
    Baseline
        Achar uma baseline, um primeiro modelo para ter uma referencia
    Escolher a metrica
    Melhorar o resultado
        Feature engineering, otimizaçao do modelo, hiperparametros, etc



In [3]:
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_log_error
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor

In [17]:
df = pd.read_csv('train.csv')
display(df)

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4454,ff85154c8,1065000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4455,ffb6b3f4f,48000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,80000.0,0,0,0,0,0,0,0
4456,ffcf61eb6,2800000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4457,ffea67e98,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


In [18]:
#Separando o dataset em 80% para treino e 20% para teste, já com as colunas certas
y = df['target']
X = df.drop(['ID', 'target'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [10]:
#Verificando o MAE sem utilizar o SelectKBest
reg = LinearRegression().fit(X_train, y_train)

y_train_pred = reg.predict(X_train)
y_pred = reg.predict(X_test)
print('Modelo sem selecionar os melhores valores para K')
print(mean_absolute_error(y_train, y_train_pred))
print(mean_absolute_error(y_test, y_pred))

Modelo sem selecionar os melhores valores para K
13368.388182472147
241887457916084.44


In [11]:
#Utilizando o SelectKBest para achar o melhor valor para K
sel_kbest = SelectKBest(f_regression, k=45).fit(X_train, y_train)

X_train_sel = sel_kbest.transform(X_train)
X_test_sel = sel_kbest.transform(X_test)

print('Shape do X_train')
print(X_train_sel.shape)

Shape do X_train
(3567, 45)


  correlation_coefficient /= X_norms


In [12]:
#Verificar o MAE uitilizando o SelectKbest
reg = LinearRegression().fit(X_train_sel, y_train)
y_train_pred = reg.predict(X_train_sel)
y_pred = reg.predict(X_test_sel)

print('Modelo selecionando os melhores valores para K')
print(mean_absolute_error(y_train, y_train_pred))
print(mean_absolute_error(y_test, y_pred))

Modelo selecionando os melhores valores para K
5322548.588998505
5230263.099091365


In [13]:
y_train_pred = reg.predict(X_train_sel)
y_pred = reg.predict(X_test_sel)
y_train = y_train[y_train_pred > 0]
y_train_pred = y_train_pred[y_train_pred > 0]
y_test = y_test[y_pred > 0]
y_pred = y_pred[y_pred > 0]

print(mean_squared_log_error(y_train, y_train_pred))
print(mean_squared_log_error(y_test, y_pred))

3.911277869992664
3.7452270716462417


In [20]:
#Melhorar o resultado 
sel_kbest = SelectKBest(f_regression, k=45).fit(X_train, y_train)
X_train_sel = sel_kbest.transform(X_train)
X_test_sel = sel_kbest.transform(X_test)

regr = RandomForestRegressor(max_depth=6, random_state=0)
regr.fit(X_train_sel, y_train)

  correlation_coefficient /= X_norms


RandomForestRegressor(max_depth=6, random_state=0)