## ITA 2021

<br>

Dicionário de Dados:

* n: número de agentes
* p: fração de traders
* f: grau de interesse dos traders
* x, y, z: dimensões do espaço aéreo
* a1, a2: média e desvio padrão do coeficiente do preço fundamental dos consumidores
* a3, a4: idem para os traders
* b1, b2: média e desvio padrão do coeficiente do preço de mercado dos consumidores
* b3, b4: idem para os traders
* c1, c2: média e desvio padrão do coeficiente do preço aleatório dos consumidores
* c3, c4: idem para os traders
* g1, g2: média e desvio padrão do grau de agressividade dos consumidores
* l1, l2: média e desvio padrão do coeficiente de desvalorização para os consumidores
* l3, l4: idem para os traders
* e1, e2: variabilidade no preço fundamental dos consumidores e traders, respectivamente
* cent_price_cor: correlação entre o preço final e centralidade das permissões de vôo
* cent_trans_cor: idem para o número de transações

In [1]:
# Importando Ferramentas Básicas
import pandas                  as pd
import matplotlib.pyplot       as plt
import numpy                   as np
import                            os
from   datetime            import datetime

In [2]:
# Importando Ferramentas de Limpeza
from sklearn.decomposition    import PCA
from sklearn.preprocessing    import StandardScaler
from sklearn.pipeline         import make_pipeline, Pipeline

In [3]:
# Importando Ferramentas de Modelo
from sklearn.svm              import SVR
from xgboost                  import XGBRegressor
from sklearn.model_selection  import train_test_split
from sklearn.model_selection  import GridSearchCV, RandomizedSearchCV
from sklearn.metrics          import accuracy_score, mean_absolute_error
from sklearn.linear_model     import LinearRegression, LogisticRegression, Lasso
from sklearn.base             import BaseEstimator

In [4]:
# Importando os dados
train = pd.read_csv('./../Dados/train.csv')
test = pd.read_csv('./../Dados/test.csv')

In [5]:
# Criando Features
dataframes = [train, test]

for df in dataframes:
    df['volume']  = df.x * df.y * df.z
    df['densidade'] = df.volume / df.n

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.isna().sum()/len(train)

In [None]:
test.isna().sum()/len(train)

In [None]:
train.cent_price_cor.describe()

In [None]:
train.cent_trans_cor.describe()

In [None]:
train.corr()["cent_price_cor"].abs().sort_values(ascending = True)

In [None]:
train.corr()["cent_trans_cor"].abs().sort_values(ascending = True)

In [None]:
X = train.drop(columns = ['cent_price_cor', 'cent_trans_cor'], axis = 1)
y_1 = train.cent_price_cor

X_train, X_test, y_1_train, y_1_test = train_test_split(X,y_1,
                                                    test_size = 0.25,
                                                    random_state = 0)


#regr = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
#regr.fit(X, y)

pipe_1 = Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svr', SVR(epsilon=0.2))])

pipe_1.fit(X_train,y_1_train)

#pipe.score(X_test, y_test)

In [None]:
y_1_pred = pipe_1.predict(X_test)

In [None]:
mean_absolute_error(y_1_test, y_1_pred)

In [None]:
X = train.drop(columns = ['cent_price_cor', 'cent_trans_cor'], axis = 1)
y_2 = train.cent_trans_cor

X_train, X_test, y_2_train, y_2_test = train_test_split(X,y_2,
                                                    test_size = 0.25,
                                                    random_state = 0)


#regr = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
#regr.fit(X, y)

pipe_2 = Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svr', SVR(epsilon=0.2))])

pipe_2.fit(X_train,y_2_train)

y_2_pred = pipe_2.predict(X_test)

mean_absolute_error(y_2_test, y_2_pred)

## PCA

In [None]:
X = train.drop(columns = ['cent_price_cor', 'cent_trans_cor'], axis = 1)

scaler = StandardScaler()

transf_X_train = scaler.fit_transform(X_train)
transf_X_test = scaler.fit_transform(X_test)

In [None]:
y_price = train.cent_price_cor
y_trans = train.cent_trans_cor

X_train, X_test, y_price_train, y_price_test = train_test_split(X,y_price,
                                                    test_size = 0.25,
                                                    random_state = 0)

X_train, X_test, y_trans_train, y_trans_test = train_test_split(X,y_trans,
                                                    test_size = 0.25,
                                                    random_state = 0)

In [None]:
models = {'Linear Regression': LinearRegression(n_jobs = -1),
          'SVR': SVR(epsilon=0.2),
          'Lasso': Lasso(),
          'XGBoostRegressor': XGBRegressor()}

In [None]:
def fit_score_PCA(models,X_train,y_train,X_test,y_test,components):

    # Make a dict to keep model scores
    model_scores = {}
    
    for i in components:
        
        pca = PCA(n_components = i)
        X_train_PCA = pca.fit_transform(X_train)
        X_test_PCA = pca.transform(X_test)

        # Loop through models
        for name, model in models.items():

            # Fit the model to the data
            model.fit(X_train_PCA,y_train)
        
            y_pred = model.predict(X_test_PCA)

            #Evaluates the model and append its score to model_scores
            model_scores[name + '_' + str(i)] = mean_absolute_error(y_test, y_pred)

    return model_scores

In [None]:
model_scores_trans = fit_score_PCA(models,X_train,y_trans_train,X_test,y_trans_test, 0.95)
model_scores_price = fit_score_PCA(models,X_train,y_price_train,X_test,y_price_test, 0.95)

In [None]:
model_scores_trans

In [None]:
model_scores_price

In [None]:
# Melhores scores para 0.95 (sem scaling)
0.0941312117033256 + 0.090252152275057

In [None]:
model_scores_trans = fit_score_PCA(models,transf_X_train,y_trans_train,transf_X_test,y_trans_test, 1)
model_scores_price = fit_score_PCA(models,transf_X_train,y_price_train,transf_X_test,y_price_test, 1)

In [None]:
model_scores_trans

In [None]:
model_scores_price

In [None]:
# Melhores scores para 0.95 (com scaling)
0.0941305091243686 + 0.09025552474334281

In [None]:
pca = PCA(n_components = 0.95)
X_train_PCA = pca.fit_transform(X_train)
X_train_new = pca.inverse_transform(X_train_PCA)

In [None]:
X_train_PCA.shape

In [None]:
X_train_new

In [None]:
model_scores_trans = fit_score_PCA(models,transf_X_train,y_trans_train,transf_X_test,y_trans_test, 1)
model_scores_price = fit_score_PCA(models,transf_X_train,y_price_train,transf_X_test,y_price_test, 1)

In [None]:
model_scores_trans

In [None]:
model_scores_price

In [None]:
pca.explained_variance_ratio_

In [None]:
model_scores_trans = fit_score_PCA(models,transf_X_train,y_trans_train,transf_X_test,y_trans_test, [0.91,0.95,1])
model_scores_price = fit_score_PCA(models,transf_X_train,y_price_train,transf_X_test,y_price_test, [0.91,0.95,1])

In [None]:
#'Linear Regression_0.95': 0.09020619481829613

sorted(model_scores_trans, key = model_scores_trans.get)

In [None]:
# 'Lasso_0.91': 0.09411110306791731,
sorted(model_scores_price, key = model_scores_price.get)

In [None]:
model_scores_trans = fit_score_PCA(models,transf_X_train,y_trans_train,transf_X_test,y_trans_test, [0.8,0.85,0.9,0.95])
model_scores_price = fit_score_PCA(models,transf_X_train,y_price_train,transf_X_test,y_price_test, [0.8,0.85,0.9,0.95])

In [None]:
{k: v for k, v in sorted(model_scores_price.items(), key=lambda item: item[1])}

In [None]:
{k: v for k, v in sorted(model_scores_trans.items(), key=lambda item: item[1])}

## GridSearch

In [7]:
X = train.drop(columns = ['cent_price_cor', 'cent_trans_cor'], axis = 1)

y_price = train.cent_price_cor
y_trans = train.cent_trans_cor

X_train, X_test, y_price_train, y_price_test = train_test_split(X,y_price,
                                                    test_size = 0.25,
                                                    random_state = 0)

X_train, X_test, y_trans_train, y_trans_test = train_test_split(X,y_trans,
                                                    test_size = 0.25,
                                                    random_state = 0)

In [8]:
params_grid = [

#Linear Regression
{'normalize': ['True', 'False'],
'fit_intercept': ['True', 'False']},
    
#SVR RBF
{'kernel': ['rbf'],
'C':[0.1, 0.5, 1, 5, 10],
'degree': [3,8],
'coef0': [0.01,10,0.5],
'gamma': ('auto','scale'),
'epsilon': [0.1,0.2]},
    
#SVR POLY
{'kernel': ['poly'],
'C':[0.1, 0.5, 1, 5, 10],
'degree': [3,8],
'coef0': [0.01,10,0.5],
'gamma': ('auto','scale'),
'epsilon': [0.1,0.2]},
    
#Lasso
{'alpha':[0.02, 0.024, 0.025, 0.026, 0.03],
'fit_alpha':[0.005, 0.02, 0.03, 0.05, 0.06]},  
    
# XGBoost
{'nthread':[4], #when use hyperthread, xgboost may become slower
'objective':['reg:linear'],
'learning_rate': [.03, 0.05, .07], #so called `eta` value
'max_depth': [5, 6, 7],
'min_child_weight': [4],
'silent': [1],
'subsample': [0.7],
'colsample_bytree': [0.7],
'n_estimators': [500]}]

In [11]:
 def prever(X_train, X_test, y_train, y_test):
    
    lista_scores = []
    lista_params = []
    lista_PCA = []
    lista_model = []
    
    components = [0.8,0.85,0.9,0.95]
    
    models = [LinearRegression(),
              SVR(),
              SVR(),
              Lasso(),
              XGBRegressor()]
            
    for i, model in enumerate(models):
                    
            for n in components:

                pca = PCA(n_components = n)
                X_train_PCA = pca.fit_transform(X_train)
                X_test_PCA = pca.transform(X_test)

                clf = GridSearchCV(model, param_grid = params_grid[i],
                                   scoring = 'neg_mean_absolute_error', #destaque à métrica pedida
                                   n_jobs=2, refit=True, cv=5, verbose=True,
                                   pre_dispatch='2*n_jobs', error_score='raise', 
                                   return_train_score=True)

                clf.fit(X_train, y_train)

                clf_fit = model
                params = clf_fit.set_params(**clf.best_params_)

                clf_fit.fit(X_train, y_train)
                score = clf_fit.score(X_test, y_test)
                
                lista_model.append(model)
                lista_params.append(params)
                lista_scores.append(score)
                lista_PCA.append(n)


    df_scores = pd.DataFrame(lista_scores)
    df_scores.insert(loc=0, column='PCA', value= pd.Series(lista_PCA))
    df_scores.insert(loc=0, column='Model', value= pd.Series(lista_model))
    df_scores.insert(loc=0, column='params', value= pd.Series(lista_params))
    df_scores.to_csv("./Resultados/scores"+"{}.csv".format(datetime.now().strftime("%d-%m-%Y_%Hh%Mm%Ss")))
            
    return df_scores

In [12]:
prever(X_train, X_test, y_price_train, y_price_test)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  20 out of  20 | elapsed:    0.4s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  20 out of  20 | elapsed:    0.3s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  20 out of  20 | elapsed:    0.4s finished


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  20 out of  20 | elapsed:    0.3s finished


Fitting 5 folds for each of 120 candidates, totalling 600 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


KeyboardInterrupt: 