# <b> Adult remuneration

Based on the Adult dataset (https://archive.ics.uci.edu/ml/datasets/adult), we'll use Random Forest to determine if a adult receives anually greater or less than 50k.

SOURCE: 
- Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Michael Greenacre, Jörg Blasius (2006). Multiple Correspondence Analysis and Related Methods, CRC Press. ISBN 1584886285.

##### Integrantes #####
    Alex Lan                                
    Amanda Maria Martins Funabashi                  
    Samyr Abrahão Moises                            
    Waldyr Lourenço de Freitas Junior               

## <b> Road Map
1. Full Dataset
2. Multiple Correspondence Analysis (MCA)

In [1]:
import pandas as pd
import numpy as np
import time

from sklearn.ensemble import RandomForestClassifier

#### k-folds Method

In [2]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [11]:
def kFold(ds_orig, k):

    
    bold = "\033[1m"
    reset = "\033[0;0m"
    print(bold + f'Número de K: {k}' + reset)
    print('\n')
    data_dict = {}

    #Divide o DataSet original em dois DataSets, um para cada classe
    ds_class1 = ds_orig[ds_orig[ds_orig.columns[-1]]==ds_orig.iloc[:,-1].unique()[0]]
    ds_class2 = ds_orig[ds_orig[ds_orig.columns[-1]]==ds_orig.iloc[:,-1].unique()[1]]
    
    #Armazena a quantidade inteira de instâncias para cada k
    n_inst1 = int(ds_class1.iloc[:,-1].count()/k)
    n_inst2 = int(ds_class2.iloc[:,-1].count()/k)
    
    #Armazena o resto da divisão acima para facilitar a divisão de forma que cada parte tenha no máximo 1 elemento de diferença
    n_inst_err1 = ds_class1.iloc[:,-1].count()%k
    n_inst_err2 = ds_class2.iloc[:,-1].count()%k
    
    #Geração dos DataSets
    range_fim1 = 0
    range_fim2 = 0
    for itr in range(k):
        
        #Cálculo do range do DataSet da classe 1
        if(n_inst_err1 != 0):
            range_ini1 = range_fim1
            range_fim1 = range_fim1 + n_inst1 + 1
            n_inst_err1 = n_inst_err1 - 1
        else:
            range_ini1 = range_fim1
            range_fim1 = range_fim1 + n_inst1
            
        #Cálculo do range do DataSet da classe 2
        if(n_inst_err2 != 0):
            range_ini2 = range_fim2
            range_fim2 = range_fim2 + n_inst2 + 1
            n_inst_err2 = n_inst_err2 - 1
        else:
            range_ini2 = range_fim2
            range_fim2 = range_fim2 + n_inst2
            
        #Geração de subDataSets de acordo com o range acima
        data_temp1 = ds_class1.iloc[range_ini1:range_fim1]
        data_temp2 = ds_class2.iloc[range_ini2:range_fim2]
        
        #Concatenação dos subDataSets de classes distintas para um único subDataSet
        data_write = pd.concat([data_temp1, data_temp2])
        
        #Gravação do arquivo em disco (Comentado temporariamente, pois fará parte de outra atividade)
        #filename = 'subDataSet' + str(itr) + '.csv'
        #data_write.to_csv(f'C:\\{filename}',header=False)
        
        #Gravação dos subdatasets em um dicionário de objetos
        data_dict[itr] = data_write
    return data_dict
        


In [4]:
def classification(data_dict, classifier):
    # for de dentro define quem eh o conjunto de treinamento
    # e concatena os subdatasets para formar o cjto treinamento
    start_time = time.time()
    precisao = 0
    recall = 0 
    acuracia = 0
    k = len(kfold_dicts)
    for itr2 in range(0, len(kfold_dicts)):
        test = data_dict[itr2]
                      
        train = None
        for itr3 in range(0, len(kfold_dicts)):
            if(itr3 != itr2):
                train = pd.concat([train,data_dict[itr3]])
        #execucao do classificador e predicao dos resultados
        classifier.fit(train.iloc[:,:-1], train.iloc[:,-1])
        y_pred = classifier.predict(test.iloc[:,:-1])
        
        #Precisão:
        #print("Precision:",round(metrics.precision_score(test.iloc[:,-1], y_pred),3))
        precisao = precisao + precision_score(test.iloc[:,-1], y_pred)
        #Erro:
        #print("Recall:",round(metrics.recall_score(test.iloc[:,-1], y_pred),3))
        recall = recall + recall_score(test.iloc[:,-1], y_pred)
        #Acurácia: 
        #print("Accuracy:",round(metrics.accuracy_score(test.iloc[:,-1], y_pred),3))
        acuracia = acuracia + accuracy_score(test.iloc[:,-1], y_pred)
        #print("\n")
        
    print("Precision:", round(precisao/k,5))
    print("Recall:", round(recall/k,5))
    #print("Erro:", round(mse/k,5))
    print("Accuracy:", round(acuracia/k,5))
    print("--- %s time ---" % (time.time() - start_time))
    return precisao/k, recall/k, acuracia/k

## <b> Parâmetros

In [5]:
# OBS. Para o MCA, mudamos alguns parametros pois o numero de caracteristicas que consideramos nesse caso é de apenas 3.
# numero de arvores da random forest.
n_estimators = [5,10,50,100,500]

# numero maximo de caracteristicas a considerar.
max_features = [1,2,3]

# profundidade maxima das arvores
max_depth = [4,6,8,10]

## <b> Full dataset

In [6]:
# Nome do Dataset de Entrada
ds_nome = 'adult_train_33.1'

#Notebook 01-pre-proc_tratamento_base.ipynb prepara dados preciamente
ds_pre_proc = '_Label_Encoded'
ds_tipo = '.csv'
ds_full_name = ds_nome + ds_pre_proc + ds_tipo

print("Nome Arquivo Origem: {0}".format(ds_full_name))

Nome Arquivo Origem: adult_train_33.1_Label_Encoded.csv


In [7]:
train = pd.read_csv(ds_full_name, index_col= 0)

In [8]:
train.shape

(10034, 11)

In [9]:
train.iloc[:, -1].value_counts()

0    7542
1    2492
Name: sal, dtype: int64

In [10]:
kfold_dicts = kFold(train, 3)

[1mNúmero de K: 3[0;0m




In [30]:
%%time

columns = ['n_estimators', 'max_depth', 'max_features', 'precision', 'recall', 'acuracia']
res = pd.DataFrame(columns = columns)
best_precision = {'score': 0, 'model': None}
best_recall = {'score': 0, 'model': None}
best_accuracy = {'score': 0, 'model': None}

for i_ne in range(0, len(n_estimators)):
    for i_md in range(0, len(max_depth)):
        for i_mf in range(0, len(max_features)):
            print("n_estimators: ", n_estimators[i_ne])
            print("max_depth: ", max_depth[i_md])
            print("max_features: ", max_features[i_mf])
            model = RandomForestClassifier(n_estimators=n_estimators[i_ne], max_depth=max_depth[i_md], max_features=max_features[i_mf])

            precision, recall, acuracia = classification(kfold_dicts, model)
            if precision > best_precision['score']:
                best_precision['score'] = precision
                best_precision['model'] = model
            if recall > best_recall['score']:
                best_recall['score'] = recall
                best_recall['model'] = model
            if acuracia > best_accuracy['score']:
                best_accuracy['score'] = acuracia
                best_accuracy['model'] = model
            res = pd.concat([res, pd.DataFrame([[n_estimators[i_ne], max_depth[i_md], max_features[i_mf],
                                                 precision, recall, acuracia]], columns = columns)]) 
            #res = pd.concat([res, pd.DataFrame([n_estimators[i_ne],max_features[i_mf],precision, recall, acuracia], columns = columns) 

n_estimators:  5
max_depth:  4
max_features:  1
Precision: 0.777
Recall: 0.18417
Accuracy: 0.78423
--- 0.2035219669342041 time ---
n_estimators:  5
max_depth:  4
max_features:  2
Precision: 0.72969
Recall: 0.36878
Accuracy: 0.80895
--- 0.15745139122009277 time ---
n_estimators:  5
max_depth:  4
max_features:  3
Precision: 0.71492
Recall: 0.4045
Accuracy: 0.81174
--- 0.14402461051940918 time ---
n_estimators:  5
max_depth:  6
max_features:  1
Precision: 0.73218
Recall: 0.34712
Accuracy: 0.80596
--- 0.12166857719421387 time ---
n_estimators:  5
max_depth:  6
max_features:  2
Precision: 0.72059
Recall: 0.41613
Accuracy: 0.81493
--- 0.14560365676879883 time ---
n_estimators:  5
max_depth:  6
max_features:  3
Precision: 0.66829
Recall: 0.50042
Accuracy: 0.81294
--- 0.1456453800201416 time ---
n_estimators:  5
max_depth:  8
max_features:  1
Precision: 0.72132
Recall: 0.41776
Accuracy: 0.81533
--- 0.12726545333862305 time ---
n_estimators:  5
max_depth:  8
max_features:  2
Precision: 0.70438


In [31]:
res.head(1000)

Unnamed: 0,n_estimators,max_depth,max_features,precision,recall,acuracia
0,5,4,1,0.776998,0.184174,0.784233
0,5,4,2,0.729691,0.368779,0.80895
0,5,4,3,0.714924,0.404498,0.81174
0,5,6,1,0.732175,0.347122,0.80596
0,5,6,2,0.720594,0.416132,0.814929
0,5,6,3,0.668289,0.500424,0.812936
0,5,8,1,0.721316,0.417763,0.815329
0,5,8,2,0.704378,0.474325,0.820012
0,5,8,3,0.70961,0.452638,0.818018
0,5,10,1,0.706889,0.473925,0.820411


## <b> Multiple Correspondence Analysis (MCA)

In [32]:
# Parametros

# numero de arvores da random forest.
n_estimators = [5,10,50,100,500]

# profundidade maxima das arvores
max_depth = [4,6,8,10]

# numero maximo de caracteristicas a considerar.
max_features = [1,2]

In [6]:
# Nome do Dataset de Entrada
ds_nome = 'adult_train_33.1'

#Notebook 01-pre-proc_tratamento_base.ipynb prepara dados preciamente
ds_pre_proc = '_tratado_MCA'
ds_tipo = '.csv'
ds_full_name = ds_nome + ds_pre_proc + ds_tipo

print("Nome Arquivo Origem: {0}".format(ds_full_name))

Nome Arquivo Origem: adult_train_33.1_tratado_MCA.csv


In [7]:
train = pd.read_csv(ds_full_name, index_col=0)

In [8]:
train.shape

(10034, 11)

In [18]:
kfold_dicts = kFold(train, 3)

[1mNúmero de K: 3[0;0m




In [19]:
kfold_dicts[2].shape

(3344, 11)

In [20]:
train.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,sal
0,-0.082125,0.104588,0.069364,-0.074746,-0.006659,0.070368,0.003401,0.011989,-0.043017,0.016424,0
1,0.272706,-0.000564,0.025023,-0.045254,0.043857,-0.062299,-0.006852,-0.024369,-0.005436,-0.01991,0
2,-0.06257,-0.012166,-0.077619,0.01214,-0.048345,0.046914,-0.050294,0.051538,0.006365,0.019667,0
3,0.042881,-0.145118,-0.043249,0.047566,-0.073234,0.03648,0.059404,0.031942,0.061853,0.000136,0
4,-0.093636,0.14588,0.004478,0.043283,0.050437,0.020226,0.087173,-0.135152,0.020913,0.056551,0


In [21]:
%%time

columns = ['n_estimators', 'max_depth', 'max_features', 'precision', 'recall', 'acuracia']
res_mca = pd.DataFrame(columns = columns)
best_precision = {'score': 0, 'model': None}
best_recall = {'score': 0, 'model': None}
best_accuracy = {'score': 0, 'model': None}

for i_ne in range(0, len(n_estimators)):
    for i_md in range(0, len(max_depth)):
        for i_mf in range(0, len(max_features)):
            print("n_estimators: ", n_estimators[i_ne])
            print("max_depth: ", max_depth[i_md])
            print("max_features: ", max_features[i_mf])
            model = RandomForestClassifier(n_estimators=n_estimators[i_ne], max_depth=max_depth[i_md], max_features=max_features[i_mf])

            precision, recall, acuracia = classification(kfold_dicts_mca, model)
            if precision > best_precision['score']:
                best_precision['score'] = precision
                best_precision['model'] = model
            if recall > best_recall['score']:
                best_recall['score'] = recall
                best_recall['model'] = model
            if acuracia > best_accuracy['score']:
                best_accuracy['score'] = acuracia
                best_accuracy['model'] = model
            res_mca = pd.concat([res_mca, pd.DataFrame([[n_estimators[i_ne], max_depth[i_md], max_features[i_mf],
                                                 precision, recall, acuracia]], columns = columns)]) 
            #res = pd.concat([res, pd.DataFrame([n_estimators[i_ne],max_features[i_mf],precision, recall, acuracia], columns = columns) 

n_estimators:  5
max_depth:  4
max_features:  1
Precision: 0.79951
Recall: 0.15766
Accuracy: 0.77785
--- 0.9413411617279053 time ---
n_estimators:  5
max_depth:  4
max_features:  2
Precision: 0.78683
Recall: 0.27602
Accuracy: 0.79958
--- 0.5552263259887695 time ---
n_estimators:  5
max_depth:  4
max_features:  3
Precision: 0.70169
Recall: 0.45507
Accuracy: 0.81623
--- 0.606414794921875 time ---
n_estimators:  5
max_depth:  6
max_features:  1
Precision: 0.76036
Recall: 0.24314
Accuracy: 0.79181
--- 0.8269524574279785 time ---
n_estimators:  5
max_depth:  6
max_features:  2
Precision: 0.7041
Recall: 0.4715
Accuracy: 0.81951
--- 0.5416123867034912 time ---
n_estimators:  5
max_depth:  6
max_features:  3
Precision: 0.69984
Recall: 0.51043
Accuracy: 0.8239
--- 0.4689769744873047 time ---
n_estimators:  5
max_depth:  8
max_features:  1
Precision: 0.72327
Recall: 0.41935
Accuracy: 0.81583
--- 0.7874016761779785 time ---
n_estimators:  5
max_depth:  8
max_features:  2
Precision: 0.68228
Recall

In [23]:
res_mca.head(1000)

Unnamed: 0,n_estimators,max_depth,max_features,precision,recall,acuracia
0,5,4,1,0.799509,0.157659,0.777854
0,5,4,2,0.786834,0.276024,0.799579
0,5,4,3,0.701695,0.455069,0.816225
0,5,6,1,0.760359,0.243143,0.791807
0,5,6,2,0.704101,0.471499,0.819513
0,5,6,3,0.69984,0.510427,0.823899
0,5,8,1,0.723271,0.419349,0.815826
0,5,8,2,0.682277,0.514453,0.819813
0,5,8,3,0.67833,0.530903,0.820909
0,5,10,1,0.679571,0.446633,0.810046
