# Atividade sobre classificação

Nesta atividade, deverão ser implementados dois modelos de classificação sobre os [dados de salários](https://www.kaggle.com/wenruliu/adult-income-dataset). No conjunto de dados existem 15 variáveis sobre a educação, idade, sexo, entre outras. **Seu objetivo é prever se uma pessoa ganha mais de 50 mil dólares ao ano** (income) de acordo com suas características.

Para realizar a atividade, siga os seguintes passos:

* Separe o conjunto de dados em treino e teste na proporção 80%/20% respectivamente.
* Utilize variáveis dummies para representar variáveis categóricas.
* Treine dois modelos de classificação, vistos até então, sobre o conjunto de treinamento.
* Teste os modelos com o conjunto de teste.
* A partir das predições, obtenha as matrizes de confusão e informe qual a eficácia dos modelos através das seguintes métricas: acurácia, precisão, revocação, informedness e markedness.
* Analise os resultados para os dois modelos e informe qual o modelo conseguiu prever melhor os resultados.


In [0]:
from sklearn.metrics import f1_score, recall_score, accuracy_score, precision_score
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

trabalhadores = pd.read_csv("https://orionwinter.github.io/datasets/adult.csv")

trabalhadores.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [0]:
# Função que calcula os reais positivos
def rp(tp, fn):
    return tp + fn

# Função que calcula os reais negativos     
def rn(fp, tn):
    return fp + tn

# Função que calcula as predicoes positivas  
def pp(tp, fp):
    return tp + fp

# Função que calcula as predicoes negativas   
def pn(fn, tn):
    return fn + tn

# Função que calcula acurácia do modelo
def accuracy (tp, fp, fn, tn):
     accuracy = ((tp + tn) / (tp + tn + fp + fn))
     return (accuracy)
    
# Função que calcula a precisão 
def precision (tp, fp):
    precision =  (tp / (tp + fp)) #predições positivas
    return precision

# Função que calcula o recall
def recall(tp, fn):
    recall =  (tp / (tp + fn)) # reais positivos
    return recall

## Função que calcula o f-measure (media harmonica entre precision e recall)
def f_measure(tp, fp, fn):
    f_measure = (2 * precision(tp, fp) * recall(tp, fn)) / (recall(tp, fn) + precision(tp, fp))
    return f_measure
  
# Função que calcula o Informedness 
def informedness(tp, fp, fn, tn):
    inform = ((tp/rp(tp, fn)) - (fp/rn(fp, tn)))
    return inform

# Função que calcula o Markedness
def markdness(tp, fp, fn, tn):    
    mark = ((tp/pp(tp,fp)) - (fn/pn(fn,tn)))
    return mark
# Função de escalonamento
def feature_scaling(data):
    sc = StandardScaler()
    return sc.fit_transform(data)

In [0]:
trabalhadores = pd.get_dummies(trabalhadores , columns = ['workclass'], drop_first=True)
trabalhadores = pd.get_dummies(trabalhadores , columns = ['education'], drop_first=True)
trabalhadores = pd.get_dummies(trabalhadores , columns = ['marital-status'], drop_first=True)
trabalhadores = pd.get_dummies(trabalhadores , columns = ['occupation'], drop_first=True)
trabalhadores = pd.get_dummies(trabalhadores , columns = ['relationship'], drop_first=True)
trabalhadores = pd.get_dummies(trabalhadores , columns = ['race'], drop_first=True)
trabalhadores = pd.get_dummies(trabalhadores , columns = ['gender'], drop_first=True)
trabalhadores = pd.get_dummies(trabalhadores , columns = ['native-country'], drop_first=True)

In [0]:
trabalhadores.head()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,income,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_11th,education_12th,education_1st-4th,education_5th-6th,education_7th-8th,education_9th,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Preschool,education_Prof-school,education_Some-college,marital-status_Married-AF-spouse,marital-status_Married-civ-spouse,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed,occupation_Adm-clerical,occupation_Armed-Forces,occupation_Craft-repair,occupation_Exec-managerial,...,native-country_Canada,native-country_China,native-country_Columbia,native-country_Cuba,native-country_Dominican-Republic,native-country_Ecuador,native-country_El-Salvador,native-country_England,native-country_France,native-country_Germany,native-country_Greece,native-country_Guatemala,native-country_Haiti,native-country_Holand-Netherlands,native-country_Honduras,native-country_Hong,native-country_Hungary,native-country_India,native-country_Iran,native-country_Ireland,native-country_Italy,native-country_Jamaica,native-country_Japan,native-country_Laos,native-country_Mexico,native-country_Nicaragua,native-country_Outlying-US(Guam-USVI-etc),native-country_Peru,native-country_Philippines,native-country_Poland,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,25,226802,7,0,0,40,<=50K,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,38,89814,9,0,0,50,<=50K,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,28,336951,12,0,0,40,>50K,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,44,160323,10,7688,0,40,>50K,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,18,103497,10,0,0,30,<=50K,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [0]:
X = trabalhadores[trabalhadores.columns[~trabalhadores.columns.isin(['income'])]].values
y = trabalhadores['income'].values.reshape(-1,1)

# Dividindo os dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [0]:
X_train = feature_scaling(X_train)
X_test = feature_scaling(X_test)

In [0]:
estimators = {'KNN': KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2),
              'SVC': SVC(kernel = 'rbf', random_state = 0)}

In [0]:
df_results = pd.DataFrame(columns=['clf', 'acc', 'prec', 'rec', 'inf','mar', 'tn','fp','fn','tp'], index=None)

In [0]:
for name, estim in estimators.items():
    
    # print("Treinando Estimador {0}: ".format(name))
    
    # Treinando os classificadores com Conjunto de Treinamento
    estim.fit(X_train, y_train)
    
    # Prevendo os resultados do modelo criado com o conjunto de testes
    y_pred = estim.predict(X_test)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    
    # Armazenando as métricas de cada classificador em um dataframe
    df_results.loc[len(df_results), :] = [name, accuracy(tp, fp, fn, tn), precision (tp, fp),
                   recall(tp, fn), informedness(tp, fp, fn, tn),markdness(tp, fp, fn, tn),tn,fp,fn,tp]

  
  y = column_or_1d(y, warn=True)


In [0]:
df_results

Unnamed: 0,clf,acc,prec,rec,inf,mar,tn,fp,fn,tp
0,KNN,0.829256,0.649807,0.589083,0.491877,0.527488,6752,727,941,1349
1,SVC,0.855768,0.746227,0.582969,0.522266,0.626553,7025,454,955,1335


De acordo com os dados acima, o modelo SVC conseguiu prever melhor os resultados. 