# Base Census
**Preveja se a renda excede US$ 50 mil/ano com base nos dados do censo. Também conhecido como conjunto de dados "Renda do Censo".**

A extração foi feita por Barry Becker do banco de dados do Censo de 1994. Um conjunto de registros razoavelmente limpos foi extraído usando as seguintes condições: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

A tarefa de previsão é determinar se uma pessoa faz mais de 50K ano.

https://archive.ics.uci.edu/ml/datasets/adult

# #2 - Tratamento da base de dados

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install category_encoders

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose  import ColumnTransformer # Aplica transformadores a colunas de uma matriz ou pandas DataFrame.
from category_encoders import TargetEncoder 
from sklearn.preprocessing import OneHotEncoder # Codifique recursos categóricos como uma matriz numérica one-hot.
from sklearn.model_selection import train_test_split


import pickle as pkl
import pandas as pd
import numpy as np
import math
path_datasets = '/content/drive/MyDrive/Machine Learning e Data Science com Python/Machine Learning e Data Science com Python de A à Z/Bases de dados/'

In [4]:
pd.set_option('max_columns', 50)
pd.set_option('max_rows', 150)
df_census = pd.read_csv(path_datasets+'census.csv')
df_census.head()

Unnamed: 0,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# Convertendo target para binário

In [5]:
df_census.loc[df_census.income == ' <=50K','income'] = 0
df_census.loc[df_census.income == ' >50K','income'] = 1
df_census['income']

0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: income, Length: 32561, dtype: object

## Separando feature de target

In [6]:
X_census, y_census = df_census.iloc[:, 0:14], df_census.iloc[:, 14]

## Remover feature **education** e **relationship**
A base de dados possui uma feature ordinal chamada **education-num**, iremos utilizar somente ela para evitar redundância de informação. Além disso, a feature **relationship** é um pouco redundante com a feature **marital-status**, com isso, também vamos removê-la

In [7]:
def remove_features(df):
  return df.drop(columns=['education', 'relationship'], axis = 1)

In [8]:
X_census = remove_features(X_census)

# Separando base de treino e teste

In [9]:
seed = 14
X_train, X_test, y_train, y_test = train_test_split(X_census, y_census, test_size = .30, random_state = seed)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((22792, 12), (22792,), (9769, 12), (9769,))

# Transformar features categóricas em numéricas com Target_encoder
Aplicaremos o Target_Encoder para reduzir a dimensionalidade da base de dados ao aplicar o One Hot Encoder. Por exemplo, mesmo se fôssemos alterar todos os países diferentes de Estados Unidos para "Outros", então perderíamos informações valiosas de classificação.  

Vamos aplicar o Target_encoder em atributos com mais de 5 valores distintos

In [10]:
cat_more_than_5_features_columns = []

for column in X_census.select_dtypes(include=['object']).columns:
  if X_census[column].nunique() > 5:
    cat_more_than_5_features_columns.append(column)
cat_more_than_5_features_columns

['workclass', 'marital-status', 'occupation', 'native-country']

In [11]:
Target_transformer = Pipeline(steps=[('target encoder', TargetEncoder())])



## Normalizar features numéricas
Como visto na exploração dos dados, o conjunto de dados possui muito outliers e não apresenta uma distribuição normal. Com isso, vamos aplicar o MinMaxScaler porque esse método é recomendado para dados que não estão em distribuição normal e também porque não é desejado eliminar a influência dos outliers.

In [14]:
num_columns = X_census.select_dtypes(include=['int64']).columns
num_columns

Index(['age', 'final-weight', 'education-num', 'capital-gain', 'capital-loos',
       'hour-per-week'],
      dtype='object')

In [15]:
Num_transformer = Pipeline(steps = [('min-max-scaler',  MinMaxScaler())])

# Transformar features categóricas em dummies

In [12]:
cat_columns = list(X_census.select_dtypes(include=['object']).columns)
for c in cat_more_than_5_features_columns:
  try:
    cat_columns.remove(c)
  except:
    pass
cat_columns

['race', 'sex']

In [13]:
Cat_transformer = Pipeline(steps=[('one-hot encoder', OneHotEncoder())])

## Compondo pré-processadores

In [16]:
Preprocessor = ColumnTransformer(transformers=[
    ('num_tgt', Target_transformer, cat_more_than_5_features_columns),
    ('num', Num_transformer, num_columns),
    ('cat', Cat_transformer, cat_columns)
])

# Treinando e transformando a base de treino

In [27]:
columns = [['workclass', 'marital-status', 'occupation', 'native-country',
            'age', 'final-weight', 'education-num', 'capital-gain',
            'capital-loos', 'hour-per-week', 'race_1', 'race_2',
            'race_3', 'race_4', 'race_5', 'sex_1', 'sex_2']]

[['workclass',
  'marital-status',
  'occupation',
  'native-country',
  'age',
  'final-weight',
  'education-num',
  'capital-gain',
  'capital-loos',
  'hour-per-week',
  'race_1',
  'race_2',
  'race_3',
  'race_4',
  'race_5',
  'sex_1',
  'sex_2']]

In [28]:
X_train_transformed = Preprocessor.fit_transform(X_train, y_train)
X_train_transformed = pd.DataFrame(X_train_transformed, columns = columns)
X_train_transformed



Unnamed: 0,workclass,marital-status,occupation,native-country,age,final-weight,education-num,capital-gain,capital-loos,hour-per-week,race_1,race_2,race_3,race_4,race_5,sex_1,sex_2
0,0.218703,0.045307,0.062436,0.247893,0.054795,0.291932,0.533333,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0.297915,0.099617,0.139279,0.247893,0.520548,0.104189,0.533333,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,0.218703,0.045307,0.062436,0.247893,0.136986,0.142369,0.533333,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.218703,0.451339,0.225772,0.247893,0.465753,0.164305,0.800000,0.0,0.0,0.193878,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,0.218703,0.045307,0.265051,0.247893,0.013699,0.172506,0.466667,0.0,0.0,0.346939,0.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22787,0.218703,0.045307,0.207965,0.247893,0.082192,0.068611,0.600000,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
22788,0.218703,0.451339,0.225772,0.247893,0.164384,0.134617,0.533333,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
22789,0.100541,0.451339,0.100231,0.247893,0.424658,0.172540,0.533333,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
22790,0.297915,0.070845,0.040763,0.247893,0.178082,0.060485,0.533333,0.0,0.0,0.255102,0.0,0.0,0.0,0.0,1.0,1.0,0.0


# Transformando a base de teste

In [32]:
X_test_transformed = Preprocessor.transform(X_test)
X_test_transformed = pd.DataFrame(X_test_transformed, columns = columns)
X_test_transformed

Unnamed: 0,workclass,marital-status,occupation,native-country,age,final-weight,education-num,capital-gain,capital-loos,hour-per-week,race_1,race_2,race_3,race_4,race_5,sex_1,sex_2
0,0.218703,0.451339,0.040763,0.241206,0.191781,0.162572,0.600000,0.00000,0.340909,0.551020,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,0.218703,0.045307,0.458447,0.247893,0.054795,0.135715,0.600000,0.00000,0.000000,0.071429,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.218703,0.045307,0.265051,0.247893,0.013699,0.179470,0.466667,0.00000,0.000000,0.142857,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.218703,0.451339,0.207965,0.247893,0.219178,0.079961,0.533333,0.00000,0.000000,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,0.218703,0.099617,0.207965,0.247893,0.232877,0.109486,0.533333,0.00000,0.000000,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9764,0.379971,0.099617,0.139279,0.247893,0.287671,0.200431,0.600000,0.00000,0.000000,0.397959,0.0,0.0,0.0,0.0,1.0,1.0,0.0
9765,0.218703,0.451339,0.225772,0.247893,0.205479,0.107124,0.533333,0.00000,0.000000,0.500000,0.0,0.0,0.0,0.0,1.0,0.0,1.0
9766,0.100541,0.451339,0.100231,0.247893,0.684932,0.122676,0.533333,0.03818,0.000000,0.102041,0.0,0.0,0.0,0.0,1.0,0.0,1.0
9767,0.218703,0.451339,0.207965,0.247893,0.493151,0.096851,0.266667,0.00000,0.000000,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0


# Salvando as variáveis

In [33]:
with open('census_data.pkl', 'wb') as f:
  pkl.dump([X_train, X_test, y_train, y_test], f)

In [34]:
with open('census_transform.pkl', 'wb') as f:
  pkl.dump([Preprocessor, remove_features], f)