# Tratamentos em dataframe para Machine Learning

A proposta deste script baseou-se em trabalhar inicialmente de forma separada por alguns pontos importantes para um bom treinamento de modelos de Machine Learning, condicionando ao modelo a melhor versão possível do dataframe, por meio do tratamento dos seguintes itens:

- Missing Values: valores nas colunas que não estão inseridos;
- Standard Scaler: normalização dos dados conforme a distribuição normal;
- MinMax Scaler: normalização dos dados entre zero e um;
- One Hot Encoding: conversão dos dados categóricos para numéricos;
- Outliers: valores fora do conjunto de dados conforme as avaliações estatísticas dos quartis.

Por fim, foi realizado um encapsulamento de todos os tópicos supracitados em uma única função para pré-processamento de dataframes.

# Avaliação inicial

Importação de bibliotecas e criação do dataframe:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("dataframe_exercicio_modulo_7.csv")
df

Unnamed: 0,idade,tempo_educacao,estado_civil,cor,sexo,horas_por_semana,salario_anual,nome,dívida
0,39.0,13,Never-married,White,Male,40,<=50K,,
1,50.0,13,Married-civ-spouse,White,Male,13,<=50K,,
2,38.0,9,Divorced,White,Male,40,<=50K,,
3,53.0,7,Married-civ-spouse,Black,Male,40,<=50K,,
4,37.0,14,Married-civ-spouse,White,Female,40,<=50K,,
...,...,...,...,...,...,...,...,...,...
29165,27.0,12,Married-civ-spouse,White,Female,38,<=50K,,
29166,40.0,9,Married-civ-spouse,White,Male,40,>50K,,
29167,58.0,9,Widowed,White,Female,40,<=50K,,
29168,22.0,9,Never-married,White,Male,20,<=50K,,


Listagem dos tipos de variáveis em cada coluna:

In [3]:
df.dtypes

idade               float64
tempo_educacao        int64
estado_civil         object
cor                  object
sexo                 object
horas_por_semana      int64
salario_anual        object
nome                float64
dívida              float64
dtype: object

Porcentagem de missing values por coluna:

In [4]:
((df.isnull().sum() / df.shape[0]) * 100).round(2)

idade                 0.34
tempo_educacao        0.00
estado_civil          0.00
cor                   0.00
sexo                  0.00
horas_por_semana      0.00
salario_anual         0.00
nome                100.00
dívida              100.00
dtype: float64

Mapeamento dos valores que compõem as colunas estado_civil, cor e sexo:

In [5]:
df.estado_civil.unique()

array(['Never-married', 'Married-civ-spouse', 'Divorced', 'Separated',
       'Married-AF-spouse', 'Widowed', 'Married-spouse-absent'],
      dtype=object)

In [6]:
df.cor.unique()

array(['White', 'Black', 'Other', 'Asian-Pac-Islander',
       'Amer-Indian-Eskimo'], dtype=object)

In [7]:
df.sexo.unique()

array(['Male', 'Female', '?'], dtype=object)

**Observação**: Na variável sexo, há uma variável "?", que pode influenciar na análise. Para as análises de ML, os dados precisarão ser convertidos em números com as técnicas OHE ou a transformação de strings em variáveis numéricas.

# Missing Values

Criação de função para missing values:

In [8]:
def preprocessamento(df, cols_numericas, cols_categoricas):
    dff = df.copy()
    temp = (df.isnull().sum() / df.shape[0]) >= 0.8
    variaveis_80percFaltantes = temp.loc[temp == True].index.tolist()
    for v in variaveis_80percFaltantes:
        dff = dff.drop(v, axis = 1)
        
    cols_numericas = list(set(cols_numericas).intersection(set(dff.columns.tolist())))
    cols_categoricas = list(set(cols_categoricas).intersection(set(dff.columns.tolist())))
        
    for v in cols_numericas:
        dff[v] = dff[v].fillna(dff[v].mean())
        
    for v in cols_categoricas:
        mode_v = dff[v].value_counts().idxmax()
        dff[v] = dff[v].fillna(mode_v)
    return dff

Execução da função de pré-processamento do dataframe para missing values:

In [9]:
df_limpo = preprocessamento(df = df, 
                            cols_numericas = ['idade', 'tempo_educacao', 'horas_por_semana'], 
                            cols_categoricas = ['estado_civil', 'cor', 'sexo', 'salario_anual'])

df_limpo

Unnamed: 0,idade,tempo_educacao,estado_civil,cor,sexo,horas_por_semana,salario_anual
0,39.0,13,Never-married,White,Male,40,<=50K
1,50.0,13,Married-civ-spouse,White,Male,13,<=50K
2,38.0,9,Divorced,White,Male,40,<=50K
3,53.0,7,Married-civ-spouse,Black,Male,40,<=50K
4,37.0,14,Married-civ-spouse,White,Female,40,<=50K
...,...,...,...,...,...,...,...
29165,27.0,12,Married-civ-spouse,White,Female,38,<=50K
29166,40.0,9,Married-civ-spouse,White,Male,40,>50K
29167,58.0,9,Widowed,White,Female,40,<=50K
29168,22.0,9,Never-married,White,Male,20,<=50K


Validação dos missing values por colunas:

In [10]:
((df_limpo.isnull().sum() / df_limpo.shape[0]) * 100).round(2)

idade               0.0
tempo_educacao      0.0
estado_civil        0.0
cor                 0.0
sexo                0.0
horas_por_semana    0.0
salario_anual       0.0
dtype: float64

# Standard Scaler e MinMax Scaler

A utilização das funções de equalização dos dados é fundamental para os modelos de Machine Learning. Dado que em alguns algoritmos, já consideram que os dados a serem avaliados já possuem normalização no dataframe. O StandardScaler é o mais utilizado, dado que se baseia na distribuição normal, mas a equalização por meio da função MinMax também pode ser uma alternativa, variando de acordo com o dataset e os modelos a serem aplicados no projeto.

Importação das bibliotecas MinMaxScaler e StandardScaler:

In [11]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
mm_scaler = MinMaxScaler()
ss_scaler = StandardScaler()

Passando os dados da coluna tempo_educacao para um df com um reshape:

In [12]:
X = df_limpo.tempo_educacao.values.reshape(-1,1)
X

array([[13],
       [13],
       [ 9],
       ...,
       [ 9],
       [ 9],
       [ 9]], dtype=int64)

Cálculo do mínimo e máximo a serem utilizados no escalonamento da função MinMaxScaler nos dados do df X:

In [13]:
mm_scaler.fit(X)

MinMaxScaler()

Passando os dados da coluna horas_por_semana para um df com um reshape:

In [14]:
Y = df_limpo.horas_por_semana.values.reshape(-1,1)
Y

array([[40],
       [13],
       [40],
       ...,
       [40],
       [20],
       [40]], dtype=int64)

Cálculo da média e do desvio padrão a serem utilizados na função StdScaler nos dados do df Y:

In [15]:
ss_scaler.fit(Y)

StandardScaler()

Execução da padronização por meio da centralização e dimensionamento dos valores com a função transform:

In [16]:
mm_scaler.transform(X)

array([[0.8       ],
       [0.8       ],
       [0.53333333],
       ...,
       [0.53333333],
       [0.53333333],
       [0.53333333]])

Execução da padronização por meio da centralização e dimensionamento dos valores com a função transform:

In [17]:
ss_scaler.transform(Y)

array([[-0.03605983],
       [-2.21049975],
       [-0.03605983],
       ...,
       [-0.03605983],
       [-1.64675606],
       [-0.03605983]])

Passagem dos dados transformados para um dataframe:

In [18]:
pd.DataFrame(np.c_[mm_scaler.transform(X), ss_scaler.transform(Y)]).round(2)

Unnamed: 0,0,1
0,0.80,-0.04
1,0.80,-2.21
2,0.53,-0.04
3,0.40,-0.04
4,0.87,-0.04
...,...,...
29165,0.73,-0.20
29166,0.53,-0.04
29167,0.53,-0.04
29168,0.53,-1.65


# OHE - One Hot Encoding

Visto que as colunas categóricas não podem ser utilizadas para treinamentos de modelos de Machine Learning, a classe OHE visa a conversão destas variáveis qualitativas para variáveis numéricas, sem que haja desbalanceamento de peso entre variáveis. Por meio da criação de colunas para os valores que se encontram em cada coluna, assim, para cada vez que o atributo aparece, na coluna deste irá aparecer o valor 1, e inversamente, se não aparecer, será atribuido o valor 0.

Importação da biblioteca OHE:

In [19]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

Passando os dados da coluna sexo para um df com um reshape, após, ajuste do OHE para o df:

In [20]:
variavel_ohe_sx = df_limpo.sexo.values.reshape(-1,1)

ohe.fit(variavel_ohe_sx)

OneHotEncoder()

Execução da conversão do OHE para o df:

In [21]:
ohe.transform(variavel_ohe_sx).toarray()

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

Passagem dos dados transformados para um dataframe:

In [22]:
dataframe_OHE_sx = pd.DataFrame(ohe.transform(variavel_ohe_sx).toarray(), 
                             columns = ohe.categories_[0].tolist())
dataframe_OHE_sx

Unnamed: 0,?,Female,Male
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0
...,...,...,...
29165,0.0,1.0,0.0
29166,0.0,0.0,1.0
29167,0.0,1.0,0.0
29168,0.0,0.0,1.0


Agrupamento dos dados do OHE com o dataframe base:

In [23]:
pd.concat([df_limpo, dataframe_OHE_sx], axis = 1)

Unnamed: 0,idade,tempo_educacao,estado_civil,cor,sexo,horas_por_semana,salario_anual,?,Female,Male
0,39.0,13,Never-married,White,Male,40,<=50K,0.0,0.0,1.0
1,50.0,13,Married-civ-spouse,White,Male,13,<=50K,0.0,0.0,1.0
2,38.0,9,Divorced,White,Male,40,<=50K,0.0,0.0,1.0
3,53.0,7,Married-civ-spouse,Black,Male,40,<=50K,0.0,0.0,1.0
4,37.0,14,Married-civ-spouse,White,Female,40,<=50K,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
29165,27.0,12,Married-civ-spouse,White,Female,38,<=50K,0.0,1.0,0.0
29166,40.0,9,Married-civ-spouse,White,Male,40,>50K,0.0,0.0,1.0
29167,58.0,9,Widowed,White,Female,40,<=50K,0.0,1.0,0.0
29168,22.0,9,Never-married,White,Male,20,<=50K,0.0,0.0,1.0


Passando os dados da coluna cor para um df com um reshape, após, ajuste do OHE para o df:

In [24]:
variavel_ohe_cr = df_limpo.cor.values.reshape(-1,1)

ohe.fit(variavel_ohe_cr)

OneHotEncoder()

Execução da conversão do OHE para o df:

In [26]:
ohe.transform(variavel_ohe_cr).toarray()

array([[0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.]])

Passagem dos dados transformados para um dataframe:

In [27]:
dataframe_OHE_cr = pd.DataFrame(ohe.transform(variavel_ohe_cr).toarray(), 
                             columns = ohe.categories_[0].tolist())
dataframe_OHE_cr

Unnamed: 0,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,Other,White
0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...
29165,0.0,0.0,0.0,0.0,1.0
29166,0.0,0.0,0.0,0.0,1.0
29167,0.0,0.0,0.0,0.0,1.0
29168,0.0,0.0,0.0,0.0,1.0


Agrupamento dos dados do OHE com o dataframe base:

In [28]:
pd.concat([df_limpo, dataframe_OHE_cr], axis = 1)

Unnamed: 0,idade,tempo_educacao,estado_civil,cor,sexo,horas_por_semana,salario_anual,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,Other,White
0,39.0,13,Never-married,White,Male,40,<=50K,0.0,0.0,0.0,0.0,1.0
1,50.0,13,Married-civ-spouse,White,Male,13,<=50K,0.0,0.0,0.0,0.0,1.0
2,38.0,9,Divorced,White,Male,40,<=50K,0.0,0.0,0.0,0.0,1.0
3,53.0,7,Married-civ-spouse,Black,Male,40,<=50K,0.0,0.0,1.0,0.0,0.0
4,37.0,14,Married-civ-spouse,White,Female,40,<=50K,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
29165,27.0,12,Married-civ-spouse,White,Female,38,<=50K,0.0,0.0,0.0,0.0,1.0
29166,40.0,9,Married-civ-spouse,White,Male,40,>50K,0.0,0.0,0.0,0.0,1.0
29167,58.0,9,Widowed,White,Female,40,<=50K,0.0,0.0,0.0,0.0,1.0
29168,22.0,9,Never-married,White,Male,20,<=50K,0.0,0.0,0.0,0.0,1.0


Passando os dados da coluna estado_civil para um df com um reshape, após, ajuste do OHE para o df:

In [29]:
variavel_ohe_es = df_limpo.estado_civil.values.reshape(-1,1)

ohe.fit(variavel_ohe_es)

OneHotEncoder()

Execução da conversão do OHE para o df:

In [31]:
ohe.transform(variavel_ohe_es).toarray()

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

Passagem dos dados transformados para um dataframe:

In [32]:
dataframe_OHE_es = pd.DataFrame(ohe.transform(variavel_ohe_es).toarray(), 
                             columns = ohe.categories_[0].tolist())
dataframe_OHE_es

Unnamed: 0,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated,Widowed
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
29165,0.0,0.0,1.0,0.0,0.0,0.0,0.0
29166,0.0,0.0,1.0,0.0,0.0,0.0,0.0
29167,0.0,0.0,0.0,0.0,0.0,0.0,1.0
29168,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Agrupamento dos dados do OHE com o dataframe base:

In [33]:
pd.concat([df_limpo, dataframe_OHE_es], axis = 1)

Unnamed: 0,idade,tempo_educacao,estado_civil,cor,sexo,horas_por_semana,salario_anual,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated,Widowed
0,39.0,13,Never-married,White,Male,40,<=50K,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,50.0,13,Married-civ-spouse,White,Male,13,<=50K,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,38.0,9,Divorced,White,Male,40,<=50K,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,53.0,7,Married-civ-spouse,Black,Male,40,<=50K,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,37.0,14,Married-civ-spouse,White,Female,40,<=50K,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29165,27.0,12,Married-civ-spouse,White,Female,38,<=50K,0.0,0.0,1.0,0.0,0.0,0.0,0.0
29166,40.0,9,Married-civ-spouse,White,Male,40,>50K,0.0,0.0,1.0,0.0,0.0,0.0,0.0
29167,58.0,9,Widowed,White,Female,40,<=50K,0.0,0.0,0.0,0.0,0.0,0.0,1.0
29168,22.0,9,Never-married,White,Male,20,<=50K,0.0,0.0,0.0,0.0,1.0,0.0,0.0


# Outliers

#### Criação de função para detecção de Outliers

Os outliers são valores que destoam em relação ao conjunto de dados que pertencem. Uma boa prática é avaliar se estes dados são realmente descartáveis ou se realmente devem continuar dentro da amostra, dado o conhecimento de negócio sobre o tema. Para a função abaixo, será criada com o objetivo de atribuir um valor categórico para cada amostra, que por meio das avaliações dos limites inferiores e superiores da coluna, será definido se este valor é um outlier ou não, estatisticamente falando.

In [35]:
def detecta_outlier(x, limite = 1.5):
    iqr = np.percentile(x, 75) - np.percentile(x, 25)
    limite_inf = np.maximum(np.percentile(x, 25) - limite * iqr, np.min(x))
    limite_sup = np.minimum(np.percentile(x, 75) + limite * iqr, np.max(x))
    return np.where((x < limite_inf) | (x > limite_sup), 1, 0)

Aplicação da função detecta_outlier:

In [36]:
df_limpo['outlier_idade'] = detecta_outlier(df_limpo.idade)
df_limpo

Unnamed: 0,idade,tempo_educacao,estado_civil,cor,sexo,horas_por_semana,salario_anual,outlier_idade
0,39.0,13,Never-married,White,Male,40,<=50K,0
1,50.0,13,Married-civ-spouse,White,Male,13,<=50K,0
2,38.0,9,Divorced,White,Male,40,<=50K,0
3,53.0,7,Married-civ-spouse,Black,Male,40,<=50K,0
4,37.0,14,Married-civ-spouse,White,Female,40,<=50K,0
...,...,...,...,...,...,...,...,...
29165,27.0,12,Married-civ-spouse,White,Female,38,<=50K,0
29166,40.0,9,Married-civ-spouse,White,Male,40,>50K,0
29167,58.0,9,Widowed,White,Female,40,<=50K,0
29168,22.0,9,Never-married,White,Male,20,<=50K,0


Remoção das linhas onde a função considerou como outlier:

In [38]:
df = df.drop(df[df_limpo.outlier_idade == 1].index)
df

Unnamed: 0,idade,tempo_educacao,estado_civil,cor,sexo,horas_por_semana,salario_anual,nome,dívida
0,39.0,13,Never-married,White,Male,40,<=50K,,
1,50.0,13,Married-civ-spouse,White,Male,13,<=50K,,
2,38.0,9,Divorced,White,Male,40,<=50K,,
3,53.0,7,Married-civ-spouse,Black,Male,40,<=50K,,
4,37.0,14,Married-civ-spouse,White,Female,40,<=50K,,
...,...,...,...,...,...,...,...,...,...
29165,27.0,12,Married-civ-spouse,White,Female,38,<=50K,,
29166,40.0,9,Married-civ-spouse,White,Male,40,>50K,,
29167,58.0,9,Widowed,White,Female,40,<=50K,,
29168,22.0,9,Never-married,White,Male,20,<=50K,,


# Encapsulamento

Criação de função que encapsula todos os tópicos abordados anteriormente, como:
- Missing Values;
- Standard Scaler;
- MinMax Scaler;
- One Hot Encoding;
- Outliers.

In [39]:
def processamento (df, series_quant, series_quali, series_minmax, series_stdscaler, series_ohe, series_outliers):
    #Missing Values:
    temp = (df.isnull().sum() / df.shape[0]) >= 0.8
    variaveis_80percFaltantes = temp.loc[temp == True].index.tolist()
    for v in variaveis_80percFaltantes:
        df = df.drop(v, axis = 1)
        series_quant = list(set(series_quant).intersection(set(df.columns.tolist())))
        series_quali = list(set(series_quali).intersection(set(df.columns.tolist())))
    for v in series_quant:
        df[v] = df[v].fillna(df[v].mean())        
    for v in series_quali:
        mode_v = df[v].value_counts().idxmax()
        df[v] = df[v].fillna(mode_v)
        df1 = pd.DataFrame(df)
    #Std Scaler
    import numpy as np
    from sklearn.preprocessing import StandardScaler
    ss_scaler = StandardScaler()
    Y = df[series_stdscaler].values.reshape(-1,1)
    ss_scaler.fit(Y)
    ss_scaler.transform(Y)
    df = pd.concat([df, pd.DataFrame(np.c_[ss_scaler.transform(Y)]).round(1)], axis = 1)
    df.rename(columns = {0:"normalizacao"}, inplace= True)
    #MinMax Scaler
    from sklearn.preprocessing import MinMaxScaler
    mm_scaler = MinMaxScaler()
    X = df[series_minmax].values.reshape(-1,1)
    mm_scaler.fit(X)
    mm_scaler.transform(X)
    df = pd.concat([df, pd.DataFrame(np.c_[mm_scaler.transform(X)]).round(1)], axis = 1)
    df.rename(columns = {0:"min_max"}, inplace= True)
    #OHE
    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder()
    variavel_ohe = df[series_ohe].values.reshape(-1,1)
    ohe.fit(variavel_ohe)
    ohe.transform(variavel_ohe)
    ohe.transform(variavel_ohe).toarray()
    ohe.categories_
    dataframe_OHE = pd.DataFrame(ohe.transform(variavel_ohe).toarray(), 
                             columns = ohe.categories_[0].tolist())
    df = pd.concat([df, dataframe_OHE], axis = 1)
    #Outliers
    def detecta_outlier(x, limite = 1.5):
        iqr = np.percentile(x, 75) - np.percentile(x, 25)
        limite_inf = np.maximum(np.percentile(x, 25) - limite * iqr, np.min(x))
        limite_sup = np.minimum(np.percentile(x, 75) + limite * iqr, np.max(x))
        return np.where((x < limite_inf) | (x > limite_sup), 1, 0)
    df['outlier'] = detecta_outlier(df[series_outliers])
    df = df.drop(df[df.outlier == 1].index)
    return df

In [40]:
df = processamento(df, ['idade', 'tempo_educacao', 'horas_por_semana'], ['estado_civil', 'cor', 'sexo', 'salario_anual'], 'tempo_educacao', 'horas_por_semana', 'estado_civil', 'idade')
df

Unnamed: 0,idade,tempo_educacao,estado_civil,cor,sexo,horas_por_semana,salario_anual,normalizacao,min_max,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated,Widowed,NaN,outlier
0,39.0,13.0,Never-married,White,Male,40.0,<=50K,-0.0,0.8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
1,50.0,13.0,Married-civ-spouse,White,Male,13.0,<=50K,-2.2,0.8,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
2,38.0,9.0,Divorced,White,Male,40.0,<=50K,-0.0,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,53.0,7.0,Married-civ-spouse,Black,Male,40.0,<=50K,-0.0,0.4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
4,37.0,14.0,Married-civ-spouse,White,Female,40.0,<=50K,-0.0,0.9,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29165,27.0,12.0,Married-civ-spouse,White,Female,38.0,<=50K,,0.5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
29166,40.0,9.0,Married-civ-spouse,White,Male,40.0,>50K,,0.5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
29167,58.0,9.0,Widowed,White,Female,40.0,<=50K,,0.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
29168,22.0,9.0,Never-married,White,Male,20.0,<=50K,,0.5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
