# Dataset Processing

Basic dataset processing code for adult classification data.

## Step 0: Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

## Step 1: Import data from https://archive.ics.uci.edu/ml/datasets/Adult & put in a dataframe.

In [None]:
df = pd.read_csv("../data/adult_data.csv")
df.head()

Oh no — no columns are set. We set the columns of the dataframe equal to the ones defined by the data publishers.

In [None]:
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'result']

In [None]:
df.head()

Remove extra spaces.

In [None]:
df.replace({' ': ''}, regex=True, inplace=True)

https://stackoverflow.com/questions/21720022/find-all-columns-of-dataframe-in-pandas-whose-type-is-float-or-a-particular-typ

## Step 2: Create dataframe of object type columns.

In [None]:
objectColumns = df.loc[:, df.dtypes == object]
objectNames = objectColumns.columns
objectColumns.head()

In [None]:
enc = preprocessing.LabelEncoder()

Using the imported LabelEncoder, encode a number to every row value in each column.

In [None]:
df_object = pd.DataFrame()
for feature in objectNames:
    df_object[feature] = enc.fit_transform(df[feature])

In [None]:
df_object.head()

## Step 3: Create dataframe of int type columns.

In [None]:
intColumns = df.loc[:, df.dtypes == int]
intNames = intColumns.columns
intColumns.head()

In [None]:
scaler = preprocessing.StandardScaler()

Using the scaler, convert each column to a standard distribution.

In [None]:
df_int = pd.DataFrame()
for feature in intNames:
    df_int[feature] = np.ravel(scaler.fit_transform(df[feature].values.reshape(-1, 1))) # https://stackoverflow.com/questions/18200052/how-to-convert-ndarray-to-array

In [None]:
df_int.head()

## Step 4: Concatenate the int and object dataframes into our final_df

In [None]:
final_df = pd.concat([df_int, df_object], axis=1)

In [None]:
final_df.head()