# Neural Networks: Part 2 - Data Prep

In [None]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Dataset Preparation

So far, we have also dealt with pre-processed datasets. Preparing data for machine learning is usualy more time consuming than actually developing and training models, so let's start with a little rougher dataset. Load in the wages dataset that describes characteristics of a group of individuals. We will build a model that takes in demographic info and predicts whether an individual has health insurance.

In [None]:
import pandas as pd

df = pd.read_csv('data/wages.csv')
print(len(df))
df.head(5)

Its apparent we have a mix of categorical and numeric data. Each type of data requires a different method to prepare it.

- Categorical: use *one hot encoding*
- Numeric: if it is logarithmically distributed, take the log of the data, then standardize to a mean of 0 and standard deviation of 1

However, before processing the data, we have to ask ourselves an important question.

Do we, as humans, expect all these features to be relevant for predicting health insurance? What are the ethical considerations of including this data?

Because we don't want to introduce a racial bias to our model, we will drop race as an input feature. The other features seem potentially relevant for our task at hand.

In [None]:
# check the type of data in each column
df = df.drop('race', axis=1)
print(df.dtypes)

### Numeric data

We will examine the distribution of each feature, and determine the best way to transform it into a "numerically stable" region, aka centered about 0 and scaled roughly with a standard deviation of 1. This ensures that features are treated similarly by the network, and will promote numerical stability during training.

In [None]:
# use that information to visualize the distribution of data for each numeric column
numeric_columns = ["year","age","wage"]
fig, axs = plt.subplots(3,1, figsize=(10,10))
for i in range(3):
    axs[i].hist(df[numeric_columns[i]], bins = 25)
    axs[i].set_xlabel(numeric_columns[i], fontsize = 12)

In [None]:
# let's write a function that standardizes numeric data that we can apply to our dataframe
def standardize_numeric(series: pd.Series, use_log: bool = False) -> pd.Series:
    # write code here that optionally takes the log of in the input series, then standardizes it

df['year_st'] = standardize_numeric(df['year'], False)
df['age_st'] = standardize_numeric(df['age'], False)
df['wage_st'] = standardize_numeric(df['wage'], True)

In [None]:
# visualize transformed numeric data
numeric_columns = ["year_st","age_st","wage_st"]
fig, axs = plt.subplots(3,1, figsize=(10,10))
for i in range(3):
    axs[i].hist(df[numeric_columns[i]], bins = 25)
    axs[i].set_xlabel(numeric_columns[i], fontsize = 12)

### Categorical data

We will identify the potential classes for each column, and then one-hot encode them. One-hot encoding entails representing classes as an integer encoding. For instance, given 3 possible labels of [ `Healthy`, `Sick`, `Unknown` ], say we have a data sample where an individual is labeled as 'Sick'. We can encode it like this

        [0, 1, 0]

Where each value represents a class. '1' means the class is labelled as true for that entry. Technically, you can have multiple 1s in a single encoding, thus meaning it is a *multi-label* task, or that your classes are not mutually exclusive. In the scenario above, our classes are mutually exclusive, thus we can expect only a single 1.

In [None]:
# check the type of data in each column
print(df.dtypes)

In [None]:
# look at all labels for each column, and the number of appearances of each
categoric_columns = ['sex','maritl','education','region','jobclass','health','health_ins']
for i in range(len(categoric_columns)):
    print(f"\nColumn: {categoric_columns[i]}")
    counts = df[categoric_columns[i]].value_counts()
    for label, count in counts.items():
        print(f"Label: '{label}' | Frequency: {count}")

Should we keep all these columns? Why or why not?

Are any of these features more **ordinal** in nature? As in, you could place each label on an axis of some kind and assign a value to it?

Are any of our labels particularly imbalanced? Why is that potentially a bad thing and what would you do to fix it?

In [None]:
# drop any columns you decide are not good to keep
# use df.drop to drop any columns you don't want

Now, one-hot encode the categorical data

In [None]:
# this code efficiently converts labels in columns into columns themselves, then populates it with 1s and 0s accordingly
# keep_categoric_columns = # populate here
for col in keep_categoric_columns:
    df = df.join(pd.get_dummies(df[col], dtype = 'int'), how = 'outer')
df.head(10)

### Conversion

The dataframe is processed, and now must simply be converted into a tensor for our ML model.

In [None]:
# list out all your finalized INPUT and OUTPUT feature columns.
# features = # populate here

# write down your TARGET columns
# Because our task is a binary classification task (has insurance, or doesn't), we can actually just represent the target 
# as a single vector of 0s and 1s, where 0 indicates no insurance, and 1 indicates possession of insurance.
# target = # populate here

# create new dataframes of your inputs and outputs
train_df = df[features + target]

Now, we will split our data. However, due to the imbalance of our target class, we will do *stratified* splitting. That means we won't accidentally over-represent a certain class in one our splits due to random chance.

In [None]:
from sklearn.model_selection import train_test_split

# split train and val
x_train, x_val, y_train, y_val = train_test_split(train_df[features], train_df[target], train_size=0.6, stratify=train_df[target])

# split again to get a test set
x_val, x_test, y_val, y_test = train_test_split(x_val, y_val, train_size=0.5, stratify=y_val)

print("x train: ",x_train.shape, "y train:", y_train.shape)
print("x val: ",x_val.shape, "y train:", y_val.shape)
print("x test: ",x_test.shape, "y test:", y_test.shape)

And finally, let's convert our data into tensors, and save it to disk so we don't have to do this again!

In [None]:
# dataframes -> numpy arrays -> tensors
# write conversion code here

# store it in a dict that we can save out as a single file
data_dict = {'x_train':x_train, 'x_val':x_val, 'x_test':x_test, 'y_train':y_train, 'y_val':y_val, 'y_test':y_test}

# save it to local data directory
torch.save(data_dict, 'data/wages_processed.pt')

In [None]:
# and confirm it works by loading it back in as a tensor
data_dict = torch.load('data/wages_processed.pt')
print(data_dict.keys())

And we have completed data preparation! It can be a tedious procedure at times, but probably 80% of the time, the source of errors during training come from data preparation, so it is absolutely necessary to know exactly how your data was gathered out in the wild, how it was filtered and prepared, and how it was transformed for your ML model.