# Cleaning and Preprocessing Data for AI ML

In [7]:
import warnings
warnings.simplefilter('ignore')

# %matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [8]:
# Step 1: Read the dataset into a Pandas Dataframe

brain = pd.read_csv("resources/brain_categorical.csv")
brain.head()

Unnamed: 0,gender,age,size,weight
0,Male,20-46,4512,1530
1,Male,20-46,3738,1297
2,Male,20-46,4261,1335
3,Male,20-46,3777,1282
4,Male,20-46,4177,1590


In [9]:
# Step 2: Identify your X and y

X = brain[["gender", "age", "size"]]
y = brain["weight"].values.reshape(-1, 1)
print(X.shape, y.shape)

(237, 3) (237, 1)


### Categorical Data

Machine Learning Algorithms are based on math. In other words, its inputs rely on numerical data. For categorical data, we have to convert strings into meaningful numbers.  

Some methods include <strong>Integer</strong>, <strong>One-Hot</strong>, and <strong>Binary Encoding</strong>.  

Sklearn provides a preprocessing library for the methods mentioned above.  

<strong>Pandas</strong> also includes an action called <strong>get_dummies</strong> that is useful to get binary encoded data from a DataFrame.

### Dummy Encoding (Binary Encoded Data)

<strong>Dummy Encoding</strong> transforms each categorical feature into new columns with binary code. 1 == True and 0 == False. In other words, it marks a value as 1 if that label is present in the original row.

In [10]:
# Note how the data previously can have "gender" labeled as male or female

# copy the data
data = X.copy()

# Use get dummies to binary encode the categorical label of male or female
data_encoded = pd.get_dummies(data, columns=["gender"])
data_encoded.head()

# Now the dataset displays 1 as males and 2 as females

Unnamed: 0,age,size,gender_Female,gender_Male
0,20-46,4512,0,1
1,20-46,3738,0,1
2,20-46,4261,0,1
3,20-46,3777,0,1
4,20-46,4177,0,1


In [12]:
# We can run get_dummies on multiple columns

data = X.copy()
data_encoded = pd.get_dummies(data)
data_encoded.head()


# Notices how all string-related labels are converted to binary

Unnamed: 0,size,gender_Female,gender_Male,age_20-46,age_46+
0,4512,0,1,1,0
1,3738,0,1,1,0
2,4261,0,1,1,0
3,3777,0,1,1,0
4,4177,0,1,1,0


### Scaling and Normalization

After encoding the data (assuming it has categorical data), we need to normalize/scale the dataset. This method is usually only required with models that use gradient descent because Linear Regression models tend to not show a difference. Normalizing the data allows the algorithms to converge on a local optimum

Sklearn offers a variety of scaling and normalization options. THe two most common are <strong>MinMax</strong> and <strong>Standard Scaler</strong>. Standard Scaler us best used when you don't know anything about your data. 

In [13]:
# Lets see how scaling and normalization works

# Step 1: import train_test_split
from sklearn.model_selection import train_test_split

In [14]:
# Step 2: use get_dummies to encode the categorical data

X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

Unnamed: 0,size,gender_Female,gender_Male,age_20-46,age_46+
125,3782,0,1,0,1
119,3937,0,1,0,1
66,3415,0,1,0,1
216,3246,1,0,0,1
67,3372,0,1,0,1
