# Data Preprocessing    
**Day 3 - Friday, January 13**



- Preprocessing and Exploratory Data Analysis (EDA)
1. Obtain and Clean the data
2. Wrangle the data (do you have to combine or join data sets? Rearrange items?)
3. EDA
    a. Look at statistical calculations
    b. Graph the data
4. Draw conclusions and formulate hypothesis, looking for relationship that we might use in the model


## Goal of EDA  
- Derive insights
- Generate Hypotheses  

## Preparing the data  
1. Null values and missing data
2. Remove outliers
3. Encoding categorical data (one-hot encoding for non-numeric items and ordinal encoding for numeric items)
    - One-hot encoding: a table of one's and zeros. each column is the variable. If the category fits that column, give it a one. If it doesn't give it a zero. (make sure to include "None" for NA values)
4. Drop unrelated features
5. Feature Scaling (like comparing lot size with house price or number of bedrooms. So you scale down the house price or scale up the number of bedrooms)
    - Put variables on similar scales

## 3 ways to accomplish this  
1. Use basic functions
2. Use classes and objects 
3. Using pre-built classes in sci-kit learn  







In [57]:
import numpy as np
import pandas as pd

data = pd.read_csv('Data/exercise.csv')
display(data)

Unnamed: 0,Date,Calories,Exercise Type,Length of Exercise (min),Quality of Exercise,Next-day Weight,Weight Lost
0,1-Jan,2520.0,Running,10,Low,194.5,0.0
1,2-Jan,1850.0,Stairs,10,High,193.0,1.5
2,3-Jan,1925.0,Swimming,30,Medium,191.8,1.2
3,4-Jan,1790.0,Running,20,High,187.0,4.8
4,5-Jan,2120.0,Stairs,10,Medium,189.0,-2.0
5,6-Jan,1910.0,Swimming,35,Medium,186.0,3.0
6,7-Jan,1845.0,Running,20,Low,186.0,0.0
7,8-Jan,2343.0,Stairs,15,Low,189.0,-3.0
8,9-Jan,1886.0,Swimming,30,High,188.0,1.0
9,10-Jan,2149.0,Running,15,High,190.0,-2.0


# Preprocessing  

1. See missing values
2. Ordinal encoding for quality of exercise
3. One-hot encoding for exercise type
4. Feature scaling for calories, length of exercise, and next-day weight so they are on similar scales
5. We don't really need the date


# 1. Using Functions  


In [58]:
# Function to drop the date column
def drop_col(x, col):
    x.drop(col, axis=1, inplace=True)


# Function to remove missing values
def fill_avg(x, col):
    x[col].replace(np.nan, x[col].mean(), inplace=True)


# One-hot encode
# data, type column
def one_hot(x, col):
    dummies = pd.get_dummies(x[col]).astype(int)
    x = x.join(dummies).drop(col, axis=1)


# Ordinal Encoding
# data, quality of exercise column
def ordinal_enc(x, col):
    order = {
        'None': 0,
        np.nan: 0,
        'Low': 1,
        'Medium': 2,
        'High': 3
    }
    x[col] = x[col].map(order)


In [59]:
drop_col(data, 'Date')
data

Unnamed: 0,Calories,Exercise Type,Length of Exercise (min),Quality of Exercise,Next-day Weight,Weight Lost
0,2520.0,Running,10,Low,194.5,0.0
1,1850.0,Stairs,10,High,193.0,1.5
2,1925.0,Swimming,30,Medium,191.8,1.2
3,1790.0,Running,20,High,187.0,4.8
4,2120.0,Stairs,10,Medium,189.0,-2.0
5,1910.0,Swimming,35,Medium,186.0,3.0
6,1845.0,Running,20,Low,186.0,0.0
7,2343.0,Stairs,15,Low,189.0,-3.0
8,1886.0,Swimming,30,High,188.0,1.0
9,2149.0,Running,15,High,190.0,-2.0


# Feature Scaling  
Two ways to scale a variable:  
1. Standardized Scaling
    - (aka min-max scaling)
    - scale all values between 0-1 (min and max)

$$
x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}
$$

2. Normalized scaling
    - Scale all values to a Z-score based on the mean and SD
    - Get the mean and then find the standard deviations and then find the Z-Score. Z-score of 1 is 1 SD away from the mean.
    
$$
    x_{scaled} = \frac{x - \bar{x}}{sd}
$$



In [60]:
# Feature Scaling Functions
# def standard_scale(x, col):

# get sd 
# x[col].std(ddof=1) get the sample standard deviation (degrees of freedom 0 in population, 1 in sample data)




**note** : any feature scaling done to the training data must also be done to any new data or the real data that comes in. Use the SAME scaling. Ex, if your training data has min of 10 and max of 17, then the test data should have the same range. 


## Piping  


# 2. Classes and Objects

Have an object with maybe a "clean_data" function within it. 

Create the object, and call the function.

# 3. Using *ski-kit learn* package



In [61]:
data = pd.read_csv('Data/exercise.csv')

X = data.drop(['Date', 'Weight Lost'], axis=1).values # makes it a pandas data frame

X[:,3] = ['None' if x is np.nan else x for x in X[:,3]]

print(X)

y = np.array(data['Weight Lost'])
print(y)

[[2520.0 'Running' 10 'Low' 194.5]
 [1850.0 'Stairs' 10 'High' 193.0]
 [1925.0 'Swimming' 30 'Medium' 191.8]
 [1790.0 'Running' 20 'High' 187.0]
 [2120.0 'Stairs' 10 'Medium' 189.0]
 [1910.0 'Swimming' 35 'Medium' 186.0]
 [1845.0 'Running' 20 'Low' 186.0]
 [2343.0 'Stairs' 15 'Low' 189.0]
 [1886.0 'Swimming' 30 'High' 188.0]
 [2149.0 'Running' 15 'High' 190.0]
 [1797.0 'Stairs' 10 'High' 187.0]
 [nan 'Swimming' 25 'Medium' 186.0]
 [1934.0 'Running' 10 'High' 184.0]
 [2129.0 'Stairs' 5 'Low' 186.0]
 [1872.0 'Swimming' 0 'None' 185.0]
 [1957.0 'Running' 15 'Medium' 183.5]
 [1790.0 'Stairs' 10 'Medium' 181.0]
 [nan 'Swimming' 30 'High' 180.0]
 [1842.0 'Running' 25 'High' 178.0]
 [2173.0 'Stairs' 15 'Medium' 178.3]]
[ 0.   1.5  1.2  4.8 -2.   3.   0.  -3.   1.  -2.   3.   1.   2.  -2.
  1.   1.5  2.5  1.   2.  -0.3]


In [62]:
# Fill in missing values
# calories - col 0
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:,0:1])
X[:, 0:1] = imputer.transform(X[:, 0:1])
X

array([[2520.0, 'Running', 10, 'Low', 194.5],
       [1850.0, 'Stairs', 10, 'High', 193.0],
       [1925.0, 'Swimming', 30, 'Medium', 191.8],
       [1790.0, 'Running', 20, 'High', 187.0],
       [2120.0, 'Stairs', 10, 'Medium', 189.0],
       [1910.0, 'Swimming', 35, 'Medium', 186.0],
       [1845.0, 'Running', 20, 'Low', 186.0],
       [2343.0, 'Stairs', 15, 'Low', 189.0],
       [1886.0, 'Swimming', 30, 'High', 188.0],
       [2149.0, 'Running', 15, 'High', 190.0],
       [1797.0, 'Stairs', 10, 'High', 187.0],
       [1990.6666666666667, 'Swimming', 25, 'Medium', 186.0],
       [1934.0, 'Running', 10, 'High', 184.0],
       [2129.0, 'Stairs', 5, 'Low', 186.0],
       [1872.0, 'Swimming', 0, 'None', 185.0],
       [1957.0, 'Running', 15, 'Medium', 183.5],
       [1790.0, 'Stairs', 10, 'Medium', 181.0],
       [1990.6666666666667, 'Swimming', 30, 'High', 180.0],
       [1842.0, 'Running', 25, 'High', 178.0],
       [2173.0, 'Stairs', 15, 'Medium', 178.3]], dtype=object)

In [63]:
# One-hot encode
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()

onehot.fit_transform(X[:, 1:2]).toarray()

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [64]:
# ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(categories=[['None', 'Low', 'Medium', 'High']])
oe.fit_transform(X[:,3].reshape(-1,1))

array([[1.],
       [3.],
       [2.],
       [3.],
       [2.],
       [2.],
       [1.],
       [1.],
       [3.],
       [3.],
       [3.],
       [2.],
       [3.],
       [1.],
       [0.],
       [2.],
       [2.],
       [3.],
       [3.],
       [2.]])

# Cross Validation