# Lecture 6 Pre-Processing
__MATH 3480__ - Dr. Michael Olson

Reading:
* Geron, Chapter 2, pp. 62-75

In order to have data ready for modeling, we have to pre-process the data. For the pre-processing, we have a few steps, some of which we have seen:
1. Take care of missing data
2. Encoding categorical data
3. Splitting the Data (Cross Validation)
4. Feature Scaling

We're going to look at this three ways
1. Using functions as we have seen in our courses so far
   * Additionally, how to execute these these in one command (piping)
2. Using classes and objects
3. Using pre-built classes in *sci-kit learn*
   * Additionally, how to execute these these in one command (piping)
   
-----

We will use the following dataset on weight loss in each case.

In [None]:
import numpy as np
import pandas as pd

exercise = pd.read_csv('Data/exercise.csv')
display(exercise)

Looking at the data here, note that this is what we will need to do in order to use this data in a model.
* Drop the *Date* column
* Missing values in the *Calories* category
    * Let's replace with a mean value
* *Exercise Type* is a nominal variable and needs to become numerical
    * Being a nominal variable, we don't want to just turn the categories into numbers as we don't want to unintentionally indicate an order
    * Let's use One-hot encoding (also known as dummy variables)
* *Quality of Exercise* is an ordinal variable and needs to become numerical
    * Since there is an order to the categories, we can merely replace each category with a numerical value

## Using functions

In [None]:
# Drop the date column
def drop_col(x,col):
    x.drop(col, axis=1, inplace=True)
    return x

# Function to fill in missing values
def fill_avg(x,col):
    x[col].replace(np.nan, x[col].mean(), inplace=True)
    return x

# One-hot encode
def one_hot(x,col):
    x = x.join(pd.get_dummies(exercise[col]).astype(int)).drop(col, axis=1)
    return x

# Ordinal Encode
def ordinal_encode(x,col):
    order = {
        'None':0,
        np.nan:0,
        'Low':1,
        'Medium':2,
        'High':3
    }
    x[col] = x[col].map(order)
    return x

In [None]:
drop_col(exercise,'Date')

In [None]:
fill_avg(exercise,'Calories')

In [None]:
exercise = one_hot(exercise,'Exercise Type')
exercise

In [None]:
exercise = ordinal_encode(exercise,'Quality of Exercise')
exercise

Now, our data is 100% numerical, and ready to be put into a model.

#### Piping functions into one command

We can also do all of these functions in one command. We do this by taking the output of one function and using it as the input for another function. In a very messy way, we can do it this way.

In [None]:
exercise = pd.read_csv('Data/exercise.csv')
exercise = ordinal_encode(one_hot(fill_avg(drop_col(exercise,'Date'),'Calories'),'Exercise Type'),'Quality of Exercise')
exercise

However, this code is very difficult to read. So, we use __piping__ instead, which applies .

In [None]:
exercise = pd.read_csv('Data/exercise.csv')
exercise = (exercise.pipe(drop_col,'Date')
                    .pipe(fill_avg,'Calories')
                    .pipe(one_hot,'Exercise Type')
                    .pipe(ordinal_encode, 'Quality of Exercise')
            )
exercise

## Using classes and objects
(Working on this section)

## Using *sci-kit learn*

*Scikit-learn* has a number of packages to do these preprocessing tasks. These functions have a lot of features that do the job more effectively and cleanly, so is a better option than our self-made functions.

In [None]:
# Set up variables
exercise = pd.read_csv('Data/exercise.csv')
X = exercise.drop(['Date','Weight Lost'], axis=1).values

# Ordinal Encoder won't like nan values. Change to 'None'
# This fits with data since there was 0 activity for that day
X[:,3] = ['None' if x is np.nan else x for x in X[:,3]]

print(X)

In [None]:
y = np.array(exercise['Weight Lost'])
print(y)

In [None]:
# Fill Missing Values
## Calories = Column 0

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:,0:1])
X[:,0:1] = imputer.transform(X[:,0:1])

print(X)

In [None]:
# One-hot Encode Nominal Variables
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()
onehot.fit_transform(X[:,1:2]).toarray()

# Columns are in Alphabetical Order
# 1st Column = Running
# 2nd Column = Stairs
# 3rd Column = Swimming

In [None]:
# Ordinal Encode Ordinal Variables
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(categories=[['None','Low','Medium','High']])
oe.fit_transform(X[:,3].reshape(-1,1))

#### Piping functions in one command

In [None]:
# One-hot Encode nominal variables and Ordinal Encode
# ordinal variables but keep all variables

# Reload Data and set up variables
exercise = pd.read_csv('Data/exercise.csv')
X = exercise.drop(['Date','Weight Lost'], axis=1).values

# Ordinal Encoder won't like nan values. Change to 'None'
# This fits with data since there was 0 activity for that day
X[:,3] = ['None' if x is np.nan else x for x in X[:,3]]

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# When putting in the columns in each imputer/encoder, indicate the column
# of the original matrix
  # [0]: Calories
  # [1]: Exercise Type
  # [3]: Quality of Exercise

ct = ColumnTransformer(transformers=[
      ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean'), [0]),  # This is placed first in X
      ('onehot', OneHotEncoder(), [1]),                                         # This is placed second in X
      ('oe', OrdinalEncoder(categories=[['None','Low','Medium','High']]), [3])  # This is placed third in X
    ], remainder='passthrough')                     # Remaining columns placed in order after the last encoder



X = np.array(ct.fit_transform(X))
X