# Lecture 3 Preprocessing
__MATH 3480__ - Dr. Michael Olson

Reading:
* Geron, Chapter 2, pp. 62-75

In Exploratory Data Analysis, we need to follow these steps:
1. Obtain and Clean the Data
2. Wrangle the Data
3. Look at statistical calculations
4. Graph the data 
5. Draw conclusions and make hypotheses from (3) and (4), looking for relationships that we might use

|              | Quantitative Data | Categorical Data |
| :----------- | :---------------- | :--------------- |
| Calculations | Mean, Mode<br>5-summary Statistics<br>Distributions (count, standard deviation/variance) | Probabilities<br>Expected Values<br>Probability/Binomial/etc. Distributions |
| Graphs       | Histogram/KDE (kernel density estimator)<br>Boxplot/Violinplot<br>Scatterplot<br>Timeseries<br>Heatmap | Barplot<br>Pie Chart<br>Venn Diagram<br>Tree Diagram |

The goal of EDA:
* Derive Insights
* Generate Hypotheses

In order to have data ready for modeling, we have to pre-process the data. For the pre-processing, we have a few steps, some of which we have seen:

1. Take care of missing data
2. Encoding categorical data
3. Splitting the Data (Cross Validation)
4. Feature Scaling

We're going to look at this three ways

1. Using functions as we have seen in our courses so far
   * Additionally, how to execute these these in one command (piping)
2. Using classes and objects (Still building this part of the lecture)
3. Using pre-built classes in *sci-kit learn*
   * Additionally, how to execute these these in one command (piping)

-----

To add to the lecture:
* leave-one-out Cross Validation
* k-fold Cross Validation

-----

We will use the following dataset on weight loss in each case.
> For a reminder on Obtaining and Loading data, look at the [MATH 3080 Notes](https://github.com/drolsonmi/math3080)

In [None]:
import numpy as np
import pandas as pd

exercise = pd.read_csv('Data/exercise.csv')
display(exercise)

Looking at the data here, note that this is what we will need to do in order to use this data in a model.
* Drop the *Date* column
* Missing values in the *Calories* category
    * Let's replace with a mean value
* *Exercise Type* is a nominal variable and needs to become numerical
    * Being a nominal variable, we don't want to just turn the categories into numbers as we don't want to unintentionally indicate an order
    * Let's use One-hot encoding (also known as dummy variables)
* *Quality of Exercise* is an ordinal variable and needs to become numerical
    * Since there is an order to the categories, we can merely replace each category with a numerical value

## Using functions

In [12]:
# Drop the date column
def drop_col(x,col):
    x.drop(col, axis=1, inplace=True)
    return x

# Function to fill in missing values
def fill_avg(x,col):
    x[col].replace(np.nan, x[col].mean(), inplace=True)
    return x

# One-hot encode
def one_hot(x,col):
    x = x.join(pd.get_dummies(x[col]).astype(int)).drop(col, axis=1)
    return x

# Ordinal Encode
def ordinal_encode(x,col):
    order = {
        'None':0,
        np.nan:0,
        'Low':1,
        'Medium':2,
        'High':3
    }
    x[col] = x[col].map(order)
    return x

In [None]:
drop_col(exercise,'Date')

In [None]:
fill_avg(exercise,'Calories')

In [None]:
exercise = one_hot(exercise,'Exercise Type')
exercise

In [None]:
exercise = ordinal_encode(exercise,'Quality of Exercise')
exercise

Now, our data is 100% numerical, and ready to be put into a model.

### Feature Scaling
Even though the data is all numerical and ready for the model, it could still potentially cause problems. For example, let's say we are looking at the housing market and want to compare the price of the house and the number of bedrooms to the square footage. The scale of the house price (around $500,000) is very different from the scale of the number of bedrooms (2-7). Since the scale for the house price is so much larger, the variation of prices is larger, and this may weigh more heavily in a model than the number of bedrooms, when the number of bedrooms may be a better indicator.

To put the variables on the same scale, we apply __feature scaling__, where we scale all features so that they are all on similar scales. There are two ways to scale everything:
1. __Standardization__ (aka, Min-Max scaling)
* Scales all values to a range of [0,1], 0 representing the minimum, 1 the maximum
$$x_{scaled} = \frac{x-min}{max-min}$$

2. __Normalization__
* Scales all values based on the mean and standard deviation
$$x_{scaled} = \frac{x-\bar{x}}{s}$$

In [10]:
def standard_scale(x, col):
    return (x[col] - x[col].min()) / (x[col].max() - x[col].min())

def normal_scale(x, col):
    return (x[col] - x[col].mean()) / x[col].std(ddof=1)

In [None]:
for c in exercise.columns:
    exercise[c] = standard_scale(exercise, c)

display(exercise.head())

Note that any test data and any new data sent to the model has to go through the same preprocessing (one-hot encoding, ordinal encoding, feature scaling) that the training data did. For this reason, it is nice to simplify and automate the process, which is what we will discuss next.

### Piping functions into one command

We can also do all of these functions in one command. We do this by taking the output of one function and using it as the input for another function. In a very messy way, we can do it this way.

In [None]:
exercise = pd.read_csv('Data/exercise.csv')
exercise = ordinal_encode(one_hot(fill_avg(drop_col(exercise,'Date'),'Calories'),'Exercise Type'),'Quality of Exercise')
exercise

However, this code is very difficult to read. So, we use __piping__ instead, which sends a dataset into a function, whose result is sent to another function, whose result is sent to another function, etc.

In [None]:
exercise = pd.read_csv('Data/exercise.csv')
exercise = (exercise.pipe(drop_col,'Date')
                    .pipe(fill_avg,'Calories')
                    .pipe(one_hot,'Exercise Type')
                    .pipe(ordinal_encode, 'Quality of Exercise')
            )
exercise

## Using classes and objects
(Working on this section)

In [None]:
import numpy as np
import pandas as pd

class LoadExercise():
    def __init__(self, url):
        self.data = pd.read_csv(url)

    def clean_data(self):
        # Drop column
        self.data.drop('Date', axis=1, inplace=True) 
        # Function to fill in missing values
        self.data['Calories'].replace(np.nan, self.data['Calories'].mean(), inplace=True) 
        # One-hot encode categorical data
        self.data = self.data.join(pd.get_dummies(self.data['Exercise Type']).astype(int)).drop('Exercise Type', axis=1) 
        # Ordinal encode categorical data
        order = {
            'None':0,
            np.nan:0,
            'Low':1,
            'Medium':2,
            'High':3
        }
        self.data['Quality of Exercise'] = self.data['Quality of Exercise'].map(order)

exercise = LoadExercise('Data/exercise.csv')
display(exercise.data.head())

exercise.clean_data()
display(exercise.data.head())

## Using *sci-kit learn*

*Scikit-learn* has a number of packages to do these preprocessing tasks. These functions have a lot of features that do the job more effectively and cleanly, so is a better option than our self-made functions.

In [None]:
# Set up variables
exercise = pd.read_csv('Data/exercise.csv')
X = exercise.drop(['Date','Weight Lost'], axis=1).values

# Ordinal Encoder won't like nan values. Change to 'None'
# This fits with data since there was 0 activity for that day
X[:,3] = ['None' if x is np.nan else x for x in X[:,3]]

print(X)

In [None]:
y = np.array(exercise['Weight Lost'])
print(y)

In [None]:
# Fill Missing Values
## Calories = Column 0

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:,0:1])
X[:,0:1] = imputer.transform(X[:,0:1])

print(X)

In [None]:
# One-hot Encode Nominal Variables
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()
onehot.fit_transform(X[:,1:2]).toarray()

# Columns are in Alphabetical Order
# 1st Column = Running
# 2nd Column = Stairs
# 3rd Column = Swimming

In [None]:
# Ordinal Encode Ordinal Variables
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(categories=[['None','Low','Medium','High']])
oe.fit_transform(X[:,3].reshape(-1,1))

#### Piping functions in one command

In [None]:
# One-hot Encode nominal variables and Ordinal Encode
# ordinal variables but keep all variables

# Reload Data and set up variables
exercise = pd.read_csv('Data/exercise.csv')
X = exercise.drop(['Date','Weight Lost'], axis=1).values

# Ordinal Encoder won't like nan values. Change to 'None'
# This fits with data since there was 0 activity for that day
X[:,3] = ['None' if x is np.nan else x for x in X[:,3]]

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# When putting in the columns in each imputer/encoder, indicate the column
# of the original matrix
  # [0]: Calories - fill missing values
  # [1]: Exercise Type - One-hot encoding
  # [3]: Quality of Exercise - Ordinal encoding

ct = ColumnTransformer(transformers=[
      ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean'), [0]),  # This is placed first in X
      ('onehot', OneHotEncoder(), [1]),                                         # This is placed second in X
      ('oe', OrdinalEncoder(categories=[['None','Low','Medium','High']]), [3])  # This is placed third in X
    ], remainder='passthrough')                     # Remaining columns placed in order after the last encoder



X = np.array(ct.fit_transform(X))
X

## Cross Validation

In [36]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=22)

In [None]:
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

## Feature Scaling
The scales of variables can have a very large impact on the results of the model. For instance, consider this example of employee salaries:


In [None]:
import pandas as pd

salaries = {
    'ID':['01','02','03'],
    'Salary':[70000,60000,52000],
    'Years of Experience':[5,4,1]
}

df = pd.DataFrame(salaries, index=salaries['ID']).drop('ID', axis=1)
df

We want to group employees together. Employees 1 and 3 are definitely in different groups. But how would we group Employee 2? Employee 2 is closer to Employee 1 in salary, but to Employee 3 in experience. 

The scale is throwing us off, so we look at __feature scaling__. There are two methods of feature scaling:
1. Standardization
$$\hat{x} = \frac{x-\bar{x}}{s}$$
2. Min-Max Scaling
$$\hat{x} = \frac{x-x_{min}}{x_{max}-x_{min}}$$
3. Normalization
$$\hat{x} = \frac{x-\bar{x}}{x_{max}-x_{min}}$$

Standardization will generally give a number in the range [-3,3] (outliers will be more extreme than that), while min-max scaling and normalization will always give a result between [0,1].

Let's see how each method affects the data.

In [None]:
# Normalize
def normalize_df(x):
    return (x-x.mean())/(x.std(ddof=1))

normalize_df(df)

In [None]:
# Standardize
def standardize_df(x):
    return (x-x.min())/(x.max()-x.min())

standardize_df(df)

What do we see? Looking at the original data, the gap in salaries between Employees 1 and 2 was so large that we'd say that Employee 2 was closer to Employee 3. But as we look at the standardized and normalized data, we see that the salary of Employee 2 is very nearly in the middle (0 for Standardized, 0.5 for Normalized). So, the Salary may not be a good indicator. But looking at the Years of Experience, we see Employee 2 is actually very close to Employee 1. So, it is more likely for Employee 2 to be grouped with Employee 1.

-----
# Class Project 1
There was a survey completed asking young people a few questions regarding preferences. Here is a quick explanation of the dataset
* The ['Music', 'Techno', 'Movies', 'History', 'Mathematics', 'Pets', 'Spiders'] columns indicate how much the person likes or dislikes each category on a scale of 1-5
* The ['Loneliness'] column indicates how lonely a person feels on a scale of 1-5
* The ['Parents Advice'] column indicates how much the person appreciates advice from parents on a scale of 1-5
* The ['Internet usage'] column indicates how much time is spent online on a scale of 1-5
* The ['Finances'] column indicates how stable the person is financially on a scale of 1-5
* The ['Age'] column is the person's age
* The ['Siblings'] column is the number of siblings the person has
* The ['Gender'] is male/female
* The ['Village - town'] indicates whether the person lives in the city (urban living) or in a village (rural living)

We are going to use the dataset to create a model to predict whether a person is likely to be lonely or not. Your job in this project is to complete the entire data preprocessing for the data.

* Load the *young-people-survey-responses* dataset
  * Located on the [github page](https://github.com/drolsonmi/math3480)
* Perform Data Preprocessing on this data
  * What variables are not needed? Drop them
  * Handle missing values 
    * If more than 10% of the values in a given row/column are missing, remove them
    * If fewer than 10% of the values in a given row/column are missing, fill them with min, max, mean, or median - whatever will best deal with each variable
  * Encode categorical variables using either one-hot encoding, ordinal encoding, or label encoding
* Divide the data into Training and Testing Groups
* Scale the features by using standardized scaling
  * Do all features need to be scaled? Consider each variable carefully.
* Apply to a logistic regression model to see if you can model