# Scaling Variables

In this notebook we will demonstrate a number of scaling techiques to ensure consistentcy across features.

### Import Basic Packages

In [None]:
# Basics
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt

### Import Data

We are working with features that describe retail stores, the products they have available and whether the store is recommended or not. This is a classification scenario and recommended is the target variable.


In [2]:
# Import store sales dataset
df_stores = pd.read_csv('stores.csv')
df_stores

Unnamed: 0,store_area,items_available,daily_cust_count,store_sales,city,recommended
0,2157,1961,530,66490,Vancouver,0
1,1928,2278,210,39820,Surrey,0
2,2090,1609,936,70213,Burnaby,1
3,2942,1923,744,59103,Vancouver,0
4,3037,2111,450,46620,Surrey,0
...,...,...,...,...,...,...
659,1619,1366,1340,62940,Langley,0
660,2167,2020,980,66070,Vancouver,1
661,1884,1892,630,43190,Surrey,0
662,1211,1447,1110,40730,Langley,0


In [3]:
# Import Testing Data
df_stores_test = pd.read_csv('stores_test.csv')

### MinMax Scaling (Normalization)

One of the simplest methods to scale our data is to use minmax scaling.

This method focuses on the absolute range of each feature, brings all values into a range of 0 to 1.

We wil be using SKLearn's **MinMaxScaler**
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
# Import store sales dataset
df_stores = pd.read_csv('stores.csv')
df_stores

In [None]:
numeric_cols = df_stores.drop('recommended', axis = 1).select_dtypes('number').columns

In [None]:
# plot each of the numeric features

for feature in numeric_cols:
    fig, (ax) = plt.subplots(1,1,figsize=(6,1))
    sns.histplot(df_stores[feature], ax = ax)
    plt.show()

In [None]:
# View the descriptive stats for each numeric column in the dataframe
df_stores.describe()

In [None]:
# Import MinMaxScaler and fit and transform to the training data 


By exploring the describe function we can now see that our features have a mean of zero and a stdev of 1.

In [None]:
#Plot a scatter chart of the Store_Sales column and Store_Area column before and after minmax scaling


Our features now occupy a similar feature space on each axis. This wil improve the performance of many machine learning models.

In [None]:
# Import Testing Data
df_stores_test = pd.read_csv('stores_test.csv')

In [None]:
#Apply the fitted scalar to the test data


In [None]:
#Plot a scatter chart of the Store_Sales column and Store_Area column before and after minmax scaling
fig, (before, after) = plt.subplots(1, 2,figsize=(12,4))

before = df_stores_test.plot.scatter(ax = before, x='store_sales', y='store_area', label = 'Before Scaling (testing)', color = 'orange')
after = df_test_minmax.plot.scatter(ax = after, x='store_sales', y='store_area', label = 'After Scaling (testing)', color= 'orange')

### Using StandardScaler to Standardize Normally Distributed Variables

When our features are normally distributed, we typically prefer to use a standardization technique, to bring the distributions of variables into comparable ranges.

This method focuses on the distribution of the data more than the absolute range, which can be helpful for distributions with tails like the normal distribution.

We wil be using SKLearn's **StandardScaler**
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
# Import both the normally distributed stores data, and the testing dataset
df_stores_norm = pd.read_csv('stores_norm_dist.csv')
df_stores_norm_test = pd.read_csv('stores_norm_dist_test.csv')

# We are working with features that describe retail stores.
# the products they have available and whether the store is recommended or not.
# This is a classification scenario and recommended is the target variable.

In [None]:
numeric_cols = df_stores_norm.drop('recommended', axis = 1).select_dtypes('number').columns

In [None]:
# plot each of the numeric features

for feature in numeric_cols:
    fig, (ax) = plt.subplots(1,1,figsize=(6,1))
    sns.histplot(df_stores_norm[feature], ax = ax)
    plt.show()

In [60]:
# Import standard scalar and fit transform the training data


By exploring the describe function we can now see that our features have a mean of zero and a stdev of 1.

In [None]:
#Plot a scatter chart of the Store_Sales column and Store_Area column before scaling
fig, (before, after) = plt.subplots(1, 2,figsize=(12,4))
before = df_stores_norm.plot.scatter(ax = before, x='store_sales', y='store_area', label = 'Before Standardization', color = 'darkblue')
after = df_stdscale.plot.scatter(ax = after, x='store_sales', y='store_area', label = 'After Standardization', color = 'darkblue')

In [61]:
#Apply the fitted scalar to the test data


In [None]:
#Plot a scatter chart of the Store_Sales column and Store_Area column before and after minmax scaling
fig, (before, after) = plt.subplots(1, 2,figsize=(12,4))

before = df_stores_norm_test.plot.scatter(ax = before, x='store_sales', y='store_area', label = 'Before Scaling (testing)', color = 'orange')
after = df_stdscale_test.plot.scatter(ax = after, x='store_sales', y='store_area', label = 'After Scaling (testing)', color= 'orange')

### Using Robust Scaler to Scale Features

Robust scaler is not as affected by outliers, as it scales the interquartile range instead of the standard deviation.

To use robust scaler, you can use a similar syntax as the other mehtods from SKLearn:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

In [None]:
from sklearn.preprocessing import RobustScaler

robust_scaler = RobustScaler()

robust_scalar.fit()
robust_scalar.fit_transform()
robust_scalar.transform()

### Which Scaling method to use?

In general, there are no rules. The main goal is to transform variables into a comparable range to allow models to balance features fairly.

- This means that all our numeric features should be scaled using the same method.
- If our features our normally distributed, we might favour the standard scaler.
- With outliers, the robust scaler may perform better.

However, the ultimate goal is improved model performance, so our choices should be guided by those metrics.

### Exercise 1 (Basic): Plot the distributions of the numeric variables in the customer experience dataset
Using the below dataset, use plot each distribution so that you can see the different scales of each numeric variable. Notice the potential outliers that may affect our scaling.

In [62]:
#import dataset
df_cx = pd.read_csv('cx_survey_data.csv')
df_cx

Unnamed: 0,caseid,date,inquiry,wait_time,case_duration,sat_score,solved
0,1,2021-11-28,Bug,680,3129,2,0
1,2,2021-12-03,Bug,745,246,6,1
2,3,2021-12-20,Bug,1199,2686,1,0
3,4,2021-10-02,Bug,205,591,4,1
4,5,2021-11-20,Bug,24,2327,2,0
...,...,...,...,...,...,...,...
164,165,2021-11-28,Discovery,989,1347,6,1
165,166,2021-09-06,Discovery,806,2512,10,0
166,167,2021-12-07,Discovery,1058,154,9,1
167,168,2021-11-26,Discovery,268,155,8,1


In [None]:
#define numeric columns to scale


In [None]:
# plot a distribution of each numeric column


### Exercise 2 (Advanced): Apply Min Max Scaling to Training & Testing Data

In this exercise you're going to explore what impact an outlier in the training data might cause on the results of minmax scaling.

In [None]:
#fit and transform the min max scaler to the numeric columns


In [None]:
#Plot a scatter chart of the wait_time and case_duration columns before and after scaling on training and testing data
fig, (before, after) = plt.subplots(1, 2,figsize=(12,4))
before = df_cx.plot.scatter(ax = before, x='wait_time', y='case_duration', label = 'Before Scaling (training)', color = 'darkblue')
after = df_cx_scaled.plot.scatter(ax = after, x='wait_time', y='case_duration', label = 'After Scaling (training)', color= 'darkblue')

In [None]:
#import the cx_testing dataset and identify the numeric columns
df_cx_test = pd.read_csv('cx_survey_data_test.csv')

In [None]:
#make a copy of the testing data and transform the numeric columns using the minmax_scaler


In [None]:
#Plot a scatter chart of the wait_time and case_duration columns before and after scaling on training and testing data
fig, (before, after) = plt.subplots(1, 2,figsize=(12,4))
before = df_cx_test.plot.scatter(ax = before, x='wait_time', y='case_duration', label = 'Before Scaling (testing)', color = 'orange')
after = df_cx_test_scaled.plot.scatter(ax = after, x='wait_time', y='case_duration', label = 'After Scaling (testing)', color= 'orange')