In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Feature Engineering for Numeric Variables
In this notebook we will cover scaling, transformations, and interactive features. This notebook is the This is a companion workbook for the 365 Data Science course on ML Process. The in-depth explanantions theories and pros/cons for each of these techniques can be found there. 

## Feature Scaling
Feature scaling is important for we are using models with a distance metric. If our features are of different scales, they can be overcompensated for in the models. 
- Absolute Max Scaling
- MinMax Scaling
- Z-Score Normalization (Standard Scaler)
- Robust Scaler 
## Transformations 
- Logarithmic 
- Square Root 
- Exponential
- Box-Cox
## Interaction Features
- Arethmetic Interaction
- Binning
- Creative Features 

This a companion notebook for the **365 Data Science Course "Machine Learning Process A-Z"**. In the course, there is a video walkthrough of this notebook as well as theory and definitions of each of the techinques. We've designed this notebook to be a stand alone learning tool, but if you're interested in the additional features of the paid course, you can access it at a discount here: https://365datascience.com/learn-machine-learning-process-a-z/

**Check out our 3 course bundle for additional learning (limited time discount 68% off!)** --> [The Machine Learning A-Z Bundle](https://bit.ly/3NAZ5oP)

In [None]:
df = pd.read_csv('/kaggle/input/craigslist-carstrucks-data/vehicles.csv')

#let's add a column for car age that will help us later on: 
df['car_age'] = df['year'].max() - df['year']

In [None]:
df.columns

In [None]:
df.describe()
# Columns we may want to normalize 
# Price, Year, Odometer

In [None]:
df.isnull().sum()

In [None]:
#let's just use a few features to create an example model and remove Nulls. Learn mnore about different imputation techniques in this other companion notebook. 
#pd.get_dummies() creates dummy variables for the categorical features (see this notebook for more on that)
#notebook: https://www.kaggle.com/code/kenjee/categorical-feature-engineering-section-7-1
df_example = pd.get_dummies(df.loc[:,['price','car_age','odometer','manufacturer','condition']].dropna())


In [None]:
df_example

In [None]:
#see section on train test split: https://www.kaggle.com/code/kenjee/cross-validation-foundations-section-8
from sklearn.model_selection import train_test_split

X = df_example.drop('price',axis =1 )
y = df_example[['price']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
df.price.plot.box()

In [None]:
#df.price.plot.box()
df.car_age.plot.box()
#df.odometer.plot.box()

In [None]:
df.odometer.plot.box()

# Feature Scaling
Feature scaling is important for we are using models with a distance metric. If our features are of different scales, they can be overcompensated for in the models. 
- Absolute Max Scaling
- MinMax Scaling
- Z-Score Normalization (Standard Scaler)
- Robust Scaler 


## Absolute Maximum Scaling
Absolute maximum scaling will have you take the maximum value within the data and then divide the raw data by this absolute maximum value.

For absolute max scaling, this works best if our data doesn't have massive outliers. In this case, we would likely want to remove outliers from price and odometer. This also keeps the same distribution of the data. For absolute maximum scaling, let's do this on the year data for the cars. 

In [None]:
from sklearn.preprocessing import MaxAbsScaler

#Scale data 
df_am = MaxAbsScaler().fit_transform(X_train)

#convert to dataframe to see table
df_am = pd.DataFrame(df_am, columns = X_train.columns)

#obvious problems with outliers regarding price & odometer 

# Min Max Scaling
Another simple form of scaling is called min max. Min Max scaling will scale all our data points between 0 and 1. We’d use the following formula to scale our data, where we subtract the min from the raw data and then divide it by the max minus the min. 

Again, this approach is not robust to outliers.

In [None]:
from sklearn.preprocessing import MinMaxScaler
df_min_max = MinMaxScaler().fit_transform(X_train)
df_min_max = pd.DataFrame(df_min_max, columns = X_train.columns)

# Z Score Normalization (Standard Scaling)

Another approach is standardization which transforms the data into the z-score, where the mean is zero and the standard deviation is 1.

This approach is more robust to outliers, but still can have issues if outliers cause massive changes to standard deviation. However, this does assume a normal distribution which is inaccurate for some of our data (Year).

In [None]:
from sklearn.preprocessing import StandardScaler
df_std = X_train.copy()
#only scale numeric varaibles in this case rather than the dummy variables for categories 
df_std.loc[:,['car_age','odometer']] = StandardScaler().fit_transform(df_std.loc[:, ['car_age','odometer']])
df_std

# Robust Scaler
With Robust Scaler, we’re subtracting the median and then scaling the column by the IQR.

This is the approach most robust to outliers that we will cover.

In [None]:
from sklearn.preprocessing import RobustScaler
df_rob = X_train.copy()
#only scale numeric varaibles in this case rather than the dummy variables for categories 
df_rob.loc[:,['car_age','odometer']] = RobustScaler().fit_transform(df_rob.loc[:, ['car_age','odometer']])
df_rob

In [None]:
#let's do a simple exmaple where we compare results with the different features scaling techniques. We will remove the categorical data for this. 

#the model we will be using is K Nearest Neighbors which can use euclidean distance. 

#we will use year and odometer to predict price 

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

#noscaling 
neigh_am = KNeighborsRegressor(n_neighbors=3)
neigh_am.fit(X_train.loc[:,['car_age','odometer']], y_train)
pred = neigh_am.predict(X_test.loc[:,['car_age','odometer']])

#absolute max 
neigh_am = KNeighborsRegressor(n_neighbors=3)
neigh_am.fit(df_am.loc[:,['car_age','odometer']], y_train)
am_pred = neigh_am.predict(X_test.loc[:,['car_age','odometer']])

#min max (should get same results)
neigh_mm = KNeighborsRegressor(n_neighbors=3)
neigh_mm.fit(df_min_max.loc[:,['car_age','odometer']], y_train)
mm_pred = neigh_mm.predict(X_test.loc[:,['car_age','odometer']])

#standard (z score)
neigh_std = KNeighborsRegressor(n_neighbors=3)
neigh_std.fit(df_std.loc[:,['car_age','odometer']], y_train)
std_pred = neigh_std.predict(X_test.loc[:,['car_age','odometer']])

#robust scaler 
neigh_rob = KNeighborsRegressor(n_neighbors=3)
neigh_rob.fit(df_rob.loc[:,['car_age','odometer']], y_train)
rob_pred = neigh_rob.predict(X_test.loc[:,['car_age','odometer']])



In [None]:
print('No Scaling: %.3f' % mean_absolute_error(y_test,pred))
print('Abosolute Max Score: %.3f' % mean_absolute_error(y_test,am_pred))
print('Min Max Score: %.3f' % mean_absolute_error(y_test,mm_pred))
print('Standard Scaling Score: %.3f' % mean_absolute_error(y_test,std_pred))
print('Robust Scaler Score: %.3f' % mean_absolute_error(y_test,rob_pred))


# Transformations 
A data transformation is the process of using a math expression to change the structure of our data. As we mentioned before, some models need data to fit a specific type of distribution for them to produce optimal results. Unfortunately, the data we get in the real world, doesn’t always fit the distributions our models call for. 

Let's look at the shape of our data and if it has any outliers before we do our transforms

In [None]:
# visual of the distribution of the odometer without any outlier removal (see boxplots above)
#data is clearly impacted heavily by outliers 
print("max odometer: " + str(df_example['odometer'].max()))
print("median odometer: " + str(df_example['odometer'].median()))

df_example['odometer'].hist(bins=50)


In [None]:
#shape of the data after very basic oultier removal (kept only data < 99th percentile)
#clear right skew in data 
df_example[df['odometer']<df['odometer'].quantile(.99)]['odometer'].hist(bins=50)

In [None]:
# visual of the distribution of the odometer without any outlier removal (see boxplots above)
#data is clearly impacted heavily by outliers 
print("max price: " + str(df_example['price'].max()))
print("median price: " + str(df_example['price'].median()))

df_example['price'].hist(bins=50)

In [None]:
#shape of the data after very basic oultier removal (kept only data < 99th percentile)
#clear right skew in data 

df_example[df['price']<df['price'].quantile(.99)]['price'].hist(bins=50)

In [None]:
# Let's do some simple feature engineering to get how old the cars are

#df_example['car_age'] = df_example['car_age'].max() - df_example['car_age']

print("max age: " + str(df_example['car_age'].max()))
print("median age: " + str(df_example['car_age'].median()))

X_train['car_age'].hist(bins=50)

# Logarithmic Transformation
A very popular, common type of transformation is the log transformation. Log transformations fall under the family of power transformations. Typically, we apply logarithmic transformations to our variables when our variables are heavily right skewed, driven by a few outliers. 

Let's see how these transformations impact some of our skewed data (Odometer & Price)

### Transforms we will cover
- Logarithmic 
- Exponential
- Square Root 
- Box-Cox

In [None]:
from sklearn.preprocessing import FunctionTransformer

def log_transform(x):
    return np.log(x + 1)

transformer_log = FunctionTransformer(log_transform)
transformed_log = transformer_log.fit_transform(X_train)

In [None]:
transformer_logp = FunctionTransformer(log_transform)
transformed_logp = transformer_logp.fit_transform(y_train)

In [None]:
#as you can see, using log transform in this case actually creates some right skew. 
#It does however almost completely normalize the outliers that were present

transformed_log['odometer'].hist(bins = 100)
#X_train['odometer'].hist(bins = 100)

In [None]:
transformed_log['car_age'].hist(bins = 20)
#X_train['car_age'].hist(bins = 20)

In [None]:
#as you can see, using log transform in this case actually creates some right skew. 
#It does however almost completely normalize the outliers that were present

transformed_logp.hist(bins =100)

# Square Root Transform
Square/Square Root transformations will compress the spread of your larger values but spread out your lower values. Log transformations have a similar effect but are much more aggressive

In [None]:
def sqrt_transform(x):
    return np.sqrt(x)

transformer_sqrt = FunctionTransformer(sqrt_transform)
transformed_sqrt = transformer_sqrt.fit_transform(X_train)

In [None]:
transformer_sqrtp = FunctionTransformer(sqrt_transform)
transformed_sqrtp = transformer_sqrtp.fit_transform(y_train)

In [None]:
transformed_sqrt['odometer'].hist(bins = 100)

In [None]:
transformed_sqrt['car_age'].hist(bins = 20)

In [None]:
transformer_sqrtp = FunctionTransformer(sqrt_transform)
transformed_sqrtp = transformer_sqrtp.fit_transform(y_train[y_train['price'] < y_train['price'].quantile(.99)])

In [None]:
transformed_sqrtp.hist(bins=50) 

# Exponential Transformation
A close cousin of the log transform is the exponential transformation. There are many instances where you'd use an exponential transform:
- Anytime you apply a log transform to your target variable, you can apply an exponential transformation to revert it back to the original value.
- Log and Exponential transformations are the inverse of each other. You can use either to perform the same task. Whether you want a log-linear or linear-log model.
- Use Exponential transformations when you wanto magnify small differences.

In [None]:
def exp_transform(x):
    return np.exp(x)

transformer_exp = FunctionTransformer(exp_transform)

## In our dataset, car age may be something we want to magnify
transformed_exp = X_train.copy()

transformed_exp['car_age'] = transformer_exp.fit_transform(transformed_exp['car_age'])

In [None]:
plt.hist(X_train['car_age'])

In [None]:
## X and y-scale here are much larger
plt.hist(transformed_exp['car_age'])

In [None]:
plt.hist(transformed_exp['odometer'])

# Box-Cox Transformation
The Box-Cox transformation is a transformation that helps your dataset follow a normal distribution. Typically, we use Box-Cox transformation when our dataset is not normal, but close to being normal. When we want to either run tests or generate significance from our dataset, Box-Cox transformation is a good option to transform our target variable so it resembles a normal distribution.

Box-Cox aggregates multiple power transformers into a single transformer. You use lambda to adjust the transformation. Lambda varies from -5 to 5. If we set lambda equal to zero, it becomes simply a log transformation. 

In [None]:
## Redo the pipeline for this example
from sklearn.model_selection import train_test_split

## Clip Outliers
df_example = df_example[df_example['price'] < np.percentile(df_example['price'], 95)]

## Remove prices that are 0 to make notebook work
df_example = df_example[df_example['price'] > 0].copy()

X = df_example.drop('price',axis =1 )
y = df_example[['price']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
plt.hist(y_train['price'])

We'll apply a boxcox transformation to make this dataset a bit more normal. Within scipy.stats, we can set lmbda = None and the boxcox function will find the lambda value that will maximize the log-likelihood function of the dataset:

In [None]:
from scipy.stats import boxcox

boxcox_y_train = boxcox(y_train['price'], lmbda = None)

plt.hist(boxcox_y_train[0])

In [None]:
print("Lambda Parameter {0}".format(boxcox_y_train[1]))

# Feature Interactions 
Like a chef remixing their ingredients, as a data scientist, we have a ton of different ways we can engineer features with our variables. Here are a few common methods:

- Arethmetic Interaction (addition, subtraction, division, or multiplication of variables)
- Binning (grouping variables in ranges)
- Creative Features (alternative metrics for evaluation)

## Arethmetic Interaction
We actually already did some arethmetic interaction at the beginning of our analysis here. One of the earliest things we did was get the car age by taking the difference between the newest eyar and the year of each vehicle. While this is interaction with the variable itself, we can also take differences, ratios and mutliples of two or more variables. Let's try a few: 

In [None]:
#first example of getting car's age:
df['car_age'] = df['year'].max() - df['year']

#let's look at price per mile. This could be a good way to normalize across different car brands 
df['price_per_mile'] = df['price']/ df['odometer']

#We can try these newly created features in our model to see if they produce better results.

## Binning 
Binning allows us to group specific variables in a range. This can be useful if we know something specific or non-linear about the data at hand. For example, if most cars go out of warranty at 50,000 miles or after 5 years, we can create a varaible bin based on that. 

We can also split data into multiple ranges if we would like to. 

In [None]:
#create warranty bin > 50,000 miles 
# we are using a lambda function here. This lets us write a function without defining it
# we are also using a ternary operator which is an if, else statement in a single line. (explained in the full video)

df['warranty_miles'] = df['odometer'].apply(lambda x: 0 if (x > 50000 or np.isnan(x)) else 1)
df['warranty_age'] = df['car_age'].apply(lambda x: 0 if (x > 5 or np.isnan(x)) else 1)


# We can also combine these together in a single statement by defining a function.
def warranty(miles, age):
    if (miles > 50000 or age > 5):
        return 0
    else:
        return 1
    
df['warranty'] = df.apply(lambda x: min(x.warranty_miles,x.warranty_age), axis=1)


In [None]:
df.loc[:,['odometer','car_age','warranty_miles','warranty_age','warranty']].dropna().head()

In [None]:
#We might also know that cars lose value non-linearly after 50k miles or 100k miles. 
#In this case, we may want to create bucks for <50k, 50k-100k, and 100k+ miles.
#we did something similiar in the eda notebook located here: https://www.kaggle.com/code/kenjee/basic-eda-example
bins = pd.IntervalIndex.from_tuples([(0, 50000), (50000, 100000), (100000,float("inf"))])
df['mile_groups'] = pd.cut(df['odometer'],bins)

## Creative features 
Often, we have a good subject area understanding of our data domain. We might want to create features based on our understanding of the specific problem or domain. For cars, maybe we could create our own classification of imports or US manufactured cars that could help us predict pricing better. Maybe there is a car desirabity metric that you could create based on the other factors. Or maybe there is a way to look at the amount of similar cars close by to approximate demand. These are some potential ideas for you to implement yourself! 

Another way to get creative features are to find more data and add it to your dataset. We could find a car sales website and scrape the average price of the cars in the market based on the make and model. 

# Summary
In this notebook, we covered the basics of feature scaling, transformations and interaction features. Working on these techiniques should help you to improve your models significantly! 
## Feature Scaling
- Absolute Max Scaling
- MinMax Scaling
- Z-Score Normalization (Standard Scaler)
- Robust Scaler 
## Transformations 
- Logarithmic 
- Square Root 
- Exponential
- Box-Cox
## Interaction Features
- Arethmetic Interaction
- Binning
- Creative Features 

## Additional Resources
- [About Feature Scaling and Normalization by Sebastian Raschka](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html)
- [Feature Scaling Techniques in Python – A Complete Guide by Eddie_4072](https://www.analyticsvidhya.com/blog/2021/05/feature-scaling-techniques-in-python-a-complete-guide/)
- [Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization by Aniruddha Bhandari](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/#:~:text=Normalization%20is%20a%20scaling%20technique,known%20as%20Min%2DMax%20scaling.&text=Here%2C%20Xmax%20and%20Xmin%20are,values%20of%20the%20feature%20respectively.)
- [Robust Scaler - Sklearn Docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)
- [Log Transformation: Purpose and Interpretation by Kyaw Saw Htoon](https://medium.com/@kyawsawhtoon/log-transformation-purpose-and-interpretation-9444b4b049c9)
- [Best exponential transformation to linearize your data with Scipy](https://towardsdatascience.com/best-exponential-transformation-to-linearize-your-data-with-scipy-cca6110313a6)
- [Exponentially scaling your data in order to zoom in on small differences](https://rikunert.com/exponential_scaler)
- [Box Cox Transformation by Ted Hessing](https://sixsigmastudyguide.com/box-cox-transformation/)
- [Box-Cox Transformation and Target Variable: Explained](https://builtin.com/data-science/box-cox-transformation-target-variable)
- [Additional Kaggle Example](https://www.kaggle.com/code/mysarahmadbhat/all-about-feature-scaling)

## Related Course Workbooks - Machine Learning Process A-Z
- [**Dealing with Missing Values - Section 5.1**](https://www.kaggle.com/code/kenjee/dealing-with-missing-values-section-5-1)
- [**Dealing with Outliers - Section 5.2**](https://www.kaggle.com/code/kenjee/dealing-with-outliers-section-5-2)
- [**Basic EDA Example - Section 6**](https://www.kaggle.com/code/kenjee/basic-eda-example-section-6)
- [**Categorical Feature Engineering - Section 7.1**](https://www.kaggle.com/code/kenjee/categorical-feature-engineering-section-7-1)
- [**Numeric Feature Engineering - Section 7.2**](https://www.kaggle.com/kenjee/numeric-feature-engineering-section-7-2)
- [**Cross Validation Foundations - Section 8**](https://www.kaggle.com/code/kenjee/cross-validation-foundations-section-8)
- [**Feature Selection - Section 9**](https://www.kaggle.com/code/kenjee/feature-selection-section-9)
- [**Dealing with Imbalanced Data - Section 10**](https://www.kaggle.com/code/kenjee/dealing-with-imbalanced-data-section-10)
- [**Model Building Example - Section 11**](https://www.kaggle.com/code/kenjee/model-building-example-section-11)
- [**Model Evaluation (Classification) - Section 11**](https://www.kaggle.com/code/kenjee/model-evaluation-classification-section-12)
- [**Model Evlauation (Regression) - Section 11**](https://www.kaggle.com/code/kenjee/model-evaluation-regression-12)