# **Data Preprocessing Walkthrough**
##### **Description**: In this Notebook, we'll discuss the purpose of data preprocessing, how it differs between algorithms, and how to optimize the process.

### **1. What is Data Preprocessing?**
##### ***Data preprocessing*** is a step in the data mining and data analysis process that takes raw data and transforms it into a format that can be understood and analyzed by computers and machine learning. (MonkeyLearn.com)
##### When you properly preprocess and clean your data, you’ll set yourself up for much more accurate downstream processes. We often hear about the importance of “data-driven decision making,” but if these decisions are driven by bad data, they’re simply bad decisions.

### **2. What are the Different Types of Data Preprocessing?**
##### - ***Label Encoding*** is a technique used in machine learning and data analysis to convert categorical features into numerical format. For example, if you have a dataset with a column named "state" with the following values ("UT", "CA", "WA"), you'd convert them to numerical values (1, 2, 3). The categorical values are simply mapped to numeric values.
##### - ***One-Hot Encoding*** is the process of converting categorical features into dummy variables and then converting those into numeric features. For example, if you have a dataset with a column named "gender" with the following values ("Male", "Female", "Other"), the following ***dummy variables*** could be created: "is_male", "is_female", "is_other". These would then be filled with numeric Boolean values (1 = True, 0 = False). ***This technique should only be used when the number of distinct values in the categorical variable is less than 10.***
##### - ***Standard Scalers*** are used to fit the values of each feature on the same range. For example, if you have a feature named "size" with a range of 1000 to 5000 and another feature named "num_tenants" with a range of 0 to 10, bias could be created depending on the model. Thus, we use a StandardScaler to put these on the same scale (or range). This is done by making the largest value in the feature 1 and the smallest 0, everything else is labeled as a percentage (e.g., if the original value of "num_tenants" was 7, this would now be 0.7 since 7 is 70% of the maximum value).
##### - ***Imputation*** is the process of filling NULL/Missing values with a value. This can be completed in a variety fo ways including filling with a default value, filling with the mean, or using a Supervised Model to predict the missing value based on the values near it.

### **3. How Does Data Preprocessing Differ Between Algorithms?**
##### ***Linear/Logistic Regression***: It's worth noting that since these utilize basic linear algebra, categorical features would simply not work, thus you'll need to exclude these rather than label encoding or one-hot encoding these variables. It's also worth noting that these can't accept NULL/missing values as it would skew the predictions or cause an error during runtime, thus imputation is required.
##### ***Decision Trees/Random Forests***: Tree-based models CAN accept categorical variables that have been label-encoded or one-hot encoded. If used for Regression, Imputation is required. However, if used for Classification, Imputation may not be requried depending on your dataset and use case.
##### ***K-Means Clustering***: Although not required, it's highly recommended to use a Standard Scalar on your numeric features to optimize the computations and reduce overall runtime. K-Means can't accept any categorical features since it causes bias in the algorithm and may confuse the model's optimization process.

### **4. Data Preprocessing in Action**

##### **Linear Regression Example**

In [18]:
# Linear Regression Example
import pandas as pd

training_data = pd.read_parquet('clean_car_listings.parquet')
# Before
print('Before Preprocessing')
print(training_data.head())

# Preprocessing

# remove categorical features
training_data2 = training_data.drop(columns=['make', 'model', 'trim', 'exterior_color', 'interior_color', 'usage_type', 'city', 'state'])

# After
print('After Preprocessing')
training_data2.head()


Before Preprocessing
   price  model_year make model                trim  mileage exterior_color   
0  13895        2006  BMW    Z4      Roadster 3.0si   114889          White  \
1  19888        2008  BMW    M5               Sedan   129195           Blue   
2  19999        2008  BMW    M6               Coupe    93700           Gray   
3  18995        2009  BMW    Z4  Roadster sDrive30i    95185           Gray   
4   6500        2010  BMW    X3       xDrive30i AWD   126832            Red   

  interior_color  num_accidents  num_owners usage_type       city state  
0        Unknown              0           5   Personal      Tempe    AZ  
1          Black              0           3   Personal      Tempe    AZ  
2          Black              0           2      Fleet  West Park    FL  
3          Black              1           5      Fleet  Englewood    CO  
4          Beige              0           3   Personal  Bountiful    UT  
After Preprocessing


Unnamed: 0,price,model_year,mileage,num_accidents,num_owners
0,13895,2006,114889,0,5
1,19888,2008,129195,0,3
2,19999,2008,93700,0,2
3,18995,2009,95185,1,5
4,6500,2010,126832,0,3


##### **Decision Tree Example**

In [19]:
from sklearn.preprocessing import LabelEncoder

# Before
print('Before Preprocessing')
print(training_data.head())

# Preprocessing

# Label Encode Categorical Features

encoder = LabelEncoder()
categorical_features = ['make', 'model', 'trim', 'exterior_color', 'interior_color', 'usage_type', 'city', 'state']
training_data2 = training_data.copy()
for feature in categorical_features:
    training_data2[feature] = encoder.fit_transform(training_data2[feature])
    
# After
print('After Preprocessing')
training_data2.head()

Before Preprocessing
   price  model_year make model                trim  mileage exterior_color   
0  13895        2006  BMW    Z4      Roadster 3.0si   114889          White  \
1  19888        2008  BMW    M5               Sedan   129195           Blue   
2  19999        2008  BMW    M6               Coupe    93700           Gray   
3  18995        2009  BMW    Z4  Roadster sDrive30i    95185           Gray   
4   6500        2010  BMW    X3       xDrive30i AWD   126832            Red   

  interior_color  num_accidents  num_owners usage_type       city state  
0        Unknown              0           5   Personal      Tempe    AZ  
1          Black              0           3   Personal      Tempe    AZ  
2          Black              0           2      Fleet  West Park    FL  
3          Black              1           5      Fleet  Englewood    CO  
4          Beige              0           3   Personal  Bountiful    UT  
After Preprocessing


Unnamed: 0,price,model_year,make,model,trim,mileage,exterior_color,interior_color,num_accidents,num_owners,usage_type,city,state
0,13895,2006,2,512,3181,114889,8,10,0,5,1,482,2
1,19888,2008,2,257,3905,129195,1,1,0,3,1,482,2
2,19999,2008,2,258,1109,93700,3,1,0,2,0,529,7
3,18995,2009,2,512,3182,95185,3,1,1,5,0,131,4
4,6500,2010,2,486,4955,126832,5,0,0,3,1,46,37


##### **K-Means Example**

In [29]:
from sklearn.preprocessing import MinMaxScaler

# Before
print('Before Preprocessing')
print(training_data.head())

# Preprocessing

# remove all categorical columns
training_data2 = training_data.drop(columns=['make', 'model', 'trim', 'exterior_color', 'interior_color', 'usage_type', 'city', 'state'])
features = ['price', 'model_year', 'mileage', 'num_accidents', 'num_owners']
training_data2 = MinMaxScaler().fit_transform(training_data2)

# After
print('After Preprocessing')
pd.DataFrame(training_data2, columns=['price', 'model_year', 'mileage', 'num_accidents', 'num_owners'])

Before Preprocessing
   price  model_year make model                trim  mileage exterior_color   
0  13895        2006  BMW    Z4      Roadster 3.0si   114889          White  \
1  19888        2008  BMW    M5               Sedan   129195           Blue   
2  19999        2008  BMW    M6               Coupe    93700           Gray   
3  18995        2009  BMW    Z4  Roadster sDrive30i    95185           Gray   
4   6500        2010  BMW    X3       xDrive30i AWD   126832            Red   

  interior_color  num_accidents  num_owners usage_type       city state  
0        Unknown              0           5   Personal      Tempe    AZ  
1          Black              0           3   Personal      Tempe    AZ  
2          Black              0           2      Fleet  West Park    FL  
3          Black              1           5      Fleet  Englewood    CO  
4          Beige              0           3   Personal  Bountiful    UT  
After Preprocessing


Unnamed: 0,price,model_year,mileage,num_accidents,num_owners
0,0.116883,0.052632,0.675808,0.000000,0.555556
1,0.178351,0.157895,0.759964,0.000000,0.333333
2,0.179489,0.157895,0.551163,0.000000,0.222222
3,0.169191,0.210526,0.559899,0.142857,0.555556
4,0.041036,0.263158,0.746063,0.000000,0.333333
...,...,...,...,...,...
74590,0.104493,0.473684,0.596923,0.000000,0.555556
74591,0.214464,0.578947,0.209400,0.000000,0.555556
74592,0.058380,0.368421,0.765658,0.000000,0.555556
74593,0.148709,0.473684,0.344687,0.000000,0.555556
