<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Preprocessing and Pipelines Overview

In this chapter, we're going to learn how to overcome common problems with Machine Learning datasets. 

For example, you'll recall that in order to perform Machine Learning the dataset must: 

1. be numerical 
2. contain no missing data

These rules are hard and fast.


**Preprocessing**

In this chapter, we'll look at 4 ways we can preprocess our data so it's ready for training a model.

1. Reshaping (or transforming) the data to make it easier to model. 

2. Strategies for dealing with Missing Data. 

3. Strategies for dealing with non-numerical features. 

4. Strategies for Scaling your data.

**Pipelines** 

5. Finally, we can build a pipeline that chains together all these steps and fits the data to a model. 



# The Automobiles Dataset

In this chapter, we'll use the Automobiles dataset. The dataset has a range of issues that need to be fixed. We'll address these issues by preprocessing the data before training a model.


In [1]:
import pandas as pd

auto_df = pd.read_csv('../../Data/automobiles.csv')
auto_df

Unnamed: 0,symboling,normalised_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,-1,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,-1,95.0,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


There are a large number of columns in this dataset. You should notice that some columns are hidden (between wheel_base and engine_size)

In [2]:
auto_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalised_losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel_type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num_of_doors       203 non-null    object 
 6   body_style         205 non-null    object 
 7   drive_wheels       205 non-null    object 
 8   engine_location    205 non-null    object 
 9   wheel_base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb_weight        205 non-null    int64  
 14  engine_type        205 non-null    object 
 15  cylinders          205 non-null    object 
 16  engine_size        205 non

You'll notice that the dataset has columns with missing data. 

You should also notice that we're not seeing data from columns 20-23 inclusive, because there's a limit to how much data a jupyter notebook will output. 

We can break the output into two groups like this: 


In [3]:
auto_df.iloc[:, 0:20].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalised_losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel_type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num_of_doors       203 non-null    object 
 6   body_style         205 non-null    object 
 7   drive_wheels       205 non-null    object 
 8   engine_location    205 non-null    object 
 9   wheel_base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb_weight        205 non-null    int64  
 14  engine_type        205 non-null    object 
 15  cylinders          205 non-null    object 
 16  engine_size        205 non

In [4]:
auto_df.iloc[:, 20:].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   compression_ratio  205 non-null    float64
 1   horsepower         203 non-null    float64
 2   peak_rpm           203 non-null    float64
 3   city_mpg           205 non-null    int64  
 4   highway_mpg        205 non-null    int64  
 5   price              201 non-null    float64
dtypes: float64(4), int64(2)
memory usage: 9.7 KB


# Familiarise yourself with the data

Take a few moments to familarise yourself with the data. Open it in Excel and make sure you understand the data. 

We're going to use various features about a car to try and predict the price of the car. 

The symboling and normalised_losses columns relate to evaluating insurance risk. We won't need them for this particular challenge.  