### Module 03 - Assignment

***
#### Environment
`conda activate sklearn-env`
***
#### Goals
   
- [Load the data sets from the UCI website](#Dataset-load-from-CSV-located-on-UCI-website)
- [Print statistics about the data](#Basic-statistical-properties)
- [Display total count of missing values](#Display-total-count-of-missing-values)
- [Use `IterativeImputer` to compute missing values](#Use-IterativeImputer-to-compute-missing-values)
- [Use `OneHotEncoder` to encode  `Cylinders` and `Origin` fields](#Use-OneHotEncoder-to-encode-Cylinders-and-Origin-fields)
- [Rescale `Displacement`, `Horsepower`, `Weight`, `Acceleration` fields using `RobustScaler` estimator](#Rescale-Displacement,-Horsepower,-Weight,-Acceleration-fields-using-RobustScaler)
- [Bucketize `Model year` field in 4 different bins to reduce the number of distinct values used in it](#Bucketize-Model-year-field-in-4-different-bins-to-reduce-the-number-of-distinct-values-used-in-it)
- [Run `LinearRegression` estimator over the transformed data and print predicted values along with label values](#Run-LinearRegression-estimator-over-the-transformed-data-and-print-predicted-values-along-with-label-values)

- [Optional](#Optional) *
- [Apply the same transformations (imp_mean,encoder, scaler, bucketer and reg ) on test datasets](#Apply-the-same-transformations-(imp_mean,-encoder,-scaler,-bucketer-and-reg-)-on-test-datasets)
  - [Apply imputer (`imp_mean` object)](#Apply-imputer-(imp_mean-object))
  - [Apply category encoder (`encoder` object)](#Apply-category-encoder-(encoder-object))
  - [Apply scaller (`scaler` object)](#Apply-scaller-(scaler-object))
  - [Apply binning (`bucketer` object)](#Apply-binning-(bucketer-object))
  - [Run logistic regression and compute model $R^2$ score (`reg` object)](#Run-logistic-regression-and-compute-model-$R^2$-score-(reg-object))

#### Basic python imports for panda (dataframe) and seaborn(visualization) packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

#### Dataset load from CSV located on UCI website.

http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data  
If the URL does not work the dataset can be loaded from the data folder `./data/auto-mpg.data`.

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)
dataset = raw_dataset.copy()
dataset.tail(2)

#### Dataset meta information

In [None]:
#print dataset information
<INSERT YOUR CODE HERE>

#### Basic statistical properties

In [None]:
#print statistical properties of the dataset
<INSERT YOUR CODE HERE>

#### Display total count of missing values

Nottice missing values on one of the fields.

In [None]:
#display na statistics
<INSERT YOUR CODE HERE>

#### Data preparation

Split data in `training` and `test` datasets

In [None]:
from sklearn.model_selection import train_test_split

train_dataset, test_dataset = train_test_split(dataset, test_size=0.2)
train_dataset.reset_index(drop=True,inplace=True)
test_dataset.reset_index(drop=True,inplace=True)

train_features = train_dataset.drop('MPG', axis='columns', inplace=False)
test_features = test_dataset.drop('MPG', axis='columns', inplace=False)

train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

### Use `IterativeImputer` to compute missing values

https://scikit-learn.org/stable//modules/generated/sklearn.impute.IterativeImputer.html

This imputer estimates the replacement for missing values based on the other fields. For this reason we are passing to `fit` and `transform` calls, all the other columns not only the ones that have missing elements 

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

#create an IterativeImputer and fit it on the train features
imp_mean = <INSERT YOUR CODE HERE>
<INSERT YOUR CODE HERE>

train_features[['Cylinders',
                'Displacement',
                'Horsepower', 
                'Weight', 
                'Acceleration', 
                'Model Year', 
                'Origin' ]] = imp_mean.transform(train_features)

train_features.head()


### Use `OneHotEncoder` to encode `Cylinders` and `Origin` fields

https://scikit-learn.org/stable//modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
from sklearn.preprocessing import OneHotEncoder

#create an OneHotEncoder and use it for the Cylinders and Origin features
encoder = <INSERT YOUR CODE HERE>

display("OHE categories: for Cylinders and Origin columns " + str(encoder.categories_))

train_features[['Cylinders_3',
              'Cylinders_4',
              'Cylinders_5',
              'Cylinders_6',
              'Cylinders_8',
              'Origin_USA',
              'Origin_Europe',
              'Origin_Japan']] = encoder.transform(train_features[['Cylinders', 'Origin']])

train_features.drop(['Cylinders', 'Origin'], axis=1, inplace=True)

train_features.head()

### Rescale `Displacement`, `Horsepower`, `Weight`, `Acceleration` fields using `RobustScaler`

https://scikit-learn.org/stable//modules/generated/sklearn.preprocessing.KBinsDiscretizer.htm


In [None]:
from sklearn.preprocessing import RobustScaler

#use RobustScaler for 'Displacement', 'Horsepower', 'Weight' and'Acceleration'
scaler = <INSERT YOUR CODE HERE>

train_features[['Displacement', 'Horsepower', 'Weight', 'Acceleration']] = scaler.transform(train_features[['Displacement', 'Horsepower', 'Weight', 'Acceleration']])
train_features.head()

### Bucketize `Model year` field in 4 different bins to reduce the number of distinct values used in it

https://scikit-learn.org/stable//modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

#use KBinsDiscretizer for 'model Year' feature
bucketer = <INSERT YOUR CODE HERE>

train_features[['Model Year']] = bucketer.transform(train_features[['Model Year']])
train_features.head()

### Run `LinearRegression` estimator over the transformed data and print predicted values along with label values

https://scikit-learn.org/stable//modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
from sklearn.linear_model import LinearRegression

#create a Linear Regressor and fit it to score on the train features
reg = <INSERT YOUR CODE HERE>

train_features['Predicted_MPG'] = <INSERT YOUR CODE HERE>
pd.concat([train_features, train_labels], axis=1).head()


### Optional

### Apply the same transformations (`imp_mean`, `encoder`, `scaler`, `bucketer` and `reg` ) on test datasets


Note: do not retrain these estimators on this unused data (do not call `fit` method)

#### Apply imputer (`imp_mean` object)

In [None]:
test_features[['Cylinders',
                'Displacement',
                'Horsepower', 
                'Weight', 
                'Acceleration', 
                'Model Year', 
                'Origin' ]] = <INSERT YOUR CODE HERE>

test_features.head()

#### Apply category encoder (`encoder` object)

In [None]:
test_features[['Cylinders_3',
              'Cylinders_4',
              'Cylinders_5',
              'Cylinders_6',
              'Cylinders_8',
              'Origin_USA',
              'Origin_Europe',
              'Origin_Japan']] = <INSERT YOUR CODE HERE>

test_features.drop(['Cylinders', 'Origin'], axis=1, inplace=True)

test_features.head()

#### Apply scaller (`scaler` object)

In [None]:
test_features[['Displacement', 'Horsepower', 'Weight', 'Acceleration']] = <INSERT YOUR CODE HERE>
test_features.head()

#### Apply binning (`bucketer` object)

In [None]:
test_features[['Model Year']] = <INSERT YOUR CODE HERE>
test_features.head()

### Run `LinearRegression` estimator on test data

In [None]:
# run the trained linear regressor and append the 'Predicted MPG' and test labels to the test features dataset

#### Print a random sample of 10 records to observe prediction accuracy

In [None]:
# print a random sample of 10 elements