### Feature Corelation
Basic data analysis with sklearn on the auto-mpg dataset

***
#### Environment
`conda activate sklearn-env`

***
#### Goals
   
- Load data in a pandas dataframe
- Remove records with missing values
- Display statistical information about the data
- Display dataset features correlation matrix
- Visualise dataset features heatmap

#### Basic python imports for panda (dataframe) and seaborn(visualization) packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#### Dataset load from CSV located on UCI website.

http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data  
If the URL does not work the dataset can be loaded from the data folder `./data/auto-mpg.data`.

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)
raw_dataset.sample(10)

In [None]:
from sklearn.datasets import load_iris
data = load_iris(as_frame = True )
print(data.DESCR)

#### Keep original dataset imutable and copy its content in a new dataset for further changes

In [None]:
dataset = raw_dataset.copy()
dataset.sample(10)

#### Display dataset info

In [None]:
dataset.info()

#### Display total count of missing values 

In [None]:
dataset.isna().sum()

#### Eliminate records with missing values from dataset

In [None]:
dataset = dataset.dropna()

#### Compute basic statistics

In [None]:
stats = dataset.describe().transpose()
stats

#### Compute correlation matrix on this dataset

In [None]:
corr = dataset.corr()
corr

#### Visualize correlation matrix using seaborn heatmap plot

https://seaborn.pydata.org/examples/many_pairwise_correlations.html

In [None]:
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize = (8,6))
sns.heatmap(corr, annot=True, fmt='.2f', mask = mask, xticklabels=corr.columns.values, yticklabels=corr.columns.values, cmap="Greens")
plt.title("Correlation Heatmap")

#### Visualize pairs of fields

https://seaborn.pydata.org/examples/scatterplot_matrix.html

In [None]:
sns.pairplot(dataset[['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration']], diag_kind='kde', corner=True)

#### Linear Regression

In [None]:
sns.pairplot(dataset[['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration']], diag_kind='kde', corner=True, height=5, aspect=.8, kind="reg")

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 6))
sns.regplot(ax=axes[0], x='Displacement', y='MPG', data=dataset, order=1, ci=None, line_kws={'color': 'red'});
sns.regplot(ax=axes[1], x='Displacement', y='MPG', data=dataset, order=4, ci=None, line_kws={'color': 'red'});
sns.regplot(ax=axes[2], x='Displacement', y='MPG', data=dataset, order=15, ci=None, line_kws={'color': 'red'});