# Machine Learning (Pandas & Sklearn)

* [Pandas](https://pandas.pydata.org/): Python Data Analysis Library
    * Pandas is usually used for data reading and preprocessing
    * `pip install pandas`
    * Tutorial: <https://pandas.pydata.org/docs/getting_started/index.html>
    * Cheat sheet: <https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf>
* [Scikit-learn](http://scikit-learn.org/) (sklearn)
    * Scikit-learn package is used for various machine learning algorithms, which range from classification, regression to clustering
    * `pip install sklearn`
    * Tutorial: <https://scikit-learn.org/stable/getting_started.html>
    * API Reference: <https://scikit-learn.org/stable/modules/classes.html>
    * Cheat sheet: <https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf>
    
In this demo, we use [UCI Machine Learning dataset](https://archive.ics.uci.edu/ml/datasets.php). Specifically, we use [Automobile](https://archive.ics.uci.edu/ml/datasets/Automobile) dataset. Given three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars, and predict the price of the cars.

## Download dataset

In [None]:
import os
import urllib.request

print('Begin downloading automobile dataset...')

# We use UCI Machine Learning dataset - Automobile here
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
description = 'http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names'
if not os.path.isfile("automobile.data"):
    urllib.request.urlretrieve(data_url, 'automobile.data')
    urllib.request.urlretrieve(description, 'automobile.names')

### Attribute Information

|     Attribute             |  Attribute Range |
| :---: | :---: |
|  1. symboling             |  -3, -2, -1, 0, 1, 2, 3 |
|  2. normalized-losses     |  continuous from 65 to 256 |
|  3. make                  |  alfa-romero, audi, bmw, chevrolet, dodge, honda |
|  ...                       |  isuzu, jaguar, mazda, mercedes-benz, mercury |
|  ...                       |  mitsubishi, nissan, peugot, plymouth, porsche |
|  ...                       |  renault, saab, subaru, toyota, volkswagen, volv |
|  4. fuel-type             |  diesel, gas |
|  5. aspiration            |  std, turbo |
|  6. num-of-doors          |  four, two |
|  7. body-style            |  hardtop, wagon, sedan, hatchback, convertible |
|  8. drive-wheels          |  4wd, fwd, rwd |
|  9. engine-location       |  front, rear |
| 10. wheel-base            |  continuous from 86.6 120.9 |
| 11. length                |  continuous from 141.1 to 208.1 |
| 12. width                 |  continuous from 60.3 to 72.3 |
| 13. height                |  continuous from 47.8 to 59.8 |
| 14. curb-weight           |  continuous from 1488 to 4066 |
| 15. engine-type           |  dohc, dohcv, l, ohc, ohcf, ohcv, rotor |
| 16. num-of-cylinders      |  eight, five, four, six, three, twelve, two |
| 17. engine-size           |  continuous from 61 to 326 |
| 18. fuel-system           |  1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi |
| 19. bore                  |  continuous from 2.54 to 3.94 |
| 20. stroke                |  continuous from 2.07 to 4.17 |
| 21. compression-ratio     |  continuous from 7 to 23 |
| 22. horsepower            |  continuous from 48 to 288 |
| 23. peak-rpm              |  continuous from 4150 to 6600 |
| 24. city-mpg              |  continuous from 13 to 49 |
| 25. highway-mpg           |  continuous from 16 to 54 |
| 26. price                 |  continuous from 5118 to 45400 |

## Data Processing

In [None]:
import numpy as np
import pandas as pd

### Read dataset
Commonly in `csv` format (i.e. items separated by `,`)

See [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for more usage

In [None]:
attr = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
data = pd.read_csv("automobile.data",names=attr)
data.head()

### Different types of data
Ref: <https://towardsdatascience.com/data-types-in-statistics-347e152e8bee>

* Categorical
    * Nominal: No order, e.g. sex
    * Ordinal: With order, e.g. education

* Numerical
    * Discrete: Can't be measured but can be counted, e.g. # of times doing sth.
    * Continuous: Can't be counted but can be measured, e.g. temperature

### Missing data
We observe that this dataset exists lots of `?`, which means data is lost
* Use [`data.isna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) to find out NaN
* But we should firstly replace `?` with NaN

In [None]:
data.replace("?", np.nan, inplace=True)

In [None]:
data.isna().sum()

To deal with missing data, we can
* Delete rows with missing item (maybe most of the data are deleted)
* Fill with **means / modes / maximums / other meaningful metrics**

The following only gives a naive method.

In practice, you **should** use different metrics for different types of attributes!

In [None]:
modes = data.mode().iloc[0]
data.fillna(modes,inplace=True)

In [None]:
data.isna().sum()

### Separate data and label
We use first several features (X) to predict price (Y, the last column)

In [None]:
y = data["price"]
data.drop(["price"],axis=1,inplace=True)

### Change text data (categorical) into number
* Use number to denote different catalogs
* Change categorical features into [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

In [None]:
cat_attr = ["make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engine-location","engine-type","num-of-cylinders","fuel-system"]
for a in cat_attr:
    data[a] = pd.Categorical(data[a]) # change column type to categorical
    dummies = pd.get_dummies(data[a],prefix="{}_category".format(a))
    data = pd.concat([data,dummies],axis=1)
data.drop(cat_attr,axis=1,inplace=True)
data.head()

### Feature selection
* Variance
* Pearson correlation $R$
* $\chi^2$ test

### Dimensionality reduction
* Principle Components Analysis (PCA)
* Linear Discriminant Analysis (LDA)

For more methods, please see <https://www.zhihu.com/question/29316149/answer/110159647>

### Separate train and test data
Since no test/validation data are available, we manually separate the data into train data and test data

In [None]:
train_size = int(len(data) * 0.8)
X_train = data[:train_size]
y_train = y[:train_size].to_numpy().astype(np.float64)
X_test = data[train_size:]
y_test = y[train_size:].to_numpy().astype(np.float64)
print("Train size: {}".format(len(X_train)))
print("Test size: {}".format(len(X_test)))

In [None]:
"""
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, y, random_state=3)
"""

### Scaling
* Normalization: $x'=\frac{x-\bar{x}}{\sigma}$
* MinMaxScaling: $x'=\frac{x-\min}{\max-\min}$

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Training & Evaluation

In [None]:
from sklearn.linear_model import LinearRegression

# create model
lr = LinearRegression(normalize=True)

In [None]:
# model fitting
lr.fit(X_train, y_train)

In [None]:
# prediction
y_pred = lr.predict(X_test)

In [None]:
# evaluate
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)