<a href="https://colab.research.google.com/github/dylanwalker/BA865/blob/master/BA865_Lecture_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# scikit-learn: Machine Learning in Python

Numerous machine learning algorithms and models are implemented in [scikit-learn](https://scikit-learn.org/stable/index.html).

Six main categories are
* Regression
* Classification
* Clustering
* Dimensionality reduction
* Model selection
* Preprocessing

## Estimators

In scikit-learn, machine learning model is called as **Estimator**.

Each **Estimator** is a Python `class` and has a form like (Recall the structure of `class`).

```python
class estimator():
    def __init__(self, data):
        self.data = data
    def fit():
        # do some calculations with self.data
```

So if you want to estimate coefficients or learn patterns within data, simply do

1. initialize an estimator
2. Fit the estimator with data of your interest

## Regression : Linear Regression - Ordinary Least Squares (OLS)

Ordinary Least Squares

$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p$

Estimate $w$ that minimizes $\sum{(y-Xw)^2}$

In [0]:
# initialize a linear model estimator
from sklearn import linear_model
lm = linear_model.LinearRegression()

`LinearRegression` accepts two inputs X and Y.

Their formats should be organized as below.

X = $[[x_{11}, x_{12}], [x_{21}, x_{22}], [x_{31},x_{32}]]$

y = $[y_{1}, y_{2}, y_{3}]$

In case that $x_i$ has only one value, let $X = [[x_{11}], [x_{21}], [x_{31}]]$.

So, if we want to estimate a linear model for the data $X=[1,2,3,4,5]$ and $y=[0,2,4,1,4]$, change the format of X as $[[1], [2], [3], [4], [5]]$. 

In [0]:
X = [[1], [2], [3], [4], [5]]
y = [0,2,4,1,4]
plt.scatter(X, y)

Now fit the estimator `lm`.

In [0]:
lm.fit(X, y)

Estimated intercept $w_0$ and coefficients $w_i$ are stored in `lm.intercept_` and `lm.coef_` respectively.

In [0]:
print('Estimated intercept is', lm.intercept_, 'and estimated coefficient is', lm.coef_[0])

The estimated linear model is $Y = 0.1 + 0.7X$.

Do you want to predict a value for $x=10$? &rarr; use `lm.predict()`

In [0]:
lm.predict([[10]]) # be sure that X should be given as two dimensional array

Lastly, using `lm.predict()`, we can draw the estimated line with data points.

In [0]:
y_pred = lm.predict(X)
plt.scatter(X, y)
plt.plot(X, y_pred, color='gray', linestyle='--')

Remember the rules of scikit-learn. Initialize and fit.

<span style="color:red"> **Exercises with kaggle data?**</span>

## Classification : Support Vector Machine (SVM)

Support vector machine (SVM) is a supervised learning model that classifies data points into given labels.
SVM finds hyperplanes that maximally divide labels and uses the hyperplanes as classifiers. 

* For $p$ dimensional vectors, its hyperplane of $(p-1)$ dimensions can separate the vectors into labels.
* For example, if each observation has two values, a hyperplane that divides observations is a line (1-dim).
* If each observation has three values, a hyperplane is a plane (2-dim).

A hyperplane that divides data points can be expressed as $\overrightarrow{w}$ that satisfies $\overrightarrow{w}\overrightarrow{x}-b = c$ where $c$ is a value between two labels.

### Example

Randomly generate samples from two multivariate normal distributions

In [0]:
np.random.seed(1)
dat1 = np.random.multivariate_normal(mean=[1,1], cov=[[0.3, 0], [0, 0.3]], size=50)
dat2 = np.random.multivariate_normal(mean=[2,1.5], cov=[[0.3, 0], [0, 0.3]], size=50)

In [0]:
plt.scatter(dat1[:,0], dat1[:,1])
plt.scatter(dat2[:,0], dat2[:,1])

1. Initialize an estimator

In [0]:
from sklearn import svm
clf = svm.LinearSVC() # Linear support vector machine

2. Fit the estimator to the data. We will use age and fare to learn the model.

In [0]:
X = np.concatenate((dat1, dat2))
y = [0]*50 + [1]*50
clf.fit(X, y)

In [0]:
print(clf.intercept_, clf.coef_)

Thus, the learned classifier is $-2.87 + 1.30x_{1}+0.67x_{2}$.

Predict labels of other data points.

In [0]:
X_pred = np.array([[0.5, 0], [1.5, 3], [3, 2]])
y_pred = clf.predict(X_pred)

#### Check the results on a plot

In [0]:
plt.scatter(dat1[:,0], dat1[:,1], alpha=0.2)
plt.scatter(dat2[:,0], dat2[:,1], alpha=0.2)

X_tmp = np.arange(0.5, 2.5, 0.1) 
SVM_line = 1/clf.coef_[0][1]*(-clf.intercept_[0] - clf.coef_[0][0]*X_tmp)
plt.plot(X_tmp, SVM_line, color='gray', linestyle='--', label='SVM')


plt.scatter(X_pred[:,0], X_pred[:,1], marker='s', s=100, 
            color = ['tab:blue' if x==0 else 'tab:orange' for x in y_pred], label='Predicted')

plt.legend()

### Q. Can we build a classifier that predicts who survived the Titanic? 

In [0]:
import seaborn as sns
titanic=sns.load_dataset('titanic') # load data
titanic.head()

In [0]:
titanic.sex = [0 if x=='male' else 1 for x in titanic.sex] # male = 0, female = 1
X = titanic[['pclass', 'sex', 'age', 'fare']] # use four columns
not_na = ~pd.isna(X.age)
X = X[not_na]
y = titanic['survived'][not_na] # survived is the label

In [0]:
X.head()

In [0]:
from sklearn import svm
clf = svm.SVC(gamma='auto', random_state=0) # SVC covers not only linear kernel as LinearSVC but also nonlinear kernels
clf.fit(X, y)

In [0]:
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X)
print('{:.2%}\n'.format(accuracy_score(y, y_pred)))

## Model selection : How to improve and evaluate the learned model?

Until now, we used the entire data to learn a model. It is in fact wrong! We need to test the learned model for new data.

&rarr; split data into train, validation, and test sets.

* Train set: a subset of data to train a model
* Validation set: a subset of data to tune hyperparameters
* Test set: a subset of data to evaluate the learned model

As our current goal is not to tune hyperparameters for better performance, we will use 70\% of data as train set and 30\% as test set.  

Fortunately, `sklearn` provides an easy way to split data into train and test sets.
```python
from sklearn.model_selection import train_test_split
```

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

Now learn a SVM model for the train sets

In [0]:
from sklearn import svm
clf = svm.SVC(gamma='auto', random_state=0) # SVC covers not only linear kernel as LinearSVC but also nonlinear kernels
clf.fit(X_train, y_train)

Then predict labels for the test set

In [0]:
y_pred = clf.predict(X_test)

In [0]:
from sklearn.metrics import accuracy_score
print('{:.2%}\n'.format(accuracy_score(y_test, y_pred)))

## Clustering : K-means

What if no labels are given in data? 

Classifying data by underlying patterns is needed, and computational models do this job are called **unsupservised learning**. K-means clustering is one of popular unsupervised learning methods.

Simple idea: Assume that data points are separated into $K$ clusters. Find $K$ centroids that minimize witin-cluster variances

Let $(x_1, x_2, ..., x_n)$ be observations and $(\mu_1, \mu_2, ..., \mu_m)$ be centroids of points in cluster $i, C_i$.

Minimize $\sum_{i=1}^{m}\sum_{x\in C_{i}}\|x-\mu_i\|^2$.

<span style="color:red"> **The number of clusters $K$ should be given**</span>

In [0]:
np.random.seed(1)
dat1 = np.random.multivariate_normal(mean=[1,1], cov=[[0.3, 0], [0, 0.3]], size=50)
dat2 = np.random.multivariate_normal(mean=[2,1.5], cov=[[0.3, 0], [0, 0.3]], size=50)
dat = np.concatenate((dat1, dat2))

In [0]:
plt.scatter(dat[:,0], dat[:,1], color='gray') # but it is originally generated by two different distributions

### K=2

In [0]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(dat)

In [0]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(dat1[:,0], dat1[:,1])
plt.scatter(dat2[:,0], dat2[:,1])
plt.title('Original data')

plt.subplot(1,2,2)
labels = kmeans.labels_
plt.scatter(dat[:,0], dat[:,1], color=['tab:blue' if x==0 else 'tab:orange' for x in labels])
plt.title('K-means clustering: K=2')

### K=4

In [0]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(dat)

In [0]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(dat1[:,0], dat1[:,1])
plt.scatter(dat2[:,0], dat2[:,1])
plt.title('Original data')

plt.subplot(1,2,2)
labels = kmeans.labels_
cmap = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']
plt.scatter(dat[:,0], dat[:,1], color=[cmap[x] for x in labels])
plt.title('K-means clustering: K=4')

### Example: wine data

In this data, chemical compositions and types of wines are given

In [0]:
from sklearn.datasets import load_wine
wine = load_wine() 
X = pd.DataFrame(wine.data, columns = wine.feature_names)
y = wine.target # there are three types

In [0]:
X.head()

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

In [0]:
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_train)

In [0]:
y_pred = kmeans.predict(X_test)
print('{:.2%}\n'.format(accuracy_score(y_test, y_pred)))

Not bad, but there is room for improvement. How?

## Preprocessing : Standardization

Standardization is required for data like the wine case because values are distributed with different means and variances by columns. These characteristics would affect performance and implementation of machine learning algorithms. By standardizing, we can resolve this issue to some extent. To standardize data, we will use the `sklearn.preprocessing` package.

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # calculate mean and standard deviation of train set
scaler.fit(X_train)

In [0]:
scaler.mean_

In [0]:
scaler.scale_

In [0]:
X_train_scaled = scaler.transform(X_train) # You can apply the scaler even to test set

In [0]:
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_train_scaled)

In [0]:
X_test_scaled = scaler.transform(X_test)
y_pred = kmeans.predict(X_test_scaled)
print('{:.2%}\n'.format(accuracy_score(y_test, y_pred)))

## Dimensionality reduction : Principal Component Analysis (PCA)

Wine data has 13 features that makes us hard to visualize and understand the data. Through dimensionality reduction, we can get following advantages. (https://en.wikipedia.org/wiki/Dimensionality_reduction)
* It reduces the time and storage space required.
* Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model.
* It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
* It avoids the curse of dimensionality.

What is PCA?

Eigenvalue, eigenvectors? Hard mathematics. <span style="color:blue"> **HK: Do we need to explain details?**</span>

In [0]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=0) # 13 dimensions to 2 dimensions
pca.fit(X_train_scaled) # find principal components

In [0]:
pca.explained_variance_ratio_

About 37\% variances are explained by the first principle component and about 19\% variances are explained by the second principle component. It means that the first two components capture more than half of all variances. 

So, projecting the wine data onto the first two principal components can give a good overview of the data

In [0]:
pca_transformed = pca.transform(X_train_scaled) # project the data onto principal components

In [0]:
cmap = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']
plt.scatter(pca_transformed[:,0], pca_transformed[:,1], color = [cmap[x] for x in y_train])

Three wine types are separated well by the first two principal components

## Pipelines : chaining pre-processors and estimators


We can do all procedures at once by the `sklearn.pipeline` package!

In [0]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

wine = load_wine() 
X = pd.DataFrame(wine.data, columns = wine.feature_names)
Y = wine.target # there are three types
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=0)

pipe = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=3, random_state=0)
)

pipe.fit(X_train, Y_train)

X_test_scaled = scaler.transform(X_test)
Y_pred = kmeans.predict(X_test_scaled)
print('{:.2%}\n'.format(accuracy_score(Y_test, Y_pred)))