<div style="width:image width px; font-size:75%; text-align:right;">
    <img src="img/pexels-brandon-montrone-1179229.jpg" width="width" height="height" style="padding-bottom:0.2em;" />
    <figcaption>Photo by Kaique Rocha on Pexels</figcaption>
</div>

# Machine Learning 2

**Applied Programming - Summer term 2022 - FOM Hochschule für Oekonomie und Management - Cologne**

**Lecture 08 - May 12, 2022**

*Dennis Gluesenkamp*

## Table of contents
* [Dataset recap](#datarecap)
* [Regression](#regression)
    * [Linear regression](#regression_linear)
    * [Lasso](#regression_lasso)
    * [Random forest regressor](#regression_rf)
* [Clustering](#clustering)
    * [kMeans](#clustering_kmeans)
    * [Alternative](#clustering_alternative)

## Dataset recap<a class="anchor" id="datarecap"></a>

In [None]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.datasets import make_regression
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

%matplotlib inline

np.random.seed(42)

In [None]:
# Artificial classification problem:
X_c, y_c = make_classification(n_samples            = 1000,
                               n_features           = 10,
                               n_redundant          = 2,
                               n_informative        = 5,
                               n_classes            = 3,
                               n_clusters_per_class = 3,
                               flip_y               = 0.05,
                               shift                = None,
                               random_state         = 42)

# Artificial regression problem:
X_r, y_r = make_regression(n_samples     = 1000,
                           n_features    = 10,
                           n_informative = 5,
                           noise         = 0.05,
                           random_state  = 42)

# Scikit-learn toy dataset for classification:
X_b, y_b = load_breast_cancer(return_X_y = True)

# Scikit-learn toy dataset for regression:
X_d, y_d = load_diabetes(return_X_y = True)

# Kaggle dataset for regression:
# Medical Cost Personal Datasets - Insurance Forecast by using Linear Regression
# https://www.kaggle.com/mirichoi0218/insurance
df_i = pd.read_csv('dat/insurance.csv')

# Preprocessing
df_i_enc = pd.get_dummies(df_i, drop_first = True)

X_c_train, X_c_test, y_c_train, y_c_test = train_test_split(X_c, y_c,
                                                            test_size = 0.25,
                                                            random_state = 42)

X_r_train, X_r_test, y_r_train, y_r_test = train_test_split(X_r, y_r,
                                                            test_size = 0.25,
                                                            random_state = 42)

X_b_train, X_b_test, y_b_train, y_b_test = train_test_split(X_b, y_b,
                                                            test_size = 0.3,
                                                            random_state = 42)

X_d_train, X_d_test, y_d_train, y_d_test = train_test_split(X_d, y_d,
                                                            test_size = 0.3,
                                                            random_state = 42)

df_i_train, df_i_test = train_test_split(df_i_enc,
                                         test_size = 0.05,
                                         random_state = 42)

st_scaler = StandardScaler()
df_i_train_scaled = pd.DataFrame(st_scaler.fit_transform(df_i_train),
                                 columns = df_i_train.columns)
df_i_test_scaled  = pd.DataFrame(st_scaler.transform(df_i_test),
                                 columns = df_i_test.columns)

## Regression<a class="anchor" id="regression"></a>
In this second part we deal with regression. This is about the prediction of continuous values in contrast to classification with discrete groups. It follows directly that the target must be a numerical expression. An example of a regression problem is the prediction of prices.

### Linear regression<a class="anchor" id="regression_linear"></a>
The classic and simplest regression approach is linear regression. This is the representation of the target variable as a linear combination of the independent variables according to the following mathematical form.
\begin{equation*}
y\left(c,x\right) = c_0 + c_1 x_1 + \dots + c_n x_n
\end{equation*}
Here, $c = \left(c_0, c_1, \dots, c_n\right)$ is coefficient vector. The model adjusts the coefficients to minimize the sum of squares of real and approximated targets, which corresponds to the mathematical minimization problem
\begin{equation*}
\min_{c} \| Xc - Y \|^2.
\end{equation*}

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()

# Artificial regression problem
reg.fit(X_r_train, y_r_train)

In [None]:
# Calculate the coefficient of determination for test dataset
reg.score(X_r_test, y_r_test)

In [None]:
print(reg.coef_)
print(reg.intercept_)

In [None]:
# Index of feature
i = 0
plt.figure(figsize = (15,8))
plt.scatter(X_r_test[:, i], y_r_test,
            s = 50, marker = '^', color = 'blue', label = 'test', alpha = 0.6)
plt.scatter(X_r_test[:, i], reg.predict(X_r_test),
            s = 70, marker = 'v', color = 'orange', label = 'pred', alpha = 0.5)
plt.xlabel('Objects, X, feature ' + str(i))
plt.ylabel("Target, y")
plt.title("Linear regression")
plt.legend()
plt.draw()

In [None]:
# Diabetes dataset
reg.fit(X_d_train, y_d_train)
reg.score(X_d_test, y_d_test)

In [None]:
print(reg.coef_)
print(reg.intercept_)

In [None]:
# Index of variable
i = 2
plt.figure(figsize = (15,8))
plt.scatter(X_d_test[:, i], y_d_test,
            s = 50, marker = '^', color = 'blue', label = 'test', alpha = 0.6)
plt.scatter(X_d_test[:, i], reg.predict(X_d_test),
            s = 70, marker = 'v', color = 'orange', label = 'pred', alpha = 0.5)
plt.xlabel('Objects, X, feature ' + str(i))
plt.ylabel('Target, y')
plt.title('Linear regression')
plt.legend()
plt.draw()

### Random forest regressor<a class="anchor" id="regression_rf"></a>
We have already learned about Random Forests as a classification algorithm. However, they can also be used for regression problems.

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf     = RandomForestRegressor(random_state = 42)
params = {'n_estimators': [50, 100, 200],
          'criterion':    ['mse', 'mae'],
          'max_depth':    [2, 4, 6, 8]}
k      = 3

reg = GridSearchCV(rf, params, cv = k)
reg.fit(X_d_train, y_d_train)

In [None]:
reg.score(X_d_test, y_d_test)

In [None]:
pd.DataFrame(reg.cv_results_).sort_values(by = ['rank_test_score'])

In [None]:
# Index of variable
i = 2
plt.figure(figsize = (15,8))
plt.scatter(X_d_test[:, i], y_d_test,
            s = 50, marker = '^', color = 'blue', label = 'test', alpha = 0.6)
plt.scatter(X_d_test[:, i], reg.predict(X_d_test),
            s = 70, marker = 'v', color = 'orange', label = 'pred', alpha = 0.5)
plt.xlabel('Objects, X, feature ' + str(i))
plt.ylabel('Target, y')
plt.title('Random forest regression')
plt.legend()
plt.draw()

#### Exercises
Train one random forest regressor model and another one of your choice for the insurance dataset.

## Clustering<a class="anchor" id="clustering"></a>
We will conclude the two lectures on machine learning with a method of unsupervised learning, namely clustering. This refers to the discovery of structures in the data set that exhibit similarities between objects. The objects found in this way are assigned to groups called clusters. Clustering thus aims at new findings that were not previously available. The algorithm formed in this way can also be applied to previously unknown data in order to make an appropriate assignment.

### kMeans<a class="anchor" id="clustering_kmeans"></a>
Apply the k-means algorithm to arbitrary high-dimensional data. The method first chooses $k$ centers as the starting points. Then, an assignment of the objects in the data set to these centers is performed. This is done based on the distance of the objects to the centers. Subsequently, the cluster centers are recalculated based on the points. The assignment and recalculation is then repeated until there is no change.

In [None]:
plt.figure(figsize = (15, 8))
plt.scatter(X_c[:, 0], X_c[:, 4], s = 25, c = y_c, alpha = 0.6)
plt.xlabel('X_0')
plt.ylabel('X_4')
plt.title('Artificial classification problem')
plt.draw()

In [None]:
from sklearn.cluster import KMeans
y_c_pred = KMeans(n_clusters = 3, n_init = 100, max_iter = 1000, random_state = 42).fit_predict(X_c)

In [None]:
df_c = pd.concat([pd.DataFrame(X_c), pd.DataFrame(y_c), pd.DataFrame(y_c_pred)], axis = 1)
df_c.columns = ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'y', 'y_pred']
df_c.sample(5)

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 3, figsize = (15, 8))
ax[0].scatter(df_c[df_c['y'] == 0]['X0'], df_c[df_c['y'] == 0]['X4'],
              s = 25, c = df_c[df_c['y'] == 0]['y_pred'], alpha = 0.7)
ax[0].set_title('Class 0')
ax[1].scatter(df_c[df_c['y'] == 1]['X0'], df_c[df_c['y'] == 1]['X4'],
              s = 25, c = df_c[df_c['y'] == 1]['y_pred'], alpha = 0.7)
ax[1].set_title('Class 1')
ax[2].scatter(df_c[df_c['y'] == 2]['X0'], df_c[df_c['y'] == 2]['X4'],
              s = 25, c = df_c[df_c['y'] == 2]['y_pred'], alpha = 0.7)
ax[2].set_title('Class 2')
plt.draw()

### Alternative<a class="anchor" id="clustering_alternative"></a>

In [None]:
from sklearn.cluster import AgglomerativeClustering

y_c_pred = AgglomerativeClustering(n_clusters = 3).fit_predict(X_c)

df_c = pd.concat([pd.DataFrame(X_c), pd.DataFrame(y_c), pd.DataFrame(y_c_pred)], axis = 1)
df_c.columns = ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'y', 'y_pred']

fig, ax = plt.subplots(nrows = 1, ncols = 3, figsize = (15, 8))
ax[0].scatter(df_c[df_c['y'] == 0]['X0'], df_c[df_c['y'] == 0]['X4'],
              s = 25, c = df_c[df_c['y'] == 0]['y_pred'], alpha = 0.7)
ax[0].set_title('Class 0')
ax[1].scatter(df_c[df_c['y'] == 1]['X0'], df_c[df_c['y'] == 1]['X4'],
              s = 25, c = df_c[df_c['y'] == 1]['y_pred'], alpha = 0.7)
ax[1].set_title('Class 1')
ax[2].scatter(df_c[df_c['y'] == 2]['X0'], df_c[df_c['y'] == 2]['X4'],
              s = 25, c = df_c[df_c['y'] == 2]['y_pred'], alpha = 0.7)
ax[2].set_title('Class 2')
plt.draw()

<hr>