<div style="width:image width px; font-size:75%; text-align:right;">
    <img src="img/danielle-macinnes-IuLgi9PWETU-unsplash.jpg" width="width" height="height" style="padding-bottom:0.2em;" />
    <figcaption>Photo by Danielle MacInnes on Unsplash</figcaption>
</div>

# Introduction to Machine Learning

**Applied Programming - Summer term 2022 - FOM Hochschule für Oekonomie und Management - Cologne**

**Lecture 06 - April 29, 2022**

*Dennis Gluesenkamp*

## Table of contents
* [Introduction](#introduction)
* [Dataset creation](#datasetcreation)
    * [Classification](#datasetcreation_classification)
    * [Regression](#datasetcreation_regression)
* [The decision tree as the first reference model](#decisiontree)
    * [Classification](#decisiontree_classification)
    * [Regression](#decisiontree_regression)
* [References](#references)

## Introduction<a class="anchor" id="introduction"></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

np.random.seed(42)

## Dataset creation<a class="anchor" id="datasetcreation"></a>
In order to get a step-by-step understanding of the topic, application and functionality of Machine Learning, we will first use artificially generated data sets for initial, simple models. This allows us to get to know the structure of modeling with scikit-learn and thus forms the basis for later, more complex applications. The generated datasets may be focused on classification or regression problems and include features that do not claim to represent real values of any particular magnitude or characteristic. Thus, they are simply numerical columns without (physical) units.

### Classification<a class="anchor" id="datasetcreation_classification"></a>
First, we want to generate a dataset for a classification problem, and we can call the ``make_classification()`` function to do this, which is located in the ``datasets`` module of scikit-learn. [[1]](#sklearn2021a)

In [None]:
from sklearn.datasets import make_classification

In [None]:
# Let's check output of function directly
make_classification()

In [None]:
pd.DataFrame(make_classification()[0]).describe()

In [None]:
pd.DataFrame(make_classification()[1]).describe()

In [None]:
X0, y0 = make_classification(n_features           = 2,
                             n_informative        = 2,
                             n_redundant          = 0,
                             n_repeated           = 0,
                             n_clusters_per_class = 2)

In [None]:
print(X0.shape)
print(y0.shape)

In [None]:
plt.scatter(X0[:, 0], X0[:, 1],
            marker = 'o',
            c      = y0)
plt.draw()

In [None]:
Xc, yc = make_classification(n_features           = 6,
                             n_informative        = 4,
                             n_redundant          = 0,
                             n_repeated           = 0,
                             n_clusters_per_class = 1,
                             random_state         = 123)

plt.figure(figsize = (12, 8))
plt.subplots_adjust(wspace = .5, hspace = .5)
plt.subplot(2, 3, 1)
plt.scatter(Xc[:, 0], Xc[:, 1],
            marker = 'o',
            c      = yc)
plt.title('X_0 vs. X_1')

plt.subplot(2, 3, 2)
plt.scatter(Xc[:, 0], Xc[:, 2],
            marker = 'o',
            c      = yc)
plt.title('X_0 vs. X_2')

plt.subplot(2, 3, 3)
plt.scatter(Xc[:, 0], Xc[:, 3],
            marker = 'o',
            c      = yc)
plt.title('X_0 vs. X_3')

plt.subplot(2, 3, 4)
plt.scatter(Xc[:, 1], Xc[:, 2],
            marker = 'o',
            c      = yc)
plt.title('X_1 vs. X_2')

plt.subplot(2, 3, 5)
plt.scatter(Xc[:, 1], Xc[:, 3],
            marker = 'o',
            c      = yc)
plt.title('X_1 vs. X_3')

plt.subplot(2, 3, 6)
plt.scatter(Xc[:, 2], Xc[:, 3],
            marker = 'o',
            c      = yc)
plt.title('X_2 vs. X_3')

plt.draw()

In [None]:
dfc = pd.DataFrame(np.c_[Xc, yc])
dfc

In [None]:
sns.pairplot(dfc, hue = 6, markers=['o', 's'])
plt.draw()

### Regression<a class="anchor" id="datasetcreation_regression"></a>
In the second step, we also create a randomly generated regression problem. Here we access the function ``make_regression()``, which works analogously to the one from the classification [[1]](#sklearn2021a). We keep the most defaults. The only exception is the number of features, because we want to be able to plot the distributions of the dataset.

In [None]:
from sklearn.datasets import make_regression
make_regression()

In [None]:
Xr, yr = make_regression(n_features   = 5,
                         random_state = 123)

dfr = pd.DataFrame(np.c_[Xr, yr])
sns.pairplot(dfr, hue = 5, palette = 'icefire', diag_kind = None)
plt.draw()

### Exercises<a class="anchor" id="datasetcreation_exercises"></a>
1. Go to the User Guide of scikit-learn to the chapter of "Toy Datasets designed for training" [[2]](#sklearn2021b). Now select one dataset each for regression and classification from the datasets offered there and familiarize yourself with the parameter ``return_X_y`` and (if available) ``as_frame`` and use them appropriately.
2. Import the necessary modules for the datasets you selected and create the data for each, as shown above for the generators. Choose some appropriate names which are different to the names for the generator datasets above.
3. Create a pairplot for each of the regression and classification and evaluate the results. **Be careful:** Using a dataset with a lot of features can lead to an extensive calcultation time! As an alternative, you can plot only some of the features.

## The decision tree as the first reference model<a class="anchor" id="decisiontree"></a>
Up to this point, we have only created sample data. Modeling is thus still pending. In order to understand the basic principle of machine learning algorithms on the one hand and to get to know the basic implementation of these algorithms in Python respectively scikit-learn on the other hand, we consider the decision tree as an example. [[3]](#sklearn2021c)

In [None]:
from sklearn import tree

The implementation of decision trees in scikit-learn allows their use for both classifications and regressions. This provides us with a tool for solving the problems created above, which may not lead to the best solution in many cases. However, decision trees have a high practical value due to their illustrativeness as well as the straightforward realization.

### Classification<a class="anchor" id="decisiontree_classification"></a>

In [None]:
# Split train and test set first
from sklearn.model_selection import train_test_split
Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc,
                                                        test_size    = 0.2,
                                                        random_state = 42)

In [None]:
print(Xc_train.shape)
print(Xc_test.shape)
print(yc_train.shape)
print(yc_test.shape)

In [None]:
clf = tree.DecisionTreeClassifier()
clf

In [None]:
clf = clf.fit(Xc_train, yc_train)
clf

In [None]:
plt.figure(figsize = (20,15))
tree.plot_tree(clf)
plt.show()

In [None]:
yc_pred = clf.predict(Xc_test)
print(yc_pred)
print(yc_test)

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, Xc_test, yc_test)  
plt.draw()

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(yc_test, yc_pred)

### Regression<a class="anchor" id="decisiontree_regression"></a>

In [None]:
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr,
                                                        test_size    = 0.2,
                                                        random_state = 42)

In [None]:
print(Xr_train.shape)
print(Xr_test.shape)
print(yr_train.shape)
print(yr_test.shape)

In [None]:
regr = tree.DecisionTreeRegressor()
regr.fit(Xr_train, yr_train)

In [None]:
yr_pred = regr.predict(Xr_test)
print(yr_pred)
print(yr_test)

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(yr_test, yr_pred)

In [None]:
plt.figure(figsize = (15,8))
plt.scatter(Xr_test[:, 1], yr_test,
            s = 50, color = 'blue', label = "test")
plt.scatter(Xr_test[:, 1], yr_pred,
            s = 50, color = 'orange', label = 'pred')
plt.xlabel('Objects, X')
plt.ylabel("Target, y")
plt.title("Decision Tree Regression")
plt.legend()
plt.draw()

### Exercises<a class="anchor" id="decisiontree_exercises"></a>
1. Create a decision tree for each of the regression and classification problems you selected earlier from the Toy Datasets and interpret the result. Use only the default parameters of the tree.
2. Try to manually improve your result from 1. with the parameters of the decision tree.

## References<a class="anchor" id="references"></a>

[1]<a class="anchor" id="sklearn2021a"></a> The scikit-learn developers (2021). 7.3. Generated datasets. Retrieved 2021-04-18 from https://scikit-learn.org/stable/datasets/sample_generators.html

[2]<a class="anchor" id="sklearn2021b"></a> The scikit-learn developers (2021). 7.1. Toy datasets. Retrieved 2021-04-18 from https://scikit-learn.org/stable/datasets/toy_dataset.html

[3]<a class="anchor" id="sklearn2021c"></a> The scikit-learn developers (2021). 1.10. Decision Trees. Retrieved 2021-04-18 from https://scikit-learn.org/stable/modules/tree.html#decision-trees