<div style="width:image width px; font-size:75%; text-align:right;">
    <img src="img/data_ev_unsplash.jpg" width="width" height="height" style="padding-bottom:0.2em;" />
    <figcaption>Photo by ev on Unsplash</figcaption>
</div>

# Machine learning overview with Python and scikit-learn

**Applied Programming - Summer term 2020 - FOM Hochschule für Oekonomie und Management - Cologne**

**Lecture 08 - May 28-29, 2020**

*At the beginning a few general comments on this notebook in the context of the lecture "Applied Programming" and with regard to the exam to be written in this module:*
* *This module is intended to present the contents relevant for the exam in the field of machine learning with scikit-learn in a compact form. This means that the exam content is not exclusively drawn from this notebook. Rather, there are also the areas CRISP-DM, GitHub, SQL as well as the topics of Python and the complementary packages pandas, NumPy and Matplotlib, which have already been discussed in previous lectures.*
* *With regard to scikit-learn, which has a much larger coverage than the one shown here, it is true that the scope of classes and methods shown here is decisive for the exam. As an example, scikit-learn implements elastic nets as an algorithm for supervised learning. They are not used in this notebook and are therefore not part of the exam. However, decision trees are discussed here, which may also be present in the exam.*
* *The algorithms are not discussed mathematically in detail, since this lecture refers to applied programming. Therefore, the goal of the lecture is to convey the basic principles of analytically oriented programming in Python. The focus is on questions like: How are machine learning algorithms implemented in general? Which typical functions are used? Which data processing steps are implemented with scikit-learn and how? Accordingly, the notebook at hand shows a number of typical implementations that students can also face in their professional environment - i.e. in the application. They should be able to quickly orient themselves on the basis of the knowledge taught here. This goal is not compatible with the time required to teach the mathematical details or specifics of the algorithms and parameters.*

## Table of contents
* [Libraries and data sets](#libraries)
* [Preprocessing](#preprocessing)
    * [Scaling](#scaling)
    * [Encoding](#encoding)
    * [Imputation](#imputation)
    * [Exercises on preprocessing](#preprocessing_exercises)
* [Unsupervised learning](#unsupervised)
    * [Principal component analysis](#pca)
    * [k-means clustering](#kmeans)
    * [Exercises on unsupervised learning](#unsupervised_exercises)
* [References](#references)

## Libraries and data sets<a class="anchor" id="libraries"></a>
First of all, we will include the general packages and modules that we will need in the following course. The specific scikit-learn modules will be integrated at the appropriate place later on.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mc
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

# Show visualizations in the notebook
%matplotlib inline

# Set random seed for comparability/reproducibility
np.random.seed(42)

scikit-learn comes with a set of sample data sets, which are of small [[1]](#sklearn2020a) or larger size [[2]](#sklearn2020a). These can be used for own experiments with scikit-learn. However, we will use the generator methods for data sets [[3]](#sklearn2020c). These do not generate features with meaning and names but are only numerical examples. Therefore, there is no business question to analyze. Rather, we can focus purely on programming and application of the machine learning algorithms. Generators exist for both classification and regression problems.

In [None]:
from sklearn.datasets import make_classification
from sklearn.datasets import make_regression

In [None]:
# Create features and labels for classification and regression problem
X_cla, y_cla = make_classification(n_samples = 1000,      # No. of samples (rows)
                                   n_features = 5,        # No. of features (columns)
                                   n_informative = 3,     # No. of informative features
                                   n_redundant = 1,       # No. of redundant features (linear combinations)
                                   n_repeated = 1,        # No. of duplicated features
                                   n_classes = 2,         # No. of classes/labels
                                   random_state = 42)     # Random seed for comparability/reproducibility

X_reg, y_reg = make_regression(n_samples = 1000,          # No. of samples (rows)
                               n_features = 6,            # No. of features (columns)
                               n_informative = 2,         # No. of informative features
                               n_targets = 1,             # No. of regression targets, dimension of output vector
                               noise = 0.05,              # Standard dev. of noise applied to output vector
                               random_state = 42)         # Random seed for comparability/reproducibility

Now we want to visualize the data graphically and look at the classification problem by plotting the features against each other and coloring the class accordingly. For the regression problem we plot the features against the output.

In [None]:
# Classification data set
owncmap = mc.ListedColormap(['#EF476F', '#118AB2'])
fig, ax = plt.subplots(nrows = 4, ncols = 4, figsize = (15, 15))
ax[0, 0].scatter(X_cla[:, 0], X_cla[:, 1], c = y_cla, marker = '.', cmap = owncmap)
ax[0, 0].set_title('0th vs. 1st')
ax[0, 1].scatter(X_cla[:, 0], X_cla[:, 2], c = y_cla, marker = '.', cmap = owncmap)
ax[0, 1].set_title('0th vs. 2nd')
ax[0, 2].scatter(X_cla[:, 0], X_cla[:, 3], c = y_cla, marker = '.', cmap = owncmap)
ax[0, 2].set_title('0th vs. 3rd')
ax[0, 3].scatter(X_cla[:, 0], X_cla[:, 4], c = y_cla, marker = '.', cmap = owncmap)
ax[0, 3].set_title('0th vs. 4th')
ax[1, 1].scatter(X_cla[:, 1], X_cla[:, 2], c = y_cla, marker = '.', cmap = owncmap)
ax[1, 1].set_title('1st vs. 2nd')
ax[1, 2].scatter(X_cla[:, 1], X_cla[:, 3], c = y_cla, marker = '.', cmap = owncmap)
ax[1, 2].set_title('1st vs. 3rd')
ax[1, 3].scatter(X_cla[:, 1], X_cla[:, 4], c = y_cla, marker = '.', cmap = owncmap)
ax[1, 3].set_title('1st vs. 4th')
ax[2, 2].scatter(X_cla[:, 2], X_cla[:, 3], c = y_cla, marker = '.', cmap = owncmap)
ax[2, 2].set_title('2nd vs. 3rd')
ax[2, 3].scatter(X_cla[:, 2], X_cla[:, 4], c = y_cla, marker = '.', cmap = owncmap)
ax[2, 3].set_title('2nd vs. 4th')
ax[3, 3].scatter(X_cla[:, 3], X_cla[:, 4], c = y_cla, marker = '.', cmap = owncmap)
ax[3, 3].set_title('3rd vs. 4th')
plt.draw()

In [None]:
# Regression data set
col = ['#EF476F', '#FFD166', '#06D6A0', '#118AB2', '#F19143', '#073B4C']
fig, ax = plt.subplots(nrows = 2, ncols = 3, figsize = (20, 15))
ax[0, 0].plot(X_reg[:, 0], y_reg, color = col[0], marker = '.', linestyle = 'none')
ax[0, 0].set_title('0th feature')
ax[0, 1].plot(X_reg[:, 1], y_reg, color = col[1], marker = '.', linestyle = 'none')
ax[0, 1].set_title('1st feature')
ax[0, 2].plot(X_reg[:, 2], y_reg, color = col[2], marker = '.', linestyle = 'none')
ax[0, 2].set_title('2nd feature')
ax[1, 0].plot(X_reg[:, 3], y_reg, color = col[3], marker = '.', linestyle = 'none')
ax[1, 0].set_title('3rd feature')
ax[1, 1].plot(X_reg[:, 4], y_reg, color = col[4], marker = '.', linestyle = 'none')
ax[1, 1].set_title('4th feature')
ax[1, 2].plot(X_reg[:, 5], y_reg, color = col[5], marker = '.', linestyle = 'none')
ax[1, 2].set_title('5th feature')
plt.draw()

In order to be able to consider encoding and the imputation of missing values in the further course, we make a few more changes to the data sets.

In [None]:
# Create missing values at random rows in specific columns
for i in np.random.choice(len(X_cla), 35, replace = False):
    X_cla[i, 1] = np.nan

for i in np.random.choice(len(X_reg), 25, replace = False):
    X_reg[i, 1] = np.nan
    
for i in np.random.choice(len(X_reg), 40, replace = False):
    X_reg[i, 4] = np.nan

In [None]:
# Transform one of the continuous features to a nominal feature
from sklearn import preprocessing

kbin_discrete = preprocessing.KBinsDiscretizer(n_bins = [5], encode = 'ordinal')
kbin_discrete.fit(X_cla[:, 4].reshape(-1, 1))
X_cla_nominal = kbin_discrete.transform(X_cla[:, 4].reshape(-1, 1))
X_cla[:, 4] = X_cla_nominal.reshape(1, -1)

kbin_discrete = preprocessing.KBinsDiscretizer(n_bins = [3], encode = 'ordinal')
kbin_discrete.fit(X_reg[:, 5].reshape(-1, 1))
X_reg_nominal = kbin_discrete.transform(X_reg[:, 5].reshape(-1, 1))
X_reg[:, 5] = np.add(X_reg_nominal.reshape(1, -1), 100)

These two data sets are now intended to be used as basic data for further programming. If necessary, we will generate further data sets at a later stage without further investigation or discussion.

## Preprocessing<a class="anchor" id="preprocessing"></a>
For the modelling in the course of machine learning approaches it is necessary to preprocess the data according to the algorithms respectively estimators used. Scikit-learn offers a number of methods and functions for this purpose [[4]](#sklearn2020d).

### Scaling<a class="anchor" id="scaling"></a>
Data often falls on a certain scale. This scale can vary between attributes or measurements. In addition, machine learning algorithms do not know physical units. Moreover, attributes with a larger order of magnitude are generally weighted higher than features with a smaller order of magnitude. This problem is solved by transforming the data to unified scales, in other words scaling. In the context of data analysis, normalization and standardization are particularly relevant. Normalization means that the data is scaled to the closed interval [0, 1]. With standardization, the data is transformed to mean value zero and standard deviation one. In general it can be said that the data should be normalized or standardized if the algorithm uses a (Euclidean) distance or assumes normality.

For normalization, scikit-learn offers the class ``MinMaxScaler()``, whose default parameters are already set to the interval [0, 1]. However, this interval can also be adjusted to individual intervals such as [-1, 1]. For standardization, the class ``StandardScaler()`` transforms the data to mean value zero and variance one.

In [None]:
from sklearn import preprocessing

mm_scale = preprocessing.MinMaxScaler(feature_range = (-1, 1))       # Instantiate scaler object
X_reg_mm_scaled = mm_scale.fit_transform(X_reg[:, 0].reshape(-1, 1)) # Fit scaler to data and transform afterwards
                                                                     # The methods fit and transform are also
                                                                     # available in seperate functions
st_scale = preprocessing.StandardScaler()
X_reg_st_scaled = st_scale.fit_transform(X_reg[:, 4].reshape(-1, 1)) # Result is, however, the same, because 
                                                                     # feature was generated by gaussian process

# Visualize, observe and check x-axis
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))
ax[0].plot(X_reg_mm_scaled, y_reg, color = col[0], marker = '.', linestyle = 'none')
ax[0].set_title('Scaled 0th feature to [-1, 1]')
ax[1].plot(X_reg_st_scaled, y_reg, color = col[4], marker = '.', linestyle = 'none')
ax[1].set_title('Standardized 4th feature')
plt.draw()

The scaler object can also be applied to yet unknown data in order to prepare it in the same way as the training data. For this purpose, simply apply the transform method to the unseen data ``X_new``.
```python
X_new_scaled = mmscale.transform(X_new)
```

#### Summary and key aspects
* For machine learning applications, features should be scaled so that the attributes are not preferred or disadvantaged due to their size.
* scikit-learn offers different ways of scaling with the preprocessing module.
* For normalization (scale to intervall [0, 1]) use the ``MinMaxScaler()``.
* The parameters of ``MinMaxScaler()`` can be canged to any other intervall.
* Standardization (mean value zero, variance one) can be achieved by using a ``StandardScaler()`` object.
* Scaling instances are fitted to the data with the ``fit()`` method and the data is transformed via ``transform()``
* The two steps above can be executed in combination with the ``fit_transform()`` function.
* Scaling instances can be applied to unseen data of the same origin.

### Encoding<a class="anchor" id="encoding"></a>
Not all data sets contain exclusively numerical, continuous features. Rather, categorical attributes occur very frequently. Examples are gender, place of residence or the operating system used. These non-numerical information can usually not be processed directly by machine learning algorithms. For this reason, these features have to be transformed into numerical expressions using encoding. In this context, it is of particular importance whether the categorical data has a rank/order or not. If not, such as colors or film genre, the data is referred to as nominal data. If there is a ranking, such as hierarchical ranks in an organization or version names, ordinal data exists. The different type of data also means a different type of encoding. We consider one-hot encoding for nominal data and ordinal encoding for ordinal data in the following.

In scikit-learn we can use the classes ``OneHotEncoder()`` and ``OrdinalEncoder()``. ``OneHotEncoder()`` transforms every occurring value in an attribute into a separate column. The object - i.e. the row - is then assigned 1 in the column, which corresponds to the categorical value. All other columns are set to 0. Thus we get a series of boolean columns in the number of instances of the categorical feature. With ``OrdinalEncoder()`` the values of the attribute are assigned to integers. This results in only one column with as many different integer values as there are expressions in the categorical variable.

In [None]:
oh_enc = preprocessing.OneHotEncoder(sparse = False, categories = 'auto')
# Parameter 'sparse' returns array directly if set to False
# Parameter 'categories' results in determining categories automatically based on unique values if set to auto
# which is the default

X_cla_enc = oh_enc.fit_transform(X_cla[:, 4].reshape(-1, 1))
print(X_cla_enc)

In [None]:
or_enc = preprocessing.OrdinalEncoder(categories = 'auto')
X_reg_enc = or_enc.fit_transform(X_reg[:, 5].reshape(-1, 1))
print(X_reg_enc.flatten())                                     # flatten() is only used for compact displaying

#### Summary and key aspects
* Encoding enables the processing of categorical features by machine learning algorithms through transforming the attributes into numerical values.
* It must be considered whether the categorical data are nominal or ordinal.
* For nominal data, one-hot encoding can be implemented using the ``OneHotEncoder()`` class.
* For ordinal data, the ``OrdinalEncoder()`` is used - however, ordinal encoding is currently not offered, as might be expected in terms of ranking. Please refer to other packages, such as ``category-encoders`` [[5]](#mcginnis2016).

### Imputation<a class="anchor" id="imputation"></a>
A major challenge in data preprocessing is the handling of missing values. As with encoding, there are a number of possible strategies here. In each individual case, it must be evaluated and decided which strategy is the most effective choice for the case or even for the individual attribute. In this lecture we will limit ourselves to the basic imputation methods implemented in the scikit-learn class ``SimpleImputer()``.

Without going into detail here, you should know that you can also have missing values filled by an own estimator. In this case, the values are calculated as the output of a regression on the basis of the other columns. Please find below an excerpt from the documentation of scikit-learn:
> "A more sophisticated approach is to use the ``IterativeImputer`` class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output ``y`` and the other feature columns are treated as inputs ``X``. A regressor is fit on ``(X, y)`` for known ``y``. Then, the regressor is used to predict the missing values of ``y``. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned." [[6]](#sklearn2020e)

First, we look graphically at the location of missing values in the two data sets for classification and regression.

In [None]:
f,(ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))
g1 = sns.heatmap(pd.DataFrame(X_cla).isnull(), cmap = ['#073B4C', '#EF476F'], cbar = False, ax = ax1)
g1.set_title('Missing values in classification data set')
g2 = sns.heatmap(pd.DataFrame(X_reg).isnull(), cmap = ['#073B4C', '#EF476F'], cbar = False, ax = ax2)
g2.set_title('Missing values in regression data set')
plt.draw()

In this way we can now compare after each imputation whether the red gaps have been closed.

We check the different imputation strategies for the ``SimpleImputer()`` class in the following short example.

In [None]:
from sklearn.impute import SimpleImputer

# Some examples to be grasped at first glance
X = np.array([[np.nan, 2, 3, 4], [10, np.nan, 10, 10], [0.1, 0.4, np.nan, 0.2], [11, 2, 10, np.nan]])
print('Array with missing values:')
print(X)
print()
print('- - - - -')
print()

simple_imputer = SimpleImputer(strategy = 'mean')
X_new = simple_imputer.fit_transform(X)
print('Imputed with means:')
print(X_new)
print()
print('- - - - -')
print()

simple_imputer = SimpleImputer(strategy = 'median')
X_new = simple_imputer.fit_transform(X)
print('Imputed with medians:')
print(X_new)
print()
print('- - - - -')
print()

simple_imputer = SimpleImputer(strategy = 'most_frequent')
X_new = simple_imputer.fit_transform(X)
print('Imputed with most frequent values:')
print(X_new)
print()
print('- - - - -')
print()

simple_imputer = SimpleImputer(strategy = 'constant', fill_value = 0, add_indicator = True)
# With add_indicator parameter, the information of where missings are imputed is not lost. It can be used for
# statistical purposes and as a new feature as well.
X_new = simple_imputer.fit_transform(X)
print('Imputed with constant value:')
print(X_new)

In [None]:
simple_imputer = SimpleImputer(strategy = 'mean')
X_cla_new = simple_imputer.fit_transform(X_cla)

simple_imputer = SimpleImputer(strategy = 'median')
X_reg_new = simple_imputer.fit_transform(X_reg)

f,(ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))
g1 = sns.heatmap(pd.DataFrame(X_cla_new).isnull(), cmap = ['#073B4C', '#EF476F'], cbar = False, ax = ax1)
g1.set_title('Classification data set')
g2 = sns.heatmap(pd.DataFrame(X_reg_new).isnull(), cmap = ['#073B4C', '#EF476F'], cbar = False, ax = ax2)
g2.set_title('Regression data set')
plt.draw()

#### Summary and key aspects
* First, get a general overview of the information gaps in the data set. In addition to purely numerical statistics, this can also be supported graphically.
* The four fundamental imputation strategies are filling with the mean, median or most frequent value of the column as well as a constant value.
* The class provided by scikit-learn for this purpose is called ``SimpleImputer()`` and is adapted to the existing data by the function ``fit_transform()``. Afterwards the calculation of the values for the previous gaps takes place directly. ``fit()`` and ``transform()`` can also be called separately.
* For multivariate imputation, i.e. replacement based on an estimator, scikit-learn provides the ``IterativeImputer()`` class.
* One method cannot generally and always be preferred to another - in other words, there is no "best" way of imputation per se. This must be examined in each individual case.

### Exercises on preprocessing<a class="anchor" id="preprocessing_exercises"></a>

Use the following NumPy Array ``E``.

In [None]:
E = np.array([[45, np.nan, 10, 13, 0], [42, np.nan, 8, 12, 1], [44, np.nan, 7, 9, 0]])
print(E)

0. Import the ``preprocessing`` module of scikit-learn.
1. Using an instance of the ``SimpleImputer()`` class with name ``ex_imputer`` from scikit-learn, replace the missing values of E (NaN) with the constant value 42. Store the imputed array in a new variable named ``E_imputed``. Print this result.
2. In a real situation, which strategy would you choose for imputation respectively which procedure would you recommend in the present example of ``E``?
3. Create an instance of the ``OneHotEncoder()`` class from scikit-learn, set the ``sparse`` parameter to ``False`` and name the instance with ``ex_encoder``. Then use it to perform one-hot-encoding on the entire ``E_imputed`` array from task 1. Save the result in a new variable named ``E_encoded``.
4. Which dimensions (number of rows and columns) does ``E_encoded`` have? Which command can be used for this?
5. Finally, create an instance of the scikit-learn class ``MinMaxScaler()`` with name ``ex_scaler`` and set the parameters so that scaling to the interval [0, 100] is performed. Then use this instance to scale the result array of 1., E_imputed, and print() the result.

## Unsupervised learning<a class="anchor" id="unsupervised"></a>
Unsupervised learning involves the machine-based recognition of patterns in unlabelled data, i.e. without the use of a target value. The applications include particularly the dimensional reduction as well as segmentation of the data. For these purposes, two algorithms of the scikit-learn package will be presented in the following.

### Principal component analysis<a class="anchor" id="pca"></a>
The purpose of principal component analysis (PCA) is to reduce the dimensionality of a data set. Thus, large data sets with many dimensions are simplified in such a way that they can be reduced to a smaller number of meaningful orthogonal linear combinations of these dimensions. Information should not be lost or only to a small extent, whereas data correlations are summarized. From a mathematical point of view, PCA is performed according to the principal axis theorem.

The scikit-learn package implements the ``PCA()`` class in the ``decomposition`` module.

In [None]:
# PCA with three dimensions on classification data set
from sklearn.decomposition import PCA
X_pca = PCA(n_components = 3)
X_cla_pca = X_pca.fit_transform(X_cla_new)                          # take the imputed array

# Plot the result
owncmap = mc.ListedColormap(['#EF476F', '#FFD166', '#06D6A0', '#118AB2', '#F19143', '#073B4C'])
fig = plt.figure(1, figsize = (10, 10))
ax = Axes3D(fig, elev = -140, azim = 50)
ax.scatter(X_cla_pca[:, 0], X_cla_pca[:, 1], X_cla_pca[:, 2],       # first three dimensions of PCA
           c = y_cla,                                               # color-coded by label
           cmap = owncmap, edgecolor = '#073B4C', s = 50)
ax.set_title("Top-3 PCA dimensions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.draw()

Each of the orthogonal principal axes explains the variance in the data to some degree. This information can be retrieved as an attribute of the PCA instance and can be displayed as absolute or percentage values.

In [None]:
print(X_pca.explained_variance_)                                    # absolute values
print(X_pca.explained_variance_ratio_)                              # percentage values

#### Summary and key aspects
* The PCA can be regarded as an unsupervised learning procedure, since it calculates a simplified dimensional structure for data without a target value.
* PCA can therefore be used to reduce the dimension of the data.
* The instance of the class ``PCA()`` is fitted to the data with ``fit_transform()``. Using the attribute ``explained_variance_ratio_``, the percentage of variance explained can be displayed for each of the eigenvectors.

### k-means clustering<a class="anchor" id="kmeans"></a>
A cluster analysis, which can form customer segments, for instance, can be realized using the k-means algorithm. The objects in the data set are grouped in such a way that they belong to a cluster from a previously determined number of clusters. Within this cluster the objects are "similar". Mathematically, this means that the algorithm minimizes the sum of the squared deviations from the cluster centroids. An approximate algorithm is normally used for this purpose. Due to these calculations, k-means can only operate on numerical data, since a mean value cannot be calculated meaningfully for categorical data.

scikit-learn implements the class ``KMeans()`` in the module cluster. During instantiation, the number of clusters must be passed using the parameter n_clusters. The ``fit()`` function with passing the data is then used to determine the clusters/segments that can be retrieved using the ``labels_`` attribute.

In [None]:
from sklearn.cluster import KMeans
X_cla_kmeans = KMeans(n_clusters = 2, random_state = 42).fit(X_cla_new)   # take the imputed array

In [None]:
fig, ax = plt.subplots(nrows = 4, ncols = 4, figsize = (15, 15))
ax[0, 0].scatter(X_cla[:, 0], X_cla[:, 1], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[0, 0].set_title('0th vs. 1st')
ax[0, 1].scatter(X_cla[:, 0], X_cla[:, 2], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[0, 1].set_title('0th vs. 2nd')
ax[0, 2].scatter(X_cla[:, 0], X_cla[:, 3], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[0, 2].set_title('0th vs. 3rd')
ax[0, 3].scatter(X_cla[:, 0], X_cla[:, 4], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[0, 3].set_title('0th vs. 4th')
ax[1, 1].scatter(X_cla[:, 1], X_cla[:, 2], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[1, 1].set_title('1st vs. 2nd')
ax[1, 2].scatter(X_cla[:, 1], X_cla[:, 3], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[1, 2].set_title('1st vs. 3rd')
ax[1, 3].scatter(X_cla[:, 1], X_cla[:, 4], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[1, 3].set_title('1st vs. 4th')
ax[2, 2].scatter(X_cla[:, 2], X_cla[:, 3], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[2, 2].set_title('2nd vs. 3rd')
ax[2, 3].scatter(X_cla[:, 2], X_cla[:, 4], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[2, 3].set_title('2nd vs. 4th')
ax[3, 3].scatter(X_cla[:, 3], X_cla[:, 4], c = X_cla_kmeans.labels_, marker = '.', cmap = owncmap)
ax[3, 3].set_title('3rd vs. 4th')
plt.draw()

In [None]:
# Check the calculated clusters in PCA-generated eigenvector space
fig = plt.figure(1, figsize = (10, 10))
ax = Axes3D(fig, elev = -140, azim = 50)
ax.scatter(X_cla_pca[:, 0], X_cla_pca[:, 1], X_cla_pca[:, 2],
           c = X_cla_kmeans.labels_,                                # color-coded by k-means-cluster
           cmap = owncmap, edgecolor = '#073B4C', s = 50)
ax.set_title("Top-3 PCA dimensions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.draw()

#### Summary and key aspects
* k-means is a clustering algorithm, which groups objects in a cluster by minimizing the sum of the squared deviations from the cluster centroids
* k-means can only be used with numerical data.
* The scikit-learn class is called ``KMeans()`` and it is located in the ``cluster`` module
* The number of clusters has to be specified upfront
* The ``labels_`` attribute contains the information about the clusters for each object in the data set.

### Exercises on unsupervised learning<a class="anchor" id="unsupervised_exercises"></a>
0. Create a data set ``(X_ex, y_ex)`` with ``make_regression()`` consisting of 100 samples, 15 features from which 7 are informative and one regression target. Add noise in the magnitude of 0.15 and set the random seed to 7.
1. Run a principal component analysis and give the percentage of variance explained for the first five components.
2. Perform a k-means cluster analysis on X_ex. Determine the segment assignments for 3, 4, 5, and 6 clusters, setting the random seed to 77. Convert X_ex to a pandas DataFrame and join the respective cluster allocations to it.
3. Draw four histograms of segment allocations concerning ``X_ex`` with seaborn's ``distplot()`` function and check which segment has the most objects in each of the four clustering cases. Ideally, you display the four charts side by side in one plot.

## References<a class="anchor" id="references"></a>

[1]<a class="anchor" id="sklearn2020a"></a> The scikit-learn developers (2020). Toy datasets. Retrieved 2020-05-18 from https://scikit-learn.org/stable/datasets/index.html#toy-datasets

[2]<a class="anchor" id="sklearn2020b"></a> The scikit-learn developers (2020). Real world datasets. Retrieved 2020-05-18 from https://scikit-learn.org/stable/datasets/index.html#real-world-datasets

[3]<a class="anchor" id="sklearn2020c"></a> The scikit-learn developers (2020). Generated datasets. Retrieved 2020-05-18 from https://scikit-learn.org/stable/datasets/index.html#generated-datasets

[4]<a class="anchor" id="sklearn2020d"></a> The scikit-learn developers (2020). Preprocessing data. Retrieved 2020-05-18 from https://scikit-learn.org/stable/modules/preprocessing.html#

[5]<a class="anchor" id="mcginnis2016"></a> Will McGinnis (2016). Category Encoders. Retrieved 2020-05-18 from http://contrib.scikit-learn.org/category_encoders/

[6]<a class="anchor" id="sklearn2020e"></a> The scikit-learn developers (2020). Multivariate feature imputation. Retrieved 2020-05-18 from https://scikit-learn.org/stable/modules/impute.html#multivariate-feature-imputation