# Introduction to Machine Learning

(The content of this notebook was inspired by my work for EmergentAlliance and Jason Brownlee's "Machine Learning Mastery with Python")

In this short intro course we will focus on predictive modeling. That means that we want to use the models to make predictions, e.g. a system's future behaviour or a system's response to specific inputs, aka classification and regression.

So from all the various types of machine learning categories we will look at **supervised learning**. So we will train a model based on labelled training data. For example when training an image recognition model for recognizing cats vs dogs you need to label a lot of pictures for training purpose upfront.
![](Bereiche-des-Machine-Learnings.png)

The other categories cover **unsupervised learning**, e.g. clustering and **Reinforcement learning**, e.g. Deepmind's AlphaGo.

![Alt Text](deepmind_parkour.0.gif.mp4)

## Datasets:
We will look at three different datasets:
1. Iris Flower Dataset
2. Boston Housing Prices

The first two datasets are so called toy datasets, well known machine learning examples, and already included in the Python machine learning library scikitlearn https://scikit-learn.org/stable/datasets/toy_dataset.html. The Iris Flower dataset is an example for a classification problem, whereas the Boston Housing Price dataset is a regression example.

## What does a ML project always looks like?
* Idea --> Problem Definition / Hypothesis formulation
* Analyze and Visualize your data
    - Understand your data (dimensions, data types, class distributions (bias!), data summary, correllations, skewness)
    - Visualize your data (box and whisker / violine / distribution / scatter matrix)
* Data Preprocessing including data cleansing, data wrangling, data compilation
* Apply algorithms and make predictions
* Improve, validate and present results

## Let's get started
Load some libraries

In [None]:
import pandas as pd # data analysis
import numpy as np # math operations on arrays and vectors
import matplotlib.pyplot as plt # plotting
# display plots directly in the notebook
%matplotlib inline 
import sklearn

## Example 1: Iris flower dataset
https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset
4 numeric, predictive attributes (sepal length in cm, sepal width in cm, petal length in cm, petal width in cm) and the class (Iris-Setosa, Iris-Versicolour, Iris-Virginica)

**Hypothesis:** One can predict the class of Iris Flower based on their attributes.

Here this is just one sentence, but formulating this hypothesis is a non-trivial, iterative task, which is the basis for data and feature selection and extremely important for the overall success!

### 1. Load the data

In [None]:
# check here again with autocompletion --> then you can see all availbale datasets
# https://scikit-learn.org/stable/datasets/toy_dataset.html
from sklearn.datasets import load_iris

In [None]:
(data, target) =load_iris(return_X_y=True, as_frame=True)

In [None]:
data

In [None]:
target

We will combine this now into one dataframe and check the classes

In [None]:
data["class"]=target

In [None]:
data

### 2. Understand your data

In [None]:
data.describe()

This is a classification problem, so we will check the class distribution. This is important to avoid bias due to over- oder underrepresentation of classes. Well known example of this problem are predictive maintenance (very less errors compared to normal runs, Amazon's hiring AI https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G)

In [None]:
class_counts = data.groupby('class').size()
class_counts

Now let's check for correlations
Correlation refers to the relationship between two variables and how they may or may not change together.
There are different methods available (--> check with ?data.corr)

In [None]:
correlations = data.corr(method='pearson')

In [None]:
correlations

Let's do a heatmap plot for the correlation matrix (pandas built-in)

In [None]:
correlations.style.background_gradient(cmap='coolwarm').set_precision(2)

Now we will also check the skewness of the distributions, assuming a normal Gaussian distribution. 
The skew results show a positive (right) or negative (left) skew. Values closer to zero show less skew.

In [None]:
skew=data.skew()
skew

## 2. Visualize your data
- Histogram
- Paiplot
- Density

In [None]:
data.hist()

In [None]:
data.plot(kind="density", subplots=True, layout=(3,2),sharex=False)

Another nice plot is the box and whisker plot

In [None]:
data.plot(kind="box", subplots=True, layout=(3,2),sharex=False)

Another option are the seaborn violine plots, which give a more intuitive feeling about the distribution of values

In [None]:
import seaborn as sns
sns.violinplot(data=data,x="class", y="sepal length (cm)")
#sns.violinplot(data["sepal width (cm)"])

And last but not least a scatterplot matrix, similar to the pairplot we did already in the last session. This should also give insights about correllations.

In [None]:
sns.pairplot(data)

## 3. Data Preprocessing
For this dataset, there are already some steps we don't need to take, like:
Conglomeration of multiple datasources  to one table, including the adaption of formats and granularities. Also we don't need to take care for missing values or NaN's. But among preprocessing there are as well
- Rescaling
- Normalization

The goal of these transformtions is bringing the data into a format, which is most beneficial for the later applied algorithms. So for example optimization algorithms for multivariate optimizations perform better, when all attributes / parameters have the same scale. And other methods assume that input variables have a Gaussian distribution, so it is better to transform the input parameters to meet these requirements.

At first we look at rescaling. This is done to rescale all attributes (parameters) into the same range, most of the times this is the range [0,1].

For applying these preprocessing steps at first we need to transform the dataframe into an array and split the arry in input and output values, here the descriptive parameters and the class.

In [None]:
# transform into array
array = data.values
array

In [None]:
# separate array into input and output components
X = array[:,0:4]
Y = array[:,4]

In [None]:
# Now we apply the MinMaxScaler with a range of [0,1], so that afterwards all columns have a min of 0 and a max of 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
rescaledX

Now we will apply the Standard Scaler, which means that each column (each attribute / parameter) will be transformed, such that afterwards each attribute has a standard distribution with mean = 0 and std. dev. = 1.
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
rescaledX

## 4. Feature Selection (Parameter Sensitivity)
Now we come to an extremely interesting part, which is about finding out which parameters do really have an impact onto my outputs. This is the first time we can validate our assumptions. So we will get a qualitative and a quantitative answer to the question which parameters are important. This is also important as having irrelevant features in your data can decrease the accuracy of many models and increases the training time.

In [None]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# feature extraction
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(X, Y)
# summarize scores
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

Here we can see the scores of the features. The higher the score, the more impact they have. As we have selected to take 3 attributes into account, we can see the values of the three selected features (sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)). This result also makes sense, when remembering the correlation heatmap...

Another very interesting transformation, which fulfills the same job as feature extraction in terms of data reduction is the PCA. Here the complete dataset is transformed into a reduced dataset (you set the number of resulting principal components). A Singular Value Decomposition of the data is performed to project it to a lower dimensional space. 

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Of course there are even more possibilities, especially when you consider that the application of ML algorithms itself will give the feature importance. So there are also built-in methods in sklearn.

## 5. Apply ML algorithms
- The first step is to split our data into **training and testing data**. We need to have a separate testing dataset, which was not used for training purpose to validate the performance and accuracy of our trained model.
- **Which algorithm to take?** There is no simple answer to that. Based on your problem (classification vs regression), there are different clases of algorithms, but you cannot know beforehand whoch algorithm will perform best on your data. So it is alwyas a good idea to try different algorithms and check the performance.
- How to evaluate the performance? There are different metrics available to check the **performance of a ML model**

In [None]:
# specifying the size of the testing data set
# seed: reproducable random split --> especially important when comparing different algorithms with each other.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size,
random_state=seed)
model = LogisticRegression(solver='liblinear') 
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test) 
print("Accuracy: %.3f%%" % (result*100.0))

In [None]:
# Let's compare the accuracy, when we use the same data for training and testing
model = LogisticRegression(solver='liblinear') 
model.fit(X, Y)
result = model.score(X, Y) 
print("Accuracy: %.3f%%" % (result*100.0))

In [None]:
# get importance
model = LogisticRegression(solver='liblinear') 
model.fit(X_train, Y_train)
importance = model.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
#    print("Feature: "+str(i)+", Score: "+str(v))
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)

In [None]:
# decision tree for feature importance on a regression problem
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
# fit the model
model.fit(X_train, Y_train)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)

### Test-Train-Splits
Performing just one test-train-split and checking the performance or feature importance might be not good enough, as the result could be very good or very bad by coincidence due to this specific split. So the easiest solution is to repeat this process several times and check the averaged accuracy or use some of the ready-to-use built-in tools in scikit-learn, like KFold, cross-val-score, LeaveOneOut, ShuffleSplit.

### Which ML model to use?
Here is just a tiny overview of some mosdels one can use for classification and regression problems. For more models, which are just built-in in sciki-learn, please refer to https://scikit-learn.org/stable/index.html and https://machinelearningmastery.com

- Logistic / Linear Regression
- k-nearest neighbour
- Classification and Regression Trees
- Support Vector Machines
- Neural Networks

In the following we will just use logistic regression (https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) for our classification example and linear regression (https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression) for our regression example.


### ML model evaluation
For evaluating the model performance, there are different metrics available, depending on your type of problem (classification vs regression)

For classification, there are for example:
- Classification accuracy
- Logistic Loss
- Confusion Matrix
- ...

For regression, there are for example:
- Mean Absolute Error
- Mean Squared Error (R)MSE
- R^2 


So the accuracy alone does by far not tell you the whole story, you need to check other metrics as well!

The confusion matrix is a handy presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and true outcomes on the y-axis. --> false negative, false positive
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

#Lets have a look at our classification problem:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')

# Classification accuracy:
scoring = 'accuracy'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print("Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))

# Logistic Loss
scoring = 'neg_log_loss'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 
print("Logloss: %.3f (%.3f)" % (results.mean(), results.std()))

# Confusion Matrix
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

## Regression Example: Boston Housing Example

In [None]:
import sklearn
from sklearn.datasets import load_boston 

In [None]:
data =load_boston(return_X_y=False)

In [None]:
print(data.DESCR)

In [None]:
df=pd.DataFrame(data.data)

In [None]:
df.columns=data.feature_names

In [None]:
df

In [None]:
df["MEDV"]=data.target

In [None]:
df

Now we start again with our procedure:
* Hypothesis
* Understand and visualize the data 
* Preprocessing
* Feature Selection
* Apply Model
* Evaluate Results

Our **Hypothesis** here is, that we can actually predict the price of a house based on attributes of the geographic area, population and the property.

In [None]:
df.describe()

In [None]:
sns.pairplot(df[["DIS","RM","CRIM","LSTAT","MEDV"]])

In [None]:
from sklearn.linear_model import LinearRegression

# Now we do the 
# preprocessing
# feature selection
# training-test-split
# ML model application
# evaluation
array = df.values
X = array[:,0:13]
Y = array[:,13]

# preprocessing
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# feature selection
test = SelectKBest(k=6)
fit = test.fit(rescaledX, Y)
features = fit.transform(X)

# train-test-split
X_train, X_test, Y_train, Y_test = train_test_split(features, Y, test_size=0.3,
random_state=5)

# build model
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()
model.fit(X_train,Y_train)
acc = model.score(X_test, Y_test) 

# evaluate model
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) 

print("Accuracy: %.3f%%" % (acc*100.0))
print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))


# And now:
# Make predictions
# make predictions
# model.predict(new_data)

### What comes next?
---> Hyperparameter optimization.
For advanced ML algorithms you have to provide options and settings by yourself. These of course also have an impact onto your model performance and accuracy. Here you can perform so-called grid searches to find the optimal settings for your dataset.

**GridSearchCV**

## What does a typical project look like:
* Data engineering -  **A LOT**
* Applying actual ML algorithms - 5% of the time. 
(If you have your dataset ready to apply algorithms you have already done like 100% of the work. Of course afterwards you still need to validate and present your results)

![](HealthRiskIndex.png)

### Example: Emergent Alliance - Health Risk Index for Europe
https://emergentalliance.org
What we wanted to do: Predict the risk of getting infected, when travelling to a specific region.

We actually spent weeks formulating and reformulatin our hypothesis to (re-)consider influencing attributes, trying to distinguish between causes and effects.

In the end we spent most of the time with data engineering for:
Population density, intensive care units, mobility, case numbers, sentiment, acceptance of governemnt orders.
The biggest amount of time was spent on checking data sources, getting the data, reading data dictionaries and understanding the data,  creating automatic downloads and data pipelines, data preprocessig, bringing the preprocessed data into a database. We had to fight lots of issues with data quality and data granularity (time and geographic) for different countries.

Also afterwards the visual and textual processing and presentation took quite some time (writing blogs, building dashboards, cleaning up databases, ...)

## Image Recognition
It is actually quite easy to build a simple classification model (cats vs dogs), so when you are interested in applying something like this maybe to your experimental data (bubble column pictures or postprocessing contour plots), here are some links to get started:
https://medium.com/@nina95dan/simple-image-classification-with-resnet-50-334366e7311a
https://medium.com/abraia/getting-started-with-image-recognition-and-convolutional-neural-networks-in-5-minutes-28c1dfdd401