# Lesson 3: Learning from data

Let us first download the required python packages:

In [2]:
import numpy as np # numpy comes from "numeric python" and it is a very popular library for numerical operations
import pandas as pd # pandas provides more advanced data structures and methods for manipulating data
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt # a widely used visualiation library
import cartopy.crs as ccrs # a geographic stuffwhich we use for plotting
import cartopy

import requests # for querying the data from internet
import io # for io operations
import urllib # for building the query string
import urllib.parse # --||--

import plotting_utils # A self made "plotting library" to hide the less important parts of code

import sklearn.preprocessing # sklearn is a good library for doing basic machine learning,
#                             in addition to that, it contains some neat preprocessing stuff

from sklearn.linear_model import LinearRegression
import pickle

And then the preprocessed data:

In [7]:
with open('datasets/water_quality.pkl', 'rb') as handle:
    data_dict_water_quality = pickle.load(handle)
data_dict_water_quality['features'] = ['LATITUDE', 'LONGITUDE', 'OBSDEP', 'YEAR', 'SINMONTHFRACTION', 
                                       'COSMONTHFRACTION', 'SINTIMEOFDAYFRACTION', 'COSTIMEOFDAYFRACTION']
data_dict_water_quality['target'] = 'TEMP' 

In [None]:
with open('datasets/ice_thickness.pkl', 'rb') as handle:
    data_dict_ice_thickness = pickle.load(handle)
data_dict_ice_thickness['features'] = ['LATITUDE', 'LONGITUDE', 'YEAR', 'SINMONTHFRACTION',
                                       'COSMONTHFRACTION', 'SINTIMEOFDAYFRACTION', 'COSTIMEOFDAYFRACTION']
data_dict_ice_thickness['target'] = 'THICKNESS' 

In [9]:
with open('datasets/mammographic.pkl', 'rb') as handle:
    data_dict_mammographic = pickle.load(handle)
data_dict_mammographic['features'] = ['BI-RADS', 'LOGAGE', 'SHAPE', 'MARGIN', 'DENSITY']
data_dict_mammographic['target'] = 'SEVERITY' 

## Supervised learning

There is a multitude of ML methods for both classification and regression tasks. In this course, we present a few central, widely used methods.

### Linear regression

Linear regression is familiar to many of us from basic statistics classes. It strives to find the best linear model to explain the numerical value of the target variable based on the values of the explaining variable(s). Linear regression tells us how strong is the (linear) relationship between the variables. The linear model simply searches for an equation

y = a + b * x 

in which the y values predicted by this equation are as close as possible to the true values in the data. What is "as close as possible" is usually evaluated through *mean squared error* (see chapter on performance metrics). 

<img src="img/linear_regression.png"/>
Linear regression from the iris data set (Figure: LU).

Linear regression can have as many explanatory factors as desired. In that case, the equation becomes

y = a + b1 * x1 + b2 * x2 + ... + bn * xn

(The regular linear model assumes that the explaining factors are independent of each other, for exapmle that the effect of sunny days on a plant's growth is the same regardless of soil's moisture content, or vice versa. However it may be true that sun and moisture together boost the plant growth more than the sum of the effect of each of them alone. Linear models can be extended to take into account the potential interaction between explaining variables.)

In [None]:
X = data_frame_normalized[regression_columns].values
y = data_frame_normalized[['TEMP']].values.reshape((-1,1))

#split to training and validation at random:
indices = np.random.permutation(X.shape[0])
train_indices = indices[:int(X.shape[0]*0.8)]
validation_indices = indices[int(X.shape[0]*0.8):]
X_train, y_train = X[train_indices,:], y[train_indices,:]
X_validation, y_validation = X[validation_indices,:], y[validation_indices,:]

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

y_pred = reg.predict(X_validation).reshape(-1)
print(regression_columns)
print(reg.coef_)
print(reg.score(X_validation, y_validation))

plotting_utils.scatterplot_in_map(data_frame_numeric.iloc[validation_indices]['LONGITUDE'], 
                                  data_frame_numeric.iloc[validation_indices]['LATITUDE'], 
                                  bounding_box=bounding_box, 
                                  bounding_box_context=plotting_utils.BOUNDS_NORTHERN_BALTIC_SEA,
                                  c=y_pred-y_validation.reshape(-1),cmap='bwr', s=10, 
                                  vmin=-2.5, vmax=2.5, stock_img=False)
cbar = plt.colorbar(fraction=0.03, pad=0.1)
cbar.set_label('Error', rotation=270)


data_ = data_frame_numeric.iloc[validation_indices]
data_['res'] = y_pred - y_validation.reshape(-1)
plotting_utils.plot_scatter(data_,
                            columns_x=[('OBSDEP', "Observation depths (in meters)"),
                                       ('YEAR', "Year"),
                                       ('MONTHFRACTION', "Month / 12"),
                                       ('TIMEOFDAYFRACTION', "Time of day / 24")], 
                            columns_y=[('res','Residual')], 
                            c='k', alpha=0.02, ncols=2)


plotting_utils.plot_effects(data_frame_numeric.iloc[validation_indices], 
                            reg, regression_columns,
                            normalizer, normalized_columns, 
                            plotted_columns=[('OBSDEP', "Observation depths (in meters)"),
                                             ('YEAR', "Year"),
                                             ('MONTHFRACTION', "Month / 12"),
                                             ('TIMEOFDAYFRACTION', "Time of day / 24")],
                            periodic_columns=['MONTHFRACTION', 'TIMEOFDAYFRACTION'], ncols=2,
                            coordinates=['LATITUDE', 'LONGITUDE'], bounding_box=bounding_box)

### Logistic regression

Logistic regression (also called logit model) is, despite the name, a classification method for binary variables based on continuous explaining variables. The concept of logistic regression is illustrated in the figure below. In the figure, x axis shows how many hours the students have studied for their exam, and the y axis has two possible values, pass and fail. We want to predict, based on the study hours, whether the student is going to pass the exam or not. The logistic curve assumes that the likelihood for pasing the exam increases with the study time, but the increase may not be linear but sigmoid-shaped function. For binary prediction (pass or fail) we can take the ponit where the curve crosses the halfway point between pass and fail, and see how many hours of study does this correspond to (in this case, something like 2 hours and 45 minutes). Below this value, we predict fail, and above it, pass.

<img src="img/Exam_pass_logistic_curve.jpeg"/>
Image from Wikipedia, by Michaelg2015 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=42442194


### Support vector machine

Support vector machines (SVM) are popular and robust binary classifiers (although multiclass and regression extensions exist). The basic idea is to separate the two classes as well as possible using a line (or in multidimensional cases, a plane/hyperplane that separates the two classes. In the image below, H1, the green line does not separate the two classes. Lines H2 (blue) and H3 (red) do, H3 so with the biggest margin.

<img src="img/SVM.svg" />
Figure from Wikipedia, by User:ZackWeinberg, based on PNG version by User:Cyc - This file was derived from:  Svm separating hyperplanes.png, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22877598

The line with the biggest margin is very sensitive to individual observations: adding just one observation might move the line considerably, as illustrated in the figure below: the right hand side plot has one additional blue dot, which shifts the maximal margin line considerably.

<img src="img/Support_Vector_ISLR9.5.png" />
Figure source: James et al. Introduction to statistical Learning. Springer. Doi:10.1007/978-1-4614-7138-7.

Also naturally, it is not always possible to find a line/plane that separates the classes perfectly. SVM fitting tries to find a separating line (often called *decision boundary*, as it is the boundary at which the decision about which class to redict changes) that is somewhat robust to individual observations, and does not overfit to the training set. This may mean that some of the training data observations are misclassified (on the wring side of the line), and others lie very close to the decision boundary. SVM fitting includes a parameter that can be understood as the width of the margin around the decision boundary, and it regulates the robustness of the model fit. A large margin means many observations lie within the margins, meaning that there are many support vectors, i.e. many observations determining the decision boundary. A smaller margin means that there are fewer support vectors and the model react more strongly to the features of the training set, posibly overfitting to it. The figure below illustrates the different margins on a small data set.

<img src="img/Support_Vector_ISLR9.7.png" />
Figure source: James et al. Introduction to statistical Learning. Springer. Doi:10.1007/978-1-4614-7138-7.


An interesting feature of the SVMs is that only the observations that either lie on the margin or on the wrong side of the decision boundary affect it. Changing an observation that is correctly classified and not on the margin will not affect the model fit at all. The observations that do affect the result, i.e. those that are on the margin or on the wrong sice of the classifier, are called *support vectors*. Intuitively, it can be thought that the decision boundary "leans" on these observations. 

Sometimes the decision boundary is not a straight line, but a curve or circle. SVMs can be adapted to deal with these kind of data as well through the use of *kernels*. The details of kernels are too technical to discuss here, but essentially they make transformations to the data that allow us to fit polynomial and radial (round) separating lines. The figure below illustrates this with a radial kernel: adding an extra dimension (z-axis, "up") and mapping the points so that te further they are from the centre of the original 2-dimendional space, the higher up they are on the z axis, allows the separation of the two classes using a straight plane.

<img src="img/Kernel_trick_idea.svg" />
Figure source: By Shiyu Ji - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=60458994


Further reading:
- [A nice, longer explanation of SVMs.](https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496)
- [Another nice explanation of SVMs, including support vetor regression.](https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/)

### Tree-based methods: basics and random forest classification and regression

Decision trees are based on the idea of consecutively splitting the data (*prediction space*) into regions that are more uniform than the whole data or the previous split. Splitting is continued until the predictions don't improve. This splitting can be presented as a tree, hence the name. The figure below illustrates this concept. 

<img src="img/decision_tree.png">
Figure: LU.

In this figure, we are predicting the class (shape) based on the *mode* i.e. most common class in each region. This is classification. If the target variable was numeric, we could predict a numeric value by taking the average of all observations in the region. This would be regression.

Trees are easy to interpret, but not competitive with the best machine learning  methods. There performance of tree.based methods can be improved through using a consensus result of multiple trees through methods such as *bagging, boosting,* and *random forests*. We will take a look at random forests, as they are a commonly used, powerful method for classification and regression.

**Random forest**

Random forest uses multiple trees (hence *forest*) to create a classifier / regression model that is more robust to the variance in the training set than individual trees are. They use two tricks for this: 
- *Bagging* i.e. *bootstrapping*, that is, taking repeated samples from the training set to produce multiple training sets. The model is taught using each of these bootstrapperd trainingsets separately, and finally, an average of these predictions is used as the final model.
- Using only a random subset (hence *random*) of the predictor variables at each split. The size of this subset is usually close to the sciare root of the number of predictor variables.

Bagging improves the robustness of the method, i.e. different training sets result in more similar models than without bagging.Selecnig only a subset of the predictors for each split further improves the robustness, and also helps when there is a large number of correlated predictors in the data.

The average (when predicting numeric value) or majority (when predicting class) prediction of all the trees will be used as the prediction of the random forest model.

There is a nice added bonus to bagging: we get model validation for free as a side effect. Each bootstrapped tree uses on average 2/3 of the observations. The remaining 1/3 are called *out-of-bag (OOB)* observations. We can validate the model by using these OOB observations, predicting each observation using the trees that have not used this particular observation in learning. With a large number of bootstrapped samples, this OOB validation is as good as leave-one-out cross-validation.

[A longer explanation of random forest.](https://towardsdatascience.com/random-forest-explained-6b4849d56a2f)


### K nearest neighbours

K nearest neighbors (KNN) a classification and regression method. In classification, the object is classified as the same class as majority of the k closest training examples in data set, in regression, the predicted value is the average of the values of k nearest neighbors. The predictors can also be weighted according to their distance from the point that is being predicted. 



## Unsupervised learning: Clustering, association and dimensionality reduction

Unsupervised methods are often used as part of exploratory data analysis, to reveal patterns in the data. As we don't know the true answer, there is also no way to check how well the algorithm is doing. That makes unsupervised methods difficult to evaluate. However, they may be important in exploring and understanding the data.


### Clustering

Clustering means grouping observations into groups that have similarity with each other - for example, finding species that seem to share features such as habitat or food preferences, body morphology, etc. Finding clusters can be intuitively thought of as identifying "groups" of observations, i.e. observations that are "close" to each other in the feature space, i.e. as measured by the variables that have been recorded, form a cluster. This naturally requires some way to meaasure the distance, or (dis)similarity between the objects. For this purpose, it might be reasonable to normalize the data so that all numeric variables have the same mean and variance - that way, all the variables will have equal weight in the distance measuring. If the variables are not normalized and one variable had the scale from 0-1000 meters and another 0-0.1 meters, the differences in the first variable would dominate the distance metric and the second variable would be virtually meaningless. There is a high number of different distance measures. The Euclidean distance is perhaps the most common. You can read more about distance measures [here](https://machinelearningmastery.com/distance-measures-for-machine-learning/).

There are many different types of clustering algorithms. [Watch this 9-minute video](https://www.youtube.com/watch?v=Se28XHI2_xE), which gives a very nice illustration of 4 different types of clustering algorithms. 

**Partitioning clustering** (or centroid clustering) divides the data to k clusters, where k is provided by the user. (There are also ways to evaluate what the optimal k would be.) **K means** clustering is perhaps the most common of these. It starts from k random observations and iteratively calculates clusters ,trying to minimize the variation within clusters. As the starting ponist are random, and they affect the algorithm's results, several runs are needed and the best one is picked. 
[This YouTube video illustrates k-means clustering.](https://www.youtube.com/watch?v=4b5d3muPQmA).

**Hierarchical clustering** (or connectivity clustering) shows the similarity or distance between any two observations. Methos start either from individual observations, connecting the closesto observations of identified gropus to each other sequentially, until all observations are connected, or starting from a group of all observations and splitting the group consecutively until each observation is in its own group.

### Association 

Association rule finding means finding relations between variables in data sets. This means, for example, finding that lakes that have a high abundance of roach also tend to have a high abundance of sander, while lakes that have vendace also tend to have whitefish. These rules can also be more complex, such as that if a forest has a high percentage of aspen trees, and the average age of trees is above 30 years, it is likelier to host flying squirrels than other forests.


### Dimensionality reduction

Dimensionality reduction means the transformation of the high-dimensional data set (i.e. one with many variables) into fewer dimensionswhile retaining as much of its meaninful properties as possible. It can be done through *feature selection*, i.e. leaving out variables, or *feature extraction*, i.e. constructing new variables based on the original ones so that fewer new variables replace the original ones and include most of their information content. Principal component analysis is an example of the latter.


## Performance metrics

When we are teaching and evaluating supervised models, we are trying to teach it to predict as well as possible. To do that, we need some metrics to evaluate this - i.e. we need to give a numeric value to the goodness of the model performance in order to say which model performs best. The algoritms are designed to find the parameters that will lead to the highest possible metric value using the train and test sets.

For this purpose, there are a number of metrics, and it is important to understand their differences in order to select the one that fits your purpose. It may also be a good idea to use more than one metric.

### Classification
To understand the performance metrics, let's first consider a binary classification case (we want to predict a yes/no answer, such as "is this a potential flying squirrel habitat?"). In this case, there are four possible outcomes: We classify a real positive value to be positive; we classify the real positive value to be negative; we classify the real negative value to be negative, or we classify the real negative value to be positive. 

- The values that are predicted as positive and are actually positive are *true positives*.
- The values that are predicted as positive but are actually negative are *false positives*.
- The values that are predicted as begative and are actually negative are *true negatives*.
- The values that are predicted as negative but are actually positive are *false negatives*. 

In this terminology, positive/negative refers to the classifier result (*what we think*), and true/false to whether that was correct or not.

By calculating ratios between these four classes, we get the different performance metrics. 
- Precision: What proportion of the positives given by our model are actually positives? 
- Recall (also called sensitivity or true positive rate): What proportion of the real positives are correctly clasified as positive?
- Specificity (or true negative rate): What propostion of the real negatives are correctly classified as negative?
- Accuracy: What proportion of all values are classified correctly?


<img src="img/TP-TN-FP-FN-1.png"/>

There are a number or other metrics, such as **F1 score** or **F measure**, which seeks to find balance between precision and recall. It is computed as (2 * Precision * Recall) / (Precision + Recall). The F1 score is popular for imbalanced data sets. If the model gives a score or probability of class memebership, and different cut-off points can be defined (e.g. do we call this positive if there's a 50 % probability for it to be positive, or only after 70 %?), ROC curves can be used to evaluate the performance and seek the best cutoff point. If you wish, you can read more about performance metrics [here](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226) and [here](https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/).

The correct performance metric depends on the purpose of the model. Often more important than overall accuracy is to catch all positives even if we get a higher number of false positives as a side effect. For example, we want to be sure that we have identified all meadows that may host an endangered species, so we can protect them - we don't mind terribly if we protect some other meadows as well. In other cases, it may be essential to avoid false negatives: when we are classifying mushrooms to edible and non-edible (poisonous), we want to be very sure not to classify a non-edible mushroom as edible even if that means that some edible ones are erroneously classified as non-edible. In these cases, recall and specificity may be much better metrics than precision or accuracy.

The distribution of positives and negatives also needs to be taken into account when selecting the performance measure. For example, if the positive and negative cases are very imbalanced, for example only 5 % of the studied habitats are suitable for the flying squirrel, we would reach 95 % accuracy simply by predicting that all habitats are unsuitable - we would be right 95 % of the time only because of the data distribution! In this case, recall might be a much better metric.

It's important to notice that - depending on the data and the classifier model - there is only a certain accuracy that can be reached. Some observations will be misclassified. However we can build a model that optimizes the performance metrics that is important to us, e.g. in the case of the edible mushroom example, the high true negative rate (non-edible mushrooms are classified as non-edible with high reliability). This usually means that we will also get more false negatives, i.e. edible mushrooms classified as non-edible. The desired balance between the different types of errors depends on the purpose of the model.

Multi-class classifier performance metrics are usually variants of these binary metrics.


### Regression

Regression model performance is evaluated through the difference between the real value and the predicted value - the smaller the difference, the better the model. This is exactly the same as we do when computing the best fit for a linear model for example. 

The most popular regression model performance metrics are
- Mean Squared Error (MSE): The average of the squared difference between the real value and the predicted value.
- Root Mean Squared Error (RMSE): Root of the MSE. It's in the same unit as the original variable and therefore easier to interpret. 
- Mean Absolute Error (MAE): The average of the absolute error (i.e. all errors computed as positive values, regardless of whether the prediction is too small or too big).
- R^2: Total variance explained by the model divided by the total variance.

MSE and RMSE penalize large errors more severely than smaller ones, i.e. an error of 10 is worse than two errors of 5. MEA penalizes all errors relative to their magnitude.R^2 is not considered as a true error metric, as it does not look at the predicted vs. real values, but it is also often used as one.

You can read more [here](https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914). 


# Further reading


## Neural networks and deep learning

Artificial neural networks (ANN or NN) are a fast-developing, flexible machine learning method family. They consist of units called artificial neurons, and links (edges) between these units. The neurons are typically adjusted in layers: an input layer, an output layer, and one or more hidden layers. The neurons are linked to each other, receive input and produce output that is passed on to other neurons. 

<img src="img/Colored_neural_network.svg">
Figure by Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24913461

A neuron's *activation function* determines its output based on the inputs it receives. There are a number of different activation functions with different mathematical properties, but the important thing to realize is that nonlinear activation fuctions (such as the logistic (sigmoid) function) allow the ANN to learn nonlinear responses.

There are also many different ANN architectures, i.e. ways that the neurons are connected to each other. The figure above describes the simplest, "basic" ANN. Recurrent neural networks (RNN) are suited for the analysis of text, audio, and time series, as they perform time steps and therefore can capture seqential information in the data. Convolutional neural networks (CNN), on the other hand, are highly useful for image analysis, since they capture spatial features (such as the positions of eyes, mouth, and nose in a portrait) from an image.

Deep learning or deep neural networks (DNN) is basically a neural networks with multiple layers. There is no clear definition for deep learning, but DNNs can be, for example, large RNNs and CNNs. DNNs are powerful with large data sets, and excel with data that are not in the form of classical data tables, but e.g. images, text documents, or audio.

ANNs are often complex and it is impossible to diagnose how they come to the conclusion they do (remember, for example, the tree-based classification in which it was easy to see how the conclusions are drawn). This means that if the data is biased, the model result may be biased and there's no way to see this has happened. For example, if the cat images often have a brown background and dog images green background, the algorithm may learn to classify the backgrouds and not the animal species.


## How to handle uncertainty: Bayesian machine learning

Eero writes this 

Classical ML methods tend to give their predictions without indicating how certain or uncertain the answers may be - the habitat is classified as suitable or non-suitable, and there's no way to tell whether the two habitats are suitable with equal certainty, or if one is more likely to be suitable than the other. 


# What does this course cover?
- Introduction to ML & the basic concepts: 
    * Data preparation 
    * Why to split your dataset: Training, testing and validation datasets 
- Regression: Methods of predicting numeric value 
- Classification: Methods of prediction of qualitative class (species, forest type, …)  
- Clustering: Methods for dividing data to groups  
- Making sure the model does what it is supposed to: Model validation, validation metrics & their interpretation 
- Modern machine learning approaches methods: ANN, Deep learning 
- How to handle uncertainty: Bayesian machine learning 
- Exercises that provide hands on experience and useful scripts to later rely on 