<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Quiz](05.03-Quiz.ipynb) | [Contents](00.00-Index.ipynb) | [Tensorflow + Keras and PyTorch](06.02-Tensorflow.ipynb) ></span>

<a href="https://colab.research.google.com/github/eurostat/e-learning/blob/main/python-official-statistics/06.01-Scikit.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

<a id='top'></a>

# Scikit-learn
## Content  
- [Categories of Machine Learning](#categories)
- [Scikit-learn API](#API)
- [Simple linear regression - Supervised](#regression)
- [Iris classification - Supervised](#classification)
- [Iris dimensionality reduction - Unsupervised](#dimensionality)
- [Iris clustering - Unsupervised](#clustering)
- [Deep learning](#neuralnet)

Machine learning (``ML``) is a type of artificial intelligence (``AI``) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.  
Deep learning (``DL``) is a subset of ML based on neural networks.

<span style=''><img style='background: rgb(128, 128, 128, .15); align: left; display: inline-block; padding: 20px' src='img/ai.webp'/></span>

<a id='categories'></a>

## Categories of Machine Learning
### Supervised learning
Involves modeling the relationship between measured features of data and some label associated with the data; once this model is determined, it can be used to apply labels to new, unknown data.
This is subdivided into:
- Classification: The labels are discrete categories.
- Regression: The labels are continuous quantities.  

### Unsupervised learning
Involves modeling the features of a dataset without reference to any label, and is often described as "letting the dataset speak for itself."
These models include:
- Clustering: Algorithms to identify distinct groups of data.
- Dimensionality reduction: Searching for more succinct representations of the data.  

### Semi-supervised learning methods
They falls somewhere between supervised learning and unsupervised learning.
Semi-supervised learning methods are often useful when only incomplete labels are available.


<a id='API'></a>

## Scikit-learn API
Scikit-learn is an increasingly popular machine learning library. It is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts.  
In practice it makes Scikit-Learn very easy to use, once the basic principles are understood.
Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface for a wide range of machine learning applications.  

### Basics of the API
Most commonly, the steps in using the Scikit-Learn estimator API are as follows:
- Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
- Choose model hyperparameters by instantiating this class with desired values.
- Prepare training data as per model requirements.
- Train (fit) the model with your data by calling the ``fit()`` method of the model instance.
- Apply the Model to new data:
   - For supervised learning the prediction is done using the ``predict()`` method.
   - For unsupervised learning there are ``transform()`` or ``predict()`` methods.

We will step now through several examples of applying supervised and unsupervised learning methods.

<a id='regression'></a>

## Supervised learning example: Simple linear regression

As an example of this process, let's consider a simple linear regression—that is, the common case of fitting a line to $(x, y)$ data.
We will use some random data for our regression example:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# slope
a = 2
# intercept
b = -1
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = a * x + b + rng.randn(50)
plt.scatter(x, y)

With this data in place, we can use the process shown earlier:
### - Choose a class of model

In Scikit-Learn, every class of model is represented by a Python class.
So, for example, if we would like to compute a simple linear regression model, we can import the linear regression class (there are many [``linear models``](http://Scikit-Learn.org/stable/modules/linear_model.html) implemented in Scikit-Learn):

In [None]:
from sklearn.linear_model import LinearRegression

### - Choose model hyperparameters
Once we have decided on our model class, there are still some options open to us.
Depending on the model class we are working with, we need to answer some questions to configure the model.
These choices are often represented as *hyperparameters*, or parameters that must be set before the model is fit to data.
In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation. For our example we will use the default values, for more we recommend you study how to configure the training of the model:

In [None]:
model = LinearRegression()
model.get_params()

### - Prepare training data
Arrange data into a ``features matrix`` and ``target vector``.  
#### Features matrix
Represent an array of ``samples``, for the same type of objects, having ``features`` that describe each sample in a quantitative manner.  
Features are generally real values, but may be Boolean or discrete-valued in some cases.
#### Target array
In addition to the feature matrix, we also generally work with a *label* or *target* array.
The target array is usually one dimensional.  
The target array may have continuous numerical values, or discrete classes/labels.

The Scikit-Learn data representation, which requires a two-dimensional features matrix, usually called ``X``, and a one-dimensional target array, ``y``.  

![](img/features.png)

Our target variable ``y`` is already in the correct form (a length-``n_samples`` array), but we need to reshape the data ``x`` to make it a matrix of size ``[n_samples, n_features]``.
In our case ``n_features = 1``:

In [None]:
# x shape before transformation 
x.shape

In [None]:
# X (capital) with a new dimmension added: now a matrix
X = x[:, np.newaxis]
X.shape, y.shape

### - Fit the model to your data
Now it is time to apply our model to data.
This can be done with the ``fit()`` method of the model:

In [None]:
model.fit(X, y)

This ``fit()`` command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore.
In Scikit-Learn, by convention all model parameters that were learned during the ``fit()`` process have trailing underscores; for example in this linear model, we have the following:

In [None]:
model.coef_, model.intercept_

These two parameters represent the slope and intercept of the simple linear fit to the data.
Comparing to the data definition, we see that they are very close to the input slope of 2 and intercept of -1.
### - Predict labels for unknown data
Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set.
In Scikit-Learn, this can be done using the ``predict()`` method.
For the sake of this example, our "new data" will be a grid of *x* values, and we will ask what *y* values the model predicts:

In [None]:
xfit = np.linspace(-1, 11, 8)
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)

Finally, let's visualize the results by plotting the raw (training) data and the model fit (estimates):

In [None]:
plt.scatter(x, y)
plt.plot(xfit, yfit, 'ro')

<a id='classification'></a>

## Supervised learning example: Iris classification

Let's take a look at another example of this process, using the Iris dataset we discussed earlier.
Our question will be this: given a model trained on a portion of the Iris data, how well can we predict the remaining labels?

For this task, we will use an extremely simple model known as [Gaussian naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB), which proceeds by assuming each class is drawn from an axis-aligned Gaussian distribution.
Because it is so fast and has no hyperparameters to choose, Gaussian naive Bayes is often a good model to use as a baseline classification, before exploring whether improvements can be found through more sophisticated models.

We would like to evaluate the model on data it has not seen before, and so we will split the data into a *training set* and a *testing set*.
This could be done by hand, but it is more convenient to use the ``train_test_split`` utility function:

In [None]:
import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.sample(5))
print('\noriginal data shape:', iris.shape)
X_iris = iris.drop('species', axis=1)
y_iris = iris['species']

from sklearn.model_selection import train_test_split
# if train_size is None (implicit), it will be set to 0.75.
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1)
print('\n', Xtest.sample(3), '\n')
print(ytest.sample(3))
Xtrain.shape, Xtest.shape, ytrain.shape, ytest.shape

With the data prepared, we can follow our process to predict the labels:

In [None]:
 # Choose model class
from sklearn.naive_bayes import GaussianNB
# Instantiate model (no hyperparameters)
model = GaussianNB()
# Fit model to data
model.fit(Xtrain, ytrain)
# Predict on new data
y_model = model.predict(Xtest)
print(ytest.sample(3), '\n')
model.get_params()

In the end, we can use the ``accuracy_score`` utility to see the fraction of predicted labels that match their correct value:

In [None]:
from sklearn.metrics import accuracy_score
print(f'Effectivness of the predictions: {100*accuracy_score(ytest, y_model):5.2f}%')

<a id='dimensionality'></a>

## Unsupervised learning example: Iris dimensionality
As an example of an unsupervised learning problem, let's take a look at reducing the dimensionality of the Iris data (from 4 to 2) so as to more easily visualize it.

The task of dimensionality reduction is to ask whether there is a suitable lower-dimensional representation that retains the essential features of the data.
Often dimensionality reduction is used as an aid to visualizing data: after all, it is much easier to plot data in two dimensions than in four dimensions or higher!

Here we will use principal component analysis (PCA), which is a fast linear dimensionality reduction technique.
We will ask the model to return two components—that is, a two-dimensional representation of the data.

Following the sequence of steps outlined earlier, we have:

In [None]:
# Choose the model class
from sklearn.decomposition import PCA
# Instantiate the model with hyperparameters
model = PCA(n_components=2)
 # Fit to data. Notice y is not specified!
model.fit(X_iris)
# Transform the data to two dimensions
X_2D = model.transform(X_iris)
model.get_params()

Now let's plot the results:

In [None]:
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot(x="PCA1", y="PCA2", hue='species', data=iris)

Still in the two-dimensional representation, the species are well separated.

<a id='clustering'></a>

## Another unsupervised learning: Iris clustering

Let's now try to apply clustering to the Iris data.
A clustering algorithm attempts to find distinct groups of data without reference to any labels.
Here we will use a powerful clustering method called a [Gaussian mixture model](https://scikit-learn.org/stable/modules/mixture.html#gaussian-mixture) (GMM).
A GMM attempts to model the data as a collection of Gaussian blobs.

We can fit the Gaussian mixture model as follows:

In [None]:
# Choose the model class
from sklearn.mixture import GaussianMixture
# Instantiate the model with hyperparameters
model = GaussianMixture(n_components=3, covariance_type='full')
# Fit to data. Notice y is not specified!
model.fit(X_iris)
# Determine cluster labels
y_gmm = model.predict(X_iris)
model.get_params()

And finally we are using Seaborn to plot the results:

In [None]:
iris['cluster'] = y_gmm
sns.lmplot(x="PCA1", y="PCA2", data=iris, hue='species', col='cluster', fit_reg=False)

This type of algorithm might give experts in the field clues as to the relationship between the samples they are observing.

<a id='neuralnet'></a>

## Deep learning
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.  
With DL models, practically, we'll be able to approach all the same kind of problems as solved by other ML models, like regression, classification, or unsupervised - clustering etc.  
So why treating it separately? There are several motives but most of them can be summarized as economics:
### Large amount of training data
For majority of ``old ML models`` after some amount the training data the performance will ``plateau``, but for ``DL models`` is keep ``geting better`` the more data is added.
### Large models
With neural networks, in general, the more layers and ``neurons`` you add the more performance you can obtain. This is why for DL the so called ``hierarchical feature learning`` when DL models automatically do feature extraction from row data, so the model become more important than understanding the data, learning without depending completely on human-crafted features.  
### More computation
As a result of both large training data and large models, the duration of training increase and the training price for such a model become quite expensive.
Sometimes it is reffered as ``scalability of neural networks`` indicating that results get better with more data and larger models, that in turn require more computation to train.
### GPT-3: The Largest Neural Network Ever Created
As an example of achieved magnitudes in DL models let's explore GPT-3. OpenAI's GPT-3 architecture represents a seminal shift in AI research and use. The largest neural network ever developed promises significant improvements in ``natural language`` tools and applications.
- 96 layers and 175 billion parameters: Parameters in machine language parlance depict skills or knowledge of the model, so the higher the number of parameters, the more skillful the model is. (100x bigger than GPT-2)
- 45 terabytes of data: Training data from CommonCrawl, WebText, Wikipedia, and a corpus of books.
- Cost and time: Since this information is not public, according to some estimate, training the 175-billion-parameter neural network requires 3.114E23 FLOPS (floating-point operation), which would theoretically take 355 years on a V100 GPU server with 28 TFLOPS capacity and would cost 4.6 million at $1.5 per hour.
### When And Where To Apply Deep Learning?
There’s no denying that deep learning is a hot topic right now. But what does it really mean, and how should it be applied in practise?  
- ``When not to use deep learning``  
For one thing, deep learning really ``needs Big Data`` to make accurate decisions. So if you haven’t got an extremely large dataset to learn from, a regular machine learning algorithm is likely to deliver more accurate results.
It’s also ``more expensive`` to implement because it takes ``a lot of computing power`` to run a deep learning network. While services and tools like IBM’s Watson are helping to lower the barrier to entry for deep learning, remember that deep learning is still at the very cutting edge.

- ``Where best to apply deep learning``  
Deep learning is ideal for predicting outcomes whenever you have a lot of data to learn from – ‘a lot’ being a huge dataset with hundreds of thousands or better ``millions of data points``. Where you have a huge volume of data like this, the system has what it needs to train itself.
It’s also best when applied to ``complex problems`` and things that would be vastly expensive to solve with human decision making. ``Image processing`` is a great example of this. So, rather than YouTube paying an army of human workers to trawl through millions of videos and tag the ones with cats for our viewing pleasure, it makes much more sense to apply deep learning. It’s the same with ``translation`` and ``speech recognition``.
And last but not least, deep learning is only appropriate if you have the high-end computing power to make it work, or are partnering with an analytics provider who has the infrastructure and skills that might be lacking in-house.

### Sklearn and DNN
``Sklearn`` doesn't have much support for ``Deep Neural Networks``, but can be a good point to start exploring the realm of DL. More like a research approach since it is very easy to create and train a model. The implementation is not intended for large-scale applications. In particular, scikit-learn offers no GPU support. 

### Example: Handwritten Digit Recognition
- The dataset was constructed from a number of scanned document datasets available from the _National Institute of Standards and Technology (NIST)_. This is where the name for the dataset
comes from, as the _Modified NIST_ or [_MNIST_ dataset](http://yann.lecun.com/exdb/mnist/).
- Images of digits were taken from a variety of scanned documents, normalized in size and centered. This makes it an excellent dataset for evaluating models, allowing the developer to focus on the machine learning with very little data cleaning or preparation required.
- Each image is a 28 x 28 pixel square (784 pixels total). A standard split of the dataset is used to evaluate and compare models, where 60'000 images are used to train a model and a separate set of 10'000 images are used to test it.
- It is a digit recognition task. As such there are 10 digits (0 to 9) or 10 classes to predict.  

<img src="img/mnist.png"
alt="drawing" width="700"/>  

To make the example run faster, we use very few hidden units, and train only for a very short time. Training longer would result in weights with a much smoother spatial appearance.  

- First, let's fetch and partition the data:

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

# Load data from https://www.openml.org/d/554
X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
# Converting data from (0-255) integer values to (0-1) float, a rescaling
X = X / 255.0

# Split data into train partition and test partition
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.5)
X_train.shape, X_test.shape

- Some visualization and analysis for training data 

In [None]:
print(f'Number: {y[0]}')
plt.imshow(X[0].reshape(28,28), cmap='gray')
plt.show()
X[0].min(), X[0].mean(), X[0].max()

Multi-layer Perceptron classifier (``MLP``) is the class offered as Neural network model (supervised) for classification by Sklearn.

In [None]:
# Create the model
mlp = MLPClassifier(
    hidden_layer_sizes=(40,),
    max_iter=30,
    alpha=1e-4,
    solver="sgd",
    verbose=10,
    random_state=1,
    learning_rate_init=0.2,
    early_stopping=True,
    validation_fraction=.2
)
# Train the model
mlp.fit(X_train, y_train)
# print(mlp.predict(X_test))
print(f"\nTraining set score: {mlp.score(X_train, y_train):.3f}")
print(f"Test set score: {mlp.score(X_test, y_test):.3f}")

mlp.get_params()

In [None]:
fig, ax_left = plt.subplots()
ax_right = ax_left.twinx()
ax_left.set_ylabel('training loss', color='green')
ax_left.plot(mlp.loss_curve_, color='green')
ax_right.set_ylabel('validation score', color='orange')
ax_right.plot(mlp.validation_scores_, color='orange')

In [None]:

plt.rcParams["figure.figsize"] = (20,10)
fig, axes = plt.subplots(4, 10)
# use global min / max to ensure all weights are shown on the same scale
vmin, vmax = mlp.coefs_[0].min(), mlp.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=0.5 * vmin, vmax=0.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())

plt.show()

Here we can see all 40 groups of parameter's values for the entries of the hidden level. Shown as 28X28 images, they tell us how, after training, the initial image is filtered by each ``arificial neuron`` to make sense of input data.  
And, even if in this representation looks like it make graphical sense: like trying to learn horizontal, vertical or diagonal patterns; don't forget that for the model the input is just a row of 784 floating point values between 0 and 1.

<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Quiz](05.03-Quiz.ipynb) | [Contents](00.00-Index.ipynb) | [Tensorflow + Keras and PyTorch](06.02-Tensorflow.ipynb) > [Top](#top) ^ </span>

<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>This is the Jupyter notebook version of the __Python for Official Statistics__ produced by Eurostat; the content is available [on GitHub](https://github.com/eurostat/e-learning/tree/main/python-official-statistics).
<br>The text and code are released under the [EUPL-1.2 license](https://github.com/eurostat/e-learning/blob/main/LICENSE).</span>