# Module 2: Unsupervised Learning

Welcome to the second module of this series! In this module you will first get a deeper look on what Unsupervised Learning is and in which scenarios it can be used. Then you will explore two types of problems that unsupervised learning solves: dimensionality reduction and clustering. As you will see, these are not necessarily independent, and can be considered as part of the same pipeline. 

**Module Overview**
1. [What is Unsupervised Learning?](#what-is-unsupervised)
2. [Hyperparameters](#hyperparameters)
3. [Dimensionality Reduction](#dimensionality-reduction)

**Dataset**

In this module we will work with the already preprocessed Swiss Food Composition dataset from Module 1: Introduction to Machine Learning and Data Preprocessing for Food Sciences. You can find the preprocessed dataset in the `data/swiss_food_composition_proc.csv`. 

As a quick recap, this is the resulting dataset after:
- removing the samples and features with more than 20% of missing values,
- splitting the dataset in train and test sets,
- imputing missing values 
- standardizing the remaining data

Note that in this module, we will not need the train and test splits since in the unsupervised 
learning case we do not make use of any labels or target variables and thus, we do not predict any category or value.

<a id='what-is-unsupervised'></a>
## What is unsupervised learning?

In unsupervised learning, the data that we have does not have any values or categories that we can learn and later predict. Here, the models will try to find a structure in the data, or learn patterns present. Some use cases of such models would be: clustering, dimensionality reduction, data generation, anomaly detection, etc. 

In the case of clustering, we try to find groups within the data, so that we can group similar samples together. In the case of dimensionality reduction, we move from data with many features, to compressed data, with very few features. While as the name suggests, in the case of data generation, we use the unlabelled data to learn a structure or underlying properties and based on this, the model will be generate similar samples. For anomaly detection, we can use machine learning models to find outliers in the data. Outliers are points that do not resemble the majority of the points in the dataset. 

[Fig. 1](#unsup_learn) illustrates the machine learning pipeline in case of unsupervised learning. Still there is an output from the models and it outputs what the model has learned from the data. In the case of clustering, it will output a cluster number that will show with which other samples a specific sample is most similar to. In the case of dimensionality reduction, the output will be the sample but with less features. 

Something to notice is the missing train-test split step. Since here we do not have any labels or target variables, the train-test split is not of any use.

<center>
    <a id="unsup_learn"></a>
    <img src="images/part2_unsupervised/unsupervised_learning__clustering_dimred.jpg" alt="Standardization" width="90%">
    <center><figcaption><em>Figure 1: Unsupervised Learning</em></figcaption></center>
</center>


<a id='hyperparameters'></a>
## Parameters vs Hyperparameters

In machine learning, parameters and hyperparameters play different roles. The parameters are values that the machine learning model learns from the data. At the end of the learning process, the data will be described by a mathematical equation. The main goal of the learning process is to find the parameters of this mathematical equation that would best describe the data. For example, suppose that you have some points scattered in a 2D coordinate system. Your aim is to find the line with an equation of the form: $$y = ax + b$$. 

In this case, `a` and `b` would be the parameters that the model would learn from the points so that the line would represent them in the best possible way.

On the other hand, hyperparameters control the learning process itself and how the parameters will be computed. Hyperparameters are set by the data scientists/analysts and they are not learned by the model. You can think of them as settings or configurations to tune the learning process. Usually people use intuition, trial-and-error, and other, more sophisticated techniques like cross-validation to pick the right hyperparameters that would make the learning process faster and produce more accurate results. Going to the line example, the hyperparameters will determine *how* complex the equation of the line that will describe the points will be. 

All in all, parameters determine the model output, while hyperparameters determine the way how the parameters would be learned. You can read more about the distinction between parameters and hyperparameters [in this blog post](https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac).

<a id='dim-red'></a>
## Dimensionality Reduction

###  Why is it used?

Dimensionality reduction is used to reduce the complexity of the dataset, capture the most important features and make possible the visualization of the dataset. The dimensionality reduction algorithms identify the most important features that affect model performance, and they usually choose the least 
correlated features, to keep. Dimensionality reduction also enhances the performance of the ML models because it reduces the effects of *the curse of dimensionality*. The more features a datasets has, the more samples are needed so that the ML models can learn it. This is otherwise considered as the curse of dimensionality. Therefore, one of the main goal of dimensionality reduction is to reduce the number of features of the dataset while maintaining the most important information.

All dimensionality reduction techniques are part of the unsupervised learning group of 
algorithms because they do not consider any labels of the data. Some of these techniques include: principal component analysis (PCA), t-SNE (t-distributed stochastic embedding) and UMAP (uniform manifold approximation and projection). They offer a lot of advantages especially when it comes to making results easier to visualize and explain to audiences, removing noise of the dataset, and making model training faster.

Below we will explore PCA, t-SNE and UMAP in the processed Swiss Food Composition Dataset. However, first, we will import the necessary libraries and read the dataset.

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
import pandas as pd

In [None]:
# we will work with the preprocessed dataset
dataset = pd.read_csv('data/swiss_food_composition_proc.csv')

### PCA

Principal Component Analysis is a linear dimensionality reduction technique. It projects the data points to the dimensions of the highest variance, since they contain the most important information in the dataset. The data is projected into a new subspace with less features than before. The new axis of the data are orthogonal to each other and are the directions of the maximum variance in the dataset.

[Fig. 2](#pca) gives an illustration of PCA:
<center>
    <a id="pca"></a>
    <img src="images/part2_unsupervised/PCA.jpg" alt="Standardization" width="90%">
    <center><figcaption><em>Figure 2: PCA components</em></figcaption></center>
</center>

The red axis depict the directions of the highest variance. The data points will be projected into these two directions. In this case there is no dimensionality reduction. The projected data points have two dimensions again, but they reside in a new subspace, defined by the directions of the highest variance. 

Besides dimensionality reduction, PCA is extensively used in bioinformatics for the analysis of gene expression levels.

### tSNE

### UMAP

**References:**

- "Machine Learning with Pytorch and Scikit-Learn" - Sebastian Raschka, Yuxi Liu, Vahid Mirjalili, Dmytro Dzhulgakov.