# What is Dimensionality Reduction?

Dimensionality reduction algorithms are a class of unsupervised machine learning algorithms. 
The aim of a dimensionality reduction algorithm is to reduce the number of features in a dataset, without losing too much important information. 

Humans are bad at visualisation of data in more than two dimensions. 
However, humans are very good at identification of trends in data. 
Therefore, it can be desirable to reduce the number of features in a given dataset to facilitate visualisation. 
Often this practice is then followed by approaches such as trend identification (i.e., linear regression) or clustering. 

```{figure} ../images/dimensionality-reduction.png
---
name: dr
width: 100%
---
A visual representation of dimensionality reduction, from three to one dimensions.
```

There are many dimensionality reduction algorithms, but we will look in detail at just a couple; namely; principal components analysis (PCA) and t-distributed stochastic neighbour embedding (*t*-SNE). 

````{margin}
```{note}
An [abalone](https://en.wikipedia.org/wiki/Abalone) is a marine mollusc that are commonly eaten by humans. 
```
````
To show the different dimensionality reduction algorithms, we will use some openly available datasets. 
Let's look at these datasets in detail before we get started. 

## Abalone Dataset
This data was sourced from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/1/abalone), to demonstrate the different dimensionality reduction algorithms that we will look at. 
The file has been modified to include the names of the features, and can be downloaded [here](./../data/abalone.csv).

Let's have a look at this dataset. 

In [None]:
import pandas as pd

data = pd.read_csv('./../data/abalone.csv')
data

We can see that the data has 4177 entries, each with 9 features.
These 9 features describe the size of the abalone samples (length, diameter, weight, etc.) and the number of rings in their shells. 
This final feature is a descriptor of the age of the abalone.
The number of rings are not straight forward to measure, so it is desirable to estimate the age from these other parameters. 

Before we continue, it is important that we check for any missing data.
Missing data, would typically be stored as a null-value. 

In [None]:
pd.isnull(data).sum()

We can see that there are no features with missing data (i.e., no null-values). 
If null-values were present, depending on the algorithm being used, it may be necessary to remove these datapoints. 

In addition to checking for missing data, we should also consider the nature of some of the data present. 
For example, the `sex` data is not numerical; either male or female.
This data is referred to as **catagorical**, as in it has catagories. 
Similar to missing data, this may not be compatible with the algorithm we apply. 

## Breast Cancer

Another dataset that we will look it is the Wisconson Breast Cancer dataset, sourced also from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic). 
This dataset is information about the size, shape and texture of breast cancer tumours, and has been tagged with information about whether the tumour was found to be benign (not harmful in effect) or malignant (infectuous). 
This dataset has been reduced to makes it suit the pedogogical purposes of this work more, and can be downloaded [here](./../data/breast-cancer.csv). 
Let's look at the dataset. 

In [None]:
data = pd.read_csv('./../data/breast-cancer.csv')
data

Similar to the abalone dataset, the null-values have been stripped from the data. 

In [None]:
pd.isnull(data).sum()

## Handwritten Digits Dataset

A popular dataset used for looking at machine learning algorithms is the MNIST handwritten digits dataset. 
This data contains a series of images of digits that have been handwritten. 
Let's load the data in and have a look at the structure. 

In [None]:
data = pd.read_csv('./../data/mnist.csv')

data

We can see that this dataset has an integer value for each of 784 pixels as well as a label, where the label indicates the true value of the digits that has been written. 
We can visualise some of the images by reshaping the data appropriately. 

In [None]:
import matplotlib.pyplot as plt
from skimage.util import montage

fig, ax = plt.subplots(1, 1, figsize=(6, 6))

ax.imshow(montage(data[[f'pixel{i+1}' for i in range(784)]].loc[:15].values.reshape(16, 28, 28)))
ax.set_aspect('equal')
plt.show()

This is a large dataset, with a lot of features for us to train or algorithms against. 