# What is Dimensionality Reduction?

Dimensionality reduction algorithms are a class of unsupervised machine learning algorithms. 
The aim of a dimensionality reduction algorithm is to reduce the number of features in a dataset, without losing too much important information. 

Humans are bad at visualisation of data in more than two dimensions. 
However, humans are very good at identification of trends in data. 
Therefore, it can be desirable to reduce the number of features in a given dataset to facilitate visualisation. 
Often this practice is then followed by approaches such as trend identification (i.e., linear regression) or clustering. 

```{figure} ../images/dimensionality-reduction.png
---
name: transpose
width: 100%
---
A visual representation of dimensionality reduction, from three to one dimensions.
```

There are many dimensionality reduction algorithms, but we will look in detail at just a couple; namely; principal components analysis (PCA) and *t*-distributed stochastic neighbour embedding (*t*-SNE). 

````{margin}
```{note}
An [abalone](https://en.wikipedia.org/wiki/Abalone) is a marine mollusc that are commonly eaten by humans. 
```
````
To show the different dimensionality reduction algorithms, we will use an openly available dataset. 
Specifically, we will use a dataset containing measurements of different abalones. 
This data was sourced from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/1/abalone), to demonstrate the different dimensionality reduction algorithms that we will look at. 
The file has been modified to include the names of the features, and can be downloaded [here](./../data/abalone.csv).

Let's have a look at this dataset. 

In [3]:
import pandas as pd

data = pd.read_csv('./../data/abalone.csv')
data

Unnamed: 0,sex,length/mm,diameter/mm,height/mm,whole_weight/g,shucked_weight/g,viscera_weight/g,shell_weight/g,rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


We can see that the data has 4177 entries, each with 9 features. 
Before we continue, it is important that we check for any missing data.
Missing data, would typically be stored as a null-value. 

In [4]:
pd.isnull(data).sum()

sex                 0
length/mm           0
diameter/mm         0
height/mm           0
whole_weight/g      0
shucked_weight/g    0
viscera_weight/g    0
shell_weight/g      0
rings               0
dtype: int64

We can see that there are no features with missing data (i.e., no null-values). 
If null-values were present, depending on the algorithm being used, it may be necessary to remove these datapoints. 

In addition to checking for missing data, we should also consider the nature of some of the data present. 
For example, the `sex` data is not numerical; either male or female.
This data is referred to as **catagorical**, as in it has catagories. 
Similar to missing data, this may not be compatible with the algorithm we apply. 