# Datasets 

Let's look at the openly available datasets we will use in detail. 
Many of these will be used for both unsupervised and supervised approaches. 

````{margin}
```{note}
An [abalone](https://en.wikipedia.org/wiki/Abalone) is a marine mollusc commonly eaten by humans. 
```
````
## Abalone Dataset

This data was sourced from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/1/abalone). 
The file has been modified to include the names of the features and can be downloaded [here](./../data/abalone.csv).

Let's have a look at this dataset. 

In [None]:
import pandas as pd

data = pd.read_csv('./../data/abalone.csv')
data

We can see that the data has 4177 entries, each with nine features.
These nine features describe the size of the abalone samples (length, diameter, weight, etc.) and the number of rings in their shells. 
This final feature is a descriptor of the age of the abalone.
The number of rings is not straightforward to measure, so estimating the age from these other parameters is desirable. 

Before we continue, we must check for any missing data.
Missing data would typically be stored as a null value. 

In [None]:
pd.isnull(data).sum()

We can see no features with missing data (i.e., no null values). 
If null values were present, depending on the algorithm being used, it may be necessary to remove these data points. 

In addition to checking for missing data, we should also consider the nature of some of the data present. 
For example, the `sex` data is not numerical, either male or female.
This data is referred to as **categorical**, as it has categories. 
Similar to missing data, this may not be compatible with the algorithms we apply. 

## Breast Cancer

Another dataset that we will look at is the Wisconson Breast Cancer dataset, also sourced from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic). 
This dataset contains information about the size, shape, and texture of breast cancer tumours and has been tagged with information about whether the tumour was found to be benign (not harmful in effect) or malignant (harmful). 
This dataset has been reduced to suit the pedagogical purposes of this work more and can be downloaded [here](./../data/breast-cancer.csv). 
Let's look at the dataset. 

In [None]:
data = pd.read_csv('./../data/breast-cancer.csv')
data

Similar to the abalone dataset, the null values have been stripped from the data. 

In [None]:
pd.isnull(data).sum()

## Handwritten Digits Dataset

A popular dataset for looking at machine learning algorithms is the MNIST handwritten digits dataset. 
This data contains a series of images of digits that have been handwritten. 
Let's load the data in and have a look at the structure. 

In [None]:
data = pd.read_csv('./../data/mnist.csv')

data

We can see that this dataset has an integer value for each of the 784 pixels and a label, where the label indicates the actual value of the digits that have been written. 
We can visualise some of the images by reshaping the data appropriately. 

In [None]:
import matplotlib.pyplot as plt
from skimage.util import montage

fig, ax = plt.subplots(1, 1, figsize=(6, 6))

ax.imshow(montage(data[[f'pixel{i+1}' for i in range(784)]].loc[:15].values.reshape(16, 28, 28)))
ax.set_aspect('equal')
plt.show()

This is a large dataset with many features against which we can train or use algorithms. 