In [1]:
from src.data import datasets

Before running this notebook, make sure you have activated the appropriate environment, and created the requisite datasets:
```
conda activate dimension_reduction
make data
```

# Natural Datasets
Look at the natural (non-synthetic) datasets that are available. This will fetch, process, and cache the datasets for use in other notebooks.

In [2]:
div = '=' * 60 + '\n'
for dataset_name in datasets.available_datasets():
    dset = datasets.load_dataset(dataset_name)
    print(f"{div}Dataset: {dataset_name.upper()}\n"
          f"Shape: data={dset.data.shape}, target={dset.target.shape}\n{div}\n"
          f"{dset.DESCR}")

Dataset: BALL
Shape: data=(1000, 3), target=(1000,)

Synthetic data produced by: src.data.synthetic.sample_ball

>>> sample_ball(1000, random_state=6502)

>>> help(sample_ball)

Sample from a unit ball

    Use rejection sampling on the unit cube
    
Dataset: BROKEN-SWISS-ROLL
Shape: data=(1000, 3), target=(1000,)

Synthetic data produced by: src.data.synthetic.synthetic_data

>>> synthetic_data(kind='broken_swiss_roll', n_points=1000, noise=0.05, random_state=6502)

>>> help(synthetic_data)

Make a synthetic dataset

    A sample dataset generators in the style of sklearn's
    `sample_generators`. This adds other functions found in the Matlab
    toolkit for Dimensionality Reduction

    Parameters
    ----------
    kind: {'unit_cube', 'broken_swiss_roll', 'twinpeaks', 'difficult'}
        The type of synthetic dataset
    n_points : int, optional (default=1000)
        The total number of points generated.
    noise : double or None (default=None)
        Standard deviation of Gau

Dataset: F-MNIST
Shape: data=(60000, 784), target=(60000,)

MNIST
=====

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: LeCun et al
    :Date: 1998

This is a copy of the well-known MNIST database of handwritten digits,
available from http://yann.lecun.com/exdb/mnist/

The MNIST database of handwritten digits consists of a training set of
60,000 examples, and a test set of 10,000 examples. It is a subset of
a larger set available from NIST. The digits have been size-normalized
and centered in a fixed-size image.

The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the

Dataset: MNIST
Shape: data=(60000, 784), target=(60000,)

MNIST
=====

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: LeCun et al
    :Date: 1998

This is a copy of the well-known MNIST database of handwritten digits,
available from http://yann.lecun.com/exdb/mnist/

The MNIST database of handwritten digits consists of a training set of
60,000 examples, and a test set of 10,000 examples. It is a subset of
a larger set available from NIST. The digits have been size-normalized
and centered in a fixed-size image.

The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the c

.Z files are only supported on systems that ship with gzip. Trying...


Dataset: ORL-FACES
Shape: data=(400, 10304), target=(400,)

The ORL face database
---------------------

This directory contains a set of faces taken between April 1992 and
April 1994 at the Olivetti Research Laboratory in Cambridge, UK.

There are 10 different images of 40 distinct subjects. For some of the
subjects, the images were taken at different times, varying lighting
slightly, facial expressions (open/closed eyes, smiling/non-smiling)
and facial details (glasses/no-glasses).  All the images are taken
against a dark homogeneous background and the subjects are in
up-right, frontal position (with tolerance for some side movement).

The files are in PGM format and can be conveniently viewed using the 'xv'
program. The size of each image is 92x112, 8-bit grey levels. The images
are organised in 40 directories (one for each subject) named as:

		sX

where X indicates the subject number (between 1 and 40). In each directory
there are 10 different images of the selected subject named 

In [3]:
datasets.available_datasets()

['ball',
 'broken-swiss-roll',
 'coil-100',
 'coil-20',
 'difficult',
 'f-mnist',
 'frey-faces',
 'gaussian-blobs',
 'helix',
 'hiva',
 'lvq-pak',
 'mnist',
 'orl-faces',
 's-curve',
 'shuttle-statlog',
 'sphere',
 'swiss-roll',
 'twinpeaks']

In [None]:
|