# FuzzyART in Parts

In the other `FuzzyART` notebook, we implemented a `FuzzyART` module as a class with all of the requisite methods for running it as a standalone module.
Here, we will look a little further down the rabbit hole to understand the moving parts of `FuzzyART` in finer detail.
In programming terms, this might look more like the "functional programming" paradigm/pattern in that we wish to atomically look at each moving part.

## Dependencies

First, we load all of our dependencies for the notebook

In [1]:
# For manipulating local paths in an object-oriented way
from pathlib import Path
# Dataclass for a structured way of passing around a dataset
from dataclasses import dataclass

# The PyTorch library containing neural network utilities and the Tensor datatype
import torch
# A convenient import of Tensor so that we don't have to write torch.Tensor every time
from torch import Tensor
# Pandas for loading and manipulating data as a DataFrame
import pandas as pd
# Numpy for handling numpy arrays (i.e., matplotlib doesn't understand Tensor types, but it does know numpy.nparray)
import numpy as np

# A sklearn utility for handling normalization of data automatically
from sklearn.preprocessing import MinMaxScaler
# From scikit-learn, for casting the data to 2D for visualization.
# This is not how the data actually looks in 4D, but the best that we can do is to cast it to 2D such that relative distances are mostly maintained.
from sklearn.manifold import TSNE
# For loading the iris dataset as an example
from sklearn.datasets import load_iris

# The most common way of importing matplotlib for plotting in Python
from matplotlib import pyplot as plt
# For manipulating axis tick locations
from matplotlib import ticker

## Data

Next, we load our dataset!
Here, we will use the UCI Iris dataset again as a relatively simple example.
This time, we will modularize the preprocessing code a little more!
First, we have the function to load the data:

In [2]:
def load_data() -> IrisData:
    # Load the iris dataset as a DataFrame
    iris = load_iris(as_frame=True)
    # Extract the dataframe from the dictionary the loader provides
    data = iris['frame']
    # Extract the target as a vector of integer labels
    # labels = iris.target
    # Return the data container
    # return IrisData(data, labels)
    return data

# Load the data
data = load_data()
# Print the first several rows to get an idea of what it looks like
data.head()

NameError: name 'IrisData' is not defined

Then we have the function to preprocess the data.
In this example, we do this on the full batch rather than incrementally for the following reason:

`FuzzyART` uses complement coding, which maps $x \rightarrow [x, 1-x]$ and is bounded in $[0, 1]$.
To do this, we need the original $x$ to be also be bounded inside $[0, 1]$.
Most real data is not neatly normalized and bounded, so we need to do it ourselves at some point.
However, that requires knowing the bounds of the full data in advance!

Even though we are working with an incremental algorithm, we have the luxury of having the all of the Iris dataset up-front, and the dataset surely isn't going to change any time soon.
This is not always the case, especially if you are dealing with streaming datasets, where you are incrementally provided a sample one at a time!
In those cases, you have two options:

1. Know the statistics of the dataset in advance (e.g., the upper and lower bounds) and preprocess each sample incrementally off of that.
2. Use some sort of intelligent normalization scheme that enforces the bounds of the data to $[0, 1]$, such as through incorporating a limiting function like the sigmoid function
$\sigma = \dfrac{1}{1+e^{-x}}$
or some other hard limiting function.

In [None]:
@dataclass
class IrisData:
    x: Tensor
    y: np.array

def preprocess(
    data: pd.DataFrame,
    shuffle: bool = True,
    random_seed: int = 12345,
):
    # Shuffle the data if necessary
    if shuffle:
        np.random.seed(random_seed)
        # data['Labels'] = iris.target
        data = data.sample(frac=1).reset_index(drop=True)
    # Whether shuffled or not, sepearate the labels
    labels = data.pop('target')

    # Intialize the scalar and update the values in-place to be normalized between [0, 1]
    scaler = MinMaxScaler()
    data = pd.DataFrame(scaler.fit_transform(data))

    # Complement code the data by pushing it into a Tensor
    data_cc = torch.Tensor(data.values)
    # and appending the vector [1-x] along the feature dimension
    data_cc = torch.cat((data_cc, 1 - data_cc), dim=1)
    # What we get is a list of 8-dimensional samples
    return IrisData(data_cc, labels)

data_cc = preprocess(data)
data_cc

IrisData(x=tensor([[0.3611, 0.2083, 0.4915,  ..., 0.7917, 0.5085, 0.5833],
        [0.0278, 0.5000, 0.0508,  ..., 0.5000, 0.9492, 0.9583],
        [0.5556, 0.5417, 0.6271,  ..., 0.4583, 0.3729, 0.3750],
        ...,
        [0.5278, 0.3333, 0.6441,  ..., 0.6667, 0.3559, 0.2917],
        [0.8056, 0.4167, 0.8136,  ..., 0.5833, 0.1864, 0.3750],
        [0.1111, 0.5000, 0.1017,  ..., 0.5000, 0.8983, 0.9583]]), y=0      1
1      0
2      1
3      0
4      0
      ..
145    0
146    2
147    2
148    2
149    0
Name: target, Length: 150, dtype: int64)

## TODO

This notebook is a work in progress!
If you see this, it means that there is more to come for this notebook.