# Homework 4 - Problem 1

**Part A [20 points]** *Note this is not Collaborative Problem*

Using the Gaussian kernel develop psuedo code to create a Parzen windowing system to
accomplish the following steps:

+ Develop the ability to read in data xn with n observations and D dimensions (number of features).
+ Develop the ability to randomly remove 20% of the observations per class and assign the observations as test data with the remaining 80% of the observations as training data.
+ Using the Gaussian kernel in Eq. 30 of the Machine Learning I document to develop an algorithm to process an input observations and compare it with the training observations.
+ Expand the development to handle multiple classes.

**Part B [10 points]** *Note this is not a Collaborative Problem*
+ Calculate the running time of the system above in O-notation.
+ Calculate the total running time of the above system as T(n) with each line of pseudocode or code accounted for.
+ How does the total running time T(n) compare to the running time in O-notation?

**Part C [20 points]** *Note this is not a Collaborative Problem*
+ Using all observations and the petal length from the Iris data replicate the subfigures in Figure 1.
+ Using all observations, the petal length and the petal width from the Iris data replicate the subfigures in Figure 2.

## Data Load

In all operations in which I read in Iris data for this course, I have leveraged the `load(0)` function of my [`Reader`](https://github.com/choct155/en685_621/blob/master/algorithms/iris/Reader.py) class. The function just leverages the fact that scikit-learn already has the Iris dataset, so the load occurs in constant time. The remaining work simply involves splitting the data for downstream use.

```python
class IrisReader:

    def __init__(self):
        super().__init__()

    def load(self):
        iris_in: np.array = datasets.load_iris()
        self.data = dict(
            setosa = iris_in["data"][:50],
            versicolor = iris_in["data"][50:100],
            virginica = iris_in["data"][100:]
        )
```

In effect, this function consists of two constant time operations for our purposes, yielding a recurrence of $2T(1)$ and $O(1)$ asymptotics.

In [8]:
import sklearn as skl
from algorithms.iris.Reader import IrisReader
from algorithms.iris.IrisOps import IrisOps
from algorithms.classifiers.neighbors import Parzen
from typing import Dict, Tuple, Callable, List
import numpy as np
from functools import reduce
import plotly.figure_factory as ff
import plotly.graph_objects as go

iris_reader: IrisReader = IrisReader()
iris_reader.load()
raw_data: Dict[str, np.array] = iris_reader.data

raw_data

{'setosa': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],


## Split into Test (20%) - Train (80%)

While **`scikit-learn`** does have it's own train-test-split function, the shape of my data after the load does not precisely play nice with it. For this reason, I have another `test_train_split()` function defined within my [`IrisOps`](https://github.com/choct155/en685_621/blob/master/algorithms/iris/IrisOps.py) class. It combines the output of `Reader.load()` across classes, randomly permutes the data, and then yields a tuple containing the train and test sets, respectively. It also labels the data with an index value corresponding to each class.

```python
def test_train_split(raw_data: Dict[str, np.array], labels: List[str], train_prop: float) -> Tuple[np.array, np.array]:
    
    def process_label_group(data: np.array, idx: int) -> np.array:
        n: int = len(data)
        lab_idx: np.array = np.repeat(idx, n).reshape(n, 1)
        return np.concatenate([lab_idx, data], axis=1)

    print("Label Mapping: ", list(enumerate(labels)))
    data: np.array = np.concatenate(list(
        map(lambda lab: process_label_group(raw_data[lab[1]], lab[0]), enumerate(labels))
    ))
    permuted: np.array = np.random.permutation(data)

    train_n: int = int(len(permuted) * 0.8)
    return (permuted[:train_n], permuted[train_n:])
```

Assigning a label (see the helper `process_label_group()`) involves an assignment ($T(1)$), the instantiation of an array (\~ $T(n)$), and the columnwise concatenation of two equivalently long arrays (\~$T(n)$). We can ignore the reshaping of the array, which is effectively a metadata exercise. The helper function has a total cost of $2T(n) + 1$ if $n$ denotes the size of the array that is labeled. 

However, we should note that the `process_label_group()` only operates on a third of the data at a time in the Iris case, but since it still must do all three, the recurrence should hold. Once all three species have been labeled, they must be concatenated. When I have written concatenation operations in the past, I tend to use a fold with the first array as the starting value and appending additional value with each subsequent recursive call. So, a recurrence of $T(\frac{2n}{3})$ seems fitting here.

The next step is a permutation which likely involves sampling (indices) without replacement and then a sort. Let us assume we can deplete a collection in linear time and assume a sort reminiscient of merge-sort ($2T(\frac{n}{2}) + O(n)$) for a total run time of $2T(\frac{n}{2}) + 2T(n)$. 

Since we are dealing with arrays (which we will assume aren't Lists, or Vectors, or something under the hood), we should have constant time indexing which I suspect supports constant time splits. So, our last operation, splitting the data into the test and train sets adds $T(1)$.

Altogether, the helper function, concatenation, permutation, and split have th following cost:

\begin{align}
    T(n) &= 2T(n) + 1 + T(\frac{2n}{3}) + 2T(\frac{n}{2}) + 2T(n) + 1 \\
    &= 4T(n) + T(\frac{2n}{3}) + 2T(\frac{n}{2}) + 2
\end{align}

Since none of these terms are interacting with each other in a nested loop, the asymptotic growth is capped at $O(n)$.

In [9]:
train, test = IrisOps.test_train_split(raw_data, ["setosa", "versicolor", "virginica"], train_prop = .8)

print(train[:5])
print(test[:5])

Label Mapping:  [(0, 'setosa'), (1, 'versicolor'), (2, 'virginica')]
[[1.  5.4 3.  4.5 1.5]
 [2.  6.2 2.8 4.8 1.8]
 [0.  5.1 3.8 1.6 0.2]
 [1.  6.2 2.9 4.3 1.3]
 [1.  5.  2.  3.5 1. ]]
[[2.  6.4 3.1 5.5 1.8]
 [2.  7.9 3.8 6.4 2. ]
 [0.  5.  3.4 1.6 0.4]
 [0.  4.8 3.  1.4 0.3]
 [1.  6.1 3.  4.6 1.4]]


##  Develop a Parzen Window Classifier with a Gaussian Kernel

As opposed to a $k$-Nearest Neighbor (KNN) algorithm, which relies on fixed density for assignment, Parzen Window (PZ) classifiers rely on fixed volume. The most straightforward implementation leverages a fixed radius around each test observation and assigns the label that corresponds with the highest proportion of labels in the captured training observations. In this case, we will leverage a Gaussian kernel, which in effect provides a version of a likelihood for each label. The procedure will be as follows:

1. Split the train data into groups by class.
2. For each class of train observations:
    + Compare the test observation to each train observation in the class, using the gaussian kernel.
    + Sum over the output of comparisons within the class to determine the score.
3. Assign the label of the class with the highest score for the test observation.

The estimator used in this notebook is provided by the `Parzen` class in my [`neighbors`](https://github.com/choct155/en685_621/blob/master/algorithms/classifiers/neighbors.py) module. The heart of this approach is the kernel:

```python
def gaussian_kernel(obs: np.array, data: np.array, spread: float) -> np.array:
    obs_rows, obs_cols = obs.shape
    data_rows, data_cols = data.shape
    out: np.array = np.zeros((obs_cols, data_rows))

    def g(x_0: np.array, x_n: np.array) -> float:
        normalization: float = 1 / ((np.sqrt(2*np.pi)*spread)**data_cols)
        distance: float = (x_0-x_n).dot((x_0-x_n).T)
        exponential: float = np.exp((-0.5/spread**2) * distance)
        return normalization * exponential

    for i, o in enumerate(obs):
        for j, d in enumerate(data):
            out[i, j] = g(o, d)

    return out
```

While this implementation is generally deployed in a way that takes one observation and compares it to an entire array of training data, the function is capable of ingesting an array of test data as well. Assuming a test size of $m$ and a train size of $n$, the population of `out` will take $mn$ assignments. Assuming constant time for the assignments at the beginning of the function, the running time will be $T(mn) + 3$ with an asymptotic cost of $O(mn)$.

The nice thing about this approach is that it already accommodates multiple classes, since we are just taking the maximum score. The following three functions leverage the kernel to execute labeling of test data:

```python
@staticmethod
def score_class(test_obs: np.array, train: np.array, class_idx: int, spread: float) -> float:
    tobs_2d: np.array = test_obs.reshape(1,-1)[:, 1:]
    class_train: np.array = train[train[:, 0] == class_idx]
    kernel_mat: np.array = Parzen.gaussian_kernel(tobs_2d, class_train[:, 1:], spread)
    return kernel_mat.sum()
```

`score_class()` assigns a kernel-based score for a given observation to one of the classes in the train data. It involves a filter, which requires a scan of the train array $T(n)$ and then applies the kernel function on a single test observation and the filtered train array. Since we have constrained the inputs to the kernel, our cost is reduced to $T(1 \cdot \frac{n}{3}) + 3$. The total cost of `score_class()` is $T(n) + T(\frac{n}{3}) + 3$.

```python
@staticmethod
def label_obs(test_obs: np.array, train: np.array, spread: float) -> int:
    labels: np.array = np.unique(train[:, 0])
    scores: List[Tuple[int, np.array]] = list(map(
        lambda class_idx: (class_idx, Parzen.score_class(test_obs, train, class_idx, spread)),
        labels
    ))
    max_score: Tuple[int, np.array] = reduce(lambda f, s: f if f[1] >= s[1] else s, scores)
    return max_score[0]
```

`label_ops()` identifies all class labels, which effectively means building a set from an array. I have previously developed a [`BinaryTree`](https://github.com/choct155/en685_621/blob/master/algorithms/data_structures/BinaryTree.py) data structure which I think I convinced myself could do insertion in $T(n\text{log}n)$, so I'll lean on that here for the call to `np.unique()`. The method also scores all classes, so in the Iris case we should triple the cost of `score_class()` to get $3T(n) + 3T(\frac{n}{3}) + 9$. Finally, the method returns the maximum score by way of reduction, which relies on $p-1$ comparisons given $p$ classes. The total running cost is $T(n\text{log}n) + 3T(n) + 3T(\frac{n}{3}) + 9 + T(p-1)$.

```python
def fit(self) -> np.array:
    truth: np.array = self.test[:, 0].reshape(len(self.test), 1)
    pred: np.array = np.array(list(
        map(lambda obs: Parzen.label_obs(obs, self.train, self.spread), self.test)
    )).reshape(len(self.test), 1)
    return np.concatenate([truth, pred], axis=1)
```

`fit()` is where the real cost shows up, insofar as it calls `label_obs()` on all $m$ observations in the test data. Consequently, all costs in `label_ops()` are scaled by $m$: $T(mn\text{log}n) + 3T(mn) + 3T(m\frac{n}{3}) + T(m(p-1)) + 9T(m)$. **The total asymptotic cost associated with this recurrence is $O(mn\text{log}n)$.**

The total running time $T(n)$ is larger than the asymptotic running time $O(n)$.

In [10]:
p = Parzen(raw_data, 0.3)
out = p.fit()
print("Results:\n", out)
print("Predictive Accuracy: ", Parzen.accuracy(out))

Label Mapping:  [(0, 'setosa'), (1, 'versicolor'), (2, 'virginica')]
Results:
 [[0. 0.]
 [2. 2.]
 [0. 0.]
 [2. 2.]
 [1. 1.]
 [1. 1.]
 [2. 2.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [0. 0.]
 [1. 1.]
 [0. 0.]
 [0. 0.]
 [2. 2.]
 [0. 0.]
 [0. 0.]
 [2. 2.]
 [1. 1.]
 [0. 0.]
 [0. 0.]
 [2. 2.]
 [2. 2.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [1. 1.]
 [2. 2.]
 [0. 0.]]
Predictive Accuracy:  1.0


## Plot Univariate Distributions at Different Bandwiths

In [11]:
species = ["setosa", "versicolor", "virginica"]
color_map: Dict[str, str] = dict(
    setosa = "#e41a1c",
    versicolor = "#377eb8",
    virginica = "#4daf4a"
) 
f10 = Parzen.plot1D(raw_data, np.linspace(-1, 3, 100), 3, 0.1, species, color_map)
f10.update_layout(title = "Petal Width Distribution By Class (h = 0.1)")
f10.show()

In [12]:
f25 = Parzen.plot1D(raw_data, np.linspace(-1, 3, 100), 3, 0.25, species, color_map)
f25.update_layout(title = "Petal Width Distribution By Class (h = 0.25)")
f25.show()

In [13]:
f50 = Parzen.plot1D(raw_data, np.linspace(-1, 3, 100), 3, 0.50, species, color_map)
f50.update_layout(title = "Petal Width Distribution By Class (h = 0.50)")
f50.show()

## Plot Bivariate Distributions at Different Bandwiths

I'm afraid I wrestled with this one a good deal and could not find a way to get plotly to allow me to plot multiple kernel density plots with differing bandwidths at the same time. I should have used another library.

In [14]:
a10 = Parzen.plot2D(raw_data, np.linspace(-1, 8, 50),  np.linspace(-2, 6, 50), [2,3], 0.50, species, color_map)
a10.show()