# Homework 4 - Problem 1

**Part A [20 points]** *Note this is not Collaborative Problem*

Using the Gaussian kernel develop psuedo code to create a Parzen windowing system to
accomplish the following steps:

+ Develop the ability to read in data xn with n observations and D dimensions (number of features).
+ Develop the ability to randomly remove 20% of the observations per class and assign the observations as test data with the remaining 80% of the observations as training data.
+ Using the Gaussian kernel in Eq. 30 of the Machine Learning I document to develop an algorithm to process an input observations and compare it with the training observations.
+ Expand the development to handle multiple classes.

**Part B [10 points]** *Note this is not a Collaborative Problem*
+ Calculate the running time of the system above in O-notation.
+ Calculate the total running time of the above system as T(n) with each line of pseudocode or code accounted for.
+ How does the total running time T(n) compare to the running time in O-notation?

**Part C [20 points]** *Note this is not a Collaborative Problem*
+ Using all observations and the petal length from the Iris data replicate the subfigures in Figure 1.
+ Using all observations, the petal length and the petal width from the Iris data replicate the subfigures in Figure 2.

## Data Load

In [1]:
import sklearn as skl
from algorithms.iris.Reader import IrisReader
from algorithms.iris.IrisOps import IrisOps
from typing import Dict, Tuple, Callable, List
import numpy as np
from functools import reduce
import plotly.figure_factory as ff
import plotly.graph_objects as go

iris_reader: IrisReader = IrisReader()
iris_reader.load()
raw_data: Dict[str, np.array] = iris_reader.data

raw_data

{'setosa': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],


## Split into Test (20%) - Train (80%)

In [2]:
train, test = IrisOps.test_train_split(raw_data, ["setosa", "versicolor", "virginica"], train_prop = .8)

print(train[:5])
print(test[:5])

Label Mapping:  [(0, 'setosa'), (1, 'versicolor'), (2, 'virginica')]
[[1.  5.7 2.6 3.5 1. ]
 [0.  5.  3.2 1.2 0.2]
 [1.  7.  3.2 4.7 1.4]
 [1.  6.5 2.8 4.6 1.5]
 [2.  6.4 2.7 5.3 1.9]]
[[2.  6.9 3.1 5.1 2.3]
 [2.  6.5 3.  5.2 2. ]
 [1.  5.6 2.5 3.9 1.1]
 [0.  5.4 3.4 1.7 0.2]
 [2.  6.9 3.1 5.4 2.1]]


##  Develop a Parzen Window Classifier with a Gaussian Kernel

As opposed to a $k$-Nearest Neighbor (KNN) algorithm, which relies on fixed density for assignment, Parzen Window (PZ) classifiers rely on fixed volume. The most straightforward implementation leverages a fixed radius around each test observation and assigns the label that corresponds with the highest proportion of labels in the captured training observations. In this case, we will leverage a Gaussian kernel, which in effect provides a version of a likelihood for each label. The procedure will be as follows:

1. Normalize the data (z-score).
2. Split the train data into groups by class.
3. For each class of train observations:
    + Compare the test observation to each train observation in the class, using the gaussian kernel.
    + Sum over the output of comparisons within the class to determine the score.
4. Assign the label of the class with the highest score for the test observation.

The heart of this approach is the kernel:

```python
def gaussian_kernel(obs: np.array, data: np.array, spread: float) -> np.array:
    obs_rows, obs_cols = obs.shape
    data_rows, data_cols = data.shape
    out: np.array = np.zeros((obs_cols, data_cols))
    
    def g(x_0: np.array, x_n: np.array) -> float:
        normalization: float = 1 / ((np.sqrt(2*np.pi)*spread)**data_rows)
        distance: float = (x_0-x_n).dot((x_0-x_n))
        exponential: float = np.exp((-0.5/spread**2) * distance)
        return normalization * exponential
    
    for oc in range(obs_cols):
        for dc in range(data_cols):
            out[oc, dc] = g(obs[:, oc], data[:, dc])
            
    return out
```

The nice thing about this approach is that it already accommodates multiple classes, since we are just taking the maximum score.

In [227]:
class Parzen:
    
    def __init__(
        self, 
        data: Dict[str, np.array], 
        spread: float, 
        train_prop: float = 0.8, 
        labels: List[str] = ["setosa", "versicolor", "virginica"]
    ) -> np.array:
        self.data = data
        self.spread = spread
        self.train_prop = train_prop
        self.labels = labels
        self.train, self.test = self.split()
        
    def split(self) -> Tuple[np.array, np.array]:
        return IrisOps.test_train_split(self.data, self.labels, self.train_prop)
        
    def preprocess(self, data: np.array) -> Tuple[np.array, np.array]:
        y: np.array = data[:,0]
        x_in: np.array = data[:, 1:]
        X: np.array = (x_in - x_in.mean(axis=0)) / x_in.std(axis=0)
        return np.concatenate([y, X], axis=1)
    
    @staticmethod
    def gaussian_kernel(obs: np.array, data: np.array, spread: float) -> np.array:
        obs_rows, obs_cols = obs.shape
        data_rows, data_cols = data.shape
        out: np.array = np.zeros((obs_cols, data_rows))
    
        def g(x_0: np.array, x_n: np.array) -> float:
            normalization: float = 1 / ((np.sqrt(2*np.pi)*spread)**data_cols)
            distance: float = (x_0-x_n).dot((x_0-x_n))
            exponential: float = np.exp((-0.5/spread**2) * distance)
            return normalization * exponential

        for i, o in enumerate(obs):
            for j, d in enumerate(data):
                out[i, j] = g(o, d)
            

        return out
    
    @staticmethod
    def score_class(test_obs: np.array, train: np.array, class_idx: int, spread: float) -> float:
        tobs_2d: np.array = test_obs.reshape(1,-1)[:, 1:]
        class_train: np.array = train[train[:, 0] == class_idx]
        kernel_mat: np.array = Parzen.gaussian_kernel(tobs_2d, class_train[:, 1:], spread)
        return kernel_mat.sum()
    
    @staticmethod
    def label_obs(test_obs: np.array, train: np.array, spread: float) -> int:
        labels: np.array = np.unique(train[:, 0])
        scores: List[Tuple[int, np.array]] = list(map(
            lambda class_idx: (class_idx, Parzen.score_class(test_obs, train, class_idx, spread)),
            labels
        ))
        max_score: Tuple[int, np.array] = reduce(lambda f, s: f if f[1] >= s[1] else s, scores)
        return max_score[0]
    
    def fit(self) -> np.array:
        truth: np.array = self.test[:, 0].reshape(len(self.test), 1)
        pred: np.array = np.array(list(
            map(lambda obs: Parzen.label_obs(obs, self.train, self.spread), self.test)
        )).reshape(len(self.test), 1)
        return np.concatenate([truth, pred], axis=1)
        
    @staticmethod
    def accuracy(label_pairs: np.array) -> float:
        matches: int = len(label_pairs[label_pairs[:, 0] == label_pairs[:, 1]])
        total: int = len(label_pairs)
        return matches / total
    
    @staticmethod
    def plot1D(
        raw_data: Dict[str,np.array], 
        support: np.array, 
        feature: int, 
        spread: float,
        labels: List[str],
        color_map: Dict[str, str]
    ) -> go.Figure:
        
        input_data: List[np.array] = [raw_data[lab][:,feature] for lab in labels]
        
        pgk: Callable[[float], float] = lambda obs: Parzen.gaussian_kernel(
            np.array([obs]).reshape(1,1), 
            support.reshape(len(support), 1), 
            spread
        )
        def class_density(feature_arr: np.array) -> np.array:
            arrs: List[np.array] = list(map(lambda obs: pgk(obs), feature_arr))
            sum_array: np.array = reduce(lambda f, s: f + s, arrs)
            return sum_array / len(feature_arr)
    
        hist_data: List[np.array] = [class_density(arr)[0] for arr in input_data]
        colors: List[str] = [color_map[lab] for lab in labels]
        
        fig: go.Figure = go.Figure()
        for i, lab in enumerate(labels):
            fig.add_trace(go.Scatter(
                x = support,
                y = hist_data[i],
                line=dict(color=colors[i]),
                    name=f"{lab} density"
            ))
            fig.add_trace(go.Scatter(
                x = input_data[i],
                y = np.random.uniform(-0.05, 0.05, size = len(input_data[i])),
                mode = "markers",
                opacity = 0.7,
                marker = dict(
                    color="#ffffff", 
                    line = dict(color=colors[i], width=1)
                ),
                name = f"{lab} data"
            ))
        fig.update_layout(template="plotly_white")
        return fig
    
    @staticmethod
    def plot2D(
        raw_data: Dict[str,np.array], 
        support: np.array, 
        features: Tuple[int, int], 
        spread: float,
        labels: List[str],
        color_map: Dict[str, str]
    ) -> go.Figure:
        
        minf, maxf = (labels[0], labels[1]) if labels[1] > labels[0] else (labels[1], labels[0])
        input_data: List[np.array] = [raw_data[lab][:,minf:maxf+1] for lab in labels]
        
        pgk: Callable[[float], float] = lambda obs: Parzen.gaussian_kernel(
            np.array([obs]).reshape(1,1), 
            support.reshape(len(support), 1), 
            spread
        )
        def class_density(feature_arr: np.array) -> np.array:
            arrs: List[np.array] = list(map(lambda obs: pgk(obs), feature_arr))
            sum_array: np.array = reduce(lambda f, s: f + s, arrs)
            return sum_array / len(feature_arr)
    
        hist_data: List[np.array] = [class_density(arr)[0] for arr in input_data]
        colors: List[str] = [color_map[lab] for lab in labels]
        
        fig: go.Figure = go.Figure()
        for i, lab in enumerate(labels):
            fig.add_trace(go.Scatter(
                x = support,
                y = hist_data[i],
                line=dict(color=colors[i]),
                    name=f"{lab} density"
            ))
            fig.add_trace(go.Scatter(
                x = input_data[i],
                y = np.random.uniform(-0.05, 0.05, size = len(input_data[i])),
                mode = "markers",
                opacity = 0.7,
                marker = dict(
                    color="#ffffff", 
                    line = dict(color=colors[i], width=1)
                ),
                name = f"{lab} data"
            ))
        fig.update_layout(template="plotly_white")
        return fig
    
    
p = Parzen(raw_data, 0.3)
out = p.fit()
Parzen.accuracy(out)

Label Mapping:  [(0, 'setosa'), (1, 'versicolor'), (2, 'virginica')]


1.0

In [242]:
np.arange(20).reshape(5,4)[:, 2:4]

array([[ 2,  3],
       [ 6,  7],
       [10, 11],
       [14, 15],
       [18, 19]])

In [232]:
species = ["setosa", "versicolor", "virginica"]
color_map: Dict[str, str] = dict(
    setosa = "#e41a1c",
    versicolor = "#377eb8",
    virginica = "#4daf4a"
) 
f10 = Parzen.plot1D(raw_data, np.linspace(-1, 3, 100), 3, 0.1, species, color_map)
f10.update_layout(title = "Petal Width Distribution By Class (h = 0.1)")
f10.show()

In [233]:
f25 = Parzen.plot1D(raw_data, np.linspace(-1, 3, 100), 3, 0.25, species, color_map)
f25.update_layout(title = "Petal Width Distribution By Class (h = 0.25)")
f25.show()

In [234]:
f50 = Parzen.plot1D(raw_data, np.linspace(-1, 3, 100), 3, 0.50, species, color_map)
f50.update_layout(title = "Petal Width Distribution By Class (h = 0.50)")
f50.show()

In [19]:
gs = list(map(lambda obs: guassian_kernel(obs.reshape(1, 4), data, 0.3), data))
reduce(lambda f, s: f + s, gs)

array([[2.77880369e-008, 2.95838882e-103, 0.00000000e+000,
        0.00000000e+000],
       [1.57678845e-053, 1.38166182e-010, 8.07278475e-083,
        0.00000000e+000],
       [0.00000000e+000, 6.92838206e-293, 5.55780721e+003,
        1.95955041e-064],
       [0.00000000e+000, 0.00000000e+000, 9.66093687e-088,
        1.44815888e+006]])

Jesus, didn't realize that there is no training. Just like KNN, this is the damn estimator. For each training observation, calculate the gaussian kernel matrices, which are really just effectively likelihood matrices in multiple directions. Sum up the values in a matrix, and that is the score for the species in question. Then just choose the species with the largest score.