# RBF Networks with Iris Data


Review from Monday:

Training a RBF consists of:
* Finding prototypes
* Selecting the activation function for the hidden nodes
* Selecting the activation function for the output nodes
* Setting the weights for the edges and biases

## Load the Data

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

def typeConverter(x):
    values = ['setosa', 'versicolor', 'virginica']
    return float(values.index(x))


columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
iris = np.array(np.genfromtxt('../data/iris.csv', delimiter=',', converters={4: typeConverter}, skip_header=2, dtype=float, encoding='utf-8'))  
print(iris)

# Look at the Data

In [None]:
def getSummaryStatistics(data):
    "Get the max, min, mean, var for each variable in the data."
    return pd.DataFrame(np.array([data.max(axis=0), data.min(axis=0), data.mean(axis=0), data.var(axis=0)]))

def getShapeType(data):
    "Get the shape and type of the data."
    return (data.shape, data.dtype)

print(getSummaryStatistics(iris))
getShapeType(iris)

df = pd.DataFrame(iris, columns=columns)

sns.pairplot(df, y_vars = ["class"], kind = "scatter")

## What kind of analysis are we going to do?

Regression, clustering, classification?

If supervised, which is our dependent variable?

If we have a dependent variable, how many possible values does it have? What will this number correspond to in the RBF network?

In [None]:
# Why are we doing this?
np.random.shuffle(iris)

# Why are we doing this?
train_data, dev_data, test_data = np.split(iris, [int(.8 * len(iris)), int(.9 * len(iris))])
print("train", "\n", getSummaryStatistics(train_data), np.unique(train_data[:, -1]))
print("dev", "\n", getSummaryStatistics(dev_data), np.unique(dev_data[:, -1]))
print("test", "\n", getSummaryStatistics(test_data), np.unique(test_data[:, -1]))

In [None]:
# Let's split off the y variables

train_data, train_y = train_data[:, :-1], train_data[:, -1]
dev_data, dev_y = dev_data[:, :-1], dev_data[:, -1]
test_data, test_y = test_data[:, :-1], test_data[:, -1]

## Does the data need to be cleaned?

Are there missing or erroneous values? 

Do we need to fix the types of some of the variables?

## Does it need to be normalized?

Is the range of one or more values clearly out of line with the rest?

## Consider transformation

Would PCA help?
* if we had a thousand independent variables, probably, but in this case no

In [None]:
def homogenizeData(data):
    return np.append(data, np.array([np.ones(data.shape[0], dtype=float)]).T, axis=1)
   
def zScore(data, translateTransform=None, scaleTransform=None):
    "z score."
    homogenizedData = np.append(data, np.array([np.ones(data.shape[0], dtype=float)]).T, axis=1)
    if translateTransform is None:
        translateTransform = np.eye(homogenizedData.shape[1])
        for i in range(homogenizedData.shape[1]):
            translateTransform[i, homogenizedData.shape[1]-1] = -homogenizedData[:, i].mean()
    if scaleTransform is None:
        diagonal = [1 / homogenizedData[:, i].std() if homogenizedData[:, i].std() != 0 else 1 for i in range(homogenizedData.shape[1])]
        scaleTransform = np.eye(homogenizedData.shape[1], dtype=float) * diagonal
    data = (scaleTransform@translateTransform@homogenizedData.T).T
    return translateTransform, scaleTransform, data[:, :data.shape[1]-1]

translateTransform, scaleTransform, train_data_transformed = zScore(train_data)
print(getSummaryStatistics(train_data_transformed))
getShapeType(train_data_transformed)

# Find Prototypes

To do this, we use kmeans. I am going to use the scikit-learn implementation; you should use your own for the project.

Why would we not just have the number of prototypes be equal to the number of classes?

In [None]:
from sklearn.cluster import KMeans

inertia_by_k = []

for k in range(2, 17):
    print(k)
    km = KMeans(n_clusters=k, random_state=0).fit(train_data)
    inertia_by_k.append([k, km.inertia_])

inertia_by_k = np.array(inertia_by_k)
print(inertia_by_k)
fig = plt.figure(figsize=(6,4))
ax1 = fig.add_subplot(111)
ax1.plot(inertia_by_k[:, 0], inertia_by_k[:, 1])
ax1.set_xlabel('k')
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Plot')
plt.show()

So, what value will we choose for k? What will this number correspond to in the RBF network?

In [None]:
# Let's get the centroid for each hiddent node - each prototype
k = 11

km = KMeans(n_clusters=k, random_state=0).fit(train_data)

print(km.cluster_centers_)

# Define the Activation Function for the Hidden Nodes

Recall that a typical activation function for the hidden nodes is the Gaussian, so something like $exp \left( - \frac{||\vec{d}-\vec{\mu_j}||^2}{2\delta_j^2 + \epsilon} \right)$, where $\vec{d}$ is the data point, $\vec{\mu_j}$ is the prototype, $\delta_j$ is the hidden unit's standard deviation, $\epsilon$ is a small constant and $||.||^2$ is the squared Euclidean distance.

Let's take a good look at this activation function. 
* What is in the numerator? Why look, it's the distance! Why would we not just use the distance itself as the activation function? 
* What is the function of $\delta_j$?
* Why do we have $\epsilon$?

# What Will We Do When We Get a New Data Point?

At this point, we have defined:
* The input layer (ish)
* The hidden layer

For a new data point, we will:
1. encode it using the same zscoring we did on train - not defining a new zscoring. IE use mean and stdev from the *training data*
2. send it to each of the hidden layer nodes (so the weights from the input layer to the hidden layer are all 1)
3. each hidden layer node will calculate its activation for this data point

On Monday we will define the output layer, and explain how it relates to another analysis method we already know well, linear regression. We will then show how we can *also use RBF networks for regression*!

# Let's Process the Dev Data Through the Hidden Layer with Matrix Math

So to process a set of data points, eg the dev data, I'm going to:
1. "encode" - input layer - subtract mean of training data and divide by stdev of training data. Take a look at zScore above; it *already does all this with matrix multiplication*! Remember, a zScoring is just a translation and scaling.
2. calculate activations of hidden layer nodes. Take a look at the activation function. Inside the exponent, it has a numerator and a denominator. The denominator operates as a scaling, and you know how to do that. The numerator includes a translation (see the minus?) and then squares it (and you know how to do that!). And the exponentiation can be done via broadcasting.

Because you should implement the activation function above yourselves, I'm instead going to implement this stupid activation function just to show you the matrix math:
$exp \left( - \frac{||\vec{d}-\vec{\mu_j}||}{3} \right)$

In [None]:
# The other thing you need is for each of these, the activation function
# I am going to implement a _stupid activation function_ so that you can implement the right one yourselves for project 7
def calculateActivations(data, centroids):
    "I repeat, do not use this activation function directly. This one is exp(-distance / 3); yours is exp(-distance^2 / (2*radius + epsilon))"
    # You can easily fiddle with this numerator to make it calculate the square of the distance
    numerator = -np.linalg.norm(data-centroids[:,np.newaxis], axis = 2).T
    # The construction of your denominator will be a little more complex than this; the diagonals will be centroid/prototype-specific
    denominator = np.eye(centroids.shape[0], dtype=float) * 1/3
    print(numerator.shape, denominator.shape)
    return np.exp((denominator@numerator.T).T)

train_calcs = calculateActivations(train_data_transformed, km.cluster_centers_)
print(train_calcs.shape)

In [None]:
# First we need to normalize the dev data using the scale and translate from the train data
_, _, dev_data_transformed = zScore(dev_data, translateTransform, scaleTransform)
print(dev_data_transformed.shape)
dev_calcs = calculateActivations(dev_data_transformed, km.cluster_centers_)

In [None]:
import pickle as pkl

pkl.dump(train_calcs, open("../hidden_node_weights_train.pkl",'wb'))
pkl.dump(train_y, open("../labels_train.pkl",'wb'))
pkl.dump(dev_calcs, open("../hidden_node_weights_dev.pkl",'wb'))
pkl.dump(dev_y, open("../labels_dev.pkl",'wb'))