# Assignment #3 - Neural Networks

<font color="blue"> Anirudh Narayanan </font>

# I. Overview (Objective and Approach)

The key idea of this assignment is to achieve non-linearity in classifying/fitting the training data. There are many ways of achieving non-linearity. The first thing to keep in mind while doing so , that higher order polynomials need to be leveraged. The problem with performing trial and error with higher order polynomials, polynomial regression, is that the number of possibilities are unfathomable, and trying all of such could be very computationally expensive. To avoid this problem, we set our focus on neural networks, which periodically perform separate tasks ( each task in a separate layer), and the result of this, when we do find the right polynomials and the right elements to perform the right classification /regression ( without overfitting)

It can be proved that a neural net with an activation function (relu, tanh, sigmoid) can naturally very easily fit any given training data without fail. It can fit any possible curve by this process. Eventually, we notice that overdoing this will lead us to overfitting.


 
#  Data Description
​
The data represents different weather features in Australia, like weather, temperature, humidity, windspeed etc. This data can be very useful for regression in that, the past can be used to predict how the future weather conditions can be. This data can also be very useful when considering classification, in that places/years can be classified upon, based on the data which they have. A good example is Melbourne Aiport vs Portland's Humidity vs Rainfall which has clear demarcations, and can be used during classifying new data during one of these years, or of future data which is of Rainfall/Humidity during any part of the year.
​
# Column Pre-Processing
​
For this process, each of the columns were evaluated with respect to their null values. The reason for this was, removing None values by the row causes issues, in that there may be columns which don't have enough data in them, and hence might have a large percentage of them as None. Due to this, the percentage of null values in each column were evaluated before processing. If the percentage of the column's null values were more than 70%, the entire columns were dropped in lieu of insufficient information.
​
# Row Pre-Processing
​
Row Pre-Processing¶
Further, the rows were removed by iteration through the entire data, thereby too many rows were not removed just because some columns had bad information. This had to be done, using a dual loop. Each column had to be iterated for each row iteration. This was because, pandas hashes the information column wise, and not row/dual.

# II. Data

Introduce your data and visualize them. Describe your observations about the data.
You can reuse the data that you examined in Assignment #1 (of course for regression). 

# PLOTS FOR UNDERSTANDING OR ANALYSIS

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from mpl_toolkits import mplot3d
import copy


df = pd.read_csv("ausweather_preprocessed.csv",sep="\t")

grouped_by_month_rainfall = df.groupby([df.Date.str[:7],"Location"])["Location","MaxTemp","Humidity9am"].mean().reset_index()

#print grouped_by_month_rainfall
grouped_2013 = grouped_by_month_rainfall
grouped_2013 = grouped_by_month_rainfall[ (grouped_2013["Location"]=="Katherine")  | (grouped_2013["Location"]=="Bendigo") ]

#print grouped_2013



plt.figure(figsize=(15,8))
plt.title("Humidity vs Max Temperature of 2 Australian Locations")
plt.xlabel('Humidity', fontsize=18)
plt.ylabel('Max Temperature', fontsize=16)
#EXAMPLE OF A SIMPLE TWO DIVISION CLUSTER. DATA FROM EITHER ONE OF THE SOURCES CAN BE CLASSIFIED TO EITHER ONE BASED ON SOME CLASSIFICATION ALGORITHM
for name,group in grouped_2013.groupby("Location"):
    plt.scatter(group["Humidity9am"],group["MaxTemp"],label=name)
    plt.legend()
 


grouped3d = df.groupby([df.Date.str[:12],"Location"])["Location","MaxTemp","Humidity9am","Rainfall"].mean().reset_index()

groupwithout = copy.deepcopy(grouped3d)
grouped3d = grouped3d[grouped3d["Location"]=="Canberra"]



#print grouped3d


In [None]:
#DATA HERE CAN BE CLASSIFIED BETWEEN 2008 and 2010

#print grouped3d
plt.figure(figsize=(15,8))
fig = plt.figure()
fig.set_figheight(15)
fig.set_figwidth(15)
plt.figure(figsize=(15,8))
ax = fig.add_subplot(111, projection='3d')

ax.set_title("Max Temperature, Humidity, Rainfall of 2008 vs 2010 (Classifiable)")
ax.set_xlabel('Max Temperature', fontsize=18)
ax.set_ylabel('Humidity 9 am', fontsize=16)
ax.set_zlabel('Rainfall', fontsize=16)
#for name,group in grouped3d.groupby(grouped3d.Date.str[:4]):
ax.scatter(groupwithout["MaxTemp"],groupwithout["Humidity9am"],groupwithout["Rainfall"])
#ax.plot(X[:,1],answers)
     #ax.legend()


canberra_rainfall_df = df.groupby([df.Date.str[:7],"Location"]).mean().reset_index()[["Humidity3pm","Rainfall","Location"]]
canberra_df_humidity = df.groupby([df.Date.str[:7]]).mean().reset_index()[["Date","Humidity3pm"]]
canberra_df_clouds = df.groupby([df.Date.str[:7]]).mean().reset_index()[["Date","Cloud3pm"]]

#print canberra_rainfall_df[(canberra_rainfall_df["Location"]=="Katherine")  | (canberra_rainfall_df["Location"]=="Bendigo")]
#print canberra_rainfall_df[(canberra_rainfall_df["Location"]=="Portland")  | (canberra_rainfall_df["Location"]=="MelbourneAirport")]
#canberra_rainfall_df = canberra_rainfall_df[(canberra_rainfall_df["Location"]=="Katherine")  | (canberra_rainfall_df["Location"]=="Bendigo")]

canberra_rainfall_df = canberra_rainfall_df[(canberra_rainfall_df["Location"]=="Portland")  | (canberra_rainfall_df["Location"]=="MelbourneAirport") | (canberra_rainfall_df["Location"]=="PerthAirport") ]


plt.title("Humidity vs Rainfall of 3 Australian Locations (Clustered, with many outliers)")
plt.xlabel('Humidity 3pm', fontsize=18)
plt.ylabel('Rainfall', fontsize=16)
for name,group in canberra_rainfall_df.groupby("Location"):
    plt.scatter(group["Humidity3pm"],group["Rainfall"],label=name)
    plt.legend()

    
    
plt.show()

    

In [None]:
plt.figure(figsize=(15,8))
canberra_rainfall_df = canberra_rainfall_df[(canberra_rainfall_df["Location"]=="Portland")  | (canberra_rainfall_df["Location"]=="MelbourneAirport") ]

plt.scatter(canberra_rainfall_df["Humidity3pm"],canberra_rainfall_df["Rainfall"])

plt.figure(figsize=(15,8))
plt.title("Humidity vs Rainfall of 2 Australian Locations (Clustered, with few outliers)")
plt.xlabel('Humidity 3pm', fontsize=18)
plt.ylabel('Rainfall', fontsize=16)
for name,group in canberra_rainfall_df.groupby("Location"):
    plt.scatter(group["Humidity3pm"],group["Rainfall"],label=name)
    plt.legend()

    

In [None]:
# Data NORMALIZATION
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from mpl_toolkits import mplot3d
import copy


df = pd.read_csv("ausweather_preprocessed.csv",sep="\t")


grouped_by_month_rainfall = df.groupby([df.Date.str[:11],"Location"])["Location","MaxTemp","Humidity9am"].mean().reset_index()

#print grouped_by_month_rainfall
grouped_2013 = grouped_by_month_rainfall
grouped_2013 = grouped_by_month_rainfall[ (grouped_2013["Location"]=="Katherine")  | (grouped_2013["Location"]=="Bendigo") ]

#print grouped_2013



#plt.figure(figsize=(15,8))
#plt.title("Humidity vs Max Temperature of Bendigo")
#plt.xlabel('Humidity', foanalysis / comparison of algorithmsntsize=18)
#plt.ylabel('Max Temperature', fontsize=16)
#EXAMPLE OF A SIMPLE TWO DIVISION CLUSTER. DATA FROM EITHER ONE OF THE SOURCES CAN BE CLASSIFIED TO EITHER ONE BASED ON SOME CLASSIFICATION ALGORITHM
k = 0

for name,group in grouped_2013.groupby("Location"):
    if(k>0):#analysis  comparison of algorithms
        break
    humidity_points = group["Humidity9am"]
    maxtemp_points = group["MaxTemp"]
    #plt.scatter(group["Humidity9am"],group["MaxTemp"])
    k+=1


humid = np.array(humidity_points)
temper = np.array(maxtemp_points)

humid_normal = humid
temper_normal = temper

humid_max = humid.max()
humid_min = humid.min()
humid_avg = humid.mean()

temper_max = temper.max()
temper_min = temper.min()
temper_avg = temper.mean()


for i in range(len(humid)):
    humid[i] = (humid[i] - humid_avg)/(humid_max - humid_min)
    
for i in range(len(temper)):
    temper[i] = (temper[i] - temper_avg)/(temper_max - temper_min)

In [None]:
print("done")
humid.shape

In [None]:
# DATA VISUALIZALTION
plt.figure(figsize=(15,8))
plt.scatter(humid,temper)    

# READING AND ANALYZING THE PLOTS

We know from the above plots that a lot of the data can be used for regression and for classification. Primarily the best ways to use the data are as follows 

- Prediction of Rainfall using Humidity and Temperature
- Prediction of Next day Rainfall or Not using the conditions of the current day
- Predictions of Humidity with other features
- Prediction of Location using all the features. This would be a perfect MultiClassification Problem.

The plots clearly show how rainfall is dependant on temperature and humidity. We can assimilate this not only from our domain knowledge, but also on the data. We also notice that different places show different patterns of data during the entire year and definitely through the years. We consider almost 80k rows of data from over the years, and notice that this problem can easily be solvable, and we can apply classification and regression in the right way to get good excerpts from this data.

The different data we have comprise of large and small datasets, and we can use all these in the right proportions to tune our neural network to get the best out of the data.

## Regression Plots

The regression plots we have are of different kinds, we notice some with large amounts of data, and some with much less data than others. We ideally are given to understand that this data can be fitted by linear regression BUT that won't be the best fit to the data, and that a non-linear approach suits the data best. 

We don't leverage polynomial regression for this, but we use neural networks to give us the best fit of the data. Some of the data we see are very large in size too, such as the above humidity vs temperature data which is being used. It can clearly have many complex ways of being fit in a training. 

The classification data too (as shown in the 3D plot above) has many classes, and cannot be perfectly fit by a linear model, and would benefit from a complex non-linear or polynomial fit. The best way that this is done, is using a neural net with a hidden layer and an activation function which works on the data.

# Classification plots

The classification data, is done between two different locations in Australia. One of the plots is the differences in Rainfall and Temperature between the cities of Bendigo and Katherine. One other is the differeces between Portland and Melbourne Airport. These are different data, which although can be classified with a linear decision boundary, won't have the best results if that is done. The best way in classifying this data, would be to use Neural nets and non linear decision boundaries which would give the best fit to the data.

## III.B Nonlinear Regression 

- Nonlinear regression is a regression in which the dependent or criterion variables are modeled as a non-linear function of model parameters and one or more independent variables.

In a nonlinear regression model we don't only consider polynomials of simple order, but also those with higher order which are way more complex. This helps the data to be fit a lot better than it would when done linearly (equivalent version of a single layer after the input layer). It can be proved that a certain amount of data which is not linear in nature can always be perfectly fit with the right neural net. A multi layer neuron can always represent any curve using the weights. Generally a hidden layer needs to be utilized but with activation functions, which are used to give a higher order to the output.

If a multilayer perceptron is used normally without an activation function, it gives an output which is linear in nature. Hence we use activation functions like (tanh, relu, leakyrelu) the use of each of which gives us a different 

\begin{equation}
E = \frac{1}{N} \frac{1}{K}\sum_{n=1}^{N} \sum_{k=1}^{K} (t_{nk} - y_{nk})^2
\end{equation}

The above error function is used to calculate error across each neuron in each layer. This is used during the backprop algorithm, wherein we try to optimize the weights using different methods. In our given code, we use an optimizer with the variable niter, which controls the number of iterations necessary. The following hyperparameters can be controlled:

1) Learning Rate <br>
2) Regularization Factor <br>
3) Neural Net Neurons in Layers and Activation Functions. <br>
4) Controlling the number of iterations/epochs. <br>

When we speak about Non Linear Regression to fit the data, we talk about a single neuron output, rather than a set of neurons, where each neuron indicates the activation for that particular class. In this case, the single neuron output gives the predicted output for a given number of input samples. We focus on the complexity of the nonlinear function using the hyperparameter regularization which helps us attain the perfect fit for our data, without much overfitting.

The learning rate can be controlled to not let it overshoot the given minima. A perfect learning rate is neither too high nor too low. It has the right amount of activation so as to give an output which fits the data perfectly. We also notice that the number of iterations or Epochs are essential to control.

Each iteration is a batch optimization, where each time a full forward and backword prop takes place on the entire data, this having a high value can cause overfitting, and a low value can cause underfitting. A perfect balance between the above hyperparameters is essential.

We ideally understand that Non Linear Regression is all about controlling hyperparameters, changing ONLY the hidden layer (not output, since it is 1 exquisitely), and this process can achieve the right amount of non linearity.

## CODE EXPLANATION

- The code is split into the following parts 
    - initialization
    - add ones
    - get_n layers
    - self hunit
    - pack
    - unpack
    - cp weight
    - RBF 
    - forward
    - backward
    - errorf
    - objectf
    - train 
    - use
    
- MODEL
    - All of the neural networks in this assignment exclusively use a neural network with a single hidden layer. The idea is to increase the number of hidden neurons and map it to the number of correct output layers.
    - Each neural network model has a two step process, forward and backward propogation. 
    - FORWARD PROP: Forward Propogation is the part, where the weights are used to make the hypothesis , and predict an output using the previously updated weights. We also need to take into consideration , that this doesn't just update weights in the final layer, but all the layers and the forward propogation does that . 
    - BACKWARD PROP: The backward propogation is the part where the weights are udpated in each iteration by using an optimizer like gradient descent. The backward propogation takes each layer, and updates it using the output at that particular point. 
    - The two above process happen alternately for multiple iterations/epochs. Each of these work with the optimizer to improve the performance.
    - First the the initialization step, where the input, hidden and output nodes are initialized.
    - The add ones, adds the necessary non parameter ones which consist of the bias unit
    - The objectF function gives out the error approximation. Int he case of Non Linear Regression the error function used is just Root Mean Squared error, based on which the weights are updated, and the weight improvement is selected and the system is optimized based on it's differential which updates the weights in a way that it improves gradually each time with a given learning rate.
    - The train function consists of the optimization, and the gradient descent which is used to call the forward and backprops for niters respectively.
    - The lambda parameter is what is most important in terms of reducing or increasing(penalizing) the complexity of the possible polynomial. 
    - If the lambda parameter is truly low, then it will allow any amount of higher order polynomial with the constraint of a given neural net but so as to NOT overfit the data.
    - The use part of the code, is to use the weights to give a hypothesis for a given set of inputs.



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from grad import scg, steepest
import copy
from util import Standardizer


class NeuralNet:
    def __init__(self, nunits):
        self._nLayers=len(nunits)-1
        self.rho = [1] * self._nLayers
        self._W = []
        wdims = []
        lenweights = 0
        for i in range(self._nLayers):
            nwr = nunits[i] + 1
            nwc = nunits[i+1]
            wdims.append((nwr, nwc))
            lenweights = lenweights + nwr * nwc

        self._weights = np.random.uniform(-0.1,0.1, lenweights) 
        start = 0  # fixed index error 20110107
        for i in range(self._nLayers):
            end = start + wdims[i][0] * wdims[i][1] 
            self._W.append(self._weights[start:end])
            self._W[i].resize(wdims[i])
            start = end

        self.stdX = None
        self.stdT = None
        self.stdTarget = True

    def add_ones(self, w):
        return np.hstack((np.ones((w.shape[0], 1)), w))

    def get_nlayers(self):
        return self._nLayers

    def set_hunit(self, w):
        for i in range(self._nLayers-1):
            if w[i].shape != self._W[i].shape:
                print("set_hunit: shapes do not match!")
                break
            else:
                self._W[i][:] = w[i][:]

    def pack(self, w):
        return np.hstack(map(np.ravel, w))

    def unpack(self, weights):
        self._weights[:] = weights[:]  # unpack

    def cp_weight(self):
        return copy.copy(self._weights)

    def RBF(self, X, m=None,s=None):
        if m is None: m = np.mean(X)
        if s is None: s = 2 #np.std(X)
        r = 1. / (np.sqrt(2*np.pi)* s)  
        return r * np.exp(-(X - m) ** 2 / (2 * s ** 2))

    def forward(self,X):
        t = X 
        Z = []

        for i in range(self._nLayers):
            Z.append(t) 
            if i == self._nLayers - 1:
                t = np.dot(self.add_ones(t), self._W[i])
            else:
                t = np.tanh(np.dot(self.add_ones(t), self._W[i]))
                #t = self.RBF(np.dot(np.hstack((np.ones((t.shape[0],1)),t)),self._W[i]))
        return (t, Z)
        
    def backward(self, error, Z, T, lmb=0):
        delta = error
        N = T.size
        dws = []
        for i in range(self._nLayers - 1, -1, -1):
            rh = float(self.rho[i]) / N
            if i==0:
                lmbterm = 0
            else:
                lmbterm = lmb * np.vstack((np.zeros((1, self._W[i].shape[1])),
                            self._W[i][1:,]))
            dws.insert(0,(-rh * np.dot(self.add_ones(Z[i]).T, delta) + lmbterm))
            if i != 0:
                #print(delta)
                #print("p2")
                #print(Z)
                #print("p3")
                #print(self._W[i][1:, :].T)
                delta = np.dot(delta, self._W[i][1:, :].T) * (1 - Z[i]**2)
        return self.pack(dws)

    def _errorf(self, T, Y):
        return T - Y
        
    def _objectf(self, T, Y, wpenalty):
        return 0.5 * np.mean(np.square(T - Y)) + wpenalty

    def train(self, X, T,**params):
        verbose = params.pop('verbose', False)
        # training parameters
        _lambda = params.pop('Lambda', 0)

        #parameters for scg
        niter = params.pop('niter', 1000)
        wprecision = params.pop('wprecision', 1e-10)
        fprecision = params.pop('fprecision', 1e-10)
        wtracep = params.pop('wtracep', False)
        ftracep = params.pop('ftracep', False)

        # optimization
        optim = params.pop('optim', 'scg')

        if self.stdX == None:
            explore = params.pop('explore', False)
            self.stdX = Standardizer(X, explore)
        Xs = self.stdX.standardize(X)
        if self.stdT == None and self.stdTarget:
            self.stdT = Standardizer(T)
            T = self.stdT.standardize(T)
        
        def gradientf(weights):
            self.unpack(weights)
            Y,Z = self.forward(Xs)
            error = self._errorf(T, Y)
            return self.backward(error, Z, T, _lambda)
            
        def optimtargetf(weights):
            """ optimization target function : MSE 
            """
            self.unpack(weights)
            #self._weights[:] = weights[:]  # unpack
            Y,_ = self.forward(Xs)
            Wnb=np.array([])
            for i in range(self._nLayers):
                if len(Wnb)==0: Wnb=self._W[i][1:,].reshape(self._W[i].size-self._W[i][0,].size,1)
                else: Wnb = np.vstack((Wnb,self._W[i][1:,].reshape(self._W[i].size-self._W[i][0,].size,1)))
            wpenalty = _lambda * np.dot(Wnb.flat ,Wnb.flat)
            return self._objectf(T, Y, wpenalty)

        if optim == 'scg':
            result = scg(self.cp_weight(), gradientf, optimtargetf,
                                        wPrecision=wprecision, fPrecision=fprecision, 
                                        nIterations=niter,
                                        wtracep=wtracep, ftracep=ftracep,
                                        verbose=False)
            self.unpack(result['w'][:])
            self.f = result['f']
        elif optim == 'steepest':
            result = steepest(self.cp_weight(), gradientf, optimtargetf,
                                nIterations=niter,
                                xPrecision=wprecision, fPrecision=fprecision,
                                xtracep=wtracep, ftracep=ftracep )
            self.unpack(result['w'][:])
        if ftracep:
            self.ftrace = result['ftrace']
        if 'reason' in result.keys() and verbose:
            print(result['reason'])

        return result

    def use(self, X, retZ=False):
        if self.stdX:
            Xs = self.stdX.standardize(X)
        else:
            Xs = X
        Y, Z = self.forward(Xs)
        if self.stdT is not None:
            Y = self.stdT.unstandardize(Y)
        if retZ:
            return Y, Z
        return Y



## II.C Nonlinear Logistic Regression

### Explanation
- Nonlinear regression is a regression in which the dependent or criterion variables are modeled as a non-linear function of model parameters and one or more independent variables.
- The core idea behind any Non Linear method is to attain better accuracies , which can be possible only with higher than linear order polynomial which is what we are leveraging to perform our task.
- We can choose the right higher order polynomial, in general to distinguish between our different classes of data, but this would really just hinder our task of understanding the data better, but this would require a higher level understanding of not only the data, but also the way the features interact.
- A neural network with a multi layer perceptron, will do this task for us, and we can play with the number of neurons in each layer to find the accurate solution for our problem.
- One way to do so, is by adding hidden layer neurons to a point where we fit the data and the case but don't overfit.
- The core difference in the "logistic" non linear regression approach, is that we don't use a regular hypothsis without activating it in the final layer of the neural network. 
### - ACTIVATION FUNCTION choices: We have many choices of activation at this stage. <br>
    - Sigmoid
    - Reulu
    - Softmax
    - Leaky Relu
    <br>
    
- Sigmod function activates one output against ONE other output, and give us the probability of one agains the other.
- RELU, also takes into consideration activating for a certain kind of output but nulls out completely, when the output goes below a certain threshold
- Softmax, is ideally used when the number of output classes is large, and we don't want to give a HARSH (hardmax) penalty for the loss, but a more gradual one hence, softmax.
- Leaky Relu, is a variant where it doesn't completely null the value and doesn't allow it to change further.


Using the knowledge we have, we pick a different activation function, and based on this we classify the data.

In our case, we are using Non Linear Logistic Regression using Softmax, because there are multiple classes(34 classes in face). Due to this we are primed to use softmax, which divides the percentage of the losses within each of the different classes,and kind of gives a percentage of chance that the input is of which class.


\begin{equation}
P(y=j) = \frac{e^{\vec w_j \cdot \vec x}}{\sum\limits_{i=1}^{K}e^{\vec w_i \cdot \vec x}}
\end{equation}


## CODE EXPLANATION

- The code is split into the following parts 
    - initialization
    - add ones
    - get_n layers
    - self hunit
    - pack
    - unpack
    - cp weight
    - RBF 
    - forward
    - backward
    - errorf
    - objectf
    - train 
    - use
    
- MODEL
    - All of the neural networks in this assignment exclusively use a neural network with a single hidden layer. The idea is to increase the number of hidden neurons and map it to the number of correct output layers.
    - Each neural network model has a two step process, forward and backward propogation. 
    - FORWARD PROP: Forward Propogation is the part, where the weights are used to make the hypothesis , and predict an output using the previously updated weights. We also need to take into consideration , that this doesn't just update weights in the final layer, but all the layers and the forward propogation does that . 
    - BACKWARD PROP: The backward propogation is the part where the weights are udpated in each iteration by using an optimizer like gradient descent. The backward propogation takes each layer, and updates it using the output at that particular point. 
    - The two above process happen alternately for multiple iterations/epochs. Each of these work with the optimizer to improve the performance.
    - First the the initialization step, where the input, hidden and output nodes are initialized.
    - The add ones, adds the necessary non parameter ones which consist of the bias unit
    - The objectF function gives out the error approximation. Int he case of Non Linear Regression the error function used is just log error, based on which the weights are updated, and the weight improvement is selected and the system is optimized based on it's differential which updates the weights in a way that it improves gradually each time with a given learning rate.The log error loss is the best for logistic regression when we choose to use sigmoid or softmax regression.
    - The train function consists of the optimization, and the gradient descent which is used to call the forward and backprops for niters respectively.
    - The lambda parameter is what is most important in terms of reducing or increasing(penalizing) the complexity of the possible polynomial. 
    - If the lambda parameter is truly low, then it will allow any amount of higher order polynomial with the constraint of a given neural net but so as to NOT overfit the data.
    - The use part of the code, is to use the weights to give a hypothesis for a given set of inputs.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from grad import scg, steepest
import copy
from util import Standardizer


class NeuralNetLog:
    def __init__(self, nunits):
        self._nLayers=len(nunits)-1
        self.rho = [1] * self._nLayers
        self._W = []
        wdims = []
        lenweights = 0
        for i in range(self._nLayers):
            nwr = nunits[i] + 1
            nwc = nunits[i+1]
            wdims.append((nwr, nwc))
            lenweights = lenweights + nwr * nwc

        self._weights = np.random.uniform(-0.1,0.1, lenweights) 
        start = 0  # fixed index error 20110107
        for i in range(self._nLayers):
            end = start + wdims[i][0] * wdims[i][1] 
            self._W.append(self._weights[start:end])
            self._W[i].resize(wdims[i])
            start = end

        self.stdX = None
        self.stdT = None
        self.stdTarget = True

    def add_ones(self, w):
        return np.hstack((np.ones((w.shape[0], 1)), w))

    def get_nlayers(self):
        return self._nLayers

    def set_hunit(self, w):
        for i in range(self._nLayers-1):
            if w[i].shape != self._W[i].shape:
                print("set_hunit: shapes do not match!")
                break
            else:
                self._W[i][:] = w[i][:]

    def pack(self, w):
        return np.hstack(map(np.ravel, w))

    def unpack(self, weights):
        self._weights[:] = weights[:]  # unpack

    def cp_weight(self):
        return copy.copy(self._weights)

    def RBF(self, X, m=None,s=None):
        if m is None: m = np.mean(X)
        if s is None: s = 2 #np.std(X)
        r = 1. / (np.sqrt(2*np.pi)* s)  
        return r * np.exp(-(X - m) ** 2 / (2 * s ** 2))

    def forward(self,X):
        t = X 
        Z = []

        for i in range(self._nLayers):
            Z.append(t) 
            if i == self._nLayers - 1:
                #t = np.tanh(np.dot(self.add_ones(t), self._W[i]))
                #t = np.dot(self.add_ones(t), self._W[i])
                #expmat = np.exp(np.dot(self.add_ones(t), self._W[i]))
                #print(expmat.shape)
                #denom = np.sum(expmat,axis=0)
                #t = expmat/denom
                t = 1/(1+np.exp(-np.dot(self.add_ones(t), self._W[i])))
                
                #print(t)
            else:
                t = np.tanh(np.dot(self.add_ones(t), self._W[i]))
                #t = self.RBF(np.dot(np.hstack((np.ones((t.shape[0],1)),t)),self._W[i]))
        return (t, Z)
        
    def backward(self, error, Z, T, lmb=0):
        delta = error
        N = T.size
        dws = []
        for i in range(self._nLayers - 1, -1, -1):
            rh = float(self.rho[i]) / N
            if i==0:
                lmbterm = 0
            else:
                lmbterm = lmb * np.vstack((np.zeros((1, self._W[i].shape[1])),
                            self._W[i][1:,]))
            dws.insert(0,(-rh * np.dot(self.add_ones(Z[i]).T, delta) + lmbterm))
            if i != 0:
                #print(delta)
                #print("p2")
                #print(Z)
                #print("p3")
                #print(self._W[i][1:, :].T)
                delta = np.dot(delta, self._W[i][1:, :].T) * (1 - Z[i]**2)
        return self.pack(dws)

    def _errorf(self, T, Y):
        return T - Y
        
    def _objectf(self, T, Y, wpenalty):
        return -(np.sum( np.sum((T * np.log(Y)) , axis=1), axis=0)) + wpenalty

    def train(self, X, T,**params):
        verbose = params.pop('verbose', False)
        # training parameters
        _lambda = params.pop('Lambda', 0)

        #parameters for scg
        niter = params.pop('niter', 1000)
        wprecision = params.pop('wprecision', 1e-10)
        fprecision = params.pop('fprecision', 1e-10)
        wtracep = params.pop('wtracep', False)
        ftracep = params.pop('ftracep', False)

        # optimization
        optim = params.pop('optim', 'scg')

        if self.stdX == None:
            explore = params.pop('explore', False)
            self.stdX = Standardizer(X, explore)
        Xs = self.stdX.standardize(X)
        if self.stdT == None and self.stdTarget and False:
            self.stdT = Standardizer(T)
            T = self.stdT.standardize(T)
        
        def gradientf(weights):
            self.unpack(weights)
            Y,Z = self.forward(Xs)
            error = self._errorf(T, Y)
            return self.backward(error, Z, T, _lambda)
            
        def optimtargetf(weights):
            """ optimization target function : MSE 
            """
            self.unpack(weights)
            #self._weights[:] = weights[:]  # unpack
            Y,_ = self.forward(Xs)
            Wnb=np.array([])
            for i in range(self._nLayers):
                if len(Wnb)==0: Wnb=self._W[i][1:,].reshape(self._W[i].size-self._W[i][0,].size,1)
                else: Wnb = np.vstack((Wnb,self._W[i][1:,].reshape(self._W[i].size-self._W[i][0,].size,1)))
            wpenalty = _lambda * np.dot(Wnb.flat ,Wnb.flat)
            return self._objectf(T, Y, wpenalty)

        if optim == 'scg':
            result = scg(self.cp_weight(), gradientf, optimtargetf,
                                        wPrecision=wprecision, fPrecision=fprecision, 
                                        nIterations=niter,
                                        wtracep=wtracep, ftracep=ftracep,
                                        verbose=False)
            self.unpack(result['w'][:])
            self.f = result['f']
        elif optim == 'steepest':
            result = steepest(self.cp_weight(), gradientf, optimtargetf,
                                nIterations=niter,
                                xPrecision=wprecision, fPrecision=fprecision,
                                xtracep=wtracep, ftracep=ftracep )
            self.unpack(result['w'][:])
        if ftracep:
            self.ftrace = result['ftrace']
        if 'reason' in result.keys() and verbose:
            print(result['reason'])

        return result

    def use(self, X, retZ=False):
        if self.stdX:
            Xs = self.stdX.standardize(X)
        else:
            Xs = X
        Y, Z = self.forward(Xs)
        if self.stdT is not None:
            Y = self.stdT.unstandardize(Y)
        if retZ:
            return Y, Z
        return Y

# Examination of correct implementation (NonlinearLogReg) with toy data.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from mpl_toolkits import mplot3d
import copy


df = pd.read_csv("data.csv",sep=",")

In [None]:
print("done")
df

In [None]:
#from scipy.stats import itemfreq
#np.asmatrix(humid).shape[0]
#np.asmatrix(temper).shape[0]
#copy.copy([1,2,3])
print(df.values.shape)
groupwithout.ix[:,[2,3,4]]
#groupwithout.ix[:,[0:3]]
X = df.ix[:,[1,3,4,5,6,7,8,9,10]].values
X = X[:35000,:]
T = df.ix[:,[2]].values

T = T[:35000,:]
print(type(T))


classes = np.unique(T)
numclasses = len(np.unique(T))
print("numclasses")
print(numclasses)
base_class = np.array([0 for i in range(numclasses)])

new_T = np.array([base_class]*T.shape[0])

print(new_T)
print(new_T.shape)

for i in range(X.shape[0]):
    #print(T[i])
    #print(np.where(classes==T[i]))
    new_T[i][np.where(classes==T[i])] = 1
    

 

print(new_T)
print(new_T.shape)
trainnet = NeuralNetLog([X.shape[1],20,numclasses])
trainnet.train(X,new_T,ftracep=True)
#tranans,z = trainnet.use(X,retZ=True)

In [None]:
print("done")

In [None]:
from sklearn.datasets import make_circles
from sklearn.preprocessing import OneHotEncoder

X, T = make_circles(n_samples=800, noise=0.07, factor=0.4)
new_T = np.array([[0,0]]*X.shape[0]) #one hot T
for i in range(len(T)):
    new_T[i][T[i]] = 1
    
print(X)
print(new_T)

plt.scatter(X[:,0],X[:,1])

trainnet = NeuralNetLog([X.shape[1],11,2])
trainnet.train(X,new_T,ftracep=True)
tranans,z = trainnet.use(X,retZ=True)

#cc = CrossValid(X,new_T)
#cc.five_fold_classify()

#new_T = np.
print(tranans)

In [None]:
Xans = np.append(X,tranans.argmax(axis=1).reshape(len(tranans),1),1)
#Xans = np.append(X,tranans,1)
print(T.shape)
Xtrain = np.append(X,T.reshape(len(T),1),1)

print(X)
print(X.shape)
n = np.unique(Xtrain[:,2])
groupednp = np.array( [ list(Xtrain[Xtrain[:,2]==i,0]) for i in n])
groupednp2 = np.array( [ list(Xtrain[Xtrain[:,2]==i,1]) for i in n])

n = np.unique(Xans[:,2])
groupednpans = np.array( [ list(Xans[Xans[:,2]==i,0]) for i in n])
groupednpans2 = np.array( [ list(Xans[Xans[:,2]==i,1]) for i in n])

print(groupednp.shape)
print(groupednpans.shape)
print(groupednpans2.shape)

i=1


colors = "bgrcmyk"
fig = plt.figure(figsize=(15,8))
plt.title("ACTUAL MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
ax1 = fig.add_subplot(111)

#color = colors[T.argmax(0)]
for i in range(len(groupednp)):
    groupednp[i] = (np.array(groupednp[i]) - np.array(groupednp[i]).min())/(np.array(groupednp[i]).max() - np.array(groupednp[i]).min())
    groupednp2[i] = (np.array(groupednp2[i]) - np.array(groupednp2[i]).min())/(np.array(groupednp2[i]).max() - np.array(groupednp2[i]).min())
    ax1.scatter(groupednp[i],groupednp2[i])
    
#fig = plt.figure(figsize=(15,8))
plt.title("PREDICTED MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
ax1 = fig.add_subplot(111)

#color = colors[T.argmax(0)]
for i in range(len(groupednpans)):
    groupednpans[i] = (np.array(groupednpans[i]) - np.array(groupednpans[i]).min())/(np.array(groupednpans[i]).max() - np.array(groupednpans[i]).min())
    groupednpans2[i] = (np.array(groupednpans2[i]) - np.array(groupednpans2[i]).min())/(np.array(groupednpans2[i]).max() - np.array(groupednpans2[i]).min())
    
    #ax1.scatter(groupednpans[i],groupednpans2[i])
    

    
#plt.plot(humid,Ytest,color="orange")
print("DONE")

The classification works, but is misclassifying, since the hyperparameters are tuned to the actual input of the large dataset being used here.

In [None]:
from scipy.stats import itemfreq
#np.asmatrix(humid).shape[0]
#np.asmatrix(temper).shape[0]
#copy.copy([1,2,3])
print(df.values.shape)
groupwithout.ix[:,[2,3,4]]
#groupwithout.ix[:,[0:3]]
X = df.ix[:,[3,4,5,7,10,11,12,13,14,15,16,17,18,19,21]].values
X = X[:35000,:]
T = df.ix[:,[20]].values

T = T[:35000,:]
print(type(T))


classes = np.unique(T)
numclasses = len(np.unique(T))

base_class = np.array([0 for i in range(numclasses)])

new_T = np.array([base_class]*T.shape[0])

print(new_T)
print(new_T.shape)

for i in range(X.shape[0]):
    #print(T[i])
    #print(np.where(classes==T[i]))
    new_T[i][np.where(classes==T[i])] = 1
    

 

print(new_T)
print(new_T.shape)
trainnet = NeuralNetLog([X.shape[1],20,numclasses])
trainnet.train(X,new_T,ftracep=True)
tranans,z = trainnet.use(X,retZ=True)






"""
X_max = X.max(axis=0)
X_min = X.min(axis=0)
X_avg = X.mean(axis=0)

T_max = T.max()
T_min = T.min()
T_avg = T.mean()


print(X_max.shape)
    
    ax1.scatter(groupednpans[i],groupednpans2[i])
for i in range(X.shape[0]):
    #print(X[i,:].shape)
    #print(X_avg.shape)
    #print(X_min.shape)
    #print(X_max.shape)
    X[i,:] = (X[i,:] - X_avg)/(X_max - X_min)
print(X.shape)
    
for i in range(len(T)):7
    T[i] = (T[i] - T_avg)/(T_max - T_min)
    
print(X.shape)
print(T.shape)
#groupwithout["MaxTemp"],groupwithout["Humidity9am"],groupwithout["Rainfall"]
#trainnet = NeuralNet([X.shape[1],10,1])
#trainnet.train(X,T,ftracep=True)


#answers,z = trainnet.use(X, retZ=True)
cc = CrossValid(X,T)
cc.five_fold_regress()
"""

In [None]:
print("Done")
print(tranans[12141])
print(len(np.unique(tranans,axis=1)))
print(np.unique(tranans.argmax(axis=1)))
print(tranans.shape)
print(tranans.argmax(axis=1).shape)

In [None]:
Xans = np.append(X,tranans.argmax(axis=1).reshape(len(tranans),1),1)
#Xans = np.append(X,tranans,1)
Xtrain = np.append(X,T,1)

print(X)
print(X.shape)
n = np.unique(Xtrain[:,15])
groupednp = np.array( [ list(Xtrain[Xtrain[:,15]==i,1]) for i in n])
groupednp2 = np.array( [ list(Xtrain[Xtrain[:,15]==i,2]) for i in n])

n = np.unique(Xans[:,15])
groupednpans = np.array( [ list(Xans[Xans[:,15]==i,1]) for i in n])
groupednpans2 = np.array( [ list(Xans[Xans[:,15]==i,2]) for i in n])

print(groupednp.shape)
print(groupednpans.shape)
print(groupednpans2.shape)

i=1

In [None]:
import matplotlib.pyplot as plt

colors = "bgrcmyk"
fig = plt.figure(figsize=(15,8))
plt.title("ACTUAL MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
ax1 = fig.add_subplot(111)

#color = colors[T.argmax(0)]
for i in range(len(groupednp)):
    groupednp[i] = (np.array(groupednp[i]) - np.array(groupednp[i]).min())/(np.array(groupednp[i]).max() - np.array(groupednp[i]).min())
    groupednp2[i] = (np.array(groupednp2[i]) - np.array(groupednp2[i]).min())/(np.array(groupednp2[i]).max() - np.array(groupednp2[i]).min())
    ax1.scatter(groupednp[i],groupednp2[i])
    
fig = plt.figure(figsize=(15,8))
plt.title("PREDICTED MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
ax1 = fig.add_subplot(111)

#color = colors[T.argmax(0)]
for i in range(len(groupednpans)):
    groupednpans[i] = (np.array(groupednpans[i]) - np.array(groupednpans[i]).min())/(np.array(groupednpans[i]).max() - np.array(groupednpans[i]).min())
    groupednpans2[i] = (np.array(groupednpans2[i]) - np.array(groupednpans2[i]).min())/(np.array(groupednpans2[i]).max() - np.array(groupednpans2[i]).min())
    
    ax1.scatter(groupednpans[i],groupednpans2[i])
    

    
#plt.plot(humid,Ytest,color="orange")
print("DONE")

In [None]:
print(len(groupednpans))

In [None]:
Xans = np.append(X,tranans.argmax(axis=1).reshape(len(tranans),1),1)
#Xans = np.append(X,tranans,1)
Xtrain = np.append(X,T,1)

print(X)
print(X.shape)
n = np.unique(Xtrain[:,15])
groupednp = np.array( [ list(Xtrain[Xtrain[:,15]==i,2]) for i in n])
groupednp2 = np.array( [ list(Xtrain[Xtrain[:,15]==i,3]) for i in n])

n = np.unique(Xans[:,15])
groupednpans = np.array( [ list(Xans[Xans[:,15]==i,2]) for i in n])
groupednpans2 = np.array( [ list(Xans[Xans[:,15]==i,3]) for i in n])

print(groupednp.shape)
print(groupednpans.shape)
print(groupednpans2.shape)

i=1

In [None]:
import matplotlib.pyplot as plt

colors = "bgrcmyk"
fig = plt.figure(figsize=(15,8))
ax1 = fig.add_subplot(111)
plt.title("ACTUAL MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
#color = colors[T.argmax(0)]
for i in range(len(groupednp)):
    groupednp[i] = (np.array(groupednp[i]) - np.array(groupednp[i]).min())/(np.array(groupednp[i]).max() - np.array(groupednp[i]).min())
    groupednp2[i] = (np.array(groupednp2[i]) - np.array(groupednp2[i]).min())/(np.array(groupednp2[i]).max() - np.array(groupednp2[i]).min())
    ax1.scatter(groupednp[i],groupednp2[i])
    
fig = plt.figure(figsize=(15,8))
ax1 = fig.add_subplot(111)
plt.title("PREDICTED MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
#color = colors[T.argmax(0)]
for i in range(len(groupednpans)):
    groupednpans[i] = (np.array(groupednpans[i]) - np.array(groupednpans[i]).min())/(np.array(groupednpans[i]).max() - np.array(groupednpans[i]).min())
    groupednpans2[i] = (np.array(groupednpans2[i]) - np.array(groupednpans2[i]).min())/(np.array(groupednpans2[i]).max() - np.array(groupednpans2[i]).min())
    
    ax1.scatter(groupednpans[i],groupednpans2[i])
    

    
#plt.plot(humid,Ytest,color="orange")
print("DONE")


## III.A 5-fold Cross Validation

- Explain and use 5-fold cross validation to find a good neural network parameters including the structure to report the CV accuracies. 
The five fold cross validation is used in the following way during the training process, and during the testing process/phase. The key idea behind using the 5 - fold cross validation, is that while we perform testing and training, we simultaneously perform validation, where we train each part of the data targetted, based on which we derive a certain number of weights. <br>

The key Idea behind using 5 fold cross validation is as follows: <br><br>
1) Train in chunks, and evaluate success rate simulataneously so that good hyperparameters can be defined easily, during hte process occurs. The key idea behind validation is also to catch overfitting and to adjust the number of epochs based on whether the data actually does overfit after a point. <br>

2) The second idea behind 5-fold Cross Validation, Is to split training and test data, so as to test whether the algorithm and the weights have actually fit the use case, or just the given training data. Again, the key idea here is to find the perfect fit for the data, rather than just a set of weights which match every point (overfitting). We also can prevent underfitting in this process, by increasing the number of epochs, adjusting the learning rate, based on how the performance is in the validation set. <br>

3) Adjusting of hyperparameters, very often becomes the key necessity of using a neural net to classify large scale data. Again one of the methods employed here is to assess progress, and create an idea of modifying the hyperparameters, so as to find the right fit. There are a few common hyperparameters which are key which can be easily identified and dealt with during the process of Cross Validation. One of those is Identifying the regularization parameter, which penalizes the use of too many higher order polynomials. <br>

4) The five fold cross validation implementation has been done in a class, which calculates the RMSE for regression and accuracy of prediction for a given classification problem.

## CODE EXPLANATION

The code below takes into consideration a 5 fold, in which 3 parts are used for training, 1 part is used for side-by-side validation and one is used for testing, in which the data has been split (the train data, which is already 3/5 of the total data), has now been split into smaller part ( 10000 for each chunk data). <br><br>

The code also has an error function for regression and for classification,wherein, the error function for regression is a Root Mean Squared (RMSE) error function, which gives out the total error we have in a given prediction, and this is done for each of the 10,000 steps.

We also initially take the type into account as to whether it is regression or classification, and apply the respective five fold and error display process. 

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [None]:
import random

class CrossValid:
    def __init__(self,X,Y,typ="reg"):
        self.X = X
        self.T = Y
        self.typ = typ
    
    def error_func_regress(self,myanswer,Y):
        return (np.sum(((Y-myanswer)**2)/Y.shape[0]))**0.5
    
    def error_func_classify(self,T,Y):
        return float(100 - ((np.count_nonzero(np.abs(T.argmax(axis=1) - Y.argmax(axis=1)))))*100/T.shape[0])
            
        
    def five_fold_regress(self):
        print(self.X.shape)
        part = int(self.X.shape[0]/5)
        train = self.X[:5*part+1,:]
        trainanswers = self.T[:5*part+1,:]
        validate = self.X[3*part+1:4*part+1,:]
        validateanswers = self.T[3*part+1:4*part+1:,:]
        test = self.X[4*part+1:5*part:,:]
        testanswers = self.T[4*part+1:5*part:,:]
        trainnet = NeuralNet([self.X.shape[1],3,1])
        models = [[self.X.shape[1],2,1],[self.X.shape[1],3,1],[self.X.shape[1],11,1],[self.X.shape[1],7,1],[self.X.shape[1],12,1],[self.X.shape[1],3,1]]
        
        #trainnet.train(train[0:9999,:],trainanswers[0:9999,:])
        
        if part > 10000:
            fold = part
        else:
            fold = X.shape[0]
            part = fold
        
        for i in range(0,train.shape[0],fold):
            trainnet = NeuralNet(models[int(i/fold)])
            rn = random.randint(0,4)
            validate = self.X[(rn*part)+1:((rn+1)*part)+1,:]
            validateanswers = self.T[(rn*part)+1:((rn+1)*part)+1,:]
            if train.shape[0] < i + fold-1:
                trainnet.train(train[i:i+train.shape[0]-1,:],trainanswers[i:i+train.shape[0]-1,:],ftracep=True)
                
            else:
                trainnet.train(train[i:i+fold-1,:],trainanswers[i:i+fold-1,:],ftracep=True)
            
            myans,z = trainnet.use(validate,retZ=True)
            print("\n -------------------------- \n")
            print("RMSE IN PART " + str(int(i/fold) + 1) + " IS: ")
            print(self.error_func_regress(myans,validateanswers))
            print("\n --------------------------\n")
        myans,z = trainnet.use(test,retZ=True)
        print("RMSE IN TEST IS: ")
        print(self.error_func_regress(myans,testanswers))
        tryans,z = trainnet.use(self.X,retZ=True)
        print(self.T.shape)
        print(self.X.shape)
        if self.X.shape[1] == 1:
            plt.figure(figsize=(15,8))
            plt.title("ACTUAL MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
            plt.scatter(self.X,self.T)
            xs, ys = zip(*sorted(zip(self.X, tryans)))

            plt.plot(xs, ys,color="orange")
            #plt.plot(self.X,tryans,color="orange")
            plt.title("PREDICTED MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
            plt.figure(figsize=(15,8))
            plt.scatter(test,testanswers)  
            xs, ys = zip(*sorted(zip(test, myans)))

            plt.plot(xs, ys,color="orange")
            #plt.plot(test,myans,color="orange")
            
        else:
            #print(self.X[:,0].shape)
            plt.figure(figsize=(15,8))
            plt.title("ACTUAL MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
            plt.scatter(self.X[:,0],self.T)  
            xs, ys = zip(*sorted(zip(self.X[:,0], tryans)))

            plt.plot(xs, ys,color="orange")
            #plt.plot(self.X[:,0],tryans,color="orange")
            plt.figure(figsize=(15,8))
            plt.title("ACTUAL MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
            plt.scatter(test[:,0],testanswers)  
            xs, ys = zip(*sorted(zip(test[:,0], myans)))

            plt.plot(xs, ys,color="orange")
            #plt.plot(test[:,0],myans,color="orange")
            #print(self.X[:,2].shape)
            plt.figure(figsize=(15,8))
            plt.title("PREDICTED MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
            plt.scatter(self.X[:,2],self.T)
            xs, ys = zip(*sorted(zip(self.X[:,0], tryans)))

            plt.plot(xs, ys,color="orange")
            #plt.plot(self.X[:,2],tryans,color="orange")
            plt.figure(figsize=(15,8))
            plt.title("PREDICTED MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
            
            plt.scatter(test[:,2],testanswers)  
            xs, ys = zip(*sorted(zip(test[:,2], myans)))

            plt.plot(xs, ys,color="orange")
            #plt.plot(test[:,2],myans,color="orange")
            #print(self.X[:,2].shape)
            plt.figure(figsize=(15,8))
            plt.scatter(self.X[:,5],self.T)  
            xs, ys = zip(*sorted(zip(self.X[:,0], tryans)))

            plt.plot(xs, ys,color="orange")
            #plt.plot(self.X[:,5],tryans,color="orange")
            plt.figure(figsize=(15,8))
            plt.scatter(test[:,5],testanswers)  
            xs, ys = zip(*sorted(zip(test[:,5], myans)))

            plt.plot(xs, ys,color="orange")
            plt.plot(test[:,5],myans,color="orange")
            
    def five_fold_classify(self,numclasses):
        print(self.X.shape)
        part = int(self.X.shape[0]/5)
        train = self.X[:5*part+1,:]
        trainanswers = self.T[:5*part+1,:]
        validate = self.X[3*part+1:4*part+1,:]
        validateanswers = self.T[3*part+1:4*part+1:,:]
        test = self.X[4*part+1:5*part:,:]
        testanswers = self.T[4*part+1:5*part:,:]
        #trainnet = NeuralNetLog([self.X.shape[1],5,2])
        #trainnet.train(self.X,self.T)
        #return
        #trainnet.train(train[0:9999,:],trainanswers[0:9999,:])
        
        models = [[self.X.shape[1],25,numclasses],[self.X.shape[1],11,numclasses],[self.X.shape[1],3,numclasses],[self.X.shape[1],10,numclasses],[self.X.shape[1],20,numclasses]]
        fold = part
        for i in range(0,train.shape[0],fold):
            trainnet = NeuralNet(models[int(i/fold)])
            rn = random.randint(0,4)
            validate = self.X[(rn*part)+1:((rn+1)*part)+1,:]
            validateanswers = self.T[(rn*part)+1:((rn+1)*part)+1,:]
            if train.shape[0] < i + fold -1:
                trainnet.train(train[i:i+train.shape[0]-1,:],trainanswers[i:i+train.shape[0]-1,:],ftracep=True)
                
            else:
                trainnet.train(train[i:i+fold-1,:],trainanswers[i:i+fold-1,:],ftracep=True)
            
            myans,z = trainnet.use(validate,retZ=True)
            print("\n -------------------------- \n")
            print("ACCURACY IN PART " + str(int(i/fold) + 1) + " IS: ")
            print(self.error_func_classify(myans,validateanswers))
            print("\n --------------------------\n")
        myans,z = trainnet.use(test,retZ=True)
        print("ACCURACY IN TEST IS: ")
        print(self.error_func_classify(myans,testanswers))
        tryans,z = trainnet.use(self.X,retZ=True)
        print(self.T.shape)
        print(self.X.shape)
        #print(self.X[:,0].shape)
        #Xans = np.append(test,myans,1)
        Xans = np.append(test,myans.argmax(axis=1).reshape(len(myans),1),1)
        Xtrain = np.append(test,testanswers,1)

        print(X)
        print(X.shape)
        n = np.unique(Xtrain[:,X.shape[1]])
        groupednp = np.array( [ list(Xtrain[Xtrain[:,X.shape[1]]==i,1]) for i in n])
        groupednp2 = np.array( [ list(Xtrain[Xtrain[:,X.shape[1]]==i,2]) for i in n])

        n = np.unique(Xans[:,X.shape[1]])
        groupednpans = np.array( [ list(Xans[Xans[:,X.shape[1]]==i,1]) for i in n])
        groupednpans2 = np.array( [ list(Xans[Xans[:,X.shape[1]]==i,2]) for i in n])
        
        fig = plt.figure(figsize=(15,8))
        ax1 = fig.add_subplot(111)
        plt.title("ACTUAL MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
        #color = colors[T.argmax(0)]
        for i in range(len(groupednp)):
            groupednp[i] = (np.array(groupednp[i]) - np.array(groupednp[i]).min())/(np.array(groupednp[i]).max() - np.array(groupednp[i]).min())
            groupednp2[i] = (np.array(groupednp2[i]) - np.array(groupednp2[i]).min())/(np.array(groupednp2[i]).max() - np.array(groupednp2[i]).min())
            ax1.scatter(groupednp[i],groupednp2[i])
    
        fig = plt.figure(figsize=(15,8))
        ax1 = fig.add_subplot(111)
        plt.title("PREDICTED MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
        #color = colors[T.argmax(0)]
        for i in range(len(groupednpans)):
            groupednpans[i] = (np.array(groupednpans[i]) - np.array(groupednpans[i]).min())/(np.array(groupednpans[i]).max() - np.array(groupednpans[i]).min())
            groupednpans2[i] = (np.array(groupednpans2[i]) - np.array(groupednpans2[i]).min())/(np.array(groupednpans2[i]).max() - np.array(groupednpans2[i]).min())
    
            ax1.scatter(groupednpans[i],groupednpans2[i])
        
        """
            
        
        
        
        plt.figure(figsize=(15,8))
        plt.scatter(self.X[:,0],self.T)  
        plt.plot(self.X[:,0],tryans,color="orange")
        plt.figure(figsize=(15,8))
        plt.scatter(test[:,0],testanswers)  
        plt.plot(test[:,0],myans,color="orange")
        #print(self.X[:,2].shape)
        plt.figure(figsize=(15,8))
        plt.scatter(self.X[:,2],self.T)  
        plt.plot(self.X[:,2],tryans,color="orange")
        plt.figure(figsize=(15,8))
        plt.scatter(test[:,2],testanswers)  
        plt.plot(test[:,2],myans,color="orange")
        #print(self.X[:,2].shape)
        plt.figure(figsize=(15,8))
        plt.scatter(self.X[:,5],self.T)  
        plt.plot(self.X[:,5],tryans,color="orange")
        plt.figure(figsize=(15,8))
        plt.scatter(test[:,5],testanswers)  
        plt.plot(test[:,5],myans,color="orange")
        """
        

In [None]:
#from scipy.stats import itemfreq
#np.asmatrix(humid).shape[0]
#np.asmatrix(temper).shape[0]
#copy.copy([1,2,3])
print(df.values.shape)
groupwithout.ix[:,[2,3,4]]
#groupwithout.ix[:,[0:3]]
X = df.ix[:,[1,3,4,5,6,7,8,9,10]].values
X = X[:,:]
T = df.ix[:,[2]].values

T = T[:,:]
print(type(T))


classes = np.unique(T)
numclasses = len(np.unique(T))
print("numclasses")
print(numclasses)
base_class = np.array([0 for i in range(numclasses)])

new_T = np.array([base_class]*T.shape[0])

print(new_T)
print(new_T.shape)

for i in range(X.shape[0]):
    #print(T[i])
    #print(np.where(classes==T[i]))
    new_T[i][np.where(classes==T[i])] = 1
    

 

print(new_T)
print(X.shape)
print(new_T.shape)


#trainnet = NeuralNetLog([X.shape[1],20,numclasses])
#trainnet.train(X,new_T,ftracep=True)
cc = CrossValid(X,new_T)
cc.five_fold_classify(numclasses)

In [None]:
from scipy.stats import itemfreq
#np.asmatrix(humid).shape[0]
#np.asmatrix(temper).shape[0]
#copy.copy([1,2,3])
print(df.values.shape)
groupwithout.ix[:,[2,3,4]]
#groupwithout.ix[:,[0:3]]
X = df.ix[:,[3,4,5,7,10,11,12,13,14,15,16,17,18,19,21]].values
X = X[:35000,:]
T = df.ix[:,[20]].values

T = T[:35000,:]
print(type(T))


classes = np.unique(T)
numclasses = len(np.unique(T))

base_class = np.array([0 for i in range(numclasses)])

new_T = np.array([base_class]*T.shape[0])

print(new_T)
print(new_T.shape)

for i in range(X.shape[0]):
    #print(T[i])
    #print(np.where(classes==T[i]))
    new_T[i][np.where(classes==T[i])] = 1
    

#trainnet = NeuralNetLog([X.shape[1],20,numclasses])
#trainnet.train(X,new_T,ftracep=True)
 

print(new_T)
print(new_T.shape)

In [None]:
print("Done")



In [None]:
cc = CrossValid(X,new_T)
cc.five_fold_classify()

In [None]:
trainnet = NeuralNet([1,3,1])
humid = humid.reshape(humid.shape[0],1)

temper = temper.reshape(temper.shape[0],1)
print(temper.shape)
print(humid.shape)
result = trainnet.train(humid, temper, ftracep=True)
#print(result["ftrace"])
Ytest, Z = trainnet.use(humid, retZ=True)

#print(Ytest)
#print("done")
humid.shape
plt.figure(figsize=(15,8))
plt.title("PREDICTED MULTI DIMENSION ACTUALIZED IN TWO DIMENSIONS")
plt.scatter(humid,temper)  
plt.plot(humid,Ytest,color="orange")

In [None]:
#result = trainnet.train(humid, temper, ftracep=True)
#print(result["ftrace"])
#Ytest, Z = trainnet.use(humid, retZ=True)
#print(temper.shape)

cc = CrossValid(humid,temper)
cc.five_fold_regress()


In [None]:
# Data NORMALIZATION
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from mpl_toolkits import mplot3d
import copy


df = pd.read_csv("ausweather_preprocessed.csv",sep="\t")


grouped_by_month_rainfall = df.groupby([df.Date.str[:11],"Location"])["Location","MaxTemp","Humidity9am"].mean().reset_index()

#print grouped_by_month_rainfall
grouped_2013 = grouped_by_month_rainfall
grouped_2013 = grouped_by_month_rainfall[ (grouped_2013["Location"]=="Katherine")  | (grouped_2013["Location"]=="Bendigo") ]

#print grouped_2013



#plt.figure(figsize=(15,8))
#plt.title("Humidity vs Max Temperature of Bendigo")
#plt.xlabel('Humidity', foanalysis / comparison of algorithmsntsize=18)
#plt.ylabel('Max Temperature', fontsize=16)
#EXAMPLE OF A SIMPLE TWO DIVISION CLUSTER. DATA FROM EITHER ONE OF THE SOURCES CAN BE CLASSIFIED TO EITHER ONE BASED ON SOME CLASSIFICATION ALGORITHM
k = 0

for name,group in grouped_2013.groupby("Location"):
    if(k>0):#analysis  comparison of algorithms
        break
    humidity_points = group["Humidity9am"]
    maxtemp_points = group["MaxTemp"]
    #plt.scatter(group["Humidity9am"],group["MaxTemp"])
    k+=1


humid = np.array(df["Humidity9am"])
temper = np.array(df["MaxTemp"])

humid_normal = humid
temper_normal = temper

humid_max = humid.max()
humid_min = humid.min()
humid_avg = humid.mean()

temper_max = temper.max()
temper_min = temper.min()
temper_avg = temper.mean()


for i in range(len(humid)):
    humid[i] = (humid[i] - humid_avg)/(humid_max - humid_min)
    
for i in range(len(temper)):
    temper[i] = (temper[i] - temper_avg)/(temper_max - temper_min)

In [None]:
print("done")
humid.shape

In [None]:
trainnet = NeuralNet([1,1,1])
humid = humid.reshape(humid.shape[0],1)

temper = temper.reshape(temper.shape[0],1)
print(temper.shape)
print(humid.shape)
result = trainnet.train(humid, temper, ftracep=True)
print(result["ftrace"])
Ytest, Z = trainnet.use(humid, retZ=True)

print(Ytest)
plt.figure(figsize=(15,8))
plt.scatter(humid,temper) 
xs, ys = zip(*sorted(zip(humid, Ytest)))

plt.plot(xs, ys,color="orange")
#plt.plot(humid,Ytest,color="orange")

In [None]:
cc = CrossValid(humid,temper)
cc.five_fold_regress()

In [None]:
#np.asmatrix(humid).shape[0]
#np.asmatrix(temper).shape[0]
#copy.copy([1,2,3])
print(df.values.shape)
groupwithout.ix[:,[2,3,4]]
#groupwithout.ix[:,[0:3]]
X = df.ix[:,[3,4,7,10,11,12,13,14,15,16,17,18,19,21]].values

T = df.ix[:,[5]].values

X_max = X.max(axis=0)
X_min = X.min(axis=0)
X_avg = X.mean(axis=0)

T_max = T.max()
T_min = T.min()
T_avg = T.mean()


print(X_max.shape)
for i in range(X.shape[0]):
    #print(X[i,:].shape)
    #print(X_avg.shape)
    #print(X_min.shape)
    #print(X_max.shape)
    X[i,:] = (X[i,:] - X_avg)/(X_max - X_min)
print(X.shape)
    
for i in range(len(T)):
    T[i] = (T[i] - T_avg)/(T_max - T_min)
    
print(X.shape)
print(T.shape)
#groupwithout["MaxTemp"],groupwithout["Humidity9am"],groupwithout["Rainfall"]
#trainnet = NeuralNet([X.shape[1],10,1])
#trainnet.train(X,T,ftracep=True)


#answers,z = trainnet.use(X, retZ=True)
cc = CrossValid(X,T)
cc.five_fold_regress()

In [None]:
trainnet = NeuralNet([X.shape[1],3,1])
trainnet.train(X,T,ftracep=True)


answers,z = trainnet.use(X, retZ=True)

In [None]:
print("Done")

plt.figure(figsize=(15,8))
print(X[:,0].shape)
plt.scatter(X[:,7],T)

xs, ys = zip(*sorted(zip(X[:,7], answers)))

plt.plot(xs, ys,color="orange")
#plt.plot(X[:,7],answers,color="orange")

plt.figure(figsize=(15,8))
print(X[:,0].shape)
plt.scatter(X[:,1],T)

xs, ys = zip(*sorted(zip(X[:,1], answers)))

plt.plot(xs, ys,color="orange")
#plt.plot(X[:,1],answers,color="orange")

plt.figure(figsize=(15,8))
print(X[:,0].shape)
plt.scatter(X[:,2],T)

xs, ys = zip(*sorted(zip(X[:,2], answers)))

plt.plot(xs, ys,color="orange")
#plt.plot(X[:,2],answers,color="orange")


In [None]:
print("test")

In [None]:
trainnet = NeuralNet([1,2,1])
humid = humid.reshape(humid.shape[0],1)

temper = temper.reshape(temper.shape[0],1)
print(humid.shape)
print(temper.shape)
cc = CrossValid(humid,temper)
cc.five_fold_regress()

trainnet.train(humid,temper,ftracep=True)
myans,z = trainnet.use(humid,retZ=True)
plt.figure(figsize=(15,8))
plt.scatter(humid,temper)  
plt.plot(humid,myans,color="orange")



In [None]:
#np.asmatrix(humid).shape[0]
#np.asmatrix(temper).shape[0]
#copy.copy([1,2,3])
groupwithout.ix[:,[2,3,4]]
#groupwithout.ix[:,[0:3]]
X = groupwithout.ix[:,[2,3]].values
T = groupwithout.ix[:,[4]].values
print(X.shape)
print(T.shape)
#groupwithout["MaxTemp"],groupwithout["Humidity9am"],groupwithout["Rainfall"]
trainnet = NeuralNet([X.shape[1],2,1])
trainnet.train(X,T,ftracep=True)
answers,z = trainnet.use(X, retZ=True)

In [None]:
trainnet = NeuralNet([1,20,1])
humid = humid.reshape(humid.shape[0],1)

temper = temper.reshape(temper.shape[0],1)
#print(temper.shape)
#print(humid.shape)
result = trainnet.train(humid, temper, ftracep=True)
#print(result["ftrace"])
Ytest, Z = trainnet.use(humid, retZ=True)

#print(Ytest)
plt.figure(figsize=(15,8))
plt.scatter(humid,temper)  
plt.plot(humid,Ytest,color="orange")

# Parameter/network structure choice

The network choice has been different, and so has the choice of hyperparameters. In order to define as to what exactly was chosen, we take into consideration, let us list the hyperparameters:

- Learning Rate
- Number of iterations
- Lambda (Regularization penalty factor)
- Number of hidden layer neurons
- Number of output layer neurons

This being a neural network assignment rather than a deep learning assignment the choice was to use a single hidden layer , and play with the number of neurons in it.So in each case, we choose models such that , the number of hidden layer neurons is less than the number of input features in GENERAL, and the output is less than the number of hidden layer neurons.

In each case we create a certain number of neurons in a model which we use in the cross validation. Our core idea behind using a network structure is to pre-initialize it. 

For the regression ,we pick different hidden layer neurons - 7,3,5,11. Each of these activate a tanh, in order to give a complexity to the equation. The right amount of complexity is necessary to fit the training data. Thus we can notice that the best way to do so is to use one of the above mentioned models.All of these map to one Neuron where the output is not passed through an activation, and is a direct result of the weights times parameters with the bias.

For the classification ,we pick different hidden layer neurons - 7,3,5,11. Each of these activate a tanh, in order to give a complexity to the equation. The right amount of complexity is necessary to fit the training data. Thus we can notice that the best way to do so is to use one of the above mentioned models.All of these map to one Neuron where the output is passed through an activation, and is a direct result of the weights times parameters with the bias.The output is passed through either sigmoid activation or softmax to give an output which would classify the data into binary or multi class respectively.

We pick a higher than 0 Lambda for classification, since we don't want to heavily overfit the classes, and rather want to fit the case. We can see that this serves the purpose. 

We use a regular learning rate without much change, and only one hidden layer with different possible neurons in each. We use 1000 iterations in each for each of these cases.



# PREDICTION RESULTS

We have to consider both the following algorithms we implemented:
    - For Non Linear Regression we have to consider that we used a large data set, which is spread across a large region, and is technically not easy to fit. In the above examples we are given to see that our model does indeed fit the data well, with the right amount of neurons in the hidden layer. 
    - We need to consider that we didn't allow any regularization for this. This helped us, in that we could visualize a very high order polynomial in this experiment very easily. That was the goal of our experiment which we achieved with relative ease.
    - For Non Linear Logistic Regression we achieved an accuracy of even:
        - 97% to 99% in some cases
    - This was possible even in the test data, and not just training because of the cross validation and that the neural network was trained well with the information it was provided and with good hyperparameters


# Conclusions

We got to learn the base of how the neural net was actually trained and used. It gives us an intuition of how we should use raw information and classify or perform regression, and achieve any polynomial by just using one hidden layer of neurons. This also gives us an intuition of how we can use the regularization and learning rate to get the desired fit to the data

# Extra Credit

Now you are testing various **activation functions** in this section. Use the best neural network structure and explore 3 different activation functions of your choice (one should be *tanh* that you used in the previous sections). 
You should use cross validation to discover the best model (with activation function). 


One extra credit is assigned when you finish the work completely. 
