# History of Neural Networks
* First Conceptualizes 1943
* First models were visualizations of perceptrions introduced in 1958 by Frank Rosenblatt
    * Belief that artificial neural networks (ANNs) would soon be able to translate languages on the fly.
* However, it lacked technology to actually get something to work. Period of time known as the "AI Winter"

**ImageNet 2012**  
Large-scale Visual Recognition Challenge. 
Reintroduced the idea of Neural Networks.

$$
y = w_0 + \sum{w_i x_i}
$$
Or
$$
y = \vec{w} * \vec{x}
$$ 

Multiple outputs of y:
$$
\bar{Y} = W \bar{X}
$$
Solving the weights:
$$
\bar{W} = \bar{Y}\bar{X}^{+}
$$

In [1]:
import numpy as np
from sklearn.datasets import load_iris

#### Load Data
iris = load_iris()
X = iris.data
y = iris.target

#### Cross Validation (split data)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

In [2]:
#### Perceptron Model
from sklearn.linear_model import Perceptron

per_clf = Perceptron()
per_clf.fit(X_train, y_train)

#### Test the perceptron
y_pred = per_clf.predict(X_test)

#### Cross Validation (test data)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.80      1.00      0.89         8
           1       0.82      0.82      0.82        11
           2       1.00      0.82      0.90        11

    accuracy                           0.87        30
   macro avg       0.87      0.88      0.87        30
weighted avg       0.88      0.87      0.87        30

[[8 0 0]
 [2 9 0]
 [0 2 9]]


Our test above wasn't very accurate. We could do better.

## Activation Functions
The __activation__ is the computed value of a neuron (or node)

For example,
$$
y_1 = w_0 \sum{w_{1i} x_i}
$$

An __activation function__ can adjust the value of the output neuron, depending on its value.  
$$
y_1 = \sigma \left( w_0 \sum{w_{1i} x_i} \right)
$$

Activation functions are generally non-linear.
Sometimes, the output of the activation function is a value within the range of that function n(linear activation function, or ReLu). These are great for regression networks.  

Let $z = w_0 + \sum{w_{1i} x_i}$

## Good for Linear Regressions 
#### 1. Linear Activation Function
Just a linear transformation of each y value. Takes the linear regression line and adjusts the slope.
$$y = \sigma(z) = \alpha z$$

#### 2. Rectified Linear Unit (ReLU)  
The max means that we keep our linear regression above zero, but after it is above zero, we follow the liner regression line.  
$$y = \sigma(z) = \max\{{0, z}\}$$


## Good for Classifications
Sometimes the output of the activation function is a value between 0 and 1. This essentially gives a probability, so it's good for classification models. 
#### 1. Step Function (aka Threshold Logic Unit (TLU))  
Gives a straight jump from 0 to 1.
$$ y = \sigma = \begin{cases}1  &\text{if }z \ge threshold \\ 0 & \text{if } z < threshold \end{cases} $$

#### 2. Sigmoid Function (aka Logical Function)  
Gives a gradual transition from 0 to 1.
$$
    y = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

#### 3. Hyperbolic Tangent  
Acts like the sigmoid, but between -1 and 1 instead of 0 and 1.
$$
    y = \sigma(z) = \tanh{z} = \frac{e^{z} - e^{-z}}{e^z + e^{-z}}
$$


# Perceptron (Picture in notebook)
When you have multiple inputs and multiple outputs, each having their own weight.

## Types of Layers
1. Input layer
    - petal length, petal width, sepal length
    - color of a pixel
    - weight, height, 
2. Hidden Layer
    - very complicated and not interperatbale
3. Output layer
    - classification of flower
    - category of image
    - diagnosis of cancer or not

**Width** is the size of the layer. The number of *neurons* in a layer.  
**Depth** is the number of *hidden layers*

## Types of Neural Networks  

**Artificial Nueral Network (ANN)** is when you have *one or more* more than one hidden layer in the perceptron.  
**Deep Neural Network (DNN)** is when you have *two or more* hidden layers in the perceptron.  

## Finding Weights
**1. Very Simplistic Way**  

If you find that your output is terrible, you can adjust the weights little by little. 

To measure how well your predictions are, you can do mean squared error, absolute error, or other loss funcitons.  

Simple Cost function using substraction would look like this: $(- \eta \hat{y_i} - y_i)x_j$ where $\eta$ is the learning rate.

You can then minimize the loss function. Tells you how much you need to change the weight by:

$$
w_{ij}' = w_{ij} - \eta (\hat{y_i} - y_i)x_j
$$

**2. Steps for Gradient Descent**  
1. Run the dataset through it multiple times, varying the weights each time
2. you get an average suggested change
3. Apply the average change
3. Then run again until the average change is very 

repeat 30-40 different epocs, applying average change every time until the average change is very little. 

# Cautions
1. Easy to overfit
2. Not good for generalizing. (you might not be able to use a model you make on some other data set.)
    - Reason: because hidden layers are not interpretable
3. Incorporating physics 

# Train Neural Networks using Back Propogation  



# Some Questions to have when using NN's

- How many input nodes?  
- How many output nodes?  
- How many hidden layers? 
    - _one hidden layer is generally enough_
    - _but..._
        - _more layers tend to give a higher parameter efficiency_
        - _sometimes, two layers of 20 do more than one layer of 100. Since there are fewer nodes, it also creates a faster model_

    - Deep networks use exponentially fewer nodes than shallow networks. 
    - _Lower layers_ (closer to the input layer) tend to model low-level structures (like line-segments, color, and other general information)
    - _Intermediate layers_ (in the middle) tend to model intermediate level structures (like going from line-segments to general shapes. Going from a color to color patterns.)
    - _Higher Layers_ (closer to the output layer) tent to model high-level structures (like going from a shape to an apple, from color patterns to color themes, specific information)  


    Input -> Middle -> Output  
    General ---------> Specific  


    - _Transfer Learning_ is when Lower layers can be reused in other models. For example, you can transfer some layers to model similar, but different data.  


- What is the width of each hidden layer?
    - Doesn't have to really follow a pyramid shape. 
    - Equally sized layers tend to perform better than a pyramid shape.
    - Higher the width, the more calculations it has to do. 




# fashion_mnist  