<h1>Neural Networks - Universal Approximators</h1>
<ul>
    <li>A Brief dive into the Human Brain</li>
    <li>Artificial Neural Networks</li>
    <li>Model Building</li>
    <li>Pros and Cons</li>    
</ul>

<h2>1. The Human Brain </h2><br/>

<img src="Media/brain.jpg" width="300px"/><br/>
<p style="text-align:center">[1] Source: https://nobaproject.com/modules/the-brain-and-nervous-system</p>

The Humain Brain is a very powerful biological entity in charge of coordination of the human body and cognition or reasoning.
Some elements of cognition are:
<ul>
    <li>Analysis</li>
    <li>Problem solving</li>
    <li>Retention</li>
    <li>Skill acquisition or learning</li>
    <li>etc.</li>
</ul>


The Human brain is essentially composed of a large network of neurons essentially in charge of the overall information processing and all forms of cognition in the brain.

<img src="Media/bnn.jpg" width="500px"/>
<p style="text-align:center">[2] Source: https://towardsdatascience.com/meet-artificial-neural-networks-ae5939b1dd3a</p>

<img src="Media/neural_net.png"/>
<p style="text-align:center">[3] Layout of a neural network connection</p>

<img src="Media/neuron.png">
<p style="text-align:center">[4] Source: https://www.upgrad.com/blog/biological-neural-network/</p>

The Biological Neural Network works with <b>electrochemistry</b>. Information is transmitted by means of <b>electrical signals</b> generated by chemical processes.

<h2>2. Biological Neuron vs Artificial Neuron</h2>

<img src="Media/neuron_vs_artificial_neuron.png"/>

$$
  y = f(\sum_{i=1}^{n}w_ix_i), f \equiv \text{  activation function}
$$

<table>
    <tr><td></td><td>Biological Neuron</td><td>Artificial Neuron</td></tr>
    <tr><td>1.</td><td>Dendrites</td><td>Input Variables</td></tr>
    <tr><td>2.</td><td>Synapses</td><td>Weights</td></tr>
    <tr><td>3.</td><td>Axon</td><td>Summed Output</td></tr>
    <tr><td>4.</td><td>Action Potential</td><td>Activation function</td></tr>
</table>

<b style="color:blue">Synaptic plasticity</b> is the ability of the connections between neurons, called synapses, to strengthen or weaken over time in response to increases or decreases in their activity. This phenomenon is fundamental to learning  ability of the neural networks to modifying its configuration when learning a task.

<h2>3. Artificial Neural Networks</h2>
<img src="media/ann_net.png"/>

Artificial Neural Networks are structures of interconnected computational neurons for model identification. They form the neural network model.

Artificial Neural networks exists in many architectures and customised for different purposes:

<ol style="color:blue">
    <li><b>Feedforward neural networks</b></li>    
    <li>Recurrent neural networks</li>
    <li>Convolutional neural networks</li>
    <li>Radial basis neural networks</li>
    <li>Transformer neural networks</li>
    <li>etc.</li>
</ol>

Additional resources: https://www.cloudflare.com/en-gb/learning/ai/what-is-neural-network/

<h3>4. Feedforward Neural Networks</h3>

<img src="media/ann_net.png"/>

Feedforward neural networks are ANNs whereby neurons only pass information to forward nodes and never backward. It is the most commonly used neural network architecture.

<img src="Media/activation_function.png"/>
<p style="text-align:center">[4] Activation functions</p>

<h3>4.1. Training a Neural Network</h3>

<img src="media/ann_net.png"/>

Neural networks are multiparametric models of convoluted linear functions: 
    
$$
z_i^{(l)} = \theta_{0,i}^{(l)} + \sum_{j=1}^{N^{(l-1)}}\theta_{j,i}^{(l)}x_j^{(l-1)}
$$
$$
 x_i^{(l)} = f(z_i^{(l)},\theta_i^{(l)})
$$
 where $\theta_i^{(l)}$ are the weights and bias for the current neuron, $x_j^{(l-1)}$ are the inputs to the current neuron incoming from the previous layer $(l-1)$, $f$ is the activation function for the neuron and $x_i^{(l)}$, the output signal of the neuron in the $l^{th}$ layer.

The estimation of the parameters $\theta$ comes down to miniminising the sum of errors between the model output predictions and the 
observed target values

$$
    E (\theta) = \sum_{k=1}^{N}(x_k-\hat{x_k}(\theta))^2
$$

This estimation is typically performed by a gradient descent methodology called the <b>backpropagation method</b>:
$$
\theta_j^{i+1}=\theta_j^i-\eta\frac{\partial E}{\partial \theta_j}(\theta_i), \eta \equiv\text{learning rate}
$$
that is uses a multistage recursive automatic differentiation process.

Additional resources: https://mlinsightscentral.com/index.php/gradient-descent-algorithm/
<br/>
https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd
    
    
    

An <b>epoch</b> in training a neural network will refer to a single complete pass of updating all the weights of the neural network towards minimising the cost function $E(\theta)$.

Training a feedfoward neural network will thus require the selection of 
<ol style="color:blue">
    <li>The number of hidden layers</li>
    <li>The number of neurons per hidden layer</li>
    <li>The type of activation functions (hidden layer and output layer)</li>
</ol>

The selection of the first two are typically a <b>work of art</b> or <b>finetuned by hyperparameter tuning</b>. The choice of activation functions is most typically dictated by the types of problem at hand: classification or regression.

<table>
    <tr><td></td><td>Regression </td><td>Classification</td></tr>
    <tr><td>Hidden layers</td><td>-</td><td>-</td></tr>
    <tr><td>Neurons per hidden layer</td><td> decreasing from left to right</td><td>decreasing from left to right</td></tr>
    <tr><td>Hidden layer-activation function </td><td> RELU </td><td>RELU</td></tr>
    <tr><td>Output layer-activation function </td><td>Linear, sigmoid</td><td>sigmoid, softmax</td></tr>
</table>

<h3>4.2. Case-study - Classification</h3>

In [16]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore') #ignore warnings

df = pd.read_csv('../../datasets/Healthcare-Diabetes.csv')
df.head()

Unnamed: 0,Id,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,6,148,72,35,0,33.6,0.627,50,1
1,2,1,85,66,29,0,26.6,0.351,31,0
2,3,8,183,64,0,0,23.3,0.672,32,1
3,4,1,89,66,23,94,28.1,0.167,21,0
4,5,0,137,40,35,168,43.1,2.288,33,1


In [2]:
X = df.iloc[:, 1:9]#get features
y = df.iloc[:,[-1]]#get target variable

<b style="color:blue">Feature Scaling</b>  - <b>Feature scaling can be the difference between success and failure in neural networks</b>.


In [17]:
from sklearn.preprocessing import StandardScaler
Sc = StandardScaler()
Sc.fit(X)
X_d = Sc.transform(X)

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train,y_test = train_test_split(X_d,y, test_size=0.2, random_state=1234)

In [19]:
from sklearn.neural_network import MLPClassifier #Multi layer Perceptron

#--configuration: 2 hidden layers - 30, 20 neurons respectively
max_epoch = 300
ffn_clf = MLPClassifier(hidden_layer_sizes=(30,20),max_iter=max_epoch,
                  early_stopping=False,activation='relu',solver='lbfgs')#feedforward neural network classifier

ffn_clf.fit(X_train,y_train)

#For Regression: Use MLPRegressor 

In [20]:
y_pred_train = ffn_clf.predict(X_train)#predict of training data

In [21]:
from sklearn.metrics import classification_report

targets = ['no-diabetes','has-diabetes']

print('Training Set:\n',classification_report(y_train,y_pred_train,target_names=targets))

Training Set:
               precision    recall  f1-score   support

 no-diabetes       1.00      1.00      1.00      1467
has-diabetes       1.00      1.00      1.00       747

    accuracy                           1.00      2214
   macro avg       1.00      1.00      1.00      2214
weighted avg       1.00      1.00      1.00      2214



In [22]:
y_pred_test = ffn_clf.predict(X_test)#predict of test data

In [23]:
targets = ['no-diabetes','has-diabetes']

print('Test Set:\n',classification_report(y_test,y_pred_test,target_names=targets))

Test Set:
               precision    recall  f1-score   support

 no-diabetes       0.99      1.00      1.00       349
has-diabetes       1.00      0.99      0.99       205

    accuracy                           0.99       554
   macro avg       0.99      0.99      0.99       554
weighted avg       0.99      0.99      0.99       554



<h4>Hyperparameter tuning</h4>

In [10]:
from sklearn.model_selection import GridSearchCV

ffn_clf_hyp = MLPClassifier(max_iter=max_epoch,
                  early_stopping=False)

parameters = {'hidden_layer_sizes':[(10,), (20,), (30,20)],
              'activation':('tanh','relu'),
              'solver':['sgd','lbfgs'],
              'learning_rate':['constant','adaptive']}

clf_grd = GridSearchCV(ffn_clf_hyp,parameters,cv=5)
clf_grd.fit(X_train,y_train)

In [13]:
clf_grd.best_params_

{'activation': 'relu',
 'hidden_layer_sizes': (30, 20),
 'learning_rate': 'adaptive',
 'solver': 'lbfgs'}

In [11]:
y_pred_test_hyp = clf_grd.predict(X_test)

In [12]:

print('Test Set-Hyper:\n',classification_report(y_test,y_pred_test_hyp,target_names=targets))

Test Set-Hyper:
               precision    recall  f1-score   support

 no-diabetes       0.99      0.99      0.99       349
has-diabetes       0.99      0.99      0.99       205

    accuracy                           0.99       554
   macro avg       0.99      0.99      0.99       554
weighted avg       0.99      0.99      0.99       554



<b style="color:blue">You can use for more sophisticated and adaptable libraries for neural networks such as TensorFlow, PyTorch and Keras</b>

<h3>5. Pros and Cons - Neural Networks</h3>



<b>Pros</b><br/>
<ul>
<li>Very good to identity nonlinearities or complex patterns</li>
<li>Outperforms other learning models in many problem scenarios (i.e. deep learning)</li>
<li>Scalabilty - can scale to large datasets using distributed computing resources</li>
</ul>

<b>Cons</b><br/>
<ul>
<li>Black-Box model - Hard to interpret predictions</li>
<li>Often require large volume of data to get meaningful patterns</li>
<li>Can be computational intensive</li>
<li>Run the risk of overfitting - (Typical Mitigation - validation set and early stop)</li>
<li>Has may hyperparameters - Optimal set of hyperparemeters may be challenging to find.</li>
</ul>

Terminology of the century: <b style="color:blue">Deep learning = Neural networks with many hidden layers</b>

In [None]:
(*) 1755
(*) 277
425
706
