
# Activative function  
1. Binary sigmoidal function: This activation function performs input editing between 0 and 1.\
    $\sigma(x) = sigm(x)= \frac{1}{1+e^{-x}}$
2. Bipolar sigmoidal function − This activation function performs input editing between -1 and 1. It can be positive or negative in nature.\
    $\sigma(x) = sigm(x) = \frac{1 - e^x}{1 + e^x}$
    
3. Threshold Function : The threshold function is used when you don’t want to worry about the uncertainty in the middle.\
    $\sigma(x) = \begin{bmatrix}
        1 & ,if x>=0 \\
        0 & ,if x<0
    \end{bmatrix}$
4. ReLU (rectified linear unit) Function : The ReLU (rectified linear unit) function gives the value but says if it’s over 1, then it will just be 1, and if it’s less than 0, it will just be 0. The ReLU function is most commonly used these days. \
    $\sigma(x) = max(0,x)$
5. Hyperbolic Tangent Function : The hyperbolic tangent function is similar to the sigmoid function but has a range of -1 to 1. \
    $\sigma(x) = \frac{1-e^{-2x}}{1+e^{-2x}}$
6. Softmax is an activation function that is commonly used in the output layer of neural networks, particularly for multi-class classification problems.
    $\sigma(z) =\frac{e^{z_i}}{\sum_{i=1}^N e^{z_i}}$ where z is the vector of raw outputs from the neural network


# Chain Rule :
  _ Discription : The chain rule of calculus is a foundational concept that allows us to find the derivative of a composite function.\
  _ Formular: $\frac{d}{dx}(f(g(x))=\frac{df(g)}{dg}\frac{dg}{dx}$\
  _ Residual = Observed - Predicted where: Predicted = Intercept + (1 x weight)
  $=> \frac{dResidual}{dIntercept} = 0 + (-1)$\
  $=> y = Residual^2 = (Obseved - Predicted)^2$\
  $=> \frac{dResidual^2}{dIntercept} = -2*(Obseved - Intercept + (1 * weight))$
   

# Gradient descent:
Gradient descent is an optimization algorithm used in machine learning to minimize the cost function by iteratively adjusting parameters in the direction of the negative gradient, aiming to find the optimal set of parameters.
+ Step1: Start by initializing the parameters(weights and biases) of the model with random values or predefined values.
+ Step2: The gradient is a vector that points in the direction of the steepest increase of the function.
+ Step3: Update the parameter in the opposite direction of the gradient to minimize the loos function.The update rule for each parameter $\theta$ is : $\theta = \theta - \alpha\Delta_{\theta}J(\theta)$ 
Where: .$\alpha$ is the learning rate, which determines the step size of each update,\
. $J(\theta)$ is the cost function
. $\Delta_{\theta}J(\theta)$ is the gradient of the cost funtion with respect to the parameters.\
+ Step4: Type fo Gradient Descent you want to choose:\
  _ Batch gradient Descent: Batch Gradient Descent involves calculations over the full training set at each step as a result of which it is very slow on very large training data.
    . $\theta_j = \theta_j - \alpha\frac{\partial}{\partial{\theta_j}}J(\theta)$, \
    .$J(\theta) = \frac{1}{m}\sum_{i=1}^m(\hat{y_i} - y_i)X_j^i$\
  _ Stochastic Gradient Descent: 
    . Sum of squared residuals = (observedHeight - (intercept+slope*Weight)^2
    .$\frac{d}{d intercept}Sum fo squared residual= -2(Height - (intercept + slope * weight))$
     .$\frac{d}{d slope}Sum of squared residuals = -2 * Weight(Height - (intercept + slope*Weight))$
     .$Step Size_{intercept} = \frac{d}{d intercept}Sum fo squared residual * Learning Rate$
     .$Step Size_{Slope} = \frac{d}{d slope}Sum fo squared residual * Learning Rate$
      Where: Weight is input , Height is output.
      .New intercept = 
     

# Feedforward Propagtion:  
_ Feedforward Neural Network is the simplest neural network. It is called Feedforward because information flows forward from Inputs -> hidden layers -> outputs. There are no feedback connections. The model feeds every output to the next layers and keeps moving forward.\
_ Forumular: $\sigma(z) = \sigma(b + \sum_{i=1}^N w_ix_i )$

# Back Propagation:
+ Step1: We first select an error or cost function that calculates the error between predicted and actual: $\mathbf{Cost = (Y_{pred} - Y_{act})^2}$
+ Step2: Partial Derivative : $\frac{\partial{Cost}}{\partial{w_1}}= \frac{\partial{Cost}}{\partial{Y_{pred}}}*\frac{\partial{Y_{pred}}}{\partial{g_1}}*\frac{\partial{g_1}}{\partial{z_1}}*\frac{\partial{z_1}}{\partial{w_1}}$\
_ $\frac{\partial{Cost}}{\partial{Y_{pred}}} = 2(Y_{pred} - Y_{act})$\
_ $\frac{\partial{Y_pred}}{\partial{Y_{g_1}}} = w_7 , \hspace{0.3cm} Y_{pred} = w_7*g_1 + w_8*g_2 + b_3 $ \
_ $\frac{\partial{g_1}}{\partial{z_1}} = g_1*(1-g_1)=(\frac{1}{1+e^{-z_1}})*(1-\frac{1}{1+e^{-z_1}}) $
+ Step3: Update Weights : $w_n^+ = w_n - \eta \frac{\partial{cost}}{\partial{w_n}}$ \
_ $w_1^+ = w1 - \eta \frac{\partial{cost}}{\partial{w_1}}$\
_ $w_2^+ = w2 - \eta \frac{\partial{cost}}{\partial{w_2}}$\
_ $b_1^+ = b_1 - \eta \frac{\partial{cost}}{\partial{b_1}}$\

In [1]:
import numpy as np


In [2]:
# Activation function

def sigmoid(x):
    return 1/(1+np.exp(-x))

def bipolar_sigmoid(x):
    return (1-np.exp(-x))/(1+np.exp(-x))

def threshold(x):
    if x >= 0:
        phi = 1
    elif x < 0:
        phi = 0
    return phi
def Relu(x):
    return max(0,x)

def hyperbolic_tangent(x):
    return (1-np.exp(-2*x))/(1+np.exp(-2*x))

def softmax(y):
    ex = np.exp(y)
    return ex/(sum(ex))


$y_{pred} = sigmoid(w2*sigmoid(w1*x+b1)+b2)$\
$\frac{\partial{j}}{\partial{w}}= \frac{\partial{y_{pred}}}{\partial{w}} - \frac{\partial{y_{actual}}}{\partial{w}}$\
Where: $y_{actual}$ is a constant value, $\frac{\partial{y_{actual}}}{\partial{w}}=0$\
$\frac{\partial{j(w)}}{\partial{w}}=\frac{\partial{y_{pred}}}{\partial{w}}$\
$\frac{\partial{y_{pred}}}{\partial{w1}}=\frac{\partial{outer}}{\partial{inner}}*\frac{\partial{inner}}{\partial{w1}}$\
Whre: outer = sigmoid(x) and inner = constant2 * sigmoid(constant1 * x)\
$\frac{\partial{outer}}{\partial{inner}} = \frac{\partial{sigmoid(y)}}{\partial{y}}=sigmoid(y)*(1-sigmoid(y))$\
$\frac{\partial{inner}}{\partial{w1}}=\frac{\partial{(w2*sigmoid(w1*x))}}{\partial{w1}}=w2*x*sigmoid(w1*x)*(1-sigmoid(w1*x))$\
$\frac{\partial{inner}}{\partial{w2}}=\frac{\partial{(w2*sigmoid(w1*x))}}{\partial{w2}}=sigmoid(w1*x)$

In [3]:
# Chain Rule
def chain_rule(x,y,w1,w2):
    for i in range(len(x)):
        y_predic = sigmoid(w2*sigmoid(w1*x[i]))
        douter_inner = sigmoid(y[i])*(1-sigmoid(y[i]))
        dinner_w1 = w2*x[i]*sigmoid(w1*x[i])*(1-sigmoid(w1*x[i]))
        dinner_w2 = sigmoid(w1*x[i])
        dypred_w1 = douter_inner * dinner_w1
        dypred_w2 = douter_inner * dinner_w2
    return dypred_w1, dypred_w2

In [4]:
# Gradient descent 
def gradient_decent(x, y):
    # Initial weight and bias
    w1 = 0
    w2 = 0
    b_curr = 0
    # Set steppest to increase or learning rate
    iteration = 100
    n = len(x)
    learning_rate = 0.01

    for i in range(iteration):
        dypred_w1, dypred_w2 = chain_rule(x,y,w1,w2)
        w1 = w1 - learning_rate*dypred_w1
        w2 = w2 - learning_rate*dypred_w2
#         cost = (1/n)* sum([val**2 for val in (y-y_predicted)])
#         md = -(2/n)*sum(x*(y-y_predicted))
#         bd = -(2/n)*sum(y - y_predicted)
#         m_curr = m_curr - learning_rate * md
#         b_curr = b_curr - learning_rate * bd
        print ("w1 {}, w2 {}, cost {} iteration {}".format(w1,w2,dypred_w1, i))

In [6]:
# x = [1,2,3,4,5,6,7]
# y = [2.2,4.4,6.3,8.2,10.5,11.8,14.3]
# gradient_decent(x,y)

In [None]:
# Back Propagation