### 1. K-means
- Find K-clusters from 'unlabeled' data

- Applications:
    - Segment user purchasing behavior
    - Cluster plhysicians together
    
    
- Working:
    - Initialize cluster centroids, different initilizations will lead to different results
    - Assign each data point to its closest centroid, and then calculate the mean for each cluster, and then update the centroid with the mean
    - This process is repeated until convergence i.e. clusters/centroids and label assignment stablizes
    - If the movement of the centroids is smaller than the set threshold, we get out of loop 
    
    
- Complexity:
    - O(K *N *I) Time
    - O(N) Space
    - K: clusters, N-data points, I-iterations
  
- Functions:
    - Initialize centroids
    - Get labels using distance
    - Update centroids
    - Stopping function
    - Runner function

In [1]:
def main(data, k):
    import random
    
    ## Initialize the k-centroids in the data space. We are considering a 2-D data here
    centroids = initialize_centroids(data, k)
    
    while True:
        old_centroids = centroids
        ## Find the label for each data-point 
        labels = get_labels(data, centroids)
        ## Find the new centroids based on the labels detected for each data point, keep updating the centroids and labels until convergence
        centroids = update_centroids(data, labels, k)
        
        if should_stop(old_centroids, centroids):
            break
    return labels

In [6]:
#####  Initialize centroids
### Idea is to randomly initialize the centroids within the data range
#

def initialize_centroids(data, k):
        ## Starting with -inf to inf space
        ## The reason we choose inf as minimum is because upon comparing it with the data, the data-point is always gonna be smaller than inf
        x_min = y_min = float('inf')
        x_max = y_max = float('-inf')
        
        ## let's iterate through the data to find the range of the data
        for point in data:
            x_min = min(point[0], x_min)
            x_max = max(point[0], x_max)
            y_min = min(point[1], y_min)
            y_max = max(point[1], y_max)
        
        centroids = []
        
        for i in range(k):
            centroids.append([random_sample(x_min, x_max), random_sample(y_min, y_max)])
        
        return centroids

### Returns a number uniformly distributed between 0 and 1
def random_sample(low, high):
    return low+(high-low)*random.random()

In [9]:
### Get labels for the data points
##
#

def get_labels(data, centroids):
    labels = []
    for point in data:
        min_dist = float('inf')
        label = None
        
        for i, centroid in enumerate(centroids):
            new_dist = get_distance(point, centroid)
            if new_dist<min_dist:
                min_dist = new_dist
                label = i
                
        labels.append(label)
    return labels


### Get distance between two points
## 

def get_distance(point_A, point_B):
    return ((point_B[1]-point_A[1])**2 + (point_B[0]-point_A[0])**2)**0.5

In [10]:
#### Update the centroids
##
def update_centroids(data, labels, k):
    ### Initializing new centroids at the origin   
    new_centroids = [[0,0] for i in range(0,k)]
    counts = [0]*k
    
    ## Iterating through the data points and their respective labels
    for point, label in zip(data, label):
        new_centroids[label][0]+=point[0]
        new_centroids[label][1]+=point[1]
        counts[label]+=1
        
    for i, (x,y) in enumerate(new_centroids):
        new_centroids[i] = (x/counts[i], y/counts[i])
        
    return new_centroids

In [12]:
### stopping criteria
##

def should_stop(old_centroids, new_centroids, threshold = 1e-5):
    ## Using a very small threshold to stop the loop
    total_movement = 0
    
    for old_point, new_point in zip(old_centroids, new_centroids):
        total_movement+=get_distance(old_point, new_point)
    return total_movement<threshold

<br>

### 2. K-nearest neighbors
    - You're determined by your closest neighbors
    - Doesn't need to lean parameters like LR or logistic regression, can be used for classification/regression
    - Prediction is made based on finding K-closest neighbors (existing data points)
    - Distances used are eucleadian distance/cosine similarity
    - For Regression, you take the average of all the neighbor values, that is now the target value of the new data point
    - For Classification, you predict a new data point as the majority of the neighbor's class

- Application:
    - Finding the price of a new apartment
    
    
- Implementation
    - Obtaining the data
    - Querying the nearest neighbors for prediction
    
    
- Complexity:
    - O(M log(M)) Time
    - O(M) Space
    - M: number of features

- How to choose K:
    - Pre-determined (arbitrary)
    - simple approach K = (# data points)**0.5
    - Cross validation (Use training data to test hyperparameters)
        - Partition the data into n-parts (say 10), pick a range of values for hyper-parameter k - range(1, max(k)) e.g. [1,4,7]
        - For each K, select a validation set and compute the validation error (MSE, or classification error) [**Remember there is no training involved in K-nearest neighbors]
        - The K with smallest validation error on a validation set is your optimal K
        - Another robust approach is to select each n-partition for each K, and calculate the validation error. For 10 parts, for each K, we will have 10 validation errors, take their mean for each K, the one with smallest mean-validation error will be your optimal K

#### Functions

- Class KNN
- train
- predict
- distance
- classification

In [16]:
### X = [ [X_00, X_0n], 
#         [X_m0, Xmn] ] where X is a 2-D array, where rows is data-points and columns are features

class KNN:
    def __init__(self):
        self.x = None
        self.y = None
    
    def train(self, x, y):
        self.x = x
        self.y = y
        
    def distance(self,point_A, point_B):
        return ((point_B[1]-point_A[1])**2 + (point_B[0]-point_A[0])**2)**0.5
    
    def classification(self, neighbors):
        ## For classification we need to find the most common label
        neighbor_labels = [label for dist, label in neighbors] ## Storing the labels for all the neighbors
        
        ## Finding the most common label
        labelDict={}
        max_val = -999
        
        for label in neighbor_labels:
            if label in labelDict:
                labelDict[label]+=1
            else:
                labelDict[label]=1
            if labelDict[label]>max_val:
                max_val = labelDict[label]
                
        return max_val
        
    ## Predict the value for a new data-point using k-neighbors    
    def predict(self, x, k, problem):
        
        ## Find the distance of the new data-point with all the data points, and store this distance as a tuple along with the label
        distance_label = [(self.distance(x, train_point), train_label) for train_point, train_label in zip(self.x, self.y)]       
        
        ### Sort the distances in ascending order for the k neighbors
        neighbors  =sorted(distance_label)[:k]
        
        if problem==reg:
            ## Do the average and return the label/y-value for the new data point
            return sum(label for _, label in neighbors)/k
        else:
            return self.classification(neighbors)

<br>

### 3. Linear Regression

- Basics
    - Fitting a straight line through a set of data points
    - Relation between x and y is linear (This is an important assumption) WHY?
    - MSE is used in LR, square of the difference b/w observed and actual value of y, divided by the total number of observations
        - Goal is to get betas to minimize this error, hence, it is an optimization problem
    - OLS: Ordinary Least Squares is an algorithm to estimate betas to get the least MSE for a data using a line.
    
    - The fitted line can calculate R2 and determine if two variables are correlated. Large values imply large effect
    - Calculate p-value to determine if the R2 is statistically significant
    - And of course, use this line to predict y from x

- Equation
    - yHat = bNot + b1*x
    - x is the independent variable, yHat is the dependent variable/predicted value
    - b1 is the slope of the line, and bNot is the constant/intercept in the straight line equation y=mx+c


- Gradient Descent
    - Start with Random guess of betas
    - Compute MSE
    - Compute gradients (derivative of error w.r.t. particular parameters) and update betas
    - Repeat until convergence, i.e. the error reaches a local minima
    - Learning rate determines the speed of convergence or movement of gradient descent. Setting it too high will make it unstable, and setting it too low will make it really slow to converge
    - Partial derivative means, if we change the input by a small value say 0.01 then how does the output change. How does loss change with weights/betas


- Application
    - Predict demand with Price  
    
- Complexity:
    - O(I *M *N) Time
    - O(N) Space

#### Functions
- Main
- initialize params
- compute gradient
- update betas

In [27]:
##### Main function that will call helper functions
###
##

import random

def linear_regression(x, y, iterations=100, learning_rate=0.01):
    ## n is the total number of features (means n+1 betas) and m is the total # of records
    n, m = len(x[0]), len(x)
    ## Initializing the # parameters based on number of features
    beta_0, beta_other = initialize_params(n)
    
    for _ in range(iterations):
        gradient_beta_0, gradient_beta_other = compute_gradient(x, y, beta_0, beta_other, n, m)
        beta_0, beta_other = update_params(beta_0, beta_other, gradient_beta_0, gradient_beta_other, learning_rate)
        
    return beta_0, beta_other

In [28]:
#### Parameter initialization
##  
def initialize_params(dimensions):
    beta_0 = 0
    beta_other = [random.random() for _ in range(dimensions)]
    return beta_0, beta_other


#### Computing the gradient
##  
def compute_gradient(x, y, beta_0, beta_other, dimension, records):
    gradient_beta_0 = 0
    gradient_beta_other = [0]*dimension #dimension is the total number of features/columns in the data
    m = records
    
    ## Iterating through each record and calculating the predicted value based on the current betas, multiplying it with the corresponding features and then
    # adding beta_0 to it
    for i in range(m):
        ## Calculating prediction for each data point i
        y_i_hat = sum(x[i][j] * beta_other[j] for j in range(dimension)) + beta_0
        ## Calculating the derivative of error
        derror_dy = 2*(y[i]-y_i_hat)  ## WHY error x 2
        
        for j in range(dimension):
            ## Diving the gradients by m, so that at the end the gradient computed is the average of all data points
            ## Following is the gradient formulae that gives us the partial derivative of loss function w.r.t. each parameter
            gradient_beta_other[j] += derror_dy*x[i][j]/m
            gradient_beta_0 += derror_dy/m
        
    return gradient_beta_0, gradient_beta_other

#### Update params
##  
def update_params(beta_0, beta_other, gradient_beta_0, gradient_beta_other, learning_rate):
    beta_0 += gradient_beta_0*learning_rate
    for i in range(0, len(beta_other)):
        beta_other[i] += gradient_beta_other[i]*learning_rate  ## Why there is a +ve sign here
    return beta_0, beta_other

<br>

### 4. Logistic Regression

- Basics
    - Used for binary classification problems
    - All it does is take linear combination of all the features (log odds), and pass it through a sigmoid function to calculate the probability of a data point of being in class 1
    - 1-prob(class-1) is the probability of class 0
    - f(z) = 1/(1+e**-z), larger numbers get probability close to 1, and small numbers get probability close to 0. Based on a threshold, we decide whether a data point should be classfied to be in class 0 or 1.
    - yhat = 1 / (1 + exp(-(X * Beta))) , that is how a predicted value is squished to 0-1 range using sigmoid function. The output is interpreted as a probability from a Binomial probability distribution function for the class labeled 1, if the two classes in the problem are labeled 0 and 1.
    - Application: Email spam or not
    

- Equation
    - p(x|beta) = P(y=1| x,beta) = 1-P(y=0|x,beta); beta is the parameters, and x is the independent features. Given x and beta what is probability of class 1.
    - p(x|beta) = g(b0 + b1x1 + b2x2 + .. + bnxn) : Probability of class 1
    - We use training date to estimate the betas. Say m data points, n features : n+1 betas
    - We make use of MLE - Maximum likelyhood estimation to get the betas. We use betas, x and y to get a function to calculate the likelihood of getting the observed class, and then we use betas to maximize the likelihood
    - What it the formulae? PRODUCT(i=1 to m){P(x(i))** y(i) X (1-P(x(i))** (1-y(i))}; change it to log likelihood and then choose betas to maximize it 
    - m is the number of data points; P(x(i)) = p(y(i)=1| x(i), beta); p(y(i)=0| x(i), beta) = 1 - P(x(i))
    - x(i) is a data-point/record with n-features


- How to maximize the LogLikelihood function for beta?
    - We change it to -ve log loss function to be minimized, and use gradient descent to get the betas
    - A gradient for a parameter is the partial derivative of the loss function w.r.t. that parameter

    
- Complexity:
    - O(I *M *N) Time
    - O(N) Space
    
- Prep:
    - Probability vs. Likelihood
    - Meaning and generation of coefficients in case of continuos vs. discrete variables in Logistic regression
    - Changing the y-axis from 0-1 to Log(odds) graph for continuos and discrete variables and forming the equations
    - Log odds to sigmoid, to Likelihood calculation from sigmoid curve, and rotating the log-odds line to fins the best fitting lines using MLE
    
    

In [37]:
##### Main function for logistic regression
##

import math
import random

def logistic_regression(x, y, iterations=100, learning_rate=0.01):
    ## Number of records and # of features
    m = len(x); n=len(x[0])
    beta_0, beta_other = initialize_params(n)
    
    for _ in range(iterations):
        gradient_beta_0, gradient_beta_other = compute_gradients(x, y, beta_0, beta_other, m, n)
        beta_0, beta_other = update_params(gradient_beta_0, gradient_beta_other)
        
    return beta_0, beta_other

In [38]:
### Initialize the parameters 
##
def initialize_params(n_features):
    beta_0 = 0
    beta_other = [random.random() for _ in range(n_features)] 
    return beta_0, beta_other


### Compute the gradients
##
def compute_gradients(x, y, beta_0, beta_other, m, n):
    gradient_beta_0 = 0
    gradient_beta_0 = [0]*n
    
    for i, point in enumerate(x):   #iterate through each record in the data and calculate predicted value using the params
        pred = logistic_function(point, beta_0, beta_other, n)
        error = pred-y[i]
        
        for j, feature in enumerate(point):  #iterate through each feature for each record and compute the gradient
            gradient_beta_0 += error/m
            gradient_beta_other[j] += error*feature/m
            
    return gradient_beta_0, gradient_beta_other

### Update the parameters
##
def update_params(beta_0, beta_other, gradient_beta_0, gradient_beta_other, learning_rate):
    beta_0 -= gradient_beta_0*learning_rate
    for _ in range(len(beta_other)):
        beta_other[_] -=  gradient_beta_other[_]*learning_rate
        
    return beta_0, beta_other

### Logistic Function
##
def logistic_function(point, beta_0, beta_other):
    log_odds = sum(beta_other[i]*point[i]   for i in range(n)) + beta_other
    prob = 1/(1+math.exp(-log_odds))
    return prob

- The above solution will become slow for a large dataset
- Better approach is to use Mini-batch gradient descent
    - Earlier we looped through the entire dataset to move step towards the target, this can become very inefficient in case of large datasets
- It takes a random mini-batch from entire data set and calculates the gradient, this leads to faster computation and as the data is now smaller we're able to fit this into memory. 
- The downside is that the gradient is noisier. It may lead the gradient in the wrong directions. But, the parameters will be optimized over iterations.

In [None]:
def compute_gradients_minibatch(x, y, beta_0, beta_other, m, n, batch_size):
    gradient_beta_0 = 0
    gradient_beta_other = [0]*n
    
    for _ in range(batch_size):
        i = random.randint(0, m-1)
        point = x[i]
        pred = logistic_regression(point, beta_0, beta_other)
        error = pred - y[i]
        
        for j in range(n):
            gradient_beta_0+=error/batch_size
            gradient_beta_other[j] += error*x[i][j]/batch_size
    
    return gradient_beta_0, gradient_beta_other

<br>

### 5. Decision Tree


In [231]:
class DecisionTree:
    """
    Decision tree for classification
    """

    def __init__(self):
        self.root_dict = None
        self.tree_dict = None

    def split_dataset(self, X, y, feature_idx, threshold):
        """
        Splits dataset X into two subsets, according to a given feature
        and feature threshold.

        Args:
            X: 2D numpy array with data samples
            y: 1D numpy array with labels
            feature_idx: int, index of feature used for splitting the data
            threshold: float, threshold used for splitting the data

        Returns:
            splits: dict containing the left and right subsets
            and their labels
        """

        left_idx = np.where(X[:, feature_idx] < threshold)
        right_idx = np.where(X[:, feature_idx] >= threshold)

        left_subset = X[left_idx]
        y_left = y[left_idx]

        right_subset = X[right_idx]
        y_right = y[right_idx]

        splits = {
           'left': left_subset,
           'y_left': y_left,
           'right': right_subset,
           'y_right': y_right,
        }

        return splits

    def gini_impurity(self, y_left, y_right, n_left, n_right):
        """
        Computes Gini impurity of a split.

        Args:
            y_left, y_right: target values of samples in left/right subset
            n_left, n_right: number of samples in left/right subset

        Returns:
            gini_left: float, Gini impurity of left subset
            gini_right: gloat, Gini impurity of right subset
        """

        n_total = n_left + n_left

        score_left, score_right = 0, 0
        gini_left, gini_right = 0, 0

        if n_left != 0:
            for c in range(self.n_classes):
                # For each class c, compute fraction of samples with class c
                p_left = len(np.where(y_left == c)[0]) / n_left
                score_left += p_left * p_left
            gini_left = 1 - score_left

        if n_right != 0:
            for c in range(self.n_classes):
                p_right = len(np.where(y_right == c)[0]) / n_right
                score_right += p_right * p_right
            gini_right = 1 - score_right

        return gini_left, gini_right

    def get_cost(self, splits):
        """
        Computes cost of a split given the Gini impurity of
        the left and right subset and the sizes of the subsets
        
        Args:
            splits: dict, containing params of current split
        """
        y_left = splits['y_left']
        y_right = splits['y_right']

        n_left = len(y_left)
        n_right = len(y_right)
        n_total = n_left + n_right

        gini_left, gini_right = self.gini_impurity(y_left, y_right, n_left, n_right)
        cost = (n_left / n_total) * gini_left + (n_right / n_total) * gini_right

        return cost

    def find_best_split(self, X, y):
        """
        Finds the best feature and feature index to split dataset X into
        two groups. Checks every value of every attribute as a candidate
        split.

        Args:
            X: 2D numpy array with data samples
            y: 1D numpy array with labels

        Returns:
            best_split_params: dict, containing parameters of the best split
        """

        n_samples, n_features = X.shape

        best_feature_idx, best_threshold, best_cost, best_splits = np.inf, np.inf, np.inf, None

        for feature_idx in range(n_features):
            for i in range(n_samples):
                current_sample = X[i]
                threshold = current_sample[feature_idx]
                splits = self.split_dataset(X, y, feature_idx, threshold)
                cost = self.get_cost(splits)

                if cost < best_cost:
                    best_feature_idx = feature_idx
                    best_threshold = threshold
                    best_cost = cost
                    best_splits = splits

        best_split_params = {
            'feature_idx': best_feature_idx,
            'threshold': best_threshold,
            'cost': best_cost,
            'left': best_splits['left'],
            'y_left': best_splits['y_left'],
            'right': best_splits['right'],
            'y_right': best_splits['y_right'],
        }

        return best_split_params


    def build_tree(self, node_dict, depth, max_depth, min_samples):
        """
        Builds the decision tree in a recursive fashion.

        Args:
            node_dict: dict, representing the current node
            depth: int, depth of current node in the tree
            max_depth: int, maximum allowed tree depth
            min_samples: int, minimum number of samples needed to split a node further

        Returns:
            node_dict: dict, representing the full subtree originating from current node
        """
        left_samples = node_dict['left']
        right_samples = node_dict['right']
        y_left_samples = node_dict['y_left']
        y_right_samples = node_dict['y_right']

        if len(y_left_samples) == 0 or len(y_right_samples) == 0:
            node_dict["left_child"] = node_dict["right_child"] = self.create_terminal_node(np.append(y_left_samples, y_right_samples))
            return None

        if depth >= max_depth:
            node_dict["left_child"] = self.create_terminal_node(y_left_samples)
            node_dict["right_child"] = self.create_terminal_node(y_right_samples)
            return None

        if len(right_samples) < min_samples:
            node_dict["right_child"] = self.create_terminal_node(y_right_samples)
        else:
            node_dict["right_child"] = self.find_best_split(right_samples, y_right_samples)
            self.build_tree(node_dict["right_child"], depth+1, max_depth, min_samples)

        if len(left_samples) < min_samples:
            node_dict["left_child"] = self.create_terminal_node(y_left_samples)
        else:
            node_dict["left_child"] = self.find_best_split(left_samples, y_left_samples)
            self.build_tree(node_dict["left_child"], depth+1, max_depth, min_samples)

        return node_dict

    def create_terminal_node(self, y):
        """
        Creates a terminal node.
        Given a set of labels the most common label is computed and
        set as the classification value of the node.

        Args:
            y: 1D numpy array with labels
        Returns:
            classification: int, predicted class
        """
        classification = max(set(y), key=list(y).count)
        return classification

    def train(self, X, y, max_depth, min_samples):
        """
        Fits decision tree on a given dataset.

        Args:
            X: 2D numpy array with data samples
            y: 1D numpy array with labels
            max_depth: int, maximum allowed tree depth
            min_samples: int, minimum number of samples needed to split a node further
        """
        self.n_classes = len(set(y))
        self.root_dict = self.find_best_split(X, y)
        self.tree_dict = self.build_tree(self.root_dict, 1, max_depth, min_samples)

    def predict(self, X, node):
        """
        Predicts the class for a given input example X.

        Args:
            X: 1D numpy array, input example
            node: dict, representing trained decision tree

        Returns:
            prediction: int, predicted class
        """
        feature_idx = node['feature_idx']
        threshold = node['threshold']

        if X[feature_idx] < threshold:
            if isinstance(node['left_child'], (int, np.integer)):
                return node['left_child']
            else:
                prediction = self.predict(X, node['left_child'])
        elif X[feature_idx] >= threshold:
            if isinstance(node['right_child'], (int, np.integer)):
                return node['right_child']
            else:
                prediction = self.predict(X, node['right_child'])

        return prediction

<br>

### 6. DNN

- PERCEPTRON

In [241]:
### BASICS

print((np.array([0,0,1,0])).shape)  #vector
print((np.array([[0,0,1,0]])).shape) #matrix
print((np.array([[0],[0],[1],[0]])).shape)
print((np.array([[[0],[0],[1],[0]]])).shape)

(4,)
(1, 4)
(4, 1)
(1, 4, 1)


In [42]:
### PERCEPTRON CODE

import numpy as np

def sigmoid(x):
    return 1/(1+np.exp(-x))

def sigmoid_derivative(x):
    return x*(1-x)


training_inputs = np.array([[0,0,1],    ##Training is a 4x3 matrix, 4-records, 3-features
                            [1,1,1],
                            [1,0,1],
                            [0,1,1]])

training_outputs = np.array([[0,0,1,0]]).T  ## Transposed it to make it a 4*1 matrix. What's the other way to do it?

np.random.seed(1)

## random.random() gnerates a number between 0 and 1. specifying a shape with in this function, you can generate a matrix of normally dist.
## number between 0 and 1

### Random values b/w -1 to 1 with a mean of 0, normally distributed weights
synaptic_weights = 2*np.random.random((3,1)) - 1


## Adjustment  = input*error*sigmoidderivative

### EXPLAIN the calculation below

for iteration in range(50000):
    input_layer = training_inputs 
    output = sigmoid(np.dot(input_layer, synaptic_weights))
    
    ### Calculate the error, and take the gradient, and update the weights
    error = training_outputs - output
    adjustments = error*sigmoid_derivative(output)
    synaptic_weights += np.dot(input_layer.T, adjustments)
    
print('Outputs for training: {}'.format(output))

Outputs for training: [[5.54116778e-03]
 [4.52027080e-03]
 [9.93599779e-01]
 [1.62978367e-07]]


In [97]:
2*np.random.random((3,1))-1

array([[0.81779661],
       [0.42362076],
       [0.97693232]])

<br>

- Usable Percepton

In [43]:
#### Neural Net

import numpy as np

class NeuralNetwork():
    
    def __init__(self):
        '''Here you can declare the variables that later can be used'''
        np.random.seed(1)
        
        self.synaptic_weights = 2*np.random.random((3,1)) - 1
        
    def sigmoid(self, x):
        return 1/(1 + np.exp(-x))

    def sigmoid_derivative(self, x):
        return x*(1 - x)   
        
    def train(self, training_inputs, training_outputs, training_iterations):
        
        for iterations in range(training_iterations):
            output = self.think(training_inputs)
            error = training_outputs - output
            
            adjustments = np.dot(training_inputs.T, error*self.sigmoid_derivative(output))
            self.synaptic_weights += adjustments
    
    def think(self,inputs):
        inputs = inputs.astype(float)
        return self.sigmoid(np.dot(inputs, self.synaptic_weights))

In [47]:
## Making it a usable command-line function
if __name__ == "__main__":
    dnn = NeuralNetwork()
    print('Random weights {}'.format(dnn.synaptic_weights))
    
    training_inputs = np.array([[0,0,1],    ##Training is a 4x3 matrix, 4-records, 3-features
                            [1,1,1],
                            [1,0,1],
                            [0,1,1]])
    training_outputs = np.array([[0,0,1,0]]).T
    
    dnn.train(training_inputs, training_outputs, 5000)
    print('\n output: {}'.format(dnn.think(training_inputs)))
    

Random weights [[-0.16595599]
 [ 0.44064899]
 [-0.99977125]]

 output: [[1.79815368e-02]
 [1.46383347e-02]
 [9.79216854e-01]
 [5.77341890e-06]]


<br>
<br>

####  Neural Net
- We want the numpy arrays for multiplication and addition operation, which is not the same or present in lists
- For weights we will use random values which will be float so it's important we have inputs as float as well
- We don't want weights or inputs to be too big as it may lead to gradient explosion, usually we keep it between [-1, 1]. Scale the input if needed
- Biases are usually initialized to 0. Network may become dead if the neurons do not become active in the first iteration itself, as you will be just propagating 0's forward and backward.

- The way we define the weights and biases in the __init__() function makes it easier for calculation as now we don't need to do a transpose everytime we do a multiplication between weights and inputs




- BP is basically nudging the weights to reduce the Cost function.

<br>

##### Questions
- Why are we taking -ve of gradient? Gradient tells us the direction of the steepest ascent.
- How many gradients do we take?
- How do things change from 1 perceptron to a network? Going from a single loss to average loss for all inputs and then calculating the gradient, to calculating gradients for all the weights in the network
- How you can you tell which changes to weights matter the most/least? Looking at the gradient! It tells you how sensitive is the cost function to each weight and bias
- Neurons that fire together, wire together; intuition behind BP. Take a training example, and calculate what changes should be applied to the previous layers to make predicted value= actual value, and keep propagating it backwards. Now you know what changes need to be made in 1000's of weights and biases for a single training example to reduce the loss. Now do it for all the training examples and take the average, and make those changes in weights and biases.

<br>

##### Sigmoid Vs. ReLu

- Sigmoid has a vanishing gradient issue
- ReLu is fast due to its super simple calculation


<br>

##### Softmax for multi-class classification
- Generalization of logistic function to multiple dimensions, to predict multinomial probability distribution.
- Softmax is exponential and enlarges differences - push one result closer to 1 while another closer to 0


In [70]:
import numpy as np

## Shape 3,4
X = [[1, 2, 3, 2.5],
     [2.0, 5.0, -1.0, 2.0],
     [-1.5, 2.7, 3.3, -0.8]]

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):  ##Anything passed while creating the class object is called here
        self.weights = 0.1*np.random.randn(n_inputs, n_neurons)  ## Gaussian/standard normal distribution and then scaling them
        self.biases = np.zeros((1, n_neurons))
     
    def sigmoid(self, x):
        return 1/(1 + np.exp(-x))
    
    def forward(self, inputs):
        self.output = self.sigmoid(np.dot(inputs, self.weights) + self.biases)
        # return self.output
        
        
class Activation_Relu:
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)
        # return self.output 
        
## Layer-1 will always have input = # of features in the input data
layer_1 = Layer_Dense(4, 5)
layer_2 = Layer_Dense(5, 10) ## The consequent layer will always have input = # of neurons in the previous layer, and the # of neurons
                            ### for this layer can be any arbitrary number
    
layer_1.forward(X)
# print(layer_1.output)
layer_2.forward(layer_1.output)
# print(layer_2.output)

layer_3 = Layer_Dense(10, 1)
layer_3.forward(layer_2.output)
print(layer_3.output)

[[0.51993858]
 [0.5201816 ]
 [0.52032996]]
