# Deep Learning and Neural Networks
### A neural network is a aritifical intelligence model that is inspired by the way the human brain works to compute data. It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons in a layered structure that resembles the human brain. 

## What makes a Neural Network
### A neural network consists mainly of three parts<br> <br>  1. Input Layer <br>  2. Hidden Layer <br>  3. Output Layer

![image-3.png](attachment:image-3.png)

## Input Layer 
### Input layer refers to the first layer of nodes in an artificial neural network. This layer receives input data from the outside world.
### In general, artificial neurons are likely to have a set of weighted inputs and function on the basis of those weighted inputs – however, in theory, an input layer can be composed of artificial neurons that do not have weighted inputs, or where weights are calculated differently. What is common in the neural network model is that the input layer sends the data to subsequent layers, in which the neurons do have weighted inputs.

## What is a Neuron in a Neural Network

### A Neuron is a part of the neural network , neurons are assigned weights and bias based on which once the input values are passed , the function determines whether the neuron fires or not. All the neurons in the same layer have the same bias.
### Neurons are the processing units of the network. Each neuron weighs and sums the different inputs and passes them through an activation function. The role of the activation function is to buffer the data before it is fed to the next layer. You can change the activity of your neuron.
### A Perceptron is a neural network unit that does certain computations to detect features or business intelligence in the input data. It is a function that maps its input “x,” which is multiplied by the learned weight coefficient, and generates an output value ”f(x). A Perceptron is an Artificial Neuron. It is the simplest possible Neural Network.
### The first trainable neural network, the Perceptron, was demonstrated by the Cornell University psychologist Frank Rosenblatt in 1957. The Perceptron’s design was much like that of the modern neural net, except that it had only one layer with adjustable weights and thresholds, sandwiched between input and output layers.

![image.png](attachment:image.png)

## Hidden Layer

### The dense layer’s neuron in a model receives output from every neuron of its preceding layer, where neurons of the dense layer perform matrix-vector multiplication. Matrix vector multiplication is a procedure where the row vector of the output from the preceding layers is equal to the column vector of the dense layer. The general rule of matrix-vector multiplication is that the row vector must have as many columns like the column vector.

### Networks with multiple hidden layers are called deep neural networks. The most common type of hidden layer is the fully-connected layer. Here, each neuron is connected to all the others in two adjacent layers. It is not connected to the ones in the same layer. Convolutional layers are another type of hidden layers that are very prominent when dealing with images.


## Output Layer

### The output layer is the final layer with neurons. This is where the data comes out of your model. So the number of neurons needs to be exactly the number of outputs you want, i.e. the questions you want to answer. 

## TensorFlow 
### TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks

# Important concepts 

## Activation Function

### It is a function that you use to get the output of the node. It is used to determine the output of neural network like yes or no. It maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).

### There are different types of activation functions<br><br> 1. Sigmoid or Logistic<br> 2. Tanh or hyperbolic tangent<br> 3. ReLU (Rectified Linear Unit)  <br> 4. Leaky ReLU <br> 5. Softmax

## 1. Sigmoid or Logistic Activation Function

### The Sigmoid Function curve looks like a S-shape. The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
 
### The logistic sigmoid function can cause a neural network to get stuck at the training time. The softmax function is a more generalized logistic activation function which is used for multiclass classification.

## 2. Tanh or hyperbolic tangent Activation Function

### tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped). The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. 

![image.png](attachment:image.png)

## 3. ReLU (Rectified Linear Unit) Activation Function

### The ReLU is the most used activation function in the world right now.Since, it is used in almost all the convolutional neural networks or deep learning. ReLU is half recified and output ranges from zero to infinity.But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. 

![image-2.png](attachment:image-2.png)

## 4. Leaky ReLU Activation Function

### It is an attempt to solve the dying ReLU problem. The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so. When a is not 0.01 then it is called Randomized ReLU.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## Why derivative/differentiation is used ?

### When updating the curve, to know in which direction and how much to change or update the curve depending upon the slope.That is why we use differentiation in almost every part of Machine Learning and Deep Learning.

![image.png](attachment:image.png)

## Flatten


### Flattening is used to convert all the resultant 2-Dimensional arrays from pooled feature maps into a single long continuous linear vector. The flattened matrix is fed as input to the fully connected layer to classify the image.

![image.png](attachment:image.png)

## Types of Layers 

## 1. Basic Layer
### The dense layer is the basic layer in deep learning. It simply takes an input, and applies a basic transformation with its activation function. The dense layer is essentially used to change the dimensions of the tensor.

## 2. Convolution Layer
### The Convolution layer is a bit more complex. It is mainly used for Computer Vision tasks : to analyze images. The convolution is the fact of analyzing an image or a data by pieces.
![image.png](attachment:image.png)

## 3. Pooling Layer 
### Pooling is used to reduce the size of the data. Indeed, it compresses the input tensor to keep only the relevant information. Like the Convolution layer, it works in pieces. However the Pooling layer will only keep the important information of this piece and delete the rest.
### There are different Pooling types : <br> 1. Max Pooling, which takes the maximum value in a subset of data<br> 2. Average Pooling, which takes the average of the values in a subgroup of data
### Average pooling method smooths out the image and hence the sharp features may not be identified when this pooling method is used. Max pooling selects the brighter pixels from the image. It is useful when the background of the image is dark and we are interested in only the lighter pixels of the image.
### When classifying the MNIST digits dataset using CNN, max pooling is used because the background in these images is made black to reduce the computation cost.

![image.png](attachment:image.png)


## Backpropogation

### Backpropagation is a process involved in training a neural network. It involves taking the error rate of a forward propagation and feeding this loss backward through the neural network layers to fine-tune the weights. Backpropagation is the essence of neural net training.

![image.png](attachment:image.png)


## Optimizers

### Once the loss function is defined, an optimizer is used to adjust the model’s parameters to minimize the loss function. It's also worth mentioning that these optimizers can be fine-tuned with different settings or Hyperparameter such as learning rate, momentum, decay rate etc.

### Also, these optimizers can be combined with different techniques such as learning rate scheduling, which can help to further improve the model's performance. The most common optimizers are 

## Gradient Descent
### Gradient descent is one of the most widely used optimizers. It adjusts the model’s parameters by taking the derivative of the loss function with respect to the parameters and updating the parameters in the direction of the negative gradient.

![image.png](attachment:image.png)

## Vanishing Gradient problem 
### Consider the graph of sigmoid function and it’s derivative. Observe that for very large values for sigmoid function the derivative takes a very low value. If the neural network has many hidden layers, the gradients in the earlier layers will become very low as we multiply the derivatives of each layer. As a result, learning in the earlier layers becomes very slow.

### On the other hand, the derivative of the RELU function is either zero or 1. Even if we multiply the derivatives for many layers, there will be no degradation unlike the case of the sigmoid function (assuming that RELU is not operating in the dead-region). However, if you carefully look at equation 2, the error signal is also dependent on the weights of the network. If the weights of the network are constantly below zero, the gradients will diminish slowly.

![image.png](attachment:image.png) ![image-2.png](attachment:image-2.png)

## Stochastic Gradient Descent (SGD)
### SGD is an extension of gradient descent. It updates the model’s parameters after each training sample, rather than after each epoch. This makes it faster to converge, but it can also make the optimization process more unstable. Stochastic gradient descent is often used for problems with a large amount of data.

![image.png](attachment:image.png)

## Adaptive Moment Estimation (Adam)
### Adam is an optimizer that combines the advantages of gradient descent and SGD. It uses the first and second moments of the gradients to adjust the learning rate adaptively. Adam is generally considered to be one of the best optimizers for deep learning.

##

## Loss Functions
### A loss function, also known as a cost function, is used to measure the accuracy of a model’s predictions. It calculates the difference between the predicted output and the actual output for each training sample.

## Mean squared error (MSE)
### MSE is a commonly used loss function for regression problems. It calculates the average squared difference between the predicted output and the actual output.

### This loss function is sensitive to outliers, which means that a small number of very large errors can have a large effect on the overall value of the loss function. However, MSE is also a popular choice because it is differentiable and computationally efficient.

## Mean Absolute Error (MAE)
### MAE is another commonly used loss function for regression problem. MAE measures the average absolute difference between the predicted and true values. It is less sensitive to outliers than MSE.

## Cross-entropy
### Cross-entropy loss is a widely used loss function for classification problems. It measures the dissimilarity between the predicted probability distribution and the actual probability distribution.Depending on the data you can use Binary Cross-entropy or Categorical Cross-entropy as well.

### The only difference between sparse categorical cross entropy and categorical cross entropy is the format of true labels. When we have a single-label, multi-class classification problem, the labels are mutually exclusive for each data, meaning each data entry can only belong to one class. Then we can represent y_true using one-hot embeddings.
### This saves memory when the label is sparse (the number of classes is very large) and also time as instead of doing log operations on all y_pred and then dot with y_true, we can directly pick the probability from y_pred at the index provided by y_true.

![image.png](attachment:image.png)

## Black Box
### Black box AI is any artificial intelligence system whose inputs and operations aren't visible to the user or another interested party. A black box, in a general sense, is an impenetrable system. Black box AI models arrive at conclusions or decisions without providing any explanations as to how they were reached.

## Decreasing / Increasing / Constant neurons per hidden layer

### While using the same number of overall neurons , increasing is clearly the worse performing of all as having lower neurons at the start means a lower amount of input to the next hidden layer 
### The other two are close in terms of accuracy and should be tested.Selecting equal size of neurons per layer has neither advantage or disadvantage. 

## YOLO

### Prior detection systems repurpose classifiers or localizers to perform detection. They apply the model to an image at multiple locations and scales. High scoring regions of the image are considered detections.

### YOLO uses a totally different approach. We apply a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

## Underfitting and Overfitting 
### In a nutshell, Underfitting refers to a model that can neither performs well on the training data nor generalize to new data. 
### In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from unseen data.

## Image Processing using TensorFlow

### from tensorflow.keras.preprocessing.image import ImageDataGenerator<br>train_datagen = ImageDataGenerator( rescale = 1.0/255. )<br>test_datagen  = ImageDataGenerator( rescale = 1.0/255. )