# PyTorch for Deep Learning & Machine Learning Notes!

### Why use machine learning (or deep learning)?



* Manual coding can be cumberson, machine learning can automate this.
* Machine learing can automate complex processes with many steps that would be difficult for traditional programming.
    * The more "rules" there are to a problem, the more attractive machine learning is.
* Can adapt to continuously changing environemtns easily,  
* ***Discovering insights within large collections of data*** (AI IS FLOURISHING)

### When not to use machine learning?



* Googles #1 rule for machine learning, "if you can build a simple rule-based system that doesnt require machine learning, do that".
    * A short and simple code should be priortized for menial tasks, not complex machine learning.    
* When you need explainability, deep learning models are uninterpretable by humans.
* when errors are unacceptable, machine learning systems are unpredictable and not deterministic.
* when you dont have much data, machine learning models typically require large amounts of data.

### When to use machine learning.



* When you have large amounts of STRUCTURED data. (rows x collumns) i.e algorithms such as XGBoost
* Some algorithms include
    * Random forest.
    * gradient boosted machine.
    * naive bayes.
    * nearest neighbor.
    * support vector machine.
    * many more.

### When to use deep learning.



* When you have large amounts of UNSTRUCTURED data (natural languages, images data) i.e neural network algorithms
* Some algorithms include.
    * neural networks
    * fully connected neural networks
    * convulutional neural network.
    * transformers.
    * many more.

### What are neural networks?



* Inputs (images/audio/random languages) -> numerical encoding (matrixes or tensors) -> learns representation (patterns/features/weights) -> representation outputs (large numerical data set or more tensors) -> human output (image of cat/"Hey Siri..."/language a joke or serious?)
    * Use the appropriate neural network for your problem (C.N.N for images or transformers for languages/audio)
* Anatomy of Neural Networks.
    * Input layer -> hidden layer(s) -> output layer

### Types of learning.



* Supervised Learning.
    * Lots of data and examples of what the inputs should be. i.e photos of cats and dogs with each labled (data and labels)
* Unsupervised & Self-supervised Learning. 
    * Learns solely on the data given i.e only the photos of cats and dogs, finds patterns between the two but cannot necesserily differentiate
* Transfer Learning.
    * Learns what one model learns and tranfers it to another. i.e can learn what Supervised Learning model discovered and transfer to another, giving a "head start"

### What is deep learning used for?



* Youtube algorithms, google translate, speech recognition, computer vision, natural language processing (NLP)
    * Translation and Speech Recognition == Sequence to Sequence (seq2seq) i.e sequence of letters or sequence of audio waves
    * Computer Vision and Natural Language Processing (NLP) == Classification/Regression i.e Regression is predicting a number or predicting the coordinates of computer visions "attnetion" and
    * Classification is predicting if something is one thing or another.

### What is PyTorch?
    


* Most popular research deep learning framework*
* Write fast deep learning code in Python (able to run on a GPU(s))
* Able to access many pre-built deep learning models (Torch Hub/torchvision.models)
* Whole stack: preprocess data, model data, deploy model in your application/cloud
* Originally designes and used in-house by Facebook/Meta (now open-source)

### What is a Tensor?



* https://www.youtube.com/watch?v=f5liqUk0ZTw 

# What is a Classification problem?


* **Binary Classification** - Determining whether an email is spam or not spam. Either spam or not spam, 0 or 1.

* **Multiclass Classification** - More than one thing or another. Cat, dog, or bunny.

* **Multilabel Classification** - Multiple lables per sample. *Grey* *Female* *longhair* Cat, *brown* *50lbs* Dog, *white* Bunny.

Classification Inputs & Outputs
* Inputs
    * Photos can be broken down into 3 inputs per pixel; Width, Height, & C (color channels: R, G, B)

* Outputs
    *  Cat, Dog, Bunny [[0.97, 0.00, 0.03]] w/ each value being how certain the model is as it approaches 1.0 (e.g. This model being 97% certain of Cat)

Input & Output Shapes
* Each photo input gets represented as a Tensor, `shape = [batch_size, color_channels, width, height]`, or `shape = [None, 3, 224, 224]`.
    * batch_size - How many photos the model looks at at once. Higher batch_size, more computing power.
    * Above Multiclass Classification input of Cat, Dog, & Bunny would result in a `Shape` output of `[3]`, while a Binary Classification would result in a `shape` of `[2]` 

Architecture of A Classification Model.
* Hyperparameter - Binary Classification
    * Input layer shape (in_features) - Same as number of featers (e.g 5 for age, sex, height, smoking status) (in numerical value, e.g. male=0 female=1)
    * Hidden layer(s) - Problem specific, minimum = 1, maximum = unlimites.
    * Neurons per hidden layer - problem specific, generally 10 to 512.
    * Output layer shape (out_features) - 1 shape (one class or the other) (e.g. 3 for cat, dog, or bunny)
    * Hidden layer activation - usually `ReLU` (Rectified Linear Unit) but can be many more.
    * Output activation - Sigmoid (`torch.sigmoid` in PyTorch).
    * Loss function - Binary Crossentropy (`torch.nn.BCELoss` in PyTorch).
    *Optimizer - `.SGD` (Stochastic Gradient Descent), `.Adam`, or many more. 

'777'

## What goes on in a training loop?

### What is gradient descent?

Gradient descent in essence is the decrease of a the cost/loss to a local minimum.

### What is backpropogation?

Backpropogation is the algorithm which minimizes the errors in a nn models prediction. It does this by taking the models predicted outputs, and compares them against the specified training data using a loss function algorithm (think ADAM or SGD)

Backpropogation is the algorithm for determining how a single training example would like to nudge the `weights` & `bias`. **NOT** just what gives the desired effect, but what relative proportions to the changes, cause the most rapid decrease to the **total cost/loss**.

* **Cost/Loss** - (outputs of the nn - desired outputs)^2 added for each piece of training data. 

* **Total cost/loss** - the average of cost/loss of ALL training examples

* The nn outputs a hyperparametered or specified number of `out_feautures` in the model.

* Each output has is given a "probability" of correctness, which the loss function compares to the training data. The loss function adjusts the weights & bias to nudge the outputs in the desired direction, as
we want the predicted output to match the training data (e.g 1.0 for desired answer & 0.0 for undesired)
    * We want the size of the adjustment to be proportional to the "distance" from the desired value.

* focusing on a single feature/neuron, it is defined as a *weighted* sum of all the activations in the previous layer + `bias`, which is inputted into a function such as `sigmoid()` or `ReLU()`
    * 3 avenues here to increase the chances of the desired output: Increase `bias` (how often its activated), increase `weight` (how strongly its activated), or change the activations.
        * Increasing the `bias` is simple as it is a single independent value.

        * Altering each `weight` from a single neuron/feature would be slightly different, we want to change the `weight` of the feature/neuron which will have the greatest effect, **NOT** all neurons which have any effect for efficency.
            * "*Neurons which activate when they see a desired value, are more likely to activate when they think of said value*" - 3Blue1Brown.
        
        * We could also change all activations of features/neurons in previous layer (again, **NOT** all neurons which have any effect, but those with the **GREATEST** effect). If each neuron we want activated is activated more, & each neuron we dont want activated is activated less, then its a win.
    
    * These changes are all relative to any other `out_feature` that may be present which has their own wants in regards to `weight` & `bias` changes, which may also are relative the wants in regards to the *n* number of data points in the training data.

* **Backpropogation** comes into play here, running through this process *backwards* recursively in each previous layer. This is whole process is then repeated for each data point in the training data.

* Stochastic Gradient Descent (`SGD`) does this by randomly organizing all the data into mini-batches, then computes a step for a mini-batch (with **backpropogation**). This is less accurate than a True step through all the training data at once, but many times more time efficient.