## Train-Dev Split:

Why do you need train-dev apart from Train, Validation & Test data

### Motivation: 
after training your model on training data, if you observe
that the performance of the model on the validation set is disappointing, you will not
know whether this is because your model has overfit the training set, or whether this
is just due to the data mismatch

* train-dev is split from train set and not used during training the Model
* One solution is to hold out some of the training pictures in yet
another set that Andrew Ng calls the train-dev set. After the model is trained (on the
training set, not on the train-dev set), you can evaluate it on the train-dev set. 
* If it performs well, then the model is not overfitting the training set. 
* If it performs poorly on the validation set, the problem must be coming from the data mismatch. 

## RMSE Vs MAE

* RMSE is mostly preferred for regression except during outliers. In presence of outliers, MAE gives better estimate.
* Even with one outlier, RMSE blows out the value but MAE keeps it not so bad like

### Distance Norm
* In general, $l_k$-norm:
$$ ||v_k|| =  (|v_0|^k + |v_1|^k + .. + |v_n|^k)^{1/k} $$
* if k = 2, gives euclidean Distance or the RMSE
* k = 1, gives Manhattan distance or MAE
* Higher the value of k, more it focuses or weighs on large values and neglects smaller ones. Hence, RMSE is more sensitive to outliers.
* When outliers are rare, they perform very well and generally preferred

## Correlations

### Points to remember:
* Correlations are not the slope values
* Correlations considers only linear relationships

### Consider the below Example
* The correlation coefficient only measures linear correlations (“if x
goes up, then y generally goes up/down”). 
* It may completely miss
out on nonlinear relationships (e.g., “if x is close to 0, then y generally goes up”). 
* Note how all the plots of the bottom row have a correlation
coefficient equal to 0, despite the fact that their axes are
clearly not independent: these are examples of nonlinear relationships.
* Also, the second row shows examples where the correlation
coefficient is equal to 1 or –1; notice that this has nothing to do
with the slope. For example, your height in inches has a correlation
coefficient of 1 with your height in feet or in nanometers.


![<title>](images/end_to_end_project/Correlation.PNG)

## Normalization Vs Standardization

Normalization:
* aka min-max scaling : 0-1

* $\frac {x - min}{max - min} $

* through "feature_range" in MinMaxScaler, we can opt for range other than 0-1

Standardization:
* mean 0, std-dev 1

* $ \frac {x - mean}{\sigma} $

* Values not bound to specific ranges. This is adv in being less affected by outliers
* Ex: If one outlier is 100, min max will crush values btw 0-15 to 0-0.15, whereas Standardization doesnt affect the other 0-15 values that much
* But not preferred approach in neural networks which expects values btw 0-1

## Why production model performance rot over time?

* Even a model trained to classify pictures of cats and dogs may need to be retrained regularly, not because cats and dogs will mutate overnight, but because cameras keep changing, along with image formats, sharpness, brightness, and size ratios. Moreover, people may love different breeds next year, or they may decide to dress their pets with tiny hats—who knows?
* Data keeps evolving and need to update dataset and re-train regularly


## Confusion Matrix

### Precision & Recall

* Best precision can be achieved in making one single +ve prediction and ensure it is correct.
* But it not useful, hence it is always coupled with Recall, the ratio of +ve classes correctly identified

### F1_score
* When either of precision or recall is lower, it is not a good model.
* F1_score being harmonic mean gives more weightage to lower values, hence F1_score will be high only when both precision & recall is high
* If you take arithmetic mean, it gives equal weightage to precision and recall, which may not be good representative

### When Precision & When Recall?
* Precision : when FP is costly. Flagging videos for kids. You dont want to flag bad video as good and make kids see it
* Recall : When FN is costly, Fraud detection - you dont want to leave any frauds from getting detected

### Precision-Recall Trade-off
* Varying the threshold varies the precision and recall. 
* In general, Higher threshold, lower recall and higher precision. Higher threshold may have fewer data points and hence precision will be good
* Plot precision vs recall, initially precision will be high as threshold is very negative, but as you increase threshold precision decreases 
* Choose that threshold which gets your required precision
* Prefer precision-recall curve over ROC especially working with imbalanced data

### ROC curve
* The roc_curve expects labels and probabilities. It keeps changing the threshold and calculates the tpr, fpr. 
* Similar to PR curve

## MultiClass Classification

### One Vs All (OVA)
* Build N number of binary classifiers one for each category. 
* Predict the class whose classifier outputs the highest score
* 0-detector, 1-detector and so-on. Test image sent to each classifier and choose which had highest score

### One Vs One (OVO)
* Build sub-sample of one-vs-one, ie 0 vs 1, 0 vs 2, etc
* Number of classifies : n*(n-1)/2
* For a test images, sent it to all the classifiers & check which class wins most duels
* Need to work on sub-sample of class of interest

### Summary
* SVM scale poorly with increase in data, so OVO is suited for svm
* Otherwise, for most other algorithms OVR is preferred



## Test Data

* Do not snoop into test data even during performing EDA
* You might tend to create some logic which might be biased as you had seen test data

## Linear Algebra

### Norm
The norm of a vector $\textbf{u}$, noted $\left \Vert \textbf{u} \right \|$, is a measure of the length (a.k.a. the magnitude) of $\textbf{u}$. There are multiple possible norms, but the most common one (and the only one we will discuss here) is the Euclidian norm, which is defined as:

$\left \Vert \textbf{u} \right \| = \sqrt{\sum_{i}{\textbf{u}_i}^2}$

We could implement this easily in pure python, recalling that $\sqrt x = x^{\frac{1}{2}}$

### normalized vectors
* The **normalized vector** of a non-null vector $\textbf{u}$, noted $\hat{\textbf{u}}$, is the unit vector that points in the same direction as $\textbf{u}$. It is equal to: $\hat{\textbf{u}} = \dfrac{\textbf{u}}{\left \Vert \textbf{u} \right \|}$

### Vector Addition & Scalar Multiplication
* vector addition results in a geometric translation
* vector multiplication by a scalar results in rescaling (zooming in or out, centered on the origin)
* vector dot product results in projecting a vector onto another vector, rescaling and measuring the resulting coordinate.

## Linear Regression

### Normal Equation with SVD
* Normal Eq : 
$$ \theta = (X^T X)^{-1} X^T y $$
* This equation is solved by sklearn through svd approach. (lstsq & pinv formulation) 
* SVD decomposes matrix into 3 components, so its efficient in edge cases (if inverse doesnt exist - m < n or if some features are redundant)
$$ X = U \small \sum V^T $$
* In such cases, the pseudo-inverse is always defined
* Computational Complexity 
    * Features : the Normal Eq - O($n^{2.4 - 3}$); SVD - O($n^2$)
    * Instances : Both O(n)

### Gradient Descent
* Works well when n is large.
* Two major disadv :
    * Local Minima
    * Should know when to stop

### Stochastic GD:
* Preferred if data doesnt fit the entire memory (large dataset)
* Always shuffle the dataset (except for time-series). Bcos it can avoid if any class structure is maintained in data. Class A followed by Class B, shuffling will avoid optimizing for one class only.
* Do not divide by n in gradient as we r working with only one data at a time
* Learning Rate Schduler is crucial to achieve if not the optimal, closer to optimal point. Else data will keep wander (stochastic)
* Stochastic Gradient in sklearn **doesnt require bias function** to be added!!
* SGDClassifier in sklearn can run logisitic, SVM, models etc. not necessarily it will run specific model unless specified

### Polynomial Regression
* When there are multiple features, Polynomial Regression is capable of finding relationships between features (which is something a plain Linear Regression model cannot do). 
* This is made possible by the fact that PolynomialFeatures also
adds all combinations of features up to the given degree. For example, if there were two features a and b, PolynomialFeatures with degree=3 would not only add the features a2, a3, b2, and b3, but also the combinations ab, a2b, and ab2.
* It can explode the number of features added bcos of poly reg. Hence, we need to beware before performing poly reg

### Regularization
* Ridge is a good default
* If only a few features are useful, you should prefer Lasso or Elastic Net because they tend to reduce the useless features’ weights down to zero
* In general, Elastic Net is preferred over Lasso because Lasso
may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.
* Early Stopping is also a method to avoid overfitting

## Logistic Regression

### Cost Function
* It has be low at correct prediction and high for incorrect predictions for both the classes
* cost
    * -log(p) if y = 1
    * -log(1 - p) if y = 0
* log(0) is inf, log(1) = 0. Hence, with y = 1, if we predict 1 then cost goes to zero, otherwise, it goes infinity.
* Combining both scenarios:
$$ cost = \frac{1}{n}*\sum[y_i*log(\hat y_i) + (1 - y_i)*log(1 - \hat y_i)] $$
* Cross-entropy loss: Generic Case
    * $$ cost = -\frac{1}{n} * \sum_i^n \sum_k^K y_k^i * log(\hat y_k^i) $$
    * k : number of classes, n : number of examples
    * if wrong prediction, that sum goes to -inf, if correct it goes to zero
    * when k = 2, this is equvilaent to above cost equation

### Softmax Regression
* Use only in-case of mutually exclusive/independent classes. Else build multiple logistic regressions
* It is aka multinomial logisitic regression. 
* It is about multi-class prediction and not multi-output. 
* Ex : Predicting probabilities of every iris species.
* Find $X^T * \theta$ for each class using separate $\theta$ vectors for each class
* Convert it to class prob using
    * $$ \hat p_k = \frac{exp (X^T * \theta)} {\sum^k exp (X^T * \theta)} $$
    * Predict the class having highest probability

## Support Vector Machines

* SVM : Large Margin classifier. There could exist only one such line maximising the split
* Support Vectors : The instances on which the decision boundary is drawn or supported
* SVM are sensitive to feature scales. If two features are in diff scales, the widest possible street is almost closer and horizontal. Whereas with scaling it looks much separable and wider.
* **SVC Vs SVM** : 
    * SVC are simple support vector classifiers based on max margin classifiers (hard & soft)
    * SVM involves : 
        1. Start with low dimension
        2. Move data to higher dimension
        3. Find the SVC that separates the data.
        4. The kernel systematically finds the higher dimension required at step 2.

* **Hard Margin Vs Soft Margin**:
    * Hard Margin : Strictly impose condition. Works only for linearly separable, sensitive to outliers (impossible to fit)
    * Soft Margin : Flexible, limit margin violations (instances in middle and wrong side)

* **Calculate Margin**:
    * Use cv to determine, allowing misclassification results in better classification in long run
    * Use cv across all the datapoints, check for whats the bias and variance.
    * Choose that margin, which optimizes the the bias-variance 
trade-off
    
* **Regularize Parameter C**:
    * Reduce c when your model is overfitting
    * c lower -> higher margin violations, wider street; but better generalizations

* **Types of Data** :
    * Linearly separable : Linear SVC
    * Non-linear data : 
        * Linear SVC with polynomial features. With polynomial features, the total number of features grow and it can make model slow
        * Kernel Trick : Through kernels, you dont generate extra features but still use the representation of polynomial featues

* **Kernels** :
    * Kernels make jobs easier as if multiple complex features are available.
    * Gaussian, Poly, etc : Which one to choose?
    * Start with LinearSVM (diff than SVC with linear kernel!), if n not to large -> Gaussian kernel. 
    * If compute available, then other kernels
    * Hyperparamters:
        * $\gamma$ : Lower values for reducing overfitting
        * C : Similar (low C -> high $\lambda$ -> heavy regularization )

* **Computational Complexity** :
    * LinearSVC - O(n*m) - No kernel Trick
    * SVC - O(n2*m) to O(n3*m) - Kernel Trick
        * ideal for complex small-medium dataset
    * Both requires scaling

## SVM Regression

* **Concept** : 
    * Instead of maximising separating instances in classification, we need to fit all the instances within the street in Regression
    * Width of street : $\epsilon$ hyperparameter
    * More the $\epsilon$ wider the street gap. 

* ** Commonalities with classification**:
    * SVC = SVR (Kernel, Kernels)
    * LinearSVC = LinearSVR (Simple, Linear, No Kernel)

## Polynomial SVM

* It is required when the data is not separable in lower dimensions (SVC fails)

![title](images/svm/SVM_Polyomial_Kernel_1.PNG)

* The above data if sent to squared dimensions, it becomes linearly separable.
* Kernel Trick helps in systematically identifying the relationships in higher dimensions without tranforming them, so we can use SVC
* Poly Kernel : 
    * $(a * b + r)^d$
    * a, b are data points/instances;  r = poly coefficient; d = poly degree
    * Choosing r, d is by cross-validation
* For above dataset, if r = 1/2, d = 2 -> 
$$ (a*b + 0.5)^2$$
$$ ab + (ab)^2 + 0.25 $$
$$ (a, a^2, 0.5) . (b, b^2, 0.5) $$
* The above dot product gives axis coordinates of the data in higher dimensions (squared dimensions in our case), ie pushing a -> a2 which was separable.
* For above dataset, if r = 1, d = 2 -> we get $\sqrt {2a}, a^2$, 1



## Radial Kernel

* Works for data which had overlapping classes or not separable in lower dimensions
* Radial Kernel is like weighted nearest neigbour, gives more weightage to the neighbors in deciding
* Radial Kernel Formula:
    * $ exp^{\gamma (a - b)^2} $
    * a, b are data points; $\gamma$ : influence scaler; High value dec the weightage.
* Consider 3 points : a, b, c in-order of distance in data
    * the kernel will be high btw a, b than a, classes
    * thereby, the kernel outputs finds the relationship between points without transforming data
* RBF is the poly kernel of r=0, d= 0 to infinity. Infinity interms of higher order dimensions! It has higher number of co-ordinates that cant be visualized. 
* RBF cooords : Roughly : (1, a, a2, a3..., inf), (1, b, b2...)

## Why computing angle useful over distance?

* The answer comes in the kind of invariance we expect data to have. 
* Consider an image, and a duplicate image, where every pixel
value is the same but 10% the brightness. The values of the individual pixels are in general far
from the original values. Thus, if one computed the distance between the original image and the
darker one, the distance can be large.
* However, for most ML applications, the content is the same—it is still an image of a cat as far as a
cat/dog classifier is concerned. 
* However, if we consider the angle, it is not hard to see that for
any vector v, the angle between v and 0.1 · v is zero. This corresponds to the fact that scaling
vectors keeps the same direction and just changes the length. The angle considers the darker
image identical.


## Decision Trees

* They require minimal data preparation. Dont require feature scaling or centering

## Dimensionality Reduction

* Curse of Dimensionality : Training slow & harder to find optimal solution. The curse of dimensionality refers to the fact that many problems that do not
exist in low-dimensional space arise in high-dimensional space. In Machine
Learning, one common manifestation is the fact that randomly sampled highdimensional
vectors are generally very sparse, increasing the risk of overfitting
and making it very difficult to identify patterns in the data without having plenty
of training data.

* Why high dimensions based predictions are unreliable:
    
    * Distance between two random points in high dimensions increases a lot.
    * If you pick two points randomly in a unit
square, the distance between these two points will be, on average, roughly 0.52. If you
pick two random points in a unit 3D cube, the average distance will be roughly 0.66.
But what about two points picked randomly in a 1,000,000-dimensional hypercube?
The average distance, believe it or not, will be about 408.25 (roughly 1, 000, 000/6)!

    * This is counterintuitive: how can two points be so far apart when they both lie within
the same unit hypercube? Well, there’s just plenty of space in high dimensions. As a
result, high-dimensional datasets are at risk of being very sparse: most training
instances are likely to be far away from each other. 
    * This also means that a new
instance will likely be far away from any training instance, making predictions much
less reliable than in lower dimensions, since they will be based on much larger extrapolations.
In short, the more dimensions the training set has, the greater the risk of
overfitting it.

* Increase the training size to reach sufficient density of training instances to avoid curse of Dimensionality.

* Unfortunately that level of density is not achievable.

## Dimensionality Reduction

* Projection:
    * data is not uniform across all dimensions. ie features may be constant, correlated -> can be made to lie in lower dimensions
    * But projection doesnt aways guarentee as best approach to Dimensionality reduction.
    * Conider swiss roll example projection :
        * Swiss roll needs to unrolled and not projected to get better representation.
        * We want lower dimension representation of right (unfolding) and not left (projection)

![title](images\dimensionalty_reduction\Proj_1.PNG)
    
* Manifold :

    * Manifold assumption : Many high dimensional datasets lie close to much lower dimensional manifold
    *  Swiss unroll is example of 2D manifold (is a 2D shape that can be bent & twisted in a higher dimensional space)
    * Swiss roll locally resembles 2d place but it is rolled in 3rd dimension
    * Manifold learning may be separable at higher dimensions too.  
![title](images\dimensionalty_reduction\Proj_2.PNG)

* Despite, dimensionalty_reduction helps in speeding training -> may not always lead to better results. It depends on dataset.

## Perceptrons

* Perceptrons are mostly single layer neural networks

* It takes inputs -> weighted sum of inputs -> applies step function to that sum -> outputs results

* Output, h(x) = step(z), where z = $w^T w$ 

* Most common step function : Considering threshold at 0

\begin{equation}
  heaviside \ (z) =
    \begin{cases}
      0 \ \text{is z < 0}\\
      1 \ \text{is z >= 0} \\
    \end{cases}       
\end{equation}

* Here the step function is the activation function

* Train Perceptrons:
  * Use Hebbs rule - strengthen connections that help reduce error

  * ie go one data at a time -> strengthen weights from inputs that lead to correct prediction
  
  * $ w_{nextstep} = w + \eta (y - \hat y)x $



## Why Random Weights Init

It is important to initialize all the hidden layers’ connection weights
randomly, or else training will fail. For example, if you initialize all
weights and biases to zero, then all neurons in a given layer will be
perfectly identical, and thus backpropagation will affect them in
exactly the same way, so they will remain identical. In other words,
despite having hundreds of neurons per layer, your model will act
as if it had only one neuron per layer: it won’t be too smart. If
instead you randomly initialize the weights, you break the symmetry
and allow backpropagation to train a diverse team of neurons.

## Why Activation Functions?

In Perceptron, Replace step function with the logistic (sigmoid) function,
σ(z) = 1 / (1 + exp(–z)). This was essential because the step function contains
only flat segments, so there is no gradient to work with (Gradient Descent cannot
move on a flat surface), while the logistic function has a well-defined nonzero derivative
everywhere, allowing Gradient Descent to make some progress at every step

Well, if
you chain several linear transformations, all you get is a linear transformation. For
example, if f(x) = 2x + 3 and g(x) = 5x – 1, then chaining these two linear functions
gives you another linear function: f(g(x)) = 2(5x – 1) + 3 = 10x + 1. So if you don’t
have some nonlinearity between layers, then even a deep stack of layers is equivalent
to a single layer, and you can’t solve very complex problems with that. Conversely, a
large enough DNN with nonlinear activations can theoretically approximate any continuous
function.


* RelU & Variants:
    * RelU : max(0, z)
    * Relu is faster than sigmoid due to comparitively complex gradient calculations 
    * Grads die at negative areas (ie when the sum of weighted inputs are -) -> no sgd update 

* Softplus : log(1 + exp(z))
    * More smoother version of RelU (in terms of differentiation)
    * close to 0 when -
    * close to z when +
    * Smoother than RelU

* Leaky RelU:
    * max($\alpha$ z, z)
    * $\alpha$ -> determines how much leak it can take ~ 0.01.
    * Ensures they dont go to coma/die out

* RReLU:
    * Randomized Leaky RelU
    * $\alpha$ is picked randomly from given range & fixed to avg during testing

* PReLU:
    * Parametric ReLU 
    * $\alpha$ is learned during training
    * Better with large datasets, poor with small datasets (overfitting)

* ELU:
    * Exponential Linear Unit. Exp decreasing when -
    * $\alpha$(exp (z) - 1) if -
    * z if +
    * Slower compute

* SELU:
    * Scaled ELU
    * Self -Normalize : if all hidden layers had SELu -> each output layers will have mean 0, stddev 1 -> no vanishing/exloding grad problem
    * Constraints:
        * Inputs to be normalized
        * Weights -> lecun normalization
        * WOrks only with sequential data

* Summary :
    * SELU > ELU > leak RELU > RELU > Tanh > sigmoid
    * If self-normalization not possible ELu > SELU
    * Run time latency -> LRELU  > SELU


    

## Loss Functions

![title](images\ann\loss.PNG)

* MSE:
    * Standard but sensitive to outliers.
    * Dont use if u have lot of outliers, weights can tend to be biased towards fitting those outliers
    * Gradient descent is efficient than MAE, as towards optimum the slope is low -> wont oscillate near optimum -> faster convergence

* MAE:
    * RObust to outliers
    * Differentiable, but constant slope leads to wandering of Gradient descent updates while trying to reach optimum.

* Huber Loss:
    * Mix of goods from MAE and MSE.
    * It is quadratic (like MSE) when error is low and linear(like MAE) when error is high (outliers)
    * Has extra hyperparameter which accounts for how much low than error

![title](images\ann\regression_1.PNG)

## losses:

* Categorical Cross Entropy:
    * Multi class classification
    * Use when labels are one hot encoded ie [0, 0, 1] for class 3
    * Softmax activation
    
* Sparse Categorical Cross Entropy:
    * Multi class classification
    * When labes are just class index like 0, 2, 4 etc. 
    * Efficient that u need not store big one hot encoded big matrix.
    * Softmax activation

* Binary Cross Entropy:
    * Binary classification
    * Sigmoid activation follows

### Sample Weights Vs Class Weights

* If the training set was very skewed, with some classes being overrepresented and others
underrepresented, it would be useful to set the class_weight argument when
calling the fit() method, which would give a larger weight to underrepresented
classes and a lower weight to overrepresented classes. These weights would be used by
Keras when computing the loss.

* If you need per-instance weights, set the sam
ple_weight argument (if both class_weight and sample_weight are provided, Keras
multiplies them). Per-instance weights could be useful if some instances were labeled
by experts while others were labeled using a crowdsourcing platform: you might want
to give more weight to the former. You can also provide sample weights (but not class
weights) for the validation set by adding them as a third item in the validation_data
tuple.

## Functional, Sequential & Subclassing API

* Sequential API is very common, written in sequence
    * Cannot add complex structure - like concat layers etc

    * model = keras.models.Sequential([
        keras.layers.Flatten(input_shape = [28, 28]),
        keras.layers.Dense(300, activation = 'relu'),
        keras.layers.Dense(100, activation = 'relu'),
        keras.layers.Dense(10, activation = 'softmax')
    ])

* Functional API:
    * Can build complex models - wide and deep NN
    * Connects all or parts of inputs directly to output layers
    * It can learn deep patterns (longest path) & simple rules (short path) simultaneously
    * Can provide different subset of inputs to each branch in the NN
    * It is called functional bcos it is written as function with input arguments           
    
    hidden1 = keras.layers.Dense(30, activation = 'relu')(input_)       
    hidden2 = keras.layers.Dense(30, activation = 'relu')(hidden1)          
    concat = keras.layers.Concatenate()([input_, hidden1])          
    output = keras.layers.Dense(10, activation = 'softmax')(concat)     

model = keras.Model(inputs = [input_], outputs = [output])
    
* Adv of above two:
    * Easy save, clone, shared
    * Structure can be displayed, analysed (can infer shapes)
    * Errors can be caught easily, easy debug

* Subclassing:
    * Main disadv of above two : static - whole model is static graph of layers
    * Diff scenarios - we need diff shapes, conditional branching, looping
    * Those can be achieved by Subclassing 

### Model Summary 

input_ = keras.layers.Input(shape = xtrain.shape[1])    
hidden1 = keras.layers.Dense(24, activation = 'relu')(input_)   
hidden2 = keras.layers.Dense(50, activation = 'relu')(hidden1)  
output = keras.layers.Dense(1)(hidden2)     

model = keras.models.Model(inputs = [input_], outputs = [output])

#### Assuming Batch_size of 32 and xtrain of features 8

![title](images\ann\NN_Summary.PNG)

## Hyperparameter Tuning

* There are lot of evolutionary algorithms which perform automatic Tuning

* In general you will get more bang for your buck by increasing the
number of layers instead of the number of neurons per layer.

* Try to build bigger layers & neurons than expected and then do early stopping and regularization

* One way to find a good learning rate is to train the model for a few hundred iterations,
starting with a very low learning rate (e.g., 10-5) and gradually increasing
it up to a very large value (e.g., 10). This is done by multiplying the learning rate
by a constant factor at each iteration (e.g., by exp(log(106)/500) to go from 10-5 to
10 in 500 iterations).

* one strategy is to try to use a large batch size, using learning rate warmup, and if training is unstable or the final performance is disappointing, then try using a small batch size instead.