### Machine Learning


#### Supervised Learning
Supervised Learning is a machine learning approach that uses labeled data to train algorithms into classifying or predicting outcomes accurately.
> - Types of Supervised Learning algorithms

#### Unsupervised Learning
Unsupervised Learning is a machine learning approach that uses unlabeled data to analyze and cluster datasets.
> - Types of Unsupervised Learning algorithms

#### Reinforcement Learning
Reinforcement learning is a machine learning training method based on rewarding desired behaviors and punishing the undesirable ones, thereby learning about the environment by trial and error
> - Types of Reinforcement Learning algorithms

### Classification algorithms
Classification algorithms use input training data to predict the likelihood that subsequent data will fall into one of predetermined categories
> - Types of Classification algorithms 

#### Probabilistic 
Probabilistic classification models classify, given an observation of an input, a probability distribution over a set of classes 
> - Types of Probabilistic Classification algorithms
>   - Naive Bayes
>   - Logistic regression 
>   - Multilayer perceptrons 

<img src = "https://www.ismiletechnologies.com/wp-content/uploads/2021/10/image-15.png">
$\tiny{\text{www.ismiletechnologies.com}}$   

##### Naive Bayes
Bayesian classification helps us find the probability of a label given some observed features, using Bayes theorem, which describes the relationship of conditional probabilities of statistical quantities
> - READ: https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html
> - Types of Bayesian Classification
>   - Multinomial Naïve Bayes Classifier
>   - Bernoulli Naïve Bayes Classifier
>   - Gaussian Naïve Bayes Classifier


<img src="https://www.analyticsvidhya.com/wp-content/uploads/2015/09/Bayes_rule-300x172-300x172.png">
$\tiny{\text{www.analyticsvidhya.com}}$   

#### Rule based
Rule based classification helps in classifying datasets by using a collection of "if.. else.." rules. The classifier may contain mutually exclusive rules, exhaustive rules, not mutually exclusive rules, or not exhaustive rules
> - READ: http://jcsites.juniata.edu/faculty/rhodes/ml/rulebasedClass.htm


##### Decision Tree
A decision tree is a data mining/machine learning method of predicting/classifying the value of a target variable based on several input variables. In this classification tree, each internal node is labeled with an input feature and each leaf of the tree is labeled with a class or a probability distribution over the classes of either class or into a particular probability distribution

> - Decision Tree Types
>   - CART - Classification and Regression Trees
>   - Ensemble methods, construct more than one decision tree
>     - Boosted trees 
>     - Bootstrap aggregated (or bagged/bagging) decision trees

<img src="https://miro.medium.com/max/1200/1*ZkQXt7mqI7MXuXhHrfvgtQ.png" width=400 height=400> 
$\tiny{\text{miro.medium.com}}$   

<img src="https://upload.wikimedia.org/wikipedia/commons/2/25/Cart_tree_kyphosis.png">
$\tiny{\text{Wikipedia}}$   

### Most common

#### Linear Regression
Linear regression helps us model the relationship between two variables by fitting a linear equation to observed data. The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values.

We assume here that $y|x; \theta \sim \mathbb N(\mu, \sigma^{2})$

The closed form solution for the $\theta$ that minimizes the cost function is
> $$\theta = (X^{T}X)^{-1}X^{T}y$$

#### Logistic Regression
The logistic regression is used to model the relationship between a set of independent and dependent variables. The dependent variables are categorical in nature, which is predicted based on the probabilities given some characteristics of class variables. The logistic regression uses sigmoid function to assign class labels.

The logit function is defined as the logarithm of the log odds
$$ \text{logit(p)} = \ln\frac{p}{1-p}$$

A standard logistic sigmoid function is defined as the 
$$ \sigma(x) = \frac{1}{1+e^{-x}} $$

The linear part of the model predicts the log-odds of the dataset example in the form of probability using logistic sigmoid function.

It tries to learn a function that approximates P(Y|X), by assuming that P(Y|X) can be approximated as a sigmoid
function when applied to the linear combination of input features.  
$$ P(Y = 1|X = x) = \sigma(z) = \sigma(\theta^{T}x) $$,
where $z = \theta_{0} + \sum\limits_{i=1}^{m}\theta_{i}x_{i} $

Similarly,
$$ P(Y = 0|X = x) = 1 - \sigma(\theta^{T}x) $$

The gradient descent is calculated as the partial derivative of logistic cost function wrt weight, which is used to maximize the logistic cost function.



#### Decision Tree
<a href="#Decision-Tree">Link</a>


#### SVM




#### Naive Bayes




#### kNN




#### K-Means




#### Random Forest




#### Dimensionality Reduction Algorithms




#### Gradient Boosting algorithms




##### GBM




##### XGBoost




##### LightGBM




##### CatBoost



[Microsoft - ML Reference](https://docs.microsoft.com/en-us/azure/machine-learning/component-reference/component-reference)
<img src="https://docs.microsoft.com/en-us/azure/machine-learning/media/algorithm-cheat-sheet/machine-learning-algorithm-cheat-sheet.png#lightbox">

### Regression - Predict a value
#### Boosted Decision Tree Regression
#### Decision Forest Regression
#### Fast Forest Quantile Regression
#### Linear Regression
#### Local Linear Regression

#### Locally Weighted Linear Regression (LWR)
Any parametric model can be made local if the fitting method accommodates observation weights. This is a variant of linear regression where the weights of each training example in the cost function is defined as 
> $$ w^{(i)}(x) = \exp \left( -\frac{(x^{(i)} - x)^{2}}{2\tau^{2}} \right) $$

#### Neural Network Regression
#### Poisson Regression
#### Quantile Regression

### Classification - Predict a class. Choose from binary (two-class) or multiclass algorithms.	
#### Multiclass Boosted Decision Tree
#### Multiclass Decision Forest

#### Multiclass Logistic Regression (Softmax Regression)
Softmax Regression is a generalization of logistic regression where we want to handle multiple classes instead of two classes $(y^{(i)} \in \{0,1\})$

#### Multiclass Neural Network
#### One vs. All Multiclass
#### One vs. One Multiclass
#### Two-Class Averaged Perceptron
#### Two-Class Boosted Decision Tree
#### Two-Class Decision Forest

#### Two-Class Logistic Regression

<a href="#Logistic-Regression">Logistic Regression</a>

The hypothesis looks like 

$$ h_{\theta}(x) = \frac{1}{1+e^{-\theta^{T}x}}$$ 
and the model parameters $\theta$ are trained to minimize the cost function
$$ J(\theta) = - \left [ \sum\limits_{i=0}^{n}y^{(i)}\log\sigma(\theta^{T}x^{(i)}) + (1-y^{(i)}) \log[1-\sigma(\theta^{T}x^{(i)})] \right]  $$ 
$$ = -\left [ \sum\limits_{i=0}^{n}y^{(i)}\log h_{\theta}(x^{(i)}) + (1-y^{(i)}) \log[1- h_{\theta}(x^{(i)})) ]\right] $$ 


#### Two-Class Neural Network
#### Two Class Support Vector Machine

### Supervised learning (Classification, Regression)
#### Decision trees
#### Ensembles
##### Bagging
##### Boosting
##### Random forest
#### k-NN
#### Linear regression
#### Naive Bayes
#### Artificial neural networks
#### Logistic regression
#### Perceptron
#### Relevance vector machine (RVM)
#### Support vector machine (SVM)

### Clustering
#### BIRCH
#### CURE
#### Hierarchical
#### k-means
#### Expectation–maximization (EM)
#### DBSCAN
#### OPTICS
#### Mean shift

### Dimensionality Reduction
#### Factor analysis
#### CCA
#### ICA
#### LDA
#### NMF
#### PCA
#### PGD
#### t-SNE

### Structured Prediction - Graphical models 
#### Bayes net
#### Conditional random field
#### Hidden Markov

### Anomaly detection
#### k-NN
#### Local outlier factor

### Artificial neural network
#### Autoencoder
#### Cognitive computing
#### Deep learning
#### DeepDream
#### Multilayer perceptron
#### RNN 
##### LSTM
##### GRU
##### ESN
#### Restricted Boltzmann machine
#### GAN
#### SOM
#### Convolutional neural network 
##### U-Net
#### Transformer Vision
#### Spiking neural network
#### Memtransistor
#### Electrochemical RAM (ECRAM)

### Reinforcement Learning
#### Q-learning
#### SARSA
#### Temporal difference (TD)

### Machine Learning Theory
#### Kernel machines
#### Bias–variance tradeoff
#### Computational learning theory
#### Empirical risk minimization
#### Occam learning
#### PAC learning
#### Statistical learning
#### VC theory

### Recommender System - Methods and challenges
#### Cold start
#### Collaborative filtering
#### Dimensionality reduction
#### Implicit data collection
#### Item-item collaborative filtering
#### Matrix factorization
#### Preference elicitation
#### Similarity search

### 10 most popular deep learning algorithms
[Link](https://www.simplilearn.com/tutorials/deep-learning-tutorial/deep-learning-algorithm)

#### Convolutional Neural Networks (CNNs)
#### Long Short Term Memory Networks (LSTMs)
#### Recurrent Neural Networks (RNNs)
#### Generative Adversarial Networks (GANs)
#### Radial Basis Function Networks (RBFNs)
#### Multilayer Perceptrons (MLPs)
#### Self Organizing Maps (SOMs)
#### Deep Belief Networks (DBNs)
#### Restricted Boltzmann Machines( RBMs)
#### Autoencoders

### Optimization algorithms - Deep Learning
- [Optimizing GD](https://ruder.io/optimizing-gradient-descent/)
- [Optimizer Visualization](https://github.com/Jaewan-Yun/optimizer-visualization)

#### ASGD
#### Adadelta
#### Adagrad
#### Adam
#### AdamW
#### Adamax
#### LBFGS
#### NAdam
#### RAdam
#### RMSprop
#### Rprop
#### SGD
#### SparseAdam

### Evaluation Metrics

#### Confusion Matrix
<img src="https://miro.medium.com/max/1400/1*-kFYBD6v7wv4qABsVtx9JA.png" width=300 height=300>

#### Log Loss or Cross Entropy Loss
<img src="https://miro.medium.com/max/1400/1*s5AmzAfKxh06ymdw1zkNkA.png" width=300 height=300>

#### Area under the curve (AUC)
<img src="https://www.ismiletechnologies.com/wp-content/uploads/2021/10/image-9.png" width=300 height=300>


#### 

### Generative Models

### Discriminative Models

### Types of classifiers

### Estimation methods

[Estimation Techniques](https://towardsdatascience.com/essential-parameter-estimation-techniques-in-machine-learning-and-signal-processing-d671c6607aa0)

#### Maximum Likelihood (ML) Estimation


#### Maximum a Posteriori (MAP) Estimation


#### Minimum Mean Square Error (MMSE) Estimation


#### Least Square (LS) Estimation
The best fit estimation line in least-square minimizes the sum of squared residuals
- it has a closed form solution

#### Linear Least squares
> Types of Linear least squares formulation:
> - Ordinary Least squares (OLS)
>   - The OLS method minimizes the sum of squared residuals, and leads to a closed-form expression for the estimated value of the unknown parameter vector
>   - it has the assumption that the error terms have finite variance and are homoscedastic
> - Weighted Least squares (WLS)
>   - it has the assumption that the error terms are heteroscedasticity
> - Generalized Least squares (GLS)
>   - it has the assumption that the error terms are either heteroscedasticity or correlations or both are present among the error terms of the model

#### Non-linear
The nonlinear problem is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, and thus the core calculation is similar in both cases.

#### Ordinary

#### Weighted

#### Generalized

#### Partial

#### Total

#### Non-negative

#### Ridge regression

#### Regularized

#### Least absolute deviations

#### Iteratively reweighted

#### Bayesian

#### Bayesian multivariate

#### Bayes Estimator

#### Probit

#### Logit


### Frequentists vs Bayesian Approach
- Frequentists define probability as a relative frequency of an event in the long run
- Bayesians define probability as a measure of uncertainty and belief for any event.
> TODO: READ

### Parametric vs NonParametric Method
- Parametric methods makes assumption in regards to the form of the function $f(X) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{p}X_{p}$, where $f(X)$ is the unknown function to be estimated, $\beta$ are the coefficients to learn, $X$'s are the corresponding inputs and $p$ is the number of independent variables. 
  - These assumptions may or may not be correct
  - these methods are quite fast
  - they require significantly less data
  - they are more interpretable
  - Examples 
    - Linear Regression
    - Naive Bayes
    - Perceptron
- NonParametric methods donot make any underlying assumption wrt to the form of the function to be estimated.
  - they tend to be more accurate
  - they require lots of data
  - Examples:
    - Support Vector Machines
    - K-Nearest Neighbors

### Prediction vs Inference

### Mathematical functions

#### Logistic/Sigmoid function
$$ f(x) = \frac{L}{1+e^{-k(x-x_{0})}} $$
> L - the curve’s maximum value  
> $x_{0}$ - the sigmoid’s midpoint  
> k - the logistic growth rate or steepness of curve  

#### Standard logistic sigmoid function
$$ \sigma(x) = \frac{1}{1+e^{-x}} $$
> L = 1 - the curve’s maximum value  
> $x_{0}$ = 0 - the sigmoid’s midpoint  
> k = 1 - the logistic growth rate or steepness of curve  



#### Logit function
The logit function is the logarithm of the odds ratio (log-odds)
$$ \text{logit(p)} = \ln\frac{p}{1-p}$$

#### Probit function

#### Softplus function

#### Weighted sum
$$ \theta^{T}x = \sum\limits_{i=1}^{n}\theta_{i}x_{i} = \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} $$

#### Likelihood function
The likelihood function defines the joint probability of observed data as a function of parameters of the model
$$ LL(\theta) = \sum\limits_{i=0}^{n}y^{(i)}\log\sigma(\theta^{T}x^{(i)}) + (1-y^{(i)}) \log[1-\sigma(\theta^{T}x^{(i)})]  $$

#### Gradient of log likelihood function
It tries to choose values of $\theta$ that maximizes the function
$$ \frac{\partial LL(\theta)}{\partial \theta_{j}} = \sum\limits_{i=0}^{n}\left[ y^{(i)} - \sigma(\theta^{T}x^{(i)}) \right]x_{j}^{(i)} $$
