## 3. Modeling

## Part-One Frame business problems as ML problems

**Important Topics**
* Supervised learning
    * Regression and classification
* Unsupervised learning
    * Clustering
    * Anomaly detection
* Deep learning
    * Perceptron
    * Components of an artificial neuron

First, remember that ML is about identifying hidden patterns in data. It has the potential to leverage large amounts of data to train an ML model on that data’s patterns and structures to then make predictions. And the power of ML is that your model, in theory, gets progressively better at making these predictions as it’s trained.

ML is not appropriate when you can determine a target value by using simple rules or computations that can be programmed without needing any data-driven learning. 

### a. Supervised Learning
Supervised learning is a popular type of ML because it’s widely applicable and has several successful applications. Supervised algorithms learn patterns by seeing the relationships between variables and known outcomes. Take a simple image recognition example. You provide your model with training data that includes images of different animals and the corresponding labels, which essentially give the model the correct answer for each label. This one is a cat. This one is a dog. After training on this data, in theory, your model should be able to predict the type of animal it sees in a totally new picture that it encounters in production. 

### b. Unsupervised Learning
By contrast, when you don’t have training data labeled and you don’t already understand how inputs may map to outputs, you might want to look to unsupervised learning as a solution. 
A common type of unsupervised learning is called clustering. This kind of algorithm groups data points into different clusters based on similar features in order to better understand the attributes of a specific group or cluster. For instance, let’s say you sell office supplies to different companies. In analyzing your customer purchasing habits, unsupervised learning might be able to identify, let’s say, two different groups of customers without the need for specific labels. Maybe the model identifies that the one cluster that is centered around purchasing products like paper and pencils happens to be smaller companies. Whereas the cluster that is centered around products like conference tables and chairs is made up of larger companies. In this situation, clustering might help you realize that you need to come up with a different marketing strategy for different-sized companies. 

## Part-Two Select the appropriate model(s) for an ML problem

**Important Topics**
* Linear learner
* XGBoost
* K-means
* Decision trees
* Random forest
* Image classification
* Object detection
* Semantic segmentation

**I am not going into the details in each one of them as I am expecting you already know what each of them do and where to use them.**

## Part-Three Train ML models

**Important Topics**

* Amazon SageMaker workflow for training jobs
* Running a training job using containers
* Build your own containers
* Amazon EC2 P3 instances
* Components of an ML training job for deep learning

Split data to ensure a proper division between training and evaluation

**Cross-validation**<br>
Use cross-validation methods to compare the performance of multiple models. The goal behind cross-validation is to help you choose the model that will eventually perform the best in production. 

**K-fold cross-validation**<br>
K-fold cross-validation is a common validation method. In k-fold cross-validation, you split the input data into k subsets of data (also known as folds). You train your models on all but one (k-1) of the subsets, and then evaluate them on the subset that was not used for training. This process is repeated k times, with a different subset reserved for evaluation (and excluded from training) each time. 

**Stratified K-fold cross-validation**<br>
Stratified k-fold cross-validation is the same as just k-fold cross-validation, But Stratified k-fold cross-validation, it does stratified sampling instead of random sampling.

### a. Amazon EC2 P3
Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications. These instances deliver up to one petaflop of mixed-precision performance per instance to significantly accelerate machine learning and high performance computing applications. Amazon EC2 P3 instances have been proven to reduce machine learning training times from days to minutes, as well as increase the number of simulations completed for high performance computing by 3-4x.

### Part-Four Perform hyperparameter optimization

**Important Topics**
* Amazon SageMaker hyperparameter tuning jobs
* Common hyperparameters to tune:
    * Momentum
    * Optimizers
    * Activation functions
    * Dropout
    * Learning rate
* Regularization:
    * Dropout
    * L1/L2

**What are hyperparameters?**<br>
Hyperparameters are the settings that can be tuned before running a training job to control the behavior of an ML algorithm. They can have a big impact on model training as it relates to training time, model convergence, and model accuracy. Unlike model parameters that are derived from the training job, the values of hyperparameters do not change during the training. 

Categories of hyperparameters:<br>
**1. Model hyperparameters**<br>
Model hyperparameters define the model itself—Attributes of a neural network architecture like filter size, pooling, stride, padding.

**2. Optimizer hyperparameters**<br>
Optimizer hyperparameters, are related to how the model learn the patterns based on data and are used for a neural network model. These types of hyperparameters include optimizers like gradient descent and stochastic gradient descent, or even optimizers using momentum like Adam or initializing the parameter weights using methods like Xavier initialization or He initialization

**3. Data hyperparameters**<br>
Data hyperparameters are related to the attributes of the data, often used when you don’t have enough data or enough variation in data—Data augmentation techniques like cropping, resizing

**Grid search**
Grid search is one of those methods. With grid search, you set up a grid made up of hyperparameters and their different values. For each possible combination, a model is trained and a score is produced on the validation data. With this approach, every single combination of the given possible hyperparameter values is tried. This approach, while thorough, can be very inefficient.

**Random search**
Random search is similar to grid search, but instead of training and scoring on each possible hyperparameter combination, random combinations are selected. You can set the number of search iterations based on time and resource constraints.

### a. Amazon SageMaker hyperparameter tuning jobs
Amazon SageMaker lets you perform automated hyperparameter tuning. Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

### b. Regularization 
Regularization  is one of the most important concepts of machine learning. It is a technique to prevent the model from overfitting by adding extra information to it.

**Ridge Regression or L2 regularization**<br>
Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L2 regularization. In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda to the squared weight of each individual feature.
It helps to solve the problems if we have more parameters than samples.

**Lasso Regression or Least Absolute and Selection Operator or L1 regularization**<br>
It is similar to the Ridge Regression except that the penalty term contains only the absolute weights instead of a square of weights. Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink it near to 0. 
The Lasso regression can help us to reduce the overfitting in the model as well as the feature selection

### Part-Five Evaluate ML models

**Important Topics**
* Metrics for regression: sum of squared errors, RMSE
* Sensitivity
* Specificity
* Neural network functions like Softmax for the last layer

**For classification problems, a confusion matrix is the building block for your model evaluation. **

Metrics for classification problems:<br>
**1. Accuracy**<br>
Accuracy is the ratio of correct predictions to total number of predictions

**2. Precision**<br>
Precision is the proportion of positive predictions that are actually correct

**3. Recall**<br>
Recall is the proportion of correct sets that are identified as positive

**4. F1 score**<br>
2* Precision* Recall/ (Precision+Recall)

**5. Area Under Curve**<br>
AUC-ROC