Fundamentals of Machine Learning
4.1 Four Branches of Machine Learning
Supervised Learning is to map input data to known responses / targets. Besides the common classification & regression, there are also:
- Sequence generation: Given a picture, predict a caption describing it. Sequence generation can sometimes be reformulated as a series of classification problems
- Syntax tree prediction: Given a sentence, predict its decomposition into a syntax tree
- Object detection: Given a picture, draw a bounding box around objects in the picture.
- Image segmentation: Given a picture, draw a pixel-level mask on a specific object

Unsupervised Learning is to find interesting transformations of the input data without any help of any targets. Unsupervised learning is the bread & butter of data analytics and is usually necessary to understand a dataset before attempting to solve a supervised-learning problem. 

Self-supervised Learning is supervised learning without human-annotated labels. Labels are involved but they are generated from the input dat, typically using a heuristic algorithm.

In Reinforcement Learning, an agent receives information about its an environment and learns to choose actions that will maximise some reward. For example, a neural network "looks" at a video game screen and outputs game actions in order to maximise its score. Currently, it mostly research and hasn't had significant practical successes beyond games.
4.2 Model Evaluation
In machine learning, the goal is to achieve models that generalise, that perform well on never-before-seen data. When a model perform well on the training set but not on the validation set, we say that it is overfitting. To generalise well, we split the data to training sets, validation sets and test sets, and train the model only on the training set.

Training, validation and test sets are usual ways to split the dataset. Train the model on the training set, evaluate on the validation set and do 1 final check on the test set.

Why split 3-ways and not 2? In a 2-split approach with only training set and test set, usually, when we tune hyperparameters, we might try to "overfit the test set". This means the model might not generalise well on "unseen data". 

To split the data, we usually have the train-test split method (hold-out validation), or the k-fold cross validation method. 

Beyond this, we can shuffle the dataset p-shuffles before doing the k-fold cross validation. 

Things to be aware of when doing the splits:
Training set and test-set should be representative of the overall data. Usually it's good to stratify sample starting from the targets. If the original dataset has 2 labels split 40%-60%, then the training set and test set should both have this even split.
If we want to perform time series analysis then we should not shuffle before training
Ensure that as far as possible, during the split step, the results are that the train and test sets are disjoint. Ensure no data point exists in both training and test set.
4.3 Data Preprocessing, Feature Engineering, Feature Learning
Preprocessing
Vectorisation Most of the times before feeding to neural networks, we need to ensure data is in tensor form. This process is data vectorization. It is usually easy with numerical features but some transformations need to be done for text & image form data.

Normalisation In most datasets, different features have different ranges, some larger than others. So it's generally safe to perform normalisation on the data before feeding it to the network. Normalisation means the feature has a mean of 0 and a variance of 1.

Imputation Sometimes, a feature might have some values missing. So it is good to handle missing values by performing imputation. It can be imputed with 0 for missing values. If you expect the test data to have missing values for a particular feature, it will be good to have that property in the training data too, so the network will know to drop that value during training.

Feature Engineering
Feature engineering is the process of using domain knowledge to apply transformations of the data that make the learning easier (in this sense, find patterns more easily). 
Good features allow you to solve ML models more elegantly with fewer resources. Also, 
Good features let you solve a problem with much less data. Deep learning models usually learn with more data available, so with fewer data points, good quality features is critical

Overfitting & Underfitting
In training a model, usually the training error and test error are correlated. A lower training error corresponds to a lower test error. However, soon the test error starts to increase. The model is starting to optimise for the training error and performs poorer and poorer on test error. This is when the model overfits. We want our models to generalise well to unseen data so tuning for training error while neglecting test error isn't the best way to do training.

Besides tuning the number of epochs, we can also tune the size of each layer. A model with a larger size has more memorization capacity and can identify more patterns, while a model with a smaller size has less ability to memorise patterns / information from the input data. The idea is to find the correct model size for the problem. Beyond that, we can also tune the number of layers to see if the model can generalise well, judged by the loss function.

Finally, there is one more way to mitigate overfitting, which is to regularise the model. Regularisation reduces variance across all the weights in a layer. There are 2 types: l1 and l2 regularisation (corresponding to the l1 and l2 norms)

Another regularisation technique is using dropout. Dropout, applied to a layer, randomly sets a number of output features to 0. The dropout rate is the fraction of features that are zero-ed.


The machine learning workflow:
Define the problem, collect the dataset. Identify the goal of machine learning (regression, classification, etc.), know the inputs and outputs
Choose a measure of success / performance metric. This can be RMSE for regression problems. This could also be ROC AUC, or precision / recall for classification problems or cross entropy loss
Train-Test Split or K-Fold split your dataset. Ensure here, your splits are a good representation of the universe, taking into account distribution of output.
Preprocess / Prepare data. This involves vectorisation / normalisation of data. Where necessary, imputation / feature engineering can be applied.
Develop a baseline model. This could be a random guess or a simpler ML model (e.g. linear regression, random forest) This will give you an idea of how to improve the prediction power for neural networks
Design NN model, model parameter tuning. This is when you tune the number of layers, size of layers, and epochs for training. Here you can tune using traditional ML methods too like feature selection. Monitor model performance using the metrics you have defined in the earlier step.
Finally explore regularisation techniques to obtain the best model for validation. Ensure that as time goes by, your model does not ruthlessly only optimise for the validation set., and then does not generalise well for the test set.

Once done, you can train the production model on all available data, then the test set. This is when you want to ensure the performance on the validation set and the test set are close. Otherwise you would be overfitting on the validation set.
