Outline

Introduction

Since CS231n, Andrew Ng's new DL course, Andrew Ng's CS229 and Google ML course are all introducing basic concepts about ML and DL, so I combine them together. Material from huaxiaozhuan provides good tutorials about commom machine learning algorithms.

Machine learning

1. Models

Model complexity: VC dimension

1.1 Support vector machine

1.1.1 Principle

Hyperplane: w_T * x + b = 0
Margin (i.e., distance between positive support vectors and negative support vectors): 2 / ||w||.
Maximize the margin: max 2 / ||w|| s.t. y_i (w^T * x_i + b) >= 1 -> Minimize the opposite: min 1/2 * ||w||^2 s.t. y_i (w^T * x_i + b) >= 1. This formula can be solved directly by quadratic programming (QP), but can also be solved by Lagrance efficiently
Lagrance
L(w, b, a) = 1/2 * ||w||^2 + sum(a_i * (1 - y_i * (w_T * x_i + b))) [1]
Let the partial derivative of L with respect to w and b euqal to 0, then w = sum(a_i * y_i * x_i), 0 = sum(a_i * y_i). Replace above euqation to [1], then L = max sum(a_i) - 1 / 2 * sum(sum(a_i * a_j * y_i * y_j * x_i^T.dot(x_j))), s.t. sum(a_i * y_i) = 0 and a_1 >=0 [2], which a can be solved by Sequential Minimal Optimization (SMO) (select two variables a_i and a_j, fix other a)
KKT: solution of prime problem == solution of dual problem

1.1.2 Kernel

Principle: map low dimension linear non-separable space to high dimension linear separable space use function fai(x)
However, x_i^T.dot(x_j) -> fai(x_i).dot(fai(Y_i)) can be hard to compute, then kernel perform the linear transformation and return the inner product of this transformation directly.
Catetories
- Linear kernel: x_i^Tx_j
- Polynomial kernel: (x_i^Tx_j)^d
- Gaussian (RBF) kernel: exp(- ||x_i-x_j||^2 / 2sigma^2)

1.1.3 Soft margin SVM

min 1/2 * ||w||^2 + C * sum(epison_i), s.t. y_i (w^T * x_i + b) >= 1 - epison_i and epison_i >=0
C is a regularization parameter
small C allows constraints to be easily ignored → large margin
large C makes constraints hard to ignore → narrow margin
Note, there is only one parameter, C

1.1.4 Hinge loss view

Definition: max(0, 1-z)
SVM: min 1/2 * ||w||^2 + C * sum(max(0, 1 - y_i(w^T * x_i + b))))
Relationship with NN
SGD with hinge loss

1.1.5 Multi-class SVM

1.1.6 Extensions

Support Vector Regression
- Loss function: tube loss, max(0, |s-y|-epison)
When two points appears in a range, the loss is zero
Ranking SVM
One-class SVM

1.2 Tree-based models

1.2.1 Decision tree

Principle
- Recursive partition of the space with greedy strategy (i.e., each partition is optimum), the boundary is axis-parallel
- Hyper parameter: max-depth
- Binary tree, easy interpret (good for medical analysis)
- Do not need standarizaion or normalization, and missing data cannot affect the model
Categories
- ID3
  - Criterion: maximize information gain. For one feature a, which has 10 different values, IG = Entropy(D) - sum(v in range(10): |D_v| / |D| * Entropy(D_v))
  - Multi-branch tree
  - Drawback: perfer features that has many values
- C4.5
  - Criterion: maximize information gain ratio = information gain / intrinsic value (feature)
  - Multi-branch tree
- CART
  - Criterion: minimize suqared error for regression (min(min(sum(y_i-c_1)^2) + min(sum(y_1-c_2)^2)), where c_1 and c_2 is mean of two parts splitted by feature value) and Gini index for classification (1 - sum(p_k^2))
  - Binary-branch tree: is or not is for one feature
  - Relationship between entropy and Gini index, f(x)=-ln(x) Tayor expansion with x=1, then f(x)=1-x, then entropy = -sum(p_i * ln(p_i)) = sum(p_i * (1-p_i))) = gini index
- For regression, recursive partition with the principle that splitting the value of a feature that minimizes the sum of squared errors (SSE) of prediction. To avoid overfitting, |tree size| is added to SSE
- Multi-variate decision tree
  - Criterion: each node is a linear classifier instead of a univariate feature
Overfitting
- Pre-pruning
  - prone to under-fitting
- Post-pruning
  - For CART, plus the |tree size| in SSE
Features
- Continuous feature
  - To discrete feature via bi-partition (e.g., >10 ?)
  - First sort the value, then choose the median of two adjacent value as the condition
- Categorical feature
  - Decision subset. E.g. X1: ['A', 'B'], then subset: ['A'], ['B'], ['AB']
Missing data
Split termination criterion
- All y is same: all samples belong to the same category
- All X is same: feature is None, or all samples have same values on all features, no decision stumps
- No samples available

1.2.2 Random forest

Principle: resample the samples with replacement, and resample the features at the same time
Hyper parameter: number of trees, number of feature for each tree: sqrt(# features)
Reduce overfit, and reduce variance

1.2.3 Gradient boosted decision trees

Principle: use decision tree to fit the residue, like Taylor expansion
Reduce bias
tutorial 1, tutorial 2
Gradient boosting performs gradient descent
Learning rate, i.e., shrinkage: shrinks the contribution of each tree by learning_rate

1.2.4 Tools

LightGBM, example

1.3 EM

Principle

Expectation Maximization is MLE or MAP with latent variable, which is used to estimate the parameter iteratively
Two steps
- E
  - product(p(y|theita)) -> sum(lnp(y|theita)) = sum(ln sum(y,z|theita)), where sum in ln make derivative equal to 0 cannot be computed
  - Q(theita, theita_i): sum(p(z|y,theita) * lnp(y,z|theita)), compute expectation of lnp(y,z|theita) with weight p(z|y,theita) (you can think that first choose the cluster first in GMM), where theita_i is known and theita is the parameter that to be estimated
- M
  - argmax Q find theita with making partial derivative of Q with repect to theita equal to 0
Applications
- K-means and Gaussian mixture model

1.4 MaxEnt

1.4.1 Entropy

Entropy
- Information size: -logp(x)
Union entropy
Condition entropy
Mutual information
- Measure the non-linear relationship between two variables
- Can be used to select feature
Relative entropy (i.e., KL divergence)
- Measure the difference between two distributions
- D(p||q) != D(q||p)
- D(p||q) >= 0
Cross entropy
- Relationship with KL divergence: CE(p, q) = H(p) + D(p||q)

1.5 Model selection

1.5.1 Under-fitting / Over-fitting

Instead of learning curve, we usually look at the gap bewteen training accury and testing accury to check whether the model is over-fitting.

1.5.2 Model ensemble, sklearn

Bagging
- Random feature and random samples
- Reduce high-variance
Boosting
- Reduce high-bias
- Difference between GBDT and XgBoost: first-order and second-order derivative
Blending
- Split data to first-step training data and second-step training data
- Based on first-step training data, we can train different models [m_1, m_2, ..., m_n]
- Based on second-step training data, first get the predictions [y_1, y_2, ..., y_n] using [m_1, m_2, ..., m_n], then a new model (usually simple model such as linear model) fit on [y_1, y_2, ..., y_n], true-y
Stacking
- Basic idea of Stacking is same as Blending
- Stacking uses cross-fold-validation (the out-of-fold is used to train the next layer), while Blending uses a holdout validation (part of the train is used in the first layer, part in the second)

2. Feature engineering

2.1 Categories

Numerical
- Scaling
  - MinMaxScaler, StandardScaler
- Log
- Statistics
  - The gap between the price and the average price
- Discreting (e.g., age)
  - pd.cut, pd.qcut
Categorical (enumerated)
- Boolean (i.e., is it yes or no ?)
  - For example, red, yellow, green are categorical feature, a object both have red and green feature can be represented as [1, 0, 1]
- pd.get_dummies
- Histogram
  - For example, there are gender and hobbies two features, we can calculate the precent of hobby on male and female category respectively. And replacing this percent to hobbies feature to each sample
Time
- Continuous
  - During time (i.e., browsing during time)
  - Interval time (i.e., the time between last time buying and now)
- Discrete
  - The quarter in one year (i.e., 0, 1, 2, 3)
  - The week in one year (i.e., 0, 1, ..., 54)
  - The day in one week (i.e., 0, 1, ..., 6)
  - The hour in one day (i.e., 0, 1, ..., 23)
  - Workday or weekend
Missing value
- Additional boolean feature
- For example, some sample does not have age feature, then additional feature is_age_defined is added.

2.2 Feature selection

Univariate
- Selecting the best features based on univariate statistical tests
- For example, chi2, Pearson, mutual information
Recursive feature elimination
- Selecting features by recursively considering smaller and smaller sets of features
SelectFromModel
- Lasso with L1 regularization
- Tree-based

2.3 Tools

Featuretools for automatic feature engineering.

3. Loss function

3.1 Categories

Mean square error (MSE)
- Linear regression
- L2 loss
Logarithmic loss
- Logistic regression.
- L(Y, f(x)) = -logf(x). Fast.ai.wiki gives a detailed explanation of log loss.
Cross entropy
- shortcut of KL divergence, relative entropy
Max-margin loss
- J = max(0, 1+s-s_c)
Exponential loss
- Adaboost
Hinge loss
- SVM
Cosine similarity
- Cosine curve (degree=0, cos_value=1; degree=90, cos_value=0; degree=180, cos_value=-1)

3.2 Empirical risk minimization (ERM) and Structural risk minimization (SRM)

ERM = minimize loss, it may leads to over-fitting phenomenon. For example, maximum likelihood estimation (MLE).
SRM = ERM + regulairzation. For example, maximum a posterior (MAP).

3.2.1 MLE

Choose value that maximize the probability of observed data, i.e., argmax_theita P(D|theita)

3.2.2 MAP

Choose value that is most probable given observed data and prior belief, i.e., argmax_theita P(theita|D) = argmax_theita P(D|theita) * P(theita)
MLE is a special case of MAP, where the prior is uniform

4. Evaluation metrics

4.1 Accuracy, precision, recall and F1

Consider a scenario that predicting the gender of the user (i.e., male or female). This is a classical binary classification prediction problem. Let predicted 1 (i.e., positive) indicates male and predicted 0 (i.e., negative) indicates female.

Confusion matrix:

	Predicted positive	Predicted negative
Real positive	TP	FN
Real negative	FP	TN

Multi-class confusion matrix: element at row i and column j denotes the true class i and is being classified in class j.

Accuracy = (TP + TN) / (TP+FP+FN+TN), suffer from class imbalance problem.
Error = (FP + FN) / (TP+FP+FN+TN) = 1 - accuracy

Precision = TP / (TP + FP) What's the correct proportion of predicted positive ?
Recall = TP / (TP + FN) What's the pick up proportion of all positive obsverations ?
F1 = 2 * Precision * Recall / (Precision + Recall).
Increase Precision indicates increase classification threshold; meanwhile, increase Recall indicates decrease classification threshold.

False Positive = FP / (FP + TN)
True Positive = TP / (TP + FN), i.e., Recall.

Which model is better?
Model 1:

	Predicted 1	Predicted 0
Real 1	4	2
Real 0	3	1

Model 2:

	Predicted 1	Predicted 0
Real 1	6	0
Real 0	4	0

Model 1: accuracy = 92%.
Model 2: accuracy = 95%.
Model 2 has higher accuracy than model 1, but model 2 is useless. This is called accuracy paradox, which means the model with higher accuracy may not have better generalization power.
In general, when TP < FP, the accuracy will always increase when we change the classifier to always output 'negative'. Conversely, when TN < FN, the same will happen when we change the classifier to always output 'positive' [1].

Recall@k

4.2 ROC, AUC

ROC (Receiver Operating Characteristic curve): A curve of true positive rate vs. false positive rate at different classification thresholds. The x-axis is False Positive rate, and the y-axis is True Positive rate. [3] .

Close to the up left point (TPR=1.0, FPR=0.0) indicates the model is better. On the diagonal line, TPR = FPR, which means the random guess.

AUC (Area under the ROC curve): aggregate measure of performance across all possible classification thresholds.
One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. [3]

4.3 BLEU

References:
[1] https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/
[2] https://www.zhihu.com/question/30643044
[3] https://developers.google.cn/machine-learning/crash-course/classification/true-false-positive-negative

5. Sample projects

Vectorized logistic regression

Vectorized neural network

6. Classical questions

6.1 Why is logistic regression a generalized linear model ?

Suppose that the logistic regression, p = sigmoid(z):

The input z, i.e., z = WX + b, is a linear function of x.
Log-odds, i.e., log (p/(1-p)) = WX, is a linear function of parameters W.
Decision boundary, i.e., WX = 0, is linear. This does not mean LR can only handle linear-separable data. Actually, we can convert low-dimensional data to high-dimensional data with the help of feature mapping. And the data are linear-separable in the high-dimensional space.
Both logistic regression and softmax regression can be modeled by exponential family distribution.

What's the difference between linear and non-linear regression ?

It is wrong that linear regression model generates straight lines and nonlinear regression model curvature. Actually, both linear and non-linear models can fit curves.
Linear means linear in parameters W but not in X. For example, both functions f_1 = w_1x_1 + w_2x_2 + b and f_2 = w_1x_1 + w_2x_2 + w_3x_2^{2} + b are linear functions. The most common operation is adding polynomial terms (i.e., quadratic term or cubic term) in linear model.
Example of non-linear function f_3 = w_1x^{w_1}
log transform can be used to convert nonlinear function y=e^{b}x_1^{w_1}x_2^{w_2} to linear function lny = b + w_1lnx_1 + w_2lnx_2

The influence of classification threshold

Precision = 0.8, Recall = 0.73.

Precision = 0.88, Recall = 0.64.

Precision = 0.75, Recall = 0.82.

References:
https://www.quora.com/Why-is-logistic-regression-considered-a-linear-model
https://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier
https://stats.stackexchange.com/questions/88603/why-is-logistic-regression-a-linear-model
http://cs229.stanford.edu/notes/cs229-notes1.pdf http://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/
http://statisticsbyjim.com/regression/curve-fitting-linear-nonlinear-regression/
https://developers.google.cn/machine-learning/crash-course/classification/precision-and-recall
Logistic regression application in Meituan: https://tech.meituan.com/intro_to_logistic_regression.html

6.2 How to prepare train, dev and test dataset ?

Principle: make sure that the distribution of validation set and test set are same.
Train and dev/test may from slightly different distribution.
Learning curve shows the training and validation score along the increases of training examples.

6.3 Error analysis

Manually check say 100 mis-classified validaton samples, and count their error types.

Deep learning

1. Optimization algorithms

1.1 Exponential weighted averages

All advanced optimization algorithms are based on exponentially weighted averages (V_t = beta * V_t-1 + (1-beta) * theita_t), which approximately averages 1/(1-beta) days. An example, V_0 = 0, V_1 = 0.9 * V_0 + 0.1 * theita_1, V_2 = 0.9 * V_1 + 0.1 * theita_2, ..., therefore, V_2 = 0.1 * theita_2 + (0.1 * 0.9) * theita_1, the weights of theita_x is exponentially decay.
At early steps, bias correction is used. i.e. V_t = V_t / (1 - beta^t). If t is large, then 1 - beta^t approximates 1, bias correction does not work now.

1.2 Momentum

V_dw = beta * V_dw + (1-beta) * dw
w = w - alpha * V_dw

1.3 RMSprop (Root mean square prop)

S_dw = beta * S_dw + (1-beta) * dw ** 2
w = w - alpha * dw / [S_dw^(1/2) + epsilon], where ** 2 is element-wise multiplication and epslion=1e-8.

1.4 Adam (Adaptive moment estimation)

V_dw = beta_1 * V_dw + (1-beta_1) * dw; S_dw = beta_2 * S_dw + (1-beta_2) * dw ** 2;
V_dw_corr = V_dw / (1-beta_1^t); S_dw_corr = S_dw / (1-beta_2^t)
w = w - alpha * V_dw_corr / [S_dw_corr^(1/2) + epsilon]
beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8

1.5 Learning rate decay

Decay the learning rate after each epoch
learning_rate / (1.0 + num_epoch * decay_rate)
Exponential decay: learning_rate * 0.95^num_epoch

1.6 Saddle points

First-order derivative is zero.
For one dimension, the saddle point is local maximum, but for another dimension, the saddle point is local minimum.

2. Exploding/vanishing gradients

2.1 Vanishing gradients [1, 3]:

Results: early layers are converged slower than later layers.
Reason: sigmoid and tanh activations suffer from vanishing gradients. But ReLU activation does not have this problem.
Solutions:
- Activation function.ReLU or Leaky ReLU. ReLU can have dying states (caused by i.e., large learning rate or large negative bias), whose both outputs and gradients are zero. Leaky ReLU solves this problem. Variants of Leaky ReLU is randomized leaky ReLU (RReLU), parametric leaky ReLU (PReLU). exponential linear unit (ELU).
  ELU > Leaky ReLU > ReLU > tanh > sigmoid [5].
- Weights initialization. i.e., Xavier initialization (Two goal: the outputs variance of each layer is equal to the inputs variance; the gradients variance before and after flowing through a layer is equal) for sigmoid and tanh, He initialization for ReLU and Leaky ReLU.
- Batch Normalization. Address vanishing or exploding gradients problem during training [6].
Implementation [5]

# xavier
tf.contrib.layers.fully_connected(inputs, num_outputs, weights_initializer=initializers.xavier_initializer())

# He
he_init = tf.contrib.layers.variance_scaling_initializer() # default is He initialization that only consider fan-in
tf.contrib.layers.fully_connected(inputs, num_outputs, weights_initializer=he_init)

# elu
tf.contrib.layers.fully_connected(inputs, num_outputs, activation_fn=tf.nn.elu)

# leaky relu
def leaky_relu(z, name=None):
    return tf.maximum(0.01*z, z, name=name)
tf.contrib.layers.fully_connected(inputs, num_outputs, activation_fn=leaky_relu)

2.2 Exploding gradients [2, 3]:

Results: gradients and cost become NaN.
Reason: large weights and derivative of activation multiplication during back propagation. It is particularly occured in RNNs.
Solution: gradients clipping.
Implementation:

# pseudo code
if norm(gradients) > max_gradients_norm:
    gradients *= max_gradients_norm/norm(gradients)

# real code
variables = tf.trainable_variables()
gradients = tf.gradients(ys=cost, xs=variables)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=self.clip_norm)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
optimize = optimizer.apply_gradients(grads_and_vars=zip(clipped_gradients, variables), global_step=self.global_step)

2.3 Why LSTM resistant to exploding and vanishing gradients?

If the forget gate is on and the input and output gates are off, it just passes the memory cell gradients through unmodified at each time step [4].
CEC (Constant Error Carrousel) mechanism with gate [7].

References:
[1] https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b
[2] https://www.quora.com/Why-is-it-a-problem-to-have-exploding-gradients-in-a-neural-net-especially-in-an-RNN
[3] http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
[4] http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L15%20Exploding%20and%20Vanishing%20Gradients.pdf
[5] Hands on machine learning with Scikit-Learn and TensorFlow p278, P281
[6] https://www.zhihu.com/question/38102762
[7] https://www.zhihu.com/question/34878706

3. Batch Normalization

Covariate shift means data distribution changes, BN reduces the distribution of neurons shifts. No matter how z1 and z2 changes, BN keeps there mean and std same.
The dimension of beta and gamma is (n^l, 1), because BN is used to scale the mean and variance of the input Z of each neuron.
He initialization and ELU can reduce the vanishing gradients at the begining of training. BN can address the vanishing gradients during training.
Batch means evaluating the mean and standard deviation of the inputs over current mini-batch.
BN can reduce vanishing gradients problem, less sensitive to the weight initialization, reduce the need for other regularization techniques (such as dropout).
At test time, use the whole training set's mean and standard deviation, i.e., exponential weighted averages of each mini-batch's beta and gamma.
bias=False

4. Nerual Networks

4.1 Definition

Neural networks essentially automatic learns feature crosses, which is necessary for solving non-linear problems.
Each neuron is f (weighted average of last layers' outputs), where f is non-linear function.

4.2 Back propagation

Chain rule with memorization.

4.3 How to choose the number of hidden layers ?

DNN could extract features layer by layer, and it has a hierarchical architecture. For many problems, one or two hidden layers will works fine. For complex problem, you can gradually increase the number of hidden layers, until overfitting occurs.

4.4 How to set the number of neurons per hidden layer ?

A common strategy is to size the number of neurons to form a funnel (i.e., deeper layer has fewer neruons). The analogy is many low-level features are coalesce into fewer high-level features.
A simple approach is to pick a complex model with early stopping to prevent from overfitting.

References:
[1] Hands on machine learning with Scikit-Learn and TensorFlow p271

4.5 How does batch size influence training speed and model accuracy ?

Batch gradient descent

slow
may converge to local minimum, and yield worse performance
suffer from GPU memory limitation

Stochastic gradient descent

fast
not stable

Mini-batch gradient descent

fast as SGD (matrix operation with GPU)
escape from local minimum and more stable

How to set mini-batch size

If the whole training set size is less than 2,000, then use the whole training data
Otherwize, 64, 128, 256, 512 is common choices

4.6 XOR solution

One hidden layer with two neurons. The activation in above image is step function.

5. Tips of training DL models

5.1 Validate the correctness of public repository

Input 10 training samples, shut down the dropout and L2 regularizations, predict the 10 testing samples (same as the 10 training samples). If the cost is approximately 0, the codes may not have mistakes

5.2 Hyperparameter tuning

Don't use grid search (useful when the number of parameters is small), try random values. Say parameters w1 and w2, grid search will loop w1 = 5 and w2 = 1, 2, 3, 4, 5, all of the five models have same w1. But with random values, five models can have different w1
Coarse to fine strategy
Log scale for learning rate
- Step 1: suppose that we want to choose learning rate from range [0.0001, ..., 1]
- Step 2: we take log scale (with base 10), then the range becomes [-4, 0]
- Step 3: uniform choose one paramter with r = -4 * np.random.randn(), alpha = 10^r

5.3 Learning rate decay

5.4 Update parameters by moving average

At each step, update the parameters by their moving average.
References:
[1] Use it in tensorflow
[2] example

5.5 dynamic batching

6. Regularization

6.1 L1 and L2

Why are the shape of the L1 norm and L2 norm (aka, weight decay) diamond like and circle like respectively?

See reference [1].

Why does L1 lead to sparse weights and L2 lead to small distributed weights?

L0 norm is the number of non-zero elements, which has sparse property.
L1 norm is the best convex approximation to L0 norm. We can view L1 norm as a compromise between the L0 and L2 norms, inheriting the sparsity-inducing property from the former and convexity from the latter [2].
The L1 norm despite being convex, is not everywhere differentiable (unlike the L2 norm) [2], see picture from [3].

Traditional gradient descent cannot be used to optimize L1 regularized models, therefore more advanced techniques like proximal gradient or primal-dual methods like ADMM are required [2].
L1 regularization helps performing feature selection.
Since squaring a number punishes large values more than it punishes small values [2], L2 regularization leads to small distributed weights.

L1 or L2, which one is perfered in differenct scenario?

Both are used for solve over-fitting.
In Bayesian view:
L1 regularization is equivalent to MAP estimation using Laplacian prior.
L2 regularization is equivalent to MAP estimation using Gaussian prior.
Always try L2 regularization first, since it will give you the best result [2].

L2 regularization Implementation

forward propagation computes cost

# suppose that there are sigler hidden layer neural network
# W1 and W2 are parameters for input X and hidden neurons
# Since we add regularization term, the cost cannot be zero at anytime.
L2_regularization_cost = 1.0 / m * lambd / 2.0 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
cost = cross_entropy_cost + L2_regularization_cost

back propagation comptutes gradients

dZ2 = A2 - Y # the cross_entropy loss
dW2 = 1./m * np.dot(dZ2, A1.T) + W2 * lambd / m
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)

dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0)) # relu activation
dW1 = 1./m * np.dot(dZ1, X.T) + W1 * lambd / m
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

L1 regularization Implementation

Lasso (with L1 regularization) and ridge regression (with L2 regularization).

The following picture is from [2].

Elastic net (combine L1 and L2)

# from http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
1 / (2 * n_samples) * ||y - Xw||^2_2 + alpha * l1_ratio * ||w||_1 + 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2

References:
[1] https://www.quora.com/Why-does-an-L1-norm-unit-ball-have-diamond-shaped-geometry#!n=12
[2] https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when
[3] https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
[4] https://feature.engineering/regularization/

6.2 Dropout

Intuition: can't rely on any one feature.
At test time, do not use dropout.
Implementation:

d1 = np.random.rand(shape[0], shape[1]) < keep_prob
a1 = np.multiply(a1, d1)
a1 /= keep_prob # inverted dropout

6.3 Other technicals

Data augmentation: horizontally flip
Early stopping

7. CNNs

7.1 Convolution

Parameter sharing: feature detector that's useful in one part of the image is probably useful in another part of the image
Sparsity of connection
Translation invariance

7.1.1 Filter size

Usually odd, i.e., 3*3, 5*5, which has a central point
Filter kernel: width * height * number of channels
Each filter has a bias
After convolution, we should apply non-linearity, i.e., relu(convolution + bias)

7.1.2 Padding

Role
- Prevent image size ** shrinking** after convolution
- Put more emphasis on the corner pixel (i.e., top-left, top-right, down-left, down-right), otherwise, the corner pixel will be convolved once
Types
- Valid: No padding
- Same: output size is the same as the input size, p = (filter_size - 1) / 2

7.1.3 Stride

stride=2 means stride both on horizontal and vertical directions
Image size after convolution, padding and stride
- Horizontal direction: round([(horizontal image size + 2 * padding - filter_size) / stride] + 1)
- Vertical direction: round([(vertical image size + 2 * padding - filter_size) / stride] + 1)

7.2 Pooling

Hyperparameters: filter size and stride
Pooling layer has but does not have any learnable parameters
Dose not change the number of channels

7.3 AlexNet

Local Response Normalization (LRN): across whole channels

7.4 VGG16

Fixed convolution filter size (3 * 3, stride=1) and pooling size (2 * 2, stride=2)
VGG16 almost have same performance compared with VGG19

7.5 ResNet

Residual block: a^(l+2) = g(z^(l+2) + a^(l))
In extremely scenario with strong regularization, z^(l+2) equals 0, and a^(l+2) = g(0 + a^(l)), it turns out that the identity function is eary to learn for residual block.
Can conquer gradients vanishing problem, why ResNet works ?

7.6 GoogLeNet

7.6.1 Inception module

1 * 1 CONV + 1 * 1 CONV -> 3 * 3 CONV + 1 * 1 CONV -> 5 * 5 CONV + MAXPOOL 3 * 3, stride=1, same -> 1 * 1 CONV

7.6.2 1 * 1 convolution

Work as a bottleneck layer for reducing computational cost

8. RNNs

8.1 LSTM

See colah's great post on Understanding LSTM Networks.

8.2 Transfering learning and word embeddings

1. Learn word embeddings from large text corpus (Or download pre-trained embeddings online).
1. Transfer embedding to new task with smaller training set.
1. Optional: Continue to finetune the word embeddings with new data.

9. Learning Types

9.1 Transfer learning

If your data size is small, just retrain the last layer; if your data size is large, you can retrain last two or three or all layers
Task A and task B share the same low level features, so transfer learning works
You can preprocess the input images with frozen layer and save them to disk.
Data augmentation methods
- Mirroring
- Random cropping
- Color shifting: PCA

9.2 Multi-task learning

A set of tasks share the same low level features

9.3 End-to-end learning

Needs large amount of data

10. Auto-Encoder

Data denoising
Dimension reduction
Usage: In fraud detection, use Auto-Encoder to compress all data to dense vector, then kNN is used to detect outliers

Reinforcement Learning

Definitions

Reinforcement learning (RL) is an area of machine learning that focuses on how agent to act in an environment in order to maximize some given reward
Markov Decision Processes (MDPs)
- Components
  - Agent: decision maker
  - Environment
  - State
  - Action
  - Reward
- Sequential process
  - At each time step, given the environment’s state, the agent selects an action to take. Then the environment is transitioned into a new state, and the agent is given a reward as a consequence of the previous action
- Goal
  - The goal of an agent in an MDP is to maximize its cumulative rewards (i.e., expected discounted (i.e., care more about the immediate reward over future rewards, discount rate γ) return of rewards)
Policy and Value function
- Policy
  - A function that maps a given state to probabilities of selecting each possible action from that state, let symbol π denotes the policy
- Value
  - State-value function v_π: gives us the value of a state under π
  - Action-value function q_π: gives us the value of an action under π. q_π is referred to as the Q-function, and the output from the function for any given state-action pair is called a Q-value. The letter 'Q' is used to represent the quality of taking a given action in a given state
- RL Goal
  - Reinforcement learning algorithms seek to find a policy (i.e., optimal policy) that will yield more return to the agent than all other policies
Bellman optimality equation
- For any state-action pair (s,a) at time t, the expected return is R_(t+1) (i.e. the expected reward we get from taking action a in state s) + the maximum expected discounted return that can be achieved from any possible next state-action pair.
Q-function: input the state-atcion pair, output the Q-value. The letter “Q” is used to represent the quality of taking a given action in a given state.

Q-learning

It is used for learning the optimal policy by learning the optimal Q-values for each state-action pair in a Markov Decision Process
Q-table: store the Q-values for each state-action pair, the dimension is (#states * #actions)
- Steps
  - Initialize all Q-values in the Q-table to 0
  - For each time-step in each episode:
    - Choose an action (using the exploration-exploitation trade-off, i.e., epsilon greedy strategy)
      - if random_num > epsilon: choose action via exploitation: choose the action with the highest Q-value for its current state
      - else: choose action via exploration: randomly choosing action
    - Update the Q-value function
      - where α is the learning rate, γ is the discount rate

DQN

Use a neural network to approximate the Q-function
Input the state, output the Q-values for each action that can be taken from that state, the loss is the gap between the output Q-values (by policy network) and the target Q-values (by target network) from Bellman equation
Experience replay: store the agent’s experiences at each time step called the replay memory data set, which stores the last N experiences. At time t, the agent's experience e_t is defined as this tuple: e_t=(s_t, a_t, r_t+1, s_t+1)
Train the network with randomly choose the samples in replay memory to break the correlation between consecutive samples
Target network
- Clone of the policy network, weights are frozen with policy network’s weights, and after x time steps, copy policy network's weights to target network
- The goal of target network is find the value of the max term in Bellman euqation to calculate the target Q-value

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
Andrew_Ng_images		Andrew_Ng_images
Andrew_Ng_images_class_2_week_1		Andrew_Ng_images_class_2_week_1
CS231n_images		CS231n_images
Logistic regression		Logistic regression
Neural network		Neural network
reinforcement_learning		reinforcement_learning
Andrew Ng's DL - Class 1 - Week 2.ipynb		Andrew Ng's DL - Class 1 - Week 2.ipynb
Andrew Ng's DL - Class 1 - Week 3.ipynb		Andrew Ng's DL - Class 1 - Week 3.ipynb
Andrew Ng's DL - Class 1 - Week 4.ipynb		Andrew Ng's DL - Class 1 - Week 4.ipynb
Andrew Ng's DL - Class 2 - Week 1.ipynb		Andrew Ng's DL - Class 2 - Week 1.ipynb
Andrew Ng's DL - Class 2 - Week 2.ipynb		Andrew Ng's DL - Class 2 - Week 2.ipynb
Andrew Ng's DL - Class 2 - Week 3.ipynb		Andrew Ng's DL - Class 2 - Week 3.ipynb
Andrew Ng's DL - Class 3 - Week 1 - Machine learning strategy.ipynb		Andrew Ng's DL - Class 3 - Week 1 - Machine learning strategy.ipynb
Andrew Ng's DL - Class 4 - Week 1 - CNN Step by Step.ipynb		Andrew Ng's DL - Class 4 - Week 1 - CNN Step by Step.ipynb
Andrew Ng's DL - Class 5 - Week 1 - RNNs Step by Step.ipynb		Andrew Ng's DL - Class 5 - Week 1 - RNNs Step by Step.ipynb
CS224n_lecture_1.md		CS224n_lecture_1.md
Lecture 7_CNN.md		Lecture 7_CNN.md
Lectures 1-3_linear classifier.md		Lectures 1-3_linear classifier.md
Lectures 4-6_neural network.md		Lectures 4-6_neural network.md
README.md		README.md

gaoisbest/Machine-Learning-and-Deep-Learning-basic-concepts-and-sample-codes

Folders and files

Latest commit

History

Repository files navigation