# Regularisation
## Introduction
- General topic: model complexity
- Regularisation help generalise model/reduce complexity and reduce overfitting
 - overfitting model = too complex
 - 'channel your inner Ockham' 
  - minimise loss + penalise complexity: 
  - minimise(Loss(Data,Model) + complexity(Model))
- Components of model complexity
 - Model complexity as a function of the weights of all the features in the model.
  - If model complexity is a function of weights, a feature weight with a high absolute value is more complex than a feature weight with a low absolute value.
 - Model complexity as a function of the total number of features with nonzero weights. (A later module covers this approach.)
- The best way to understand regularization is to see the implications it has on our loss function.
 - In mathematical optimization and decision theory, a [loss function](https://en.wikipedia.org/wiki/Loss_function) or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event


## Types of regularization techniques
- L2 regularisation - takes in sum of square
- L1 regularisation - takes in absolute value
- Elastic Net - combine both L1 and L2

| L2   | L1   |
|------|------|
| square  | absolute|
| punished large numbers  | affect numbers equally|
| 'really wants small values in the whole matrix'  | 'doesn't care if we put all the large value in a single slot in the matrix'|
| spreads error throughout the weight matrix  | sparse weight matrix (some exactly zero, some relatively large)|
| takes in all component of the weight matrix  | encourage many of the uninformative coefficients in our model to be exactly 0|
|    | can save RAM and may reduce noise in the mode|

-  _The former case is sufficient and indeed suitable for a variety of statistical problems, but the latter is gaining traction through the field of compressive sensing.  From a non-rigorous standpoint, compressive sensing assumes not that observations come from Gaussian-distributed sources about ground truth but rather that sparse or simple solutions to equations are preferable or more likely (the "Occam's Razor" approach)._ - from Justin Solomon's answer in Quora
- _For example, consider a housing data set that covers not just California but the entire globe. Bucketing global latitude at the minute level (60 minutes per degree) gives about 10,000 dimensions in a sparse encoding; global longitude at the minute level gives about 20,000 dimensions. A feature cross of these two features would result in roughly 200,000,000 dimensions. Many of those 200,000,000 dimensions represent areas of such limited residence (for example, the middle of the ocean) that it would be difficult to use that data to generalize effectively. It would be silly to pay the RAM cost of storing these unneeded dimensions. Therefore, it would be nice to encourage the weights for the meaningless dimensions to drop to exactly 0, which would allow us to avoid paying for the storage cost of these model coefficients at inference time._ - great illustration for the motivation for L1 (from Google Course)

## L1 and L2 comparison 
### Note - show when is it useful to use L1 vs L2

In [4]:
#https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/

In [6]:
! pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.1.0-cp37-cp37m-manylinux2010_x86_64.whl (421.8 MB)
[K     |████████████████████████████████| 421.8 MB 73 kB/s  eta 0:00:012     |█████████▉                      | 129.6 MB 2.2 MB/s eta 0:02:15███████▉| 419.6 MB 1.5 MB/s eta 0:00:02
[?25hProcessing /home/febriyan/.cache/pip/wheels/5c/2e/7e/a1d4d4fcebe6c381f378ce7743a3ced3699feb89bcfbdadadd/gast-0.2.2-cp37-none-any.whl
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.2.0-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.9 MB/s  eta 0:00:01
[?25hCollecting protobuf>=3.8.0
  Downloading protobuf-3.11.3-cp37-cp37m-manylinux1_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 32.3 MB/s eta 0:00:01
[?25hCollecting astor>=0.6.0
  Using cached astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting tensorflow-estimator<2.2.0,>=2.1.0rc0
  Using cached tensorflow_estimator-2.1.0-py2.py3-none-any.whl (448 kB)
Collecting tensorboard<2.2.0,>=2.1.0
  

In [7]:
import tensorflow.keras
from extra_keras_datasets import emnist
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras import regularizers
import matplotlib.pyplot as plt

# Model configuration
img_width, img_height, num_channels = 28, 28, 1
input_shape = (img_height, img_width, num_channels)
batch_size = 250
no_epochs = 25
no_classes = 47
validation_split = 0.2
verbosity = 1

# Load EMNIST dataset
(input_train, target_train), (input_test, target_test) = emnist.load_data()

# Add number of channels to EMNIST data
input_train = input_train.reshape((len(input_train), img_height, img_width, num_channels))
input_test  = input_test.reshape((len(input_test), img_height, img_width, num_channels))

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize data
input_train = input_train / 255
input_test = input_test / 255

# Convert target vectors to categorical targets
target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes)
target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes)

# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01)))
model.add(Dense(no_classes, activation='softmax', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01)))

# Compile the model
model.compile(loss=tensorflow.keras.losses.categorical_crossentropy,
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])

# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=batch_size,
            epochs=no_epochs,
            verbose=verbosity,
            validation_split=validation_split)

# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

# Plot history: Loss
plt.plot(history.history['loss'], label='Training data')
plt.plot(history.history['val_loss'], label='Validation data')
plt.title('L1/L2 Activity Loss')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()

# Plot history: Accuracy
plt.plot(history.history['accuracy'], label='Training data')
plt.plot(history.history['val_accuracy'], label='Validation data')
plt.title('L1/L2 Activity Accuracy')
plt.ylabel('%')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()

ModuleNotFoundError: No module named 'extra_keras_datasets'

# References
- https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization
- https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization
- https://www.pyimagesearch.com/2016/09/19/understanding-regularization-for-image-classification-and-machine-learning/
- https://towardsdatascience.com/only-numpy-implementing-different-combination-of-l1-norm-l2-norm-l1-regularization-and-14b01a9773b
- https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when
- https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/


## Data Science Questions
- What is regularization? Why do we need it? 👶
- Which regularization techniques do you know? 👩‍🎓
- What kind of regularization techniques are applicable to linear models? 👩‍🎓
- How does L2 regularization look like in a linear model? 👩‍🎓
- How do we select the right regularization parameters? 👶
- What’s the effect of L2 regularization on the weights of a linear model? 👩‍🎓
- How L1 regularization looks like in a linear model? 👩‍🎓
- What’s the difference between L2 and L1 regularization? 👩‍🎓
- Can we have both L1 and L2 regularization components in a linear model? 👩‍🎓
- How do we interpret weights in linear models? 👩‍🎓
- If a weight for one variable is higher than for another - can we say that this variable is more important? 👩‍🎓
- Can we use L1 regularization for feature selection? 👩‍🎓
- Can we use L2 regularization for feature selection? 👩‍🎓