# Machine Learning

## 7️⃣ Overfitting and Regularization

### Overfitting

Overfitting is the production of an analysis that corresponds **too closely or exactly** to a particular set of data, and may therefore **fail to fit additional data** or **predict future observations reliably**.

### Cross Validation

**Cross validation** is one of the ways to prevent overfitting.
Cross validation is a method of evaluating model if the model has learned well by dividing data into **Train data**, **Test data**, and **Validation data**.

#### K-fold cross validation

K-fold cross validation divides the data into k equal parts and trains model k times with different set of train data.

<img src="./k-fold.png" width="300"/>

1. Set K and divide the dataset into K pieces.
2. One of K data is used as validation data, and the other are used as train data.
3. The average performance of K models becomes the performance of the model.

In [1]:
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

# import KFold module
from sklearn.model_selection import KFold

"""
1. Load data from sklearn & Split data
"""
def load_data():
    
    X, y = load_boston(return_X_y = True)
    
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state = 100)
    
    return train_X, test_X, train_y, test_y
    

"""
2. Implement function that train model and predict using KFold cross validation.
   
   Step01. Define Kfold object to divide data into 5.
           
   Step02. Split divided data into Train data and Validation data using kFold object and .split().
           train_idx and val_idx is an index of splitted data.
            
   Step03. Train model using splitted Train data, and test the model using validation data.
           Save each model's score at score.
"""
def kfold_regression(train_X, train_y):
    
    n_iter = 0
    
    # Create an empty array to save each model's score
    model_scores = []
    
    kfold = KFold(n_splits=5)
    
    for train_idx, val_idx in kfold.split(train_X):
        
        X_train, X_val =  train_X[train_idx], train_X[val_idx]
        y_train, y_val =  train_y[train_idx], train_y[val_idx]
        
        # Initialize model for each fold to get rid of the influence between each fold
        model = LinearRegression()
        
        model.fit(X_train, y_train)
        
        # Save score of each fold
        score = model.score(X_val, y_val)
        
        # Save size of data for train and validation
        train_size = X_train.shape[0]
        val_size = X_val.shape[0]
    
        print("Iter : {0} Cross-Validation Accuracy : {1}, Size of Train Data : {2}, Size of Validation Data : {3}"
              .format(n_iter, score, train_size, val_size))
    
        n_iter += 1
        
        model_scores.append(score)
        
    return kfold, model, model_scores
        
        
def main():
    
    # Load data for test and training
    train_X, test_X, train_y, test_y = load_data()
    
    kfold, model, model_scores = kfold_regression(train_X, train_y)
    
    print("\n> Mean of scores : ", np.mean(model_scores))
    

    
if __name__ == "__main__":
    main()


Iter : 0 Cross-Validation Accuracy : 0.622527754679733, Size of Train Data : 323, Size of Validation Data : 81
Iter : 1 Cross-Validation Accuracy : 0.7158099616179292, Size of Train Data : 323, Size of Validation Data : 81
Iter : 2 Cross-Validation Accuracy : 0.7986314390280332, Size of Train Data : 323, Size of Validation Data : 81
Iter : 3 Cross-Validation Accuracy : 0.6952286567450774, Size of Train Data : 323, Size of Validation Data : 81
Iter : 4 Cross-Validation Accuracy : 0.7006957536853015, Size of Train Data : 324, Size of Validation Data : 80

> Mean of scores :  0.7065787131512149


#### Regularization

A method to implement a generalized model by reducing the complexity of the model.

- **L1 Regularization (Lasso)**

Make $\beta_i$ corresponding to unnecessary input exactly 0.

- **L2 Regularization (Ridge)**

Make $\beta_i$ close to zero for outliers with very large or small values.

