# Kaggle

#### Understanding the problem
* **Data type:** tabular data, time series, images, text, etc. structured vs. unstructured... a mix
* **Problem type:** classification, regression, ranking, etc.
* **Evaluation metric:** ROC AUC, F1-score, MAE, MSE, etc.

#### Metric definition
* Generally, the majority of the metrics can be found in the `sklearn.metrics` library
* However, **there are some special competition metrics that are not available in scikit-learn**
    * In such cases, we have to create metrics manually
    
### Kaggle Solution Workflow

<img src='data/solution_workflow.png' width="600" height="300" align="center"/>

In [None]:
import pandas as pd
import numpy as np

In [None]:
def rmsle(y_true, y_pred):
    diffs = np.log(y_true + 1) - np.log(y_pred + 1)
    squares = np.power(diffs, 2)
    err = np.sqrt(np.mean(squares))
    return err

* Before building any models, we should perform some preliminary steps to understand the data and the problem we're facing. 

#### Goals of EDA
* Size of the data
* Properties of the target variable
    * high class imblance in classification problem?
    * skewed distribution in regression problem?
* Properties of the features
* Generate ideas for feature engineering

### K-fold cross-validation

In [None]:
from sklearn.model_selection import KFold

In [None]:
# Create a KFold object
kf = KFold(n_splits=5, shuffle= True, random_state=123)

* Now we need to train `K` models for each cross-validation split. 
* To obtain all the splits we call the `split()` method of the KFold object with the `train` data as an argument.
* It returns a list of training and testing observations for each split
* The observations are given as numeric indices on the train data.
* These indices could be used inside the loop to select training and testing folds for the corresponding cross-validation split
* For pandas DataFrame, it could be done using the `iloc` operator, for example.

In [None]:
# Loop through each cross-validation split
for train_index, test_index in kf.split(train):
    # Get training and testing data for the corresponding split
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]

#### Stratified K-fold
* As demonstrated in the image, each fold has the same class distribution as the initial dataset.
* It is useful when we have a classification problem with high class imbalance in the target variable or our data size is very small.

<img src='data/stratified_kfold.png' width="600" height="300" align="center"/>

In [None]:
# Import StratifiedKFold
from sklearn.model_selection import StratifiedKFold

# Create a StratifiedKFold object
str_kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state=123)

# Loop through each cross-validation split
for train_index, test_index in str_kf.split(train, train['target']):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]

```
# Import KFold
from sklearn.model_selection import KFold

# Create a KFold object
kf = KFold(n_splits=3, shuffle=True, random_state=123)

# Loop through each split
fold = 0
for train_index, test_index in kf.split(train):
    # Obtain training and testing folds
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    print('Fold: {}'.format(fold))
    print('CV train shape: {}'.format(cv_train.shape))
    print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
    fold += 1
    
# Import StratifiedKFold
from sklearn.model_selection import StratifiedKFold

# Create a StratifiedKFold object
str_kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=123)

# Loop through each split
fold = 0
for train_index, test_index in str_kf.split(train, train['interest_level']):
    # Obtain training and testing folds
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    print('Fold: {}'.format(fold))
    print('CV train shape: {}'.format(cv_train.shape))
    print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
    fold += 1
```

#### Validation usage
* **Leakage** causes a model to seem accurate until we start making predictions in a real-world environment
* **Types of data leakage:**
    * Leak in **features:** using data that will not be available in the real (production) setting
        * Example: predicting sales in US dollars, while having exactly the same sales in UK pounds as a feature.
    * Leak in **validation strategy:** validation strategy differs from the real-world situation
        * Using kfold for time series data
        * Instead, time series kfold should be done as demonstrated in the image below:
        
<img src='data/time_series_kfold.png' width="600" height="300" align="center"/>        


* The underlying idea of time series k-fold crossvalidation is to provide multiple splits in such a manner that we train only on past data while always predicting the future
* **Time k-fold crossvalidation** is also available in `sklearn.model_selection`

In [None]:
# Import TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit

# Create a TimeSeriesSplit object
time_kfold = TimeSeriesSplit(n_splits=5)

* **Note** that **before applying it to the data, we need to sort the train DataFrame by date.**
    * Then, as usual, iterate through each crossvalidation split

In [None]:
# Sort train by date
train = train.sort_values('date')

# Loop through each cross-validation split
for train_index, test_index in time_kfold.split(train):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]

### Validation pipeline
* Firstly, create an empty list where we will store the model's results
* Split train data into folds 
* For each crossvalidation split, we perform the following steps:
    * Train a model using all except for a single fold
    * Make predictions on this single unseen fold
    * Calculate the competition metric and append it to the list of folds metrics
* As a result, we have a list of K numbers representing model quality for each fold 

```
# List for the results
fold_metrics = []
for train_index, test_index in CV_STRATEGY.split(train):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    # Train a model
    model.fit(cv_train)
    # Make predictions
    predictions = model.predict(cv_test)
    # Calculate the metric
    metric = evaluate(cv_test, predictions)
    fold_metrics.append(metric)
```
* Now, we could train two different models and for each model get a list of K numbers

<img src='data/mod_comp.png' width="300" height="150" align="center"/>

* For example, above we have models A and B, each with mean squared errors in four folds
* Our goal is to select the model with better quality 
* The next step is to tranform K fold scores into a single overall validation score
* The simplest way to obtain a single number is to find the mean over all fold scores
    * However, the mean is not always a good choice, as it does *not* take into account **score deviation** from one fold to another
    * For example, we could get a very good score for a single fold, while the performance on the rest K-1 folds is poor. 

```
import numpy as np

# Simple mean over the folds
mean_score = np.mean(fold_metrics)
```
* **A more reliable overall validation score:** uses the worst-case scenario considering validation score one standard deviation away from the mean
* **Note** that we **add** standard deviation if the competition metric is being *minimized* and **subtract** standard deviation if the metric is being *maximized*.

```
# Overall validation score
overall_score_minimizing = np.mean(fold_metrics) + np.std(fold_metrics)

# Or
overall_score_maximizing = np.mean(fold_metrics) - np.std(fold_metrics)
```

#### Exercises: Time K-Fold

```
# Create TimeSeriesSplit object
time_kfold = TimeSeriesSplit(n_splits=3)

# Sort train data by date
train = train.sort_values(by='date')

# Iterate through each split
fold = 0
for train_index, test_index in time_kfold.split(train):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    
    print('Fold :', fold)
    print('Train date range: from {} to {}'.format(cv_train.date.min(), cv_train.date.max()))
    print('Test date range: from {} to {}\n'.format(cv_test.date.min(), cv_test.date.max()))
    fold += 1
```

#### Exercises: Overall Validation Score

```
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Sort train data by date
train = train.sort_values('date')

# Initialize 3-fold time cross-validation
kf = TimeSeriesSplit(n_splits=3)

# Get MSE scores for each cross-validation split
mse_scores = get_fold_mse(train, kf)

print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))
print('MSE by fold: {}'.format(mse_scores))
print('Overall validation MSE: {:.5f}'.format(np.mean(mse_scores) + np.std(mse_scores)))
```

# $\star$ Chapter 3: Feature Engineering
You will now get exposure to different types of features. You will modify existing features and create new ones. Also, you will treat the missing data accordingly.

### Feature engineering

<img src='data/mod_process.png' width="600" height="300" align="center"/>

* **Important rule:** tweak only a single a thing at a time, because changing multiple things does not allow us to detect what actually works and what doesn't
* **Feature engineering** helps our ML models to get additional information and consequently to better predict the target variable
* The ideas for new features can come from prior experience working with similar data.
* Also, having looked at the data, we could potentially generate ideas for new valuable features
* One more source is domain knowledge of the problem we're solving

#### Feature types
* Numerical
* Categorical
* Datetime
* Coordinates
* Text
* Images

#### Creating features 
* There are some situations when we need to generate features for train and tests independently and for each validation split in the k-fold cross-validation
* However, in the majority of cases features are created for train and test sets simultaneously
    * For this purpose, we concatenate train and test DataFrames from Kaggle into a single DF using pandas
    
```
# Concatenate the train and test data
data = pd.concat([train, test])

# Generate new features for the full DataFrame

# Get the original train and test split back
train = data[data.id.isin(train.id)]
test = data[data.is.isin(test.id)]
```

#### Arithmetical features

```
# Arithmetical features
two_sigma['price_per_bedroom'] = two_sigma.price / two_sigma.bedrooms
two_sigma['rooms_number'] = two_sigma.bedrooms + two_sigma.bathrooms
```

#### Datetime features 

```
# Convert date to the datetime object
dem['date'] = pd.to_datetime(dem['date'])

# Year features
dem['year'] = dem['date'].dt.year

# Month features
dem['month'] = dem['date'].dt.month

# Week features
dem['week'] = dem['date'].dt.weekofyear

# Day features
dem['dayofyear'] = dem['date'].dt.dayofyear
dem['dayofmonth'] = dem['date'].dt.day
dem['dayofweek'] = dem['date'].dt.dayofweek
```

#### Exercises: Arithmetical features

```
# Look at the initial RMSE
print('RMSE before feature engineering:', get_kfold_rmse(train))

# Find the total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['FirstFlrSF'] + train['SecondFlrSF']

# Look at the updated RMSE
print('RMSE with total area:', get_kfold_rmse(train))
```

<img src='data/course_datasets.png' width="600" height="300" align="center"/>