# Kaggle 30 Days of Machine Learning - Competition

## Introduction

### Author Info

- Author: [Zacks Shen](https://www.linkedin.com/in/zacks-shen/)
- Blog: [Zacks.One](https://zacks.one)

---

### Reference

> [A Brief Overview of Outlier Detection Techniques](https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561)

---

### Project Intro

For this competition, you will be predicting a continuous `target` based on a number of feature columns given in the data. All of the feature columns, `cat0` - `cat9` are categorical, and the feature columns `cont0` - `cont13` are continuous.

**Files**
- **train.csv** - the training data with the `target` column
- **test.csv** - the test set; you will be predicting the `target` for each row in this file
- **sample_submission.csv** - a sample submission file in the correct format

---

# Dependencies

In [1]:
# Statistics
import pandas as pd
import numpy as np
import math as mt

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Data Preprocessing - Standardization, Encoding, Imputation
from sklearn.preprocessing import StandardScaler # Standardization
from sklearn.preprocessing import Normalizer # Normalization
from sklearn.preprocessing import OneHotEncoder # One-hot Encoding
from sklearn.preprocessing import OrdinalEncoder # Ordinal Encoding
from category_encoders import MEstimateEncoder # Target Encoding
from sklearn.preprocessing import PolynomialFeatures # Create Polynomial Features
from sklearn.impute import SimpleImputer # Imputation

# Exploratory Data Analysis - Feature Engineering
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import mutual_info_regression
from sklearn.decomposition import PCA

# Modeling - ML Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# Modeling - Algorithms
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

# ML - Evaluation
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

# ML - Tuning
import optuna
#from sklearn.model_selection import GridSearchCV

# Settings
# Settings for Seaborn
sns.set_theme(context='notebook', style='ticks', palette="bwr_r", font_scale=0.7, rc={"figure.dpi":240, 'savefig.dpi':240})

---

# Dataset

In [None]:
import os
kaggle_project = '30-days-of-ml'
# Import dataset from local directory './data' or from Kaggle
data_dir = ('./data' if os.path.exists('data') else f'/kaggle/input/{kaggle_project}')

# print all files in data_dir
for dirname, _, filenames in os.walk(data_dir):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_data = pd.read_csv(f'{data_dir}/train.csv')
test_data = pd.read_csv(f'{data_dir}/test.csv')

train_data.head()

---

# Exploratory Data Analysis

## Missing Values

Since the train dataset does not have any missing value, there is not necessary for data imputation.

In [None]:
# print number missing values from train_data and test_data
train_data.isna().sum().sum(), test_data.isna().sum().sum()

---

## Target Distribution

99.82% target data is located in 6.5 to 10.5.

In [None]:
sns.histplot(x=train_data['target'], stat='density')
plt.title('Distribution of All Targets')

In [None]:
# Main target range
len(train_data[(6.5 <= train_data.target) & (train_data.target < 10.5)]) / len(train_data)

---

## Numerical Data

Since all of the numerical data are in the same scale, there is no need for scaling.


**Conclusions**
- The numerical features are all non-parametric. Therefore, we should implement non-parametric outliers dectection.
- The distributions of numerical features from train and test dataset are similar.

In [None]:
# Define num_cols storing the columns names of numerical features
num_cols = list(train_data.select_dtypes(include=['int', 'float']).columns)
num_cols.remove('id')
num_cols.remove('target')
print(num_cols)

In [None]:
train_data[num_cols].describe()

In [None]:
test_data[num_cols].describe()

In [None]:
def plots(plot, features, train=None, test=None, target=None, ncols=4, figsize=(20, 16), **params):
    """Recieves an Seaborn method and one or two Pandas DataFrames, then returns subplots.
    
    Args:
        plot: A Seaborn instance.
        features: A set of Pandas DataFrame columns.
        train: Train dataset.
        test: Test dataset.
        target: The target, default is None.
        ncols: Number of figures in a row, default is 4.
        figsize: Figure Size, default is (20, 10).
    
    **kwargs:
        params: The parameters for Seaborn instance.
    
    Returns:
        The matplotlib.pyplot.subplots instance.
    """
    # Set row, col, and figure id
    nfigs = len(features)
    nrows = mt.ceil(nfigs/ncols)  # number of rows
    fig_id = 1  # initialize plot counter

    fig = plt.figure(figsize=figsize)
    for feature in features:
        plt.subplot(nrows, ncols, fig_id)
        if isinstance(train, pd.DataFrame):
            plot(data=train, x=feature, y=target, label='train', 
                 **params)
        if isinstance(test, pd.DataFrame):  
            plot(data=test, x=feature, label='test', # Test dataset doesn't have target
                 **params)
            plt.legend() # Force legend
        fig_id += 1
    plt.show()

In [None]:
%%time
# Density plots of train and test
plots(plot=sns.kdeplot, train=train_data, test=test_data, features=num_cols)

In [None]:
%%time
# Scattor plots of train
plots(plot=sns.scatterplot, train=train_data, features=num_cols, target='target')

In [None]:
print("Correlation with target:\n")
pd.DataFrame(train_data[num_cols].corrwith(train_data.target), columns=['Correlation Coefficient']).style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

---

## Categorical Data

Due to the imbalanced distribution of categorical features, it is worth to implement target encoding.

In [None]:
# Define Categorical Columns
cat_cols = list(train_data.select_dtypes(include=['object']).columns)
print(cat_cols)

In [None]:
%%time
# Countplots of train and test
plots(plot=sns.countplot, train=train_data, test=test_data, features=cat_cols)

In [None]:
%%time
# Violinplot of train
plots(plot=sns.violinplot, train=train_data, features=cat_cols, target='target')

---

## Outliers

In [None]:
def detect_outliers(data_frame, features):
    df = data_frame.copy()
    for feature in features:
        Q1, Q3 = df[feature].quantile(0.25), df[feature].quantile(0.75)
        IQR = Q3 - Q1
        min_value, max_value = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
        filter1, filter2 = min_value <= df[feature], df[feature] <= max_value
        proportion = 1- len(df[filter1 & filter2]) / len(df)
        print(f"{feature}'s outliers proportion: {proportion}")

detect_outliers(train_data, num_cols)

In [None]:
def tukey_rule(data_frame, features):
    """Detect outliers for each feature according to Tukey's Rule.
    
    Args:
        data_frame: A Pandas DataFrame instance.
        features: A set of Pandas DataFrame columns.
    
    Returns:
        A Pandas DataFrame with no outliers.
    """
    df = data_frame.copy()
    for feature in features:
        Q1, Q3 = df[feature].quantile(0.25), df[feature].quantile(0.75)
        IQR = Q3 - Q1
        min_value, max_value = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
        filter1, filter2 = min_value <= df[feature], df[feature] <= max_value
        df = df[filter1 & filter2] # Overwrite df to keep the index same.
    return df

In [None]:
df_train = tukey_rule(train_data[num_cols], num_cols)
df_train = df_train.join(train_data[['id'] + cat_cols + ['target']], how='inner') # Join the rest columns
df_train = df_train[train_data.columns] # Sort the columns

In [None]:
df_train.to_csv('df_train.csv', index=False)

---

# Feature Selection

The Goals of feature engineering:
- improve a model's predictive performance
- reduce computational or data needs
- improve interpretability of the results

The possible ways of feature engineering:
- Creating new features
- K-means clustering
- Pinciple Component Analysis

Metric of feature engineering:
- Mutual Information

---

## Mutual Information

In [None]:
def make_mi_scores(X, y):
    X = X.copy()
    # Mutual Information required all data be integers
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize() # factorize() returns code and uniques
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

In [None]:
%%time
X = df_train.copy()
X.drop(columns=['id'], inplace=True)
y = X.pop("target")

# Review the MI score from all data
all_mi_scores = pd.DataFrame(make_mi_scores(X, y))

all_mi_scores.style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

---

### PCA

In [None]:
def apply_pca(X, standardize=True):
    # Standardize
    if standardize:
        X = (X - X.mean(axis=0)) / X.std(axis=0)
    # Create principal components
    pca = PCA()
    X_pca = pca.fit_transform(X)
    # Convert to dataframe
    component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
    X_pca = pd.DataFrame(X_pca, columns=component_names)
    # Create loadings
    loadings = pd.DataFrame(
        pca.components_.T,  # transpose the matrix of loadings
        columns=component_names,  # so the columns are the principal components
        index=X.columns,  # and the rows are the original features
    )
    return pca, X_pca, loadings


def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n + 1)
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)
    )
    # Cumulative Variance
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
    )
    # Set up figure
    fig.set(figwidth=8, dpi=100)
    return axs

In [None]:
print("Correlation with target:\n")
pd.DataFrame(X.corrwith(y), columns=['Correlation Coefficient']).style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

In [None]:
X = train_data.copy()
y = X.pop("target")

# `apply_pca`, defined above, reproduces the code from the tutorial
pca, X_pca, loadings = apply_pca(X[num_cols])
loadings

In [None]:
plot_variance(pca);

In [None]:
%%time
pca_mi_scores = pd.DataFrame(make_mi_scores(X_pca, y))
#pca_mi_scores = pd.DataFrame(mi_scores).style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

pca_mi_scores.style.bar(align='mid', color=['#d65f5f', '#5fba7d'])

In [None]:
def test_pca(model, X):
    # Test PCA numberical data first
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                    random_state=0)

    # Fit the model
    model.fit(X_train, y_train)

    # Prediction
    pred = model.predict(X_valid)

    # Calculate RMSE
    return mt.sqrt(mean_squared_error(y_valid, pred))

In [None]:
# Model hyperparameters
xgb_params = {'n_estimators': 1000,
              'learning_rate': 0.1,
              #'subsample': 0.8,
              'colsample_bytree': 0.3,
              'max_depth': 3,
              'booster': 'gbtree', 
              #'reg_lambda': 45.1,
              #'reg_alpha': 34.9,
              'random_state': 0,
              'n_jobs': 4}

model = XGBRegressor(**xgb_params, tree_method = 'gpu_hist')

In [None]:
%%time

X = train_data.copy()
y = X.pop("target")

# numerical data only
# 0.7276467952234649
# 0.7228196639975967
#test_pca(model, X[num_cols])

# pca only
# 0.7456365639105215
# 0.7409662927889535
#test_pca(model, X_pca)

# pac (no 0 pc)
# 0.744166905080158
# 0.7413420081098745
#test_pca(model, X_pca[pca_mi_scores.iloc[0:12, :].index])

# numerical data + pca
# 0.727961200766366
# 0.7235085989402649
test_pca(model, X[num_cols].join(X_pca))

# numerical data + pca (no 0 pc)
# 0.7281481337015027
# 0.7234893735995077
# test_pca(model, X[num_cols].join(X_pca[pca_mi_scores.iloc[0:12, :].index]))

---

## Data Engineering

---

### Data Pipline

In [None]:
X = df_train.copy()
X.drop(columns=['id'], inplace=True)
y = X.pop("target")

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
cat_cols_nunique = X[cat_cols].nunique()
onehot_cols = list(cat_cols_nunique[cat_cols_nunique < 10].keys())
ordinal_cols = list(cat_cols_nunique[cat_cols_nunique >= 10].keys())

# Keep selected columns only
# my_cols = list(num_cols) + list(onehot_cols)

assert len(num_cols)  + len(onehot_cols) + len(ordinal_cols) == X.shape[1], 'feature selection error'

In [None]:
def data_pipelie(num_cols, onehot_cols, ordinal_cols):    
    # Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='median')

    # Preprocessing of One Hot Encoding
    onehot_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    # Preprocessing of Ordnial Encoding
    ordinal_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ordinal', OrdinalEncoder())
    ])

    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, num_cols),
            ('onehot', onehot_transformer, onehot_cols),
            ('ordinal', ordinal_transformer, ordinal_cols)
        ])

    '''
    # Bundle preprocessing and modeling code in a pipeline
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('model', model)
                         ])

    # Preprocessing of training data, fit model 
    clf.fit(X_train, y_train)

    # Preprocessing of validation data, get predictions
    preds = clf.predict(X_valid)
    '''
    return preprocessor

---

# Modeling

- Algorithms
    - XGBoost
    - LightGBM

## Training + Cross Validation

In [None]:
def train(model, data_frame, num_cols, onehot_cols, ordinal_cols):
    """Return the average RMSE over 5 CV folds of selected model.
    """
    y = data_frame["target"]
    X = data_frame[num_cols + onehot_cols + ordinal_cols]

    # Define pipline
    preprocessor = data_pipelie(num_cols, onehot_cols, ordinal_cols)
    reg = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    # Define cross-validation
    scores = -1 * cross_val_score(reg, X, y,
                                  cv=5,
                                  scoring='neg_root_mean_squared_error')

    i = 0
    for score in scores:
        print(f'Fold {i}, RMSE: {score}')
        i += 1
    print(f'Average RMSE: {np.mean(scores)}')
    #return scores

In [None]:
# Model hyperparameters
xgb_params = {
    'n_estimators': 1000,
    'learning_rate': 0.1,
    #'subsample': 0.8,
    'colsample_bytree': 0.3,
    'max_depth': 3,
    'booster': 'gbtree', 
    #'reg_lambda': 45.1,
    #'reg_alpha': 34.9,
    'random_state': 0,
    'n_jobs': 4
}

model = XGBRegressor(
    **xgb_params,
    tree_method='gpu_hist', 
    gpu_id=0, 
    predictor='gpu_predictor'
)

In [None]:
%%time
_one_hot_cols = list(set(onehot_cols) - {'cat2', 'cat4', 'cat6'})

train(model, df_train, num_cols, onehot_cols=_one_hot_cols, ordinal_cols=ordinal_cols)

In [None]:
%%time
_one_hot_cols = list(set(onehot_cols) - {'cat2', 'cat4'})

train(model, df_train, num_cols, onehot_cols=_one_hot_cols, ordinal_cols=ordinal_cols)

In [None]:
%%time
_one_hot_cols = list(set(onehot_cols) - {'cat2', 'cat4'})
_num_cols = list(set(num_cols) - {'cont13'})

train(model, df_train, num_cols=_num_cols, onehot_cols=_one_hot_cols, ordinal_cols=ordinal_cols)

---

## Hyperparameters Tuning

### Generate KFold datasets

In [None]:
df_train = pd.read_csv(f'df_train.csv')

# Mark the train dataset with kfold = 5
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, valid_idx) in enumerate(kf.split(X=df_train)):
    # For each loop, we only mark the valid data.
    # Therefore, after kfold loops, we have 5 different valid parts.\
    #print(len(valid_idx))
    df_train.loc[valid_idx, "kfold"] = fold

# Export train dataset with kfold mark
df_train.to_csv("train_fold.csv", index=False)

---

### Model tuning: Ordinal Encoding

In [None]:
def objective(trial):
    """Modeling tuning.
    """
    # Select fold == 0 as valid dataset, others as train dataset.
    fold = 0
    #df_train = pd.read_csv('train_fold.csv')
    df_train = pd.read_csv('../input/30days-folds/train_folds.csv')
    X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
    
    # Define dataset
    num_cols = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']
    onehot_cols = ['cat0', 'cat1', 'cat3', 'cat5', 'cat6', 'cat7', 'cat8'] # remove 'cat2', 'cat4' due to the low MI scores
    ordinal_cols = ['cat9']
    cat_cols = onehot_cols + ordinal_cols
    
    y_train = X_train.pop('target')
    X_train = X_train[num_cols + onehot_cols + ordinal_cols]
    y_valid = X_valid.pop('target')
    X_valid = X_valid[num_cols + onehot_cols + ordinal_cols]
    
    # Preprocessing - Ordinal Encoding
    # For this dataset, One-hot encoding cause low accuracy and low performance.
    # Therefore, I implemented Ordinal Encoding for all categorical features.
    oe = OrdinalEncoder()
    X_train[cat_cols] = oe.fit_transform(X_train[cat_cols])
    X_valid[cat_cols] = oe.transform(X_valid[cat_cols])
    
    # Hyperparameters for XGBoost
    """
    xgb_params = {
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'gamma': trial.suggest_loguniform('gamma', 1e-3, 10.0),
        'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-8, 100.0),
        'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-8, 100.0),
        'colsample_bytree': trial.suggest_float("colsample_bytree", 0.1, 1.0),
        'subsample': trial.suggest_float("subsample", 0.5, 1.0),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 10000),
        'max_depth': trial.suggest_int("max_depth", 3, 10),
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
    }
    """
    xgb_params = {
        'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-8, 100.0),
        'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-8, 100.0),
        'colsample_bytree': trial.suggest_float("colsample_bytree", 0.1, 1.0),
        'subsample': trial.suggest_float("subsample", 0.5, 1.0),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 10000),
        'max_depth': trial.suggest_int("max_depth", 2, 10),
        'random_state': trial.suggest_categorical('random_state', [1, 42, 2021]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
    }
    
    model = XGBRegressor(
            tree_method='gpu_hist',
            gpu_id=0,
            predictor='gpu_predictor',
            **xgb_params)
    
    """
    # Hyperparameters for LightGBM
    lgb_params = {
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'num_iterations': trial.suggest_int('num_iterations', 100, 10000),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 1, 7),
        'num_leaves': trial.suggest_int('num_leaves', 2, 100),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 100, 2000),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.01, 0.99),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.01, 0.99),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    
    model = LGBMRegressor(
                device='gpu',
                gpu_platform_id=0,
                gpu_device_id=0,
                n_jobs=-1,
                metric='rmse',
                **lgb_params
    )
    """
    
    model.fit(
        X_train, y_train, 
        early_stopping_rounds=300,
        eval_set=[(X_valid, y_valid)],
        verbose=5000
    )
    
    valid_preds = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, valid_preds, squared=False)
    return rmse

In [None]:
%%time
#study = optuna.create_study(direction='minimize')
#study.optimize(objective, n_trials=200) # set n_trials

---

### Modeling tuning: Ordinal Encoding + One-hot Encoding

In [None]:
def objective(trial):
    """Modeling tuning.
    """
    # Select fold == 0 as valid dataset, others as train dataset.
    fold = 0
    #df_train = pd.read_csv('train_fold.csv')
    df_train = pd.read_csv('../input/30days-folds/train_folds.csv')
    X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
    
    # Define dataset
    num_cols = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']
    onehot_cols = ['cat0', 'cat1', 'cat3', 'cat5', 'cat6', 'cat7', 'cat8'] # remove 'cat2', 'cat4' due to the low MI scores
    ordinal_cols = ['cat9']
    cat_cols = onehot_cols + ordinal_cols
    
    y_train = X_train.pop('target')
    X_train = X_train[num_cols + onehot_cols + ordinal_cols]
    y_valid = X_valid.pop('target')
    X_valid = X_valid[num_cols + onehot_cols + ordinal_cols]
    
    # Preprocessing - Ordinal Encoding
    oe = OrdinalEncoder()
    X_train[ordinal_cols] = oe.fit_transform(X_train[ordinal_cols])
    X_valid[ordinal_cols] = oe.transform(X_valid[ordinal_cols])
    
    # Preprocessing - One-hot Encoding
    ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
    X_train_ohe = ohe.fit_transform(X_train[onehot_cols])
    X_valid_ohe = ohe.transform(X_valid[onehot_cols])
    #X_test_ohe = ohe.transform(X_test[onehot_cols]) # Q. The last transform
    
    X_train_ohe = pd.DataFrame(X_train_ohe, columns=[f"ohe_{i}" for i in range(X_train_ohe.shape[1])])
    X_valid_ohe = pd.DataFrame(X_valid_ohe, columns=[f"ohe_{i}" for i in range(X_valid_ohe.shape[1])])
    #X_test_ohe = pd.DataFrame(X_test_ohe, columns=[f"ohe_{i}" for i in range(X_test_ohe.shape[1])])
    
    X_train = pd.concat([X_train.drop(columns=onehot_cols), X_train_ohe], axis=1)
    X_valid = pd.concat([X_valid.drop(columns=onehot_cols), X_valid_ohe], axis=1)
    #X_test = pd.concat([X_test.drop(columns=onehot_cols), X_test_ohe], axis=1)

    # Hyperparameters for XGBoost
    xgb_params = {
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'gamma': trial.suggest_loguniform('gamma', 1e-3, 10.0),
        'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-8, 100.0),
        'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-8, 100.0),
        'colsample_bytree': trial.suggest_float("colsample_bytree", 0.1, 1.0),
        'subsample': trial.suggest_float("subsample", 0.5, 1.0),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 10000),
        'max_depth': trial.suggest_int("max_depth", 3, 10),
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
    }
    
    model = XGBRegressor(
            tree_method='gpu_hist',
            gpu_id=0,
            predictor='gpu_predictor',
            **xgb_params)
    
    """
    # Hyperparameters for LightGBM
    lgb_params = {
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'num_iterations': trial.suggest_int('num_iterations', 100, 10000),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 1, 7),
        'num_leaves': trial.suggest_int('num_leaves', 2, 100),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 100, 2000),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.01, 0.99),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.01, 0.99),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    
    model = LGBMRegressor(
                device='gpu',
                gpu_platform_id=0,
                gpu_device_id=0,
                n_jobs=-1,
                metric='rmse',
                **lgb_params
    )
    """
    
    model.fit(
        X_train, y_train, 
        early_stopping_rounds=300,
        eval_set=[(X_valid, y_valid)],
        verbose=5000
    )

    valid_preds = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, valid_preds, squared=False)
    return rmse

In [None]:
%%time
#study = optuna.create_study(direction='minimize')
#study.optimize(objective, n_trials=200) # set n_trials

---

### Model tuning: Ordinal Encoding, Standardization

In [None]:
def objective(trial):
    """Modeling tuning with Standardization.
    """
    # Select fold == 0 as valid dataset, others as train dataset.
    fold = 0
    #df_train = pd.read_csv('train_fold.csv')
    df_train = pd.read_csv('../input/30days-folds/train_folds.csv')
    X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
    
    # Define dataset
    num_cols = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']
    onehot_cols = ['cat0', 'cat1', 'cat3', 'cat5', 'cat6', 'cat7', 'cat8'] # remove 'cat2', 'cat4' due to the low MI scores
    ordinal_cols = ['cat9']
    cat_cols = onehot_cols + ordinal_cols
    
    y_train = X_train.pop('target')
    X_train = X_train[num_cols + onehot_cols + ordinal_cols]
    y_valid = X_valid.pop('target')
    X_valid = X_valid[num_cols + onehot_cols + ordinal_cols]
    
    # Preprocessing - Ordinal Encoding
    # For this dataset, One-hot encoding cause low accuracy and low performance.
    # Therefore, I implemented Ordinal Encoding for all categorical features.
    oe = OrdinalEncoder()
    X_train[cat_cols] = oe.fit_transform(X_train[cat_cols])
    X_valid[cat_cols] = oe.transform(X_valid[cat_cols])
    
    # Preprocessing - Standardization
    scaler = StandardScaler()
    X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
    X_valid[num_cols] = scaler.transform(X_valid[num_cols])

    # Hyperparameters for XGBoost
    xgb_params = {
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'gamma': trial.suggest_loguniform('gamma', 1e-3, 10.0),
        'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-8, 100.0),
        'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-8, 100.0),
        'colsample_bytree': trial.suggest_float("colsample_bytree", 0.1, 1.0),
        'subsample': trial.suggest_float("subsample", 0.5, 1.0),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 10000),
        'max_depth': trial.suggest_int("max_depth", 3, 10),
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
    }
    
    model = XGBRegressor(
            tree_method='gpu_hist',
            gpu_id=0,
            predictor='gpu_predictor',
            **xgb_params)
    
    """
    # Hyperparameters for LightGBM
    lgb_params = {
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'num_iterations': trial.suggest_int('num_iterations', 100, 10000),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 1, 7),
        'num_leaves': trial.suggest_int('num_leaves', 2, 100),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 100, 2000),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.01, 0.99),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.01, 0.99),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    
    model = LGBMRegressor(
                device='gpu',
                gpu_platform_id=0,
                gpu_device_id=0,
                n_jobs=-1,
                metric='rmse',
                **lgb_params
    )
    """
    
    model.fit(
        X_train, y_train, 
        early_stopping_rounds=300,
        eval_set=[(X_valid, y_valid)],
        verbose=5000
    )
    
    valid_preds = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, valid_preds, squared=False)
    return rmse

In [None]:
%%time
#study = optuna.create_study(direction='minimize')
#study.optimize(objective, n_trials=200) # set n_trials

---

### Model tuning: Ordinal Encoding, Log Transformation

In [None]:
def objective(trial):
    """Modeling tuning with Log transformation.
    """
    # Select fold == 0 as valid dataset, others as train dataset.
    fold = 0
    #df_train = pd.read_csv('train_fold.csv')
    df_train = pd.read_csv('../input/30days-folds/train_folds.csv')
    X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
    
    # Define dataset
    num_cols = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']
    onehot_cols = ['cat0', 'cat1', 'cat3', 'cat5', 'cat6', 'cat7', 'cat8'] # remove 'cat2', 'cat4' due to the low MI scores
    ordinal_cols = ['cat9']
    cat_cols = onehot_cols + ordinal_cols
    
    y_train = X_train.pop('target')
    X_train = X_train[num_cols + onehot_cols + ordinal_cols]
    y_valid = X_valid.pop('target')
    X_valid = X_valid[num_cols + onehot_cols + ordinal_cols]
    
    # Preprocessing - Ordinal Encoding
    # For this dataset, One-hot encoding cause low accuracy and low performance.
    # Therefore, I implemented Ordinal Encoding for all categorical features.
    oe = OrdinalEncoder()
    X_train[cat_cols] = oe.fit_transform(X_train[cat_cols])
    X_valid[cat_cols] = oe.transform(X_valid[cat_cols])

    # Preprocessing - Log transformation
    for col in num_cols:
        X_train[col] = np.log1p(X_train[col])
        X_valid[col] = np.log1p(X_valid[col])
        #test_data[col] = np.log1p(test_data[col])
    
    # Hyperparameters for XGBoost
    xgb_params = {
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'gamma': trial.suggest_loguniform('gamma', 1e-3, 10.0),
        'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-8, 100.0),
        'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-8, 100.0),
        'colsample_bytree': trial.suggest_float("colsample_bytree", 0.1, 1.0),
        'subsample': trial.suggest_float("subsample", 0.5, 1.0),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 10000),
        'max_depth': trial.suggest_int("max_depth", 3, 10),
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
    }
    
    model = XGBRegressor(
            tree_method='gpu_hist',
            gpu_id=0,
            predictor='gpu_predictor',
            **xgb_params)
    
    """
    # Hyperparameters for LightGBM
    lgb_params = {
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'num_iterations': trial.suggest_int('num_iterations', 100, 10000),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 1, 7),
        'num_leaves': trial.suggest_int('num_leaves', 2, 100),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 100, 2000),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.01, 0.99),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.01, 0.99),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    
    model = LGBMRegressor(
                device='gpu',
                gpu_platform_id=0,
                gpu_device_id=0,
                n_jobs=-1,
                metric='rmse',
                **lgb_params
    )
    """
    
    model.fit(
        X_train, y_train, 
        early_stopping_rounds=300,
        eval_set=[(X_valid, y_valid)],
        verbose=5000
    )
    
    valid_preds = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, valid_preds, squared=False)
    return rmse

In [None]:
%%time
#study = optuna.create_study(direction='minimize')
#study.optimize(objective, n_trials=200) # set n_trials

---

### Model tuning: Target Encoding

In [None]:
%%time
# Find the best m for target encoding.
def t_enc_baseline(m_list):
    """Return baseline rmse for each m of target encoding.
    """
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)

    # Define dataset
    fold = 0
    #df_train = pd.read_csv('train_fold.csv')
    df_train = pd.read_csv('../input/30days-folds/train_folds.csv')
    X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
    
    num_cols = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']
    onehot_cols = ['cat0', 'cat1', 'cat3', 'cat5', 'cat6', 'cat7', 'cat8'] # remove 'cat2', 'cat4' due to the low MI scores
    ordinal_cols = ['cat9']
    cat_cols = onehot_cols + ordinal_cols
    
    y_train = X_train.pop('target')
    X_train = X_train[num_cols + onehot_cols + ordinal_cols]
    y_valid = X_valid.pop('target')
    X_valid = X_valid[num_cols + onehot_cols + ordinal_cols]
    
    # Define model
    model = XGBRegressor(tree_method='gpu_hist', gpu_id=0, predictor='gpu_predictor')
    
    scores = {}
    for m in m_list:
        # Target Encoding
        te = MEstimateEncoder(cols=cat_cols, m=m)
        X_train = te.fit_transform(X_train, y_train)
        X_valid = te.transform(X_valid)

        # Modeling - Training
        model.fit(X_train, y_train)
        valid_preds = model.predict(X_valid)
        
        # Modeling - Evaluation
        rmse = mean_squared_error(y_valid, valid_preds, squared=False)
        
        scores[f'm={m}'] = rmse
    return scores

_scores = t_enc_baseline(range(50))
print(f'The m of best rmse: {min(_scores, key=scores.get)}')
print(f'The best rmse: {min(_scores.values())}')

In [None]:
def objective(trial):
    """Modeling tuning with Target encoding.
    """
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    # Select fold == 0 as valid dataset, others as train dataset.
    fold = 0
    #df_train = pd.read_csv('train_fold.csv')
    df_train = pd.read_csv('../input/30days-folds/train_folds.csv')
    X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
    
    # Define dataset
    num_cols = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']
    onehot_cols = ['cat0', 'cat1', 'cat3', 'cat5', 'cat6', 'cat7', 'cat8'] # remove 'cat2', 'cat4' due to the low MI scores
    ordinal_cols = ['cat9']
    cat_cols = onehot_cols + ordinal_cols
    
    y_train = X_train.pop('target')
    X_train = X_train[num_cols + onehot_cols + ordinal_cols]
    y_valid = X_valid.pop('target')
    X_valid = X_valid[num_cols + onehot_cols + ordinal_cols]
    
    # Preprocessing - Target Encoding
    te = MEstimateEncoder(cols=cat_cols, m=8) # m is from previous step
    X_train = te.fit_transform(X_train, y_train)
    X_valid = te.transform(X_valid)
    
    # Hyperparameters for XGBoost
    xgb_params = {
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'gamma': trial.suggest_loguniform('gamma', 1e-3, 10.0),
        'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-8, 100.0),
        'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-8, 100.0),
        'colsample_bytree': trial.suggest_float("colsample_bytree", 0.1, 1.0),
        'subsample': trial.suggest_float("subsample", 0.5, 1.0),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 10000),
        'max_depth': trial.suggest_int("max_depth", 3, 10),
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
    }
    
    model = XGBRegressor(
            tree_method='gpu_hist',
            gpu_id=0,
            predictor='gpu_predictor',
            **xgb_params)
    
    """
    # Hyperparameters for LightGBM
    lgb_params = {
        'random_state': trial.suggest_categorical('random_state', [0, 42, 2021]),
        'num_iterations': trial.suggest_int('num_iterations', 100, 10000),
        'learning_rate': trial.suggest_float("learning_rate", 1e-2, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 1, 7),
        'num_leaves': trial.suggest_int('num_leaves', 2, 100),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 100, 2000),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.01, 0.99),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.01, 0.99),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    
    model = LGBMRegressor(
                device='gpu',
                gpu_platform_id=0,
                gpu_device_id=0,
                n_jobs=-1,
                metric='rmse',
                **lgb_params
    )
    """
    
    model.fit(
        X_train, y_train, 
        early_stopping_rounds=300,
        eval_set=[(X_valid, y_valid)],
        verbose=5000
    )
    
    valid_preds = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, valid_preds, squared=False)
    return rmse

In [None]:
%%time
#study = optuna.create_study(direction='minimize')
#study.optimize(objective, n_trials=200) # set n_trials

In [None]:
study.best_params

---

## Model Blending

In [None]:
class Modeling:
    def __init__(self, train_path):
        import warnings
        warnings.filterwarnings('ignore')

        # Import datasets
        self.df_train = pd.read_csv(train_path)
        self.df_test = pd.read_csv('../input/30-days-of-ml/test.csv')
        #self.df_test = pd.read_csv('data/test.csv')
        self.sample_submission = pd.read_csv('../input/30-days-of-ml/sample_submission.csv')
        
        # Define features
        self.num_cols = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']
        self.onehot_cols = ['cat0', 'cat1', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8'] # remove 'cat2' due to the low MI scores
        self.ordinal_cols = ['cat9']
        self.cat_cols = self.onehot_cols + self.ordinal_cols
        self.useful_features = self.num_cols + self.cat_cols
        self.target = 'target'
    
    # Preprocessing solution 0: Ordinal Encoding
    def _ordinal_encoding(self, X_train, X_valid, X_test, params=True):
        # Preprocessing - Ordinal Encoding
        oe = OrdinalEncoder()
        X_train[self.cat_cols] = oe.fit_transform(X_train[self.cat_cols])
        X_valid[self.cat_cols] = oe.transform(X_valid[self.cat_cols])
        X_test[self.cat_cols] = oe.transform(X_test[self.cat_cols])
        
        """no_outliers
        # 0.7172987346930846
        # XGBoost params
        xgb_params = {
            'alpha': 7.128681031027614,
            'lambda': 0.40760576474680843,
            'gamma': 0.08704298132127238,
            'reg_alpha': 25.377502919374336,
            'reg_lambda': 0.003401041649454036,
            'colsample_bytree': 0.1355660282707954,
            'subsample': 0.6999406375783235,
            'learning_rate': 0.02338550339980208,
            'n_estimators': 9263,
            'max_depth': 6,
            'random_state': 2021,
            'min_child_weight': 138
        }

        # 0.7174088504920006
        # LightGBM params
        lgb_params = {
            'random_state': 0, 
            'num_iterations': 9530, 
            'learning_rate': 0.018509357813869098, 
            'max_depth': 6, 
            'num_leaves': 98, 
            'min_data_in_leaf': 1772, 
            'lambda_l1': 0.0010866230909549698, 
            'lambda_l2': 1.6105154171511057e-05, 
            'feature_fraction': 0.09911317646202211, 
            'bagging_fraction': 0.8840672050147438, 
            'bagging_freq': 6, 
            'min_child_samples': 35
        }
        """

        """Full dataset"""
        # 0.7168956185375995
        # XGBoost params
        xgb_params = {
            'alpha': 0.41478790863509063, 
            'lambda': 4.533806139098733, 
            'gamma': 0.006523052455552593, 
            'reg_alpha': 16.714567692323264, 
            'reg_lambda': 6.321580437513598e-06, 
            'colsample_bytree': 0.11544585116842096, 
            'subsample': 0.8448523684136955, 
            'learning_rate': 0.061677285578690844, 
            'n_estimators': 8676, 
            'max_depth': 3, 
            'random_state': 0, 
            'min_child_weight': 268
        }

        # 0.716864914502225
        # LightGBM params
        lgb_params = {
            'random_state': 42, 
            'num_iterations': 9718, 
            'learning_rate': 0.014607325219450438, 
            'max_depth': 6, 
            'num_leaves': 11, 
            'min_data_in_leaf': 704, 
            'lambda_l1': 1.1906873176743155e-07, 
            'lambda_l2': 1.0479700691744536, 
            'feature_fraction': 0.13449900909384252, 
            'bagging_fraction': 0.6680657144501343, 
            'bagging_freq': 1, 
            'min_child_samples': 66
        }

        
        if params == True:
            return X_train, X_valid, X_test, xgb_params, lgb_params
        else:
            return X_train, X_valid, X_test
    
    # Preprocessing solution 1: One-hot Encoding + Ordinal Encoding
    def _onehot_encoding(self, X_train, X_valid, X_test):
        # Preprocessing - One-hot Encoding
        ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
        X_train_ohe = ohe.fit_transform(X_train[self.onehot_cols])
        X_valid_ohe = ohe.transform(X_valid[self.onehot_cols])
        X_test_ohe = ohe.transform(X_test[self.onehot_cols])

        X_train_ohe = pd.DataFrame(X_train_ohe, columns=[f"ohe_{i}" for i in range(X_train_ohe.shape[1])])
        X_valid_ohe = pd.DataFrame(X_valid_ohe, columns=[f"ohe_{i}" for i in range(X_valid_ohe.shape[1])])
        X_test_ohe = pd.DataFrame(X_test_ohe, columns=[f"ohe_{i}" for i in range(X_test_ohe.shape[1])])

        X_train = pd.concat([X_train.drop(columns=self.onehot_cols), X_train_ohe], axis=1)
        X_valid = pd.concat([X_valid.drop(columns=self.onehot_cols), X_valid_ohe], axis=1)
        X_test = pd.concat([X_test.drop(columns=self.onehot_cols), X_test_ohe], axis=1)
        
        # Preprocessing - Ordinal Encoding
        oe = OrdinalEncoder()
        X_train[self.ordinal_cols] = oe.fit_transform(X_train[self.ordinal_cols])
        X_valid[self.ordinal_cols] = oe.transform(X_valid[self.ordinal_cols])
        X_test[self.ordinal_cols] = oe.transform(X_test[self.ordinal_cols])
    
        """No outliers
        # 0.7174931253475558
        # XGBoost params
        xgb_params = {
            'alpha': 3.046687193123841,
            'lambda': 0.7302844649944737,
            'gamma': 0.10108768743909796,
            'reg_alpha': 14.711350393993625,
            'reg_lambda': 1.6855306764481926e-07,
            'colsample_bytree': 0.15006790036326567,
            'subsample': 0.9761751211889541,
            'learning_rate': 0.02730958701307226,
            'n_estimators': 7897,
            'max_depth': 4,
            'random_state': 0,
            'min_child_weight': 203
        }
        
        # 0.7172624587909345
        # LightGBM params
        lgb_params = {
            'random_state': 42, 
            'num_iterations': 6969, 
            'learning_rate': 0.014404708757048168, 
            'max_depth': 7, 
            'num_leaves': 21, 
            'min_data_in_leaf': 1121, 
            'lambda_l1': 4.1636932334315094e-07, 
            'lambda_l2': 1.0975422991510602e-08, 
            'feature_fraction': 0.08082581387850206, 
            'bagging_fraction': 0.6804475225598854, 
            'bagging_freq': 2, 
            'min_child_samples': 32
        }
        """
 
        """Full dataset"""
        # 0.7169803941400036
        # XGBoost params
        xgb_params = {
            'alpha': 0.006431298298825313, 
            'lambda': 1.9946701540008387, 
            'gamma': 0.004468866854966971, 
            'reg_alpha': 1.7419809828857386e-08, 
            'reg_lambda': 0.25665090431195203, 
            'colsample_bytree': 0.10016560933147275, 
            'subsample': 0.770411681261352, 
            'learning_rate': 0.01350994278419047, 
            'n_estimators': 8378, 
            'max_depth': 5, 
            'random_state': 0, 
            'min_child_weight': 237
        }

        # 0.7168422881316736
        # LightGBM params
        lgb_params = {
            'random_state': 0, 
            'num_iterations': 9121, 
            'learning_rate': 0.03762205881915334, 
            'max_depth': 5, 
            'num_leaves': 12, 
            'min_data_in_leaf': 1331, 
            'lambda_l1': 1.4246377549177525e-08, 
            'lambda_l2': 0.00031572480246719195, 
            'feature_fraction': 0.07760290667911449, 
            'bagging_fraction': 0.9766045388889536, 
            'bagging_freq': 4, 
            'min_child_samples': 50
        }
        
        return X_train, X_valid, X_test, xgb_params, lgb_params

    # Preprocessing solution 2: Ordinal Encoding + Standardization
    def _standardization(self, X_train, X_valid, X_test):
        # Preprocessing - Standardization
        scaler = StandardScaler()
        X_train[self.num_cols] = scaler.fit_transform(X_train[self.num_cols])
        X_valid[self.num_cols] = scaler.transform(X_valid[self.num_cols])
        X_test[self.num_cols] = scaler.transform(X_test[self.num_cols])
    
        """No outliers
        # 0.7172152365762312
        # XGBoost params
        xgb_params = {
            'alpha': 0.029925179326119784,
            'lambda': 0.12530061860157662,
            'gamma': 0.5415753114227984,
            'reg_alpha': 14.992919845445886,
            'reg_lambda': 0.42076728548917974,
            'colsample_bytree': 0.10022710624560974,
            'subsample': 0.5596856445758918,
            'learning_rate': 0.020866717779139694,
            'n_estimators': 6852,
            'max_depth': 7,
            'random_state': 2021,
            'min_child_weight': 62
        }
        
        # 0.7173410652198884
        # LightGBM params
        lgb_params = {
            'random_state': 0,
            'num_iterations': 6439,
            'learning_rate': 0.03625416364918611,
            'max_depth': 6,
            'num_leaves': 11,
            'min_data_in_leaf': 745,
            'lambda_l1': 4.1932281223524115e-06,
            'lambda_l2': 0.043343249414638636,
            'feature_fraction': 0.08623933710228435,
            'bagging_fraction': 0.7934935001504152,
            'bagging_freq': 3,
            'min_child_samples': 23
        }
        """

        """Full dataset"""
        # 0.7169754780128185
        # XGBoost params
        xgb_params = {
            'alpha': 0.001954751535110173, 
            'lambda': 0.4889428995375083, 
            'gamma': 0.2639993537540112, 
            'reg_alpha': 4.613139964829376, 
            'reg_lambda': 2.5644939695394116e-06, 
            'colsample_bytree': 0.10081771115519686, 
            'subsample': 0.7241988095847515, 
            'learning_rate': 0.02383200558701416, 
            'n_estimators': 6995, 
            'max_depth': 4, 
            'random_state': 42, 
            'min_child_weight': 198
        }

        # 0.7168529799564377
        # LightGBM params
        lgb_params = {
            'random_state': 2021, 
            'num_iterations': 7614, 
            'learning_rate': 0.013334143201183738, 
            'max_depth': 7, 
            'num_leaves': 18, 
            'min_data_in_leaf': 1270, 
            'lambda_l1': 6.801685953700497e-05, 
            'lambda_l2': 6.178728858466813e-08, 
            'feature_fraction': 0.13830499588135214, 
            'bagging_fraction': 0.8060814091341896, 
            'bagging_freq': 3, 
            'min_child_samples': 81
        }
        
        return X_train, X_valid, X_test, xgb_params, lgb_params

    # Preprocessing solution 3: Ordinal Encoding + Log transformation
    def _log_transformation(self, X_train, X_valid, X_test):
        # Preprocessing - Log transformation
        for col in self.num_cols:
            X_train[col] = np.log1p(X_train[col])
            X_valid[col] = np.log1p(X_valid[col])
            X_test[col] = np.log1p(X_test[col])

        """No outliers
        # 0.7172539872780895
        # XGBoost params
        xgb_params = {
            'alpha': 0.08862033338686888, 
            'lambda': 0.003553846716302233, 
            'gamma': 0.4097695581309838, 
            'reg_alpha': 17.808150656220917, 
            'reg_lambda': 1.6112661145526217, 
            'colsample_bytree': 0.11935885763757494, 
            'subsample': 0.7326515814471944, 
            'learning_rate': 0.04006687786137418, 
            'n_estimators': 5239, 
            'max_depth': 5, 
            'random_state': 2021, 
            'min_child_weight': 258
        }

        # 0.7174737448879298
        # LightGBM params
        lgb_params = {
            'random_state': 0,
            'num_iterations': 7945,
            'learning_rate': 0.05205269244224801,
            'max_depth': 6,
            'num_leaves': 9,
            'min_data_in_leaf': 1070,
            'lambda_l1': 1.0744924634974802e-07,
            'lambda_l2': 1.1250360028635182,
            'feature_fraction': 0.10421484055936374,
            'bagging_fraction': 0.916143112009066,
            'bagging_freq': 6,
            'min_child_samples': 20
        }
        """

        """Full dataset"""
        # 0.7170384818940932
        # XGBoost params
        xgb_params = {
            'alpha': 0.017253367743182032, 
            'lambda': 1.2312523239198236, 
            'gamma': 0.8870836430062957, 
            'reg_alpha': 9.3294011692425e-08, 
            'reg_lambda': 0.080494664534471, 
            'colsample_bytree': 0.12723293204878566, 
            'subsample': 0.5562373818875186, 
            'learning_rate': 0.01759177927013953, 
            'n_estimators': 7480, 
            'max_depth': 5, 
            'random_state': 42, 
            'min_child_weight': 216
        }

        # 0.716867820750228
        # LightGBM params
        lgb_params = {
            'random_state': 2021, 
            'num_iterations': 9135, 
            'learning_rate': 0.05204126206296579, 
            'max_depth': 3, 
            'num_leaves': 32, 
            'min_data_in_leaf': 1196, 
            'lambda_l1': 0.07110967369867664, 
            'lambda_l2': 9.981527842388462e-08, 
            'feature_fraction': 0.13002087379571273, 
            'bagging_fraction': 0.6510683790039721, 
            'bagging_freq': 2, 
            'min_child_samples': 19
        }
        
        return X_train, X_valid, X_test, xgb_params, lgb_params

    # Preprocessing solution 4: Target Encoding
    def _target_encoding(self, X_train, X_valid, X_test, y_train):
        # Preprocessing - Target Encoding
        te = MEstimateEncoder(cols=self.cat_cols, m=8) # m is from previous step
        X_train = te.fit_transform(X_train, y_train)
        X_valid = te.transform(X_valid)
        X_test = te.transform(X_test)
    
        """No outliers
        # 0.7172617296722674
        # XGBoost params
        xgb_params = {
            'alpha': 0.012609024116174448,
            'lambda': 0.7990281671135536,
            'gamma': 0.16689280834519887,
            'reg_alpha': 16.48576968441873,
            'reg_lambda': 4.83082534682402e-08,
            'colsample_bytree': 0.1162304168345657,
            'subsample': 0.9126362948665406,
            'learning_rate': 0.05528416190414117,
            'n_estimators': 9670,
            'max_depth': 5,
            'random_state': 42,
            'min_child_weight': 280
         }

        # 0.7173917173794985
        # LightGBM params
        lgb_params = {
            'random_state': 2021, 
            'num_iterations': 7977, 
            'learning_rate': 0.01618931564625682, 
            'max_depth': 5, 
            'num_leaves': 50, 
            'min_data_in_leaf': 890, 
            'lambda_l1': 0.003233614433753064, 
            'lambda_l2': 2.0001872037801434e-06, 
            'feature_fraction': 0.13638848986185334, 
            'bagging_fraction': 0.7045068716734475, 
            'bagging_freq': 2, 
            'min_child_samples': 79
        }
        """
        
        """Full dataset"""
        # 0.7169975934181431
        # XGBoost params
        xgb_params = {
            'alpha': 0.1211523885965823,
            'lambda': 0.00452864739396485,
            'gamma': 0.03948208038791913,
            'reg_alpha': 12.908845680463497,
            'reg_lambda': 3.120894405337636e-07,
            'colsample_bytree': 0.10449109016850185,
            'subsample': 0.7633088122674517,
            'learning_rate': 0.03246721588939738,
            'n_estimators': 6866,
            'max_depth': 5,
            'random_state': 0,
            'min_child_weight': 131
         }

        # 0.7169507121946433
        # LightGBM params
        lgb_params = {
            'random_state': 2021, 
            'num_iterations': 5027, 
            'learning_rate': 0.05104410422762626, 
            'max_depth': 3, 
            'num_leaves': 77, 
            'min_data_in_leaf': 440, 
            'lambda_l1': 0.05579764755559036, 
            'lambda_l2': 4.375963929072086e-08, 
            'feature_fraction': 0.14611731889635768, 
            'bagging_fraction': 0.9005654268392156, 
            'bagging_freq': 1, 
            'min_child_samples': 23
        }
        
        return X_train, X_valid, X_test, xgb_params, lgb_params
    
    def _xgboost_reg(self, xgb_params):
        model = XGBRegressor(
                    tree_method='gpu_hist',
                    gpu_id=0,
                    predictor='gpu_predictor',
                    n_jobs=-1,
                    **xgb_params
                )
        return model
    
    def _lightgbm_reg(self, lgb_params):
        model = LGBMRegressor(
                    device='gpu',
                    gpu_platform_id=0,
                    gpu_device_id=0,
                    n_jobs=-1,
                    metric='rmse',
                    **lgb_params
                )
        return model
    
    def blending(self, model: str):
        '''Model blending. Generate 5 predictions according to 5 data preprocessing solutions.
        
        Args:
            model: One of xgboost or lightgbm
            
        Returns:
            None
        '''
        assert model in ['xgboost', 'lightgbm'], "ValueError: model must be one of ['xgboost', 'lightgbm']!"
        
        # Loop preprocessing solutions
        for preprocessing_solution in range(5):
            final_valid_predictions = {} # store final predictions of X_valid for each preprocessing_solution
            final_test_predictions = [] # store final predictions of X_test for each preprocessing_solution
            scores = [] # store RMSE scores for each preprocessing_solution
            print(f"Data Preprocessing Solution: {preprocessing_solution}, Model: {model}")
            print(f"Training ...")
            # Loop KFolds
            for fold in range(5):
                # Data Preprocessing
                X_train = self.df_train[self.df_train.kfold != fold].reset_index(drop=True)
                X_valid = self.df_train[self.df_train.kfold == fold].reset_index(drop=True)
                X_test = self.df_test.copy()
                
                # get X_valid id
                X_valid_ids = X_valid.id.values.tolist()
                
                y_train = X_train.pop(self.target)
                X_train = X_train[self.useful_features] # not include id, cat2
                y_valid = X_valid.pop(self.target)
                X_valid = X_valid[self.useful_features] # not include id, cat2
                X_test = X_test[self.useful_features]
                
                # Ordinal Encoding
                if preprocessing_solution == 0:
                    X_train, X_valid, X_test, xgb_params, lgb_params = self._ordinal_encoding(X_train, X_valid, X_test)
                # One-hot Encoding + Ordinal Encoding
                elif preprocessing_solution == 1:
                    X_train, X_valid, X_test, xgb_params, lgb_params = self._onehot_encoding(X_train, X_valid, X_test)
                # Ordinal Encoding + Standardization
                elif preprocessing_solution == 2:
                    X_train, X_valid, X_test = self._ordinal_encoding(X_train, X_valid, X_test, params=False)
                    X_train, X_valid, X_test, xgb_params, lgb_params = self._standardization(X_train, X_valid, X_test)
                # Ordinal Encoding + Log Transformation
                elif preprocessing_solution == 3:
                    X_train, X_valid, X_test = self._ordinal_encoding(X_train, X_valid, X_test, params=False)
                    X_train, X_valid, X_test, xgb_params, lgb_params = self._log_transformation(X_train, X_valid, X_test)
                # Target Encoding
                elif preprocessing_solution == 4:
                    X_train, X_valid, X_test, xgb_params, lgb_params = self._target_encoding(X_train, X_valid, X_test, y_train)
                
                # Define model
                if model == 'xgboost':
                    reg = self._xgboost_reg(xgb_params)
                elif model == 'lightgbm':
                    reg = self._lightgbm_reg(lgb_params)
                
                # Modeling - Training
                reg.fit(
                    X_train, y_train, 
                    early_stopping_rounds=300,
                    eval_set=[(X_valid, y_valid)],
                    verbose=False
                )
                
                # Modeling - Evaluation and Inference
                valid_preds = reg.predict(X_valid)
                test_preds = reg.predict(X_test)
                
                final_valid_predictions.update(dict(zip(X_valid_ids, valid_preds))) # loop 5 times with different valid id
                final_test_predictions.append(test_preds) # loop 5 times and get the mean predictions for each row later

                rmse = mean_squared_error(y_valid, valid_preds, squared=False)
                scores.append(rmse)
                print(f'Data Preprocessing Solution: {preprocessing_solution}, Fold: {fold}, RMSE: {rmse}')
                
            # Export results
            final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
            final_valid_predictions.columns = ["id", f"{model}_{preprocessing_solution}_pred"]
            final_valid_predictions.to_csv(f"{model}_{preprocessing_solution}_valid_pred.csv", index=False)

            test_mean_preds = np.mean(np.column_stack(final_test_predictions), axis=1) # get the meam predictions for each row
            test_mean_preds = pd.DataFrame({'id': self.sample_submission.id, f"{model}_{preprocessing_solution}_pred": test_mean_preds})
            test_mean_preds.to_csv(f"{model}_{preprocessing_solution}_test_pred.csv", index=False)
            print(f'Average RMSE: {np.mean(scores)}, STD of RMSE: {np.std(scores)}')
            print('-----------------------------------------------------------------')
    
    def _stacking_0(self, X_train, y_train):
        # Linear Regression stacking
        model = LinearRegression()
        model.fit(X_train, y_train)
        return model
    
    def _stacking_1(self, X_train, X_valid, y_train, y_valid):
        # XGBoost stacking
        '''
        # 0.7163584561404919
        xgb_params = {
            'alpha': 0.13769710880276775,
            'lambda': 1.6419112442610602,
            'gamma': 1.9542744201254334,
            'reg_alpha': 4.040410591598369e-06,
            'reg_lambda': 0.0005095115877677602,
            'colsample_bytree': 0.15171196846336138,
            'subsample': 0.7792590379642004,
            'learning_rate': 0.02306515169610187,
            'n_estimators': 1403,
            'max_depth': 6,
            'random_state': 42,
            'min_child_weight': 161
        }
        '''
        xgb_params = {
            'random_state': 1, 
            'booster': 'gbtree',
            'n_estimators': 7000,
            'learning_rate': 0.03,
            'max_depth': 2
        }

        model = XGBRegressor(
            #tree_method='gpu_hist',
            #gpu_id=0,
            #predictor='gpu_predictor',
            n_jobs=-1,
            **xgb_params
        )
        
        model.fit(
            X_train, y_train, 
            early_stopping_rounds=300,
            eval_set=[(X_valid, y_valid)],
            verbose=False
        )
        
        return model

    def _stacking_2(self, X_train, X_valid, y_train, y_valid):
       # LightGBM stacking
        # 0.7163310756021445
        lgb_params = {
            'random_state': 2021,
            'num_iterations': 4817,
            'learning_rate': 0.07421712914298478,
            'max_depth': 4,
            'num_leaves': 37,
            'min_data_in_leaf': 1306,
            'lambda_l1': 0.01948754316174006,
            'lambda_l2': 0.8125852153169449,
            'feature_fraction': 0.03432830339337113,
            'bagging_fraction': 0.8110509749600237,
            'bagging_freq': 2,
            'min_child_samples': 22
        }
        
        model = LGBMRegressor(
                    #device='gpu',
                    #gpu_platform_id=0,
                    #gpu_device_id=0,
                    n_jobs=-1,
                    metric='rmse',
                    **lgb_params
                )

        model.fit(
            X_train, y_train, 
            early_stopping_rounds=300,
            eval_set=[(X_valid, y_valid)],
            verbose=False
        )
        
        return model 
                
    def stacking(self, data_path: str):
        '''Model stacking
        
        Args:
            data_path: The path of directory that stores all of the datasets after model blending.
            
        Returns:
            None
        '''
        # Define featurs
        xgb_features = [f"xgboost_{i}_pred" for i in range(5)]
        lgb_features = [f"lightgbm_{i}_pred" for i in range(5)]
        useful_features = xgb_features + lgb_features
        
        # Import datasets
        df_train = pd.read_csv(f'{data_path}/train_folds.csv')
        df_test = pd.read_csv('../input/30-days-of-ml/test.csv')
        #data_path = '../input/kaggle30daysofml/kaggle-30days-of-ml/full_dataset'
        
        # Join datasets on id
        for i in range(5):
            _df_valid = pd.read_csv(f'{data_path}/xgboost_{i}_valid_pred.csv')
            df_train = df_train.merge(_df_valid, on='id', how='left')
            _df_test = pd.read_csv(f'{data_path}/xgboost_{i}_test_pred.csv')
            df_test = df_test.merge(_df_test, on='id', how='left')

        for i in range(5):
            _df_valid = pd.read_csv(f'{data_path}/lightgbm_{i}_valid_pred.csv')
            df_train = df_train.merge(_df_valid, on='id', how='left')
            _df_test = pd.read_csv(f'{data_path}/lightgbm_{i}_test_pred.csv')
            df_test = df_test.merge(_df_test, on='id', how='left')

        n_stackings = 3
        # Loop stackings
        for stacking in range(n_stackings):
            final_valid_predictions = {} # store final predictions of X_valid for each stacking
            final_test_predictions = [] # store final predictions of X_test for each stacking
            scores = [] # store RMSE scores for each preprocessing_solution
            print(f"Model Stacking: {stacking}")
            print(f"Training ...")
            # Loop KFolds
            for fold in range(5):
                X_train =  df_train[df_train.kfold != fold].reset_index(drop=True)
                X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
                X_test = df_test.copy()
                
                # get X_valid id
                X_valid_ids = X_valid.id.values.tolist()

                y_train = X_train.target
                y_valid = X_valid.target

                X_train = X_train[useful_features]
                X_valid = X_valid[useful_features]
                X_test = X_test[useful_features]

                # Modeling - Training
                if stacking == 0:
                    model = self._stacking_0(X_train, y_train)
                elif stacking == 1:
                    model = self._stacking_1(X_train, X_valid, y_train, y_valid)
                elif stacking == 2:
                    model = self._stacking_2(X_train, X_valid, y_train, y_valid)

                # Modeling - Evaluation and Inference
                valid_preds = model.predict(X_valid)
                test_preds = model.predict(X_test)
                
                final_valid_predictions.update(dict(zip(X_valid_ids, valid_preds)))
                final_test_predictions.append(test_preds)

                rmse = mean_squared_error(y_valid, valid_preds, squared=False)
                scores.append(rmse)
                print(f'Model Stacking: {stacking}, Fold: {fold}, RMSE: {rmse}')
                
            # Export results
            final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
            final_valid_predictions.columns = ["id", f"stacking_pred_{stacking}"]
            final_valid_predictions.to_csv(f"stacking_{stacking}_valid_pred.csv", index=False)

            test_mean_preds = np.mean(np.column_stack(final_test_predictions), axis=1) # get the meam predictions for each row
            test_mean_preds = pd.DataFrame({'id': self.sample_submission.id, f"stacking_pred_{stacking}": test_mean_preds})
            test_mean_preds.to_csv(f"stacking_{stacking}_test_pred.csv", index=False)
            print(f'Average RMSE: {np.mean(scores)}, STD of RMSE: {np.std(scores)}')
            print('-----------------------------------------------------------------')
            
    def predict(self, data_path: str):
        '''Predict and generate submission.csv
            
        Args:
            data_path: The path of the train dataset.
        Returns:
            None
        '''
        n_stackings = 3
        # Define featurs
        useful_features = [f"stacking_pred_{i}" for i in range(n_stackings)]
        
        # Import datasets
        df_train = pd.read_csv(f'{data_path}/train_folds.csv')
        df_test = pd.read_csv('../input/30-days-of-ml/test.csv')
        
        # Join datasets on id
        for i in range(n_stackings):
            _df_valid = pd.read_csv(f'stacking_{i}_valid_pred.csv')
            df_train = df_train.merge(_df_valid, on='id', how='left')
            _df_test = pd.read_csv(f'stacking_{i}_test_pred.csv')
            df_test = df_test.merge(_df_test, on='id', how='left')

        # Modeling
        final_predictions = []
        scores = []
        for fold in range(5):
            X_train =  df_train[df_train.kfold != fold].reset_index(drop=True)
            X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
            X_test = df_test.copy()

            y_train = X_train.target
            y_valid = X_valid.target

            X_train = X_train[useful_features]
            X_valid = X_valid[useful_features]
            X_test = X_test[useful_features]

            # Modeling - Training
            model = LinearRegression()
            model.fit(X_train, y_train)

            # Modeling - Evaluation and Inference
            valid_preds = model.predict(X_valid)
            test_preds = model.predict(X_test)
            final_predictions.append(test_preds)
            rmse = mean_squared_error(y_valid, valid_preds, squared=False)
            print(f'Fold: {fold}, RMSE: {rmse}')
            scores.append(rmse)

        # Generate submission.csv
        preds = np.mean(np.column_stack(final_predictions), axis=1)
        preds = pd.DataFrame({'id': self.sample_submission.id, 'target': preds})
        preds.to_csv('submission.csv', index=False)
        
        print('Generate submission.csv succeed!')
        print(f'Average RMSE: {np.mean(scores)}, STD of RMSE: {np.std(scores)}')
        

In [None]:
modeling = Modeling(train_path='../input/30days-folds/train_folds.csv')

In [None]:
# modeling.blending(model='xgboost')
# modeling.blending(model='lightgbm')
# modeling.stacking(data_path='../input/kaggle30/no_cat2')
modeling.stacking(data_path='../input/kaggle30/full_dataset')

In [None]:
# modeling.blending(model='xgboost')
# modeling.blending(model='lightgbm')
# modeling.stacking(data_path='../input/kaggle30/no_cat2')
modeling.stacking(data_path='../input/kaggle30/full_dataset')

In [None]:
%%time
modeling.predict(data_path='../input/kaggle30/no_cat2')

In [None]:
%%time
#study = optuna.create_study(direction='minimize')
#study.optimize(objective, n_trials=200) # set n_trials

In [None]:
def predict(model):
    '''Predict and generate submission.csv
    '''
    n_stackings = 3
    # Define featurs
    useful_features = [f"pred_{i}" for i in range(1, n_stackings+1)]

    # Import datasets
    data_path = '../input/kaggle30-2'
    df_train = pd.read_csv('../input/30days-folds/train_folds.csv')
    df_test = pd.read_csv('../input/30-days-of-ml/test.csv')
    sample_submission = pd.read_csv('../input/30-days-of-ml/sample_submission.csv')
    
    # Join datasets on id
    for i in range(n_stackings):
        _df_valid = pd.read_csv(f'{data_path}/stacking_{i}_valid_pred.csv')
        df_train = df_train.merge(_df_valid, on='id', how='left')
        _df_test = pd.read_csv(f'{data_path}/stacking_{i}_test_pred.csv')
        df_test = df_test.merge(_df_test, on='id', how='left')
        
    # Modeling
    final_predictions = []
    scores = []
    for fold in range(5):
        X_train =  df_train[df_train.kfold != fold].reset_index(drop=True)
        X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
        X_test = df_test.copy()

        y_train = X_train.target
        y_valid = X_valid.target

        X_train = X_train[useful_features]
        X_valid = X_valid[useful_features]
        X_test = X_test[useful_features]

        # Modeling - Training
        model.fit(
            X_train, y_train, 
            early_stopping_rounds=300,
            eval_set=[(X_valid, y_valid)],
            verbose=False
        )

        # Modeling - Evaluation and Inference
        valid_preds = model.predict(X_valid)
        test_preds = model.predict(X_test)
        final_predictions.append(test_preds)
        rmse = mean_squared_error(y_valid, valid_preds, squared=False)
        print(f'Fold: {fold}, RMSE: {rmse}')
        scores.append(rmse)

    # Generate submission.csv
    preds = np.mean(np.column_stack(final_predictions), axis=1)
    preds = pd.DataFrame({'id': sample_submission.id, 'target': preds})
    preds.to_csv('submission.csv', index=False)

    print('Generate submission.csv succeed!')
    print(f'Average RMSE: {np.mean(scores)}, STD of RMSE: {np.std(scores)}')

---

## Predictions Submission

In [None]:
# XGBoost Standardization (Single)
# 0.7170814875570916, 0.71952

# XGBoost Standardization
# 0.7177871347165812, 0.71939

# XGBoost Ordinal Encoding
# 0.7177328219011401, 0.71936

# XGBoost Ordinal Encoding, Full dataset
# 0.7176631057759594, 0.71909