<a href="https://www.kaggle.com/upamanyumukherjee/30-days-ml-competition-with-optuna?scriptVersionId=88621883" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [None]:
# Familiar imports
import numpy as np
import pandas as pd

# For ordinal encoding categorical variables, splitting data
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

# For training random forest model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [None]:
# Load the training data
train = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
test = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)

# Preview the data
train.head()

The next code cell separates the target (which we assign to `y`) from the training features (which we assign to `features`).

In [None]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
features.head()

# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [None]:
# List of categorical columns
object_cols = [col for col in features.columns if 'cat' in col]

# ordinal-encode categorical columns
X = features.copy()
X_test = test.copy()
ordinal_encoder = OrdinalEncoder()
X[object_cols] = ordinal_encoder.fit_transform(features[object_cols])
X_test[object_cols] = ordinal_encoder.transform(test[object_cols])

# Preview the ordinal-encoded features
X.head()

Next, we break off a validation set from the training data.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

# Step 4: Train a model

Now that the data is prepared, the next step is to train a model.  

If you took the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** courses, then you learned about **[Random Forests](https://www.kaggle.com/dansbecker/random-forests)**.  In the code cell below, we fit a random forest model to the data.

In [None]:
*Baseline

## **Baseline**

In [None]:
# Define the model 
model = RandomForestRegressor(random_state=1)

# Train the model (will take about 10 minutes to run)
model.fit(X_train, y_train)
preds_valid = model.predict(X_valid)
print(mean_squared_error(y_valid, preds_valid, squared=False))

In [None]:
# Use the model to generate predictions
predictions_raf = model.predict(X_test)

## Using Optuna to tune our models and then ensemble them.

### With Optuna we can do hyperparameter tuning. It is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.
### In the following steps have demonstrated it in three models **LGBM, Xboost Classifier and Random Forest**.

In [None]:
def objective(trial, X, y, name='xgb'):
        
    params = {'max_depth':trial.suggest_int('max_depth', 5, 50),
              'n_estimators':200000,
              #'boosting':trial.suggest_categorical('boosting', ['gbdt', 'dart', 'goss']),
              'subsample': trial.suggest_uniform('subsample', 0.2, 1.0),
              'colsample_bytree':trial.suggest_uniform('colsample_bytree', 0.2, 1.0),
              'learning_rate':trial.suggest_uniform('learning_rate', 0.007, 0.02),
              'reg_lambda':trial.suggest_uniform('reg_lambda', 0.01, 50),
              'reg_alpha':trial.suggest_uniform('reg_alpha', 0.01, 50),
              'min_child_samples':trial.suggest_int('min_child_samples', 5, 100),
              'num_leaves':trial.suggest_int('num_leaves', 10, 200),
              'n_jobs' : -1,
              'metric':'rmse',
              'max_bin':trial.suggest_int('max_bin', 300, 1000),
              'cat_smooth':trial.suggest_int('cat_smooth', 5, 100),
              'cat_l2':trial.suggest_loguniform('cat_l2', 1e-3, 100)}

    model = LGBMRegressor(**params)
                  
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
    

    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              eval_metric=['rmse'],
              early_stopping_rounds=250, 
              categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
              #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
              verbose=0)

    train_score = np.round(np.sqrt(mean_squared_error(y_train, model.predict(X_train))), 5)
    test_score = np.round(np.sqrt(mean_squared_error(y_val, model.predict(X_val))), 5)
                  
    print(f'TRAIN RMSE : {train_score} || TEST RMSE : {test_score}')
                  
    return test_score

In [None]:
from lightgbm import LGBMRegressor

import optuna
from functools import partial

In [None]:
optimize = partial(objective, X=X_train, y=y_train)

study_lgbm = optuna.create_study(direction='minimize')
# study_lgbm.optimize(optimize, n_trials=300)

# i have commented out the trials so as to cut short the notebook execution time.

In [None]:
from xgboost import XGBRegressor

# Define the model
def objective2(trial, X, y, name='xgb'):
        
    param = {
            'tree_method':'gpu_hist',  # this parameter means using the GPU when training our model to speedup the training process
            'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
            'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
            'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
            'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
            'learning_rate': trial.suggest_categorical('learning_rate', [0.008,0.009,0.01,0.012,0.014,0.016,0.018, 0.02]),
            'n_estimators': 4000,
            'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15,17,20]),
            'random_state': trial.suggest_categorical('random_state', [24, 48,2020]),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
        }

#     model = LGBMRegressor(**params)
    model = XGBRegressor(**param) # Your code here
                  
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
    

    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              eval_metric=['rmse'],
              early_stopping_rounds=250, 
#               categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
              #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
              verbose=0)

    train_score = np.round(np.sqrt(mean_squared_error(y_train, model.predict(X_train))), 5)
    test_score = np.round(np.sqrt(mean_squared_error(y_val, model.predict(X_val))), 5)
                  
    print(f'TRAIN RMSE : {train_score} || TEST RMSE : {test_score}')
                  
    return test_score

In [None]:
optimize = partial(objective2, X=X_train, y=y_train)

study_xgboost = optuna.create_study(direction='minimize')
# study_xgboost.optimize(optimize, n_trials=300)

# i have commented out the trials so as to cut short the notebook execution time.

In [None]:
# Define the model
def objective3(trial, X, y, name='xgb'):
        
    params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 1000),
            'max_depth': trial.suggest_int('max_depth', 4, 50),
            'min_samples_split': trial.suggest_int('min_samples_split', 1, 150),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 60),
        }

#     model = LGBMRegressor(**params)
    model = RandomForestRegressor(**params)
                  
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
    

    model.fit(X_train, y_train,
#               early_stopping_rounds=250, 
#               categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
              #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
             )

    train_score = np.round(np.sqrt(mean_squared_error(y_train, model.predict(X_train))), 5)
    test_score = np.round(np.sqrt(mean_squared_error(y_val, model.predict(X_val))), 5)
                  
    print(f'TRAIN RMSE : {train_score} || TEST RMSE : {test_score}')
                  
    return test_score

In [None]:
optimize = partial(objective3, X=X_train, y=y_train)

study_rnforst = optuna.create_study(direction='minimize')
# study_rnforst.optimize(optimize, n_trials=300)

In [None]:
def Save(X, y, name='lgb'):
        
    params = {'max_depth':31,
              'n_estimators':200000,
              #'boosting':trial.suggest_categorical('boosting', ['gbdt', 'dart', 'goss']),
              'subsample': 0.3258744755198934,
              'colsample_bytree':0.21016860689504144,
              'learning_rate':0.01796483827009817,
              'reg_lambda':18.9086111285175,
              'reg_alpha':30.781820149384465,
              'min_child_samples':44,
              'num_leaves':200,
              'n_jobs' : -1,
              'metric':'rmse',
              'max_bin':788,
              'cat_smooth':34,
              'cat_l2':10.706502107212572}

    model = LGBMRegressor(**params)
    k = 4
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
    num_val_samples = len(X_train) // k
    X_train=np.array(X_train)
    y_train=np.array(y_train)
    X_val=np.array(X_val)
    y_val=np.array(y_val)
    all_scores = []
    for i in range(k):
        print('processing fold #%d' % i)
        X_val = X_train[i * num_val_samples: (i + 1) * num_val_samples] #taking data from a range of kth to kth +1 samples
        y_val = y_train[i * num_val_samples: (i + 1) * num_val_samples]
        partial_train_data = np.concatenate( 
            [X_train[:i * num_val_samples],
             X_train[(i + 1) * num_val_samples:]],
            axis=0)#taking data from a range of kth to kth +1 samples
        partial_train_labels = np.concatenate(
            [y_train[:i * num_val_samples],
             y_train[(i + 1) * num_val_samples:]],
            axis=0)
    
    model.fit(partial_train_data, partial_train_labels, eval_set=[(X_val, y_val)],
              eval_metric=['rmse'],
              early_stopping_rounds=250, 
              categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
              #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
              verbose=0)
    preds_valid = model.predict(X_val)
    y_val=np.array(y_val)
    predictions = model.predict(X_test)
                  
    return predictions,preds_valid

In [None]:
predictions_lgb_valid=Save(X, y, name='lgb')[1]
predictions_lgb_valid

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
test_score_lgb = np.round(np.sqrt(mean_squared_error(y_val, predictions_lgb_valid)), 5)
test_score_lgb

In [None]:
plot = []
plotname = []

In [None]:
plot.append(test_score_lgb)
plotname.append('test_score_lgb')

In [None]:
# Save the predictions to a CSV file
predictions_lgb=Save(X, y, name='lgb')[0]
output_lgb = pd.DataFrame({'Id': X_test.index,
                       'target': predictions_lgb})
output_lgb.to_csv('submission1.csv', index=False)

In [None]:
def Save(X, y, name='xgb'):
        
    params = {'lambda': 5.448403383226208,
              'alpha': 0.004852988858499958,
              'colsample_bytree': 0.4,
              'subsample': 0.7,
              'learning_rate': 0.018,
              'max_depth': 15,
              'random_state': 24,
              'min_child_weight': 291
             }
    
    model = XGBRegressor(**params) # Your code here
    k = 4
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
    num_val_samples = len(X_train) // k
    X_train=np.array(X_train)
    y_train=np.array(y_train)
    X_val=np.array(X_val)
    y_val=np.array(y_val)
    all_scores = []
    for i in range(k):
        print('processing fold #%d' % i)
        X_val = X_train[i * num_val_samples: (i + 1) * num_val_samples] #taking data from a range of kth to kth +1 samples
        y_val = y_train[i * num_val_samples: (i + 1) * num_val_samples]
        partial_train_data = np.concatenate( 
            [X_train[:i * num_val_samples],
             X_train[(i + 1) * num_val_samples:]],
            axis=0)#taking data from a range of kth to kth +1 samples
        partial_train_labels = np.concatenate(
            [y_train[:i * num_val_samples],
             y_train[(i + 1) * num_val_samples:]],
            axis=0)
    
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              eval_metric=['rmse'],
              early_stopping_rounds=250, 
#               categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
              #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
              verbose=0)
    preds_valid = model.predict(X_val)
    y_val=np.array(y_val)
    predictions = model.predict(X_test)
                  
    return predictions,preds_valid

In [None]:
predictions_xgb_valid=Save(X, y, name='xgb')[1]
predictions_xgb_valid

In [None]:
test_score_xgb = np.round(np.sqrt(mean_squared_error(y_val, predictions_xgb_valid)), 5)
test_score_xgb

In [None]:
plot.append(test_score_xgb)
plotname.append('test_score_xgb')

In [None]:
# Save the predictions to a CSV file
predictions_xgb=Save(X, y, name='xgb')[0]
output_xgb = pd.DataFrame({'Id': X_test.index,
                       'target': predictions_xgb})
output_xgb.to_csv('submission2.csv', index=False)

In [None]:
def Save(X, y, name='rnf'):
        
    params = {'n_estimators': 54,
              'max_depth': 29,
              'min_samples_split': 121,
              'min_samples_leaf': 38
             }
    model = RandomForestRegressor(**params)
    k = 4
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
    num_val_samples = len(X_train) // k
    X_train=np.array(X_train)
    y_train=np.array(y_train)
    X_val=np.array(X_val)
    y_val=np.array(y_val)
    all_scores = []
    for i in range(k):
        print('processing fold #%d' % i)
        X_val = X_train[i * num_val_samples: (i + 1) * num_val_samples] #taking data from a range of kth to kth +1 samples
        y_val = y_train[i * num_val_samples: (i + 1) * num_val_samples]
        partial_train_data = np.concatenate( 
            [X_train[:i * num_val_samples],
             X_train[(i + 1) * num_val_samples:]],
            axis=0)#taking data from a range of kth to kth +1 samples
        partial_train_labels = np.concatenate(
            [y_train[:i * num_val_samples],
             y_train[(i + 1) * num_val_samples:]],
            axis=0)
    
    model.fit(partial_train_data, partial_train_labels)
    preds_valid = model.predict(X_val)
    y_val=np.array(y_val)
    predictions = model.predict(X_test)
                  
    return predictions,preds_valid

In [None]:
predictions_rnf_valid =Save(X, y, name='rnf')[1]
predictions_rnf_valid

In [None]:
test_score_RNF = np.round(np.sqrt(mean_squared_error(y_val, predictions_rnf_valid)), 5)
test_score_RNF

In [None]:
plot.append(test_score_RNF)
plotname.append('test_score_RNF')

In [None]:
# Save the predictions to a CSV file
predictions_rnf=Save(X, y, name='rnf')[0]
output_rnf = pd.DataFrame({'Id': X_test.index,
                       'target': predictions_rnf})
output_rnf.to_csv('submission3.csv', index=False)

### Ensemble

In [None]:
predictions_ensemble_valid=(predictions_lgb_valid+predictions_xgb_valid+predictions_rnf_valid)/3

In [None]:
predictions_ensemble = (predictions_lgb + predictions_xgb + predictions_rnf)/3

In [None]:
predictions_ensemble_valid

In [None]:
test_score_ensemble = np.round(np.sqrt(mean_squared_error(y_val, predictions_ensemble_valid)), 5)
test_score_ensemble
plot.append(test_score_ensemble)
plotname.append('test_score_ensemble')

In [None]:
import matplotlib.pyplot as plt

## Visualization

In [None]:
plt.bar(plotname,plot)

In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [None]:
# Save the predictions to a CSV file
output_ensemble = pd.DataFrame({'Id': X_test.index,
                       'target': predictions_ensemble})
output_ensemble.to_csv('submission_new.csv', index=False)

## Next up:
1. Would add Model Blending with current models.
2. Would add Targert encoding.
3. Would add DNN and RNNs as well in the model blending.