Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [3]:
# Familiar imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats
from scipy.stats import norm, skew #for some statistics
from scipy.special import boxcox1p



# For ordinal encoding categorical variables, splitting data, pipeline, and so on
from sklearn.preprocessing import OrdinalEncoder, PowerTransformer, StandardScaler, MinMaxScaler, RobustScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, KFold, cross_val_score 
from sklearn.pipeline import make_pipeline

# For training random forest model
from sklearn.linear_model import ElasticNet, Lasso 
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# base
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin, clone


from sklearn.metrics import mean_squared_error


In [4]:
pd.set_option("display.max_columns", 100)

# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [5]:
# Load the training data
train = pd.read_csv("train.csv", index_col=0)
test = pd.read_csv("test.csv", index_col=0)

# Preview the data
train.head()
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cont0,300000.0,0.527335,0.230599,-0.118039,0.405965,0.497053,0.66806,1.058443
cont1,300000.0,0.460926,0.214003,-0.069309,0.310494,0.427903,0.615113,0.887253
cont2,300000.0,0.490498,0.253346,-0.056104,0.300604,0.502462,0.647512,1.034704
cont3,300000.0,0.496689,0.219199,0.130676,0.329783,0.465026,0.664451,1.03956
cont4,300000.0,0.491654,0.240074,0.255908,0.284188,0.39047,0.696599,1.055424
cont5,300000.0,0.510526,0.228232,0.045915,0.354141,0.488865,0.669625,1.067649
cont6,300000.0,0.467476,0.210331,-0.224689,0.342873,0.429383,0.573383,1.111552
cont7,300000.0,0.537119,0.21814,0.203763,0.355825,0.504661,0.703441,1.032837
cont8,300000.0,0.498456,0.23992,-0.260275,0.332486,0.439151,0.606056,1.040229
cont9,300000.0,0.474872,0.218007,0.117896,0.306874,0.43462,0.614333,0.982922


The next code cell separates the target (which we assign to `y`) from the training features (which we assign to `features`).

In [6]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
features.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cont0,cont1,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
1,B,B,B,C,B,B,A,E,C,N,0.20147,-0.014822,0.669699,0.136278,0.610706,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985
2,B,B,A,A,B,D,A,F,A,O,0.743068,0.367411,1.021605,0.365798,0.276853,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083
3,A,A,A,C,B,D,A,D,A,F,0.742708,0.310383,-0.012673,0.576957,0.285074,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846
4,B,B,A,C,B,D,A,E,C,K,0.429551,0.620998,0.577942,0.28061,0.284667,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682
6,A,A,A,C,B,D,A,E,A,N,1.058291,0.367492,-0.052389,0.232407,0.287595,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823


# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [7]:
def add_poly_features(train, test, cols, concate=True, poly_degree=2):
    
    columns = train[cols].columns
    poly = PolynomialFeatures(2)
    poly_train = pd.DataFrame(poly.fit_transform(train[cols]), index=train.index)
    poly_test = pd.DataFrame(poly.transform(test[cols]), index=test.index)
    
    # stamp these columns
    poly_train = poly_train.add_prefix('poly_')
    poly_test = poly_test.add_prefix('poly_')
    
    if concate:
        train = pd.concat([train, poly_train], axis=1)
        test = pd.concat([test, poly_test], axis=1)
        
    return train, test
    
    


        
# List of categorical columns
object_cols_long = [col for col in features.columns if 'cat' in col]
cont_cols = [col for col in features.columns if 'con' in col]


new_features = features 

In [8]:
# ordinal-encode categorical columns
X = new_features.copy()
X_test = test.copy()

#cols = ['cont0', 'cont3', 'cont9']
# add poly features
#X, X_test = add_poly_features(X, X_test, cols)


# encoding
ordinal_encoder = OrdinalEncoder()
X[object_cols_long] = ordinal_encoder.fit_transform(X[object_cols_long])
X_test[object_cols_long] = ordinal_encoder.transform(X_test[object_cols_long])


# scaling
# pr_transformer = PowerTransformer(standardize=True)
# cols = ['cont0', 'cont1', 'cont3', 'cont4']
# cols = cont_cols
# X[cols] = pr_transformer.fit_transform(X[cols].values)
# X_test[cols] = pr_transformer.transform(X_test[cols].values)

In [9]:
#new_features['target'] = train['target']

In [10]:
# _, ax = plt.subplots(figsize=(18, 12))
# sns.heatmap(new_features.corr(), annot=True, square=True,linewidths=.5, ax=ax);

In [11]:
# new_features.drop('target', inplace=True, axis=1)

In [12]:
new_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300000 entries, 1 to 499999
Data columns (total 24 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   cat0    300000 non-null  object 
 1   cat1    300000 non-null  object 
 2   cat2    300000 non-null  object 
 3   cat3    300000 non-null  object 
 4   cat4    300000 non-null  object 
 5   cat5    300000 non-null  object 
 6   cat6    300000 non-null  object 
 7   cat7    300000 non-null  object 
 8   cat8    300000 non-null  object 
 9   cat9    300000 non-null  object 
 10  cont0   300000 non-null  float64
 11  cont1   300000 non-null  float64
 12  cont2   300000 non-null  float64
 13  cont3   300000 non-null  float64
 14  cont4   300000 non-null  float64
 15  cont5   300000 non-null  float64
 16  cont6   300000 non-null  float64
 17  cont7   300000 non-null  float64
 18  cont8   300000 non-null  float64
 19  cont9   300000 non-null  float64
 20  cont10  300000 non-null  float64
 21  cont11  30

Next, we break off a validation set from the training data.

In [13]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

In [14]:
X_train.shape

(225000, 24)

# Step 4.1: Predict using Stacking



In [26]:
# helper functions
def cv_score(model, X, y, scoring_method="neg_mean_squared_error", shuffle=True, k=5):
    
    kfolds = KFold(n_splits=k, 
                  shuffle=shuffle,
                  random_state=42)
    
    return cross_val_score(model, 
                            X, y, 
                            scoring=scoring_method,
                            cv = kfolds)

def rmse_score(model, X_train, y_train, X_valid, y_valid, **kwargs):
        model.fit(X_train, y_train, **kwargs)
        predictions = model.predict(X_valid)
        
        return mean_squared_error(predictions, y_valid, squared=False)

   
def kfold_train_predict(model, X, y, X_test, model_name=None, k=10, shuffle=True, use_eval_set=False, multi=False, store=True, **kwargs):
    
    """ out of folds OOFs calculations
    """
    
    kfolds = KFold(n_splits=k, 
                  shuffle=shuffle,
                  random_state=42)
    
    test_predictions = 0
    t_valid_predictions = np.zeros_like(np.array(y))
    valid_mean_score = [] 
    
    for fold, (train_index, valid_index) in enumerate(kfolds.split(X)):
        X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
        y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
        
        if multi:
            for m in model.models:
                print('cloning ' + m.__class__.__name__)
                m = clone(m)
                
        else:
             model_ = clone(model)
                
        if use_eval_set:
            model_.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], **kwargs)
        else:
            model_.fit(X_train, y_train, **kwargs)
        
        # validation score
        valid_predications = model_.predict(X_valid)
        score = mean_squared_error(valid_predications, y_valid, squared=False)
        valid_mean_score.append(score)
        
        t_valid_predictions[valid_index] = valid_predications
        # test predictions
        test_predictions += model_.predict(X_test) / k
        
        print('Fold:{} score:{:.4f}'.format(fold + 1, score))
        
    print('average score:{:.4f} ({:.4f})'.format(np.mean(valid_mean_score), np.std(valid_mean_score) ))
    

    # store them
    test_df = pd.DataFrame({'Id': X_test.index,
                       'feature': test_predictions})
    valid_df = pd.DataFrame({'Id': X.index,
                       'feature': t_valid_predictions})
    
    if model_name is None:
        model_name = model.__class__.__name__
        
    test_df.to_csv( '{}_{}_test.csv'.format(model_name, k), index=True)
    valid_df.to_csv('{}_{}_valid.csv'.format(model_name, k), index=True)
    return test_predictions, t_valid_predictions

### Base models

In [16]:

# lasso
lasso = make_pipeline(StandardScaler(), Lasso(alpha =0.00005, random_state=1))

# Elastic net
e_net = make_pipeline(StandardScaler(), ElasticNet(alpha=0.00005, l1_ratio=.9, random_state=3))

# extra-tree
extree = make_pipeline(StandardScaler(), ExtraTreesRegressor(max_depth=5, n_jobs=-1))

# random forest
rfr = RandomForestRegressor(max_depth=5,  n_jobs=-1)

# KR
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=1, coef0=2.5)

# gradient boosing
gb = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, 
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

xgb_params = {'n_estimators': 10000,
              'learning_rate': 0.35,
              'subsample': 0.926,
              'colsample_bytree': 0.84,
              'max_depth': 2,
              'booster': 'gbtree', 
              'reg_lambda': 35.1,
              'reg_alpha': 34.9,
              'random_state': 42,
              'n_jobs': 4}

xgb =  XGBRegressor(**xgb_params)


catb = CatBoostRegressor(iterations=6800,
                  learning_rate=0.93,
                  loss_function="RMSE",
                  random_state=42,
                  verbose=0,
                  thread_count=4,
                  depth=1,
                  l2_leaf_reg=3.28)

SEED = 7770777
params_lgb = {
    "n_estimators": 10000,   
    "boosting_type": "gbdt",
    "objective": "regression",
    "metric": "rmse",
    "learning_rate": 0.007899156646724397,
    "num_leaves": 77,
    "max_depth": 77,
    "feature_fraction": 0.2256038826485174,
    "bagging_fraction": 0.7705303688019942,
    "min_child_samples": 290,
    "reg_alpha": 9.562925363678952,
    "reg_lambda": 9.355810045480153,
    "max_bin": 772,
    "min_data_per_group": 177,
    "bagging_freq": 1,
    "cat_smooth": 96,
    "cat_l2": 17,
    "verbosity": -1,
    "bagging_seed": SEED,
    "feature_fraction_seed": SEED,
    "verbose_eval":1000,
    "seed": SEED
}

lgb = LGBMRegressor(**params_lgb)

#                    objective='regression',
#                    boosting_type="gbdt"
#                    learning_rate=0.01, 
#                    n_estimators=10000, 
#                    eval_set=[(X_valid, y_valid)],
#                    early_stopping_rounds=200,
#                    eval_metric='RMSE')


In [24]:
# baseline scores of each model
score_ = rmse_score(lasso, X_train.values, y_train.values, X_valid.values, y_valid.values)
print("\nLasso score: {:.4f} \n".format(score_))



Lasso score: 0.7416 



In [25]:
score_ = rmse_score(e_net, X_train, y_train, X_valid, y_valid)
print("ElasticNet score: {:.4f} \n".format(score_))


ElasticNet score: 0.7416 



In [34]:
score_ = rmse_score(xgb, X_train, y_train, X_valid, y_valid, early_stopping_rounds=200,
                   eval_set=[(X_valid, y_valid)], verbose=False)
print("Xgboost score: {:.4f} \n".format(score_))

Xgboost score: 0.7200 



In [35]:
score_ = rmse_score(lgb, X_train, y_train, X_valid, y_valid, 
                   eval_set=[(X_valid, y_valid)], 
                   early_stopping_rounds=200, verbose=False)
print("LGBM score: {:.4f} \n" .format(score_))

LGBM score: 0.7204 



In [39]:
score_ = rmse_score(catb, X_train, y_train, X_valid, y_valid)
print("CatBoostRegressor score: {:.4f} \n" .format(score_))

KeyboardInterrupt: 

### First approach: simple averaging of predictions

In [59]:
class RegressorsMixer(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models, kwargs, weights):
        self.models = [clone(model) for model in models]
        self.weights = weights
        self.kwargs = kwargs
    
    def fit(self, X, y):
        for model, kwargs in zip(self.models, self.kwargs):
            
            model.fit(X, y, **kwargs)
            
        return self
    
    def predict(self, X):
        predictions = np.column_stack([model.predict(X) for model in self.models])
        return np.sum(predictions * self.weights, axis=1)
    
    def clone(self):
        print('called')
        return RegressorsMixer(self.models, self.kwargs, self.weights)

In [60]:
mix_regressors = RegressorsMixer(models=[lgb, xgb, catb],
                                 kwargs=[
                                         {'early_stopping_rounds':100,'eval_set':[(X_valid, y_valid)],'verbose':False},
                                         {'early_stopping_rounds':100,'eval_set':[(X_valid, y_valid)],'verbose':False},
                                           {}],
                                 weights=[0.35, 0.45, 0.20])

In [20]:
score_ = rmse_score(mix_regressors, X_train, y_train, X_valid, y_valid)
print("Averaged score from mix of models: {:.4f} \n" .format(score_))

Averaged score from mix of models: 0.7191 



<hr>

### Second approach: simple averaging of predictions with folds on each model

In this approach I am going to run k-folds using each model and average the rest results among folds.


In [27]:
lasso_test_results, lasso_valid_results = kfold_train_predict(lasso, X, y, X_test)

Fold:1 score:0.7393
Fold:2 score:0.7376
Fold:3 score:0.7360
Fold:4 score:0.7408
Fold:5 score:0.7443
Fold:6 score:0.7353
Fold:7 score:0.7396
Fold:8 score:0.7392
Fold:9 score:0.7430
Fold:10 score:0.7369
average score:0.7392 (0.0028)


In [28]:
enet_test_results, enet_valid_results = kfold_train_predict(e_net, X, y, X_test)

Fold:1 score:0.7393
Fold:2 score:0.7376
Fold:3 score:0.7360
Fold:4 score:0.7408
Fold:5 score:0.7443
Fold:6 score:0.7353
Fold:7 score:0.7396
Fold:8 score:0.7392
Fold:9 score:0.7430
Fold:10 score:0.7369
average score:0.7392 (0.0028)


In [24]:
rfr_test_results, rfr_valid_results = kfold_train_predict(rfr, X, y, X_test)

Fold:1 score:0.7402
Fold:2 score:0.7392
Fold:3 score:0.7382


KeyboardInterrupt: 

In [29]:
extree_test_results, extree_valid_results = kfold_train_predict(extree, X, y, X_test)

Fold:1 score:0.7398
Fold:2 score:0.7386
Fold:3 score:0.7378
Fold:4 score:0.7412
Fold:5 score:0.7459
Fold:6 score:0.7362
Fold:7 score:0.7413
Fold:8 score:0.7408
Fold:9 score:0.7445
Fold:10 score:0.7381
average score:0.7404 (0.0028)


In [30]:
xgb_test_results, xgb_valid_results = kfold_train_predict(xgb, X, y, X_test, use_eval_set=True,
                   early_stopping_rounds=200,
                   verbose=False)

Fold:1 score:0.7164
Fold:2 score:0.7158
Fold:3 score:0.7153
Fold:4 score:0.7172
Fold:5 score:0.7218
Fold:6 score:0.7151
Fold:7 score:0.7174
Fold:8 score:0.7184
Fold:9 score:0.7200
Fold:10 score:0.7125
average score:0.7170 (0.0025)


In [31]:
lgb_test_results, lgb_valid_results = kfold_train_predict(lgb, X, y, X_test, use_eval_set=True,
                   early_stopping_rounds=200,
                   verbose=False)

Fold:1 score:0.7170
Fold:2 score:0.7161
Fold:3 score:0.7159
Fold:4 score:0.7179
Fold:5 score:0.7219
Fold:6 score:0.7151
Fold:7 score:0.7181
Fold:8 score:0.7186
Fold:9 score:0.7202
Fold:10 score:0.7133
average score:0.7174 (0.0024)


In [32]:
catb_test_results, catb_valid_results = kfold_train_predict(catb, X, y, X_test)

Fold:1 score:0.7174
Fold:2 score:0.7176
Fold:3 score:0.7162
Fold:4 score:0.7186
Fold:5 score:0.7221
Fold:6 score:0.7156
Fold:7 score:0.7189
Fold:8 score:0.7195
Fold:9 score:0.7207
Fold:10 score:0.7143
average score:0.7181 (0.0022)


In [None]:
#mix_test_results, mix_valid_results = kfold_train_predict(mix_regressors, X, y, X_test, multi=True)

In [98]:
# find optimizer to fit them based on the validation results
import cvxpy as cp


def get_weights(A, y):
    """
    A will be the matrix of validation predictions where we will try to find the best
    linear combination weights 'x' to solve Ax = y 
    The objective function will be (Ax - y)^2 which is a clear proxy for the RMSE
    """
    n = A.shape[1] # number of models
    x = cp.Variable(n) # 
    b = np.array(y)
    objective = cp.Minimize(cp.sum_squares(A@x - b))
    # we need x to be between 0 and 1 and to sum to 1
    constraints = [-1 <= x, x <= 1, cp.sum(x) == 1] 
    prob = cp.Problem(objective, constraints)
    prob.solve()
    
    # Print result.
    print("\nThe optimal value is", np.sqrt(prob.value/len(y)))
    print("A solution x is")
    print(x.value)
    return x.value


In [1]:
valid_stack = [#mix_valid_results,
               extree_valid_results, 
               lasso_valid_results,
               enet_valid_results,
               #xgb_valid_results, 
               #lgb_valid_results,
               #catb_valid_results
            ]

test_stack = [#mix_test_results,
              extree_test_results, 
              lasso_test_results,
              enet_test_results,
              #xgb_test_results,
              #lgb_test_results,
              #catb_test_results
            ]
                
#A = np.column_stack(valid_stack)

#w = get_weights(A, y)

#results = np.sum(np.column_stack(test_stack) * w, axis=1)
#results = lgb_test_results

NameError: name 'extree_valid_results' is not defined

In [205]:
X_train_ = pd.DataFrame(np.column_stack(valid_stack))
X_test_ =  pd.DataFrame(np.column_stack(test_stack))

In [None]:
lasso_test_results, lasso_valid_results = kfold_train_predict(RandomForestRegressor(), X_train_, y, X_test_, k=5)

In [48]:
#results = np.exp((np.log(lasso_test_results) + np.log(xgb_test_results) + np.log(lgb_test_results) + np.log(catb_test_results)) / 4 )
results = np.exp(np.sum(np.log(np.column_stack(test_stack)) / 4, axis=1))
#mean_squared_error(y, y3, squared=False)

In [207]:
results = lasso_test_results

<hr>

In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [None]:
best_model = mix_regressors
# Use the model to generate predictions
#best_model.fit(X, y)

In [None]:
predictions = best_model.predict(X_test)

In [208]:
predictions = results
# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

In [209]:
output.head(10)

Unnamed: 0,Id,target
0,0,8.085849
1,5,8.37098
2,15,8.399496
3,16,8.509723
4,17,8.135707
5,19,8.311111
6,20,8.491843
7,21,7.823435
8,23,8.071818
9,29,8.314829


In [82]:
output.head(10)

Unnamed: 0,Id,target
0,0,8.084778
1,5,8.373437
2,15,8.400814
3,16,8.510105
4,17,8.13758
5,19,8.311714
6,20,8.495224
7,21,7.8255
8,23,8.072313
9,29,8.314436
