<a href="https://www.kaggle.com/evi125/a-beginners-approach-to-30-days-of-ml-top-15?scriptVersionId=88610181" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 30 Days of ML

Hi, I'm Evelin. I'm a beginner without any programming or machine learning background and this is the first data science competition I'm participating in (after the House Prices competition we practised on while doing the Machine Learning courses).

I made quite a few notebooks with many many versions of them. Being a beginner and not knowing good techniques I tested most things by making small changes and comparing the result of the cross validation score and the public score. If both the CV score and the public score got better, I kept the change, if only one of them, I didn't.

After checking for missing values, I tried a few things for numerical and categorical columns: StandardScaler(), MinMaxScaler(); Normalizer(), OrdinalEncoder(), OneHotEncoder(); KBinsDiscretizer(); KMeans(); PCA; TargetEncoder() and also removing columns and using different scaling or encoding for different features. StandardScaler() for numerical columns and OneHotEncoder() gave the best results.

I followed the same approach for selecting the hyperparameters for XGBoost as well. First, I ran a GridSearch with a few parameters. Then I tried optimizing the parameters around the best score one-by-one or in groups of 2 or 3. Eg. if I got n_estimator=1000, learning_rate=0.08, max_depth=4 as best parameters from the GridSearch, then next round I tried n_estimator= 900, 1000, 1100. Then, if I got 1100, I tried 1050, 1100, 1200 next... etc etc. Then, I added more parameters as I learnt about them (eg alpha, lambda). I went through all the parameters I used in my models this way. I know it's very inefficient, but it helped me reach a public score of 0.71825 and I was about 400th on the public leaderboard that time, so I was very happy with my results.

After Abhishek's optimization and stacking videos came out, I learnt about Optuna and StackingRegressor(). Optuna often gave worse results than my tedious optimization technique, but stacking helped a lot improving my score. Unfortunately, it took so much time to run all the models and the cv as well, that I stopped using cv and only relied on my public score. That was a mistake and I believe that's the reason I fell more than 200 places on the private leaderboard.

(I've also experimented with using a Pipeline and I loved it, but I got worse score using it, probably because I couldn't figure out how to use early_stopping for XGBoost with a pipeline, so I ditched it)


**Resources I've used for this notebook:**

Python, Intro to ML, Intermediate ML, Data Visualization mini courses

Documentations: Sklearn, XGBoost, LGBM, etc.

The [Getting Started with 30 Days of ML Competition](https://www.kaggle.com/alexisbcook/getting-started-with-30-days-of-ml-competition) notebook by [Alexis Cook](https://www.kaggle.com/alexisbcook)

I used [this notebook](https://www.kaggle.com/garylucn/top-9-house-price/notebook) by [Gary Lu]( https://www.kaggle.com/garylucn) for Optuna and Stacking (I was struggling with using these techniques until I found this notebook, so thank you so much for sharing it!))

[Abhishek Takur](https://www.kaggle.com/abhishek)'s  [youtube videos](https://www.youtube.com/watch?v=_55G24aghPY&list=PL98nY_tJQXZnP-k3qCDd1hljVSciDV9_N)

[Scikit-optimize for LightGBM Tutorial with Luca Massaron](https://www.youtube.com/watch?v=AFtjWuwqpSQ&list=PLqFaTIg4myu9uAPsqXBBZRr8kcj9IvAIf&index=3) video by [Luca Massaron](https://www.kaggle.com/lucamassaron)

I also used my own Housing Prices notebooks for which I learnt techniques and got inspiration from the following notebooks:

* [House Prices: Pipeline & Cross-Validation](https://www.kaggle.com/sergejnuss/house-prices-pipeline-cross-validation) by [Sergej Nuss](https://www.kaggle.com/sergejnuss) for Pipelines
* [Housing Prices Competition: Clear and Concise Exploratory Data Analysis](https://www.kaggle.com/korfanakis/housing-prices-clear-and-concise-eda) by [Orfanakis Konstantinos](https://www.kaggle.com/korfanakis) for Data Visualization, Data Analysis
* [Housing Prices: A Simple Approach to Top 2%](https://www.kaggle.com/korfanakis/housing-prices-a-simple-approach-to-top-2) by [Orfanakis Konstantinos](https://www.kaggle.com/korfanakis)
* [Data Science Workflow TOP 2% (with Tuning)](https://www.kaggle.com/angqx95/data-science-workflow-top-2-with-tuning) by [aqx](https://www.kaggle.com/angqx95)
* [
Top 1% Approach: EDA, New Models and Stacking](https://www.kaggle.com/datafan07/top-1-approach-eda-new-models-and-stacking) by [Ertuğrul Demir](https://www.kaggle.com/datafan07)



The [Statquest](https://www.youtube.com/user/joshstarmer) youtube channel by Josh Starmer helped me understand a lot of statistical and machine learning concepts.


Also [Attila Ambrus](https://www.kaggle.com/ambrusattila) was very kind and spent a lot of time answering my never ending questions in the Hungarian 30 Days of ML Discord community. Köszönöm a sok segítséget!!!!


Thank you everyone for sharing lots of valuable information and making my learning process a lot of fun!



**(Note: this notebook only includes my final work. It doesn't include everything I tried (eg. checking for missing values, scaling, encoding, removing columns, etc.), and tuning with GridSearch and Optuna.)**

# Importing necessary libraries

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

import xgboost as xgb
import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor

from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import KFold

# Loading and exploring the data

In [None]:
train = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
test = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)

print(train.shape)
train.head()

In [None]:
train.describe()

### Numerical data

In [None]:
cont_data = train.select_dtypes(exclude='object')

fig = plt.figure(figsize = (20, 15))
for i, col in enumerate(cont_data):
    plt.subplot(3, 5, i + 1)
    sns.histplot(x= col, data= train)
    plt.xticks(rotation = 90)
fig.tight_layout()

### Categorical data

In [None]:
cat_data = train.select_dtypes(include='object')

fig = plt.figure(figsize = (20, 10))
for i, col in enumerate(cat_data):
    plt.subplot(2, 5, i + 1)
    sns.countplot(x= col, data= train)
    plt.xticks(rotation = 90)
fig.tight_layout()

### Correlation between numerical features and the target

In [None]:
correlations = train.select_dtypes(exclude=['object']).corr()
correlations = correlations[['target']].sort_values(by=['target'], ascending=False)
correlations

In [None]:
fig = plt.figure(figsize=(20,30))
for index, column in enumerate(cont_data.columns):
    plt.subplot(8,5, index+1)
    sns.regplot(x=column, y='target', data= train)
fig.tight_layout()

### Multicorrelation

In [None]:
plt.figure(figsize= (20,20))
multicorr = train.corr()
sns.heatmap(multicorr, cmap="crest", annot=True, fmt= ".3f")

There isn't strong correlation neither between the features nor any of the features and the target. I wonder if that's why none of the scaling/encoding/feature engineering techniques worked for me.

# Preparing the data

In [None]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

object_cols = [col for col in features.columns if 'cat' in col]
number_cols = [col for col in features.columns if 'cont' in col]

X = features.copy()
X_test = test.copy()

# Encoding

OH_encoder = OneHotEncoder(sparse=False)

OH_cols = pd.DataFrame(OH_encoder.fit_transform(X[object_cols]))
OH_test_cols = pd.DataFrame(OH_encoder.transform(X_test[object_cols]))

# OH_cols.index = X[object_cols].index
# OH_test_cols.index = X_test[object_cols].index

num = X.drop(object_cols, axis=1)
num_t = X_test.drop(object_cols, axis=1)

# Scaling

scaler = StandardScaler()

scaled_cols = scaler.fit_transform(X[number_cols])
scaled_test_cols = scaler.transform(X_test[number_cols])

scaled_cols = pd.DataFrame(scaled_cols, columns=number_cols)
scaled_test_cols = pd.DataFrame(scaled_test_cols, columns=number_cols)

# Concat

preprocessed_X = pd.concat([OH_cols, scaled_cols], axis=1)
preprocessed_test_X = pd.concat([OH_test_cols, scaled_test_cols], axis=1)

# Preview the ordinal-encoded features
preprocessed_X.head()

# Models

I got the first 3 XGBoost models using GridSearch and tuning the parameters one-by-one or in groups of two.

For the 4th XGBoost and LGBM models I used Optuna, but the LGBM model made the stack perform worse, so I'm leaving it out.

The RandomForest and GradientBoostingRegressor models are from Abhishek's video.

(Note: I know preprocessed_X should be just X, but I don't want to risk missing one out and creating errors, so I'm leaving it this way for now. I'll be more careful naming variables next time.)

In [None]:
RANDOM_SEED = 1215

In [None]:
xgb1_params = {#'tree_method': 'gpu_hist',
              'n_estimators': 1100,
              'learning_rate': 0.08,
              'max_depth': 4,
              'min_child_weight': 1, 
              'subsample': 0.8, 
              'colsample_bytree': 0.5, 
              'reg_alpha': 5,
              'reg_lambda': 1,
              'gamma': 0,
             }

xgbr1 = xgb.XGBRegressor(random_state=RANDOM_SEED, **xgb1_params)
xgbr1.fit(preprocessed_X, y, early_stopping_rounds =5, eval_set=[(preprocessed_X, y)], verbose=500)

In [None]:
xgb2_params = {#'tree_method': 'gpu_hist',
              'n_estimators': 1500,
              'learning_rate': 0.08,
              'max_depth': 4,
              'min_child_weight': 1, 
              'subsample': 0.9, 
              'colsample_bytree': 0.2, 
              'reg_alpha': 8,
              'reg_lambda': 20,
              'gamma': 0,
             }

xgbr2 = xgb.XGBRegressor(random_state=RANDOM_SEED, **xgb2_params)
xgbr2.fit(preprocessed_X, y, early_stopping_rounds =5, eval_set=[(preprocessed_X, y)], verbose=500)

In [None]:
xgb3_params = {#'tree_method': 'gpu_hist',
               'n_estimators': 1700,
              'learning_rate': 0.08,
              'max_depth': 4,
              'min_child_weight': 1, 
              'subsample': 0.9, 
              'colsample_bytree': 0.1, 
              'reg_alpha': 10,
              'reg_lambda': 20,
              'gamma': 0
             }

xgbr3 = xgb.XGBRegressor(random_state=RANDOM_SEED, **xgb3_params)
xgbr3.fit(preprocessed_X, y, early_stopping_rounds =5, eval_set=[(preprocessed_X, y)], verbose=500)

In [None]:
xgb4_params = {#'tree_method': 'gpu_hist',
              'n_estimators': 4366, 
               'max_depth': 3, 
               'learning_rate': 0.05435500605120945, 
               'gamma': 0.492373667901573, 
               'min_child_weight': 5.02962746238382, 
               'subsample': 0.48512472393136913, 
               'colsample_bytree': 0.16115615922020954, 
               'reg_alpha': 6.564178028196104, 
               'reg_lambda': 8.21636933472606
             }

xgbr4 = xgb.XGBRegressor(random_state=RANDOM_SEED, **xgb4_params)
xgbr4.fit(preprocessed_X, y, early_stopping_rounds =5, eval_set=[(preprocessed_X, y)], verbose=500)

In [None]:
# lgb_params = {'num_leaves': 38, 'n_estimators': 897, 'max_depth': 5, 
#               'learning_rate': 0.08612849100285823, 'min_child_weight': 1.5455034368368281, 
#               'subsample': 0.3757097543793422, 'colsample_bytree': 0.10598238005270319, 
#               'reg_alpha': 7.677631589902765, 'reg_lambda': 4.300612342722245}

# lgbr = lgb.LGBMRegressor(random_state=RANDOM_SEED, **lgb_params)
# lgbr.fit(preprocessed_X, y, early_stopping_rounds =5, eval_set=[(preprocessed_X, y)])

In [None]:
randomforest_params = {'n_estimators': 500,  'max_depth': 3}
rf = RandomForestRegressor(n_jobs=-1, random_state=RANDOM_SEED, **randomforest_params)
rf.fit(preprocessed_X, y)

In [None]:
gbr_params = {'n_estimators': 500,  'max_depth': 3}
gbr = GradientBoostingRegressor(random_state=RANDOM_SEED, **gbr_params)
gbr.fit(preprocessed_X, y)

# Stacking

In [None]:
stack = StackingRegressor(
    estimators=[
        ('xgbr1', xgbr1),
        ('xgbr2', xgbr2),
        ('xgbr3', xgbr3),
        ('xgbr4', xgbr4),
        #('lgbr', lgbr),
        ('rf', rf),
        ('gbr', gbr)
    ],
    cv=5)
stack.fit(preprocessed_X, y)

**As I mentioned earlier I made the mistake of not using cross validation for the stacks, because it took a lot of time. I used cross_val_score with cv=5 for GridSearch and Optuna.**

# Submission

In [None]:
print('Predict submission')
submission = pd.read_csv("../input/30-days-of-ml/sample_submission.csv")

submission.iloc[:,1] = stack.predict(preprocessed_test_X)

submission.to_csv('my_submission.csv', index=False)