Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [None]:
# Familiar imports
import numpy as np
import pandas as pd

# For ordinal encoding categorical variables, splitting data
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split

# For training random forest model
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

#for the xgboost 
import xgboost 
from sklearn.model_selection import GridSearchCV

In [None]:
#we will set some variables
JOBS=4 #for the number of jobs in the ml algorithm

# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [None]:
# Load the training data
train = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
test = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)

# Preview the data that will be used for training
train.head()

In [None]:
#preview of the test data
test.head()

The next code cell separates the target (which we assign to `y`) from the training features (which we assign to `features`).

In [None]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
features.head()

# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [None]:
# List of categorical columns
object_cols = [col for col in features.columns if 'cat' in col]

# get the numeric columns also
num_cols = [col for col in features.columns if 'cat' not in col]

#check the categorical column's cardinality
col_cardinality = dict.fromkeys(object_cols, None)
for key in col_cardinality.keys():
    col_cardinality[key] = features[key].nunique()
print("each column and it's cardinality {}".format(col_cardinality))

cardinalities = set(col_cardinality.values())
print("the distinct cardinality {}".format(cardinalities))


We will loop all over the possible values of the cardinality, to check which one will give us the better result, regarding the chosen metric (mean squared error)

In [None]:
def get_input(col_cardinality, cardinality, num_cols):   
    '''
    Helper function. Returns the X and X_test datasets for a specific value for cardinality
    '''
    # we are gonna pick the low cardinality cols [arbitary to 4] and apply one hot encoding
    sel_obj_cols = [key for key in col_cardinality.keys() if col_cardinality[key]<=cardinality]
    #print(sel_obj_cols)

    #get the final columns
    final_cols = sel_obj_cols + num_cols
    #print(final_cols)

    # select final columns
    subfeatures = features[final_cols]
    subtest = test[final_cols]

    #we will use One Hot Encoder for the categorical data
    ohe_encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)
    OH_cols_train = pd.DataFrame(ohe_encoder.fit_transform(subfeatures[sel_obj_cols]))
    OH_cols_valid = pd.DataFrame(ohe_encoder.transform(subtest[sel_obj_cols]))

    # One-hot encoding removed index; put it back
    OH_cols_train.index = subfeatures.index
    OH_cols_valid.index = subtest.index

    # Remove categorical columns (will replace with one-hot encoding)
    num_X_train = subfeatures.drop(sel_obj_cols, axis=1)
    num_X_valid = subtest.drop(sel_obj_cols, axis=1)

    # Add one-hot encoded columns to numerical features
    X = pd.concat([num_X_train, OH_cols_train], axis=1)
    X_test = pd.concat([num_X_valid, OH_cols_valid], axis=1)
    return X, X_test


In [None]:
def fit_model(col_cardinality, cardinality, num_cols):
    '''
    Helper function, returns the model and rmse on a dataset defined for a specific cardinality value
    '''
    #to get the X and X_test datasets we are gonna utilize the previous function
    X, X_test = get_input(col_cardinality, cardinality, num_cols)
    #preview of the data
    #X.head()

    #create the test and training sets [split 0.80-0.20]
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20, train_size=0.80, random_state=0)

    # Define the Random Forest model [set parametres]
    model = RandomForestRegressor(n_estimators=150, max_depth=10, n_jobs=JOBS, random_state=1)

    # Train the model
    model.fit(X_train, y_train)
    preds_valid = model.predict(X_valid)
    rmse = (mean_squared_error(y_valid, preds_valid, squared=False))
    return model, rmse

In [None]:
models = [] #placeholder for the different models
rmses = [] #placeholder for the different root mean squared errors calculated 
#loop over the different value of cardinalities
for cardinality in cardinalities:
    model, rmse = fit_model(col_cardinality, cardinality, num_cols)
    rmses.append(rmse)
    models.append(model)
    print("for cardinality {c}, the root mean squared error is {r}".format(c=cardinality, r=rmses[-1]))

In [None]:
#get the min value of the rmses and the cardinality
print("min value of the rmes's is : {}".format(min(rmses)))
best_cardinality = list(cardinalities)[rmses.index(min(rmses))]
print("the corresponding cardinality is : {}".format(best_cardinality))

#setting the best model 
#best_model = models[rmses.index(min(rmses))]

#set the input for the best cardinality
X, X_test = get_input(col_cardinality, best_cardinality, num_cols)

In [1]:
#initiliaze the model with basic parametres
#xgb_model = xgb.XGBRegressor(n_jobs=JOBS, random_state=1, n_estimators=2000, max_depth=15, learning_rate=0.1)
xgb_model = xgb.XGBRegressor(n_jobs=JOBS, random_state=1)

#specify the some parametres to optimize
params = {
    'n_estimators' : [1000, 1500, 2000],
    'max_depth' : [10, 12],
    'subsample' : [0.8, 0.9],
    'learning_rate' : [0.05, 0.1, 0.2]
}

grid_search = GridSearchCV(xgb_model, params, cv=5, n_jobs=JOBS, scoring='accuracy') #neg_root_mean_squared_error

#create the test and training sets [split 0.80-0.20]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20, train_size=0.80, random_state=0)

#preview the first rows of the training set
X_train.head()

ModuleNotFoundError: No module named 'xgboost'

In [None]:
grid_search.fit(X_train, y_train)

#define the best model
best_model = grid_search.best_estimator_

In [None]:
#get the best parameters that were evaluated
best_model_params = grid_search.best_params_

In [None]:
#calculate the rmse for the best evaluated model
preds_valid = best_model.predict(X_valid)
rmse = (mean_squared_error(y_valid, preds_valid, squared=False))

print('rmse : {}'.format(rmse))

In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [None]:
# Use the model to generate predictions
predictions = best_model.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

Once you have run the code cell above, follow the instructions below to submit to the competition:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.

# Step 6: Keep Learning!

If you're not sure what to do next, you can begin by trying out more model types!
1. If you took the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course, then you learned about **[XGBoost](https://www.kaggle.com/alexisbcook/xgboost)**.  Try training a model with XGBoost, to improve over the performance you got here.

2. Take the time to learn about **Light GBM (LGBM)**, which is similar to XGBoost, since they both use gradient boosting to iteratively add decision trees to an ensemble.  In case you're not sure how to get started, **[here's a notebook](https://www.kaggle.com/svyatoslavsokolov/tps-feb-2021-lgbm-simple-version)** that trains a model on a similar dataset.