Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [1]:
# Familiar imports
import numpy as np
import pandas as pd

# For ordinal encoding categorical variables, splitting data
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split

# For training random forest model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [2]:
#we will set some variables
JOBS=4 #for the number of jobs in the ml algorithm

# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [3]:
# Load the training data
train = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
test = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)

# Preview the data that will be used for training
train.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,B,B,B,C,B,B,A,E,C,N,...,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985,8.113634
2,B,B,A,A,B,D,A,F,A,O,...,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083,8.481233
3,A,A,A,C,B,D,A,D,A,F,...,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846,8.364351
4,B,B,A,C,B,D,A,E,C,K,...,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682,8.049253
6,A,A,A,C,B,D,A,E,A,N,...,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823,7.97226


In [4]:
#preview of the test data
test.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,B,B,B,C,B,B,A,E,E,I,...,0.476739,0.37635,0.337884,0.321832,0.445212,0.290258,0.244476,0.087914,0.301831,0.845702
5,A,B,A,C,B,C,A,E,C,H,...,0.285509,0.860046,0.798712,0.835961,0.391657,0.288276,0.549568,0.905097,0.850684,0.69394
15,B,A,A,A,B,B,A,E,D,K,...,0.697272,0.6836,0.404089,0.879379,0.275549,0.427871,0.491667,0.384315,0.376689,0.508099
16,B,B,A,C,B,D,A,E,A,N,...,0.719306,0.77789,0.730954,0.644315,1.024017,0.39109,0.98834,0.411828,0.393585,0.461372
17,B,B,A,C,B,C,A,E,C,F,...,0.313032,0.431007,0.390992,0.408874,0.447887,0.390253,0.648932,0.385935,0.370401,0.900412


The next code cell separates the target (which we assign to `y`) from the training features (which we assign to `features`).

In [5]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
features.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,B,B,B,C,B,B,A,E,C,N,...,0.610706,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985
2,B,B,A,A,B,D,A,F,A,O,...,0.276853,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083
3,A,A,A,C,B,D,A,D,A,F,...,0.285074,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846
4,B,B,A,C,B,D,A,E,C,K,...,0.284667,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682
6,A,A,A,C,B,D,A,E,A,N,...,0.287595,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823


# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [6]:
# List of categorical columns
object_cols = [col for col in features.columns if 'cat' in col]

# get the numeric columns also
num_cols = [col for col in features.columns if 'cat' not in col]

#check the categorical column's cardinality
col_cardinality = dict.fromkeys(object_cols, None)
for key in col_cardinality.keys():
    col_cardinality[key] = features[key].nunique()
print("each column and it's cardinality {}".format(col_cardinality))

cardinalities = set(col_cardinality.values())
print("the distinct cardinality {}".format(cardinalities))


each column and it's cardinality {'cat0': 2, 'cat1': 2, 'cat2': 2, 'cat3': 4, 'cat4': 4, 'cat5': 4, 'cat6': 8, 'cat7': 8, 'cat8': 7, 'cat9': 15}
the distinct cardinality {2, 4, 7, 8, 15}


We will loop all over the possible values of the cardinality, to check which one will give us the better result, regarding the chosen metric (mean squared error)

In [7]:
def get_input(col_cardinality, cardinality, num_cols):   
    '''
    Helper function. Returns the X and X_test datasets for a specific value for cardinality
    '''
    # we are gonna pick the low cardinality cols [arbitary to 4] and apply one hot encoding
    sel_obj_cols = [key for key in col_cardinality.keys() if col_cardinality[key]<=cardinality]
    #print(sel_obj_cols)

    #get the final columns
    final_cols = sel_obj_cols + num_cols
    #print(final_cols)

    # select final columns
    subfeatures = features[final_cols]
    subtest = test[final_cols]

    #we will use One Hot Encoder for the categorical data
    ohe_encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)
    OH_cols_train = pd.DataFrame(ohe_encoder.fit_transform(subfeatures[sel_obj_cols]))
    OH_cols_valid = pd.DataFrame(ohe_encoder.transform(subtest[sel_obj_cols]))

    # One-hot encoding removed index; put it back
    OH_cols_train.index = subfeatures.index
    OH_cols_valid.index = subtest.index

    # Remove categorical columns (will replace with one-hot encoding)
    num_X_train = subfeatures.drop(sel_obj_cols, axis=1)
    num_X_valid = subtest.drop(sel_obj_cols, axis=1)

    # Add one-hot encoded columns to numerical features
    X = pd.concat([num_X_train, OH_cols_train], axis=1)
    X_test = pd.concat([num_X_valid, OH_cols_valid], axis=1)
    return X, X_test


In [8]:
def fit_model(col_cardinality, cardinality, num_cols):
    '''
    Helper function, returns the model and rmse on a dataset defined for a specific cardinality value
    '''
    #to get the X and X_test datasets we are gonna utilize the previous function
    X, X_test = get_input(col_cardinality, cardinality, num_cols)
    #preview of the data
    #X.head()

    #create the test and training sets [split 0.80-0.20]
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20, train_size=0.80, random_state=0)

    # Define the Random Forest model [set parametres]
    model = RandomForestRegressor(n_estimators=500, max_depth=15, n_jobs=JOBS, random_state=1)

    # Train the model
    model.fit(X_train, y_train)
    preds_valid = model.predict(X_valid)
    rmse = (mean_squared_error(y_valid, preds_valid, squared=False))
    return model, rmse

In [9]:
models = [] #placeholder for the different models
rmses = [] #placeholder for the different root mean squared errors calculated 
#loop over the different value of cardinalities
for cardinality in cardinalities:
    model, rmse = fit_model(col_cardinality, cardinality, num_cols)
    rmses.append(rmse)
    models.append(model)
    print("for cardinality {c}, the root mean squared error is {r}".format(c=cardinality, r=rmses[-1]))

for cardinality 2, the root mean squared error is 0.7343482954343052
for cardinality 4, the root mean squared error is 0.7342782555939127
for cardinality 7, the root mean squared error is 0.7337638401071578
for cardinality 8, the root mean squared error is 0.733791045266522
for cardinality 15, the root mean squared error is 0.7337861426936235


In [10]:
#get the min value of the rmses and it's index
print("min value of the rmes's is : {}".format(min(rmses)))
best_cardinality = list(cardinalities)[rmses.index(min(rmses))]
print("the corresponding cardinality is : {}".format(best_cardinality))

#setting the best model 
best_model = models[rmses.index(min(rmses))]

#set the input for the best cardinality
X, X_test = get_input(col_cardinality, best_cardinality, num_cols)

min value of the rmes's is : 0.7337638401071578
the corresponding cardinality is : 7


In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [11]:
# Use the model to generate predictions
predictions = best_model.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

Once you have run the code cell above, follow the instructions below to submit to the competition:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.

# Step 6: Keep Learning!

If you're not sure what to do next, you can begin by trying out more model types!
1. If you took the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course, then you learned about **[XGBoost](https://www.kaggle.com/alexisbcook/xgboost)**.  Try training a model with XGBoost, to improve over the performance you got here.

2. Take the time to learn about **Light GBM (LGBM)**, which is similar to XGBoost, since they both use gradient boosting to iteratively add decision trees to an ensemble.  In case you're not sure how to get started, **[here's a notebook](https://www.kaggle.com/svyatoslavsokolov/tps-feb-2021-lgbm-simple-version)** that trains a model on a similar dataset.