### Chapter 1: Kaggle competitions process

In this first chapter, you will get exposure to the Kaggle competition process. You will train a model and prepare a csv file ready for submission. You will learn the difference between Public and Private test splits, and how to prevent overfitting.


- 1.1 Competitions overview
    - Explore train data
    - Explore test data  
    <br />
- 1.2 Prepare your first submission
    - Determine a problem type
    - Train a simple model
    - Prepare a submission  
    <br />    
- 1.3 Public vs Private leaderboard
    - What model is overfitting?
    - Train XGBoost models
    - Explore overfitting XGBoost


#### 1.1 Overview
- Explore train and test data

In [1]:
import pandas as pd

# Read train data
taxi_train = pd.read_csv('datasets/taxi_train_chapter_4.csv')
taxi_train.columns.to_list()

['id',
 'fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [2]:
""" 'fare_amount' column is missing in test data because this is the column that we are predicting."""

# Read test data
taxi_test = pd.read_csv('datasets/taxi_test_chapter_4.csv')
taxi_test.columns.to_list()

['id',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [6]:
# read the sample submission file
taxi_sample_sub = pd.read_csv('datasets/taxi_sample_submission.csv')
taxi_sample_sub.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.35
1,2015-01-27 13:08:24.0000003,11.35
2,2011-10-08 11:53:44.0000002,11.35
3,2012-12-01 21:12:12.0000002,11.35
4,2012-12-01 21:12:12.0000003,11.35


### Train XGBoost models


In [7]:
# load forcasting train data
df_full = pd.read_csv('datasets/train.csv')
# test = pd.read_csv('datasets/test.csv')

# Randomly use 30000 of your dataframe
df = df_full.sample(n=30000, random_state=1)

print(df.info())


from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.5, random_state=42)


import xgboost as xgb

# Create DMatrix on train data
dtrain = xgb.DMatrix(data=train[['store', 'item']],
                     label=train['sales'])
# dtest = xgb.DMatrix(data=test[['store', 'item']])


# Define xgboost parameters
params = {  'objective': 'reg:linear',
            'max_depth': 2,
            # 'max_depth': 8,
            # 'max_depth': 15,
            'verbosity': 0}

# Train xgboost model
xg_depth_2 = xgb.train(params=params, dtrain=dtrain)


# Define xgboost parameters
params = {  'objective': 'reg:linear',
            #'max_depth': 2,
            'max_depth': 8,
            # 'max_depth': 15,
            'verbosity': 0}

# Train xgboost model
xg_depth_8 = xgb.train(params=params, dtrain=dtrain)

# Define xgboost parameters
params = {  'objective': 'reg:linear',
            #'max_depth': 2,
            # 'max_depth': 8,
            'max_depth': 15,
            'verbosity': 0}

# Train xgboost model
xg_depth_15 = xgb.train(params=params, dtrain=dtrain)

# Define xgboost parameters
params = {  'objective': 'reg:linear',
            #'max_depth': 2,
            # 'max_depth': 8,
            'max_depth': 30,
            'verbosity': 0}

# Train xgboost model
xg_depth_30 = xgb.train(params=params, dtrain=dtrain)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 782526 to 499585
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    30000 non-null  object
 1   store   30000 non-null  int64 
 2   item    30000 non-null  int64 
 3   sales   30000 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 1.1+ MB
None


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


#### Explore overfitting XGBoost
Having trained 3 XGBoost models with different maximum depths, you will now evaluate their quality. For this purpose, you will measure the quality of each model on both the train data and the test data. As you know by now, the train data is the data models have been trained on. The test data is the next month sales data that models have never seen before.

The goal of this exercise is to determine whether any of the models trained is overfitting. To measure the quality of the models you will use Mean Squared Error (MSE). It's available in sklearn.metrics as mean_squared_error() function that takes two arguments: true values and predicted values.

train and test DataFrames together with 3 models trained (xg_depth_2, xg_depth_8, xg_depth_15) are available in your workspace.

- Instructions

    -  Make predictions for each model on both the train and test data.
    -  Calculate the MSE between the true values and your predictions for both the train and test data.

In [8]:
from sklearn.metrics import mean_squared_error

dtrain_ = xgb.DMatrix(data=train[['store', 'item']])
dtest = xgb.DMatrix(data=test[['store', 'item']])
print(len(train), len(test))

# For each of 3 trained models
for model in [xg_depth_2, xg_depth_8, xg_depth_15,xg_depth_30]:
    # Make predictions
    train_pred = model.predict(dtrain_)     
    test_pred = model.predict(dtest)          
    
    # Calculate metrics
    mse_train = mean_squared_error(train['sales'], train_pred)                  
    mse_test = mean_squared_error(test['sales'], test_pred)
    print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))

15000 15000
MSE Train: 600.630. MSE Test: 601.486
MSE Train: 295.394. MSE Test: 301.276
MSE Train: 252.848. MSE Test: 263.355
MSE Train: 251.809. MSE Test: 262.937


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
