# Kaggle competitions process
  
Kaggle is the most famous platform for Data Science competitions. Taking part in such competitions allows you to work with real-world datasets, explore various machine learning problems, compete with other participants and, finally, get invaluable hands-on experience. In this course, you will learn how to approach and structure any Data Science competition. You will be able to select the correct local validation scheme and to avoid overfitting. Moreover, you will master advanced feature engineering together with model ensembling approaches. All these techniques will be practiced on Kaggle competitions datasets.
  
In this first chapter, you will get exposure to the Kaggle competition process. You will train a model and prepare a csv file ready for submission. You will learn the difference between Public and Private test splits, and how to prevent overfitting.

## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
</table>
  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [2]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

## Competitions overview
  
Hi all! Welcome to the course on Kaggle competitions! In this course, you will develop the overall pipeline for successful participation in Machine Learning competitions. Also, you will learn some practical tips and tricks that can be used in any Machine Learning project.
  
**Instructor**
  
I will be your instructor for this course. My name is Yauhen Babakhin. I have a Master’s Degree in Applied Data Analysis and over 5 years of working experience in Data Science. I'm also a Kaggle competitions Grandmaster having gold medals in both classic Machine Learning and Deep Learning competitions.
  
**Kaggle**
  
First of all, let's discuss what Kaggle actually is. Kaggle is a web platform for Data Science and Machine Learning competitions. It allows us to solve Data Science challenges and compete with other participants in building the best predictive models.
  
**Kaggle benefits**
  
The list of Kaggle benefits is pretty long. Note that this platform could be useful for everyone: from beginners in Data Science to experienced professionals. We could get practical skills working with the real-world datasets, develop own pet projects, meet and grow with a great Kaggle community, get experience in new domain or model type, and also, keep up-to-date with the best performing machine learning methods.
  
- Get practical experience on the real-world data
- Develop portfolio projects
- Meet a great Data Science community
- Try new domain or model type
- Keep up-to-date with the best performing methods
  
**Competition process**
  
The general competition process consists of three major stages. Firstly, Kaggle gives us a problem definition, and data to resolve this problem.
  
Then, we're developing a Machine Learning model and preparing the submission file that is uploaded to Kaggle.
  
Finally, our submission is shown on the so-called "Leaderboard" together with the position relative to other competitors.
  
<center><img src='../_images/kaggle-introduction.png' alt='img' width='740'></center>
  
**How to participate**
  
To start competing on Kaggle, we should perform three simple steps. Firstly, go to the Kaggle website and select any active competition we're interested in. Then, download the data available in the competition. That's it! Now, we're ready to start exploring the data and build Machine Learning models.
  
**New York city taxi fare prediction**
  
As an example, we will work with a past Kaggle playground competition called New York city taxi fare prediction. The goal of this challenge is to predict the fare amount for a taxi ride in New York City given the pickup and dropoff locations.
  
<center><img src='../_images/kaggle-introduction1.png' alt='img' width='740'></center>
  
**Train and Test data**
  
The typical data structure in Kaggle competitions consists of two major parts: train and test datasets. Our goal is to prepare a model on the train dataset given some labels. Afterwards, we should make predictions on the test set. Let's read the train dataset from New York taxi competition using `pandas` library and look at the columns available there. The first column is an ID variable called 'key'. The 'fare_amount' is a target variable we'd like to predict. And the rest of the columns are features we could use to build the model. Now, let's move on to the test set. It has the same list of columns except for the 'fare_amount', as this is the column we should predict.
  
<center><img src='../_images/kaggle-introduction2.png' alt='img' width='740'></center>
  
**Sample submission**
  
After we have built a model, we could make predictions on the test set and save them as a .csv file. This .csv file could be submitted to Kaggle. Every Kaggle competition provides a sample submission file. This file shows the correct format and structure of the submission. Let's take a look at the head of the sample submission in the taxi fare prediction challenge. As expected, it consists of two columns: the ID column and 'fare_amount' we're predicting.
  
<center><img src='../_images/kaggle-introduction3.png' alt='img' width='740'></center>
  
**Let's practice!**
  
All right, now let's explore train and test datasets from another Kaggle competition.

### Explore train data
  
You will work with another Kaggle competition called "Store Item Demand Forecasting Challenge". In this competition, you are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items in 10 different stores.
  
To begin, let's explore the train data for this competition. For the faster performance, you will work with a subset of the train data containing only a single month history.
  
Your initial goal is to read the input data and take the first look at it.
  
---
  
1. Import `pandas` as `pd`.
2. Read train data using `pandas`' `.read_csv()` method.
3. Print the head of the train data (using `.head()` method) to see the data sample.

In [5]:
# Read train data
train = pd.read_csv('../_datasets/demand_forecasting_train_1_month.csv')

# Look at the shape of the data
print('Train shape:', train.shape)

# Look at the head() of the data
train.head()

Train shape: (15500, 5)


Unnamed: 0,id,date,store,item,sales
0,100000,2017-12-01,1,1,19
1,100001,2017-12-02,1,1,16
2,100002,2017-12-03,1,1,31
3,100003,2017-12-04,1,1,7
4,100004,2017-12-05,1,1,20


In [16]:
train.describe()

Unnamed: 0,id,store,item,sales
count,15500.0,15500.0,15500.0,15500.0
mean,107749.5,5.5,25.5,44.849677
std,4474.608921,2.872374,14.431335,22.617654
min,100000.0,1.0,1.0,3.0
25%,103874.75,3.0,13.0,26.0
50%,107749.5,5.5,25.5,42.0
75%,111624.25,8.0,38.0,60.0
max,115499.0,10.0,50.0,129.0


In [17]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15500 entries, 0 to 15499
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      15500 non-null  int64 
 1   date    15500 non-null  object
 2   store   15500 non-null  int64 
 3   item    15500 non-null  int64 
 4   sales   15500 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 605.6+ KB


In [18]:
unique_counts = train.nunique()
for col, count in unique_counts.items():
    print('Column "{}" has {} unique value(s).'.format(col, count))

Column "id" has 15500 unique value(s).
Column "date" has 31 unique value(s).
Column "store" has 10 unique value(s).
Column "item" has 50 unique value(s).
Column "sales" has 123 unique value(s).
