# Sales Prediction Pair Programming Competition

In this notebook, we will tackle the sales prediction challenge by performing data exploration, preprocessing, model training, and finally making predictions. We'll walk through each step with detailed explanations and code comments.

## Step 1: Import Libraries

We begin by importing all the necessary libraries for data manipulation, visualization, and model training.

In [1]:
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For visualization
import seaborn as sns  # For visualization
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV  # For model training and tuning
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor  # For regression modeling
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score  # For model evaluation

## Step 2: Load the Dataset

We load the dataset from the provided file and inspect the first few rows to understand the structure of the data.

In [2]:
# Load the dataset
data = pd.read_csv('data/raw/sales.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,True_index,Store_ID,Day_of_week,Date,Nb_customers_on_day,Open,Promotion,State_holiday,School_holiday,Sales
0,0,625,3,2013-11-06,641,1,1,0,0,7293
1,1,293,2,2013-07-16,877,1,1,0,1,7060
2,2,39,4,2014-01-23,561,1,1,0,0,4565
3,3,676,4,2013-09-26,1584,1,1,0,0,6380
4,4,709,3,2014-01-22,1477,1,1,0,0,11647


## Step 3: Data Exploration

In this step, we explore the dataset to understand its structure, data types, and summary statistics. This helps us identify any potential issues such as missing values or incorrect data types.

In [3]:
# Display basic information about the dataset
data.info()

# Display summary statistics
data.describe()

# Check for missing values
data.isnull().sum()

# Convert column names to lower cases
data.columns = data.columns.str.lower()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640840 entries, 0 to 640839
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   True_index           640840 non-null  int64 
 1   Store_ID             640840 non-null  int64 
 2   Day_of_week          640840 non-null  int64 
 3   Date                 640840 non-null  object
 4   Nb_customers_on_day  640840 non-null  int64 
 5   Open                 640840 non-null  int64 
 6   Promotion            640840 non-null  int64 
 7   State_holiday        640840 non-null  object
 8   School_holiday       640840 non-null  int64 
 9   Sales                640840 non-null  int64 
dtypes: int64(8), object(2)
memory usage: 48.9+ MB


## Step 4: Data Preprocessing

Here, we preprocess the data to make it suitable for modeling. This includes handling date features, encoding categorical variables, and creating new features.

### 4.1 Handle Date Feature

We split the `date` column into `year`, `month`, and `day` to extract more information from the date.

In [4]:
# Convert date column to datetime format
data['date'] = pd.to_datetime(data['date'])

# Extract year, month, and day from the date
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day

# Drop the original date column
data.drop('date', axis=1, inplace=True)

### 4.2 Encode Categorical Variables

We encode the `state_holiday` column, which contains categorical values, into numerical values to make it usable by machine learning models.

In [5]:
# Encode state_holiday as numerical
data['state_holiday'] = data['state_holiday'].map({'0': 0, 'a': 1, 'b': 2, 'c': 3})

## Step 5: Feature Selection

We separate the features (`X`) and the target variable (`y`). The target variable in this case is `sales`.

In [6]:
# Split into features (X) and target (y)
X = data.drop(columns=['sales'])
y = data['sales']

## Step 6: Split the Dataset

We split the dataset into training and validation sets to evaluate the model's performance on unseen data.

In [7]:
# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 7: Model Training and Comparison

We will try several models and evaluate their performance to determine the best model for our data.

### 7.1 Train and Evaluate Multiple Models

We will train three different models: Gradient Boosting Regressor, Random Forest Regressor, and XGBoost Regressor. We will evaluate each model using the validation set.

In [8]:
# Initialize models
models = {
    'GradientBoosting': GradientBoostingRegressor(random_state=30),
    'RandomForest': RandomForestRegressor(random_state=30, n_jobs=-1),
    'XGBoost': XGBRegressor(random_state=30, n_jobs=-1),
}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    mae = mean_absolute_error(y_val, y_pred)
    rmse = root_mean_squared_error(y_val, y_pred)
    r2 = r2_score(y_val, y_pred)
    print(f"{name} - MAE: {mae}, RMSE: {rmse}, R2: {r2}")

GradientBoosting - MAE: 869.8411068135325, RMSE: 1282.6054279282378, R2: 0.8907289976915952
RandomForest - MAE: 567.4649208070658, RMSE: 911.6463833914406, R2: 0.944795865342002
XGBoost - MAE: 681.3823411977924, RMSE: 1010.375723431068, R2: 0.9321914315223694


### 7.2 Hyperparameter Tuning with RandomizedSearchCV

After evaluating the models, we choose the best-performing model and perform hyperparameter tuning using RandomizedSearchCV.

In [9]:
# Select the best model (assuming RandomForest performed best based on previous results)
model = RandomForestRegressor(random_state=30)

# Set up RandomizedSearchCV to find the best parameters
param_dist = {
    'n_estimators': [int(x) for x in np.linspace(start=10, stop=100, num=5)],
    'max_depth': [int(x) for x in np.linspace(start=2, stop=8, num=4)],
}

random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=2, cv=3, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)

# Get the best estimator
best_model = random_search.best_estimator_

## Step 8: Retrain the Best Model

We retrain the best model using the entire training dataset.

In [10]:
# Train the best model on the entire training dataset
best_model.fit(X, y)

In [35]:
# Calculate best model metrics 
y_pred = best_model.predict(X_val)
mae = mean_absolute_error(y_val, y_pred)
rmse = root_mean_squared_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
print(f"Best Model - MAE: {mae}, RMSE: {rmse}, R2: {r2}")

Best Model - MAE: 873.4832535627821, RMSE: 1320.510756471748, R2: 0.8841749047257152


## Step 9: Load New Dataset and Make Predictions

We load the new dataset (without the target variable) and use the trained model to make predictions.

In [25]:
# Load the new dataset without target variable
new_data = pd.read_csv('data/raw/ironkaggle_notarget.csv')

# Convert column names to lower cases
new_data.columns = new_data.columns.str.lower()

# Preprocess the new dataset
new_data['date'] = pd.to_datetime(new_data['date'])
new_data['year'] = new_data['date'].dt.year
new_data['month'] = new_data['date'].dt.month
new_data['day'] = new_data['date'].dt.day
new_data.drop('date', axis=1, inplace=True)
new_data['state_holiday'] = new_data['state_holiday'].map({'0': 0, 'a': 1, 'b': 2, 'c': 3})

# Make predictions on the new dataset
predictions = best_model.predict(new_data)

# Save predictions to a CSV file
output = pd.DataFrame({'true_index': new_data['true_index'], 'predicted_sales': predictions})
output.to_csv('data/cleaned/sales_predictions.csv', index=False)

## Step 10: Compare Predictions with Real Data

We compare the predicted values with the actual values provided in the solution dataset.

In [37]:
# Load the real values dataset
solution_data = pd.read_csv('data/raw/ironkaggle_solutions.csv')

# Convert column names to lower cases
solution_data.columns = solution_data.columns.str.lower()

# Merge predictions with real values for comparison
comparison = pd.merge(output, solution_data, on='true_index', how='inner')

# Calculate evaluation metrics for the new dataset
mae_new = mean_absolute_error(comparison['sales'], comparison['predicted_sales'])
rmse_new = root_mean_squared_error(comparison['sales'], comparison['predicted_sales'])
r2_new = r2_score(comparison['sales'], comparison['predicted_sales'])

print(f"New Dataset MAE: {mae_new}")
print(f"New Dataset RMSE: {rmse_new}")
print(f"New Dataset R2: {r2_new}")

New Dataset MAE: 534.4308267697052
New Dataset RMSE: 856.5095800753584
New Dataset R2: 0.9473019658323503


## Conclusion

In this notebook, we explored the dataset, preprocessed the data, trained multiple models, and compared their performance. We have selected RandomForestRegresor and then we have tuned the best model using RandomizedSearchCV, retrained it on the entire dataset, and made predictions on a new dataset. Finally, we evaluated the model's performance using multiple metrics and compared the predictions with the actual sales values. The final predictions have been saved to a CSV file for submission.

We conclude that the performance of the best model is superior than the first selected.

**Data first selected model (RandomForestRegresor)**
*MAE: 567.4649208070658*
*RMSE: 911.6463833914406*
*R2: 0.944795865342002*

**Metrics best_model with sales comparison**
*MAE: 534.4308267697052*
*RMSE: 856.5095800753584*
*R2: 0.9473019658323503*