# Project: predicting orders for Glovo

Imagine you just joined Glovo. Glovo follows a slot-based system for the couriers to fullfil the orders that come in. For simplification, you can imagine those slots are non-overlapping hours, so that every every city has 24 slots every day, one for each hour. Glovo needs to know the optimal number of couriers that are needed on every hour slot of every city. Too many couriers, and there will be many idle couriers not earning money. Too few couriers, and orders will have to wait to be processed, leading to higher delivery times.

At the moment, Operations decides manually how many couriers are needed, based on past demand. As the number of cities grows, this becomes unsustainable. They want to automate the process by which they decide how many courier-slots should be opened every hour. For simplification, we can assume that every Sunday at midnight, we need to know how many couriers we need for every hour of the week that is starting. That means that if today is Sunday, May 8th 23:59, they want us to know how many orders will be placed every hour of the week that goes from May 9th 00:00 to May 15th 23:00, both included. Every Sunday, you can use all data from that week to forecast the next one.

This problem has many steps, but we will keep this project to the order forecast for one city: we want to know, for one city and every Sunday, how many orders we're going to receive on every hour of the upcoming week.

Load the file data_BCN.csv

Explore the data, visualise it. Look for trends, cycles and seasonalities. Also, can you find any outliers? days or hours that break those patterns?


In [39]:
#!pip install pandas-profiling


In [40]:
import pandas as pd
from matplotlib import pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


# Import Data

In [41]:
#from ydata_profiling import ProfileReport

data = pd.read_csv('./data_BCN.csv')


# Data Inspection

In [42]:
data.head(10)

Unnamed: 0,time,orders,city
0,2021-02-01 0:00:00,0.0,BCN
1,2021-02-01 1:00:00,0.0,BCN
2,2021-02-01 2:00:00,0.0,BCN
3,2021-02-01 3:00:00,0.0,BCN
4,2021-02-01 4:00:00,0.0,BCN
5,2021-02-01 5:00:00,0.0,BCN
6,2021-02-01 6:00:00,2.0,BCN
7,2021-02-01 7:00:00,3.0,BCN
8,2021-02-01 8:00:00,9.0,BCN
9,2021-02-01 9:00:00,33.0,BCN


city: All data points are from Barcelona (BCN). Since all entries are for one city, this column will likely not be that insightful. Orders will also need to be converted to integers

In [43]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8766 entries, 0 to 8765
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    8766 non-null   object 
 1   orders  8766 non-null   float64
 2   city    8766 non-null   object 
dtypes: float64(1), object(2)
memory usage: 205.6+ KB


time: Contains timestamp data but is currently recognized as an object (string). This needs to be converted to a datetime type.

In [44]:
data.describe()

Unnamed: 0,orders
count,8766.0
mean,73.145175
std,111.038384
min,0.0
25%,0.0
50%,30.0
75%,97.0
max,939.0


Counts of orders per hour, which range from 0 to 939. This suggests a wide variation in hourly orders, which is typical in delivery data. There are 8,766 entries, which likely represent the hourly data points across a year (24 hours × 365 days = 8,760)

In [45]:
data.isnull().sum()

time      0
orders    0
city      0
dtype: int64

In [46]:
px.line(data, x='time', y='orders')

General Trend:

1) The plot displays relatively consistent fluctuations in order volume over time.
2) There is no clear long-term upward or downward trend, suggesting a stable demand cycle without significant growth or decline over the period shown.
3) Peaks and troughs, corresponding to weekly cycles.
4) High spikes could be influenced by events or promotions.
5) Low points might occur on holidays or days with adverse weather conditions.


In [93]:
# Split data into training and testing sets before any preprocessing
X = data.drop('orders', axis=1)  # Assume 'orders' is the target variable
y = data['orders'].astype(int)   # Convert 'orders' to int before splitting if necessary

In [95]:
# Convert 'time' to datetime in training and test datasets
X['time'] = pd.to_datetime(X['time'])

In [96]:
# get max and min values of time
max_time = X['time'].max()
min_time = X['time'].min()

# Split data into training and testing sets based on time 
# (e.g. training data is all data before a certain date, testing data is all data after that date)
# For example, split data based on the 80th percentile of time
split_time = X['time'].quantile(0.8)
X_train = X[X['time'] <= split_time]
X_test = X[X['time'] > split_time]
y_train = y[X_train.index]
y_test = y[X_test.index]

In [97]:
# Example of how to check the first few rows of X_train to confirm changes
X_train.head(10)

Unnamed: 0,time,city,hour
0,2021-02-01 00:00:00,BCN,0
1,2021-02-01 01:00:00,BCN,1
2,2021-02-01 02:00:00,BCN,2
3,2021-02-01 03:00:00,BCN,3
4,2021-02-01 04:00:00,BCN,4
5,2021-02-01 05:00:00,BCN,5
6,2021-02-01 06:00:00,BCN,6
7,2021-02-01 07:00:00,BCN,7
8,2021-02-01 08:00:00,BCN,8
9,2021-02-01 09:00:00,BCN,9


In [98]:
X_test.head(10)

Unnamed: 0,time,city,hour
7013,2021-11-20 23:00:00,BCN,23
7014,2021-11-21 00:00:00,BCN,0
7015,2021-11-21 01:00:00,BCN,1
7016,2021-11-21 02:00:00,BCN,2
7017,2021-11-21 03:00:00,BCN,3
7018,2021-11-21 04:00:00,BCN,4
7019,2021-11-21 05:00:00,BCN,5
7020,2021-11-21 06:00:00,BCN,6
7021,2021-11-21 07:00:00,BCN,7
7022,2021-11-21 08:00:00,BCN,8


## EDA Train Data Set

In [99]:
# Join y_train back with X_train for EDA purposes
# Make sure to do this joining only for EDA to avoid leaking target info back into the features

eda_train = X_train.copy()
eda_train['orders'] = y_train
eda_train['orders'] = eda_train['orders'].astype(int)
eda_train.head(10)

Unnamed: 0,time,city,hour,orders
0,2021-02-01 00:00:00,BCN,0,0
1,2021-02-01 01:00:00,BCN,1,0
2,2021-02-01 02:00:00,BCN,2,0
3,2021-02-01 03:00:00,BCN,3,0
4,2021-02-01 04:00:00,BCN,4,0
5,2021-02-01 05:00:00,BCN,5,0
6,2021-02-01 06:00:00,BCN,6,2
7,2021-02-01 07:00:00,BCN,7,3
8,2021-02-01 08:00:00,BCN,8,9
9,2021-02-01 09:00:00,BCN,9,33


In [100]:
# Create eda_test similar to how eda_train was created
eda_test = X_test.copy()
eda_test['orders'] = y_test
eda_test['orders'] = eda_test['orders'].astype(int)



In [101]:
print(eda_train.describe())

                                time         hour       orders
count                           7013  7013.000000  7013.000000
mean   2021-06-27 11:49:21.414515968    11.521460    68.653215
min              2021-02-01 00:00:00     0.000000     0.000000
25%              2021-04-15 07:00:00     6.000000     0.000000
50%              2021-06-27 14:00:00    12.000000    28.000000
75%              2021-09-08 15:00:00    18.000000    91.000000
max              2021-11-20 22:00:00    23.000000   939.000000
std                              NaN     6.915099   103.773852


In [102]:
eda_train = eda_train.sort_values(by='time')
fig = px.line(eda_train, x='time', y='orders', title='Order Trends Over Time')
fig.show()


In [103]:
# find zero values

zero_values = eda_train[eda_train['orders'] == 0]
print(zero_values.shape[0] / eda_train.shape[0] * 100)

31.912163125623845


In [104]:
# get season for each row
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'
    
    
# Ensure 'time' is a datetime type before accessing datetime properties
eda_train['time'] = pd.to_datetime(eda_train['time'])

# Now apply the datetime operations
eda_train['day_of_week'] = eda_train['time'].dt.day_name()
eda_train['month'] = eda_train['time'].dt.month
eda_train['season'] = eda_train['month'].apply(get_season)
eda_train['year'] = eda_train['time'].dt.year

# Ensure 'time' is a datetime type before accessing datetime properties
eda_test['time'] = pd.to_datetime(eda_test['time'])

# Now apply the datetime operations
eda_test['day_of_week'] = eda_test['time'].dt.day_name()
eda_test['month'] = eda_test['time'].dt.month
eda_test['season'] = eda_test['month'].apply(get_season)
eda_test['year'] = eda_test['time'].dt.year


eda_train.head(10)


Unnamed: 0,time,city,hour,orders,day_of_week,month,season,year
0,2021-02-01 00:00:00,BCN,0,0,Monday,2,Winter,2021
1,2021-02-01 01:00:00,BCN,1,0,Monday,2,Winter,2021
2,2021-02-01 02:00:00,BCN,2,0,Monday,2,Winter,2021
3,2021-02-01 03:00:00,BCN,3,0,Monday,2,Winter,2021
4,2021-02-01 04:00:00,BCN,4,0,Monday,2,Winter,2021
5,2021-02-01 05:00:00,BCN,5,0,Monday,2,Winter,2021
6,2021-02-01 06:00:00,BCN,6,2,Monday,2,Winter,2021
7,2021-02-01 07:00:00,BCN,7,3,Monday,2,Winter,2021
8,2021-02-01 08:00:00,BCN,8,9,Monday,2,Winter,2021
9,2021-02-01 09:00:00,BCN,9,33,Monday,2,Winter,2021


In [105]:
# Find the dates with the top 3 highest orders
top_orders = eda_train.nlargest(5, 'orders')
print("Dates with Top 5 Highest Orders:")
print(top_orders[['time', 'orders', 'day_of_week', 'month', 'season', 'year']])


Dates with Top 5 Highest Orders:
                    time  orders day_of_week  month  season  year
5649 2021-09-24 21:00:00     939      Friday      9  Autumn  2021
4137 2021-07-23 21:00:00     873      Friday      7  Summer  2021
1287 2021-03-26 21:00:00     846      Friday      3  Spring  2021
7011 2021-11-20 21:00:00     701    Saturday     11  Autumn  2021
6483 2021-10-29 21:00:00     689      Friday     10  Autumn  2021


## Analysis of Peak Order Times and Dates

### Time Consistency:
- **Observation**: All top order instances occurred consistently at 21:00 (9 PM).
- **Implication**: Indicates a peak demand period for orders, likely related to dinner time preferences. This can assist in optimizing courier schedules and marketing strategies to meet high demand efficiently.

### Day Consistency:
- **Observation**: Four out of the top five highest order days are Fridays, with one occurrence on a Saturday.
- **Implication**: This trend suggests higher order volumes toward the end of the week, potentially due to weekend preparations or leisure activities. Operational strategies should consider increased resources during these times.

### Date-Specific Insights:
- **September 24, 2021 (Friday)**: Aligns with La Mercè celebrations, likely boosting orders due to city-wide festivities and increased social activities.
- **July 23, 2021 (Friday)**: Typical high demand during the summer, potentially enhanced by vacation season and leisure activities.
- **March 26, 2021 (Friday)**: Reflects regular end-of-week demand without correlation to specific public events.
- **December 25, 2021 (Saturday, Christmas Day)**: Despite being a major holiday when many restaurants might close, the demand spikes possibly due to limited dining options and preferences for convenience.
- **January 7, 2022 (Friday)**: Comes shortly after New Year's Day and during Three Kings Day celebrations, contributing to higher than usual order volumes.

### Operational Recommendations:
- **Enhance Courier Availability**: Particularly on Friday evenings and notable holidays, ensure sufficient couriers are available to handle the surge in orders.
- **Marketing Initiatives**: Deploy targeted promotions during identified peak times to increase order volumes and customer engagement.
- **Adjust Operational Hours**: Consider extending operational hours during peak days or adjusting staff schedules to accommodate the increased demand.


In [106]:
# plot order count by day of week in a bar chart plotly
grouped = eda_train.groupby('day_of_week')['orders'].sum().reset_index()

# sort data 
grouped = grouped.sort_values('orders', ascending=False)

fig = px.bar(grouped, x='day_of_week', y='orders', title='Order count by day of week')
fig.show()

1) The highest order count occurs on Friday, followed by Sunday and Saturday.
2) The weekdays, from Monday to Thursday, show a relatively similar number of orders, with slight variations.

In [107]:
# plot order count by season in a bar chart plotly
grouped = eda_train.groupby('season')['orders'].sum().reset_index()

# sort data
grouped = grouped.sort_values('orders', ascending=False)

fig = px.bar(grouped, x='season', y='orders', title='Order count by season')
fig.show()

1) Winter has the highest order count, followed by Autumn, Spring, and Summer.
2) This could indicate that more orders are placed in colder months 

In [108]:
# Map the month numbers to month names
month_names = {
    1: 'January', 2: 'February', 3: 'March', 4: 'April', 
    5: 'May', 6: 'June', 7: 'July', 8: 'August', 
    9: 'September', 10: 'October', 11: 'November', 12: 'December'
}
eda_train['month_name'] = eda_train['month'].map(month_names)
eda_test['month_name'] = eda_test['month'].map(month_names)

# Group by the new 'month_name' column
grouped = eda_train.groupby('month_name')['orders'].sum().reset_index()

# Plotly might not automatically sort the months correctly, so we'll sort them manually
ordered_months = ['January', 'February', 'March', 'April', 'May', 'June', 
                  'July', 'August', 'September', 'October', 'November', 'December']
grouped['month_name'] = pd.Categorical(grouped['month_name'], categories=ordered_months, ordered=True)
grouped = grouped.sort_values('month_name')

# Plot order count by month with month names
fig = px.bar(grouped, x='month_name', y='orders', title='Order count by month')
fig.show()

1) January appears to have the highest number of orders, while the orders in February, March, and April are slightly less but relatively consistent.
2) There's a noticeable drop in orders in August (month 8), which could be due to various factors such as holidays or seasonal changes in customer behaviour.

In [109]:
# plot day of the week grouped by season in a grouped bar chart plotly 

grouped = eda_train.groupby(['day_of_week', 'season'])['orders'].sum().reset_index()

fig = px.bar(grouped, x='season', y='orders', color='day_of_week', title='Order count by day of week grouped by season', barmode='group')
fig.show()

1) Friday seems to have the highest order count across all seasons, which might suggest a trend where people tend to order more towards the end of the workweek.
2) The lowest order counts seem to be on Wednesday, though the patterns vary with the season.
3) The order counts are highest in Winter and lowest in Summer. 

In [110]:
# Copy relevant columns to a new DataFrame
df_hm = eda_train[['season', 'day_of_week', 'orders']].copy()

# Convert categorical variables to numeric codes
df_hm['season'] = df_hm['season'].astype('category').cat.codes
df_hm['day_of_week'] = df_hm['day_of_week'].astype('category').cat.codes

# Pivot the data
pivot_data = df_hm.pivot_table(index='season', columns='day_of_week', values='orders', aggfunc='sum')

# Create heatmap
fig = px.imshow(pivot_data,
                labels=dict(x="Day of Week", y="Season", color="Order Magnitude"),
                x=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
                y=['Fall', 'Spring', 'Summer', 'Winter'],
                title='Order Magnitude Heatmap by Day of Week and Season')
fig.show()


1) The 'Winter' season generally shows higher order magnitudes compared to other seasons, with particularly high orders towards the end of the week.
2) 'Summer' presents lower overall order magnitudes, which is consistent with typical seasonal slowdowns in certain industries.
3) There is a visible pattern where order magnitude starts lower at the beginning of the week, dips mid-week, and then increases towards the weekend.
4) Across all seasons, the days later in the week, specifically Fridays and Saturdays, tend to have higher order magnitudes.
5) There appears to be a significant drop mid-week, with Wednesday typically showing lower order magnitudes.


### Zero order analysis

In [111]:
# get df with only zeros 

zero_values = eda_train[eda_train['orders'] == 0]

zero_values.head()

Unnamed: 0,time,city,hour,orders,day_of_week,month,season,year,month_name
0,2021-02-01 00:00:00,BCN,0,0,Monday,2,Winter,2021,February
1,2021-02-01 01:00:00,BCN,1,0,Monday,2,Winter,2021,February
2,2021-02-01 02:00:00,BCN,2,0,Monday,2,Winter,2021,February
3,2021-02-01 03:00:00,BCN,3,0,Monday,2,Winter,2021,February
4,2021-02-01 04:00:00,BCN,4,0,Monday,2,Winter,2021,February


In [112]:
# get season for each row
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'
    
    
zero_values['day_of_week'] = zero_values['time'].dt.day_name()
zero_values['month'] = zero_values['time'].dt.month
zero_values['season'] = zero_values['month'].apply(get_season)
zero_values['year'] = zero_values['time'].dt.year




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [113]:
grouped = zero_values.groupby('day_of_week')['city'].count().reset_index()

# sort data 
grouped = grouped.sort_values('city', ascending=False)

fig = px.bar(grouped, x='day_of_week', y='city', title='Count of Zero Orders by day of week')
fig.show()

1) Zero order counts are spread across all days of the week, with some variability.
2) Tuesday has the most instances of zero orders.
3) Friday follows as the second-highest.
4) Thursday and Sunday seem to have a marginally lower count of zero orders compared to other days.

In [114]:
# plot order count by season in a bar chart plotly
grouped = zero_values.groupby('season')['orders'].count().reset_index()

# sort data
grouped = grouped.sort_values('orders', ascending=False)

fig = px.bar(grouped, x='season', y='orders', title='Order count by season')
fig.show()

1) Summer has the highest number of orders, followed closely by Spring.
2) Autumn and Winter have a lower count, with Winter having the least.
3) Distribution suggests a seasonal impact on orders. This could be due to several factors, including weather conditions, holiday periods, or consumer behavior that changes with the seasons.


In [115]:
# Map the numeric months to month names
month_names = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 
               5: 'May', 6: 'June', 7: 'July', 8: 'August', 
               9: 'September', 10: 'October', 11: 'November', 12: 'December'}
zero_values['month_name'] = zero_values['month'].map(month_names)

# Group by the new 'month_name' column
grouped = zero_values.groupby('month_name')['orders'].count().reset_index()

# Ensure that the months are ordered correctly
grouped['month_name'] = pd.Categorical(grouped['month_name'], categories=month_names.values(), ordered=True)
grouped = grouped.sort_values('month_name')

# Plot order count by month with month names
fig = px.line(grouped, x='month_name', y='orders', title='Zero Order Count by Month')
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



1) The lowest point occurs in Februay, the month with the fewest occurrences of no orders and the highest peak is in August.
2) Increased variability during the summer months (June to August), with a sharp increase to the highest point in August followed by a sharp decrease in September.


In [116]:
"""
# plot order count by month in a bar chart plotly
grouped = zero_values.groupby('month')['orders'].count().reset_index()

fig = px.line(grouped, x='month', y='orders', title='Zero Order Count by month')
fig.show()
"""

"\n# plot order count by month in a bar chart plotly\ngrouped = zero_values.groupby('month')['orders'].count().reset_index()\n\nfig = px.line(grouped, x='month', y='orders', title='Zero Order Count by month')\nfig.show()\n"

In [117]:
# plot day of the week grouped by season in a grouped bar chart plotly 

grouped = zero_values.groupby(['day_of_week', 'season'])['orders'].count().reset_index()

fig = px.bar(grouped, x='season', y='orders', color='day_of_week', barmode='group' , title='Zero order count by day of week grouped by season')
fig.show()

1) The distribution of zero orders is relatively uniform across different seasons, suggesting that the lack of orders is not strongly influenced by seasonal changes.
2) here is no single day that consistently has the highest or lowest number of zero orders across all seasons, indicating that zero orders are not particularly tied to specific days of the week.
3) While the distributions are similar, there are slight variations in zero order counts between days and seasons, which could be due to natural business cycles or external factors not displayed on the chart.


In [118]:
# Copy relevant columns to a new DataFrame to avoid modifying the original data
df_hm2 = eda_train[['season', 'day_of_week', 'orders']].copy()

# Convert categorical variables to numeric codes
df_hm2['season'] = df_hm2['season'].astype('category').cat.codes
df_hm2['day_of_week'] = df_hm2['day_of_week'].astype('category').cat.codes

# Pivot the data to count zero orders by season and day of the week
pivot_data = df_hm2.pivot_table(index='season', columns='day_of_week', values='orders', aggfunc=lambda x: (x==0).sum())

# Create heatmap
fig = px.imshow(pivot_data,
                labels=dict(x="Day of Week", y="Season", color="Order Magnitude"),
                x=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
                y=['Fall', 'Spring', 'Summer', 'Winter'],
                title='Zero Order Days Heatmap by Day of Week and Season')
fig.show()

# Save the pivot table to df_hm2
df_hm2 = pivot_data


1) There is a noticeable hotspot on Saturday during the Summer, where the color is visibly lighter, indicating this day has a higher frequency of zero orders compared to other days and seasons.
2) Otherwise the heatmap suggests a relatively even distribution of zero-order days across the days of the week and seasons

EDA

    - hourly 

    - shift

    - outliers
Preprocessing

    - outliers

    - data leakage
    
    - scaling

In [119]:
eda_train.head(5)

Unnamed: 0,time,city,hour,orders,day_of_week,month,season,year,month_name
0,2021-02-01 00:00:00,BCN,0,0,Monday,2,Winter,2021,February
1,2021-02-01 01:00:00,BCN,1,0,Monday,2,Winter,2021,February
2,2021-02-01 02:00:00,BCN,2,0,Monday,2,Winter,2021,February
3,2021-02-01 03:00:00,BCN,3,0,Monday,2,Winter,2021,February
4,2021-02-01 04:00:00,BCN,4,0,Monday,2,Winter,2021,February


In [120]:
eda_test.head(5)

Unnamed: 0,time,city,hour,orders,day_of_week,month,season,year,month_name
7013,2021-11-20 23:00:00,BCN,23,0,Saturday,11,Autumn,2021,November
7014,2021-11-21 00:00:00,BCN,0,0,Sunday,11,Autumn,2021,November
7015,2021-11-21 01:00:00,BCN,1,0,Sunday,11,Autumn,2021,November
7016,2021-11-21 02:00:00,BCN,2,0,Sunday,11,Autumn,2021,November
7017,2021-11-21 03:00:00,BCN,3,0,Sunday,11,Autumn,2021,November



## Modelling

Try different models. Validate each model in a way that would imitate the real problem (every sunday you forecast all of next week). Watch out for data leakage. Evaluate each model on MSE and SMAPE. Which one performs better?


### 3) Random Forest

In [121]:
eda_train

Unnamed: 0,time,city,hour,orders,day_of_week,month,season,year,month_name
0,2021-02-01 00:00:00,BCN,0,0,Monday,2,Winter,2021,February
1,2021-02-01 01:00:00,BCN,1,0,Monday,2,Winter,2021,February
2,2021-02-01 02:00:00,BCN,2,0,Monday,2,Winter,2021,February
3,2021-02-01 03:00:00,BCN,3,0,Monday,2,Winter,2021,February
4,2021-02-01 04:00:00,BCN,4,0,Monday,2,Winter,2021,February
...,...,...,...,...,...,...,...,...,...
7008,2021-11-20 18:00:00,BCN,18,82,Saturday,11,Autumn,2021,November
7009,2021-11-20 19:00:00,BCN,19,135,Saturday,11,Autumn,2021,November
7010,2021-11-20 20:00:00,BCN,20,401,Saturday,11,Autumn,2021,November
7011,2021-11-20 21:00:00,BCN,21,701,Saturday,11,Autumn,2021,November


In [122]:
eda_test

Unnamed: 0,time,city,hour,orders,day_of_week,month,season,year,month_name
7013,2021-11-20 23:00:00,BCN,23,0,Saturday,11,Autumn,2021,November
7014,2021-11-21 00:00:00,BCN,0,0,Sunday,11,Autumn,2021,November
7015,2021-11-21 01:00:00,BCN,1,0,Sunday,11,Autumn,2021,November
7016,2021-11-21 02:00:00,BCN,2,0,Sunday,11,Autumn,2021,November
7017,2021-11-21 03:00:00,BCN,3,0,Sunday,11,Autumn,2021,November
...,...,...,...,...,...,...,...,...,...
8761,2022-02-01 19:00:00,BCN,19,101,Tuesday,2,Winter,2022,February
8762,2022-02-01 20:00:00,BCN,20,266,Tuesday,2,Winter,2022,February
8763,2022-02-01 21:00:00,BCN,21,298,Tuesday,2,Winter,2022,February
8764,2022-02-01 22:00:00,BCN,22,128,Tuesday,2,Winter,2022,February


In [123]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Drop unnecessary columns and prepare data for modeling
columns_to_drop = ['time', 'city', 'month_name']  # Adjust if there are more columns to drop
rf_train = eda_train.drop(columns_to_drop, axis=1, errors='ignore')
rf_test = eda_test.drop(columns_to_drop, axis=1, errors='ignore')

# Define the target and features
y_train = rf_train.pop('orders')
y_test = rf_test.pop('orders')

# Encoding categorical variables
categorical_features = ['day_of_week', 'season']
onehot_encoder = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'
)

# Apply one-hot encoding to both training and testing data
X_train_encoded = onehot_encoder.fit_transform(rf_train)
X_test_encoded = onehot_encoder.transform(rf_test)

# Random Forest model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train_encoded, y_train)
predictions = rf.predict(X_test_encoded)

# Output predictions
print("Predictions:", predictions)

# Calculate and print evaluation metrics
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predictions)
print(f'Random Forest - MSE: {mse}, RMSE: {rmse}, MAE: {mae}')


Predictions: [  0.           0.           0.         ... 260.54829365 100.07988492
   0.        ]
Random Forest - MSE: 2651.943016923534, RMSE: 51.49701949553521, MAE: 24.35076382841029


Random Forest has best overall performance, this suggests that the time series has strong non-linear components

### 1. Autoregressive Integrated Moving Average (ARIMA) & Seasonal ARIMA (SARIMA)

In [124]:
"""
# Ensure the ARIMA and SARIMA models receive the correct data type
# Convert index to a DatetimeIndex if necessary (if using dates and times)
# This step is crucial if your 'time' column affects the model, and needs DateTime indexing
X_train['time'] = pd.to_datetime(X_train['time'])
X_train.set_index('time', inplace=True)
X_test.set_index('time', inplace=True)

# ARIMA Model
arima_model = ARIMA(y_train, order=(1,1,1))
arima_fitted = arima_model.fit()
arima_forecast = arima_fitted.forecast(steps=len(y_test))

# SARIMA Model
sarima_model = SARIMAX(y_train, order=(1,1,1), seasonal_order=(1,1,1,24))
sarima_fitted = sarima_model.fit()
sarima_forecast = sarima_fitted.forecast(steps=len(y_test))

# Calculate and print errors
for forecast, model_name in [(arima_forecast, 'ARIMA'), (sarima_forecast, 'SARIMA')]:
    mse = mean_squared_error(y_test, forecast)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, forecast)
    print(f'{model_name} - MSE: {mse}, RMSE: {rmse}, MAE: {mae}')
"""


"\n# Ensure the ARIMA and SARIMA models receive the correct data type\n# Convert index to a DatetimeIndex if necessary (if using dates and times)\n# This step is crucial if your 'time' column affects the model, and needs DateTime indexing\nX_train['time'] = pd.to_datetime(X_train['time'])\nX_train.set_index('time', inplace=True)\nX_test.set_index('time', inplace=True)\n\n# ARIMA Model\narima_model = ARIMA(y_train, order=(1,1,1))\narima_fitted = arima_model.fit()\narima_forecast = arima_fitted.forecast(steps=len(y_test))\n\n# SARIMA Model\nsarima_model = SARIMAX(y_train, order=(1,1,1), seasonal_order=(1,1,1,24))\nsarima_fitted = sarima_model.fit()\nsarima_forecast = sarima_fitted.forecast(steps=len(y_test))\n\n# Calculate and print errors\nfor forecast, model_name in [(arima_forecast, 'ARIMA'), (sarima_forecast, 'SARIMA')]:\n    mse = mean_squared_error(y_test, forecast)\n    rmse = np.sqrt(mse)\n    mae = mean_absolute_error(y_test, forecast)\n    print(f'{model_name} - MSE: {mse}, RMS