# 

## **1. Preliminary Data Exploration** 📊🔍

Getting a feel for the data is an important first step. This is where we can start to understand the data and identify any potential issues. We can also start to think about how we might want to structure our data for modeling. For the assignment, this step will help in choosing a modeling approach 🤔💡.

In [None]:
# Import Libraries

import pandas as pd
import numpy as np

In [None]:
# Load Data
original_df = pd.read_csv('../data/orders_autumn_2020.csv')

# Create a copy of the original dataframe
df = original_df.copy()

In [None]:
# Check columns and data types
df.info()

**🔍 Observation:** There are 18706 data points in the dataset with 13 features. Also, there seems to be some missing values in the dataset 🚫📉.

In [None]:
# Check missing values
df.isna().sum()

**🔍 Observation**: Some values for the weather 🌤️ features are missing. We will have to handle these missing values (impute, drop) if we decide to use these features depending on our model (some models handle missing data natively).


In [None]:
# Check for duplicates
df.duplicated().sum()

**🔍 Observation**: There are no duplicate data points in the dataset.

In [None]:
# Get statistical summary of the dataset
df.describe().T

**🔍 Observations**: 
- ⏱️ The data indicates a generally efficient delivery system, with actual delivery times often being less than estimated.
- 🌦️ The weather conditions show notable variation, which could impact delivery times.
- 🗺️ The proximity of users to venues (latitude, longitude) suggests a densely populated or urban area, possibly leading to quicker deliveries.


In [None]:
#  Check how the dataset looks like
df.head(10)

## **2. Modeling Approach** 🤖📈

After getting a feel of the data, and taking into account what I have learnt about Wolt's business model, the following modeling approach is chosen:

### **🔮 "What would be the order demand for the next day?"** 

### **Motivations for the modeling approach:**
Accurate forecasting of item demand helps Wolt's business model in the following ways:

- 💼 **Resource Management:** Ensures optimal allocation of delivery personnel and reduces operational costs. 

- 👩‍🍳 **Partner Restaurants Efficiency:** Helps restaurants prepare for demand, improving food quality and reducing waste. 

- 💰 **Dynamic Pricing:** Enables effective dynamic pricing and promotional strategies to manage demand. 

- 📈 **Financial Planning:** Essential for revenue forecasting and strategic decisions, like market expansion. 

## **3. Exploratory Data Analysis**

- Trend and Seasonality Analysis

In [None]:
# Import libraries

from haversine import haversine, Unit
from geopy.distance import geodesic
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tools.sm_exceptions import ConvergenceWarning

In [None]:
# Set seaborn style in Wolt brand colors
wolt_colors = ["#004C5C", "#FF007A", "#56C1E6"]  # Dark Blue, Pink, Light Blue

# Set the theme with a Wolt color palette
sns.set_theme(style="whitegrid", palette=sns.color_palette(wolt_colors))

# Further customize
plt.rcParams["axes.spines.top"] = False
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.left"] = True
plt.rcParams["axes.spines.bottom"] = True
plt.rcParams["grid.color"] = "#eeeeee"
plt.rcParams["grid.linestyle"] = "-"
plt.rcParams["grid.linewidth"] = 0.75
plt.rcParams["font.family"] = "sans-serif"
plt.rcParams["font.size"] = 12
plt.rcParams["axes.edgecolor"] = "#333333"

# set figure size
plt.rcParams["figure.figsize"] = (12, 8)

In [None]:
# Convert the TIMESTAMP column to datetime format
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])

# Make TIMESTAMP the index of the dataframe as it will easy to group by date
df.set_index('TIMESTAMP', inplace=True)

In [None]:
# Create a function to calculate the distance between two latitude and longitude points
def calculate_distance(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """
    Calculate the distance between latitude and longitude points

    Args:
        lat1 (float): Latitude of first point
        lon1 (float): Longitude of first point
        lat2 (float): Latitude of second point
        lon2 (float): Longitude of second point
    
    Returns:
        float: Distance between two points in kilometers
    """
    # return haversine((lat1, lon1), (lat2, lon2), unit=Unit.KILOMETERS)
    return geodesic((lat1, lon1), (lat2, lon2)).km

In [None]:
# Extract the distance from latitude and longitude of user and venue
df['DISTANCE'] = df.apply(lambda x: calculate_distance(x['USER_LAT'], x['USER_LONG'], x['VENUE_LAT'], x['VENUE_LONG']), axis=1)

In [None]:
# Check the distance column
df['DISTANCE'].describe()

In [None]:
def plot_heatmap(matrix: np.ndarray, title: str) -> None:
    """
    Plot a heatmap

    Args:
        matrix (numpy.ndarray): Matrix to plot
        title (str): Title of the heatmap
    
    Returns:
        None
    """
    sns.heatmap(
        data = matrix, 
        annot=True,
        linewidth = 0.5, 
        cmap=sns.light_palette(wolt_colors[2], as_cmap=True),
    )

    plt.title(title, fontsize=18, fontweight='bold')
    plt.show()

In [None]:
# Check the correlation between relevant columns
relevant_columns = ['ITEM_COUNT', 'ESTIMATED_DELIVERY_MINUTES', 'ACTUAL_DELIVERY_MINUTES', 
                   'CLOUD_COVERAGE', 'TEMPERATURE', 'WIND_SPEED', 'PRECIPITATION', 'DISTANCE']
plot_heatmap(df[relevant_columns].corr(), 'Correlation Heatmap')

## **4. Model Selection**

Given the nature of the data, timeseries forecasting models are a good choice. The data is a time series of item demand, and we want to predict the demand for the next day.

> ### **Following models are considered for the approach:**

- **📉 SARIMAX (Baseline):** It is a good baseline model for our data as it has seasionality and external factors. It is also relatively easy to interpret and explain the results than blackbox models.

- 🚀 **XGBoost:** XGBoost is a powerful ensemble machine learning model that can handle both regression and timeseries forecasting tasks. It can capture complex patterns and dependencies in the data. 

- **🧠 LSTM:** LSTM is a good choice for timeseries forecasting, and can be used to capture the non-linearities in the data. 


In [None]:
# Creating new DataFrame with daily frequency and number of orders
daily_orders = df.resample('D').size()

daily_orders.head()

In [None]:
def plot_line(df: pd.DataFrame, title: str, xlabel, ylabel, size=None) -> None:
    """
    Plot a line chart

    Args:
        df (pd.DataFrame): Dataframe to plot the line chart for
        title (str): Title of the chart
        xlabel (str): Label of the x-axis
        ylabel (str): Label of the y-axis
        size (tuple): Size of the chart

    Returns:
        None
    """
    sns.lineplot(
        data=daily_orders, 
        color=wolt_colors[2]
    )
    plt.title(title, fontsize=18, fontweight='bold')
    plt.xlabel(xlabel, fontsize=14)
    plt.ylabel(ylabel, fontsize=14)
    if size:
        plt.figure(figsize=size)
    plt.show()

plot_line(df, title='Daily Orders', xlabel='Date', ylabel='Number of Orders', size=(15, 12))
plot_line(df, title='Daily Orders', xlabel='Date', ylabel='Number of Orders', size=(15, 12))

## **5. Feature Engineering**

- hourly, daily and weekly decomposition
- calculate distance between venue and delivery location

## **6. Modeling**

### **6.1 SARIMAX (Baseline)**

### **6.2 XGBoost**

### **6.3 LSTM**

## **7. Evaluation**

## **8. Further Development**