# 

## **1. Preliminary Data Exploration** 📊🔍

Getting a feel for the data is an important first step. This is where we can start to understand the data and identify any potential issues. We can also start to think about how we might want to structure our data for modeling. For the assignment, this step will help in choosing a modeling approach 🤔💡.

In [None]:
# Import Libraries

import pandas as pd
import numpy as np

In [None]:
# Load Data
original_df = pd.read_csv('../data/orders_autumn_2020.csv')

# Create a copy of the original dataframe
df = original_df.copy()

In [None]:
# Check columns and data types
df.info()

**🔍 Observation:** There are 18706 data points in the dataset with 13 features. Also, there seems to be some missing values in the dataset 🚫📉.

In [None]:
# Check missing values
df.isna().sum()

**🔍 Observation**: Some values for the weather 🌤️ features are missing. We will have to handle these missing values (impute, drop) if we decide to use these features depending on our model (some models handle missing data natively).


In [None]:
# Check for duplicates
df.duplicated().sum()

**🔍 Observation**: There are no duplicate data points in the dataset.

In [None]:
# Get statistical summary of the dataset
df.describe().T

**🔍 Observations**: 
- ⏱️ **Delivery Time Differences**: Average actual delivery times are -1.20 minutes faster than estimated, with high variability (standard deviation: 8.98 minutes).
- 🍔 **Item Count**: Average order contains 2.69 items, ranging from 1 to 11 items.
- 🌐 **Location Data**: Location data of user and venues suggests dataset covers a concentrated geographic area (small standard deviations).
- ⏳ **Estimated vs. Actual Delivery Minutes**: Average estimated delivery time is 33.81 minutes, actual is slightly lower at 32.61 minutes.
- ☁️ **Weather Conditions**: It show wide variation in weather conditions, indicating potential impact on delivery times.

In [None]:
#  Check how the dataset looks like
df.head(10)

## **2. Modeling Approach** 🤖📈

After getting a feel of the data, and taking into account what I have learnt about Wolt's business model, the following modeling approach is chosen:

### **🔮 "How much time will it take for the delivery?"**

The estimated delivery time should be in the goldilock zone of neither too conservative (overestimating) nor too optimistic (underestimating). If the estimated time is too conservative, it will deter customers from placing order. If the estimated time is too optimistic, it will lead to late deliveries.

However, [studies](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3469015) show that customers punish late deliveries more than they reward early deliveries. Therefore, it is slightly better to overestimate than underestimate delivery times in comparision. This information will shape our evaluation approach. Thus we have insentive to prevent underestimation of delivery times.
For this we have to analyse late deliveries in detail and what factors affect it.

> Note: We could verify this information by looking at Wolt's data on customer behavior. However, for the purpose of this assignment, we will assume this information is correct.

> ### **Motivations for the modeling approach:**
Accurate forecasting of delivery time helps Wolt's business model in the following ways:

- 🌟 **Customer Experience**: Enhances satisfaction by reducing uncertainty about order arrival times.
- ⚙️ **Operational Efficiency**: Optimizes logistics and allocation of delivery personnel.
- 📊 **Demand Forecasting**: Helps manage order influx during peak times, ensuring quality service.
- 🏅 **Competitive Advantage**: Differentiates Wolt in the competitive food delivery market.

## **3. Exploratory Data Analysis**

In [None]:
# Import libraries

from geopy.distance import geodesic
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Set seaborn style in Wolt brand colors
wolt_colors = ["#004C5C", "#FF007A", "#56C1E6"]  # Dark Blue, Pink, Light Blue

# Set the theme with a Wolt color palette
sns.set_theme(style="whitegrid", palette=sns.color_palette(wolt_colors))

# Further customize
plt.rcParams["axes.spines.top"] = False
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.left"] = True
plt.rcParams["axes.spines.bottom"] = True
plt.rcParams["grid.color"] = "#eeeeee"
plt.rcParams["grid.linestyle"] = "-"
plt.rcParams["grid.linewidth"] = 0.75
plt.rcParams["font.family"] = "sans-serif"
plt.rcParams["font.size"] = 12
plt.rcParams["axes.edgecolor"] = "#333333"

# set figure size
plt.rcParams["figure.figsize"] = (12, 8)

In [None]:
# Convert the TIMESTAMP column to datetime format
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])

# Make TIMESTAMP the index of the dataframe as it will easy to group by date
df.set_index('TIMESTAMP', inplace=True)

In [None]:
def plot_histogram(data, column, title, xlabel, ylabel):
    """
    Plot a histogram of given columns of a dataframe
    """
    # Plotting the distribution of the specified column
    plt.figure(figsize=(10, 6))
    sns.histplot(data[column], kde=True, color=wolt_colors[2], edgecolor='white', linewidth=1.5)
    plt.title(title, fontsize=16, fontweight='bold')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()

# Call the plot_histogram function
plot_histogram(df, 'ACTUAL_DELIVERY_MINUTES', 'Distribution of Actual Delivery Minutes', 'Actual Delivery Minutes', 'Frequency')
plot_histogram(df, 'ESTIMATED_DELIVERY_MINUTES', 'Distribution of Estimated Delivery Minutes', 'Estimated Delivery Minutes', 'Frequency')
plot_histogram(df, 'ACTUAL_DELIVERY_MINUTES - ESTIMATED_DELIVERY_MINUTES', 'Distribution of Difference between Actual and Estimated Delivery Minutes', 'Difference between Actual and Estimated Delivery Minutes', 'Frequency')

**🔍 Observations**: 
- The histogram shows a relatively normal distribution, with a peak around the mean value we observed earlier. So a Linear Regression model might be a good baseline to start with.
- This suggests that most deliveries fall within a standard range of times, with fewer occurrences of extremely short or long delivery times.

In [None]:
# Create a function to calculate the distance between two latitude and longitude points
def calculate_distance(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """
    Calculate the distance between latitude and longitude points

    Args:
        lat1 (float): Latitude of first point
        lon1 (float): Longitude of first point
        lat2 (float): Latitude of second point
        lon2 (float): Longitude of second point
    
    Returns:
        float: Distance between two points in kilometers
    """
    # return haversine((lat1, lon1), (lat2, lon2), unit=Unit.KILOMETERS)
    return geodesic((lat1, lon1), (lat2, lon2)).km

In [None]:
# Extract the distance from latitude and longitude of user and venue
df['DISTANCE'] = df.apply(lambda x: calculate_distance(x['USER_LAT'], x['USER_LONG'], x['VENUE_LAT'], x['VENUE_LONG']), axis=1)

In [None]:
# Check the distance column
df['DISTANCE'].describe()

In [None]:
def plot_heatmap(matrix: np.ndarray, title: str) -> None:
    """
    Plot a heatmap

    Args:
        matrix (numpy.ndarray): Matrix to plot
        title (str): Title of the heatmap
    
    Returns:
        None
    """
    sns.heatmap(
        data = matrix, 
        annot=True,
        linewidth = 0.5, 
        cmap=sns.light_palette(wolt_colors[2], as_cmap=True),
    )

    plt.title(title, fontsize=18, fontweight='bold')
    plt.show()

In [None]:
# Check the correlation between relevant columns
relevant_columns = ['ITEM_COUNT', 'ESTIMATED_DELIVERY_MINUTES', 'ACTUAL_DELIVERY_MINUTES', 
                   'CLOUD_COVERAGE', 'TEMPERATURE', 'WIND_SPEED', 'PRECIPITATION', 'DISTANCE']
plot_heatmap(df[relevant_columns].corr(), 'Correlation Heatmap')

In [None]:
# Creating new DataFrame with daily frequency and number of orders
daily_df = df.resample('D').size()

daily_df.head()

In [None]:
def plot_line(df: pd.DataFrame, title: str, xlabel, ylabel, size=None) -> None:
    """
    Plot a line chart

    Args:
        df (pd.DataFrame): Dataframe to plot the line chart for
        title (str): Title of the chart
        xlabel (str): Label of the x-axis
        ylabel (str): Label of the y-axis
        size (tuple): Size of the chart

    Returns:
        None
    """
    sns.lineplot(
        data=daily_df, 
        color=wolt_colors[2]
    )
    plt.title(title, fontsize=18, fontweight='bold')
    plt.xlabel(xlabel, fontsize=14)
    plt.ylabel(ylabel, fontsize=14)
    if size:
        plt.figure(figsize=size)
    plt.show()

For the assignment, we will assume the following:
- 😃 **On Time Deliveries**: Delivery Time Difference  <= 5 minutes.
- 😐 **Moderate Late Deliveries**: 5 minutes < Delivery Time Difference  <= 15 minutes.
- 😡 **Extremely Late Deliveries**: Delivery Time Difference  > 15 minutes.

In [None]:
# For the assignment, we will assume the following:
# - 😃 **On Time Deliveries**: Delivery Time Difference <= 5 minutes.
# - 😐 **Moderately Late Deliveries**: 5 minutes < Delivery Time Difference  <= 15 minutes.
# - 😡 **Extremely Late Deliveries**: Delivery Time Difference  > 15 minutes.

# Define a function to categorize the deliveries
def categorize_deliveries(row: pd.Series) -> str:
    """
    Categorize the deliveries based on the estimated delivery time

    Args:
        row (pd.Series): Row of the dataframe
    
    Returns:
        str: Category of the delivery
    """
    if row['ACTUAL_DELIVERY_MINUTES - ESTIMATED_DELIVERY_MINUTES'] <= 5:
        return "ON_TIME"
    elif 5 < row['ACTUAL_DELIVERY_MINUTES - ESTIMATED_DELIVERY_MINUTES'] <= 15:
        return "MODERATELY_LATE"
    else:
        return "EXTREMELY_LATE"
    
# Create a new column to categorize the deliveries
df['DELIVERY_TIME_CATEGORY'] = df.apply(categorize_deliveries, axis=1)

In [None]:
df['DELIVERY_TIME_CATEGORY'].value_counts()

## **4. Model Selection**

Given the nature of the data, timeseries forecasting models are a good choice. The data is a time series of item demand, and we want to predict the demand for the next day.

> ### **Following models are considered for the approach:**

- **📉 SARIMAX (Baseline):** It is a good baseline model for our data as it has seasionality and external factors. It is also relatively easy to interpret and explain the results than blackbox models.

- 🚀 **XGBoost:** XGBoost is a powerful ensemble machine learning model that can handle both regression and timeseries forecasting tasks. It can capture complex patterns and dependencies in the data. 

- **🧠 LSTM:** LSTM is a good choice for timeseries forecasting, and can be used to capture the non-linearities in the data. 


## **5. Feature Engineering**

- hourly, daily and weekly decomposition
- calculate distance between venue and delivery location

## **6. Modeling**

### **6.1 SARIMAX (Baseline)**

### **6.2 XGBoost**

### **6.3 LSTM**

## **7. Evaluation**

## **8. Further Development**