

```
# This is formatted as code
```

# Order Delivery Time Prediction

## Objectives
The objective of this assignment is to build a regression model that predicts the delivery time for orders placed through Porter. The model will use various features such as the items ordered, the restaurant location, the order protocol, and the availability of delivery partners.

The key goals are:
- Predict the delivery time for an order based on multiple input features
- Improve delivery time predictions to optimiae operational efficiency
- Understand the key factors influencing delivery time to enhance the model's accuracy

## Data Pipeline
The data pipeline for this assignment will involve the following steps:
1. **Data Loading**
2. **Data Preprocessing and Feature Engineering**
3. **Exploratory Data Analysis**
4. **Model Building**
5. **Model Inference**

## Data Understanding
The dataset contains information on orders placed through Porter, with the following columns:

| Field                     | Description                                                                                 |
|---------------------------|---------------------------------------------------------------------------------------------|
| market_id                 | Integer ID representing the market where the restaurant is located.                         |
| created_at                | Timestamp when the order was placed.                                                        |
| actual_delivery_time      | Timestamp when the order was delivered.                                                     |
| store_primary_category    | Category of the restaurant (e.g., fast food, dine-in).                                      |
| order_protocol            | Integer representing how the order was placed (e.g., via Porter, call to restaurant, etc.). |
| total_items               | Total number of items in the order.                                                         |
| subtotal                  | Final price of the order.                                                                   |
| num_distinct_items        | Number of distinct items in the order.                                                      |
| min_item_price            | Price of the cheapest item in the order.                                                    |
| max_item_price            | Price of the most expensive item in the order.                                              |
| total_onshift_dashers     | Number of delivery partners on duty when the order was placed.                              |
| total_busy_dashers        | Number of delivery partners already occupied with other orders.                             |
| total_outstanding_orders  | Number of orders pending fulfillment at the time of the order.                              |
| distance                  | Total distance from the restaurant to the customer.                                         |


## **Importing Necessary Libraries**

In [None]:
# Import essential libraries for data manipulation and analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels
import statsmodels.api as sm
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')



## **1. Loading the data**
Load 'porter_data_1.csv' as a DataFrame

In [None]:
# Importing the file porter_data_1.csv
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Colab
Delivery_time_Data=pd.read_csv('porter_data_1.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab


## **2. Data Preprocessing and Feature Engineering** <font color = red>[15 marks]</font> <br>

#### **2.1 Fixing the Datatypes**  <font color = red>[5 marks]</font> <br>
The current timestamps are in object format and need conversion to datetime format for easier handling and intended functionality

##### **2.1.1** <font color = red>[2 marks]</font> <br>
Convert date and time fields to appropriate data type

In [None]:
# Convert 'created_at' and 'actual_delivery_time' columns to datetime format

Delivery_time_Data.head()
 Delivery_time_Data['created_at_time_format'] = pd.to_datetime(Delivery_time_Data['created_at'])
 Delivery_time_Data['actual_delivery_time_format'] = pd.to_datetime(Delivery_time_Data['actual_delivery_time'])

Unnamed: 0,market_id,created_at,actual_delivery_time,store_primary_category,order_protocol,total_items,subtotal,num_distinct_items,min_item_price,max_item_price,total_onshift_dashers,total_busy_dashers,total_outstanding_orders,distance
0,1.0,2015-02-06 22:24:17,2015-02-06 23:11:17,4,1.0,4,3441,4,557,1239,33.0,14.0,21.0,34.44
1,2.0,2015-02-10 21:49:25,2015-02-10 22:33:25,46,2.0,1,1900,1,1400,1400,1.0,2.0,2.0,27.6
2,2.0,2015-02-16 00:11:35,2015-02-16 01:06:35,36,3.0,4,4771,3,820,1604,8.0,6.0,18.0,11.56
3,1.0,2015-02-12 03:36:46,2015-02-12 04:35:46,38,1.0,1,1525,1,1525,1525,5.0,6.0,8.0,31.8
4,1.0,2015-01-27 02:12:36,2015-01-27 02:58:36,38,1.0,2,3620,2,1425,2195,5.0,5.0,7.0,8.2


##### **2.1.2**  <font color = red>[3 marks]</font> <br>
Convert categorical fields to appropriate data type

In [None]:
# Convert categorical features to category type
Delivery_time_Data.info()
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(Delivery_time_Data["created_at"])
# finding the unique classes
print(list(label_encoder.classes_))
print()
# values after transforming the categorical column.
print(label_encoder.transform(Delivery_time_Data["created_at"]))
Delivery_time_Data["created_at_Transform"] = label_encoder.transform(Delivery_time_Data["created_at"])


Delivery_time_Data.info()
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(Delivery_time_Data["actual_delivery_time"])
# finding the unique classes
print(list(label_encoder.classes_))
print()
# values after transforming the categorical column.
print(label_encoder.transform(Delivery_time_Data["actual_delivery_time"]))
Delivery_time_Data["actual_delivery_time_Transform"] = label_encoder.transform(Delivery_time_Data["actual_delivery_time"])


NameError: name 'Delivery_time_Data' is not defined

#### **2.2 Feature Engineering** <font color = red>[5 marks]</font> <br>
Calculate the time taken to execute the delivery as well as extract the hour and day at which the order was placed

##### **2.2.1** <font color = red>[2 marks]</font> <br>
Calculate the time taken using the features `actual_delivery_time` and `created_at`

In [None]:
# Calculate time taken in minutes
 Delivery_time_Data['created_at_time_format'] = pd.to_datetime(Delivery_time_Data['created_at'])
 Delivery_time_Data['actual_delivery_time_format'] = pd.to_datetime(Delivery_time_Data['actual_delivery_time'])
 Delivery_time_Data['time_taken'] = (Delivery_time_Data['actual_delivery_time_format'] - Delivery_time_Data['created_at_time_format']).dt.total_seconds() / 60
sns.distplot(Delivery_time_Data['time_taken'],kde=False)
plt.title('time_taken')
plt.show()

##### **2.2.2** <font color = red>[3 marks]</font> <br>
Extract the hour at which the order was placed and which day of the week it was. Drop the unnecessary columns.

In [None]:
# Extract the hour and day of week from the 'created_at' timestamp

def extract_date_features(data):
    data["created_at_time_format_day"] = data.created_at_time_format.dt.day
    data["created_at_time_format_hour"] = data.created_at_time_format.dt.hour
    data["created_at_time_format_day_name"] = data.created_at_time_format.dt.day_name()
    data["created_at_time_format_weekday"] = data.created_at_time_format.dt.weekday
    data['created_at_time_format_day_of_week'] = data.created_at_time_format.dt.day_of_week.astype(int)
    data['created_at_time_format_is_weekend'] = np.where(data['day_of_week'].isin([5,6]),1,0)

extract_date_features(Delivery_time_Data)
Delivery_time_Data.info()



def extract_date_features(data):
    data["actual_delivery_time_format_day"] = data.actual_delivery_time_format.dt.day
    data["actual_delivery_time_format_hour"] = data.actual_delivery_time_format.dt.hour
    data["actual_delivery_time_format_day_name"] = data.actual_delivery_time_format.dt.day_name()
    data["actual_delivery_time_format_weekday"] = data.actual_delivery_time_format.dt.weekday
    data['actual_delivery_time_format_day_of_week'] = data.actual_delivery_time_format.dt.day_of_week.astype(int)
    data['actual_delivery_time_format_is_weekend'] = np.where(data['day_of_week'].isin([5,6]),1,0)

extract_date_features(Delivery_time_Data)
Delivery_time_Data.info()


# Create a categorical feature 'isWeekend'

# Convert 'Category' column to categorical type
Delivery_time_Data['created_at_time_format_is_weekend'] = Delivery_time_Data['created_at_time_format_is_weekend'].astype('category')
# Convert 'Category' column to categorical type
Delivery_time_Data['actual_delivery_time_format_is_weekend'] = Delivery_time_Data['actual_delivery_time_format_is_weekend'].astype('category')

In [None]:
# Drop unnecessary columns
  Delivery_time_Data.drop(columns=['created_at_time_format_date', 'created_at_time_format_time','created_at_time_format_dayname','created_at_time_format_day_no','actual_delivery_time_format_day_of_week','actual_delivery_time_format_is_weekend'], axis=1, inplace=True)

#### **2.3 Creating training and validation sets** <font color = red>[5 marks]</font> <br>

##### **2.3.1** <font color = red>[2 marks]</font> <br>
 Define target and input features

In [None]:
# Define target variable (y) and features (X)
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
df_train, df_test = train_test_split(Delivery_time_Data, train_size = 0.7, test_size = 0.3, random_state = 100)


##### **2.3.2** <font color = red>[3 marks]</font> <br>
 Split the data into training and test sets

In [None]:
# Split data into training and testing sets
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
df_train, df_test = train_test_split(Delivery_time_Data, train_size = 0.7, test_size = 0.3, random_state = 100)


## **3. Exploratory Data Analysis on Training Data** <font color = red>[20 marks]</font> <br>
1. Analyzing the correlation between variables to identify patterns and relationships
2. Identifying and addressing outliers to ensure the integrity of the analysis
3. Exploring the relationships between variables and examining the distribution of the data for better insights

#### **3.1 Feature Distributions** <font color = red> [7 marks]</font> <br>


In [None]:
# Define numerical and categorical columns for easy EDA and data manipulation
cat_cols=Delivery_time_Data.select_dtypes(include=['object']).columns
num_cols = Delivery_time_Data.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)


##### **3.1.1** <font color = red>[3 marks]</font> <br>
Plot distributions for numerical columns in the training set to understand their spread and any skewness

In [None]:
# Plot distributions for all numerical columns

sns.pairplot(Delivery_time_Data)
plt.show()
sns.distplot(Delivery_time_Data['num_distinct_items'],kde=False)
plt.title('num_distinct_items')
plt.show()
sns.distplot(Delivery_time_Data['total_busy_dashers'],kde=False)
plt.title('total_busy_dashers')
plt.show()

##### **3.1.2** <font color = red>[2 marks]</font> <br>
Check the distribution of categorical features

In [None]:
# Distribution of categorical columns
Delivery_time_Data['colour'].value_counts().plot(kind='bar')


##### **3.1.3** <font color = red>[2 mark]</font> <br>
Visualise the distribution of the target variable to understand its spread and any skewness

In [None]:
# Distribution of time_taken

sns.distplot(Delivery_time_Data['time_taken'],kde=False)
plt.title('time_taken')
plt.show()

#### **3.2 Relationships Between Features** <font color = red>[3 marks]</font> <br>

##### **3.2.1** <font color = red>[3 marks]</font> <br>
Scatter plots for important numerical and categorical features to observe how they relate to `time_taken`

In [None]:
# Scatter plot to visualise the relationship between time_taken and other features

sns.pairplot(Delivery_time_Data)
plt.show()


In [None]:
# Show the distribution of time_taken for different hours

   # Using displot for individual distributions
   sns.displot(Delivery_time_Data, x='time_taken', kde=True)
   sns.displot(Delivery_time_Data, x='actual_delivery_time_format_hour', kde=True)
   plt.show()

   # Using displot for joint distribution
   sns.displot(Delivery_time_Data, x='time_taken', y='actual_delivery_time_format_hour', kind='kde')
   plt.show()

      # Using displot for individual distributions
   sns.displot(Delivery_time_Data, x='time_taken', kde=True)
   sns.displot(Delivery_time_Data, x='actual_delivery_time_format_hour', kde=True)
   plt.show()

   # Using displot for joint distribution
   sns.displot(Delivery_time_Data, x='time_taken', y='actual_delivery_time_format_hour', kind='kde')
   plt.show()

   sns.distplot(Delivery_time_Data['created_at_time_format_hour'],kde=False)
plt.title('created_at_time_format_hour')
plt.show()

#### **3.3 Correlation Analysis** <font color = red>[5 marks]</font> <br>
Check correlations between numerical features to identify which variables are strongly related to `time_taken`

##### **3.3.1** <font color = red>[3 marks]</font> <br>
Plot a heatmap to display correlations

In [None]:
# Plot the heatmap of the correlation matrix

plt.figure(figsize = (16, 10))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()

##### **3.3.2** <font color = red>[2 marks]</font> <br>
Drop the columns with weak correlations with the target variable

In [None]:
# Drop 3-5 weakly correlated columns from training dataset

  Delivery_time_Data.drop(columns=['created_at', 'actual_delivery_time','created_at_time_format','actual_delivery_time_format',], axis=1, inplace=True)

  cat_cols=Delivery_time_Data.select_dtypes(include=['object']).columns
num_cols = Delivery_time_Data.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

#### **3.4 Handling the Outliers** <font color = red>[5 marks]</font> <br>



##### **3.4.1** <font color = red>[2 marks]</font> <br>
Visualise potential outliers for the target variable and other numerical features using boxplots

In [1]:
# Boxplot for time_taken
import pandas as pd
import matplotlib.pyplot as plt

# Assuming df is your DataFrame and 'column_name' is the column you want to plot
Delivery_time_Data.boxplot(column=['market_id', 'store_primary_category', 'order_protocol', 'total_items', 'num_distinct_items', 'total_onshift_dashers', 'total_busy_dashers', 'total_outstanding_orders', 'distance', 'time_taken']
, figsize=(20,10))
plt.show()

NameError: name 'Delivery_time_Data' is not defined

##### **3.4.2** <font color = red>[3 marks]</font> <br>
Handle outliers present in all columns

In [None]:
# Handle outliers
Delivery_time_Data = Delivery_time_Data[Delivery_time_Data.total_items <=100]
Delivery_time_Data.describe()



## **4. Exploratory Data Analysis on Validation Data** <font color = red>[optional]</font> <br>
Optionally, perform EDA on test data to see if the distribution match with the training data

In [None]:
# Define numerical and categorical columns for easy EDA and data manipulation

cat_cols=Delivery_time_Data.select_dtypes(include=['object']).columns
num_cols = Delivery_time_Data.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

#### **4.1 Feature Distributions**


##### **4.1.1**
Plot distributions for numerical columns in the validation set to understand their spread and any skewness

In [None]:
# Plot distributions for all numerical columns

# Let's visualise the data with a scatter plot and the fitted regression line
plt.scatter(X_train_lm.iloc[:, 1], y_train)
plt.plot(X_train_lm.iloc[:, 1], 0.127 + 0.462*X_train_lm.iloc[:, 1], 'r')
plt.show()

##### **4.1.2**
Check the distribution of categorical features

In [None]:
# Distribution of categorical columns

cat_cols=Delivery_time_Data.select_dtypes(include=['object']).columns
num_cols = Delivery_time_Data.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

##### **4.1.3**
Visualise the distribution of the target variable to understand its spread and any skewness

In [None]:
# Distribution of time_taken
plt.figure(figsize = (16, 10))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()


#### **4.2 Relationships Between Features**
Scatter plots for numerical features to observe how they relate to each other, especially to `time_taken`

In [None]:
# Scatter plot to visualise the relationship between time_taken and other features

plt.figure(figsize = (16, 10))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()

#### **4.3** Drop the columns with weak correlations with the target variable

In [None]:
# Drop the weakly correlated columns from training dataset
# Boxplot for time_taken
import pandas as pd
import matplotlib.pyplot as plt

# Assuming df is your DataFrame and 'column_name' is the column you want to plot
Delivery_time_Data.boxplot(column=['market_id', 'store_primary_category', 'order_protocol', 'total_items', 'num_distinct_items', 'total_onshift_dashers', 'total_busy_dashers', 'total_outstanding_orders', 'distance', 'time_taken']
, figsize=(20,10))
plt.show()


## **5. Model Building** <font color = red>[15 marks]</font> <br>

#### **Import Necessary Libraries**

In [None]:
# Import libraries

import statsmodels
import statsmodels.api as sm
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')

#### **5.1 Feature Scaling** <font color = red>[3 marks]</font> <br>

In [None]:
# Apply scaling to the numerical columns

# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['market_id', 'store_primary_category', 'order_protocol', 'total_items', 'subtotal','num_distinct_items','min_item_price','max_item_price','total_onshift_dashers','total_outstanding_orders','distance','time_taken']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

Note that linear regression is agnostic to feature scaling. However, with feature scaling, we get the coefficients to be somewhat on the same scale so that it becomes easier to compare them.

#### **5.2 Build a linear regression model** <font color = red>[5 marks]</font> <br>

You can choose from the libraries *statsmodels* and *scikit-learn* to build the model.

In [None]:
# Create/Initialise the model
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['market_id', 'store_primary_category', 'order_protocol', 'total_items', 'subtotal','num_distinct_items','min_item_price','max_item_price','total_onshift_dashers','total_outstanding_orders','distance','time_taken']

df_train[num_vars] = scaler.fit_transform(df_trai

In [None]:
# Train the model using the training data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_train.info()

In [None]:
# Make predictions
y_train = df_train.pop('time_taken')
X_train = df_train
import statsmodels.api as sm

# Add a constant
X_train_lm = sm.add_constant(X_train[['distance']])

# Create a first fitted model
lr = sm.OLS(y_train, X_train_lm).fit()
sns.distplot(Delivery_time_Data['time_taken'],kde=False)
plt.title('time_taken')
plt.show()

In [None]:
# Find results for evaluation metrics

scaler= MinMaxScaler()
num_vars =['market_id','store_primary_category','order_protocol','total_items','subtotal','num_distinct_items','min_item_price','max_item_price','total_onshift_dashers','total_busy_dashers','total_outstanding_orders','distance','time_taken']
df_train[num_vars]=scaler.fit_transform(df_train[num_vars])
df_train.head()
y_train_time_taken = lr_3.predict(X_train_lm)

# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_time_taken), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading
plt.xlabel('Errors', fontsize = 18)                         # X-label

num_vars =['market_id','store_primary_category','order_protocol','total_items','subtotal','num_distinct_items','min_item_price','max_item_price','total_onshift_dashers','total_busy_dashers','total_outstanding_orders','distance','time_taken']

df_test[num_vars] = scaler.transform(df_test[num_vars])
# Import LabelEncoder from sklearn
from sklearn.preprocessing import LabelEncoder

def label_encoding(df):
    categorical_columns = df.select_dtypes(include='object').columns
    label_encoder = LabelEncoder()
    df[categorical_columns] = df[categorical_columns].apply(lambda col: label_encoder.fit_transform(col))

label_encoding(Delivery_time_Data)
Delivery_time_Data.head()

[link text](https://)Note that we have 12 (depending on how you select features) training features. However, not all of them would be useful. Let's say we want to take the most relevant 8 features.

We will use Recursive Feature Elimination (RFE) here.

For this, you can look at the coefficients / p-values of features from the model summary and perform feature elimination, or you can use the RFE module provided with *scikit-learn*.

#### **5.3 Build the model and fit RFE to select the most important features** <font color = red>[7 marks]</font> <br>

For RFE, we will start with all features and use
the RFE method to recursively reduce the number of features one-by-one.

After analysing the results of these iterations, we select the one that has a good balance between performance and number of features.

In [None]:
# Loop through the number of features and test the model

y_test = df_test.pop('time_taken')
X_test = df_test
# Adding constant variable to test dataframe
X_test_m4 = sm.add_constant(X_test)
# Creating X_test_m4 dataframe by dropping variables from X_test_m4
X_test_m4 = X_test_m4.drop('total_busy_dashers', axis = 1)
# Making predictions using the fourth model

y_pred_m4 = lr_3.predict(X_test_m4)
# Plotting y_test and y_pred to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_pred_m4)
fig.suptitle('y_test vs y_pred', fontsize = 20)              # Plot heading
plt.xlabel('y_test', fontsize = 18)                          # X-label
plt.ylabel('y_pred', fontsize = 16)

In [None]:
# Build the final model with selected number of features

 OLS Regression Results
==============================================================================
Dep. Variable:             time_taken   R-squared:                       0.834
Model:                            OLS   Adj. R-squared:                  0.834
Method:                 Least Squares   F-statistic:                 5.620e+04
Date:                Wed, 28 May 2025   Prob (F-statistic):               0.00
Time:                        08:52:15   Log-Likelihood:             1.8557e+05
No. Observations:              123043   AIC:                        -3.711e+05
Df Residuals:                  123031   BIC:                        -3.710e+05
Df Model:                          11
Covariance Type:            nonrobust
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                        0.0015      0.001      2.128      0.033       0.000       0.003
market_id                   -0.0468      0.001    -81.124      0.000      -0.048      -0.046
store_primary_category       0.0054      0.001     10.089      0.000       0.004       0.006
order_protocol              -0.0635      0.001   -102.564      0.000      -0.065      -0.062
subtotal                     0.4784      0.004    116.090      0.000       0.470       0.486
num_distinct_items           0.1723      0.003     53.402      0.000       0.166       0.179
min_item_price               0.0502      0.006      7.912      0.000       0.038       0.063
max_item_price               0.2208      0.006     37.187      0.000       0.209       0.232
total_onshift_dashers       -0.8421      0.003   -320.203      0.000      -0.847      -0.837
total_busy_dashers          -0.0021   1.59e-05   -129.354      0.000      -0.002      -0.002
total_outstanding_orders     1.4425      0.003    547.851      0.000       1.437       1.448
distance                     0.5613      0.001    385.174      0.000       0.558       0.564
==============================================================================
Omnibus:                    23126.342   Durbin-Watson:                   1.997
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            59158.221
Skew:                           1.036   Prob(JB):                         0.00
Kurtosis:                       5.693   Cond. No.                     2.58e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.58e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

## **6. Results and Inference** <font color = red>[5 marks]</font> <br>

#### **6.1 Perform Residual Analysis** <font color = red>[3 marks]</font> <br>

In [None]:
# Perform residual analysis using plots like residuals vs predicted values, Q-Q plot and residual histogram
# Calculate the VIFs again for the new model
vif = pd.DataFrame()
vif['Features'] = X_train_lm.columns
vif['VIF'] = [variance_inflation_factor(X_train_lm.values, i) for i in range(X_train_lm.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

	Features	VIF
0	const	20.19
8	total_onshift_dashers	11.71
9	total_busy_dashers	11.25
10	total_outstanding_orders	9.89
4	subtotal	3.40
5	num_distinct_items	3.27
7	max_item_price	2.22
6	min_item_price	2.15
3	order_protocol	1.05
2	store_primary_category	1.02
1	market_id	1.01
11	distance	1.00




[Your inferences here:]



#### **6.2 Perform Coefficient Analysis** <font color = red>[2 marks]</font> <br>

Perform coefficient analysis to find how changes in features affect the target.
Also, the features were scaled, so interpret the scaled and unscaled coefficients to understand the impact of feature changes on delivery time.


In [None]:
# Compare the scaled vs unscaled features used in the final model

	Features	VIF
0	const	20.19
8	total_onshift_dashers	11.71
9	total_busy_dashers	11.25
10	total_outstanding_orders	9.89
4	subtotal	3.40
5	num_distinct_items	3.27
7	max_item_price	2.22
6	min_item_price	2.15
3	order_protocol	1.05
2	store_primary_category	1.02
1	market_id	1.01
11	distance	1.00


Additionally, we can analyse the effect of a unit change in a feature. In other words, because we have scaled the features, a unit change in the features will not translate directly to the model. Use scaled and unscaled coefficients to find how will a unit change in a feature affect the target.

In [None]:
# Analyze the effect of a unit change in a feature, say 'total_items'

==================================================
Dep. Variable:             time_taken   R-squared:                       0.812
Model:                            OLS   Adj. R-squared:                  0.811
Method:                 Least Squares   F-statistic:                 4.815e+04
Date:                Wed, 28 May 2025   Prob (F-statistic):               0.00
Time:                        08:25:54   Log-Likelihood:             1.7775e+05
No. Observations:              123043   AIC:                        -3.555e+05
Df Residuals:                  123031   BIC:                        -3.554e+05
Df Model:                          11
Covariance Type:            nonrobust
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                        0.0019      0.001      2.571      0.010       0.000       0.003
market_id                   -0.0457      0.001    -74.389      0.000      -0.047      -0.045
store_primary_category       0.0049      0.001      8.613      0.000       0.004       0.006
order_protocol              -0.0670      0.001   -101.503      0.000      -0.068      -0.066
total_items                 -0.2487      0.037     -6.661      0.000      -0.322      -0.175
subtotal                     0.4825      0.005    107.191      0.000       0.474       0.491
num_distinct_items           0.1876      0.004     49.450      0.000       0.180       0.195
min_item_price               0.0482      0.007      7.127      0.000       0.035       0.061
max_item_price               0.2061      0.006     31.955      0.000       0.193       0.219
total_onshift_dashers       -1.0308      0.002   -441.856      0.000      -1.035      -1.026
total_outstanding_orders     1.2967      0.003    511.403      0.000       1.292       1.302
distance                     0.5596      0.002    360.369      0.000       0.557       0.563
==============================================================================
Omnibus:                    22808.828   Durbin-Watson:                   1.994
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            59497.669
Skew:                           1.015   Prob(JB):                         0.00
Kurtosis:                       5.735   Cond. No.                         302.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Note:
The coefficients on the original scale might differ greatly in magnitude from the scaled coefficients, but they both describe the same relationships between variables.

Interpretation is key: Focus on the direction and magnitude of the coefficients on the original scale to understand the impact of each variable on the response variable in the original units.

Include conclusions in your report document.

## Subjective Questions <font color = red>[20 marks]</font>

Answer the following questions only in the notebook. Include the visualisations/methodologies/insights/outcomes from all the above steps in your report.

#### Subjective Questions based on Assignment

##### **Question 1.** <font color = red>[2 marks]</font> <br>

Are there any categorical variables in the data? From your analysis of the categorical variables from the dataset, what could you infer about their effect on the dependent variable?



**Answer:**
>created_at and actual_delivery_time  are the categorial variable in the data which is the time format.
so these has to be converted to time object or time format to extract data and time for caluating the tme taken
 Delivery_time_Data['created_at_time_format'] = pd.to_datetime(Delivery_time_Data['created_at'])
 Delivery_time_Data['actual_delivery_time_format'] = pd.to_datetime(Delivery_time_Data['actual_delivery_time'])
 Delivery_time_Data['time_taken'] = (Delivery_time_Data['actual_delivery_time_format'] - Delivery_time_Data['created_at_time_format']).dt.total_seconds() / 60
Delivery_time_Data.info()



---



##### **Question 2.** <font color = red>[1 marks]</font> <br>
What does `test_size = 0.2` refer to during splitting the data into training and test sets?

**Answer:**
>In ML course , test_size=0.2 during data splitting indicates that 20% of the dataset will be used for testing the model, while the remaining 80% will be used for training. This is a common split ratio, though other ratios like 70/30 or 60/40 are also used. The test set helps evaluate the model's performance on unseen data, preventing overfitting.




---



##### **Question 3.** <font color = red>[1 marks]</font> <br>
Looking at the heatmap, which one has the highest correlation with the target variable?  

**Answer:**
>Highest correleaction based on heat map is distance



---



##### **Question 4.** <font color = red>[2 marks]</font> <br>
What was your approach to detect the outliers? How did you address them?

**Answer:**

>To identify and address outliers, we can use visual inspection and statistical methods. For visualization, I used box plots, histograms, and scatter plots to observe deviations from the expected data distribution. Statistically, I applied the interquartile range (IQR) method and z-score analysis to pinpoint extreme values. Outliers were then addressed by either trimming or removing them, quantile-based flooring and capping, or mean/median imputation, depending on the nature of the data and the analysis objectives.




---



##### **Question 5.** <font color = red>[2 marks]</font> <br>
Based on the final model, which are the top 3 features significantly affecting the delivery time?

**Answer:**
>Based on the final model the major one impactong the delivery time or time taken are total oustanding order distance and subtotal

const                        0.0015      0.001      2.128      0.033       0.000       0.003
market_id                   -0.0468      0.001    -81.124      0.000      -0.048      -0.046
store_primary_category       0.0054      0.001     10.089      0.000       0.004       0.006
order_protocol              -0.0635      0.001   -102.564      0.000      -0.065      -0.062
subtotal                     0.4784      0.004    116.090      0.000       0.470       0.486
num_distinct_items           0.1723      0.003     53.402      0.000       0.166       0.179
min_item_price               0.0502      0.006      7.912      0.000       0.038       0.063
max_item_price               0.2208      0.006     37.187      0.000       0.209       0.232
total_onshift_dashers       -0.8421      0.003   -320.203      0.000      -0.847      -0.837
total_busy_dashers          -0.0021   1.59e-05   -129.354      0.000      -0.002      -0.002
total_outstanding_orders     1.4425      0.003    547.851      0.000       1.437       1.448
distance                     0.5613      0.001    385.174      0.000       0.558       0.564



---



#### General Subjective Questions

##### **Question 6.** <font color = red>[3 marks]</font> <br>
Explain the linear regression algorithm in detail

**Answer:**
>Linear regression is a supervised machine learning algorithm that predicts a continuous numerical value by modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship, meaning the output changes at a constant rate as the input changes, and represents this relationship with a straight line or a hyperplane in multiple dimensions.
Here's a more detailed explanation:
1. Supervised Learning: Linear regression, like other regression algorithms, is a supervised learning method. This means it learns from labeled data, where both the input (independent variable) and the corresponding output (dependent variable) are known.
2. Predicting Continuous Values: It's specifically used for predicting continuous numerical values, such as predicting house prices, sales, or temperatures.
3. Linear Relationship Assumption: The core principle of linear regression is that it assumes a linear relationship between the input features (independent variables) and the target variable (dependent variable). This means the relationship can be represented by a straight line (in simple linear regression) or a hyperplane in multiple-dimensional spaces (in multiple linear regression).
4. Finding the Best Fit Line: The algorithm aims to find the line (or hyperplane) that best fits the data points, minimizing the distance between the predicted values and the actual values.
5. Cost Function and Minimization: The "best fit" line is determined by minimizing a cost function, which measures the difference between the predicted values and the actual values. A common cost function is the Mean Squared Error (MSE), which calculates the average of the squared differences.
6. Multiple Linear Regression: When there are multiple independent variables, it's called multiple linear regression. The model then becomes a linear combination of the input features, with each feature having a weight that represents its influence on the target.
7. Applications: Linear regression is widely used in various fields for:
Predicting: Forecasting sales, predicting house prices, estimating market trends, etc.
Understanding relationships: Identifying the relationship between variables and their effects on each other.
In summary, linear regression is a powerful and versatile algorithm for predicting continuous values by finding the best-fitting linear relationship between input features and the target variable.




---



##### **Question 7.** <font color = red>[2 marks]</font> <br>
Explain the difference between simple linear regression and multiple linear regression

**Answer:**
>Simple linear regression uses one independent variable to predict a dependent variable, while multiple linear regression uses two or more independent variables to predict the same dependent variable. Simple linear regression models a linear relationship between one predictor and an outcome, whereas multiple regression allows for more complex models of an outcome based on multiple factors.

Simple Linear Regression:

One Independent Variable:
It examines the relationship between a single predictor variable and the outcome variable.
Simpler Model:
Easier to interpret and implement, with a straightforward linear relationship.
Prediction Focused:
Useful for situations where a single factor is believed to strongly influence the outcome.

Multiple Linear Regression:
Two or More Independent Variables:
It considers multiple predictor variables simultaneously to understand their combined influence on the outcome.
More Complex Model:
Can capture more complex relationships between variables and improve predictive accuracy.
Can Control for Variables:
Can be used to account for the effects of multiple factors, isolating the impact of individual variables.




---



##### **Question 8.** <font color = red>[2 marks]</font> <br>
What is the role of the cost function in linear regression, and how is it minimized?

**Answer:**
>In linear regression, the cost function quantifies the difference between predicted and actual values, acting as a measure of model performance. Minimizing this cost function, often through algorithms like gradient descent, refines the model's parameters to produce more accurate predictions.


Role of the Cost Function:
Performance Metric:
The cost function provides a numerical score reflecting how well the model is performing, with lower values indicating better performance.

Error Indicator:
It measures the discrepancy between the model's predictions and the true values, highlighting areas where the model needs improvement.

Optimization Guide:
By minimizing the cost function, the model can adjust its parameters to reduce the error and improve its predictive ability.


Minimizing the Cost Function:
Gradient Descent:
This iterative optimization algorithm adjusts the model's parameters (e.g., slope and intercept in a linear regression) based on the direction of the cost function's gradient.


Learning Rate:
The learning rate controls the size of the steps taken during each iteration of gradient descent, balancing the need for quick convergence with the risk of overshooting the minimum.

Iteration:
The process of updating parameters is repeated until the cost function converges to a minimum, indicating that the model has found its optimal parameters.
Common Cost Function in Linear Regression: Mean Squared Error (MSE):
Calculation:
MSE calculates the average of the squared differences between predicted and actual values.
Interpretation:
It penalizes larger errors more significantly than smaller ones, ensuring that the model's predictions are as close as possible to the true values.




---



##### **Question 9.** <font color = red>[2 marks]</font> <br>
Explain the difference between overfitting and underfitting.



**Answer:**

>Overfitting and underfitting are two common challenges in machine learning model training. Overfitting occurs when a model learns the training data too well, including noise and outliers, leading to poor generalization on new, unseen data. Underfitting, on the other hand, occurs when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and new data.
Overfitting:
Definition:
A model that is too complex and captures noise and irrelevant details in the training data, leading to high accuracy on training data but poor accuracy on unseen data.
Causes:
Using a model with too many parameters, training for too long, and having a small training dataset.
Characteristics:
High accuracy on the training data but low accuracy on new data, high variance, and low bias.
Example:
A model that memorizes the exact examples in the training data but fails to generalize to new data points that are slightly different.
Underfitting:
Definition: A model that is too simple and cannot capture the underlying patterns in the data, resulting in poor performance on both training and new data.
Causes: Using a model with too few parameters, training for too short a time, and not having enough features.
Characteristics: Low accuracy on both training and new data, low variance, and high bias.
Example: A linear regression model trying to fit a curved dataset.
How to address overfitting and underfitting:
Overfitting:
Increase the amount of training data: This can help the model generalize better by exposing it to a wider range of examples.
Use regularization techniques: These methods penalize complex models, encouraging them to find simpler solutions.
Simplify the model: Reduce the number of parameters or features used by the model.
Early stopping: Monitor the model's performance on a validation set and stop training when the performance starts to decline.
Underfitting:
Use a more complex model: Increase the number of parameters or features used by the model.
Train for a longer period: Give the model more time to learn the patterns in the data.
Add more features: Include more relevant features in the model.
Reduce regularization: Reduce the penalty on complex models, allowing them to fit the data more closely.






---



##### **Question 10.** <font color = red>[3 marks]</font> <br>
How do residual plots help in diagnosing a linear regression model?

**Answer:**
>Residual plots are crucial for diagnosing linear regression models because they visually represent the discrepancies between observed and predicted values, revealing potential issues with the model's fit. By examining the scatter of points on the residual plot, one can identify if the model is adequately specified, if it violates assumptions about error distribution, or if there are influential data points.
