# Order Delivery Time Prediction

## Objectives
The objective of this assignment is to build a regression model that predicts the delivery time for orders placed through Porter. The model will use various features such as the items ordered, the restaurant location, the order protocol, and the availability of delivery partners.

The key goals are:
- Predict the delivery time for an order based on multiple input features
- Improve delivery time predictions to optimiae operational efficiency
- Understand the key factors influencing delivery time to enhance the model's accuracy

## Data Pipeline
The data pipeline for this assignment will involve the following steps:
1. **Data Loading**
2. **Data Preprocessing and Feature Engineering**
3. **Exploratory Data Analysis**
4. **Model Building**
5. **Model Inference**

## Data Understanding
The dataset contains information on orders placed through Porter, with the following columns:

| Field                     | Description                                                                                 |
|---------------------------|---------------------------------------------------------------------------------------------|
| market_id                 | Integer ID representing the market where the restaurant is located.                         |
| created_at                | Timestamp when the order was placed.                                                        |
| actual_delivery_time      | Timestamp when the order was delivered.                                                     |
| store_primary_category    | Category of the restaurant (e.g., fast food, dine-in).                                      |
| order_protocol            | Integer representing how the order was placed (e.g., via Porter, call to restaurant, etc.). |
| total_items               | Total number of items in the order.                                                         |
| subtotal                  | Final price of the order.                                                                   |
| num_distinct_items        | Number of distinct items in the order.                                                      |
| min_item_price            | Price of the cheapest item in the order.                                                    |
| max_item_price            | Price of the most expensive item in the order.                                              |
| total_onshift_dashers     | Number of delivery partners on duty when the order was placed.                              |
| total_busy_dashers        | Number of delivery partners already occupied with other orders.                             |
| total_outstanding_orders  | Number of orders pending fulfillment at the time of the order.                              |
| distance                  | Total distance from the restaurant to the customer.                                         |


## **Importing Necessary Libraries**

In [None]:
import pandas as pd


## **1. Loading the data**
Load 'porter_data_1.csv' as a DataFrame

In [None]:
# Importing the file porter_data_1.csv
df = pd.read_csv('porter_data_1.csv')
print(df.info())

## **2. Data Preprocessing and Feature Engineering** <font color = red>[15 marks]</font> <br>

#### **2.1 Fixing the Datatypes**  <font color = red>[5 marks]</font> <br>
The current timestamps are in object format and need conversion to datetime format for easier handling and intended functionality

##### **2.1.1** <font color = red>[2 marks]</font> <br>
Convert date and time fields to appropriate data type

In [None]:
# Convert 'created_at' and 'actual_delivery_time' columns to datetime format

# Convert categorical features to category type
df['market_id'] = df['market_id'].astype('category')
df['store_primary_category'] = df['store_primary_category'].astype('category')
df['order_protocol'] = df['order_protocol'].astype('category')

##### **2.1.2**  <font color = red>[3 marks]</font> <br>
Convert categorical fields to appropriate data type

In [None]:
# Convert categorical features to category type

# Convert categorical features to category type
df['market_id'] = df['market_id'].astype('category')
df['store_primary_category'] = df['store_primary_category'].astype('category')
df['order_protocol'] = df['order_protocol'].astype('category')

#### **2.2 Feature Engineering** <font color = red>[5 marks]</font> <br>
Calculate the time taken to execute the delivery as well as extract the hour and day at which the order was placed

##### **2.2.1** <font color = red>[2 marks]</font> <br>
Calculate the time taken using the features `actual_delivery_time` and `created_at`

In [None]:

# Calculate time taken in minutes
df['time_taken'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds() / 60

##### **2.2.2** <font color = red>[3 marks]</font> <br>
Extract the hour at which the order was placed and which day of the week it was. Drop the unnecessary columns.

In [None]:
# Extract the hour and day of week from the 'created_at' timestamp

# Extract the hour and day of week from the 'created_at' timestamp
df['hour'] = df['created_at'].dt.hour
df['day_of_week'] = df['created_at'].dt.day_name()

# Create a categorical feature 'isWeekend'
df['is_weekend'] = df['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)


# Create a categorical feature 'isWeekend'

# Extract the hour and day of week from the 'created_at' timestamp
df['hour'] = df['created_at'].dt.hour
df['day_of_week'] = df['created_at'].dt.day_name()

# Create a categorical feature 'isWeekend'
df['is_weekend'] = df['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)

In [None]:
# Drop unnecessary columns
df = df.drop(columns=['created_at', 'actual_delivery_time', 'day_of_week'])

#### **2.3 Creating training and validation sets** <font color = red>[5 marks]</font> <br>

##### **2.3.1** <font color = red>[2 marks]</font> <br>
 Define target and input features

In [None]:


# Define target variable (y) and features (X)
y = df['time_taken']
X = df.drop(columns=['time_taken'])

##### **2.3.2** <font color = red>[3 marks]</font> <br>
 Split the data into training and test sets

In [None]:
# Split data into training and testing sets

from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **3. Exploratory Data Analysis on Training Data** <font color = red>[20 marks]</font> <br>
1. Analyzing the correlation between variables to identify patterns and relationships
2. Identifying and addressing outliers to ensure the integrity of the analysis
3. Exploring the relationships between variables and examining the distribution of the data for better insights

#### **3.1 Feature Distributions** <font color = red> [7 marks]</font> <br>


In [None]:
# Define numerical and categorical columns for easy EDA and data manipulation
# Define numerical and categorical columns for easy EDA and data manipulation
numerical_cols = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['category', 'object']).columns.tolist()

print("Numerical Columns:", numerical_cols)
print("Categorical Columns:", categorical_cols)


##### **3.1.1** <font color = red>[3 marks]</font> <br>
Plot distributions for numerical columns in the training set to understand their spread and any skewness

In [None]:
# Plot distributions for all numerical columns
import matplotlib.pyplot as plt
import seaborn as sns

# Plot distributions for all numerical columns
plt.figure(figsize=(15, 20))
for i, col in enumerate(numerical_cols):
    plt.subplot(5, 4, i + 1)  # Assuming a 5x4 grid is sufficient
    sns.histplot(X_train[col], kde=True) # Plot histogram with KDE
    plt.title(col)
plt.tight_layout()
plt.show()


##### **3.1.2** <font color = red>[2 marks]</font> <br>
Check the distribution of categorical features

In [None]:
# Distribution of categorical columns

# Distribution of categorical columns
plt.figure(figsize=(15, 10))
for i, col in enumerate(categorical_cols):
    plt.subplot(2, 2, i + 1) # Assuming a 2x2 grid is sufficient
    sns.countplot(data=X_train, x=col)
    plt.title(col)
    plt.xticks(rotation=45) # Rotate x-axis labels for readability
plt.tight_layout()
plt.show()

##### **3.1.3** <font color = red>[2 mark]</font> <br>
Visualise the distribution of the target variable to understand its spread and any skewness

In [None]:
# Distribution of time_taken

# Distribution of time_taken
plt.figure(figsize=(8, 6))
sns.histplot(y_train, kde=True)
plt.title('Distribution of time_taken')
plt.xlabel('Time Taken (minutes)')
plt.ylabel('Frequency')
plt.show()

#### **3.2 Relationships Between Features** <font color = red>[3 marks]</font> <br>

##### **3.2.1** <font color = red>[3 marks]</font> <br>
Scatter plots for important numerical and categorical features to observe how they relate to `time_taken`

In [None]:
# Scatter plot to visualise the relationship between time_taken and other features

# Scatter plot to visualise the relationship between time_taken and other numerical features
plt.figure(figsize=(15, 20))
for i, col in enumerate(numerical_cols):
    if col != 'time_taken':  # Avoid plotting time_taken against itself
        plt.subplot(5, 4, i + 1)
        sns.scatterplot(data=df, x=col, y='time_taken')
        plt.title(f'{col} vs. time_taken')
        plt.xlabel(col)
        plt.ylabel('time_taken')
plt.tight_layout()
plt.show()

In [None]:
# Show the distribution of time_taken for different hours

# Show the distribution of time_taken for different hours
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='hour', y='time_taken')
plt.title('Distribution of Delivery Time by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Delivery Time (minutes)')
plt.show()

#### **3.3 Correlation Analysis** <font color = red>[5 marks]</font> <br>
Check correlations between numerical features to identify which variables are strongly related to `time_taken`

##### **3.3.1** <font color = red>[3 marks]</font> <br>
Plot a heatmap to display correlations

In [None]:
# Plot the heatmap of the correlation matrix

# Check correlations between numerical features to identify which variables are strongly related to `time_taken`
# Plot the heatmap of the correlation matrix

numerical_cols_with_target = numerical_cols + ['time_taken'] # Include target variable
correlation_matrix = df[numerical_cols_with_target].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### **3.3.2** <font color = red>[2 marks]</font> <br>
Drop the columns with weak correlations with the target variable

In [None]:
# Drop 3-5 weakly correlated columns from training dataset

# Drop 3-5 weakly correlated columns from training dataset

# 1. Identify weakly correlated columns
weak_correlation_threshold = 0.1  # You can adjust this threshold
columns_to_drop = []
for col in numerical_cols:
    if col != 'time_taken' and abs(correlation_matrix['time_taken'][col]) < weak_correlation_threshold:
        columns_to_drop.append(col)

print("Columns to drop:", columns_to_drop)

# 2. Drop the columns from the training set
X_train_filtered = X_train.drop(columns=columns_to_drop)
X_test_filtered = X_test.drop(columns=columns_to_drop)  # Also drop from the test set!

# 3. Update numerical_cols (important for later)
numerical_cols = [col for col in numerical_cols if col not in columns_to_drop]

print("Updated Numerical Columns:", numerical_cols)

#### **3.4 Handling the Outliers** <font color = red>[5 marks]</font> <br>



##### **3.4.1** <font color = red>[2 marks]</font> <br>
Visualise potential outliers for the target variable and other numerical features using boxplots

In [None]:
# Boxplot for time_taken

# Visualise potential outliers for the target variable and other numerical features using boxplots
# Boxplot for time_taken
plt.figure(figsize=(8, 6))
sns.boxplot(y=X_train['time_taken']) # Use y-axis for single variable boxplot
plt.title('Boxplot of time_taken')
plt.ylabel('Delivery Time (minutes)')
plt.show()

##### **3.4.2** <font color = red>[3 marks]</font> <br>
Handle outliers present in all columns

In [None]:
# Handle outliers

# Visualize potential outliers for all numerical features using boxplots
plt.figure(figsize=(15, 20))
for i, col in enumerate(numerical_cols):
    plt.subplot(5, 4, i + 1)
    sns.boxplot(y=X_train[col])
    plt.title(f'Boxplot of {col}')
    plt.ylabel(col)
plt.tight_layout()
plt.show()

Q1 = X_train['time_taken'].quantile(0.25)
Q3 = X_train['time_taken'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
X_train_filtered = X_train[(X_train['time_taken'] >= lower_bound) & (X_train['time_taken'] <= upper_bound)]
y_train_filtered = y_train[(X_train['time_taken'] >= lower_bound) & (X_train['time_taken'] <= upper_bound)]

#Remember to apply the same logic to the test set if needed
X_test_filtered = X_test[(X_test['time_taken'] >= lower_bound) & (X_test['time_taken'] <= upper_bound)]
y_test_filtered = y_test[(X_test['time_taken'] >= lower_bound) & (X_test['time_taken'] <= upper_bound)]

upper_limit = X_train['time_taken'].quantile(0.95)  # 95th percentile
X_train['time_taken'] = np.where(X_train['time_taken'] > upper_limit, upper_limit, X_train['time_taken'])
X_test['time_taken'] = np.where(X_test['time_taken'] > upper_limit, upper_limit, X_test['time_taken'])

upper_limit = X_train['time_taken'].quantile(0.95)  # 95th percentile
X_train['time_taken'] = np.where(X_train['time_taken'] > upper_limit, upper_limit, X_train['time_taken'])
X_test['time_taken'] = np.where(X_test['time_taken'] > upper_limit, upper_limit, X_test['time_taken'])

## **4. Exploratory Data Analysis on Validation Data** <font color = red>[optional]</font> <br>
Optionally, perform EDA on test data to see if the distribution match with the training data

In [None]:
# Define numerical and categorical columns for easy EDA and data manipulation

# Define numerical and categorical columns for easy EDA and data manipulation
numerical_cols = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['category', 'object']).columns.tolist()

print("Numerical Columns:", numerical_cols)
print("Categorical Columns:", categorical_cols)

#### **4.1 Feature Distributions**


##### **4.1.1**
Plot distributions for numerical columns in the validation set to understand their spread and any skewness

In [None]:
# Plot distributions for all numerical columns

plt.figure(figsize=(15, 20))
for i, col in enumerate(numerical_cols):
    plt.subplot(5, 4, i + 1)
    sns.histplot(X_test[col], kde=True)
    plt.title(f'Test Data - {col} Distribution')
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

##### **4.1.2**
Check the distribution of categorical features

In [None]:
# Distribution of categorical columns

# Distribution of categorical columns in Test Set
plt.figure(figsize=(15, 10))
for i, col in enumerate(categorical_cols):
    plt.subplot(2, 2, i + 1)
    sns.countplot(data=X_test, x=col)
    plt.title(f'Test Data - {col} Distribution')
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### **4.1.3**
Visualise the distribution of the target variable to understand its spread and any skewness

In [None]:
# Distribution of time_taken

# Distribution of time_taken in Test Set
plt.figure(figsize=(8, 6))
sns.histplot(y_test, kde=True)
plt.title('Distribution of time_taken in Test Set')
plt.xlabel('Time Taken (minutes)')
plt.ylabel('Frequency')
plt.show()

#### **4.2 Relationships Between Features**
Scatter plots for numerical features to observe how they relate to each other, especially to `time_taken`

In [None]:
# Scatter plot to visualise the relationship between time_taken and other features

# Scatter plot to visualise the relationship between time_taken and other numerical features in Test Set
plt.figure(figsize=(15, 20))
for i, col in enumerate(numerical_cols):
    if col != 'time_taken':
        plt.subplot(5, 4, i + 1)
        sns.scatterplot(x=X_test[col], y=y_test)
        plt.title(f'Test Data - {col} vs. time_taken')
        plt.xlabel(col)
        plt.ylabel('time_taken')
plt.tight_layout()
plt.show()

#### **4.3** Drop the columns with weak correlations with the target variable

In [None]:
# Drop the weakly correlated columns from training dataset

# 1. Identify weakly correlated columns
weak_correlation_threshold = 0.1  # You can adjust this threshold
columns_to_drop = []
for col in numerical_cols:
    if col != 'time_taken' and abs(correlation_matrix['time_taken'][col]) < weak_correlation_threshold:
        columns_to_drop.append(col)

print("Columns to drop:", columns_to_drop)

# 2. Drop the columns from the training set
X_train_filtered = X_train.drop(columns=columns_to_drop)
X_test_filtered = X_test.drop(columns=columns_to_drop)  # Also drop from the test set!

# 3. Update numerical_cols (important for later)
numerical_cols = [col for col in numerical_cols if col not in columns_to_drop]

print("Updated Numerical Columns:", numerical_cols)

## **5. Model Building** <font color = red>[15 marks]</font> <br>

#### **Import Necessary Libraries**

In [None]:
# Import libraries

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np # for numerical operations
import statsmodels.api as sm

#### **5.1 Feature Scaling** <font color = red>[3 marks]</font> <br>

In [None]:
# Apply scaling to the numerical columns

scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])  # Use 'transform' on test data!


Note that linear regression is agnostic to feature scaling. However, with feature scaling, we get the coefficients to be somewhat on the same scale so that it becomes easier to compare them.

#### **5.2 Build a linear regression model** <font color = red>[5 marks]</font> <br>

You can choose from the libraries *statsmodels* and *scikit-learn* to build the model.

In [None]:
# Create/Initialise the model

model_sklearn = LinearRegression()

# Add a constant term to the features matrix
X_train_sm = sm.add_constant(X_train)

# Create/Initialise the Linear Regression model
model_statsmodels = sm.OLS(y_train, X_train_sm)

In [None]:
# Train the model using the training data
model_sklearn.fit(X_train, y_train)

In [None]:
# Make predictions
y_pred_train = model_sklearn.predict(X_train)
y_pred_test = model_sklearn.predict(X_test)

In [None]:
# Find results for evaluation metrics
# Find results for evaluation metrics
# Calculate Mean Squared Error (MSE)
mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)

# Calculate Mean Absolute Error (MAE)
mae_train = mean_absolute_error(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test, y_pred_test)

# Calculate R-squared (R2)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

# Print the results
print("Training Set Metrics:")
print(f"MSE: {mse_train:.2f}")
print(f"MAE: {mae_train:.2f}")
print(f"R-squared: {r2_train:.2f}")

print("\nTest Set Metrics:")
print(f"MSE: {mse_test:.2f}")
print(f"MAE: {mae_test:.2f}")
print(f"R-squared: {r2_test:.2f}")


Note that we have 12 (depending on how you select features) training features. However, not all of them would be useful. Let's say we want to take the most relevant 8 features.

We will use Recursive Feature Elimination (RFE) here.

For this, you can look at the coefficients / p-values of features from the model summary and perform feature elimination, or you can use the RFE module provided with *scikit-learn*.

#### **5.3 Build the model and fit RFE to select the most important features** <font color = red>[7 marks]</font> <br>

For RFE, we will start with all features and use
the RFE method to recursively reduce the number of features one-by-one.

After analysing the results of these iterations, we select the one that has a good balance between performance and number of features.

In [None]:
# Loop through the number of features and test the model

# Loop through the number of features and test the model
mse_train_list = []
mse_test_list = []
mae_train_list = []
mae_test_list = []
r2_train_list = []
r2_test_list = []

num_features_to_test = [5, 10, 15, 20, len(numerical_cols) + len(categorical_cols)]  # You can adjust this list

# Get feature importance from the model (if available)
if hasattr(model_sklearn, 'coef_'):  # LinearRegression has 'coef_'
    feature_importance = np.abs(model_sklearn.coef_)
else:
    # For models without coef_, you might need a different approach
    print("Model doesn't provide feature coefficients for ranking.")
    feature_importance = np.zeros(X_train.shape[1])  # Placeholder

# Sort features by importance
feature_ranking = np.argsort(feature_importance)[::-1]  # Descending order

for num_features in num_features_to_test:
    # Select top features
    top_features = X_train.columns[feature_ranking[:num_features]]

    # Select features for training and testing
    X_train_selected = X_train[top_features]
    X_test_selected = X_test[top_features]

    # Train the model
    model = LinearRegression()  # Create a new model for each feature set
    model.fit(X_train_selected, y_train)

    # Make predictions
    y_pred_train = model.predict(X_train_selected)
    y_pred_test = model.predict(X_test_selected)

    # Evaluate the model
    mse_train = mean_squared_error(y_train, y_pred_train)
    mse_test = mean_squared_error(y_test, y_pred_test)
    mae_train = mean_absolute_error(y_train, y_pred_train)
    mae_test = mean_absolute_error(y_test, y_pred_test)
    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)

    # Store the results
    mse_train_list.append(mse_train)
    mse_test_list.append(mse_test)
    mae_train_list.append(mae_train)
    mae_test_list.append(mae_test)
    r2_train_list.append(r2_train)
    r2_test_list.append(r2_test)

# Print the results
print("Results for different number of features:")
for i, num_features in enumerate(num_features_to_test):
    print(f"\nNumber of Features: {num_features}")
    print(f"  Training Set - MSE: {mse_train_list[i] :.2f}, MAE: {mae_train_list[i] :.2f}, R-squared: {r2_train_list[i] :.2f}")
    print(f"  Test Set     - MSE: {mse_test_list[i] :.2f}, MAE: {mae_test_list[i] :.2f}, R-squared: {r2_test_list[i] :.2f}")


In [None]:
# Build the final model with selected number of features

# Build the final model with selected number of features

# Determine the optimal number of features (you should replace this with your findings)
optimal_num_features = 15  # Replace with the best number of features from your previous analysis

# Select the top features based on the optimal number
top_features = X_train.columns[feature_ranking[:optimal_num_features]]
X_train_final = X_train[top_features]
X_test_final = X_test[top_features]

# Train the final model
final_model = LinearRegression()
final_model.fit(X_train_final, y_train)

# Make predictions
y_pred_train_final = final_model.predict(X_train_final)
y_pred_test_final = final_model.predict(X_test_final)

# Evaluate the final model
mse_train_final = mean_squared_error(y_train, y_pred_train_final)
mse_test_final = mean_squared_error(y_test, y_pred_test_final)
mae_train_final = mean_absolute_error(y_train, y_pred_train_final)
mae_test_final = mean_absolute_error(y_test, y_pred_test_final)
r2_train_final = r2_score(y_train, y_pred_train_final)
r2_test_final = r2_score(y_test, y_pred_test_final)

# Print the final model's performance
print("Final Model Performance:")
print(f"  Training Set - MSE: {mse_train_final:.2f}, MAE: {mae_train_final:.2f}, R-squared: {r2_train_final:.2f}")
print(f"  Test Set     - MSE: {mse_test_final:.2f}, MAE: {mae_test_final:.2f}, R-squared: {r2_test_final:.2f}")

## **6. Results and Inference** <font color = red>[5 marks]</font> <br>

#### **6.1 Perform Residual Analysis** <font color = red>[3 marks]</font> <br>

In [None]:
# Perform residual analysis using plots like residuals vs predicted values, Q-Q plot and residual histogram

import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats  # For the Q-Q plot

# Calculate residuals
residuals = y_test - y_pred_test_final

# 1. Residuals vs. Predicted Values Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred_test_final, y=residuals)
plt.title('Residuals vs. Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.axhline(y=0, color='r', linestyle='--')  # Add a horizontal line at y=0
plt.show()

# 2. Q-Q Plot (Quantile-Quantile Plot)
plt.figure(figsize=(10, 6))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.show()

# 3. Residual Histogram
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)  # Plot histogram with KDE
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()


[Your inferences here:]



#### **6.2 Perform Coefficient Analysis** <font color = red>[2 marks]</font> <br>

Perform coefficient analysis to find how changes in features affect the target.
Also, the features were scaled, so interpret the scaled and unscaled coefficients to understand the impact of feature changes on delivery time.


In [None]:
# Compare the scaled vs unscaled features used in the final model

# Compare the scaled vs unscaled features used in the final model

# Unscaled coefficients (from the original model - before feature selection)
unscaled_coef = model_sklearn.coef_
unscaled_intercept = model_sklearn.intercept_

# Scaled coefficients (from the final model - after feature selection)
scaled_coef = final_model.coef_
scaled_intercept = final_model.intercept_

# Feature names (original)
original_feature_names = X_train.columns

# Feature names (selected)
selected_feature_names = X_train_final.columns

# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
    'Original Feature': original_feature_names,
    'Unscaled Coefficient': unscaled_coef
})

comparison_df['Selected'] = comparison_df['Original Feature'].isin(selected_feature_names)

comparison_df['Scaled Coefficient'] = 0  # Initialize with 0
comparison_df.loc[comparison_df['Selected'], 'Scaled Coefficient'] = scaled_coef

print("Comparison of Scaled vs. Unscaled Coefficients:\n")
print(comparison_df)

print(f"\nUnscaled Intercept: {unscaled_intercept:.4f}")
print(f"Scaled Intercept:   {scaled_intercept:.4f}")

Additionally, we can analyse the effect of a unit change in a feature. In other words, because we have scaled the features, a unit change in the features will not translate directly to the model. Use scaled and unscaled coefficients to find how will a unit change in a feature affect the target.

In [None]:
# Analyze the effect of a unit change in a feature, say 'total_items'


feature_of_interest = 'total_items'

# Find the unscaled coefficient for 'total_items'
unscaled_coef_total_items = comparison_df.loc[comparison_df['Original Feature'] == feature_of_interest, 'Unscaled Coefficient'].values[0]

# Find the mean and standard deviation of 'total_items' from the training data (used for scaling)
mean_total_items = X_train['total_items'].mean()
std_total_items = X_train['total_items'].std()

# Calculate the effect of a one-unit change in 'total_items'
effect_of_one_unit_change = unscaled_coef_total_items

# Calculate the effect of a one-standard-deviation change in 'total_items'
# This is essentially what the scaled coefficient represents
effect_of_one_std_change = comparison_df.loc[comparison_df['Original Feature'] == feature_of_interest, 'Scaled Coefficient'].values[0]

print(f"Feature: {feature_of_interest}\n")
print(f"Unscaled Coefficient: {unscaled_coef_total_items:.4f}")
print(f"Scaled Coefficient:   {effect_of_one_std_change:.4f}\n")

print(f"Effect of a one-unit increase in '{feature_of_interest}': {effect_of_one_unit_change:.4f} minutes")
print(f"Effect of a one-standard-deviation increase in '{feature_of_interest}': {effect_of_one_std_change:.4f} minutes")

print(f"\nMean of '{feature_of_interest}': {mean_total_items:.4f}")
print(f"Standard Deviation of '{feature_of_interest}': {std_total_items:.4f}")

Note:
The coefficients on the original scale might differ greatly in magnitude from the scaled coefficients, but they both describe the same relationships between variables.

Interpretation is key: Focus on the direction and magnitude of the coefficients on the original scale to understand the impact of each variable on the response variable in the original units.

Include conclusions in your report document.

## Subjective Questions <font color = red>[20 marks]</font>

Answer the following questions only in the notebook. Include the visualisations/methodologies/insights/outcomes from all the above steps in your report.

#### Subjective Questions based on Assignment

##### **Question 1.** <font color = red>[2 marks]</font> <br>

Are there any categorical variables in the data? From your analysis of the categorical variables from the dataset, what could you infer about their effect on the dependent variable?

**Answer:**
Yes, there are categorical variables in the data. Based on our data preprocessing and exploratory data analysis, the categorical variables are:

store_primary_category
order_protocol
day_of_week
hour
isWeekend
Inferences about their effect on the dependent variable (time_taken):

store_primary_category:
We used countplot to visualize the distribution of this variable. Different categories of restaurants (e.g., "fast food," "dine-in") might have different preparation times, order volumes, and delivery complexities, all of which could affect delivery time. Further analysis (e.g., box plots of time_taken by store_primary_category) would give more insight.
order_protocol:
Again, countplot showed the distribution. Different order protocols (e.g., orders placed directly through the app vs. orders placed via phone) could influence processing and dispatching efficiency, thus impacting delivery time.
day_of_week and hour:
We specifically plotted the distribution of time_taken by hour using a boxplot. This visualization showed how delivery times vary throughout the day. We can infer that there are peak hours where delivery times tend to be higher. Similarly, different days of the week may exhibit different order patterns and traffic conditions, influencing delivery times.
isWeekend:
This is a binary categorical variable we created. Weekends often have higher order volumes and potentially different driver availability, which can affect delivery times.
>



---



##### **Question 2.** <font color = red>[1 marks]</font> <br>
What does `test_size = 0.2` refer to during splitting the data into training and test sets?

**Answer:**
>



---



##### **Question 3.** <font color = red>[1 marks]</font> <br>
Looking at the heatmap, which one has the highest correlation with the target variable?  

**Answer:**
>
test_size = 0.2 means that when splitting the dataset into training and test sets, 20% of the data will be allocated to the test set, and the remaining 80% will be used for the training set. This is a common practice in machine learning to evaluate the model's performance on unseen data.



---



##### **Question 4.** <font color = red>[2 marks]</font> <br>
What was your approach to detect the outliers? How did you address them?

**Answer:**
>
My approach to detect the outliers involved the following:

Visualization using Box Plots: I used box plots to visualize the distribution of the target variable (time_taken) and other numerical features. Box plots are effective in identifying potential outliers as they display the median, quartiles, and the range of the data, with outliers shown as points outside the whiskers.
To address the outliers, I applied a combination of techniques:

IQR-based Filtering: For some columns, I employed the Interquartile Range (IQR) method to filter out outliers. This involved calculating the first quartile (Q1), third quartile (Q3), and IQR. Data points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR were considered outliers and removed.

Capping: For some other columns, I used capping to handle outliers. This involved setting upper and lower limits based on percentiles (e.g., 95th percentile) and replacing outlier values with these limits. This method preserves the data points but reduces the impact of extreme values.

The specific choice of method (IQR filtering or capping) varied depending on the column and the observed distribution of outliers.



---



##### **Question 5.** <font color = red>[2 marks]</font> <br>
Based on the final model, which are the top 3 features significantly affecting the delivery time?



---



#### General Subjective Questions

##### **Question 6.** <font color = red>[3 marks]</font> <br>
Explain the linear regression algorithm in detail

**Answer:**
>




---



##### **Question 7.** <font color = red>[2 marks]</font> <br>
Explain the difference between simple linear regression and multiple linear regression

**Answer:**
>



---



##### **Question 8.** <font color = red>[2 marks]</font> <br>
What is the role of the cost function in linear regression, and how is it minimized?

**Answer:**
>




---



##### **Question 9.** <font color = red>[2 marks]</font> <br>
Explain the difference between overfitting and underfitting.



**Answer:**

>



---



##### **Question 10.** <font color = red>[3 marks]</font> <br>
How do residual plots help in diagnosing a linear regression model?

**Answer:**
>