# **Build End-to-End ML Pipeline for Truck Delay Classification**


The project addresses a critical challenge faced by the logistics industry. Delayed truck shipments not only result in increased operational costs but also impact customer satisfaction. Timely delivery of goods is essential to meet customer expectations and maintain the competitiveness of logistics companies.
By accurately predicting truck delays, logistics companies can:
* Improve operational efficiency by allocating resources more effectively.
* Enhance customer satisfaction by providing more reliable delivery schedules.
* Optimize route planning to reduce delays caused by traffic or adverse weather conditions.
* Reduce costs associated with delayed shipments, such as penalties or compensation to customers.

 In this initial phase, we will utilize PostgreSQL and MYSQL in AWS Redshift to store the data, perform data retrieval, and conduct basic exploratory data analysis (EDA). With Hopsworks feature store, we will build a pipeline that includes data processing feature engineering and prepare the data for model building.

![image.png](https://images.pexels.com/photos/2199293/pexels-photo-2199293.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)


## **Approach**


* Introduction to End-to-End Pipelines:
  * Understanding the fundamental concepts and importance of end-to-end pipelines


* Database Setup:
  * Creating AWS RDS instances for MySQL and PostgreSQL
  * Setting up MySQL Workbench and pgAdmin4 for database management


* Data Analysis:
  * Performing data analysis using SQL on MySQL Workbench and pgAdmin4


* AWS SageMaker Setup


* Exploratory Data Analysis (EDA):
  * Conducting exploratory data analysis to understand essential features and the dataset's characteristics


* Feature Store:
  * Understanding the concept of a feature store and its significance in machine learning projects
  * Understanding how Hopsworks works to facilitate project creation and feature group management


* Data Retrieval from Feature Stores

* Fetching data from feature stores for further analysis


* Data Preprocessing and Feature Engineering


* Data Storage:
  * Storing the final engineered features in the feature store for easy access and consistency



## **Data Fetching**

## **Postgres to Python Connector**

In [None]:
!pip install psycopg2==2.9.7

In [None]:
# Import the psycopg2 library for PostgreSQL connection
import psycopg2

# Import the pandas library for data manipulation
import pandas as pd

# Establish a connection to the PostgreSQL database
postgres_connection = psycopg2.connect(
    user="postgres",             # PostgreSQL username
    password="your_passowrd",    # Password for the database
    host="host_id.rds.amazonaws.com",   # Host ID of the RDS instance
    database="DB",               # Name of the database
    port="5432"                  # Port number for PostgreSQL
)


### Retrieve City details data from database
*   Database - Postgres
*   Table name - routes_details

In [None]:
# Read data from the "routes_details" table in the PostgreSQL database
routes_df = pd.read_sql("Select * from routes_details", postgres_connection)

# Display the first few rows of the routes dataframe
routes_df.head()


### Retrieve Route details data from database
*   Database - Postgres
*   Table name - route_details

In [None]:
# Read data from the "routes_weather" table in the PostgreSQL database
route_weather = pd.read_sql("Select * from routes_weather", postgres_connection)

# Display the first few rows of the route weather dataframe
route_weather.head()


In [None]:
# Rename the column for consistency
route_weather=route_weather.rename(columns={'Date':'date'})

## MySQL to Python Connector

In [None]:
!pip install pymysql==1.1.0

In [None]:
# Import the pymysql library for MySQL connection
import pymysql

# Import the numpy library and alias it as np
import numpy as np

# Establish a connection to the MySQL database
mysql_connection = pymysql.connect(
     host = "host_id.rds.amazonaws.com",  # Host ID of the RDS instance
     user = "admin",                       # MySQL username
     password = "your_password",           # Password for the database
     database = "DB"                       # Name of the database
)


### Retrieve Driver details data from database
*   Database - MySQL
*   Table name - driver_details

In [None]:
# Read data from the "drivers_details" table in the MySQL database
drivers_df = pd.read_sql("Select * from drivers_details", mysql_connection)

# Display the first two rows of the drivers dataframe
drivers_df.head(2)


### Retrieve Truck details data from database
*   Database - MySQL
*   Table name - truck_details

In [None]:
# Read data from the "truck_details" table in the MySQL database
trucks_df = pd.read_sql("Select * from truck_details", mysql_connection)

# Display the first few rows of the trucks dataframe
trucks_df.head()


### Retrieve Driver details data from database
*   Database - MySQL
*   Table name - traffic_details

In [None]:
# Read data from the "traffic_details" table in the MySQL database
traffic_df = pd.read_sql("Select * from traffic_details", mysql_connection)

# Display the first few rows of the traffic dataframe
traffic_df.head()


### Retrieve Driver details data from database
*   Database - MySQL
*   Table name - truck_schedule_data

In [None]:
# Read data from the "truck_schedule_data" table in the MySQL database
schedule_df = pd.read_sql("Select * from truck_schedule_data", mysql_connection)

# Display the first few rows of the schedule dataframe
schedule_df.head()


### Retrieve Driver details data from database
*   Database - MySQL
*   Table name - city_weather

In [None]:
# Read data from the "city_weather" table in the MySQL database
weather_df = pd.read_sql("Select * from city_weather", mysql_connection)

# Display the first few rows of the weather dataframe
weather_df.head()


## **Exploratory Data Analysis**
Exploratory Data Analysis, commonly known as EDA, is a technique to analyze the data with visuals. It involves using statistics and visual techniques to identify particular trends in data.

It is used to understand data patterns, spot anomalies, check assumptions, etc. The main purpose of EDA is to help look into the data before making any hypothesis about it.


In [None]:
# Import Libraries

# !pip install matplotlib==3.7.1
# !pip install seaborn==0.12.2
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 500)

In [None]:
# Change dates to datetime
weather_df['date'] = pd.to_datetime(weather_df['date'])
route_weather['date'] = pd.to_datetime(route_weather['date'])
traffic_df['date'] = pd.to_datetime(traffic_df['date'])
schedule_df['departure_date'] = pd.to_datetime(schedule_df['departure_date'])
schedule_df['estimated_arrival'] = pd.to_datetime(schedule_df['estimated_arrival'])
route_weather['date'] = pd.to_datetime(route_weather['date'])

### **Driver's Data Analysis**

In [None]:
# Driver's data
drivers_df.head(2)

In [None]:
# Driver's data info
drivers_df.info()

Gender and driving styles have some missing values

Dtypes seem to be in order

In [None]:
# statistics of various columns
drivers_df.describe(include='all')

### **Distribution Plots**

Distribution plots are graphical representations that show the distribution of a set of numerical data. These plots are used to gain insight into the characteristics of the data, such as the central tendency, spread, and skewness. There are several types of distribution plots, including histograms, density plots, box plots, and violin plots.

* A histogram is a bar graph that represents the frequency distribution of a set of data. It shows how many data points fall into each range of values or bin. The bars in the histogram represent the frequency of data points within a given range, and the height of each bar represents the number of data points in that bin.

* A density plot is a smoothed representation of the distribution of the data, which is calculated by fitting a probability density function to the histogram of the data. It shows the shape of the distribution and provides a visual representation of the relative density of the data at different values.

* A box plot, also known as a box-and-whisker plot, is a graphical representation of the distribution of a set of data. It shows the median, quartiles, and outliers of the data in a compact and easily interpretable format. The box in the plot represents the interquartile range (IQR), which is the range between the first and third quartile. The whiskers extend from the box to the minimum and maximum values of the data, and any outliers are plotted as individual points outside the whiskers.

* A violin plot is a combination of a density plot and a box plot, showing the density of the data along the y-axis and the distribution along the x-axis. It shows the distribution of the data in a compact format, and provides information on the central tendency, spread, and skewness of the data.

* Distribution plots are an important tool for exploratory data analysis and can help in understanding the distribution of the data, identifying patterns and outliers, and making informed decisions about the data. They provide a visual representation of the data and can be used to identify potential issues with the data, such as non-normality or outliers.

In [None]:
# List of numerical columns to visualize
drivers_num_cols = ['age', 'experience', 'ratings', 'average_speed_mph']

# Loop through each numerical column and create histograms with KDE
for col in drivers_num_cols:
    plt.figure(figsize=(10, 5))

    # Create a histogram with KDE using seaborn
    sns.histplot(drivers_df[col], bins=30, kde=True)
    # Set the title
    plt.title(f'{col} distribution')
    # Set the label for the x-axis
    plt.xlabel(f'{col}')
    plt.show()

#### Age:

The distribution of drivers' ages is approximately normally distributed, with the majority of drivers falling in the range of 45 to 50 years.

Recommendation: While it's true that older drivers may have more experience, they may also be more susceptible to fatigue or health-related issues. Assign routes based on the driver's experience level and health conditions. Longer and more complex routes may be more demanding and tiring, which may lead to delay.

Implement appropriate rest break policies and adhere to regulations regarding maximum working hours. Ensuring sufficient rest periods during and between routes can help prevent driver fatigue.

Provide drivers with training on proper driving techniques

Consider rotating drivers on different types of routes to avoid monotony and reduce the impact of repetitive tasks

#### Experience:

The distribution of driver experience is right-skewed, with most drivers having experience between 5 to 50 years.

Recommendation: The company could analyze the performance metrics of drivers with different experience levels to identify if there are any correlations between experience and driving efficiency. This analysis can help optimize driver assignment to different types of routes.


#### Ratings:

The significant count of ratings below 5 out of 10 suggests that there is a considerable proportion of drivers with lower ratings.

Lower ratings may indicate potential issues with driver performance or customer satisfaction.

Recommendation: The company should investigate the reasons behind lower ratings and take necessary steps to address driver performance, provide additional training, or offer incentives to improve driver ratings and enhance overall customer satisfaction.


#### Average Speed:

The bimodal distribution of average speeds, peaking around 45 and 60 mph, suggests the presence of two distinct groups of drivers with different driving styles.

This bimodal pattern may indicate a split between drivers who adopt a more cautious driving style (lower average speed) and those who adopt a more aggressive driving style (higher average speed) both are not recommended.

Recommendation: The company can consider categorizing drivers based on their average speed behavior and analyze how different driving styles impact ETA, fuel efficiency, and safety.

This can help with training and driving style guidelines for drivers for overall improvemnent in delivery times.

In [None]:
# Counts of gender
drivers_df['gender'].value_counts()

In [None]:
# Value counts of driving style
drivers_df['driving_style'].value_counts()

In [None]:
# Setting figure size
plt.figure(figsize=(10, 5))
# plotting scatter plot between ratings and average speed
sns.scatterplot(x='ratings', y='average_speed_mph', data=drivers_df)
plt.title('Ratings vs. Average Speed')
plt.xlabel('Ratings (out of 10)')
plt.ylabel('Average Speed (mph)')
plt.show()

No significant relationship - Ratings may be because of some other factors.

### **Boxplots**
Boxplots, also known as box-and-whisker plots, are a type of graphical representation used to display the distribution, spread, and central tendency of a dataset. They provide a concise summary of the data's key statistical properties.

Components of a Boxplot
A boxplot consists of the following elements:

* Box: The box in the plot represents the interquartile range (IQR), which contains the middle 50% of the data. The lower edge of the box represents the first quartile (Q1), and the upper edge represents the third quartile (Q3).

* Whiskers: The whiskers extend from the box and represent the range of the data. They indicate the minimum and maximum values within a certain range, often defined by a formula (e.g., 1.5 times the IQR).

* Median Line: A vertical line inside the box represents the median (Q2) of the dataset.

* Outliers: Individual data points that fall outside the whiskers are considered outliers. They are plotted as individual points beyond the whiskers.

Interpreting a Boxplot

* Central Tendency: The median line gives the central value of the data.
Spread: The length of the box (IQR) indicates the spread of the central 50% of the data.

* Skewness: Asymmetry in the data can be observed by comparing the lengths of the whiskers. If one whisker is longer than the other, it suggests skewness.

* Outliers: Outliers are points outside the whiskers. They can provide insights into anomalies or extreme values in the data.

Use Cases of Boxplots

* Identifying Outliers: Boxplots are useful for identifying outliers in a dataset.

* Comparing Distributions: They allow for quick visual comparison of the distribution of multiple datasets.

* Summarizing Data

* Detecting Skewness: They can reveal whether a dataset is symmetric or skewed.



In [None]:
# Boxplot between gender and ratings
sns.boxplot(x='gender', y='ratings', data=drivers_df, palette='Set2')
plt.title('Driver Ratings by Gender')
plt.xlabel('Gender')
plt.ylabel('Ratings (out of 10)')
plt.show()

Average ratings are similar for both the genders.

### **Truck's Data Analysis**

In [None]:
# Truck data head
trucks_df.head()

In [None]:
# Info
trucks_df.info()

load capacity and fuel type have missing values.

In [None]:
# statistics of various columns
trucks_df.describe(include='all')

In [None]:
# Numerical cols in truck's dataset
truck_num_cols = ['truck_age', 'load_capacity_pounds', 'mileage_mpg']

# plotting histogram for each column
for col in truck_num_cols:
  plt.figure(figsize=(10, 5))
  sns.histplot(trucks_df[col], bins=30, kde=True)
  plt.title(f'{col} distribution')
  plt.xlabel(f'{col}')
  plt.show()

Analyze the distribution of truck ages and identify older trucks that might be approaching the end of their useful life.

Identify trucks with significantly lower fuel efficiency. Implement strategies to improve fuel efficiency, such as regular maintenance, driver training, and adopting fuel-saving technologies.

Consider replacing trucks with low mileage or poor fuel efficiency with more fuel-efficient models.

In [None]:
# According to the histogram, setting low mileage to be 15
low_mileage_threshold = 15

# Filter trucks with low mileage
low_mileage_trucks = trucks_df[trucks_df['mileage_mpg'] <= low_mileage_threshold]

In [None]:
# overview of data of low mileage trucks
low_mileage_trucks.head()

In [None]:
# Age distribution of low mileage trucks
plt.figure(figsize=(10, 5))
sns.histplot(low_mileage_trucks['truck_age'], bins=30, kde=True)
plt.title(f"Low Mileage Truck's Age distribution")
plt.xlabel("Age")
plt.show()

Trucks more than 8 years of age has low mileage.

### **Routes Data Analysis**

In [None]:
# Display the first rows
routes_df.head()

In [None]:
# Information on dataframe
routes_df.info()

### **Traffic Data Analysis**

In [None]:
# Traffic data head
traffic_df.head()

In [None]:
# Info
traffic_df.info()

In [None]:
# Sum of null values
traffic_df.isnull().sum()

In [None]:
# statistical description
traffic_df.describe()

In [None]:
# Sum of null values
traffic_df.isnull().sum()

There are 1152 null values

In [None]:
# statistical description
traffic_df.describe()

In [None]:
def categorize_time(hour):
    """
    Categorizes hours of the day into time periods.

    Args:
    hour (int): Hour in 24-hour format.

    Returns:
    str: Categorized time period.
    """
    if 300 <= hour < 600:
        return 'Early Morning'
    elif 600 <= hour < 1200:
        return 'Morning'
    elif 1200 <= hour < 1600:
        return 'Noon'
    elif 1600 <= hour < 2000:
        return 'Evening'
    elif 2000 <= hour < 2300:
        return 'Night'
    elif 0 <= hour < 300:
        return 'Night'

# Create a copy of traffic_df
traffic = traffic_df.copy()

# Apply the categorize_time function to create a new column 'time_category'
traffic['time_category'] = traffic['hour'].apply(categorize_time)

# Group by 'time_category' and calculate the mean of 'no_of_vehicles'
mean_vehicles_by_time = traffic.groupby('time_category')['no_of_vehicles'].mean()


In [None]:
# print
mean_vehicles_by_time

Evening experiences the highest traffic with an average of 2006 vehicles, likely due to rush hour.

Morning follows closely with around 1738 vehicles, indicating significant traffic during the morning commute.

Noon has an average of approximately 1995 vehicles, reflecting continued high traffic for various activities during the day.

Early Morning sees the least traffic with around 562 vehicles, as people are still at home or starting their day.

Nighttime has around 1263 vehicles, showing lighter traffic due to reduced commuting and increased resting hours.

Other features relevant for delay could be weather conditions, day of the week, holidays, special events,

## **Feature Store**

A feature store is a crucial component in the field of machine learning and data science. It serves as a centralized repository for storing, managing, and serving features used in machine learning models. Features, in this context, refer to the variables or attributes that are used to make predictions or classifications in a model.

Need of a feature store:

* Consistency and Reproducibility
* Collaboration and Knowledge Sharing
* Data Quality and Monitoring
* Time and Cost saving

Benefits of a feature store:
* Data Centralization and Organization: In complex organizations, data is often scattered across different teams, departments, and systems. A feature store centralizes the storage of features, making it easier to manage and access them.

* Feature Versioning and Lineage: Keeping track of different versions of features is essential for reproducibility and debugging in machine learning workflows. A feature store maintains a history of features, allowing teams to trace back to specific data points.

* Consistency Across Models: Different teams and models within an organization may use similar or overlapping sets of features. A feature store ensures that these features are consistently engineered and used across different projects, leading to more reliable results.

* Data Quality Assurance: Feature stores often include mechanisms to monitor and validate the quality of features. This ensures that features used for training models are of high quality, reducing the risk of erroneous predictions.

* Efficient Data Access: Feature stores are optimized for efficient access to features. This is particularly important when dealing with large datasets, as it reduces the time and resources required to retrieve relevant information for model training or prediction.

* Reduced Redundancy: Without a feature store, teams may duplicate efforts in feature engineering for different models or projects. A feature store reduces redundancy in data processing and engineering tasks, saving time and resources.

* Scalability and Performance: A well-designed feature store is capable of handling large volumes of data and serving features efficiently. This is crucial for organizations dealing with big data and requiring real-time or batch processing capabilities.

* Integration with ML Platforms: Feature stores seamlessly integrate with popular machine learning platforms and frameworks. This ensures that features can be easily incorporated into the end-to-end machine learning pipeline, from data preprocessing to model deployment.

* Metadata and Descriptive Information: Feature stores store metadata and descriptive information about features. This includes data types, units, and descriptions, which are crucial for understanding the meaning and context of each feature, especially in collaborative environments.


### **Hopsworks**

Description: Hopsworks is an open-source platform for data-intensive AI and machine learning. It includes a feature store component that allows users to store and manage features for their machine learning models.

Key Features:
* Supports both online and batch serving of features.
* Integration with popular machine learning platforms and tools.
* Versioning and lineage tracking of features.

For more information, check out: https://www.hopsworks.ai/

In [None]:
!pip install -U hopsworks==3.2.0

In [None]:
# Import the necessary library
import hopsworks

# Log in to the Hopsworks project
project = hopsworks.login()

# Get the feature store associated with the project
fs = project.get_feature_store()


### **Driver Data Feature Store**

In [None]:
# Display the first two rows of the drivers DataFrame
drivers_df.head(2)

In [None]:
# Display information about the drivers DataFrame (e.g., column names, data types)
drivers_df.info()

In [None]:
drivers_df['event_time'] = pd.to_datetime('2023-08-23')

In [None]:
drivers_df.isna().sum()

In [None]:
# Filling the null values with Unknown
drivers_df['driving_style']=drivers_df['driving_style'].fillna('Unknown')
drivers_df['gender']=drivers_df['gender'].fillna('Unknown')

In [None]:
drivers_df.columns

In [None]:
# Create feature group for drivers details
drivers_fg = fs.get_or_create_feature_group(
    name="drivers_details_fg",                # Name of the feature group
    version=1,                                # Version number
    description="Drivers data",               # Description of the feature group
    primary_key=['driver_id'],                # Primary key(s) for the feature group
    event_time='event_time',                  # Event time column
    online_enabled=False                      # Online feature store capability
)

# Insert the drivers DataFrame into the feature group
drivers_fg.insert(drivers_df)

In [None]:
# Sort values
drivers_df=drivers_df.sort_values(["event_time","driver_id"])

In [None]:
# List of feature descriptions for drivers
feature_descriptions_drivers = [

    {"name": "driver_id", "description": "unique identification for each driver"},
    {"name": "name", "description": "name of the truck driver"},
    {"name": "gender", "description": "gender of the truck driver"},
    {"name": "age", "description": "age of the truck driver"},
    {"name": "experience", "description": "experience of the truck driver in years"},
    {"name": "driving_style", "description": "driving style of the truck driver, conservative or proactive"},
    {"name": "ratings", "description": "average rating of the truck driver on a scale of 1 to 5"},
    {"name": "vehicle_no", "description": "the number of the driver’s truck"},
    {"name": "average_speed_mph", "description": "average speed of the truck driver in miles per hour"},
    {"name": "event_time", "description": "dummy event time"}

]

# Iterate through the feature descriptions and update them in the feature group
for desc in feature_descriptions_drivers:
    drivers_fg.update_feature_description(desc["name"], desc["description"])


In [None]:
# Configure statistics for the feature group
drivers_fg.statistics_config = {
    "enabled": True,        # Enable statistics calculation
    "histograms": True,     # Include histograms in the statistics
    "correlations": True    # Include correlations in the statistics
}

# Update the statistics configuration for the feature group
drivers_fg.update_statistics_config()

# Compute statistics for the feature group
drivers_fg.compute_statistics()


### **Truck Data Feature Store**

In [None]:
# Displaying head of the data
trucks_df.head()

In [None]:
# Displaying information
trucks_df.info()

In [None]:
# Sum of null values
trucks_df.isna().sum()

In [None]:
trucks_df['fuel_type'].unique()

In [None]:
# Filling the null values with Unknown
trucks_df['fuel_type']=trucks_df['fuel_type'].replace("",'Unknown')



In [None]:
trucks_df['fuel_type'].value_counts()

In [None]:
trucks_df['event_time'] = pd.to_datetime('2023-08-23')

trucks_df=trucks_df.sort_values(["event_time","truck_id"])

In [None]:
# Create a feature group for truck details
truck_fg = fs.get_or_create_feature_group(
    name="truck_details_fg",          # Name of the feature group
    version=1,                        # Version number
    description="Truck data",         # Description of the feature group
    primary_key=['truck_id'],         # Primary key(s) for the feature group
    event_time='event_time',          # Event time column
    online_enabled=False              # Online feature store capability (set to False)
)


In [None]:
truck_fg.insert(trucks_df)

In [None]:
# Add feature descriptions

feature_descriptions_trucks = [
    {"name":'truck_id',"description":"the unique identification number of the truck"},
    {"name":'truck_age',"description":"age of the truck in years"},
    {"name":'load_capacity_pounds',"description":"loading capacity of the truck in years"},
    {"name":'mileage_mpg',"description": "mileage of the truck in miles per gallon"},
    {"name":'fuel_type',"description":"fuel type of the truck"},
    {"name": "event_time", "description": "dummy event time"}

]

for desc in feature_descriptions_trucks:
    truck_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
truck_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

truck_fg.update_statistics_config()
truck_fg.compute_statistics()

### **Routes data Feature Store**

In [None]:
# Display the head
routes_df.head()

In [None]:
# Routes Information
routes_df.info()

In [None]:
# Sum of null values
routes_df.isna().sum()

In [None]:
routes_df['event_time'] = pd.to_datetime('2023-08-23')

routes_df=routes_df.sort_values(["event_time","route_id"])

In [None]:
# Create feature group for route details
routes_fg = fs.get_or_create_feature_group(
    name="routes_details_fg",         # Name of the feature group
    version=1,                        # Version number
    description="Routes data",        # Description of the feature group
    primary_key=['route_id'],         # Primary key(s) for the feature group
    event_time='event_time',          # Event time column
    online_enabled=False              # Online feature store capability (set to False)
)


In [None]:
routes_fg.insert(routes_df)

In [None]:
# Add feature descriptions

feature_descriptions_routes = [
    {"name": 'route_id', "description": "the unique identifier of the routes"},
    {"name": 'origin_id', "description": "the city identification number for the origin city"},
    {"name": 'destination_id', "description": " the city identification number for the destination"},
    {"name": 'distance', "description": " the distance between the origin and destination cities in miles"},
    {"name": 'average_hours', "description": "average time needed to travel from the origin to the destination in hours"},
    {"name": "event_time", "description": "dummy event time"}

]

for desc in feature_descriptions_routes:
    routes_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
routes_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

routes_fg.update_statistics_config()
routes_fg.compute_statistics()

### **Truck Schedule Data Feature Store**

In [None]:
# Display the head
schedule_df.head()

In [None]:
# Display data information
schedule_df.info()

In [None]:
# Sum of null values
schedule_df.isna().sum()

In [None]:
# sorting
schedule_df=schedule_df.sort_values(["estimated_arrival","truck_id"])

In [None]:
# Create  feature group for truck schedule details
truck_schedule_fg = fs.get_or_create_feature_group(
    name="truck_schedule_details_fg",  # Name of the feature group
    version=1,                          # Version number
    description="Truck Schedule data",  # Description of the feature group
    primary_key=['truck_id','route_id'], # Primary key(s) for the feature group
    event_time='estimated_arrival',     # Event time column
    online_enabled=True                  # Online feature store capability (set to True)
)


In [None]:
truck_schedule_fg.insert(schedule_df)

In [None]:
# Add feature descriptions
feature_descriptions_schedule = [
    {"name": 'truck_id', "description": "the unique identifier of the truck"},
    {"name": 'route_id', "description": "the unique identifier of the route"},
    {"name": 'departure_date', "description": "departure DateTime of the truck"},
    {"name": 'estimated_arrival', "description": "estimated arrival DateTime of the truck"},
    {"name": 'delay', "description": "binary variable if the truck’s arrival was delayed, 0 for on-time arrival and 1 for delayed arrival"},
]

for desc in feature_descriptions_schedule:
    truck_schedule_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
truck_schedule_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

truck_schedule_fg.update_statistics_config()
truck_schedule_fg.compute_statistics()

### **Traffic Feature Store**

In [None]:
traffic_df.head()

In [None]:
traffic_df.info()

In [None]:
traffic_df.isna().sum()

In [None]:
traffic_df=traffic_df.sort_values(['date','route_id','hour'])

In [None]:
traffic_fg = fs.get_or_create_feature_group(
    name="traffic_details_fg",
    version=1,
    description="Traffic data",
    primary_key=['route_id','hour'],
    event_time='date',
    online_enabled=True
)

In [None]:
traffic_fg.insert(traffic_df)

In [None]:
feature_descriptions_traffic = [
     {"name": 'route_id', "description": "the identification number of the route"},
     {"name": 'date', "description": " date of the traffic observation"},
     {"name": 'hour', "description": "the hour of the observation as a number in 24-hour format"},
     {"name": 'no_of_vehicles', "description": "the number of vehicles observed on the route"},
     {"name": 'accident', "description": "binary variable to denote if an accident was observed"}

]

for desc in feature_descriptions_traffic:
    traffic_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
traffic_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

traffic_fg.update_statistics_config()
traffic_fg.compute_statistics()

### **City Weather Feature Store**

In [None]:
weather_df.head()

In [None]:
weather_df.info()

In [None]:
weather_df.isna().sum()

In [None]:
weather_df=weather_df.sort_values(['date','city_id','hour'])

In [None]:
city_weather_fg = fs.get_or_create_feature_group(
    name="city_weather_details_fg",
    version=1,
    description="City Weather data",
    primary_key=['city_id','hour'],
    event_time='date',
    online_enabled=True
)

In [None]:
city_weather_fg.insert(weather_df)

In [None]:
feature_descriptions_weather = [
    {"name": 'city_id', "description":  'the unique identifier of the city'},
    {"name": 'date', "description":  'date of the observation'},
    {"name": 'hour', "description": 'the hour of the observation as a number in 24hour format'},
    {"name": 'temp', "description":  'temperature in Fahrenheit'},
    {"name": 'wind_speed', "description":  'wind speed in miles per hour'},
    {"name": 'description', "description":  'description of the weather conditions such as Clear, Cloudy, etc'},
    {"name": 'precip', "description":  'precipitation in inches'},
    {"name": 'humidity', "description":  'humidity observed'},
    {"name": 'visibility', "description":  'visibility observed in miles per hour'},
    {"name": 'pressure', "description":  'pressure observed in millibar'},
    {"name": 'chanceofrain', "description":  'chances of rain'},
    {"name": 'chanceoffog', "description":  'chances of fog'},
    {"name": 'chanceofsnow', "description":  'chances of snow'},
    {"name": 'chanceofthunder', "description":  'chances of thunder'}

]

for desc in feature_descriptions_weather:
    city_weather_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
city_weather_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

city_weather_fg.update_statistics_config()
city_weather_fg.compute_statistics()

### **Route Weather Feature Store**

In [None]:
route_weather.head()

In [None]:
route_weather.info()

In [None]:
route_weather.isna().sum()

In [None]:
route_weather=route_weather.sort_values(by=['date','route_id'])

In [None]:
route_weather_fg = fs.get_or_create_feature_group(
    name="route_weather_details_fg",
    version=1,
    description="Route Weather data",
    primary_key=['route_id'],
    event_time='date',
    online_enabled=True
)

In [None]:
route_weather_fg.insert(route_weather)

In [None]:
feature_descriptions_route_weather = [

    {"name": 'route_id', "description":  'the unique identifier of the city'},
    {"name": 'date', "description":  'date of the observation'},
    {"name": 'temp', "description":  'temperature in Fahrenheit'},
    {"name": 'wind_speed', "description":  'wind speed in miles per hour'},
    {"name": 'description', "description":  'description of the weather conditions such as Clear, Cloudy, etc'},
    {"name": 'precip', "description":  'precipitation in inches'},
    {"name": 'humidity', "description":  'humidity observed'},
    {"name": 'visibility', "description":  'visibility observed in miles per hour'},
    {"name": 'pressure', "description":  'pressure observed in millibar'},
    {"name": 'chanceofrain', "description":  'chances of rain'},
    {"name": 'chanceoffog', "description":  'chances of fog'},
    {"name": 'chanceofsnow', "description":  'chances of snow'},
    {"name": 'chanceofthunder', "description":  'chances of thunder'}

]

for desc in feature_descriptions_route_weather:
    route_weather_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
route_weather_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

route_weather_fg.update_statistics_config()
route_weather_fg.compute_statistics()

## **Fetch data from Feature Store**

In [None]:
routes_df_fg = fs.get_feature_group('routes_details_fg', version=1)
query = routes_df_fg.select_all()
routes_df=query.read()

In [None]:
route_weather_fg = fs.get_feature_group('route_weather_details_fg', version=1)
query = route_weather_fg.select_all()
route_weather=query.read()

In [None]:
drivers_df_fg = fs.get_feature_group('drivers_details_fg', version=1)
query = drivers_df_fg.select_all()
drivers_df=query.read()

In [None]:
trucks_df_fg = fs.get_feature_group('truck_details_fg', version=1)
query = trucks_df_fg.select_all()
trucks_df=query.read()

In [None]:
traffic_df_fg = fs.get_feature_group('traffic_details_fg', version=1)
query = traffic_df_fg.select_all()
traffic_df=query.read()

In [None]:
schedule_df_fg = fs.get_feature_group('truck_schedule_details_fg', version=1)
query = schedule_df_fg.select_all()
schedule_df=query.read()

In [None]:
weather_df_fg = fs.get_feature_group('city_weather_details_fg', version=1)
query = weather_df_fg.select_all()
weather_df=query.read()

## **Data Preprocessing**

### **Data Preprocessing and Leakage**

Data leakage is a situation where information from the test or prediction data is inadvertently used during the training process of a machine learning model. This can occur when information from the test or prediction data is leaked into the training data, and the model uses this information to improve its performance during the training process.

Data leakage can occur during the preprocessing phase of machine learning when information from the test or prediction data is used to preprocess the training data, inadvertently leaking information from the test or prediction data into the training data.

For example, consider a scenario where the preprocessing step involves imputing missing values in the dataset. If the missing values are imputed using the mean or median values of the entire dataset, including the test and prediction data, then the imputed values in the training data may be influenced by the values in the test and prediction data. This can lead to data leakage, as the model may learn to recognize patterns in the test and prediction data during the training process, leading to overfitting and poor generalization performance.


To avoid data leakage, it's important to perform the data preprocessing steps on the training data only, and then apply the same preprocessing steps to the test and prediction data separately. This ensures that the test and prediction data remain unseen by the model during the training process, and helps to prevent overfitting and improve the accuracy of the model.

In the context of this problem, we will perform data preprocessing steps together for the sake of simplicity, which could potentially lead to data leakage. However, in real-world scenarios, it's important to treat the test and prediction data separately and apply the necessary preprocessing steps separately, based on the characteristics of the data.

### **Missing Value Detection and Imputation**

Real world datasets are never friendly to data scientists. They always pose great challenges to those who are dealing with them due to many different reasons and one of them is “missing values”

Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located

In [None]:
drivers_df.head(2)

In [None]:
drivers_df=drivers_df.drop(columns=['event_time'])

In [None]:
# Check the null values
drivers_df.isna().sum()

In [None]:
# Duplicates in drivers data
drivers_df[drivers_df.duplicated(subset=['driver_id'])]

In [None]:
# Trucks data
trucks_df.head(2)

In [None]:
trucks_df=trucks_df.drop(columns=['event_time'])

In [None]:
# Check null values
trucks_df.isna().sum()

In [None]:
# Checking the different load capacities
trucks_df['load_capacity_pounds'].unique()

In [None]:
# Most common value
trucks_df['load_capacity_pounds'].mode()

In [None]:
#check null values
trucks_df.isna().sum()

In [None]:
# Check for duplicates
trucks_df[trucks_df.duplicated(subset=['truck_id'])]

In [None]:
#
routes_df.head(2)

In [None]:
routes_df=routes_df.drop(columns=['event_time'])

In [None]:
# Sum of null values
routes_df.isna().sum()

In [None]:
# check duplicates
routes_df[routes_df.duplicated(subset=['route_id'])]

In [None]:
# check duplicates across origin and destination
routes_df[routes_df.duplicated(subset=['route_id','destination_id','origin_id'])]

In [None]:
schedule_df.head(2)

In [None]:
# sum of null values in schedule
schedule_df.isna().sum()

In [None]:
# check for duplicates
schedule_df[schedule_df.duplicated()]

In [None]:
weather_df.head(2)

In [None]:
# statistical description
weather_df.describe()

In [None]:
# check for duplicates
weather_df[weather_df.duplicated(subset=['city_id','date','hour'])]

In [None]:
# drop duplicates
weather_df=weather_df.drop_duplicates(subset=['city_id','date','hour'])

In [None]:
# drop unnecessary cols
weather_df=weather_df.drop(columns=['chanceofrain','chanceoffog','chanceofsnow','chanceofthunder'])

In [None]:
# Convert 'hour' to a 4-digit string format
weather_df['hour'] = weather_df['hour'].apply(lambda x: f'{x:04d}')

# Convert 'hour' to datetime format
weather_df['hour'] = pd.to_datetime(weather_df['hour'], format='%H%M').dt.time

# Combine 'date' and 'hour' to create a new datetime column 'custom_date' and insert it at index 1
weather_date_val = pd.to_datetime(weather_df['date'].astype(str) + ' ' + weather_df['hour'].astype(str))
weather_df.insert(1, 'custom_date', weather_date_val)


In [None]:
weather_df.head(2)

In [None]:
weather_df.describe()

In [None]:
#drop city_id from here
route_weather.head(2)

In [None]:
route_weather.describe()

In [None]:
# check for duplicates
route_weather[route_weather.duplicated(subset=['route_id','date'])]

In [None]:
# Drop unnecessary cols
route_weather=route_weather.drop(columns=['chanceofrain','chanceoffog','chanceofsnow','chanceofthunder'])

In [None]:
route_weather.isna().sum()

In [None]:
traffic_df.head(2)

In [None]:
traffic_df[traffic_df.duplicated(subset=['route_id','date','hour'])]

In [None]:
traffic_df=traffic_df.drop_duplicates(subset=['route_id','date','hour'],keep='first')

In [None]:
traffic_df.isna().sum()

In [None]:
# Convert 'hour' to a 4-digit string format
traffic_df['hour'] = traffic_df['hour'].apply(lambda x: f'{x:04d}')

# Convert 'hour' to datetime format
traffic_df['hour'] = pd.to_datetime(traffic_df['hour'], format='%H%M').dt.time

# Combine 'date' and 'hour' to create a new datetime column 'custom_date' and insert it at index 1
traffic_custom_date = pd.to_datetime(traffic_df['date'].astype(str) + ' ' + traffic_df['hour'].astype(str))
traffic_df.insert(1, 'custom_date', traffic_custom_date)


In [None]:
traffic_df.head(5)

In [None]:
schedule_df.head(2)

In [None]:
schedule_df.isna().sum()

In [None]:
schedule_df.describe(include='all')

In [None]:
schedule_df[schedule_df.duplicated(subset=['truck_id','route_id','departure_date'])]

## **Feature Engineering**

Feature engineering is a crucial step in the machine learning pipeline where we transform and create new features from the existing data. This process aims to provide the machine learning model with the most relevant and informative input variables to make accurate predictions or classifications.

Merge Route Weather with Schedule Data

In [None]:
schedule_df.insert(0,'unique_id',np.arange(len(schedule_df)))

In [None]:
nearest_6h_schedule_df=schedule_df.copy()

In [None]:
nearest_6h_schedule_df['estimated_arrival']=nearest_6h_schedule_df['estimated_arrival'].dt.ceil("6H")
nearest_6h_schedule_df['departure_date']=nearest_6h_schedule_df['departure_date'].dt.floor("6H")

In [None]:
nearest_6h_schedule_df.head(2)

In [None]:

# Assign a new column 'date' using a list comprehension to generate date ranges between 'departure_date' and 'estimated_arrival' with a frequency of 6 hours
# This will create a list of date ranges for each row
# Explode the 'date' column to create separate rows for each date range

exploded_6h_scheduled_df=(nearest_6h_schedule_df.assign(date = [pd.date_range(start, end, freq='6H')
                      for start, end
                      in zip(nearest_6h_schedule_df['departure_date'], nearest_6h_schedule_df['estimated_arrival'])]).explode('date', ignore_index = True))

In [None]:
exploded_6h_scheduled_df.head(2)

In [None]:
schduled_weather=exploded_6h_scheduled_df.merge(route_weather,on=['route_id','date'],how='left')

In [None]:
schduled_weather.head(4)

In [None]:
# Define a custom function to calculate mode
def custom_mode(x):
    return x.mode().iloc[0]

# Group by specified columns and aggregate
schedule_weather_grp = schduled_weather.groupby(['unique_id','truck_id','route_id'], as_index=False).agg(
    route_avg_temp=('temp','mean'),
    route_avg_wind_speed=('wind_speed','mean'),
    route_avg_precip=('precip','mean'),
    route_avg_humidity=('humidity','mean'),
    route_avg_visibility=('visibility','mean'),
    route_avg_pressure=('pressure','mean'),
    route_description=('description', custom_mode)
)


In [None]:
schedule_weather_grp.head(2)

In [None]:
schedule_weather_merge=schedule_df.merge(schedule_weather_grp,on=['unique_id','truck_id','route_id'],how='left')

In [None]:
schedule_weather_merge.shape

In [None]:
schedule_weather_merge.isna().sum()

Find Origin and Destination city Weather

In [None]:
weather_df.head(2)

In [None]:
#take hourly as weather data available hourly
nearest_hour_schedule_df=schedule_df.copy()
nearest_hour_schedule_df['estimated_arrival_nearest_hour']=nearest_hour_schedule_df['estimated_arrival'].dt.round("H")
nearest_hour_schedule_df['departure_date_nearest_hour']=nearest_hour_schedule_df['departure_date'].dt.round("H")
nearest_hour_schedule_route_df=pd.merge(nearest_hour_schedule_df, routes_df, on='route_id', how='left')

In [None]:
nearest_hour_schedule_route_df.shape

In [None]:
nearest_hour_schedule_route_df.dtypes

In [None]:
weather_df.dtypes

In [None]:
# Create a copy of the 'weather_df' DataFrame for manipulation
origin_weather_data = weather_df.copy()

# Drop the 'date' and 'hour' columns from 'origin_weather_data'
origin_weather_data = origin_weather_data.drop(columns=['date', 'hour'])

origin_weather_data.columns = ['origin_id','departure_date_nearest_hour', 'origin_temp', 'origin_wind_speed','origin_description', 'origin_precip',
       'origin_humidity', 'origin_visibility', 'origin_pressure']

# Create a copy of the 'weather_df' DataFrame for manipulation
destination_weather_data = weather_df.copy()

# Drop the 'date' and 'hour' columns from 'destination_weather_data'
destination_weather_data = destination_weather_data.drop(columns=['date', 'hour'])

destination_weather_data.columns = ['destination_id', 'estimated_arrival_nearest_hour','destination_temp', 'destination_wind_speed','destination_description', 'destination_precip',
       'destination_humidity', 'destination_visibility', 'destination_pressure' ]

# Merge 'nearest_hour_schedule_route_df' with 'origin_weather_data' based on specified columns
origin_weather_merge = pd.merge(nearest_hour_schedule_route_df, origin_weather_data, on=['origin_id','departure_date_nearest_hour'], how='left')

# Merge 'origin_weather_merge' with 'destination_weather_data' based on specified columns
origin_destination_weather = pd.merge(origin_weather_merge, destination_weather_data , on=['destination_id', 'estimated_arrival_nearest_hour'], how='left')


In [None]:
origin_destination_weather.head(2)

In [None]:
origin_destination_weather.shape

Traffic and Schedule Data Merge

In [None]:
traffic_df.head(5)

In [None]:
traffic_df.dtypes

In [None]:
schedule_df.head(5)

In [None]:
schedule_df.dtypes

In [None]:
# Create a copy of the schedule DataFrame for manipulation
nearest_hour_schedule_df = schedule_df.copy()

# Round 'estimated_arrival' times to the nearest hour
nearest_hour_schedule_df['estimated_arrival'] = nearest_hour_schedule_df['estimated_arrival'].dt.round("H")

# Round 'departure_date' times to the nearest hour
nearest_hour_schedule_df['departure_date'] = nearest_hour_schedule_df['departure_date'].dt.round("H")

In [None]:
nearest_hour_schedule_df.head(5)

In [None]:
hourly_exploded_scheduled_df=(nearest_hour_schedule_df.assign(custom_date = [pd.date_range(start, end, freq='H')  # Create custom date ranges
                      for start, end
                      in zip(nearest_hour_schedule_df['departure_date'], nearest_hour_schedule_df['estimated_arrival'])])  # Using departure and estimated arrival times
                      .explode('custom_date', ignore_index = True))  # Explode the DataFrame based on the custom date range

In [None]:
hourly_exploded_scheduled_df.head(10)

In [None]:
scheduled_traffic=hourly_exploded_scheduled_df.merge(traffic_df,on=['route_id','custom_date'],how='left')

In [None]:
# Define a custom aggregation function for accidents
def custom_agg(values):
    """
    Custom aggregation function to determine if any value in a group is 1 (indicating an accident).

    Args:
    values (iterable): Iterable of values in a group.

    Returns:
    int: 1 if any value is 1, else 0.
    """
    if any(values == 1):
        return 1
    else:
        return 0

# Group by 'unique_id', 'truck_id', and 'route_id', and apply custom aggregation
scheduled_route_traffic = scheduled_traffic.groupby(['unique_id', 'truck_id', 'route_id'], as_index=False).agg(
    avg_no_of_vehicles=('no_of_vehicles', 'mean'),
    accident=('accident', custom_agg)
)


In [None]:
scheduled_route_traffic.head(5)

Merge all dataframes

In [None]:
origin_destination_weather_traffic_merge=origin_destination_weather.merge(scheduled_route_traffic,on=['unique_id','truck_id','route_id'],how='left')

In [None]:
origin_destination_weather_traffic_merge.head(5)

In [None]:
schedule_weather_merge.columns.intersection(origin_destination_weather_traffic_merge.columns)

In [None]:
merged_data_weather_traffic=pd.merge(schedule_weather_merge, origin_destination_weather_traffic_merge, on=['unique_id', 'truck_id', 'route_id', 'departure_date',
       'estimated_arrival', 'delay'], how='left')

In [None]:
merged_data_weather_traffic_trucks = pd.merge(merged_data_weather_traffic, trucks_df, on='truck_id', how='left')

# Merge merged_data with truck_data based on 'truck_id' column (Left Join)
final_merge = pd.merge(merged_data_weather_traffic_trucks, drivers_df, left_on='truck_id', right_on = 'vehicle_no', how='left')

In [None]:
final_merge.shape

In [None]:
final_merge.head(5)

In [None]:
# Function to check if there is nighttime involved between arrival and departure time
def has_midnight(start, end):
    return int(start.date() != end.date())


# Apply the function to create a new column indicating nighttime involvement
final_merge['is_midnight'] = final_merge.apply(lambda row: has_midnight(row['departure_date'], row['estimated_arrival']), axis=1)

In [None]:
final_merge[final_merge['is_midnight']==1]

## **Feature Store for Final Dataset**

In [None]:
fs_data = final_merge.sort_values(["estimated_arrival","unique_id"])

In [None]:
'''import hopsworks

project = hopsworks.login()
fs = project.get_feature_store()'''

In [None]:
truck_eta_fg = fs.get_or_create_feature_group(
    name="final_data",
    version=1,
    description="Truck ETA Final Data",
    primary_key=['unique_id'],
    event_time='estimated_arrival',
    online_enabled=True,
)

In [None]:
fs_data.isna().sum()

In [None]:
fs_data.dtypes

In [None]:
fs_data['origin_description'] = fs_data['origin_description'].fillna("Unknown")

In [None]:
truck_eta_fg.insert(fs_data)

In [None]:
final_feature_descriptions = [
    {"name": 'unique_id', "description": "the unique identifier for each record"},
    {"name": 'truck_id', "description": "the unique identifier of the truck"},
    {"name": 'route_id', "description": "the unique identifier of the route"},
    {"name": 'departure_date', "description": "departure DateTime of the truck"},
    {"name": 'estimated_arrival', "description": "estimated arrival DateTime of the truck"},
    {"name": 'delay', "description": "binary variable if the truck’s arrival was delayed, 0 for on-time arrival and 1 for delayed arrival"},
    {"name": 'route_avg_temp', "description":  'Average temperature in Fahrenheit'},
    {"name": 'route_avg_wind_speed', "description":  'Average wind speed in miles per hour'},
    {"name": 'route_avg_precip', "description":  'Average precipitation in inches'},
    {"name": 'route_avg_humidity', "description":  'Average humidity observed'},
    {"name": 'route_avg_visibility', "description":  'Average visibility observed in miles per hour'},
    {"name": 'route_avg_pressure', "description":  'Average pressure observed in millibar'},
    {"name": 'route_description', "description":  'description of the weather conditions such as Clear, Cloudy, etc'},
    {"name": 'estimated_arrival_nearest_hour', "description":  'estimated arrival DateTime of the truck'},
    {"name": 'departure_date_nearest_hour', "description":  'departure DateTime of the truck'},
    {"name": 'origin_id', "description": "the city identification number for the origin city"},
    {"name": 'destination_id', "description": " the city identification number for the destination"},
    {"name": 'distance', "description": " the distance between the origin and destination cities in miles"},
    {"name": 'average_hours', "description": "average time needed to travel from the origin to the destination in hours"},
    {"name": 'origin_temp', "description":  'temperature in Fahrenheit'},
    {"name": 'origin_wind_speed', "description":  'wind speed in miles per hour'},
    {"name": 'origin_description', "description":  'description of the weather conditions such as Clear, Cloudy, etc'},
    {"name": 'origin_precip', "description":  'precipitation in inches'},
    {"name": 'origin_humidity', "description":  'humidity observed'},
    {"name": 'origin_visibility', "description":  'visibility observed in miles per hour'},
    {"name": 'origin_pressure', "description":  'pressure observed in millibar'},
    {"name": 'destination_temp', "description":  'temperature in Fahrenheit'},
    {"name": 'destination_wind_speed', "description":  'wind speed in miles per hour'},
    {"name": 'destination_description', "description":  'description of the weather conditions such as Clear, Cloudy, etc'},
    {"name": 'destination_precip', "description":  'precipitation in inches'},
    {"name": 'destination_humidity', "description":  'humidity observed'},
    {"name": 'destination_visibility', "description":  'visibility observed in miles per hour'},
    {"name": 'destination_pressure', "description":  'pressure observed in millibar'},
    {"name": 'avg_no_of_vehicles', "description": "the average number of vehicles observed on the route"},
    {"name": 'accident', "description": "binary variable to denote if an accident was observed"},
    {"name":'truck_age',"description":"age of the truck in years"},
    {"name":'load_capacity_pounds',"description":"loading capacity of the truck in years"},
    {"name":'mileage_mpg',"description": "mileage of the truck in miles per gallon"},
    {"name":'fuel_type',"description":"fuel type of the truck"},
    {"name": "driver_id", "description": "unique identification for each driver"},
    {"name": "name", "description": " name of the truck driver"},
    {"name": "gender", "description": "gender of the truck driver"},
    {"name": "age", "description": "age of the truck driver"},
    {"name": "experience", "description": " experience of the truck driver in years"},
    {"name": "driving_style", "description": "driving style of the truck driver, conservative or proactive"},
    {"name": "ratings", "description": "average rating of the truck driver on a scale of 1 to 5"},
    {"name": "vehicle_no", "description": "the number of the driver’s truck"},
    {"name": "average_speed_mph", "description": "average speed the truck driver in miles per hour"},
    {"name": 'is_midnight', "description": "binary variable to denote if it was midnight"}

]

for desc in final_feature_descriptions:
    truck_eta_fg.update_feature_description(desc["name"], desc["description"])

In [None]:
truck_eta_fg = fs.get_or_create_feature_group("final_data", version=1)
truck_eta_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

truck_eta_fg.update_statistics_config()
truck_eta_fg.compute_statistics()

## **Conclusion**


We delved into data analysis using SQL and did some exploratory data analysis to understand our data better. We also learned about feature stores and how they help in machine learning projects.

We fetched data from feature stores, did some feature engineering, and stored our final features in the feature store.

In the next part, we will dive into model building techniques and explore APIs and related topics. In the final part, we will construct a complete CICD pipeline for this project and learn how to trigger it. We've got a lot more exciting content coming up!