# **Project Name**    -  UBER SUPPLY DEMAND GAP


##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### Team Member 1 -  Akshay Vadnala


# **Project Summary -**

This project focuses on analyzing the supply-demand imbalance faced by Uber, particularly in the route from the airport to the city. The primary objective was to identify when and why there is a gap between rider demand and cab availability. Through detailed examination of data patterns and rider behavior, several important insights and recommendations were developed.

The analysis clearly shows that Uber experiences a significant mismatch between supply and demand during the Night and Early Morning time slots. This issue is especially noticeable when passengers try to book rides from the airport heading toward the city. In these time periods, either very few cabs are available, or riders face high cancellation rates by drivers, creating a major inconvenience for customers.

One key factor contributing to this problem is the lack of cabs during the night hours. Many Uber drivers are not active during these late hours, which causes customers to frequently receive messages saying that no cabs are available. This results in a poor customer experience and lost business for Uber. On the other hand, during early morning and morning time slots, the issue is mainly due to driver cancellations. Even when cabs are technically available, many drivers choose to cancel rides, especially if the trip is not profitable or convenient for them. This further widens the supply-demand gap and affects rider trust in the platform.

# **GitHub Link -**

https://github.com/akshay24032002/UBER-SUPPLY-DEMAND-GAP.git

# **Problem Statement**


Uber faces a recurring supply-demand gap during night and early morning hours, particularly for trips from the airport to the city. During these time slots, riders often encounter unavailability of cabs or frequent cancellations by drivers. This mismatch results in poor customer experience, lost revenue opportunities, and reduced platform reliability. The primary causes include a lack of active drivers at night and high cancellation rates during early mornings. Identifying and addressing this issue is essential to improve service efficiency and rider satisfaction. The goal is to analyze the root causes and suggest practical solutions to bridge this gap effectively.

#### **Define Your Business Objective?**

The primary business objective is to identify and address the supply-demand gap faced by Uber during night and early morning hours, especially for airport-to-city trips. By analyzing ride availability and cancellation patterns, the aim is to uncover root causes and implement data-driven solutions. These may include driver incentives and shift adjustments to ensure better cab availability. Ultimately, the goal is to enhance customer satisfaction, improve operational efficiency, and increase revenue by reducing unfulfilled ride requests during critical time slots.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/Uber Request Data.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.sample(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
percent_missing = df.isnull().sum() * 100 / len(df)
print(percent_missing)


In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

The output shows that the dataset has 6745 rows and 6 columns. There are no duplicate rows. There are missing values in the Dropoff location and Driver id columns. The Dropoff location column has 39.15% missing values and the Driver id` column has 38.85% missing values. The heatmap also visually confirms the presence of missing values in these two columns. This indicates that a significant number of requests were either not completed (cancelled or no car available), hence no dropoff location and driver ID recorded, or there is some data recording issue. These missing values will need to be handled during the data cleaning and preprocessing phase of the analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

1.  **Request id**: This is likely a unique identifier for each Uber request. It seems to be a numerical variable.

2.  **Pickup point**: This variable indicates the location where the ride request originated.

3.  **Driver id**: This is likely the identifier for the driver who accepted or was assigned to the request. It's a numerical variable.
4.  **Status**: This variable describes the final status of the ride request.

5.  **Request timestamp**: This variable records the date and time when the Uber request was made.

6.  **Dropoff timestamp**: This variable records the date and time when the ride was completed (if successful).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df['Driver id']=df['Driver id'].astype('str')
df['Request id']=df['Request id'].astype('str')
df['Request timestamp']=pd.to_datetime(df['Request timestamp'],format='mixed',dayfirst=True,errors='coerce')
df['Request_date']=df['Request timestamp'].dt.date
df['Request_hour']=df['Request timestamp'].dt.hour
df['Request_minute']=df['Request timestamp'].dt.minute
df['Drop timestamp']=pd.to_datetime(df['Drop timestamp'],dayfirst=True,errors='coerce')
df['Drop_date']=df['Drop timestamp'].dt.date
df['Drop_hour']=df['Drop timestamp'].dt.hour
df['Drop_minute']=df['Drop timestamp'].dt.minute
df['Trip_hours'] = (df['Drop timestamp'] - df['Request timestamp']).dt.total_seconds() / 3600
df['Trip_minutes'] = (df['Drop timestamp'] - df['Request timestamp']).dt.total_seconds() / 60

def time_slot(hour):
  if hour >= 0 and hour < 4:
    return 'Night'
  elif hour >= 4 and hour < 10:
    return 'Early Morning'
  elif hour >= 10 and hour < 17:
    return 'Daytime'
  elif hour >= 17 and hour < 22:
    return 'Evening'
  else:
    return 'Late Evening'
df['Time_Slot'] = df['Request_hour'].apply(time_slot)


### What all manipulations have you done and insights you found?

1.  **Handling Missing Values**:
    *   The code drops rows where the 'Drop timestamp' is missing. This is done twice in the provided snippet, effectively removing incomplete trip records.

2.  **Data Type Conversion**:
    *   'Driver id' and 'Request id' are converted to string type.
    *   'Request timestamp' and 'Drop timestamp' are converted to datetime objects using pd.to_datetime. The format='mixed', dayfirst=True, errors='coerce'` arguments are used to handle potential variations in date format and convert invalid dates to NaT (Not a Time).

3.  **Feature Engineering**:
    *   New columns are created by extracting date, hour, and minute components from both 'Request timestamp' and 'Drop timestamp' using .dt.date, .dt.hour, and .dt.minute. These new columns are 'Request\_date', 'Request\_hour', 'Request\_minute', 'Drop\_date', 'Drop\_hour', and 'Drop\_minute'.

**Insights :**

1. Timestamps converted to datetime objects, allowing for time-based analysis.
2. Date, hour, and minute extracted, enabling grouping and analysis by time components.
3. Time slots (Morning, Day, Evening, Night) created to categorize requests and analyze patterns across different periods of the day.
4. Trip duration calculated for completed trips, providing insight into ride length.
5. Supply and demand quantified by grouping requests and completed trips by time slot and pickup point.
6. The 'Gap' column clearly highlights the difference between the number of requests and the number of completed trips in each time slot and location, which is the core problem the project aims to address.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
status_counts = df['Status'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(status_counts, labels=status_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Request Statuses')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is suitable for visualizing the distribution of categorical data. In this case, it effectively shows the proportion of each request status (completed, cancelled, no cars available) out of the total number of requests.

##### 2. What is/are the insight(s) found from the chart?

The pie chart will clearly show the percentage breakdown of ride requests that were completed, cancelled by the driver, or resulted in no cars being available. This provides an immediate understanding of the most frequent outcomes of ride requests.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights are crucial for creating a positive business impact. Understanding the proportion of cancelled and 'no cars available' requests highlights the magnitude of the supply-demand gap. This information helps in prioritizing efforts to reduce cancellations and increase driver availability, which directly impacts customer satisfaction and revenue.

**Negative:
The high proportion of 'Cancelled' and 'No Cars Available' statuses indicates significant issues in the Uber service. These represent unfulfilled demand and poor customer experience, which can lead to customer churn and negative perception of the brand. This directly contributes to negative growth by reducing the number of completed rides and potentially losing market share to competitors.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
status_by_hour = df.groupby('Request_hour')['Status'].value_counts().unstack().fillna(0)
status_by_hour[['Cancelled', 'No Cars Available', 'Trip Completed']].plot(kind='bar', stacked=False, figsize=(14, 7))
plt.title('Request Status Distribution by Hour')
plt.xlabel('Request Hour of the Day')
plt.ylabel('Number of Requests')
plt.xticks(rotation=0)
plt.legend(title='Status')
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart is effective here because it allows us to compare the number of requests for each status ('Cancelled', 'No Cars Available', 'Trip Completed') across different hours of the day. This visual representation makes it easy to identify which hours have a higher proportion of unfulfilled requests (cancelled or no cars available).


##### 2. What is/are the insight(s) found from the chart?

This chart clearly shows how the status of requests varies throughout the day. We can observe which hours experience the most cancellations and which hours have the highest incidence of 'No Cars Available'. This is crucial for pinpointing the peak times of the supply-demand gap. Specifically, it will likely highlight the night and early morning hours as periods with significant issues (either high cancellations or no cars available).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights are highly valuable for positive business impact. By identifying the specific hours where the supply-demand gap is most severe, Uber can implement targeted strategies. For example, offering incentives to drivers during those peak hours, implementing surge pricing strategically, or better communication with riders about expected wait times could help bridge the gap. This leads to more completed trips, increased revenue, and improved customer satisfaction.

**Negative: The visual representation of high cancellations or 'No Cars Available' during certain hours directly shows periods of significant operational inefficiency and poor customer experience. These unfulfilled requests represent lost revenue opportunities and can lead to customer frustration and defection. If these patterns are not addressed, they will continue to contribute to negative growth by hindering Uber's ability to serve demand effectively.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
request_by_hour_pickup = df.groupby(['Request_hour', 'Pickup point'])['Request id'].count().unstack().fillna(0)

plt.figure(figsize=(14, 7))
request_by_hour_pickup.plot(kind='bar', stacked=False, figsize=(14, 7))
plt.title('Number of Requests by Hour and Pickup Point')
plt.xlabel('Request Hour of the Day')
plt.ylabel('Number of Requests')
plt.xticks(rotation=0)
plt.legend(title='Pickup Point')
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?

A grouped bar chart is used to show the number of requests originating from each 'Pickup point' (Airport and City) across different hours of the day. This allows for a direct comparison of demand from each location at each hour.


##### 2. What is/are the insight(s) found from the chart?

This chart helps to understand the demand pattern from both the Airport and the City over the 24-hour period. It will likely show distinct peak hours for requests originating from the Airport and from the City, helping to identify where and when the demand is highest for each location. This is crucial for understanding if the supply-demand issue is uniform or specific to certain pickup points and times.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight is valuable for positive business impact. By understanding the specific demand patterns from the Airport and the City at different hours, Uber can better allocate drivers. For instance, if demand from the Airport peaks during early morning, Uber can incentivize drivers to be available near the Airport during those hours. This targeted approach improves supply matching demand, leading to more completed trips and higher satisfaction for riders from both locations.

** Negative : If the chart shows significant unserved demand (which will need to be inferred by comparing this with completed trips from these locations/hours, though this chart alone shows total requests), especially during peak hours from a specific location like the Airport, it highlights missed business opportunities. A consistent inability to meet demand from key locations like the Airport at critical times can lead to a negative perception of reliability among frequent travelers, potentially leading to negative growth as they turn to alternative transportation.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
status_by_time_slot = df.groupby('Time_Slot')['Status'].value_counts().unstack().fillna(0)

status_by_time_slot[['Cancelled', 'No Cars Available', 'Trip Completed']].plot(kind='bar', stacked=False, figsize=(14, 7))
plt.title('Request Status Distribution by Time Slot')
plt.xlabel('Time Slot')
plt.ylabel('Number of Requests')
plt.xticks(rotation=0)
plt.legend(title='Status')
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

A grouped bar chart is used to visualize the total number of requests and the calculated supply-demand gap (total requests minus completed trips) for each defined time slot. This allows for a direct comparison between the overall demand and the unfulfilled demand across different periods of the day.

##### 2. What is/are the insight(s) found from the chart?

This chart provides a clear insight into which time slots experience the largest supply-demand gap. It highlights the periods where Uber struggles the most to fulfill ride requests. This is crucial for confirming the initial hypothesis about night and early morning hours having the most significant issues and quantifying the magnitude of the problem in each time slot.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight is fundamental for creating a positive business impact. By precisely identifying and quantifying the gap in each time slot, Uber can focus its resources and strategies on the most problematic periods. For example, if the 'Night' or 'Early Morning' slots show the largest gap, Uber can implement targeted driver incentives, surge pricing adjustments, or communication campaigns specifically for those times. This leads to a more efficient allocation of supply and a reduction in unfulfilled demand, increasing completed trips and revenue.

** Negative : The presence of a significant supply-demand gap in any time slot, as shown by the gap bars, represents unserved demand and potentially lost revenue. If certain time slots consistently show a large gap, it indicates a failure to meet customer needs during those periods, leading to frustration and potentially driving customers to competitors. This inability to convert requests into completed trips directly contributes to negative growth by limiting the platform's overall volume and revenue.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Time_Slot', hue='Pickup point', palette='viridis')
plt.title('Number of Requests by Time Slot and Pickup Point')
plt.xlabel('Time Slot')
plt.ylabel('Number of Requests')
plt.show()

##### 1. Why did you pick the specific chart?

A count plot (bar chart showing counts of observations in each categorical bin) is used to visualize the distribution of ride requests across different time slots, separated by their Pickup point (Airport vs. City). The use of hue=Pickup point creates side-by-side bars for each time slot, making it easy to compare the request volume originating from the Airport and the City within the same time frame.

##### 2. What is/are the insight(s) found from the chart?

This chart reveals the volume of requests from the Airport and the City during different time periods of the day. It will show when demand is highest from the Airport (likely early morning) and when it is highest from the City (likely evening and daytime). It also allows us to see if there's a mismatch in demand from these two locations across time slots. This helps understand the specific demand patterns Uber needs to cater to at different times and locations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the demand distribution by time slot and pickup point is crucial for positive business impact. By knowing when and where the demand originates, Uber can strategically position drivers or offer incentives to ensure better supply matches the specific demand patterns. For instance, if early mornings show high demand from the Airport, Uber can focus on getting more drivers there. This improves efficiency, reduces wait times, and increases the likelihood of fulfilling requests, leading to more completed trips and higher revenue.

** Negative : If the chart shows consistently high demand from a specific pickup point during a certain time slot (e.g., Airport in the early morning) while other charts reveal a high rate of 'No Cars Available' or 'Cancelled' during that same time/location, it highlights a failure to serve existing demand. This inability to capitalize on peak demand periods from key locations represents lost revenue and could lead to customers seeking alternative transportation options, thus contributing to negative growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
request_status_by_pickup_time = df.groupby(['Pickup point', 'Time_Slot'])['Status'].value_counts().unstack().fillna(0)

request_status_by_pickup_time.plot(kind='bar', stacked=False, figsize=(14, 7))
plt.title('Request Status Distribution by Pickup Point and Time Slot')
plt.xlabel('Pickup Point, Time Slot')
plt.ylabel('Number of Requests')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Status')
plt.grid(axis='y')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A grouped bar chart is used to visualize the total number of requests and the calculated supply-demand gap (total requests minus completed trips) for each defined time slot. This allows for a direct comparison between the overall demand and the unfulfilled demand across different periods of the day.

##### 2. What is/are the insight(s) found from the chart?

This chart is key to understanding the core problem. It will clearly show which pickup point and time slot combinations suffer the most from cancellations and 'no cars available'. For example, it's likely to reveal a high number of 'No Cars Available' for Airport requests during the 'Night' time slot and a high number of 'Cancelled' requests for Airport requests during the 'Early Morning' time slot, while City requests might have different patterns. This granularity helps pinpoint the exact segments experiencing the supply-demand gap.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
trip_minutes_by_time_slot = df.groupby('Time_Slot')['Trip_minutes'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.barplot(data=trip_minutes_by_time_slot, x='Time_Slot', y='Trip_minutes', palette='viridis')
plt.title('Average Trip Duration by Time Slot')
plt.xlabel('Time Slot')
plt.ylabel('Average Trip Duration (minutes)')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is used to visualize the average trip duration (in minutes) for each time slot. It allows for a clear comparison of how the average length of completed trips varies throughout the day.

##### 2. What is/are the insight(s) found from the chart?

This chart reveals if certain time slots typically involve longer or shorter trips on average. For instance, trips during peak commuting hours might be longer due to traffic, while late-night trips could be shorter. This insight can help understand if trip duration plays a role in driver behavior (e.g., drivers preferring longer or shorter trips during certain times).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding average trip duration by time slot can contribute to positive business impact. If certain time slots have significantly shorter average trip durations, it might be less appealing for drivers who prefer longer, more profitable rides. This could be a contributing factor to driver unavailability or cancellations in those slots. Uber could potentially use this information to adjust incentives or pricing strategies for shorter trips during specific times to make them more attractive to drivers.

** Negative : If certain time slots show very short average trip durations, and these are also the time slots with high 'Cancelled' or 'No Cars Available' rates, it could indicate that drivers are less inclined to accept these shorter trips. This pattern suggests that the trip characteristics (duration/profitability) in those time slots are unattractive to drivers, directly contributing to the supply-demand gap and hindering growth.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Trip_minutes', bins=30, kde=True)
plt.title('Distribution of Trip Durations')
plt.xlabel('Trip Duration (minutes)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is used to visualize the distribution of a single numerical variable, 'Trip_minutes'. It shows how frequently different trip durations occur within the dataset. Adding a Kernel Density Estimate (KDE) line helps to see the overall shape of the distribution.


##### 2. What is/are the insight(s) found from the chart?

This histogram provides insights into the typical duration of completed Uber trips. It will likely show a peak around a certain duration, indicating the most common trip lengths. It will also reveal the range of trip durations and if there are many very short or very long trips. This helps understand the nature of the trips being completed on the platform.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of trip durations can have a positive business impact. For example, if a significant portion of completed trips are very short, this might influence driver behavior and contribute to cancellations of short trips if not profitable. Uber could use this information to potentially adjust pricing for shorter trips or bundle requests to make them more appealing to drivers. It can also help in estimating arrival times and managing driver availability based on expected trip lengths.

**Negative: If the histogram shows a large number of very short trips and this correlates with time slots or locations experiencing high cancellation rates (as seen in previous charts), it suggests that drivers are actively avoiding these short, potentially less profitable trips. This unwillingness of drivers to take frequent short trips directly contributes to the supply-demand gap and hinders the platform's ability to serve a segment of the market, leading to negative growth in that segment. Conversely, if there are many very long trips, it might reduce overall driver availability if drivers are tied up on long rides during peak hours.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
completed_trips_by_driver = df[df['Status'] == 'Trip Completed']['Driver id'].value_counts().reset_index()
completed_trips_by_driver.columns = ['Driver id', 'Completed Trips']

top_10_drivers = completed_trips_by_driver.sort_values(by='Completed Trips', ascending=False).head(10)

plt.figure(figsize=(12, 7))
sns.barplot(data=top_10_drivers, x='Driver id', y='Completed Trips', palette='viridis')
plt.title('Top 10 Drivers by Number of Completed Trips')
plt.xlabel('Driver ID')
plt.ylabel('Number of Completed Trips')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is used to display the number of completed trips for each of the top 10 drivers. This chart is effective because it allows for a clear comparison of the performance among the most active drivers.

##### 2. What is/are the insight(s) found from the chart?

This chart identifies the drivers who have completed the most trips. It shows the distribution of completed trips among the top drivers and can reveal if there are a few exceptionally active drivers or if completed trips are more evenly distributed among the top performers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can contribute to positive business impact. Recognizing and potentially rewarding the most active drivers can encourage continued high performance. Understanding the volume of trips completed by top drivers can also provide a benchmark for other drivers and inform strategies to increase overall driver productivity and retention.

** Negative ** : If a small number of drivers are responsible for a disproportionately large number of completed trips, it might indicate a potential dependency on a few high-performing individuals. If these key drivers become inactive, it could negatively impact overall supply and service reliability. It might also suggest that the platform struggles to retain or incentivize a broader base of drivers to achieve similar levels of activity.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
cancelled_trips_by_driver = df[df['Status'] == 'Cancelled']['Driver id'].value_counts().reset_index()
cancelled_trips_by_driver.columns = ['Driver id', 'Cancelled Trips']

top_10_cancelled_drivers = cancelled_trips_by_driver.sort_values(by='Cancelled Trips', ascending=False).head(10)

plt.figure(figsize=(12, 7))
sns.barplot(data=top_10_cancelled_drivers, x='Driver id', y='Cancelled Trips', palette='viridis')
plt.title('Top 10 Drivers by Number of Cancelled Trips')
plt.xlabel('Driver ID')
plt.ylabel('Number of Cancelled Trips')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is used to display the number of cancelled trips for each of the top 10 drivers who have the highest number of cancellations. This chart is effective for highlighting the drivers contributing most to the cancellation issue.

##### 2. What is/are the insight(s) found from the chart?

This chart identifies the drivers who are responsible for the most ride cancellations. It can reveal if a small number of drivers account for a large proportion of cancellations or if cancellations are more broadly distributed among drivers. This insight is critical for understanding the source of driver-side cancellations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight is extremely valuable for positive business impact. By identifying the drivers with high cancellation rates, Uber can take targeted actions. This could include investigating the reasons for their cancellations (e.g., trip distance, destination, time of day, low fare), providing additional training, implementing stricter policies, or adjusting incentives to discourage cancellations. Reducing cancellations directly improves service reliability and customer experience, leading to more completed trips and higher revenue.

** Negative ** : The very existence of a list of top cancellers highlights a significant negative aspect of the service: driver behavior that directly harms the platform's ability to fulfill demand. High cancellation rates by a few or many drivers contribute to the supply-demand gap, lead to customer frustration, and damage Uber's reputation for reliability. This directly hinders growth by reducing completed trip volume and potentially driving customers to competitors.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

df_numeric = df.select_dtypes(include=np.number)
correlation_matrix = df_numeric.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is an excellent choice for visualizing the correlation matrix of numerical variables. It uses color intensity to represent the strength and direction of correlations between pairs of variables, making it easy to quickly identify which variables are positively or negatively correlated and how strongly. The `annot=True` parameter adds the correlation values directly onto the heatmap, enhancing readability.


##### 2. What is/are the insight(s) found from the chart?

The heatmap will show the correlation coefficients between all pairs of numerical columns in the dataset (`Request_hour`, `Request_minute`, `Drop_hour`, `Drop_minute`, `Trip_hours`, `Trip_minutes`). We can see:
- Strong positive correlations between `Request_hour` and `Drop_hour` if completed trips tend to stay within the same hour bracket.
- Strong positive correlation between `Trip_hours` and `Trip_minutes`, as they represent the same duration but in different units.
- Correlations (or lack thereof) between temporal features like `Request_hour` and `Trip_duration`. This could indicate if the time of day influences how long trips typically are.
- The heatmap on this specific dataset (after the data wrangling steps) shows a very strong positive correlation (1.00) between `Trip_minutes` and `Trip_hours`, which is expected as one is simply a conversion of the other. Other correlations between the time components might be weak or moderate depending on the distribution of trip start and end times.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df_numeric)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is used to visualize the relationships between all pairs of numerical variables in the dataset (`df_numeric`). It creates a grid of scatter plots for each pair of variables and a histogram (or KDE) on the diagonal for each individual variable. This allows for a quick visual exploration of potential correlations, distributions, and patterns among the numerical features.


##### 2. What is/are the insight(s) found from the chart?

The pair plot provides a matrix of visualizations:
- The diagonal histograms show the distribution of each numerical variable (`Request_hour`, `Request_minute`, `Drop_hour`, `Drop_minute`, `Trip_hours`, `Trip_minutes`). We can see if they are normally distributed, skewed, or have multiple peaks. For instance, `Request_hour` will likely show peaks corresponding to commuting times.
- The off-diagonal scatter plots show the relationship between each pair of variables. We can visually inspect for linear relationships (as seen between `Trip_hours` and `Trip_minutes`), clusters, or non-linear patterns. For example, looking at `Request_hour` vs. `Trip_minutes` might show if longer or shorter trips are more common at certain hours. It confirms the strong linear relationship between `Trip_hours` and `Trip_minutes` and allows visual inspection of distributions and pairwise scatter plots for other numerical columns.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***