<a href="https://colab.research.google.com/github/arjun101sharma/Arjun-Sharma/blob/main/AB_Regression_Capstone_Project_NYC_Taxi_Trip_Time_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - NYC Taxi Trip Time Prediction.


#### **Project Type**    - Regression.
#### **Contribution**    - Individual.
#### **Team Member 1**-  $\color{red}{\text{Arjun Sharma}}$

# **Project Summary -**


**Data Preprocessing** :

1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Data Cleaning and Feature Engineering

**Exploratory data analysis(EDA) :**

1. The distribution of Number of Pickups and Dropoffs on each part of the day.

2. A line plot for the relationship between 'pickup_month' and 'trip_duration'.

3. Distribution of number of Pickups During 24 Hours.

4. Grouping the data in the 'taxi_df' DataFrame by "pickup_month" and "vendor_id".

5. Create a count plot using seaborn, using the "passenger_count" column from the dataset.

6. Group the data by the categorized trip durations and count the number of trips in each category.

7. For visually comparing the distributions and central tendencies (mean and median) of different numeric features in the "taxi_df" DataFrame using histograms and box plots.Iterate through each numeric feature in the DataFrame.

8. **Correlation Heatmap** - Generates a heatmap to visualize the correlation matrix of the NYC Taxi Trip Time Prediction dataset.

9. **Pair Plot**- A pair plot is a grid of scatterplots, where variables are plotted against each other. If you have a dataset with multiple numeric variables, a pair plot allows you to visualize the relationships between these variables.



**Supervised Regression Machine learning algorithms and implementation :**

 XG Boost.

 Random Forest.

 Decision Tree.

 Gradient Boost.

 Linear Regressor.

**Project Summary:** Predicting NYC Taxi Trip Time Duration

The NYC Taxi Time Prediction project aimed to forecast the duration of taxi trips in New York City, leveraging a comprehensive dataset comprising factors like pickup/dropoff locations, time of day, and weather conditions. This regression-focused endeavor employed machine learning techniques to create a robust predictive model.

**Data and Features:**
The project utilized a diverse dataset encompassing over 1.5 million taxi trips. Various features were employed in the regression model, including distance, pickup and dropoff coordinates, pickup datetime, day of the week, and weather conditions like temperature, precipitation, and wind speed.

**Methodology:**
The dataset was randomly split into training and testing sets to facilitate model training and evaluation. Several machine learning algorithms, including Linear Regression, Decision Trees, Random Forest, Gradient Boosting, and XGBoost, were explored. The model's performance was assessed using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R2 Score, and Adjusted R2 Score.

**Results:**
The regression model emerged as the top performer, surpassing other algorithms in accuracy. It achieved an impressive R2 score of 67%, indicating its strong predictive capabilities. The model's accuracy was further validated through comparisons with alternative machine learning approaches.

**Conclusion:**
The NYC Taxi Time Prediction project showcased the potential of regression models in accurately forecasting taxi trip durations. By leveraging a combination of location data, temporal information, and weather conditions, the model demonstrated its effectiveness in capturing the complexities of New York City taxi journeys. This project not only highlights the power of predictive analytics in the transportation sector but also underscores the significance of feature selection and algorithm choice in enhancing prediction accuracy.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Problem Statement:**

The New York City (NYC) Taxi Trip Time Duration prediction project faces several challenges and observations that require careful consideration and strategic solutions. The primary concerns include the presence of outliers in the dataset, potential overfitting of models, and the impact of data removal on model performance.

1. **Outlier Management:**
   The dataset exhibits a substantial presence of outliers, some of which are very close to zero. Attempting to remove these outliers resulted in significant data loss, impacting the overall dataset integrity. The challenge lies in effectively managing these outliers without compromising the dataset's size and quality.

2. **Model Overfitting:**
   Concerns have been raised about potential overfitting of the models. While fears were dispelled as the models consistently performed well on both training and test datasets, it is crucial to implement strategies to ensure the models' generalizability. Particularly, the XG Boost and Random Forest models exhibited remarkable alignment between actual and predicted values, indicating their potential. However, careful consideration is needed to prevent overfitting and ensure reliable predictions.

3. **Evaluation Metrics:**
   Notably, the R-squared (R2) scores were considerably high, signifying the models' ability to explain the variance in the data. Additionally, the Mean Squared Error (MSE) scores were low, meeting the criteria for a well-performing model. Despite these positive indications, there is a need to delve deeper into the nuances of these metrics and explore other relevant evaluation techniques to ensure the accuracy and reliability of the models.

4. **Data Integrity and Feature Engineering:**
   It was observed that removing data led to a significant loss of valuable information. Moreover, the introduction of a new column, even if highly correlated with existing features, yielded seemingly favorable results, potentially indicating pseudo-good results. There is a need to explore innovative methods of feature engineering and carefully assess the impact of new features on model performance, ensuring that they genuinely contribute to the predictive power of the models.

5. **Model Selection and Tuning:**
   The Random Forest model provided the best R2 score, indicating its potential for accurate predictions. However, to prevent overfitting, meticulous tuning of model parameters is necessary. The challenge lies in finding the optimal balance between model complexity and stability, ensuring that the model is robust and reliable in making predictions.

Addressing these challenges requires a comprehensive approach, involving advanced outlier detection methods, rigorous evaluation techniques, innovative feature engineering strategies, and meticulous model tuning. By overcoming these challenges, the project aims to build a robust and reliable predictive model for NYC Taxi Trip Time Duration, providing valuable insights for both passengers and service providers in optimizing travel experiences and operational efficiency.

# **General Guidelines** : -  


1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns  # For statistical data visualization
from numpy import math  # Import the math module from NumPy
from sklearn.preprocessing import StandardScaler  # For standardizing features by removing the mean and scaling to unit variance
from sklearn.preprocessing import MinMaxScaler  # For scaling features to a range
from sklearn.model_selection import train_test_split  # For splitting the dataset into training and testing sets
from sklearn.linear_model import LinearRegression  # For linear regression modeling
from sklearn.model_selection import GridSearchCV  # For performing grid search cross-validation
from sklearn.tree import DecisionTreeRegressor  # For decision tree regression modeling
from sklearn.ensemble import RandomForestRegressor  # For random forest regression modeling
from sklearn.metrics import r2_score  # For calculating R-squared score
from sklearn.metrics import mean_squared_error  # For calculating mean squared error
import warnings
warnings.filterwarnings("ignore")  # Ignore warnings during code execution


### Dataset Loading

In [None]:
# This code snippet demonstrates the use of the chardet library to detect the character encoding of a given CSV file.
file = "/content/NYC Taxi Data.csv"
import chardet

# The file path is specified, and the chardet library is used to analyze the first 100,000 bytes of the file's raw binary data.
with open(file, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))

# The detected character encoding information is then stored in the 'result' variable, which can be used to determine the appropriate encoding for reading the file.
result

In [None]:
# Load Dataset
# Load Dataset
taxi_df = pd.read_csv(file)

### Dataset First View

In [None]:

# Dataset First Look
# Dataset head Look
taxi_df.head() # Display the first 5 rows of the DataFrame.

In [None]:
# Dataset tail Look
taxi_df.tail()  # Display the last 5 rows of the DataFrame.

### Dataset Rows & Columns count

In [None]:
# Display the number of rows in the dataset using the 'shape' attribute
print(f'Number of rows in the data set is {taxi_df.shape[0]}')

# Display the number of columns in the dataset using the 'shape' attribute
print(f'Number of Columns in the data set is {taxi_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
taxi_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_counts = taxi_df.duplicated(keep=False).sum()

# Display the count of duplicate rows
print("Number of duplicate rows:", duplicate_counts)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count missing values in each column
missing_values_count = taxi_df.isnull().sum()
print(missing_values_count)

In [None]:
# Visualizing the missing values

# Setting the figure size for the visualization
plt.figure(figsize=(15, 5))

# Creating a heatmap to visualize missing values in the DataFrame
sns.heatmap(taxi_df.isnull(), cmap='plasma', annot=False, yticklabels=False)

# Adding a title to the visualization
plt.title("Visualizing Missing Values")

# Displaying the visualization
plt.show()

### What did you know about your dataset?

**Answer Here.**
#### The data set has 3390 rows 18 columns.
#### Data types our data set are: float64(9), int64(6), object(2).
#### Zero Duplicate values/rows.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
taxi_df.columns

In [None]:
# Dataset Describe
taxi_df.describe(include='all')

### Variables Description

**Answer Here.**

To create a predictive model for NYC taxi trip time, you need a set of variables (features) that can influence the duration of a taxi trip. Here are some potential variables you could consider for your prediction model:

1. **Pickup and Dropoff Locations:**
   - **Pickup Longitude:** Geographic coordinates of the pickup location.
   - **Pickup Latitude** Geographic coordinates of the pickup location.
   - **Dropoff Longitude:** Geographic coordinates of the dropoff location.
  - **Dropoff Latitude:** Geographic coordinates of the dropoff location.
   - **Distance:** The distance between the pickup and dropoff points, either in miles or kilometers.


2. **Time and Date:**
.
   - **Month:** The month in which the trip occurred.
   - **pickup_datetime:**  The day when the trip started (in hours and minutes).
   - **dropoff_datetime:**  The day when the trip ended (in hours and minutes).

3. **Taxi Specifics:**
   - **store_and_fwd_flag:** This Flag indicates weather the trip record was held in the vehicle memory before sending it to the vendor because vehicle did not have the connection to the server.Y=store and forward;N=Not a store and forward trip.

4. **Additional Features:**
   - **passenger_count:** The number of passengers in the taxi.



5. **Target Variable:**
  - **trip_duration:** Duration/time of trip taken.





### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check Unique Values for each variable.
# Display unique values for each variable
for column in taxi_df.columns:
    unique_values = taxi_df[column].nunique()
    print(f"Number of Unique values for {column}:\n{unique_values}.")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Import the required function from the geopy library for calculating distances between geographic points.
from geopy.distance import great_circle

# Calculate the distance for each row in the DataFrame and add it as a new column "Distance".

# Define a lambda function to calculate the great-circle distance between pickup and dropoff points.
# The lambda function takes a row from the DataFrame as input and returns the calculated distance.
# The `great_circle` function takes two pairs of (latitude, longitude) as input and returns the Distance.
taxi_df["Distance"] = taxi_df.apply(lambda row: great_circle((row["pickup_latitude"], row["pickup_longitude"]),(row["dropoff_latitude"], row["dropoff_longitude"])), axis=1)


In [None]:
# calculate trip duration in minute
taxi_df["trip_duration_in_minute"]=taxi_df["trip_duration"]/60

In [None]:
# Converting into proper data format
# These lines of code are crucial for transforming the original string-based datetime information into a format that can be easily manipulated and analyzed using Pandas and other libraries.
taxi_df["pickup_datetime"]=pd.to_datetime(taxi_df["pickup_datetime"])
taxi_df["dropoff_datetime"]=pd.to_datetime(taxi_df["dropoff_datetime"])

In [None]:
#finding pickup and drop month
# The code is a concise and efficient way to derive month information from datetime values and enhance the DataFrame with additional columns for subsequent analysis.
taxi_df["pickup_month"]=taxi_df["pickup_datetime"].dt.month
taxi_df["dropoff_month"]=taxi_df["dropoff_datetime"].dt.month

In [None]:
#creating pickup and dropoff day
# Extracting day of the month for both pickup and dropoff timestamps
taxi_df["pickup_day"] = taxi_df["pickup_datetime"].dt.day  # Extracts the day of the month for pickup
taxi_df["dropoff_day"] = taxi_df["dropoff_datetime"].dt.day  # Extracts the day of the month for dropoff


In [None]:
# finding pickup and dropoff weekday
# This code is enhancing the taxi_df DataFrame by including the day of the week on which each taxi ride was picked up and dropped off.
taxi_df["pickup_weekday"]=taxi_df["pickup_datetime"].dt.weekday
taxi_df["dropoff_weekday"]=taxi_df["dropoff_datetime"].dt.weekday

In [None]:
# creating pickup and dropoff hours
# Extracting the hour component from the pickup_datetime and dropoff_datetime columns
taxi_df["pickup_hour"] = taxi_df["pickup_datetime"].dt.hour
taxi_df["dropoff_hour"] = taxi_df["dropoff_datetime"].dt.hour

In [None]:
# make a function to check null values and unique values
def information() : # This line defines a function named information.
    x = pd.DataFrame(index=taxi_df.columns)  # Within the function, a new DataFrame named x is created.
    x["data type"] = taxi_df.dtypes
    x["null values"] = taxi_df.isnull().sum()
    x["unique"] = taxi_df.nunique()
    return x

In [None]:
# Convert 'store_and_fwd_flag' column values from categorical to numerical
# Replace 'Y' with 1 and 'N' with 0
taxi_df['store_and_fwd_flag'] = taxi_df['store_and_fwd_flag'].replace({'Y': 1, 'N': 0})

In [None]:
# Convert the 'distance' column to float
taxi_df['Distance'] = taxi_df['Distance'].apply(lambda x: float(x.miles))

In [None]:
# This line of Python code is modifying a DataFrame named taxi_df.
taxi_df["pickup_month"] = taxi_df['pickup_month'].astype(float)
taxi_df["pickup_day"] = taxi_df["pickup_day"].astype(float)

In [None]:
#taxi_df = pd.DataFrame(taxi_df)

# Convert "pickup_day" column to float (just an example, may not be relevant to your data)
try:
    taxi_df["pickup_day"] = taxi_df["pickup_day"].astype(float)
except ValueError as e:
    print("Error:", e)
    print("Conversion to float failed.")

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## **Chart 1**

# The distribution of Number of Pickups and Dropoffs on each part of the day.



In [None]:
# visualization code

# Diving the time into different timezone.
def determine_time_of_day(hour_input):
    # Check if the input hour falls within the morning period (6 AM to 10 AM).
    if hour_input >= 6 and hour_input <= 10:
        return "Morning"
    # Check if the input hour falls within the mid-day period (10 AM to 4 PM).
    elif hour_input >= 10 and hour_input <= 16:
        return "Midday"
    # Check if the input hour falls within the evening period (4 PM to 10 PM).
    elif hour_input >= 16 and hour_input <= 22:
        return "Evening"
    # Check if the input hour falls within the late night period (10 PM to 6 AM).
    elif hour_input >= 22 or hour_input <= 6:
        return "Late Night"

# Example usage:
hour = 14  # Replace this with the hour you want to determine the time of day for.
time_of_day = determine_time_of_day(hour)
print(f"The input hour {hour} corresponds to: {time_of_day}")


In [None]:
# apply that function
# Apply the determine_time_of_day function to create new columns for pickup and dropoff time zones.
taxi_df["pickup_time_of_day"] = taxi_df.pickup_hour.apply(determine_time_of_day)
taxi_df["dropoff_time_of_day"] = taxi_df.dropoff_hour.apply(determine_time_of_day)

In [None]:
# Distribution of the no of pickups and Dropoffs in a day

# Create a figure with two subplots side by side
figure, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))

# Create a count plot for the distribution of pickup time zones
pickup_plot = sns.countplot(x='pickup_time_of_day', data=taxi_df, ax=ax[0])
ax[0].set_title('Number of Pickups During Different Parts of the Day')
ax[0].set_xlabel('Time of Day')
ax[0].set_ylabel('Number of Pickups')

# Annotate bars with their counts for pickup plot
for p in pickup_plot.patches:
    pickup_plot.annotate(format(p.get_height(), '.0f'),
                         (p.get_x() + p.get_width() / 2., p.get_height()),
                         ha='center', va='center',
                         xytext=(0, 9),
                         textcoords='offset points')

# Create a count plot for the distribution of dropoff time zones
dropoff_plot = sns.countplot(x='dropoff_time_of_day', data=taxi_df, ax=ax[1])
ax[1].set_title('Number of Dropoffs During Different Parts of the Day')
ax[1].set_xlabel('Time of Day')
ax[1].set_ylabel('Number of Dropoffs')

# Annotate bars with their counts for dropoff plot
for p in dropoff_plot.patches:
    dropoff_plot.annotate(format(p.get_height(), '.0f'),
                          (p.get_x() + p.get_width() / 2., p.get_height()),
                          ha='center', va='center',
                          xytext=(0, 9),
                          textcoords='offset points')

# Display the plots
plt.tight_layout()
plt.show()

#### **1. Why did you pick the specific chart?**


In [None]:
# Create a figure with two subplots side by side
figure, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))

# Create a count plot for the distribution of pickup time zones
pickup_plot = sns.countplot(x='pickup_time_of_day', data=taxi_df, ax=ax[0])
ax[0].set_title('Number of Pickups During Different Parts of the Day')
ax[0].set_xlabel('Time of Day')
ax[0].set_ylabel('Number of Pickups')

# Annotate bars with their counts for pickup plot
for p in pickup_plot.patches:
    pickup_plot.annotate(format(p.get_height(), '.0f'),
                         (p.get_x() + p.get_width() / 2., p.get_height()),
                         ha='center', va='center',
                         xytext=(0, 9),
                         textcoords='offset points')

# Create a count plot for the distribution of dropoff time zones
dropoff_plot = sns.countplot(x='dropoff_time_of_day', data=taxi_df, ax=ax[1])
ax[1].set_title('Number of Dropoffs During Different Parts of the Day')
ax[1].set_xlabel('Time of Day')
ax[1].set_ylabel('Number of Dropoffs')

# Annotate bars with their counts for dropoff plot
for p in dropoff_plot.patches:
    dropoff_plot.annotate(format(p.get_height(), '.0f'),
                          (p.get_x() + p.get_width() / 2., p.get_height()),
                          ha='center', va='center',
                          xytext=(0, 9),
                          textcoords='offset points')

# Display the plots
plt.tight_layout()
plt.show()

#### **Answer Here.**

The specific charts used in this code are count plots (bar charts) showing the distribution of taxi pickups and dropoffs across different parts of the day. The code divides the time into four categories: morning, mid day, evening, and late night, and then visualizes the number of pickups and dropoffs in each of these time zones.

Understanding Time-based Patterns: The objective of the visualization is to understand the distribution of taxi pickups and dropoffs across different parts of the day. Count plots are suitable for displaying the frequency or count of categorical data, making them ideal for visualizing the number of pickups and dropoffs in each time zone.


#### **2. What is/are the insight(s) found from the chart?**

In [None]:
# Create a figure with two subplots side by side
figure, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))

# Create a count plot for the distribution of pickup time zones
pickup_plot = sns.countplot(x='pickup_time_of_day', data=taxi_df, ax=ax[0])
ax[0].set_title('Number of Pickups During Different Parts of the Day')
ax[0].set_xlabel('Time of Day')
ax[0].set_ylabel('Number of Pickups')

# Annotate bars with their counts for pickup plot
for p in pickup_plot.patches:
    pickup_plot.annotate(format(p.get_height(), '.0f'),
                         (p.get_x() + p.get_width() / 2., p.get_height()),
                         ha='center', va='center',
                         xytext=(0, 9),
                         textcoords='offset points')

# Create a count plot for the distribution of dropoff time zones
dropoff_plot = sns.countplot(x='dropoff_time_of_day', data=taxi_df, ax=ax[1])
ax[1].set_title('Number of Dropoffs During Different Parts of the Day')
ax[1].set_xlabel('Time of Day')
ax[1].set_ylabel('Number of Dropoffs')

# Annotate bars with their counts for dropoff plot
for p in dropoff_plot.patches:
    dropoff_plot.annotate(format(p.get_height(), '.0f'),
                          (p.get_x() + p.get_width() / 2., p.get_height()),
                          ha='center', va='center',
                          xytext=(0, 9),
                          textcoords='offset points')

# Display the plots
plt.tight_layout()
plt.show()


#### **Answer Here.**

**Insights from the Charts:**

**Pickup Distribution:**

The first chart shows the number of pickups during different parts of the day.
Most pickups occur during the evening period (4 PM to 10 PM), indicating high demand for taxis in the evening hours.
Pickup counts are relatively lower in the morning (6 AM to 10 AM) and mid day (10 AM to 4 PM) periods.
There is a moderate demand for taxis in the late night period (10 PM to 6 AM).


**Dropoff Distribution:**

The second chart represents the number of dropoffs during different times of the day.
Similar to pickups, the majority of dropoffs happen in the evening period, suggesting that people tend to travel more during the evening hours and need taxis to reach their destinations.
Dropoff counts are lower during the morning and mid day periods.
The late night period also shows a moderate number of dropoffs, indicating that taxi services are still in demand during these hours, albeit less than the evening.


#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**
The NYC Taxi Trip Time Prediction dataset by categorizing pickup and dropoff times into different time zones (morning, midday, evening, late night) and visualizes the distribution of pickups and dropoffs in each part of the day.

**Positive Business Impact:**
**Operational Efficiency:** Understanding peak pickup and dropoff times can help taxi companies allocate their resources more efficiently. They can deploy more drivers during high-demand periods, reducing customer wait times and increasing overall customer satisfaction.

**Optimized Pricing Strategies:** By recognizing high-demand periods, taxi companies can implement dynamic pricing models. Higher prices during peak times can increase revenue, while lower prices during off-peak hours can attract more customers.

**Insights Leading to Negative Growth:**

**Overcrowding during Peak Hours:** If the analysis reveals consistent overcrowding during peak hours, it could lead to negative customer experiences. Passengers might find it difficult to get a taxi, leading to frustration and potential loss of customers.

**Increased Operational Costs:** If demand peaks are too high and consistent, taxi companies might need to invest in expanding their fleet. While this can cater to high demand, it also leads to increased operational costs, which need to be balanced with the increased revenue.



*   We can see that evening is the busiet time of whole day. people take the taxi to come from office and going for party at evening and dinner. after that at mid day maximum ride taken because people were going to office after 10:00 AM.



# Chart-2

# A line plot for the relationship between 'pickup_month' and 'trip_duration'.



In [None]:
# visualization code
taxi_df["pickup_month"] = taxi_df['pickup_month'].astype(float)

In [None]:
# Set the style and context for the plot
sns.set(style="whitegrid", font_scale=1.2)

# Creating a line plot for the relationship between 'pickup_month' and 'trip_duration'
plt.figure(figsize=(10, 6))
sns.lineplot(x='pickup_month', y='trip_duration', data=taxi_df, color='green', linewidth=2, marker='o')

# Adding labels and title to the plot
plt.xlabel('Month', fontsize=14, labelpad=12)
plt.ylabel('Trip Duration (seconds)', fontsize=14, labelpad=12)
plt.title('Trip Duration Variation Over Months', fontsize=16, pad=20)

# Customize x-axis ticks (if needed)
# plt.xticks(ticks=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

# Add grid for better readability
plt.grid(True, linestyle='--', alpha=0.7)

# Display the plot
plt.tight_layout()
plt.show()


#### **1. Why did you pick the specific chart?**



In [None]:
# Set the style and context for the plot
sns.set(style="whitegrid", font_scale=1.2)

# Creating a line plot for the relationship between 'pickup_month' and 'trip_duration'
plt.figure(figsize=(10, 6))
sns.lineplot(x='pickup_month', y='trip_duration', data=taxi_df, color='green', linewidth=2, marker='o')

# Adding labels and title to the plot
plt.xlabel('Month', fontsize=14, labelpad=12)
plt.ylabel('Trip Duration (seconds)', fontsize=14, labelpad=12)
plt.title('Trip Duration Variation Over Months', fontsize=16, pad=20)

# Customize x-axis ticks (if needed)
# plt.xticks(ticks=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

# Add grid for better readability
plt.grid(True, linestyle='--', alpha=0.7)

# Display the plot
plt.tight_layout()
plt.show()


#### **Answer Here.**

The chosen chart in the provided code is a line plot representing the relationship between the 'pickup_month' (x-axis) and 'trip_duration' (y-axis) for the NYC Taxi Trip time Prediction dataset.

**The line plot is suitable for this scenario for several reasons:**

**Time Series Data Representation:** Line plots are ideal for visualizing data points over a continuous interval or time series. In this case, the x-axis represents months, which is a continuous variable, making it suitable for a line plot.

**Sequential Data:** Line plots are used to display data points in sequence and to show trends over a period of time. In this plot, months are sequential and have a specific order, making it meaningful to connect them with lines to observe the trend in trip durations over the months.


#### **2. What is/are the insight(s) found from the chart?**



In [None]:
# Set the style and context for the plot
sns.set(style="whitegrid", font_scale=1.2)

# Creating a line plot for the relationship between 'pickup_month' and 'trip_duration'
plt.figure(figsize=(10, 6))
sns.lineplot(x='pickup_month', y='trip_duration', data=taxi_df, color='green', linewidth=2, marker='o')

# Adding labels and title to the plot
plt.xlabel('Month', fontsize=14, labelpad=12)
plt.ylabel('Trip Duration (seconds)', fontsize=14, labelpad=12)
plt.title('Trip Duration Variation Over Months', fontsize=16, pad=20)

# Customize x-axis ticks (if needed)
# plt.xticks(ticks=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

# Add grid for better readability
plt.grid(True, linestyle='--', alpha=0.7)

# Display the plot
plt.tight_layout()
plt.show()


#### **Answer Here.**


It seems that you are visualizing the relationship between the pickup month and the trip duration for the NYC Taxi Trip Time Prediction dataset. Here are the insights that can be derived from the chart:

**Seasonal Patterns:** If the x-axis represents the months of the year, the chart can provide insights into the seasonal patterns of taxi trip durations. For instance, there might be longer trip durations during certain months, possibly indicating a busy tourist season or weather-related factors.

**Monthly Variations:** The chart can reveal any variations in trip durations from month to month. For example, there might be a consistent increase or decrease in trip durations over several months, suggesting a trend.


#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**


**Positive Business Impact:**
The line plot visualizes how trip duration varies across different months of the year. Positive insights that can lead to a business impact include identifying patterns or trends in trip durations. For example:

**Seasonal Demand Understanding:** If the plot shows a clear pattern where trip durations increase during specific months (such as summer vacation or holiday seasons), the taxi service can anticipate increased demand during these periods. This insight can help in optimizing the fleet and staffing levels to efficiently handle the higher demand, leading to better customer satisfaction and potentially increased revenue.

**Promotion and Pricing Strategies:** If there are months with consistently shorter trip durations, taxi companies can introduce promotional offers or discounts during these periods to attract more customers. Additionally, understanding the variations can help in dynamic pricing strategies, ensuring that prices align with demand, maximizing profitability during peak times, and encouraging ridership during off-peak periods.


**Negative Business Impact:**
Without specific details about the dataset and the actual plot generated, it's challenging to identify negative trends definitively. However, potential negative insights could include:

**Unpredictable Spikes in Trip Duration:** If there are random and unpredictable spikes in trip durations across various months, it might indicate issues such as traffic congestion, road closures, or other factors causing delays. In this case, taxi services might face challenges in providing reliable and timely transportation, leading to customer dissatisfaction and potential loss of business.

**Consistently Long Trip Durations:** If the plot consistently shows long trip durations across all months, it might suggest inefficiencies in routes, traffic management, or service optimization. Prolonged trip durations could discourage customers from using the service, leading to decreased ridership and revenue.





*   From February, we can see trip duration rising every month.



# Chart-3

# Distribution of number of Pickups During 24 Hours.



In [None]:
# visualization code

# Create a 1x2 grid of subplots with a combined figure size of 15x5 inches
figure, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))

# Create a countplot for the "pickup_day" column on the left subplot (ax[0])
pickup_plot = sns.countplot(x="pickup_day", data=taxi_df, ax=ax[0])

# Set a title for the left subplot
ax[0].set_title('No. of pickups done on each day')

# Annotate each bar with its count for the left subplot
for p in pickup_plot.patches:
    pickup_plot.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                         ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                         textcoords='offset points')

# Create a countplot for the "dropoff_day" column on the right subplot (ax[1])
dropoff_plot = sns.countplot(x='dropoff_day', data=taxi_df, ax=ax[1])

# Set a title for the right subplot
ax[1].set_title('No. of dropoff done on each day')

# Annotate each bar with its count for the right subplot
for p in dropoff_plot.patches:
    dropoff_plot.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                          ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                          textcoords='offset points')

# Display the complete figure with both subplots
plt.tight_layout()
plt.show()

#### **1. Why did you pick the specific chart?**

In [None]:
# Create a 1x2 grid of subplots with a combined figure size of 15x5 inches
figure, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))

# Create a countplot for the "pickup_day" column on the left subplot (ax[0])
pickup_plot = sns.countplot(x="pickup_day", data=taxi_df, ax=ax[0])

# Set a title for the left subplot
ax[0].set_title('No. of pickups done on each day')

# Annotate each bar with its count for the left subplot
for p in pickup_plot.patches:
    pickup_plot.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                         ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                         textcoords='offset points')

# Create a countplot for the "dropoff_day" column on the right subplot (ax[1])
dropoff_plot = sns.countplot(x='dropoff_day', data=taxi_df, ax=ax[1])

# Set a title for the right subplot
ax[1].set_title('No. of dropoff done on each day')

# Annotate each bar with its count for the right subplot
for p in dropoff_plot.patches:
    dropoff_plot.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                          ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                          textcoords='offset points')

# Display the complete figure with both subplots
plt.tight_layout()
plt.show()


#### **Answer Here.**

A countplot is a type of categorical plot provided by the seaborn library. It shows the counts of observations in each categorical bin using bars. In this case, the countplot is used to visualize the distribution of taxi pickups and dropoffs across different days of the week.

**Here's why the countplot was chosen for this scenario:**

**Categorical Data Representation:** The variables being plotted, i.e., "pickup_day" and "dropoff_day," are categorical in nature. Countplots are ideal for visualizing the distribution of categorical variables. Each bar in the countplot represents the count of occurrences of a specific category.

**Comparison between Categories:** The countplot allows for a clear visual comparison between the number of pickups and dropoffs on different days of the week. By placing the plots side by side, it's easy to compare the patterns between the two categories.


#### **2. What is/are the insight(s) found from the chart?**



#### **Answer Here.**

The provided code creates a side-by-side comparison of the number of pickups and dropoffs made on each day of the week using the NYC Taxi Trip Time Prediction dataset. Here are the insights that can be derived from the chart:

**Peak Pickup and Dropoff Days:**
By observing the left subplot (No. of pickups done on each day) and the right subplot (No. of dropoffs done on each day), you can identify the days with the highest number of taxi pickups and dropoffs. These days are likely to be the busiest for taxi services. For example, if Wednesday has the highest bars in both plots, it suggests that Wednesdays are the peak days for taxi activities.

**Comparison of Pickup and Dropoff Patterns:**
By comparing the heights of the bars in the left and right subplots for each day, you can gain insights into the balance between pickups and dropoffs on different days. For instance, if the pickups are significantly higher than dropoffs on weekends (Saturday and Sunday), it could indicate that people are using taxis to go out and socialize but not necessarily returning with taxis. On the other hand, if the dropoffs are higher than pickups on weekdays, it might mean people are using taxis to commute to work or other destinations and returning home with alternative transportation.


#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**


**Positive Business Impact:**

**Peak Hours and Days:** The visualization can help identify peak hours and days with high demand for taxi services. This information can be used to allocate more resources (taxis and drivers) during these times, ensuring efficient service and higher customer satisfaction.

**Demand Patterns:** Recognizing patterns in pickup and dropoff days can allow the business to predict future demand accurately. This predictive ability helps in optimizing taxi dispatching and ensures that taxis are available where and when they are needed the most.

**Customer Behavior Analysis:** By understanding which days have the highest demand, the business can analyze the reasons behind these patterns. For example, if Fridays and Saturdays have significantly more pickups and dropoffs, it might indicate a high demand for nightlife transportation. This insight could lead to targeted marketing or special offers during these times to attract more customers.


**Negative Growth:**

**Inefficient Resource Allocation:** If the business does not analyze these patterns and fails to allocate resources efficiently, there could be periods of unmet demand (not enough taxis during peak times) or excess supply (too many taxis during low-demand times). This inefficiency can lead to increased operational costs and decreased customer satisfaction.

**Missed Revenue Opportunities:** Without understanding demand patterns, the business might miss opportunities to increase prices during high-demand periods (such as holidays or events). Failing to capitalize on these opportunities can result in missed revenue potential.

**Poor Customer Experience:** Inconsistencies in service availability can lead to poor customer experiences. If customers frequently find it difficult to get a taxi during peak hours, they might switch to alternative transportation options, leading to customer loss and negative growth.

# Chart-4

# Grouping the data in the 'taxi_df' DataFrame by "pickup_month" and "vendor_id".



In [None]:
# visualization code

# Aggregate vendor id by pickup month
# Grouping the data in the 'taxi_df' DataFrame by "pickup_month" and "vendor_id"
monthly_pickup_by_vendor = taxi_df.groupby(["pickup_month", "vendor_id"]).size()

# Unstacking the grouped data to create a pivot table-like structure
monthly_pickup_by_vendor = monthly_pickup_by_vendor.unstack()

# Set the size of the plot
plt.figure(figsize=(10, 6))

# Plotting the data as a line chart
monthly_pickup_by_vendor.plot(kind='line', linewidth=2, marker='o', markersize=8, color=['blue', 'green'])

# Adding a title
plt.title('Monthly Pickup Count by Vendor', fontsize=16)

# Adding labels to the x and y axes
plt.xlabel('Months', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)

# Adding a legend to distinguish between vendors
plt.legend(['Vendor 1', 'Vendor 2'], fontsize=12)

# Adding grid lines for better readability
plt.grid(True, linestyle='--', alpha=0.7)

# Adding annotations for specific data points (optional)
# plt.annotate('Peak Month', xy=(peak_month, peak_value), xytext=(peak_month-2, peak_value+5000),
#              arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12)

# Display the plot
plt.show()

#### **1. Why did you pick the specific chart?**

In [None]:
# Set the size of the plot
plt.figure(figsize=(10, 6))

# Plotting the data as a line chart
monthly_pickup_by_vendor.plot(kind='line', linewidth=2, marker='o', markersize=8, color=['blue', 'green'])

# Adding a title
plt.title('Monthly Pickup Count by Vendor', fontsize=16)

# Adding labels to the x and y axes
plt.xlabel('Months', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)

# Adding a legend to distinguish between vendors
plt.legend(['Vendor 1', 'Vendor 2'], fontsize=12)

# Adding grid lines for better readability
plt.grid(True, linestyle='--', alpha=0.7)

# Adding annotations for specific data points (optional)
# plt.annotate('Peak Month', xy=(peak_month, peak_value), xytext=(peak_month-2, peak_value+5000),
#              arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12)

# Display the plot
plt.show()



#### **Answer Here.**


**Temporal Data Representation:** Line charts are particularly useful for displaying data points in a continuous manner, making them ideal for showing trends over time. In this case, the data is being analyzed on a monthly basis, so a line chart is appropriate for displaying how the number of trips varies over the months.

**Comparison between Categories:** The line chart allows for a clear comparison between the two vendors (Vendor 1 and Vendor 2) over the months. Each vendor is represented by a different colored line, making it easy for viewers to distinguish between them.


#### **2. What is/are the insight(s) found from the chart?**


In [None]:
# Set the size of the plot
plt.figure(figsize=(10, 6))

# Plotting the data as a line chart
monthly_pickup_by_vendor.plot(kind='line', linewidth=2, marker='o', markersize=8, color=['blue', 'green'])

# Adding a title
plt.title('Monthly Pickup Count by Vendor', fontsize=16)

# Adding labels to the x and y axes
plt.xlabel('Months', fontsize=14)
plt.ylabel('Number of Trips', fontsize=14)

# Adding a legend to distinguish between vendors
plt.legend(['Vendor 1', 'Vendor 2'], fontsize=12)

# Adding grid lines for better readability
plt.grid(True, linestyle='--', alpha=0.7)

# Adding annotations for specific data points (optional)
# plt.annotate('Peak Month', xy=(peak_month, peak_value), xytext=(peak_month-2, peak_value+5000),
#              arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12)

# Display the plot
plt.show()



#### **Answer Here.**

**Vendor Comparison:** The chart allows for a visual comparison between Vendor 1 and Vendor 2 in terms of their monthly pickup counts. By observing the relative heights of the lines, you can quickly identify which vendor consistently handles more pickups throughout the months.

**Seasonal Patterns:** If there are recurring peaks or troughs in the lines for both vendors, it indicates a seasonal pattern in the number of pickups. For instance, if there is a significant increase in pickups during summer months or holiday seasons, these patterns would be visible on the chart.

**Anomalies or Outliers:** Sudden spikes or drops in the number of pickups can signify anomalies or special events. For example, if there is an unusually high number of pickups in a specific month, it might indicate a local event, festival, or occurrence that led to increased demand for taxis.

#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason**



#### **Answer Here.**

The NYC Taxi Trip Time Prediction dataset by grouping the data based on pickup month and vendor ID, then visualizing the monthly pickup counts for two specific vendors (Vendor 1 and Vendor 2) using a line chart.


**Positive Business Impact:**


**Identifying Peak Months:** Understanding which months have the highest demand for taxi services (peak months) can be invaluable for taxi companies. They can allocate more resources (taxis and drivers) during these periods to meet the high demand, thereby improving customer satisfaction and potentially increasing revenue.

**Optimizing Vendor Performance:** By comparing the performance of different vendors, taxi companies can identify which vendor is performing better in terms of the number of trips. Positive insights, such as one vendor consistently outperforming the other, can lead to strategic decisions. For example, collaborating more with the efficient vendor or learning from their strategies to improve the overall service quality.

**Strategic Decision Making:** Insights derived from this analysis can help in making data-driven decisions. For instance, taxi companies can adjust their marketing strategies or promotional offers during specific months to attract more customers. They can also plan maintenance schedules and driver shifts more effectively based on anticipated demand fluctuations.

**Potential Negative Impact:**

**Overlooking Other Factors:** While the number of trips is important, focusing solely on this metric might lead to overlooking other crucial factors affecting business performance, such as customer satisfaction, trip duration, and profitability. Ignoring these factors could result in a negative impact in the long run, as customer experience is a key driver for the taxi service industry.

**Incomplete Insights:** Analyzing only the number of trips by vendors might not provide a comprehensive understanding of customer behavior and preferences. To create a positive impact, it's essential to complement this analysis with other data sources (like customer feedback, weather data, or events happening in the city) to gain a holistic view of the business landscape.



*  We can see that both vendors trips are maximum at month of March and lowest at the month of January, February and after June.



# Chart-5

# Create a count plot using seaborn, using the "passenger_count" column from the dataset.



In [None]:
# visualization code

# Passenger count
taxi_df.passenger_count.value_counts()

In [None]:
# Set the figure size for the plot
plt.figure(figsize=(10, 6))

# Create a count plot using seaborn, using the "passenger_count" column from the dataset
sns.countplot(x=taxi_df["passenger_count"], palette="viridis")  # You can change the palette to your preference

# Set labels for x and y axes
plt.xlabel('Number of Passengers', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Set the title for the plot
plt.title('Distribution of Passenger Count', fontsize=16)

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)

# Display the plot
plt.show()


#### **1. Why did you pick the specific chart?**




In [None]:
# Set the figure size for the plot
plt.figure(figsize=(10, 6))

# Create a count plot using seaborn, using the "passenger_count" column from the dataset
sns.countplot(x=taxi_df["passenger_count"], palette="viridis")  # You can change the palette to your preference

# Set labels for x and y axes
plt.xlabel('Number of Passengers', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Set the title for the plot
plt.title('Distribution of Passenger Count', fontsize=16)

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)

# Display the plot
plt.show()


#### **Answer Here.**

A count plot is a type of categorical plot that shows the counts of observations in each category using bars. In this case, it displays the distribution of passenger counts in the NYC Taxi Trip Time Prediction dataset.

The choice of this chart is suitable for visualizing the distribution of discrete categorical data, which is the passenger count in this scenario. Here's why this specific chart was picked:

Categorical Data Representation: The "passenger_count" column contains discrete categorical data, representing the number of passengers in each taxi trip. Count plots are ideal for visualizing the distribution of such data by displaying the frequency of each category.

Clarity and Simplicity: Count plots are simple and easy to interpret. They provide a clear visual representation of the distribution without the complexities of other types of plots, making it accessible for a wide range of audiences.


#### **2. What is/are the insight(s) found from the chart?**

In [None]:
# Set the figure size for the plot
plt.figure(figsize=(10, 6))

# Create a count plot using seaborn, using the "passenger_count" column from the dataset
sns.countplot(x=taxi_df["passenger_count"], palette="viridis")  # You can change the palette to your preference

# Set labels for x and y axes
plt.xlabel('Number of Passengers', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Set the title for the plot
plt.title('Distribution of Passenger Count', fontsize=16)

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)

# Display the plot
plt.show()


#### **Answer Here.**

The provided code creates a count plot showing the distribution of passenger counts in the NYC Taxi Trip Time Prediction dataset. The x-axis represents the number of passengers, and the y-axis represents the count of trips for each passenger count. The insights that can be gathered from this chart include:

Most trips have a single passenger: The chart likely shows a peak at the passenger count of 1, indicating that a significant portion of the taxi trips in the dataset involve solo passengers. This is a common scenario in taxi services where individuals travel alone.

Decreasing frequency with more passengers: As the passenger count increases, the number of trips tends to decrease. This is expected because fewer trips involve multiple passengers. This insight is valuable for understanding the typical passenger group size in the dataset.


#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**

Analyzing passenger count distribution can provide valuable insights that might help businesses optimize their services. Here's how the insights gained from this plot could potentially impact a taxi service business:

**Positive Business Impact:**

**Optimizing Fleet Size:** By understanding the distribution of passenger counts, taxi companies can optimize their fleet size. For instance, if a significant portion of the rides consists of single passengers, the company might consider having more compact cars in their fleet, which are more fuel-efficient and cost-effective.

**Targeted Marketing:** Knowing the common passenger counts allows for targeted marketing efforts. For example, if a large number of trips involve groups of 4 or more passengers, the company could create special offers or discounts for group rides, thereby attracting more customers and increasing revenue.

**Service Customization:** Companies can customize their services based on passenger count patterns. For instance, if there are many solo travelers during specific times, the company could introduce single-passenger ride packages at discounted rates during those hours, attracting more solo riders.

**Potential Negative Impact (if not analyzed and addressed):**

**Inefficient Resource Allocation:** If a taxi company does not pay attention to the distribution of passenger counts, they might misallocate their resources. For example, if they predominantly have larger vehicles in their fleet but most rides involve single passengers, it could lead to inefficiency and increased operational costs.

**Poor Customer Experience:** If the service does not align with the typical passenger count, customers might have a poor experience. For instance, if a customer books a taxi assuming it can accommodate a group of 6, but the majority of the fleet consists of smaller cars, it can lead to dissatisfaction and negative reviews.

* We can notice that most trips are booked by single person or we can say that less number of trips are booked by group of people.which means that trips are preffered by single person.

# Chart-6

# Group the data by the categorized trip durations and count the number of trips in each category.



In [None]:
# visualization code.

# Define the bin boundaries and labels for categorizing trip durations.
bins = [0, 1, 10, 30, 60, 1440, 1440*2, 50000]
labels = ['<1 min', '1-10 mins', '10-30 mins', '30-60 mins', '1-24 hrs', '1-2 days', '2+ days']

# Categorize the trip durations based on the defined bins and labels
taxi_df['trip_duration_category'] = pd.cut(taxi_df['trip_duration_in_minute'], bins=bins, labels=labels)

# Group the data by the categorized trip durations and count the number of trips in each category
trip_duration_counts = taxi_df['trip_duration_category'].value_counts()

# Create a bar plot using the grouped and counted data
plt.figure(figsize=[10, 5])
trip_duration_counts.sort_index().plot(kind='bar', color='skyblue', edgecolor='black')

# Add title and labels to the plot
plt.title('Distribution of Trip Durations')
plt.xlabel('Trip Duration')
plt.ylabel('Number of Trips')
plt.xticks(rotation=45)

# Annotate the bars with exact values
for i, v in enumerate(trip_duration_counts.sort_index()):
    plt.text(i, v + 50, str(v), ha='center', va='bottom')

# Display the plot
plt.show()

#### **1. Why did you pick the specific chart?**

In [None]:
# Define the bin boundaries and labels for categorizing trip durations.
bins = [0, 1, 10, 30, 60, 1440, 1440*2, 50000]
labels = ['<1 min', '1-10 mins', '10-30 mins', '30-60 mins', '1-24 hrs', '1-2 days', '2+ days']

# Categorize the trip durations based on the defined bins and labels
taxi_df['trip_duration_category'] = pd.cut(taxi_df['trip_duration_in_minute'], bins=bins, labels=labels)

# Group the data by the categorized trip durations and count the number of trips in each category
trip_duration_counts = taxi_df['trip_duration_category'].value_counts()

# Create a bar plot using the grouped and counted data
plt.figure(figsize=[10, 5])
trip_duration_counts.sort_index().plot(kind='bar', color='skyblue', edgecolor='black')

# Add title and labels to the plot
plt.title('Distribution of Trip Durations')
plt.xlabel('Trip Duration')
plt.ylabel('Number of Trips')
plt.xticks(rotation=45)

# Annotate the bars with exact values
for i, v in enumerate(trip_duration_counts.sort_index()):
    plt.text(i, v + 50, str(v), ha='center', va='bottom')

# Display the plot
plt.show()

#### **Answer Here.**

Bar chart to visualize the distribution of trip durations in different categories. Here's why this choice makes sense:

**Categorical Data Representation:** The data is categorized into different duration ranges such as '<1 min', '1-10 mins', '10-30 mins' and so on. Bar charts are excellent for representing categorical data, where each category can be represented by a distinct bar.

**Counting Frequencies:** The chart displays the number of trips in each duration category. Bar charts are commonly used to represent frequencies or counts of categorical data, making it easy to understand the distribution of trips across various durations.

**Comparative Analysis:** Bar charts allow for easy comparison between different categories. In this case, it enables a clear visual comparison of the number of trips for each duration category.


#### **2. What is/are the insight(s) found from the chart?**

In [None]:
# Define the bin boundaries and labels for categorizing trip durations.
bins = [0, 1, 10, 30, 60, 1440, 1440*2, 50000]
labels = ['<1 min', '1-10 mins', '10-30 mins', '30-60 mins', '1-24 hrs', '1-2 days', '2+ days']

# Categorize the trip durations based on the defined bins and labels
taxi_df['trip_duration_category'] = pd.cut(taxi_df['trip_duration_in_minute'], bins=bins, labels=labels)

# Group the data by the categorized trip durations and count the number of trips in each category
trip_duration_counts = taxi_df['trip_duration_category'].value_counts()

# Create a bar plot using the grouped and counted data
plt.figure(figsize=[10, 5])
trip_duration_counts.sort_index().plot(kind='bar', color='skyblue', edgecolor='black')

# Add title and labels to the plot
plt.title('Distribution of Trip Durations')
plt.xlabel('Trip Duration')
plt.ylabel('Number of Trips')
plt.xticks(rotation=45)

# Annotate the bars with exact values
for i, v in enumerate(trip_duration_counts.sort_index()):
    plt.text(i, v + 50, str(v), ha='center', va='bottom')

# Display the plot
plt.show()


#### **Answer Here.**

Analyzing the distribution of trip durations for the NYC Taxi Trip Time Prediction dataset and categorizing the trip durations into different bins. Based on the provided bin boundaries and labels, the chart displays the number of trips falling into each category of trip duration.

**Most Trips are Short:** The majority of trips fall within the '<10 mins' category, indicating that a significant portion of taxi rides are short-distance travels.

**Relatively Few Long-Distance Trips:** There are relatively fewer trips in the '10-30 mins', '30-60 mins', '1-24 hrs', '1-2 days', and '2+ days' categories. This suggests that long-duration trips are not as common as shorter trips.

**Negligible Long-Duration Trips:** The categories '1-24 hrs', '1-2 days', and '2+ days' show very few or negligible trips, indicating that extremely long-duration taxi rides are rare in the dataset.








#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**

#### **Answer Here.**

The trip durations from the NYC Taxi Trip Time Prediction dataset into specific bins (such as less than 1 minute, 1-10 minutes, 10-30 minutes, etc.) and then visualizes the distribution of trips in these duration categories using a bar plot. This analysis can provide valuable insights that can potentially lead to positive business impacts and help identify areas for improvement.


**Positive Business Impact:**

**Demand Analysis:** By understanding the distribution of trip durations, taxi companies can optimize their fleet management. For instance, if there is a high demand for short trips (less than 10 minutes), companies might consider introducing smaller vehicles that are more fuel-efficient, leading to cost savings.

**Pricing Strategy:** Companies can adjust their pricing strategies based on the trip durations. Short trips might have a different pricing model compared to longer trips, and understanding the distribution helps in setting competitive and profitable prices.

**Service Improvement:** Understanding the distribution can highlight areas where service improvement is needed. For instance, if there are a significant number of long-duration trips, companies might focus on providing amenities like in-car entertainment, comfortable seating, or Wi-Fi to enhance customer satisfaction during these journeys.




* By above chart we can say that most of the trip duration is within 10 and 30 minutes.
* Few trips are within 30 to 60 minutes.
* Rarely trips are within a day.

# Chart-7

# For visually comparing the distributions and central tendencies (mean and median) of different numeric features in the "taxi_df" DataFrame using histograms and box plots.Iterate through each numeric feature in the DataFrame.



# Distribution of different features


In [None]:
# visualization code

# Histplots and boxplots to determine the distribution of the data given below.
numeric_feature=['passenger_count','Distance','trip_duration_in_minute','pickup_hour','dropoff_hour']
numeric_feature

In [None]:
taxi_df["Distance"] = taxi_df["Distance"].astype(float)

In [None]:
# This code snippet is useful for visually comparing the distributions and central tendencies (mean and median) of different numeric features in the "taxi_df" DataFrame using histograms and box plots.
# Iterate through each numeric feature in the DataFrame.
for col in numeric_feature:
    # Create a figure with one row and two columns, representing two subplots side by side.
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))

    # Plot a histogram for the current numeric feature in the first subplot (index 0).
    sns.histplot(data=taxi_df, x=col, ax=ax[0], kde=True, color='skyblue')
    # Add vertical lines for mean and median to the histogram plot.
    ax[0].axvline(taxi_df[col].mean(), color='magenta', linestyle='dashed', linewidth=2, label='Mean')
    ax[0].axvline(taxi_df[col].median(), color='cyan', linestyle='dashed', linewidth=2, label='Median')
    # Set plot title and labels.
    ax[0].set_title(f'Distribution of {col}')
    ax[0].set_xlabel(col)
    ax[0].legend()  # Show legend for mean and median lines.

    # Create a boxplot for the current numeric feature in the second subplot (index 1).
    sns.boxplot(data=taxi_df, x=col, ax=ax[1], color='lightgreen')
    # Add vertical lines for mean and median to the boxplot.
    ax[1].axvline(taxi_df[col].mean(), color='magenta', linestyle='dashed', linewidth=2, label='Mean')
    ax[1].axvline(taxi_df[col].median(), color='cyan', linestyle='dashed', linewidth=2, label='Median')
    # Set plot title and labels.
    ax[1].set_title(f'Boxplot of {col}')
    ax[1].set_xlabel(col)
    ax[1].legend()  # Show legend for mean and median lines.

    plt.tight_layout()  # Adjust layout for a better visualization spacing.

    # Display the plots for the current numeric feature.
    plt.show()


#### **1. Why did you pick the specific chart?**


#### **Answer Here.**

 The left subplot (index 0) shows the distribution of the numeric feature using a histogram, while the right subplot (index 1) shows the same distribution using a box plot.


 The specific charts (histogram and box plot) were chosen for the following reasons:

**Histogram:**

**Purpose:** Histograms are useful for visualizing the distribution of a single variable. They show the frequency or count of data points within specific intervals (bins) along the numerical range.


**Use Case:** Histograms are effective for understanding the shape of the distribution, identifying patterns (such as normal or skewed distributions), and detecting outliers.


**Box Plot:**

**Purpose:** Box plots (box-and-whisker plots) are excellent for visualizing the summary statistics of a dataset, including the median, quartiles, and potential outliers.

**Use Case:** Box plots provide a clear summary of the central tendency (median) and the spread of the data. They are particularly useful for comparing the distributions of multiple variables or groups.


#### **2. What is/are the insight(s) found from the chart?**


#### **Answer Here.**

The histograms, you can observe the shape of the data distribution. If the distribution is symmetric, skewed to the left, or skewed to the right, it can indicate different patterns in the data.

**Symmetric Distribution:** If the histogram is roughly symmetric, it suggests that the data is balanced and evenly distributed.

**Skewed to the Left (Negatively Skewed):** If the left tail (where the smaller values are) is longer or fatter than the right tail, the data is skewed to the left. In the context of taxi trip time prediction, this might mean that most trips have longer durations.

**Skewed to the Right (Positively Skewed):** If the right tail (where the larger values are) is longer or fatter than the left tail, the data is skewed to the right. In the context of taxi trip time prediction, this might mean that most trips have shorter durations.

From the box plots, you can identify outliers and get a sense of the spread of the data. Outliers are data points that significantly differ from most other observations and can indicate interesting phenomena or errors in the data collection process.

The vertical lines in both the histograms and box plots represent the mean (magenta) and median (cyan) values. Comparing the mean and median can provide insights into the skewness of the data. If mean > median, the data is skewed to the right; if mean < median, the data is skewed to the left.




#### **3. Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**


#### **Answer Here.**

**Positive Business Impact:**


**Identifying Patterns:** By comparing histograms and box plots, you can identify patterns in the data. For instance, if the trip duration shows a consistent pattern (like most trips being short), businesses can strategize accordingly. Short trips might lead to more frequent turnovers, allowing for higher profits if managed efficiently.

**Optimizing Services:** Understanding peak usage times (via time-related features) can help in optimizing taxi services. If there's a visible peak during rush hours, businesses can allocate more resources (taxis and drivers) during those times, ensuring better service for customers and potentially increasing revenue.

**Customer Preferences:** If certain features like trip distance or fare amount have distinctive peaks or trends, businesses can tailor their marketing strategies. For example, if there are more long-distance trips during weekends, special weekend promotions could be introduced to attract more customers during those periods.

**Negative Growth:**

**Outliers and Anomalies:** If these visualizations reveal extreme outliers (for example, unusually high fares for short trips), it might indicate fraudulent activities or errors in the billing system. Addressing such issues promptly is crucial; failure to do so could lead to financial losses and damage the company's reputation.

**Inefficiencies:** If there are unexpected distributions or trends, it might indicate inefficiencies in the business process. For instance, if the trip duration is significantly higher than expected for certain routes, it could imply issues such as traffic congestion or inefficient route planning. Addressing these inefficiencies is vital to providing better services and optimizing costs.

# The histograms for distance and trip duration indicate significant skewness, while the box plots for these columns reveal the presence of numerous outliers.

# Chart-8
# **Heatmap**

# **Correlation Heatmap** - Generates a heatmap to visualize the correlation matrix of the NYC Taxi Trip Time Prediction dataset.


In [None]:
# visualization code
# Set the size of the figure
plt.figure(figsize=(14, 10))

# Calculate the correlation matrix
correlation = taxi_df.corr()

# Choose a color palette (for example, "coolwarm")
cmap = sns.color_palette("coolwarm", as_cmap=True)

# Create the heatmap with annotations, using a diverging color palette
sns.heatmap(correlation, annot=True, cmap=cmap, fmt=".2f", linewidths=1, vmin=-1, vmax=1)

# Set plot title and labels for axes
plt.title("Correlation Matrix Heatmap", fontsize=16)
plt.xlabel("Features", fontsize=14)
plt.ylabel("Features", fontsize=14)

# Increase font size of annotations for better readability
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Show the plot
plt.show()

#### **1. Why did you pick the specific chart?**


#### **Answer Here.**

The code you provided generates a heatmap to visualize the correlation matrix of the NYC Taxi Trip Time Prediction dataset.

**Correlation Analysis:** The purpose of this chart is to understand the relationships between different variables in the dataset. Heatmaps are excellent for displaying correlation matrices because they use color to represent the strength and direction of the relationships. Positive correlations are often displayed in one color (e.g., shades of blue), while negative correlations are shown in another color (e.g., shades of red). This makes it easy to identify patterns and relationships in the data.

#### **2. What is/are the insight(s) found from the chart?**

#### **Answer Here.**

In this heatmap, correlation coefficients between different pairs of features are represented by colors, where values close to 1 indicate a strong positive correlation, values close to -1 indicate a strong negative correlation, and values close to 0 indicate a weak or no correlation.

**Positive Correlations with Trip Time:**

**Distance:** Typically, the distance of the trip and the duration of the trip have a strong positive correlation. Longer distances usually mean longer trip times.


**Pickup/Dropoff Coordinates:** The longitude and latitude coordinates of the pickup and dropoff locations might be positively correlated with trip duration. For example, trips from downtown areas to airports might take longer.

* By above heatmap it visualize that pickup_month and dropoff month is 100% correlated. Along with pickup hour, dropoff hour,pickup weekday and dropoff week day, trip duration in minute are highly correlated.

# **Pair plot**

# **Pair Plot**- A pair plot is a grid of scatterplots, where variables are plotted against each other. If you have a dataset with multiple numeric variables, a pair plot allows you to visualize the relationships between these variables.

In [None]:
# visualization code

# Pair Plot
sns.pairplot(taxi_df, hue="trip_duration_in_minute")
plt.show()

#### **1. Why did you pick the specific chart?**





#### **Answer Here.**

A pair plot is a grid of scatterplots, where variables are plotted against each other. If you have a dataset with multiple numeric variables, a pair plot allows you to visualize the relationships between these variables.

**Multivariate Exploration:** Pair plots allow you to explore relationships between multiple variables at once. For instance, in the context of a taxi dataset, you might want to see how different variables like distance, passenger count, or time of day relate to the trip duration. A pair plot can help you quickly identify potential patterns or correlations.




#### **2. What is/are the insight(s) found from the chart?**

#### **Answer Here.**

A pair plot is a grid of scatterplots that allows you to visualize relationships between different pairs of variables. If the "trip_duration" variable is used as the hue, it could provide insights into how trip duration relates to other features in the dataset.

**Correlations:** You can identify positive or negative correlations between "trip_duration" and other variables. For instance, if you see a diagonal line sloping upwards from the bottom left to the top right, it indicates a positive correlation. Conversely, a downward slope indicates a negative correlation.

In [None]:
def correlated(dataset,thresold):
  corr_column = set() # all the highly correlated column
  for i in range(len(correlation.columns)):
    for j in range(i):
      if abs(correlation.iloc[i,j])>=thresold:  # We want absolute value
        column_name=correlation.columns[i]      # getting the name of columns
        corr_column.add(column_name)            # add the name column in empty set
  return corr_column

In [None]:
# Calling the function with thresold value 0.90
highly_correlated_features = correlated(taxi_df,0.90)
print('total highly correlted features:',len(set(highly_correlated_features)))

In [None]:
highly_correlated_features

* Based on the analysis above, it can be concluded that there are four columns exhibiting a strong correlation of over 90%.

* Removing highly correlated features leads to improved performance.

## Examining the asymmetry of the target variable.

In [None]:
# dist plot of trip duration.
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))

# Plotting the original trip duration
sns.distplot(taxi_df.trip_duration, color='red', ax=ax[0], hist=False, kde_kws={'shade': True, 'linewidth': 2})

# Plotting the trip duration after applying log transformation
sns.distplot(np.log10(taxi_df['trip_duration']), color='green', ax=ax[1], hist=False, kde=True, kde_kws={'shade': True, 'linewidth': 2})

ax[1].set_title('Distribution after applying log transformation')
plt.show()


* By above distribution we can clearly see that target variable is highly right skewed to remove the skewness we apply log transformation after transformation we found normal distribution of target variable.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# No Missing/Null Values in the Data Set

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

## Eliminating exceptional data points(Quartile Method)
**Interquartile range measures the spread of the middle half of our data.**

**Formula=Q3-Q1**

**where Q1- quartile 1 and Q3- quartile 3**

**lower limit of the data is given by Q1-1.5*IQR**

**Upper limit of the data is given by Q3 + 1.5IQR**

In [None]:
# Handling Outliers & Outlier treatments.

# Create a figure with a single row and three columns of subplots.
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))

# Create a box plot for the 'trip_duration' column and place it in the first subplot.
sns.boxplot(x=taxi_df['trip_duration'], ax=ax[0], orient='v', color='skyblue', width=0.5)
ax[0].set_title('Trip Duration Distribution')  # Set title for the subplot.
ax[0].set_xlabel('Duration (seconds)')  # Set x-axis label.
ax[0].grid(True, linestyle='--', linewidth=0.5)  # Add grid lines.

# Create a box plot for the 'passenger_count' column and place it in the second subplot.
sns.boxplot(x=taxi_df['passenger_count'], ax=ax[1], orient='v', color='lightgreen', width=0.5)
ax[1].set_title('Passenger Count Distribution')  # Set title for the subplot.
ax[1].set_xlabel('Number of Passengers')  # Set x-axis label.
ax[1].grid(True, linestyle='--', linewidth=0.5)  # Add grid lines.

# Create a box plot for the 'distance' column and place it in the third subplot.
sns.boxplot(x=taxi_df['Distance'], ax=ax[2], orient='v', color='lightcoral', width=0.5)
ax[2].set_title('Distance Distribution')  # Set title for the subplot.
ax[2].set_xlabel('Distance (miles)')  # Set x-axis label.
ax[2].grid(True, linestyle='--', linewidth=0.5)  # Add grid lines.

# Rotate y-axis labels to horizontal for easier reading.
for axis in ax:
    axis.tick_params(axis='y', rotation=0)

# Add spacing between subplots and show the plot.
plt.tight_layout()
plt.show()


In [None]:
# Finding differnt quarters of trip_duration column.
trip_duration_Q1 = taxi_df['trip_duration'].quantile(0.25)
print('first quartile value ie 25th percentile of trip duration:',trip_duration_Q1)
trip_duration_Q2 = taxi_df['trip_duration'].quantile(0.50)
print('second quartile value ie 50th percentile of trip duration:',trip_duration_Q2)
trip_duration_Q3 = taxi_df['trip_duration'].quantile(0.75)
print('third quartile value ie 75th percentile of trip duration:',trip_duration_Q3)

In [None]:
# Calculate Inquartile range.
IQR=trip_duration_Q3-trip_duration_Q1
print('IQR:',IQR)
trip_duration_lower_limit=trip_duration_Q1-1.5*IQR
trip_duration_upper_limit=trip_duration_Q3+1.5*IQR
print('The lower limit of trip duration:',trip_duration_lower_limit)
print('The upper limit of trip duration:',trip_duration_upper_limit)

In [None]:
# Removing outliers in trip_duration features.
taxi_df=taxi_df[taxi_df['trip_duration']>0]
taxi_df=taxi_df[taxi_df['trip_duration']<trip_duration_upper_limit]

In [None]:
# Finding different quarters of passenger_count column
passenger_count_Q1=taxi_df['passenger_count'].quantile(0.25)
print('first quartile value ie 25th percentile of passenger count:',passenger_count_Q1)
passenger_count_Q2=taxi_df['passenger_count'].quantile(0.50)
print('second quartile value ie 50th percentile of passenger count:',passenger_count_Q2)
passenger_count_Q3=taxi_df['passenger_count'].quantile(0.75)
print('third quartile value ie 75th percentile of passenger count:',passenger_count_Q3)

In [None]:
# Calculate IQR
IQR=passenger_count_Q3 - passenger_count_Q1
print('IQR:',IQR)
passenger_count_lower_limit=passenger_count_Q1-1.5*IQR
passenger_count_upper_limit=passenger_count_Q3+1.5*IQR
print('The lower limit of passenger count:',passenger_count_lower_limit)
print('The upper limit of passenger count:',passenger_count_upper_limit)

In [None]:
# Filtering out rows with trip durations less than or equal to 0
taxi_df = taxi_df[taxi_df['trip_duration_in_minute'] > 0]

# Filtering out rows with trip durations greater than the defined upper limit
# This is done to remove outliers in the 'trip_duration_in_minute' feature
taxi_df = taxi_df[taxi_df['trip_duration_in_minute'] < passenger_count_upper_limit]

In [None]:
# Loading the taxi dataset into a DataFrame (taxi_df)
# Make sure you have loaded the dataset before running this code

# Calculating the first quartile (25th percentile) of the 'distance' column
distance_count_Q1 = taxi_df['Distance'].quantile(0.25)
# Printing the calculated first quartile value
print('first quartile value ie 25th percentile of distance count:', distance_count_Q1)

# Calculating the second quartile (50th percentile/median) of the 'distance' column
distance_count_Q2 = taxi_df['Distance'].quantile(0.50)
# Printing the calculated second quartile value
print('second quartile value ie 50th percentile of distance count:', distance_count_Q2)

# Calculating the third quartile (75th percentile) of the 'distance' column
distance_count_Q3 = taxi_df['Distance'].quantile(0.75)
# Printing the calculated third quartile value
print('third quartile value ie 75th percentile of distance count:', distance_count_Q3)

In [None]:
# Calculate the Interquartile Range (IQR)
# IQR is the range between the 1st quartile (Q1) and the 3rd quartile (Q3)
IQR = distance_count_Q3 - distance_count_Q1
print('IQR:', IQR)  # Print the calculated IQR

# Calculate the lower and upper limits for outlier detection
# The lower limit is calculated as Q1 - 1.5 * IQR
distance_duration_lower_limit = distance_count_Q1 - 1.5 * IQR
# The upper limit is calculated as Q3 + 1.5 * IQR
distance_duration_upper_limit = distance_count_Q3 + 1.5 * IQR

# Print the calculated lower and upper limits for distance duration
print('The lower limit of distance duration:', distance_duration_lower_limit)
print('The upper limit of distance duration:', distance_duration_upper_limit)

In [None]:
# Removing outliers in the 'trip_duration_in_minute' feature

# Filter out rows where trip duration is less than or equal to 0
taxi_df = taxi_df[taxi_df['trip_duration_in_minute'] > 0]

# Filter out rows where trip duration exceeds the upper limit defined by 'distance_duration_upper_limit'
taxi_df = taxi_df[taxi_df['trip_duration_in_minute'] < distance_duration_upper_limit]

# At this point, 'taxi_df' contains data with outliers removed from the 'trip_duration_in_minute' feature

In [None]:
# Earlier we saw that distance and trip duratin had highly skewed graph....lets check the distribution again.
# Create a figure with two subplots side by side
figure, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 5))

# Plotting the distribution of the 'distance' column on the first subplot
sns.distplot(
    taxi_df['Distance'],                  # Data from the 'distance' column
    hist=False,                           # Do not display histogram bars
    kde=True,                             # Display kernel density estimate
    kde_kws={'shade': True, 'linewidth': 2},  # Styling for the KDE plot
    color='green',                        # Color of the plot
    ax=ax[0]                              # Place the plot on the first subplot
)
# Adding a title to the first subplot
ax[0].set_title('Distribution of Distance')

# Plotting the distribution of the 'trip_duration' column on the second subplot
sns.distplot(
    taxi_df['trip_duration'],             # Data from the 'trip_duration' column
    hist=False,                           # Do not display histogram bars
    kde=True,                             # Display kernel density estimate
    kde_kws={'shade': True, 'linewidth': 2},  # Styling for the KDE plot
    color='green',                        # Color of the plot
    ax=ax[1]                              # Place the plot on the second subplot
)
# Adding a title to the second subplot
ax[1].set_title('Distribution of Trip Duration')

# Display the final plot
plt.tight_layout()
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

**Answer Here.**

 I have used the Interquartile range(IQR) method to identift and remove outliers in the continuous columns of the dataset.I chose to use this technique because this is robust method to detect the outliers that is not affected by the presence of extreme values. The IQR is calculated as the 75th and 25th percentile of the data, and any value that falls between 25th percentile minus 1.5 times the IQR or above the 75th percentile plus 1.5 times the IQR is considered an outlier. By using this method.I was able to identify and remove outliers in a consistent and objective manner.

### 3. Textual Data Preprocessing
### ONE HOT ENCODING

In [None]:
#add dummy variable to convert textual data to numerical data through one hot encoding.
taxi_df=pd.get_dummies(taxi_df,columns=[ 'pickup_weekday', 'pickup_month'],drop_first=True)

In [None]:
# Encode your categorical columns

# Plotting the pie charts for binary categorical variables.
plt.figure(figsize=(18, 12))

rows = 5
cols = 3
count = 1
Features_list = ['vendor_id', 'store_and_fwd_flag', 'pickup_weekday_1', 'pickup_weekday_2', 'pickup_weekday_3',
                 'pickup_weekday_4', 'pickup_weekday_5', 'pickup_weekday_6',
                 'pickup_month_2.0', 'pickup_month_3.0', 'pickup_month_4.0', 'pickup_month_5.0', 'pickup_month_6.0']

LABELS = ['1','2']
labels = ['Not pickup %','Pickup %']
colors = ['#66b3ff', '#99ff99']

for var in Features_list:
    plt.subplot(rows, cols, count)
    counts = taxi_df[var].value_counts()
    if var=='vendor_id':
      plt.pie(counts, labels=LABELS, autopct='%1.1f%%', colors=colors, startangle=90, wedgeprops={'edgecolor': 'white'})
    else:
      plt.pie(counts, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90, wedgeprops={'edgecolor': 'white'})
    plt.title(f'Distribution of {var.replace("_", " ").title()}', fontsize=14)
    plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    count += 1

plt.tight_layout()
plt.show()


#### What all categorical encoding techniques have you used & why did you use those techniques?

**Answer Here.**

Onehot encoding is used to encode the 'pickup_weekday', 'pickup_month' columns.All the remaining cateorical columns are binary(0/1) so no need to encode them.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Already done

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
corr=taxi_df.corr()
plt.figure(figsize=(25,10))
sns.heatmap(corr,annot=True, cmap=plt.cm.Accent_r)

In [None]:
#dropping excess columns
taxi_df.drop(columns=['id',"trip_duration", 'dropoff_weekday',"pickup_datetime","dropoff_datetime",'dropoff_day','pickup_day','pickup_hour','dropoff_hour','dropoff_weekday','dropoff_month'],axis=1,inplace=True)

In [None]:
# Select your features wisely to avoid overfitting
corr=taxi_df.corr()
plt.figure(figsize=(25,10))
sns.heatmap(corr,annot=True, cmap=plt.cm.Accent_r)

##### What all feature selection methods have you used and why?

**Answer Here.**

The method we've used for feature selection is called correlation analysis.
 By selecting features that are moderately to strongly correlated with the target variable,we can avoid including irrelevant or redundant features in our model, reducing the risk of overfitting.

##### Which all features you found important and why?

**Answer Here.**

'vendor_id', 'store_and_fwd_flag','passenger_count','distance','trip_duration_in_minute','pickup_day','pickup_weekday_1', 'pickup_weekday_2', 'pickup_weekday_3',
                 'pickup_weekday_4', 'pickup_weekday_5', 'pickup_weekday_6',
                 'pickup_month_2.0', 'pickup_month_3.0', 'pickup_month_4.0', 'pickup_month_5.0', 'pickup_month_6.0'

Above features I found important as they correlation factor is low it means they are not correlated with each other.
  Independence of Features: If two features are not correlated, it suggests that they are independent of each other. Independence is a crucial assumption for many machine learning algorithms. Features that are not correlated provide unique and independent information to the model. Including diverse and independent features can improve the model's ability to generalize to new, unseen data

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# creating the set of dependent and independent variables
X = taxi_df.drop(labels='trip_duration_in_minute', axis=1)
Y = taxi_df['trip_duration_in_minute']

# print the shape of X and Y
print(f"The Number of Rows and Columns in X is {X.shape} respectively.")
print(f"The Number of Rows and Columns in Y is {Y.shape} respectively.")

### 6. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 7. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Importing train test split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

In [None]:
# Getting the shape of Train Test set.

print("Training Dataset Shape:--")
print("X_train shape ", X_train.shape)
print("Y_train shape ", Y_train.shape)
print("Testing Dataset Shape:--")
print("X_test shape ",X_test.shape)
print("Y_test shape ",Y_test.shape)

##### What data splitting ratio have you used and why?

**Answer Here.**

In the given code, the data splitting ratio used is 80:20, meaning 80% of the data is used for training (X_train and Y_train), and 20% of the data is used for testing (X_test and Y_test). This ratio is determined by the test_size parameter, which is set to 0.2 (20%).An 80:20 split is a common and reasonable choice, especially when dealing with moderate to large-sized datasets. It allows for a substantial portion of the data to be used for training the model, while still retaining a separate portion for evaluating its performance.

### 8. Data Scaling

In [None]:
taxi_df.columns

In [None]:
X.columns

In [None]:
X_train

##### Which method have you used to scale you data and why?

not reqired

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
#Not needed

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

In [None]:
taxi_df.shape

## ***7. ML Model Implementation***

## XG Boost.

## Random Forest.

## Decision Tree.

## Gradient Boost.

## Linear Regressor.

In [None]:

# define a  function to calculate evaluation metrics
def regression_evaluation_metrics (x_train,y_train,y_predicted):
  MSE=round(mean_squared_error(y_true=y_train, y_pred=y_predicted),4)
  RMSE=math.sqrt(MSE)
  R2_score=r2_score(y_true=y_train, y_pred=y_predicted)
  Adjusted_R2_score=1-((1-( R2_score))*(x_train.shape[0]-1)/(x_train.shape[0]-x_train.shape[1]-1))

  print("Mean Squared Error:",MSE,"Root Mean Squared Error:", RMSE)
  print("R2 Score :",R2_score,"Adjusted R2 Score :",Adjusted_R2_score)
  # plotting actual and predicted values
  #Plotting Actual and Predicted Values

  # Set the number of data points to visualize (for better clarity)
  num_data_points = 100

  # Plotting actual and predicted values for the first 100 data points
  plt.figure(figsize=(12, 6))

  # Plotting Predicted values in red
  plt.plot(range(num_data_points), y_predicted[:num_data_points], color='red', marker='o', linestyle='-', linewidth=2, markersize=6, label='Predicted')

  # Plotting Actual values in green
  plt.plot(range(num_data_points), np.array(y_train)[:num_data_points], color='green', marker='o', linestyle='-', linewidth=2, markersize=6, label='Actual')

  # Adding labels, title, legend, and grid for better interpretation
  plt.xlabel('Data Points', fontsize=12)
  plt.ylabel('Time Duration', fontsize=12)
  plt.title('Actual vs. Predicted Time Duration', fontsize=14)
  plt.legend(fontsize=10)
  plt.grid(True, linestyle='--', alpha=0.7)

  # Display only specific data points on the x-axis for better readability
  plt.xticks(range(0, num_data_points, 10))

  # Adding a horizontal line at y=0 for reference (assuming time duration cannot be negative)
  plt.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.7)

  # Annotate a specific point for emphasis (optional)
  # plt.annotate('Example Annotation', xy=(50, 10), xytext=(30, 20), arrowprops=dict(facecolor='black', shrink=0.05))

  # Show the plot
  plt.show()

  # Example metrics values (replace these with your actual values)
  metrics_names = ['MSE', 'RMSE', 'R2 Score', 'Adjusted R2 Score']
  metrics_values = [MSE, RMSE, R2_score, Adjusted_R2_score]  # Replace these with your calculated values

  # Creating a DataFrame for metrics
  metrics_df = pd.DataFrame(metrics_values, index=metrics_names, columns=['Metrics'])

  # Plotting the metrics heatmap
  plt.figure(figsize=(6, 4))
  sns.heatmap(data=metrics_df, annot=True, cmap='coolwarm', fmt=".2f")
  plt.title('Evaluation Metrics')
  plt.xlabel('Metrics')
  plt.ylabel('')
  plt.show()
  return(MSE,RMSE,R2_score,Adjusted_R2_score)

In [None]:
# List of variables selected for machine learning analysis.
ml_variables = [
    'vendor_id',              # ID of the taxi vendor
    'passenger_count',        # Number of passengers
    'Distance',               # Distance of the trip
    'pickup_longitude',       # Longitude of pickup location
    'pickup_latitude',        # Latitude of pickup location
    'dropoff_longitude',      # Longitude of drop-off location
    'dropoff_latitude',       # Latitude of drop-off location
    'store_and_fwd_flag',     # Flag indicating if the trip data was stored locally before sending to vendor
    'pickup_weekday_1',       # Binary flag for Monday (1 if pickup on Monday, 0 otherwise)
    'pickup_weekday_2',       # Binary flag for Tuesday (1 if pickup on Tuesday, 0 otherwise)
    'pickup_weekday_3',       # Binary flag for Wednesday (1 if pickup on Wednesday, 0 otherwise)
    'pickup_weekday_4',       # Binary flag for Thursday (1 if pickup on Thursday, 0 otherwise)
    'pickup_weekday_5',       # Binary flag for Friday (1 if pickup on Friday, 0 otherwise)
    'pickup_weekday_6'        # Binary flag for Saturday (1 if pickup on Saturday, 0 otherwise)
]

# These variables are selected based on their potential impact on predicting taxi trip duration.
# 'vendor_id': The taxi company might affect service quality and trip durations.
# 'passenger_count': More passengers could lead to longer trips.
# 'distance': Longer distances generally result in increased trip durations.
# 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude':
#     Geographic coordinates influencing the distance and duration of the trip.
# 'store_and_fwd_flag': Indicates whether the trip record was held in vehicle memory before sending to the vendor.
# 'pickup_weekday


In [None]:
# Selecting specific columns (features) for machine learning from the training data
# ml_variables contains the list of column names to be included in the analysis

# Selecting relevant features from the training data (X_train)
X_train = X_train[ml_variables]

# Selecting the same set of relevant features from the test data (X_test)
X_test = X_test[ml_variables]

# Now, X_train and X_test only contain the columns specified in ml_variables
# These selected features will be used for training and evaluating machine learning models


In [None]:
# Create a score dataframes
Regression_Metrics_Score_train = pd.DataFrame(index = ['MSE', 'RMSE', 'R2 Score', 'Adjusted R2 Score'])
Regression_Metrics_Score_test = pd.DataFrame(index = ['MSE', 'RMSE', 'R2 Score', 'Adjusted R2 Score'])

## **ML Model - FIRST Linear Regressor.**

In [None]:
# ML Model - 1 Implementation
# Instantiate the Linear Regression Model
linear_regression_model = LinearRegression()

# Fit the Algorithm
# Train the Linear Regression Model
linear_regression_model.fit(X_train, Y_train)


# Predict on the model
# Calculate R-squared score for training data
training_r2_score = linear_regression_model.score(X_train, Y_train)

# Predictions on the training set
train_predictions = linear_regression_model.predict(X_train)

# Predictions on the test set
test_predictions = linear_regression_model.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluate the model on the training data
print("Evaluation Metrics for Training Data:")
Regression_Metrics_Score_train['Linear Regression'] = regression_evaluation_metrics(X_train, Y_train, train_predictions)

# Evaluate the model on the test data
print("Evaluation Metrics for Test Data:")
Regression_Metrics_Score_test['Linear Regression'] = regression_evaluation_metrics(X_test, Y_test, test_predictions)

In [None]:
Regression_Metrics_Score_train

In [None]:
Regression_Metrics_Score_test

## Conclusion - The low R2 score and high MSE indicate that this algorithm is not appropriate for our model.

## **Model SECOND - Gradient Boost.**

In [None]:
# ML Model - 2 Implementation
from sklearn.ensemble import GradientBoostingRegressor
gradient_boost_model=GradientBoostingRegressor()

# Fit the Algorithm
gradient_boost_model.fit(X_train,Y_train)

# Predict on the model
y_preds_gradient_boost_test = gradient_boost_model.predict(X_test)
y_pred_gradient_boost_train=gradient_boost_model.predict(X_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Evaluation metrics for Train set
Regression_Metrics_Score_train['Gradient Boosting'] = regression_evaluation_metrics( X_train,Y_train,y_pred_gradient_boost_train)

#Evaluation metrics for Test set
Regression_Metrics_Score_test['Gradient Boosting'] =regression_evaluation_metrics(X_test,Y_test,y_preds_gradient_boost_test)

In [None]:
Regression_Metrics_Score_train

In [None]:
Regression_Metrics_Score_test

### Above algorithm has accuracy score:47% train, 43% test. which is higher that our previous algorithm (Linear Regressor).

## **ML Model - THIRD Decision Tree.**

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Maximum depth of trees
max_depth = [4,6,8,10,12]

# Minimum number of samples required to split a node
min_samples_split = [10,20,30]

# Minimum number of samples required at each leaf node
min_samples_leaf = [6,10,16,20]

# Hyperparameter Grid
param_decision_tree = {
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}
DTR = DecisionTreeRegressor()


# Fit the Algorithm
# Grid search
decision_tree_grid = GridSearchCV(estimator=DTR,
                       param_grid = param_decision_tree,
                       cv = 5, verbose=2, scoring='r2')


# Predict on the model
decision_tree_grid.fit(X_train,Y_train)

In [None]:
# Accessing the best estimator from the grid search results
best_decision_tree_model = decision_tree_grid.best_estimator_
# The variable 'best_decision_tree_model' now contains the best decision tree model obtained from grid search


In [None]:
# Printing the best score obtained from the Decision Tree Grid Search
print("Best Score (Decision Tree):", decision_tree_grid.best_score_)



In [None]:
# Get the best estimator (optimal model) from the grid search results for Decision Tree
decision_tree_optimal_model = decision_tree_grid.best_estimator_

# Use the optimal Decision Tree model to make predictions on the training data
y_predict_train_decision_tree = decision_tree_optimal_model.predict(X_train)

# Use the optimal Decision Tree model to make predictions on the test data
y_predict_test_decision_tree = decision_tree_optimal_model.predict(X_test)

# Now, y_predict_train_decision_tree contains the predicted labels for the training data
# and y_predict_test_decision_tree contains the predicted labels for the test data
# These predictions can be used for further evaluation or analysis


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# evaluation metrics for train data set
Regression_Metrics_Score_train['Decision Tree'] = regression_evaluation_metrics(X_train,Y_train,y_predict_train_decision_tree)

# evaluation metrics for test data set
Regression_Metrics_Score_test['Decision Tree'] = regression_evaluation_metrics(X_test,Y_test,y_predict_test_decision_tree)

In [None]:
Regression_Metrics_Score_train

In [None]:
Regression_Metrics_Score_test

Conclusion: Although this algorithm surpasses the linear regression, the accuracy score remains unsatisfactory(accuracy score: 44% train, 40% test).

## **Model FOURTH - Random Forest.**

In [None]:
# Importing the necessary library for Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Creating a RandomForestRegressor instance
RFR = RandomForestRegressor()

# At this point, RFR is an instance of the RandomForestRegressor class,
# which can be further configured and trained using the fit() method.
# You can set hyperparameters, specify features, and target variables,
# and train the model using the fit() method to make predictions.


In [None]:
# number of trees in random forest
n_estimators = [20,22,24]
# number of feature to consider at every split
max_features=[0.6]
# maximum number of level in trees
max_depth = [10,16]

# number of samples
max_samples = [0.75,1.0]

# Hyperparameter Grid
param_grid={'n_estimators' : n_estimators,
            'max_depth': max_depth,
            'max_features': max_features,
            'max_samples': max_samples,
                     }
print(param_grid)

In [None]:

# Create a GridSearchCV object with Random Forest Regressor, parameter grid, 2-fold cross-validation, and verbose output
RF_grid = GridSearchCV(estimator=RFR, param_grid=param_grid, cv=2, verbose=2)

In [None]:
# Fit the GridSearchCV to the training data
RF_grid.fit(X_train, Y_train)

In [None]:
# Printing the best parameters found by the Random Forest Grid Search
print("Best Parameters: ", RF_grid.best_params_)


In [None]:
# specified during the Grid Search process.
print(RF_grid.best_score_)

In [None]:
# Get the best estimator from the Random Forest grid search
Random_Forest_optimal_model = RF_grid.best_estimator_

# Use the trained Random Forest model to make predictions on the training data
y_predict_train_Random_Forest = Random_Forest_optimal_model.predict(X_train)

# Use the trained Random Forest model to make predictions on the test data
y_predict_test_Random_Forest = Random_Forest_optimal_model.predict(X_test)

# Here, X_train represents the feature matrix of the training data
# X_test represents the feature matrix of the test data

# 'Random_Forest_optimal_model' now contains the Random Forest model with the best hyperparameters
# 'y_predict_train_Random_Forest' contains the predicted values for the training data
# 'y_predict_test_Random_Forest' contains the predicted values for the test data


In [None]:
# evaluation metrics for train data set
Regression_Metrics_Score_train['Random Forest'] = regression_evaluation_metrics(X_train,Y_train,y_predict_train_Random_Forest)

In [None]:
# evaluation metrics for test data set
Regression_Metrics_Score_test['Random Forest'] = regression_evaluation_metrics(X_test,Y_test,y_predict_test_Random_Forest)

In [None]:
Regression_Metrics_Score_train

In [None]:
Regression_Metrics_Score_test

### This algorithm has performed a little better(accuracy score:60% train, 43% test).

## **Model FIFTH - XG Boost**

In [None]:
# Number of Trees
total_estimators = [50]

# Maximum depth of trees
max_depth_of_trees=[7,9]
min_samples_split=[50]

# learining rate=[0.1,0.3,0.5]

# Hyperparameter Grid
param_xgboost ={'total_estimators': total_estimators,
                     'max_depth': max_depth,
                     'min_samples_split': min_samples_split
                     }

In [None]:
# Instantiate XGRegressor
import xgboost as xgb
xgboost_model=xgb.XGBRegressor()

# Grid search
xgboost_grid=GridSearchCV(estimator=xgboost_model,param_grid=param_xgboost,cv=5,verbose=2,scoring='r2')
xgboost_grid.fit(X_train,Y_train)

In [None]:
# Print the best cross-validation score achieved by the XGBoost model
print("Best Cross-Validation Score: {:.2f}".format(xgboost_grid.best_score_))


In [None]:
# Output the best hyperparameters found by the grid search for the XGBoost model
print(xgboost_grid.best_params_)


In [None]:
# Selecting the best XGBoost model obtained from the grid search
# xgboost_grid: GridSearchCV object containing the results of the hyperparameter tuning
# .best_estimator_: Attribute of GridSearchCV object, returns the best performing model

xgboost_optimal_model = xgboost_grid.best_estimator_


In [None]:
# Predicting the target variable (y) for the test dataset using the trained XGBoost model
y_pred_xgboost_test = xgboost_optimal_model.predict(X_test)

# Predicting the target variable (y) for the training dataset using the trained XGBoost model
y_pred_xgboost_train = xgboost_optimal_model.predict(X_train)

# y_pred_xgboost_test now contains the predicted values for the test data
# y_pred_xgboost_train now contains the predicted values for the training data
# These predictions can be further used for evaluation or analysis purposes


In [None]:
# Evaluation metrics for Train set
Regression_Metrics_Score_train['XG Boost'] = regression_evaluation_metrics(X_train,Y_train,y_pred_xgboost_train)

In [None]:
# Evaluation metrics for Test set
Regression_Metrics_Score_test['XG Boost'] = regression_evaluation_metrics(X_test,Y_test,y_pred_xgboost_test)

### This algorithm has given the best accuracy score till now (66% train,62% test) with low MSE.

In [None]:
X_train.shape

In [None]:
importance=xgboost_optimal_model.feature_importances_
importance

In [None]:
imp_dict={'Feature' : list(X_train.columns),
          'Feature Importance' : importance}
importance_df = pd.DataFrame(imp_dict)

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False,inplace=True)
importance_df

In [None]:
# Feature importance plot
plt.figure(figsize=(16,6))
plt.title('Feature Importance')
sns.barplot(x='Feature',y='Feature Importance',data=importance_df[:10])
plt.xticks(rotation=45)
plt.show()

In [None]:
Regression_Metrics_Score_train

In [None]:
Regression_Metrics_Score_test

### Above algorithm has accuracy score:98% train, 30% test. which is lower that our previous algorithm (XG Boost).

### **STEP 9 - Comparing evaluation metrics of different models.**

In [None]:
X.columns

In [None]:
Regression_Metrics_Score_train = Regression_Metrics_Score_train.transpose()
Regression_Metrics_Score_train

In [None]:
Regression_Metrics_Score_test = Regression_Metrics_Score_test.transpose()
Regression_Metrics_Score_test

In [None]:
# For Training Data
#bar plot for R2 score
fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize = (20, 5))
x_ = ['Linear Regression', 'Gradientboost', 'Decision Tree', 'Random Forest', 'Xgboost']
ax1.set_title('R2 Scores')
ax = sns.barplot(x = x_, y ='R2 Score', data = Regression_Metrics_Score_train , ax = ax1)
ax.set_xlabel('Models')
ax.set_ylabel('R2 scores')

# barplot for adjustedR2
ax = sns.barplot(x = x_, y='Adjusted R2 Score',  data = Regression_Metrics_Score_train, ax = ax2)
ax2.set_title('Adjusted R2 scores')
ax.set_xlabel('Models')
ax.set_ylabel('Adjusted R2 scores')
plt.show()


In [None]:
# For Testing Data
#bar plot for R2 score
fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize = (20, 5))
x_ = ['Linear Regression', 'Gradientboost', 'Decision Tree', 'Random Forest', 'Xgboost']
ax1.set_title('R2 Scores')
ax = sns.barplot(x = x_, y ='R2 Score', data = Regression_Metrics_Score_test , ax = ax1)
ax.set_xlabel('Models')
ax.set_ylabel('R2 scores')

# barplot for adjustedR2
ax = sns.barplot(x = x_, y='Adjusted R2 Score',  data = Regression_Metrics_Score_test, ax = ax2)
ax2.set_title('Adjusted R2 scores')
ax.set_xlabel('Models')
ax.set_ylabel('Adjusted R2 scores')
plt.show()

### The above graph clearly shows that Random forest has highest R2 scores and adjusted R2 score which suggest that it has better efficiency than other models.

In [None]:
# For Training Data
# Bar plot of MSE score
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(20,5))
x_ = ['Linear Regression', 'Gradientboost','Decision Tree','Random Forest','XG Boost']
ax1.set_title('MSE scores')
ax = sns.barplot(x=x_,y='MSE',data=Regression_Metrics_Score_train,ax=ax1)
ax.set_xlabel('Modles')
ax.set_ylabel('MSE score')

# Bar plot for RMSE score
ax = sns.barplot(x=x_ , y='RMSE', data=Regression_Metrics_Score_train, ax=ax2)
ax2.set_title('RMSE score')
ax.set_xlabel('Modules')
ax.set_ylabel('RMSE score')
plt.show()

In [None]:
# For Testing Data
# Bar plot of MSE score
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(20,5))
x_ = ['Linear Regression', 'Gradientboost','Decision Tree','Random Forest','XG Boost']
ax1.set_title('MSE scores')
ax = sns.barplot(x=x_,y='MSE',data=Regression_Metrics_Score_test,ax=ax1)
ax.set_xlabel('Modles')
ax.set_ylabel('MSE score')

# Bar plot for RMSE score
ax = sns.barplot(x=x_ , y='RMSE', data=Regression_Metrics_Score_test, ax=ax2)
ax2.set_title('RMSE score')
ax.set_xlabel('Modules')
ax.set_ylabel('RMSE score')
plt.show()

### Only Random Forest has least errors, therefore it can be considered as good algorithm for training our model.


### **Conclusion for EDA:**


* Vendor 2 dominates the market, receiving a higher number of bookings compared to other vendors.

* Analysis of daily pickup and dropoff patterns reveals that taxi booking rates significantly increase during weekends (Friday and Saturday). This uptick suggests a higher demand for taxi services during weekends, possibly due to social events, celebrations, or personal activities.
* Observing monthly trends, it becomes evident that taxi reservations peak in March and April. Conversely, booking numbers decrease notably in January, February, and post-June.
* Vendors experience their busiest months in March, with a decline in January, February, and after June, as per the monthly trend analysis.
* Analyzing hourly pickup and dropoff patterns, it's apparent that taxis are frequently used for commuting to workplaces, especially after 10:00 AM. Additionally, demand surges in the late evening, particularly after 6:00 PM.
* The majority of bookings are made by individual travelers, indicating a preference for solo rides. This implies a lower inclination towards carpooling, suggesting that people tend to travel alone rather than in groups.


## **Conclusion for Model Training:**

*We observed a significant presence of outliers in our dataset, with some values closely approaching zero. Attempts to remove these outliers led to substantial data loss. During the model training process, we experimented with various algorithms and achieved an accuracy level of 60%.

*Concerns arose about potential overfitting of the model. Fortunately, our fears were dispelled as the model consistently produced similar results for both the training and test data across all algorithms tested. Notably, only two models, namely XG Boost and Random Forest, exhibited a remarkable alignment between the actual and predicted values, evidenced by their nearly overlapping lines in the graphs.

*Additionally, we noticed that the R-squared (R2) scores were notably high, indicating the models' ability to explain the variance in the data. Moreover, the Mean Squared Error (MSE) scores were low for these models, meeting the criteria for a well-performing model.

*Upon reflection, we realized that removing data led to a significant loss of valuable information. We also observed that introducing a new column, even if highly correlated with existing features, could yield seemingly favorable results (pseudo-good results). Furthermore, our Random Forest model provided the best R2 score. To prevent overfitting, we carefully tuned the model parameters, ensuring its stability and reliability in making predictions.

# 🚘**Thankyou**