#  EXPLORATORY DATA ANALYSIS

INDRODUCTION


  Rapido is a quick, reliable, and affordable ride-hailing service that brings you seamless connectivity at your fingertips. Specializing in bike taxis and auto rides, Rapido allows you to skip traffic jams and reach your destination efficiently, all while keeping your budget in check. With a focus on convenience and user satisfaction, Rapido’s easy-to-use app ensures you can book a ride with just a few taps and track your driver in real time. Perfect for daily commutes or short trips around the city, Rapido makes urban travel simpler, faster, and more accessible for everyone.

Project Overview
 
  The main goal of this project is to analyze and derive insights from the Rapido dataset to understand user behavior, trip dynamics, and operational efficiencies. The insights will inform Rapido’s strategy for service improvements, user satisfaction, and operational optimization.

Indroduction to Tools

  Python:

   Python is a powerful, high-level programming language known for its simplicity and versatility. Developed in the late 1980s by Guido van Rossum, Python has become one of the most popular languages for beginners and experts alike, thanks to its readable syntax and extensive library support. It is widely used across various fields, including data science, web development, machine learning, automation, and software development.

   Pandas:

   Pandas is a powerful and widely used Python library for data manipulation and analysis, particularly popular in data science and analytics. Created by Wes McKinney in 2008, Pandas provides data structures and functions that make it easy to clean, transform, and analyze structured data efficiently. The library’s two primary data structures are the Series, for one-dimensional data, and the DataFrame, for two-dimensional tabular data, similar to a spreadsheet or SQL table.

   Matplotlib:

   Matplotlib is a popular data visualization library in Python, designed to create static, animated, and interactive plots. Initially developed by John D. Hunter in 2003, it has become one of the go-to tools for visualizing data, particularly in data science, engineering, and scientific research. Matplotlib provides a wide range of plotting capabilities, from simple line and scatter plots to complex multi-panel figures.

   Seaborn:

   Seaborn is a powerful Python library for statistical data visualization, built on top of Matplotlib. It simplifies the process of creating visually appealing and informative statistical graphics, especially for complex datasets. Developed by Michael Waskom, Seaborn provides an intuitive API that makes it easy to create a wide range of plot types, from simple bar charts and histograms to advanced visualizations like heatmaps, violin plots, and pair plots.

   Numpy:

   NumPy, short for "Numerical Python," is a foundational Python library for numerical computing, particularly valued for its efficient handling of large multi-dimensional arrays and matrices. Developed in 2006 by Travis Oliphant, NumPy provides essential data structures, such as ndarrays (N-dimensional arrays), that allow for fast and efficient array processing. It also includes a vast array of mathematical functions for performing operations on these arrays, making it a cornerstone for scientific and analytical computing in Python.



#Importing modules

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

#Loading Data

In [3]:
df = pd.read_csv('C:\\Users\\Acer\\OneDrive\\Desktop\\Datascience\\Dataset\\Rdata.csv')

#Basic Operation

In [None]:
print(df)

In [None]:
print(df.head(7))

In [None]:
print(df.tial(8))

In [None]:
print(df.describe)

In [None]:
pd.options.display.max_rows = 10

#Data Cleaning

In [None]:
print(df.isnull())

In [None]:
print(df.isnull().sum())

In [8]:
df['time'] = df.time.apply(lambda x: x.split('.')[0])

### DATA VISUALIZATION   

#Number of completion and cancellation of ride services?

In [None]:
a=ride_status_counts = df['ride_status'].value_counts()

# Plotting the pie chart for ride status
plt.figure(figsize=(6, 6))
plt.pie(ride_status_counts, labels=ride_status_counts.index, autopct='%1.1f%%', startangle=90, colors=['green', 'red'])
plt.title(' Completed vs. Cancelled Rides')
plt.show()
print("Total counts completion and cancelled:\n", a)

#Identifying the most used ride services?

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=df, x='services')
plt.title('Distribution of Ride Services')
plt.show()
print("Most used services:\n",df['services'].value_counts())
print("HERE THE MOST USED SERVICES ARE BIKE")



##Here the most used services are Bike because customers can go easily through trafic and they can take shortcut routes for there destination

#Difining most used  payment methods?

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(data=df, x='payment_method')
plt.title('payment_methods')
plt.show()
df['payment_method'].value_counts()


#Average Ride Duration for Completed Rides?

In [None]:
completed_rides = df[df['ride_status'] == 'completed']
duration = completed_rides['duration'].mean()
print("average_duration_for_completed_rides:",duration,"KM")
plt.figure(figsize=(6, 4))
plt.bar(['Completed Rides'], [duration], color='green')
plt.xlabel('Ride Status')
plt.ylabel('Average Duration (minutes)')
plt.title('Average Ride Duration for Completed Rides')
plt.ylim(0, max(duration + 10, 80))  # Setting a reasonable y-limit





#To understand if cancellations are more common in specific areas, count cancellations by source and destination?

In [None]:
canceled_rides = df[df['ride_status'] == 'cancelled']
# Count cancellations by source and destination
source_cancellations = canceled_rides['source'].value_counts().head(10)  # Top 10 sources
destination_cancellations = canceled_rides['destination'].value_counts().head(10)  # Top 10 destinations

# Plot cancellations by top sources
plt.figure(figsize=(10, 6))
source_cancellations.plot(kind='bar', color='purple')
plt.title('Top 10 Locations for Ride Cancellations - Source')
plt.xlabel('Source Location')
plt.ylabel('Number of Cancellations')
plt.show()
a=source_cancellations = canceled_rides['source'].value_counts()
print(a)
# Plot cancellations by top destinations
plt.figure(figsize=(10, 6))
destination_cancellations.plot(kind='bar', color='teal')
plt.title('Top 10 Locations for Ride Cancellations - Destination')
plt.xlabel('Destination Location')
plt.ylabel('Number of Cancellations')
plt.show()
b=destination_cancellations = canceled_rides['destination'].value_counts()
print(b)


#The average cost per kilometer and average cost per minute?

In [None]:
pd.options.mode.chained_assignment = None
# Filter only completed rides with non-null total fare values
completed_rides = df[(df['ride_status'] == 'completed') & (df['total_fare'].notna())]

# Calculate cost per kilometer and cost per minute
completed_rides['cost_per_km'] = completed_rides['total_fare'] / completed_rides['distance']
completed_rides['cost_per_min'] = completed_rides['total_fare'] / completed_rides['duration']

# Calculate the average cost per kilometer and per minute
a=average_cost_per_km = completed_rides['cost_per_km'].mean()
b=average_cost_per_min = completed_rides['cost_per_min'].mean()
print('Average Cost per KM:',a)
print('Average Cost per Minute:',b)

# Plotting the average cost per kilometer and per minute in a bar plot
plt.figure(figsize=(8, 6))
plt.bar(['Average Cost per KM', 'Average Cost per Minute'], 
        [average_cost_per_km, average_cost_per_min], 
        color=['green', 'blue'])
plt.title('Average Ride Cost per Kilometer and per Minute')
plt.ylabel('Average Cost ($)')
plt.show()


#Calculate the average ride_charge and misc_charge for each ride service type?

In [None]:
g=charge_by_service = df.groupby('services')[['ride_charge', 'misc_charge']].mean()
print(g)
# Plot the variations by ride service type
charge_by_service.plot(kind='bar', figsize=(10, 6), color=['skyblue', 'red'])
plt.title('Average Ride and Misc Charges by Service Type')
plt.xlabel('Ride Service Type')
plt.ylabel('Average Charge ($)')
plt.show()

#Which routes (source to destination) are most frequently traveled?

In [None]:
# Group by 'source' and 'destination' and count the occurrences of each route
v=route_counts = df.groupby(['source', 'destination']).size().reset_index(name='count')
# print(v)
# Sort the routes by frequency in descending order and select the top 10 most frequent routes
r=top_routes = route_counts.sort_values(by='count', ascending=False).head(10)
print('Top 10 Most Frequently Traveled Routes:\n',r)

# Plot the top 10 most frequently traveled routes
plt.figure(figsize=(12, 6))
plt.barh(top_routes.apply(lambda x: f"{x['source']} -> {x['destination']}", axis=1), 
         top_routes['count'], color='skyblue')
plt.xlabel('Number of Rides')
plt.ylabel('Route (Source -> Destination)')
plt.title('Top 10 Most Frequently Traveled Routes')
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()

#Is there any correlation between the total fare and the chosen payment method?

In [None]:
# Calculate the average total fare for each payment method
a=average_fare_by_payment = df.groupby('payment_method')['total_fare'].mean().reset_index()
print(a)
# Plot the average total fare by payment method
plt.figure(figsize=(10, 6))
plt.bar(average_fare_by_payment['payment_method'], average_fare_by_payment['total_fare'], color='brown')
plt.title('Average Total Fare by Payment Method')
plt.xlabel('Payment Method')
plt.ylabel('Average Total Fare ($)')
plt.xticks(rotation=45)  # Rotate labels for readability
plt.show()

#We can divide the day into intervals (morning, afternoon, evening, night) and see if cancellations are more frequent at certain times?

In [None]:
pd.options.mode.chained_assignment = None
# Filter for canceled rides
canceled_rides = df[df['ride_status'] == 'cancelled']

# Convert 'time' to datetime and extract hour
canceled_rides['hour'] = pd.to_datetime(canceled_rides['time']).dt.hour

# Define time intervals
time_bins = [0, 6, 12, 18, 24]
time_labels = ['Night', 'Morning', 'Afternoon', 'Evening']
canceled_rides['time_of_day'] = pd.cut(canceled_rides['hour'], bins=time_bins, labels=time_labels, right=False)

# Plot cancellation frequency by time of day
rm=time_of_day_counts = canceled_rides['time_of_day'].value_counts().sort_index()
print(rm)
plt.figure(figsize=(8, 6))
time_of_day_counts.plot(kind='bar', color='orange')
plt.title('Cancellations by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Number of Cancellations')
plt.show()



## Conclution

#Ride Trends: Rides are busiest during specific times, like rush hours, and on certain days, such as weekends or weekdays.

Service Usage: Different services have varying levels of popularity and fares, catering to different customer preferences.

Cancellations: Cancellations tend to happen more often at certain times or days, which may point to service issues or user needs.

Fare Patterns: Fare amounts depend on the ride's distance and duration, with some services having higher costs than others.

Improvement Opportunities: These insights can help better manage vehicles, improve services, and increase profits.