# **Project Name**    - Ford Gobike Analysis




##### **Project Type**    - EDA
##### **Contribution**    - Individual
team member name: K.Sai Chandrahasa

# **Project Summary -**

The project Ford Gobike Analysis in the San-Fransico bay area is about how the gobike service is used by people in those bay areas where, there are customers and subscribers who rent the bike and use it for leisure roaming or commuting from one place to the another. We have all the data related to the people renting it along with their renting duration, in this project we analyse which user type (subscriber or customer) is using this facility more and for how much time and how frequently using the dataset. We can analyse the data based on the gender of the user and which day of the week is having more demand. And also insights on how the business can be made better.


# **GitHub Link -**

(https://github.com/chandrahasa0819/ford-gobike.git)

# **Problem Statement**


The Ford GoBike dataset contains information about bike-sharing usage patterns in San Francisco, including start and end time, duration, start and end station location with id, member birth year along with gender and the user type However, understanding the key factors influencing bike usage, identifying trends, and uncovering actionable insights from this dataset has yet to be fully explored. The goal of this project is to perform an Exploratory Data Analysis (EDA) on the Ford GoBike data to uncover patterns in bike usage, understand user behavior, and highlight potential areas for service improvement.

#### **Define Your Business Objective?**

The business objective of this EDA is to provide insights that can help Ford GoBike enhance its service offering. By analyzing the data, the goal is to:

Identify peak usage times and locations to optimize bike distribution and station placement.

Understand user demographics and behaviors to tailor marketing efforts and improve user engagement.

Explore patterns in bike usage duration, weather conditions, and trip types to optimize the fleet and improve operational efficiency.

Discover any potential bottlenecks or underutilized areas where the service can be adjusted to improve overall customer satisfaction and profitability.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/201801-fordgobike-tripdata.csv')

### Dataset First View

In [None]:
# Dataset First Look
print("Shape of dataset:", df.shape)
print("\nColumn names:\n", df.columns)
print("\nData types:\n", df.dtypes)
print("\nFirst 5 rows:\n")
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_rows = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("Missing Values in each column:\n", missing_values[missing_values > 0])

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
missing_values.sort_values(ascending=False).plot(kind='bar', color='skyblue')
plt.title('Missing Values Count per Column', fontsize=16)
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()


### What did you know about your dataset?


The data set is having all the information about the customer renting the gobike and the start time, end time of the rental period and the areas where the gobike has been taken for rent and the destination.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:\n")
print(df.columns.tolist())

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description



Trip duration: total time of the trip in seconds.
Start time and date/end time and date: time stamp of when the trip began and ended.
Start and end station name & id: identifiers and names of start and end locations.
Rider type: whether the rider is a customer or subscriber.
Member gender and birth year: the details of the rider.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.drop_duplicates(inplace=True)

df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

df['start_hour'] = df['start_time'].dt.hour
df['start_day'] = df['start_time'].dt.day_name()
df['start_month'] = df['start_time'].dt.month_name()

df = df[df['duration_sec'] >= 60]

print("Data wrangling completed.")

### What all manipulations have you done and insights you found?


Removed duplicate values and rows. Converted start time and end time columns to datetime format. Created new columns such as start day,start month,start hour from start time column. Also checked trips with less than 60 seconds duration as it might be a error.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(data=df, x='user_type', y='duration_sec', estimator='mean', errorbar=None)
plt.title("Average Trip Duration by User Type")
plt.ylabel("Average Trip Duration (seconds)")
plt.xlabel("User Type")
plt.show()

##### 1. Why did you pick the specific chart?


I chose bar chart to compare the average trip duration between subscribers and customers. Bar charts are effective to show this type of comparision

##### 2. What is/are the insight(s) found from the chart?


we can see that customers use it more than subscribers as customers use it for leisure roaming whereas subscribers use it for commuting.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


yes, these insights help creating a positive business impact as it can be used for business marketing as more people use it more people will get to know about this service and will use it.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Number of rides by day of the week
df['day_of_week'] = pd.to_datetime(df['start_time']).dt.day_name()

# Now plot
plt.figure(figsize=(8,5))
sns.countplot(data=df, x='day_of_week', order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title("Ride Count by Day of the Week")
plt.xlabel("Day of the Week")
plt.ylabel("Number of Rides")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?


This chart shows the use of the bike during particular days of the week which helps us to know on which day is the service used more helping in promoting the business and also building new offers to attract new customers.

##### 2. What is/are the insight(s) found from the chart?


The chart shows that mainly weekends have more usage of the bikes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, by increasing the bike count during weekends it will help more people to use it resulting in growth of the business and demand.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Average ride duration by user type
plt.figure(figsize=(8,5))
sns.barplot(data=df, x='user_type', y='duration_sec', estimator=np.mean, errorbar=None)
plt.title("Average Ride Duration by User Type")
plt.xlabel("User Type")
plt.ylabel("Average Ride Duration (in seconds)")  # It's seconds for now
plt.show()

##### 1. Why did you pick the specific chart?


I picked this specific chart to show the ideal behaviour between the customers and subscribers.

##### 2. What is/are the insight(s) found from the chart?


The analysis shows that customers use this service more than subscribers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes. Understanding these patterns can help tailor services — for example, offering time-based promotions to casual users or loyalty benefits to encourage them to subscribe.


#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Ride frequency by day of the week
df['day_of_week'] = pd.to_datetime(df['start_time']).dt.day_name()
plt.figure(figsize=(8,5))
sns.countplot(data=df, x='day_of_week', order=[
    'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title("Ride Frequency by Day of the Week")
plt.xlabel("Day of the Week")
plt.ylabel("Number of Rides")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?


I chose this chart to identify the weekly ride patterns and find out which days experience the highest or lowest demand. Understanding usage trends over weekdays vs weekends is essential for operations and marketing.


##### 2. What is/are the insight(s) found from the chart?


The chart reveals a spike in rides during weekends, especially Saturdays, suggesting leisure-based usage, while weekdays have a steadier, moderate number of rides, possibly from commuters.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes. These insights can help optimize bike availability and maintenance scheduling. Marketing campaigns and discounts can also be tailored for peak days to boost engagement and conversion.


#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Trip Duration Distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['duration_sec'], bins=100, kde=True)
plt.title('Trip Duration Distribution')
plt.xlabel('Trip Duration (seconds)')
plt.ylabel('Frequency')
plt.xlim(0, 5000)  # focus on trips less than ~1.5 hours
plt.show()


##### 1. Why did you pick the specific chart?


I chose this chart to understand how trip durations are distributed and identify if most trips are short or long.

##### 2. What is/are the insight(s) found from the chart?


Most trips are short, with the majority completed under 1500 seconds i.e 25 minutes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, by focusing on short-term rentals, the company can optimize bike availability. No negative growth insights.



#### Chart - 6

In [None]:
# Chart - 6 visualization code
# User Type Count
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='user_type', palette='Set2')
plt.title('Distribution of User Types')
plt.xlabel('User Type')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?


This chart helps in understanding the count of subscribers vs customers.

##### 2. What is/are the insight(s) found from the chart?


Here we found that the no.of subscribers are more than customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the company can focus more on the subscription plans and offers to convert customers into subscribers.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Trips by Day of the Week
plt.figure(figsize=(10, 6))
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sns.countplot(data=df, x='day_of_week', order=order, palette='viridis')
plt.title('Number of Trips by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Trips')
plt.show()


##### 1. Why did you pick the specific chart?


To analyze peak days of bike usage and spot any weekday vs weekend trends.

##### 2. What is/are the insight(s) found from the chart?


Weekdays, especially Tuesday to Thursday, have the highest trip volumes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, bike maintenance and rebalancing efforts can be prioritized on weekdays.
No major negative trend observed.



#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Average Trip Duration by User Type
plt.figure(figsize=(8, 6))
avg_duration = df.groupby('user_type')['duration_sec'].mean().reset_index()
sns.barplot(data=avg_duration, x='user_type', y='duration_sec', palette='coolwarm')
plt.title('Average Trip Duration by User Type')
plt.xlabel('User Type')
plt.ylabel('Average Trip Duration (seconds)')
plt.show()


##### 1. Why did you pick the specific chart?


To compare how much time each user type spends on trips on average.

##### 2. What is/are the insight(s) found from the chart?


Customers have longer average trip durations compared to subscribers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, pricing strategies can be customized, like offering short-trip discounts to customers.
Long trip durations by customers could cause slight bike shortages if not managed.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Trips by Start Hour
plt.figure(figsize=(10, 6))
df['start_hour'] = pd.to_datetime(df['start_time']).dt.hour
sns.histplot(df['start_hour'], bins=24, kde=False, color='skyblue')
plt.title('Trips by Start Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Number of Trips')
plt.xticks(range(0, 24))
plt.show()

##### 1. Why did you pick the specific chart?


To identify peak hours for trip starting times and plan bike availability.

##### 2. What is/are the insight(s) found from the chart?


Morning (7-9 AM) and evening (4-6 PM) peaks suggest commuter usage patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, strategic placement of bikes during peak commuter hours can boost satisfaction.
No negative growth trends found.




#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Gender Distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='member_gender', palette='pastel')
plt.title('Gender Distribution of Users')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?


To understand gender-wise usage of the bikes.

##### 2. What is/are the insight(s) found from the chart?

Male users dominate the rides compared to female and other categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, special marketing campaigns can be launched to attract underrepresented groups.
Skewed gender ratio might indicate untapped market potential.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Age Distribution of Users
plt.figure(figsize=(10, 6))
current_year = 2025
df['member_age'] = current_year - df['member_birth_year']
sns.histplot(df['member_age'], bins=30, kde=True, color='coral')
plt.title('Age Distribution of Users')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.xlim(10, 80)
plt.show()


##### 1. Why did you pick the specific chart?

To understand the age demographics of users and focus marketing efforts.

##### 2. What is/are the insight(s) found from the chart?

Most users are between 25 to 40 years old, suggesting a young adult audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, targeted promotions and partnerships can be made for the young professional audience.
Older age groups are relatively low, which might be an area for growth

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Heatmap of Trips by Day and Hour
df['start_day'] = pd.to_datetime(df['start_time']).dt.day_name()
pivot = df.pivot_table(index='start_day', columns='start_hour', values='bike_id', aggfunc='count')
pivot = pivot.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.figure(figsize=(15, 6))
sns.heatmap(pivot, cmap='YlGnBu')
plt.title('Heatmap of Number of Trips by Day and Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.show()

##### 1. Why did you pick the specific chart?

To visualize the density of rides across different days and hours.

##### 2. What is/are the insight(s) found from the chart?

High usage is seen during weekday mornings and evenings, confirming commuter behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, helps optimize bike rebalancing and station placements during busy times.
Low weekend usage might indicate an opportunity for weekend-specific offers.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Box Plot of Trip Duration by User Type
plt.figure(figsize=(10, 6))
sns.boxplot(x='user_type', y='duration_sec', data=df, palette='muted')
plt.title('Trip Duration by User Type (Box Plot)')
plt.xlabel('User Type')
plt.ylabel('Trip Duration (seconds)')
plt.ylim(0, 5000)  # focus on trips < 5000 seconds (~1.5 hours)
plt.show()


##### 1. Why did you pick the specific chart?

To explore the spread and outliers of trip durations by different user types.

##### 2. What is/are the insight(s) found from the chart?

Customers show more variability and longer trips compared to subscribers

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, better understanding of customer behavior can guide service improvements.
Trips with very long durations might require checking for bike hoarding or misuse.

#### Chart - 14 - Correlation Heatmap

In [None]:
numerical_cols = [col for col in ['trip_duration', 'start_hour', 'start_day', 'user_birth_year'] if col in df.columns]

# Convert categorical columns to numeric (e.g., 'start_day' if it's categorical)
if 'start_day' in numerical_cols:
    label_encoder = LabelEncoder()
    df['start_day'] = label_encoder.fit_transform(df['start_day'])

# Ensure all selected columns are numeric (convert if necessary)
df[numerical_cols] = df[numerical_cols].apply(pd.to_numeric, errors='coerce')

# Drop rows with missing values for a clean correlation matrix
df_corr = df[numerical_cols].dropna()

# Compute the correlation matrix
corr_matrix = df_corr.corr()

# Plotting the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Ford GoBike Dataset', fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

I selected the correlation heatmap because it provides a quick and intuitive understanding of how different numerical features are related to each other. It helps to easily spot strong positive or negative correlations, which is essential for identifying key drivers and potential multicollinearity in the Ford GoBike dataset.

##### 2. What is/are the insight(s) found from the chart?

Trip duration has a slight negative correlation with user birth year, indicating that older users tend to take longer trips compared to younger users.

Start hour and start day show very weak correlations with trip duration, suggesting that the time of day or day of the week does not heavily impact the trip length.

Most of the features are weakly correlated, implying that each variable provides relatively independent information, which is good for diverse analysis.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select key numerical and categorical columns
pairplot_cols = [col for col in ['trip_duration', 'start_hour', 'user_birth_year', 'user_type'] if col in df.columns]

# Drop NA values for clean visuals
df_pair = df[pairplot_cols].dropna()

# Create pair plot with hue on user_type
sns.pairplot(df_pair, hue='user_type', palette='Set2', plot_kws={'alpha':0.5})  # Updated to scatter_kws
plt.suptitle('Pair Plot of Selected Features by User Type', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

I chose the pair plot to visualize the interaction between multiple numerical variables in the dataset simultaneously, segmented by user type. It allows us to see potential clusters, patterns, and relationships more clearly across different feature combinations.

##### 2. What is/are the insight(s) found from the chart?

Subscribers tend to have shorter trip durations compared to customers.

Most users are concentrated in the 25–40 age range (based on birth year).

There is a higher density of subscribers taking trips during specific start hours (likely during commute hours).

Trip duration has a wider spread for customers than for subscribers, suggesting more varied usage patterns.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

From the analysis, the client would need to target the highly active 25–40 age segment through corporate alliances and loyalty initiatives to build on the subscription model. Increasing the availability of bicycles during peak office hours will provide higher customer satisfaction, while providing flexible plans and promotions for periodic users can provide a boost during weekends. Strategically expanding docking stations close to business centers and tourist spots, tracking trip lengths in order to monitor for misuse, and regularly leveraging data insights for operational modifications will assist the client in driving growth, enhancing customer loyalty, and realizing long-term business success

# **Conclusion**

Finally, the Ford GoBike study has given key insights into how users behave, when they ride most, and trip patterns. By using this information, the company can fine-tune operations, customize marketing efforts, and improve customer satisfaction. This information-driven strategy will enable improved decision-making and further sustainable growth and enhanced service delivery.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***