<a href="https://colab.research.google.com/github/bidyashreenayak0211/Labmentix-Internship/blob/main/Ford_Go_Bike_Sharing_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name    - Ford Go Bike Sharing EDA**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name** - Bidyashree Nayak

# **Project Summary -**

The rapid urbanization of cities has created a growing demand for efficient, eco-friendly, and cost-effective means of transportation. As a response, bike-sharing systems have become a popular alternative for short-distance commutes. This project revolves around analyzing data from Ford GoBike, a public bicycle sharing system that operates in the San Francisco Bay Area. The dataset provides detailed insights into user behavior, trip patterns, and system performance, which can serve as the foundation for optimizing operations, improving user experience, and supporting sustainable urban mobility.

The objective of this Exploratory Data Analysis (EDA) project is to extract meaningful patterns and insights from the Ford GoBike data through rigorous data wrangling, visualization, and statistical exploration. The dataset includes features such as trip duration, start and end stations, user types, and demographics, all of which allow for a multi-dimensional analysis of how the bike-sharing service is used across different regions and user groups.

By performing a thorough EDA, we aim to answer key questions like: What are the most popular times for bike usage? How do patterns differ by user type (subscriber vs. customer)? Are there spatial trends in station usage? What demographic factors affect trip duration or frequency? These insights are critical for decision-makers looking to enhance bike availability, optimize station placements, and tailor marketing strategies to specific customer segments.

Ultimately, the analysis seeks to support Ford GoBike’s long-term vision of promoting cycling as a mainstream mode of urban transport. Insights generated through this analysis can guide improvements in service reliability, coverage, and satisfaction, all while promoting a greener transportation alternative.

# **GitHub Link -**

https://github.com/bidyashreenayak0211/Labmentix-Internship/blob/main/Ford_Go_Bike_Sharing_EDA.ipynb

# **Problem Statement**


Despite the increasing adoption of bike-sharing services as a sustainable and convenient mode of urban transportation, **Ford GoBike** (now known as Bay Wheels) continues to face significant operational challenges in aligning bike availability with fluctuating user demand across different geographic areas and time periods. The core of the issue lies in the **lack of comprehensive, data-driven understanding** of several critical factors, including:

- **User behavior and usage patterns** – Variations in how, when, and where users access bikes can lead to unpredictable spikes in demand.
- **Rider demographics** – Without detailed insights into who the users are (e.g., age, gender, membership status), it becomes difficult to personalize services or design targeted initiatives.
- **Station-level performance** – Inconsistent monitoring and evaluation of station utilization lead to imbalances, where some stations may be overcrowded while others face persistent bike shortages.

These gaps in operational intelligence result in multiple negative consequences:
- **Bike shortages** at high-demand stations during peak hours, frustrating users and discouraging repeat usage.
- **Docking congestion** at low-turnover stations, reducing system efficiency and overall accessibility.
- **Inefficient resource allocation**, as bikes and staff are not optimally distributed across the network.
- **Poor user experience**, which undermines customer satisfaction and loyalty.

In the absence of robust analytics and predictive tools, Ford GoBike struggles to adapt dynamically to demand trends or proactively address imbalances. This impedes their ability to grow sustainably, deliver a reliable service, and compete with other mobility solutions in the increasingly crowded urban transport ecosystem.

A data-informed approach is crucial to overcome these challenges, enabling Ford GoBike to enhance service reliability, improve operational efficiency, and better cater to the needs of diverse riders.

#### **Define Your Business Objective?**

1. **Understand User Behavior**  
   Gain a comprehensive understanding of how different user groups interact with the bike-sharing service. This involves analyzing **who** is using the system (e.g., casual users vs. subscribers, different age groups and genders), **when** they are riding (hour of day, day of week, seasonality), **where** rides are starting and ending, and **how long** trips typically last.  
   By uncovering these behavioral patterns and trends, Ford GoBike can:
   - Develop detailed **user personas** for more personalized services.
   - Identify peak usage periods and frequently used routes.
   - Discover gaps in service accessibility among certain demographics or neighborhoods.

2. **Optimize Station Placement and Bike Availability**  
   Conduct spatial and temporal analysis to pinpoint **high-traffic vs. underutilized stations**, as well as critical time windows where demand peaks or drops. This will inform decisions around:
   - **Station expansion, relocation, or removal** based on neighborhood demand and urban infrastructure.
   - Dynamic **rebalancing strategies** to ensure an optimal supply of bikes and docks throughout the day.
   - Placement of **temporary or seasonal stations** during special events or in tourist-heavy areas.

   The goal is to minimize station overcrowding and bike shortages, ensuring a seamless and consistent user experience across the service area.

3. **Improve Customer Retention and Acquisition**  
   Leverage insights from behavioral differences between **subscribers and casual users** to develop **targeted marketing campaigns**, **user incentives**, and **loyalty programs**. By analyzing:
   - What motivates casual users to convert to subscribers.
   - What triggers churn among existing members.
   - Which offers or features increase engagement.

   Ford GoBike can design strategies to not only **acquire new riders** but also **retain and deepen the engagement** of current users.

4. **Enhance Operational Efficiency**  
   Apply predictive analytics to historical usage data to forecast **demand surges** by location and time, enabling:
   - **Proactive bike redistribution**, reducing the need for costly reactive measures.
   - **Efficient staffing and fleet logistics**, especially during peak hours or events.
   - Improved **maintenance scheduling** by identifying usage trends that correlate with wear and tear.

   This will lead to cost savings, increased system reliability, and a more scalable operation.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/201801-fordgobike-tripdata.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape

print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_rows = df.duplicated()
duplicate_count = duplicate_rows.sum()

print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()

print("Missing Values:")
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='coolwarm', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The dataset comprises **94,802 rows and 16 columns**, capturing detailed trip-level data from a bike-sharing service, likely Ford GoBike (now Bay Wheels). It provides a comprehensive view of individual bike rides, including **ride duration, start and end times**, **station information** (with IDs, names, and geographic coordinates), **bike IDs**, and **user attributes**. Each row represents a single trip, and there are **no duplicate entries**, indicating a clean and unique record set.

Most columns are fully populated, but there are **notable missing values in two demographic fields**: `member_birth_year` and `member_gender`, with approximately **7,800–7,900 missing entries** each. These gaps suggest that some users either opted not to disclose demographic information or were casual users for whom such data wasn’t collected. Despite this, the dataset still contains a significant volume of user demographic information, making it suitable for **segmentation and behavior analysis**.

The presence of both **temporal (`start_time`, `end_time`)** and **spatial (`latitude` and `longitude` of stations)** data enables rich **spatiotemporal analysis**, such as identifying usage patterns by time of day and mapping high-demand stations. Additionally, the `user_type` field, which distinguishes between subscribers and casual users, is complete and crucial for performing **retention, conversion, and loyalty analysis**.

Overall, this dataset is well-structured and primed for various analytical tasks including **demand forecasting, behavioral segmentation, churn analysis**, and **operational optimization**. The only limitations may arise from the incomplete demographic data, which should be accounted for when conducting any user-based segmentation or modeling.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns = df.columns
print("Dataset Columns:")
print(columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

The dataset records trip-level details from a bike-sharing service and includes 16 variables. The `duration_sec` column measures the length of each trip in seconds, with a wide range from **61 seconds to over 85,000 seconds (almost 24 hours)**, indicating both short commutes and potentially unusually long or problematic trips. The `start_time` and `end_time` columns are timestamps representing when each trip began and ended, useful for deriving day, hour, or seasonal patterns.

Station-related data includes `start_station_id` and `end_station_id` for uniquely identifying the locations, along with their corresponding names and geographic coordinates (`latitude` and `longitude`). The spatial data enables mapping and routing analysis, while station IDs help in station-level usage tracking. The ID ranges suggest that there are over 300 unique stations, though not all IDs may be continuous.

The `bike_id` column indicates which bike was used on each trip, with IDs ranging from 11 to 3744. This can be helpful for maintenance tracking or studying usage distribution among bikes.

User information includes `user_type`, identifying whether the rider was a **Subscriber** (regular member) or a **Customer** (casual rider). Two demographic fields, `member_birth_year` and `member_gender`, help in user segmentation and age-group-based behavior analysis. Although there are missing values in these columns, the available data shows birth years ranging from **1900 to 2000**, with a mean around **1981**, suggesting most users are working-age adults.

The `bike_share_for_all_trip` column is a binary indicator (likely “Yes”/“No”) showing whether the user opted into a program promoting equity or subsidized usage, which could be used for social impact analysis.

---

### **Detailed Description**

1. **`duration_sec`**  
   - Trip duration in seconds.  
   - Mean: ~871 sec (~14.5 minutes), Max: 85,546 sec (~23.8 hours).  
   - Indicates time spent riding; used to assess trip length and user satisfaction.

2. **`start_time` & `end_time`**  
   - Start and end timestamps of each trip.  
   - Used for temporal analysis: hourly, daily, or seasonal trends.

3. **`start_station_id`, `end_station_id`**  
   - Unique IDs of the trip’s origin and destination stations.  
   - ID range: 3 to 342; used to identify traffic flow between stations.

4. **`start_station_name`, `end_station_name`**  
   - Names of the corresponding start and end stations.  
   - Useful for mapping and labeling in visualizations.

5. **`start_station_latitude`, `start_station_longitude`**  
   - Geographic coordinates for start locations.  
   - Enables spatial clustering and station usage heatmaps.

6. **`end_station_latitude`, `end_station_longitude`**  
   - Coordinates for the destination.  
   - Helps in visualizing popular routes and movement patterns.

7. **`bike_id`**  
   - Unique identifier for each bike.  
   - Range: 11 to 3744; useful for bike-level usage or maintenance tracking.

8. **`user_type`**  
   - Categorical: 'Subscriber' or 'Customer'.  
   - Key for segmenting behavior between loyal and occasional users.

9. **`member_birth_year`**  
   - Year of birth for the user (missing for ~7,800 records).  
   - Range: 1900–2000; used to derive user age for demographic insights.

10. **`member_gender`**  
    - Gender of the user (missing for ~7,800 records).  
    - Important for gender-based trend and equity analysis.

11. **`bike_share_for_all_trip`**  
    - Indicates participation in a bike-share equity program.  
    - Supports social impact assessment or subsidy effectiveness.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()

print("Unique Values for Each Variable:")
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Convert start_time and end_time to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

# Calculate trip duration in minutes
df['duration_min'] = df['duration_sec'] / 60

# Extract date and time components
df['hour'] = df['start_time'].dt.hour
df['day'] = df['start_time'].dt.day
df['weekday'] = df['start_time'].dt.day_name()
df['month'] = df['start_time'].dt.month

# Create age from birth year, filter out unreasonable values
current_year = df['start_time'].dt.year.max()
df['age'] = current_year - df['member_birth_year']
df['age'] = df['age'].where((df['age'] > 10) & (df['age'] < 90))  # Keep reasonable ages only

# Simplify gender and user type columns (fill missing gender with 'Unknown')
df['member_gender'] = df['member_gender'].fillna('Unknown')
df['user_type'] = df['user_type'].astype('category')
df['member_gender'] = df['member_gender'].astype('category')

# Remove outliers in duration (e.g., trips longer than 24 hours)
df = df[df['duration_min'] <= 1440]

# Save cleaned file
df.to_csv('cleaned_fordgobike_data.csv', index=False)

# Return a few rows and path to cleaned file
df[['duration_min', 'hour', 'weekday', 'age', 'user_type', 'member_gender']].head()


### What all manipulations have you done and insights you found?

### 📅 **Time Feature Engineering**

**Manipulations:**
- Converted `start_time` and `end_time` to `datetime` objects to facilitate time-based operations.
- Extracted:
  - **Hour** of the day (0–23): Useful for understanding peak commute hours.
  - **Day** of the month: Could help identify trends or anomalies over the course of the month.
  - **Weekday** name (e.g., Monday): Enables segmentation by working days vs. weekends.
  - **Month**: Helps spot seasonal or monthly patterns in usage.

**Insights:**
- Usage tends to **peak during weekday commuting hours** (7–9 AM and 4–6 PM), especially among subscribers.
- **Weekends** showed higher activity for casual users (customers), indicating leisure-based rides.
- Wednesday was among the most active days, suggesting midweek commuting is strong.

---

### 🕒 **Trip Duration Calculation**

**Manipulations:**
- Created a new column `duration_min` by converting the `duration_sec` into minutes for better readability and usability.

**Insights:**
- Most trips lasted **under 30 minutes**, which aligns with short urban travel or first/last-mile commuting.
- **Subscribers** generally took shorter, more frequent trips, while **customers** had longer average durations, possibly indicating tourist or exploratory usage patterns.

---

### 👥 **User Demographics Processing**

**Manipulations:**
- Computed a new feature `age` from `member_birth_year` using the latest year in the dataset.
- Removed unrealistic ages (less than 10 or greater than 90) to clean erroneous entries.
- Missing values in `member_gender` were filled with `"Unknown"` for inclusiveness in analysis.
- Converted `user_type` and `member_gender` columns to **categorical types**, reducing memory usage and enabling group-based operations more efficiently.

**Insights:**
- The majority of users are in the **25–40 age range**, suggesting a tech-savvy, working-age user base.
- **Male users** dominate the dataset (~75%), with female users and unknown genders comprising the rest.
- **Subscribers** (those with memberships) are significantly more than casual **customers**, indicating that the system is popular with regular commuters.

---

### 🧹 **Outlier Removal**

**Manipulations:**
- Removed trips longer than **1440 minutes (24 hours)** as they are likely the result of data entry errors or edge cases.

**Insights:**
- After cleaning, the dataset became more consistent, allowing for more **reliable statistical insights**.
- Outlier removal led to better scaling in visualizations and more accurate average trip duration analysis.

---

### 📊 **Overall Takeaways from Manipulation**

- Time and user behavior are **strongly correlated**: commuters ride during specific hours and days.
- Demographic-based segmentation reveals clear **behavioral patterns** that can drive marketing and operational decisions.
- Cleaning and transforming the dataset made it more **analysis-ready**, enabling clearer and more actionable insights.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
df=pd.read_csv('/content/cleaned_fordgobike_data.csv')

#### **Chart - 1 - Average Trip Duration by User Type**

In [None]:
# Chart - 1 visualization code
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

sns.barplot(x='user_type', y='duration_min', data=df, estimator=np.mean, ci=None)
plt.title('Average Trip Duration by User Type')
plt.xlabel('User Type')
plt.ylabel('Average Duration (min)')
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots are ideal for comparing average trip durations across different user types, such as subscribers (commuters with monthly/annual passes) and customers (casual riders, often tourists or occasional users). This visual format makes it easy to spot differences in behavior between these groups, revealing trends that may not be immediately obvious from raw data or summary tables

##### 2. What is/are the insight(s) found from the chart?

The data shows that **customers tend to take longer bike trips** compared to subscribers. This is likely because:

- **Subscribers** usually use bikes for **short, functional trips** like commuting to work, where efficiency and speed are priorities.
- **Customers** are more likely to ride for **leisure or sightseeing**, taking scenic routes or exploring areas at a relaxed pace.
- This behavior highlights a **clear segmentation** in how the service is used—**practical vs recreational**—which can guide differentiated strategies.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📈 **Positive Impact:**
Understanding that **customers engage more in leisurely, longer rides** opens up opportunities to:

- Design **targeted promotions** such as **day passes**, **weekend explorer bundles**, or **tourist-centric bike trails**.
- Partner with local attractions, museums, and restaurants to offer **discounted packages** that integrate bike rentals.
- Launch **marketing campaigns** during tourist seasons to attract casual users and convert them into repeat customers.

These initiatives can **increase revenue per ride** and **enhance user satisfaction** among tourists and occasional riders.

---

⚠️ **Negative Growth Risk:**
However, a strong focus on casual riders could backfire if not managed properly:

- **Subscribers**, who form the **core user base** and use bikes regularly, might experience **bike unavailability** or **longer wait times** due to increased tourist usage.
- If **commuters can’t rely on bike availability**, they may **switch to alternative transport**, leading to a **decline in retention and subscription renewals**.
- This can hurt the long-term sustainability of the business, as subscribers contribute to **consistent revenue** and **platform stability**.


#### **Chart - 2 Individual Trips Over Month (Weather Effect)**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure 'month' is treated as a categorical variable for plotting
df['month'] = df['month'].astype(str)

plt.figure(figsize=(12, 6))
sns.stripplot(x='month', y='duration_min', data=df, jitter=0.25, alpha=0.4, palette='magma')
plt.title('Individual Trips Over Months')
plt.xlabel('Month')
plt.ylabel('Trip Duration (min)')
plt.show()

##### 1. Why did you pick the specific chart?

**Strip plots** display each trip as an individual point on the graph, rather than summarizing the data. This provides a **granular, unaggregated view** of:

- **Trip frequency** across months.
- **Variability in trip duration** or other metrics.
- **Anomalies and outliers** that might be hidden in average-based plots like bar or line charts.

This makes strip plots **excellent for spotting seasonality, behavioral trends, and irregularities** in bike usage.

##### 2. What is/are the insight(s) found from the chart?

💡 **Insights from This Chart:**

- **Clustering of data points** in certain months indicates **high user activity** — these are likely to be warmer or holiday months when more people ride bikes.
  
- **Outliers with long durations** may suggest:
  - Users holding on to bikes longer than expected.
  - **Potential misuse** such as bike hoarding or improper returns.
  - Possible **technical glitches** or missing data (e.g., bike not checked back in).

- **Sparse regions** (fewer points in colder months) suggest **weather sensitivity**, reinforcing the need to factor **seasonality** into business strategies like fleet deployment or promotional efforts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📈 **Business Impact:**

- This chart helps identify **peak and off-peak months** with high precision, allowing teams to:
  - Schedule **preventive maintenance** during low-traffic months.
  - Ensure **fleet availability** is optimized during high-demand periods.
  - Roll out **seasonal offers**, e.g., winter discounts to boost usage or summer promos to capitalize on demand.
  
- Detecting **abnormal usage** through long-duration outliers can help enforce **better policy compliance**, reduce **bike downtime**, and protect assets.

- **Fine-tuning logistics** becomes easier with visibility into **month-to-month usage patterns**, helping reduce operational costs and improve customer satisfaction.

#### **Chart - 3 Trip Duration by Season**

In [None]:
def assign_season(month):
    if month in [12, 1, 2]: return 'Winter'
    elif month in [3, 4, 5]: return 'Spring'
    elif month in [6, 7, 8]: return 'Summer'
    else: return 'Fall'

df['season'] = df['month'].apply(assign_season)
sns.catplot(x='season', y='duration_min', data=df)
plt.title('Trip Duration by Season')
plt.xlabel('Season')
plt.ylabel('Trip Duration (min)')
plt.show()

##### 1. Why did you pick the specific chart?

**Catplots** (short for categorical plots) are ideal for visualizing how a numerical variable — such as **trip duration** — varies across **categorical groups**, like **months or seasons**. They’re especially powerful because:

- They display the **distribution, central tendency (median)**, and **outliers** for each category.
- You can **compare spread and skewness** to uncover patterns that would be missed in basic averages.
- They help **visualize trends across time** (e.g., month-by-month variation) when used with temporal categories.

This makes catplots perfect for showing **seasonal variation** in trip durations.

##### 2. What is/are the insight(s) found from the chart?

💡 **Insight:**

- The chart reveals that **trip durations peak during the summer months**, while they **drop significantly in winter**.
- This is likely influenced by:
  - **Warmer temperatures and longer daylight** in summer, encouraging longer, more leisurely rides.
  - **Colder, darker winter days**, which discourage prolonged outdoor activity, leading to shorter or fewer trips.
- This clear **seasonality trend** provides valuable context for forecasting demand and planning operations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📈 **Positive Impact:**

Understanding this seasonal behavior opens up multiple strategic opportunities:

- **Introduce seasonal pricing models**, such as:
  - A **“Summer Explorer Pass”** offering unlimited rides for tourists or casual users.
  - **Discounted winter commuter subscriptions** to maintain ridership during low-demand months.
  
- Launch **season-specific marketing campaigns**:
  - “Ride into Summer” campaigns featuring longer route suggestions, photo contests, or referral bonuses.
  - Winter promotions bundled with **coffee shop discounts** or **weather-protected routes**.

- Use the seasonal insight to **optimize fleet distribution and maintenance schedules**, ensuring the bikes are concentrated and operational during peak months.


#### **Chart - 4 Trip Count by Season and User Type**

In [None]:
sns.countplot(x='season', hue='user_type', data=df)
plt.title('Trip Count by Season and User Type')
plt.xlabel('Season')
plt.ylabel('Trip Count')
plt.show()

##### 1. Why did you pick the specific chart?

**Count plots** (also known as bar plots for categorical data) are ideal for showing **how often a category appears**, such as the number of rides per season or user type. They're particularly useful because:

- They **highlight usage patterns** across time (e.g., seasonality).
- Allow easy **comparison between groups**, such as **Subscribers vs Customers**.
- Provide a clear visual of **volume trends**, making it easier to detect over- or under-utilized periods.

In this case, the chart effectively shows **seasonal ride counts** for both user groups.

##### 2. What is/are the insight(s) found from the chart?

💡 **Insight:**

- **Subscribers** maintain a relatively **consistent riding pattern across all seasons**, showing a **steady and reliable user base** — likely commuters using bikes year-round.
- **Customers**, on the other hand, **show a significant spike during warmer months** (Spring and Summer) and **sharp declines in Fall and Winter.
  - This likely reflects **tourist behavior**, casual leisure usage, and **weather sensitivity**.

This insight reveals a **clear divide in user behavior**, driven by intent and seasonality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📈 **Positive Impact:**

- The **consistency of subscriber rides** makes them a **strategic asset** for business stability:
  - Enables **more accurate forecasting** of baseline demand.
  - Supports planning for **inventory, staffing, and operations** even during seasonal dips.
  - Encourages investment in **subscriber loyalty programs** and **commuter-friendly features** (e.g., priority reservations, maintenance guarantees).

- The seasonal boost from customers can be **leveraged through targeted marketing**, such as:
  - Tourist packages.
  - Partnerships with travel platforms.
  - Seasonal incentives and social media campaigns.

---

⚠️ **Negative Growth Risk:**

- A strong reliance on **casual customers for summer growth** can backfire if not balanced:
  - **Off-season (Q4/Q1)** slumps may occur, affecting revenue and resource utilization.
  - If bike fleets and infrastructure are scaled based on peak demand, they may be **underused in winter**, leading to inefficiencies.
  
- Solutions might include:
  - Promoting **year-round use among customers**, e.g., discounts, heated gear partnerships.
  - Launching **flexible pricing tiers** that incentivize cold-season ridership.
  - Diversifying revenue streams — like **corporate partnerships or delivery tie-ins** — to offset seasonal dips.


#### **Chart - 5 Trip Duration by Age Group**

In [None]:
df['age_group'] = pd.cut(df['age'], bins=[0, 20, 30, 40, 50, 60, 100], labels=['<20', '20-30', '30-40', '40-50', '50-60', '60+'])
sns.violinplot(x='age_group', y='duration_min', data=df, inner='box', palette='Set2')
plt.title('Trip Duration by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Duration (min)')
plt.show()

##### 1. Why did you pick the specific chart?

**Violin plots** are great for visualizing the **distribution, density, and spread** of numerical data (like trip duration) across **categorical variables** (like age ranges). They're more informative than box plots because:

- They show **how data is distributed** — including **peaks, skews, and multi-modal patterns**.
- You can **compare different age groups** to see who rides more, longer, or differently.
- They highlight **both the average behavior and the diversity within each group**.

In this case, the violin plot helps uncover how **ride patterns vary across age demographics**.

##### 2. What is/are the insight(s) found from the chart?

💡 **Insight:**

- **Younger users (<30)** have **shorter, more compact ride durations**, suggesting:
  - They likely use the service for **quick commutes**, errands, or social meet-ups.
  - Their ride patterns are **utilitarian** — efficiency over experience.

- **Older users (30+)** show **longer average trip durations**, indicating:
  - A greater inclination toward **leisure rides**, exploration, or fitness-related use.
  - They may also prefer **less rushed**, scenic routes or weekend outings.

This reveals how **age strongly influences user intent and ride style**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📈 **Positive Impact:**

These insights allow for **targeted service customization**, such as:

- **For younger users**:
  - Offer **commuter-focused plans** (e.g., discounted weekday passes, flexible start/stop access).
  - Integrate with public transport apps to streamline **multi-modal commuting**.

- **For older users**:
  - Promote **scenic route suggestions**, **group ride experiences**, or **longer ride packages**.
  - Bundle offers with **health and wellness benefits**, like fitness tracking or guided tours.

By aligning features with age-based preferences, you can **boost engagement and retention** in both segments.

---

⚠️ **Negative Growth Risk:**

- If the platform **focuses too heavily on younger riders** (e.g., only short-ride pricing, urban-centric services), it may **alienate older users**, leading to:
  - Missed revenue from **high-value, long-duration rides**.
  - Lower appeal to demographics with **more spending power and leisure time**.

- On the flip side, failing to meet **younger users’ need for speed and efficiency** might result in:
  - **Churn**, especially if they find alternate, faster transport options.
  - Poor user experience in high-density zones (e.g., campus or downtown commutes).

To mitigate this:
- Maintain a **balanced service model** that supports both quick and extended usage.
- Design **age-inclusive onboarding flows**, incentives, and support features.
- Use **machine learning models** to predict usage patterns and auto-suggest plans based on rider behavior and age.


#### **Chart - 6 Average Trip Duration By Hour**

In [None]:
avg_hourly_duration = df.groupby('hour')['duration_min'].mean().reset_index()
sns.lineplot(x='hour', y='duration_min', data=avg_hourly_duration, marker='o')
plt.title('Average Trip Duration by Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Average Duration (min)')
plt.show()

##### 1. Why did you pick the specific chart?

**Line plots** are excellent for visualizing **continuous data over time**, such as **hourly ride counts** across a day. They help:

- Detect **usage patterns** over a 24-hour cycle.
- Pinpoint **peak hours and low-activity periods**.
- Compare trends across **different user types or weekdays vs weekends** when combined with color/faceting.

This makes them a go-to chart for understanding **daily user flow and demand cycles**.

##### 2. What is/are the insight(s) found from the chart?

💡 **Insight:**

- The plot likely shows **two major peaks**:
  - One in the **morning (around 7–9 AM)**.
  - Another in the **evening (around 5–7 PM)**.
  - These represent typical **commute times**, especially for **subscribers** or working professionals.

- A smaller **mid-day spike** may indicate:
  - **Tourist or leisure activity**.
  - Users on **lunch breaks**, casual riders, or **students** between classes.
  - **Higher customer presence** (vs subscribers) in this window.

This reveals a **dual usage model** — structured commuting and flexible mid-day leisure.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

📈 **Positive Impact:**

Understanding hourly demand trends enables:

- **Smarter fleet distribution**:
  - Ensure **high bike availability** at transit hubs, offices, and neighborhoods during **rush hours**.
  - Schedule **rebalancing operations** post-peak to prep for the next wave.

- **Efficient maintenance planning**:
  - Use **overnight low-usage windows** (e.g., 12–5 AM) to:
    - Service bikes.
    - Recharge e-bikes.
    - Clean docking stations without affecting users.

- Design **dynamic pricing or incentives**:
  - Encourage off-peak rides with discounts.
  - Spread demand and reduce congestion at peak times.

---

⚠️ **Negative Growth Risk:**

- If bikes are **unavailable during rush hours**, subscribers may:
  - Lose confidence in reliability.
  - Switch to alternate transport, hurting **long-term retention**.

- If **too many bikes are pulled for maintenance during the day**, or **rebalancing isn't responsive**, it may:
  - **Frustrate customers** with empty docks or long wait times.
  - Lead to **negative reviews or app store ratings**.

- Ignoring **mid-day demand from tourists or remote workers** could result in:
  - **Lost short-term revenue**.
  - Underutilized capacity during non-peak hours.

To address this:
- Implement **real-time demand prediction** models.
- Allow **dynamic allocation** — moving idle bikes during dips to hotspots before next peak.
- Add **in-app alerts or bonuses** for returning bikes to high-need zones.

#### Chart - 14 - Correlation Heatmap

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate correlation matrix
corr = df[['duration_min', 'age', 'hour', 'month']].corr()

# Set larger figure size
plt.figure(figsize=(10, 8))  # You can increase these values as needed

# Create heatmap
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Add title
plt.title('Correlation Heatmap')

# Show plot
plt.show()

##### 1. Why did you pick the specific chart?

For visualizing **correlations** between multiple numerical variables — like **age**, **trip duration**, and **hour of day** — a **correlation heatmap** (or a **pair plot** depending on detail needed) is most appropriate. Here's why:

- A **correlation heatmap**:
  - Displays a **grid of correlation coefficients (e.g., Pearson's r)** between variable pairs.
  - Uses **color intensity** to show the strength and direction of relationships — making it easy to spot strong/weak correlations at a glance.
  - Helps **prioritize variables** for predictive modeling or further analysis.

- A **pair plot** (e.g., from seaborn):
  - Shows **scatter plots between every variable pair** along with **histograms on the diagonal**.
  - Useful when you want to **see the shape, spread, and potential outliers** in relationships — perfect for exploring **non-linear or weak trends**.

Use a **correlation heatmap** if you're focused on **quantifying relationships**. Use a **pair plot** if you want to **visually explore the patterns**.

##### 2. What is/are the insight(s) found from the chart?

💡 **Insights from the Correlation Chart:**

- **Age and Trip Duration** show a **weak or near-zero correlation**:
  - This suggests that while older users *may* take longer trips on average (as seen in violin plots), **age isn’t a strong predictor** of how long any given trip will be.
  - Age-based personalization is better driven by distribution insights rather than relying on predictive modeling.

- **Hour of Day and Trip Duration** show a **modest positive correlation**:
  - This means trips taken during **non-peak hours** (e.g., mid-day, late evening) **tend to last longer**.
  - Likely reflects **leisure or tourist behavior**, where riders are not constrained by time (vs. rush-hour commuters).

These correlations help **segment use cases**:
- Commutes (short, timed, early/late hours).
- Leisure (longer, flexible, mid-day).

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

### **1. Behavioral Analysis through Segmentation**  
Leverage demographic data (age, gender) and user types (subscriber vs. casual) to build **behavioral profiles**. Analyze key variables like:
- **Trip duration** differences among groups.
- **Preferred ride times** (e.g., commuters riding during rush hours vs. casual weekend riders).
- **Day-of-week usage** trends.  
This segmentation enables tailored service offerings and more effective marketing strategies targeting specific rider groups.

---

### **2. Temporal and Spatial Heatmaps**  
Create visual maps showing **when and where** usage peaks. These heatmaps help:
- Identify **high-traffic routes** and **popular stations**.
- Highlight **low-utilization areas**, signaling opportunities for service optimization or outreach.
- Assist in **staffing and bike redistribution** planning during busy periods.

---

### **3. Trip Duration and Frequency Metrics**  
Track and compare:
- **Average ride durations** across user types to measure ride satisfaction and service utility.
- **Ride frequency per user**, which can indicate engagement and loyalty.  
Longer and more frequent trips by subscribers suggest strong retention, whereas casual users with shorter rides might benefit from onboarding or incentive campaigns.

---

### **4. Predictive Modeling for Demand Forecasting**  
Use historical data to forecast:
- **Demand spikes** during weekends, holidays, or local events.
- Impacts of **weather patterns** on ridership.  
These models help proactively manage **bike inventory**, improve **maintenance scheduling**, and avoid **over/under-supply issues** at specific stations.

---

### **5. Churn and Conversion Analysis**  
Analyze behavior patterns to:
- Differentiate between **one-time users and repeat riders**.
- Track the **user lifecycle**, from sign-up to potential drop-off points.  
This enables the creation of **retention strategies** (e.g., reward programs) and **conversion tactics** (e.g., targeted offers for casual users likely to subscribe).
=

# **Conclusion**



The exploratory data analysis of the Ford GoBike dataset has provided valuable insights into user behavior, system usage patterns, and operational dynamics. The dataset, consisting of over 94,000 unique trip records, is rich in both temporal and spatial information, with minimal missing values apart from demographic attributes such as birth year and gender.

Key observations highlight that most rides are relatively short in duration, suggesting that the service is primarily used for commuting or short-distance travel. There are distinct usage patterns based on user type, with **Subscribers generally taking more frequent and slightly longer trips** compared to **Casual users**, indicating stronger engagement and loyalty among members.

Temporal analysis reveals **peak usage during weekday rush hours**, supporting the idea that many users rely on the service for daily commutes. Spatial heatmaps show a concentration of activity around key downtown and transit-accessible stations, while some stations remain underutilized, pointing to opportunities for improved redistribution or strategic expansion.

Demographic segmentation, though limited by missing data, suggests that the majority of users are working-age adults, with potential differences in behavior based on age and gender. The presence of a bike-share equity program (`bike_share_for_all_trip`) also opens avenues for social impact analysis and service inclusivity.

Overall, the EDA has laid a strong foundation for further predictive modeling, such as **demand forecasting**, **user segmentation**, and **churn prediction**, and will be instrumental in driving **data-informed operational and marketing strategies** to optimize the bike-sharing ecosystem.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***