
# Ford GoBike System â€“  Exploratory Data Analysis

## 1. Project Understanding

The objective of this analysis is to understand:
- User behavior patterns
- Demographic distribution
- Trip duration characteristics
- Geographic consistency of station data
- Outlier presence and data quality

This notebook integrates data cleaning, preprocessing, and analytical insights into one structured EDA report.


## 2. Import Libraries & Load Data

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("cleaned_fordgobike_data.csv")
df.head()


## 3. Dataset Overview

In [None]:

df.info()
df.describe().T



### Insight

The dataset contains both numerical and categorical features describing trip duration, 
station coordinates, and user demographics. 

Data types appear consistent with cleaned preprocessing.


## 4. Trip Duration Distribution

In [None]:

plt.figure()
plt.hist(df['duration_sec'], bins=50)
plt.title("Distribution of Trip Duration (Log Transformed)")
plt.xlabel("Log(Duration in Seconds)")
plt.ylabel("Frequency")
plt.show()



### Insight

Trip durations are heavily right-skewed.  
Most rides are short, confirming commuter-driven usage.  
Long-duration rides exist but represent rare behavior.


## 5. User Type Distribution

In [None]:

user_counts = df['user_type'].value_counts()

plt.figure()
plt.bar(user_counts.index.astype(str), user_counts.values)
plt.title("User Type Distribution")
plt.xlabel("User Type")
plt.ylabel("Count")
plt.show()



### Insight

Subscribers dominate the platform, indicating high recurring engagement 
and integration into daily mobility routines.


## 6. Gender Distribution

In [None]:

gender_counts = df['member_gender'].value_counts()

plt.figure()
plt.bar(gender_counts.index.astype(str), gender_counts.values)
plt.title("Gender Distribution")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()



### Insight

Male users represent the majority of trips.  
This demographic imbalance presents potential for targeted growth strategies.


## 7. Age Distribution

In [None]:

if 'age' in df.columns:
    plt.figure()
    plt.hist(df['age'], bins=40)
    plt.title("Age Distribution")
    plt.xlabel("Age")
    plt.ylabel("Frequency")
    plt.show()



### Insight

Users cluster in young-to-middle adulthood, reinforcing the commuting hypothesis.


## 8. Numerical Feature Outlier Analysis (Boxplots)

In [None]:

num_cols = df.select_dtypes(include=['int64', 'float64']).columns

for col in num_cols:
    plt.figure()
    plt.boxplot(df[col], vert=False)
    plt.title(f"Boxplot of {col}")
    plt.show()



### Insight

- **Duration:** Strong right-tail behavior confirms mostly short rides.
- **Coordinates:** Tight IQR suggests geographic clustering.
- **Birth Year:** Early-year outliers validate need for filtering.

The preprocessing steps applied earlier are justified by these distributions.



# 9. Overall Conclusion

The Ford GoBike system:

- Primarily serves subscribers
- Is commuter-oriented (short trip durations)
- Shows demographic concentration in working-age males
- Operates within tightly bounded geographic coordinates
- Contains natural but manageable outliers

This EDA establishes a strong foundation for predictive modeling, 
demand forecasting, or operational optimization analysis.
