# Capstone 1: Framing an Impactful Clustering Problem

**Goal**: Define the clustering problem, identify stakeholders, articulate impact, and establish hypotheses for urban bike-share trip behavior segmentation.

**Deliverables**:
- Clear problem statement and context
- Stakeholder analysis and beneficiaries
- Rationale for clustering approach
- Behavioral hypotheses (commuters, tourists, casual, last-mile)
- Ethical considerations and scope boundaries

---
## 1. Problem Definition & Context

### The Challenge
Urban bike-sharing systems like CitiBike NYC generate millions of trip records annually, yet rider behavior remains poorly understood and heterogeneous. Without clear segmentation, cities and operators struggle to:

- **Optimize infrastructure**: Where should new stations be placed? Which areas need more bike capacity?
- **Tailor services**: How should pricing differ for commuters vs tourists?
- **Measure sustainability impact**: Which rider segments contribute most to carbon reduction?
- **Inform policy**: What evidence supports investment in bike infrastructure?

### Why This Matters
Bike-sharing is a cornerstone of sustainable urban mobility:
- **Environmental**: Reduces car trips and emissions (each bike-share trip avoids ~0.5 kg CO₂)
- **Social**: Provides affordable, healthy transport for diverse populations
- **Economic**: Creates jobs and boosts local commerce near stations
- **Urban Planning**: Eases congestion and reclaims street space for people

Understanding *who* uses bikes, *when*, and *how* unlocks data-driven decisions that amplify these benefits.

### Domain Context: CitiBike NYC
- **Largest bike-share in US**: 25,000+ bikes, 1,500+ stations across NYC
- **3–5 million trips/month** (peak summer season)
- **Diverse ridership**: Commuters (weekday peaks), tourists (weekend leisure), last-mile connectors (transit hubs)
- **Public data**: Trip records openly available → enables research and accountability

---
## 2. Stakeholders & Beneficiaries

### Primary Stakeholders (Who will use these insights?)

#### 1. City Planners & Transport Agencies (NYC DOT, MTA)
**Needs**:
- Station placement optimization (where to expand network)
- Infrastructure investment (bike lanes, docking capacity)
- Integration with public transit (last-mile connections)

**How clustering helps**:
- Identify commuter corridors → prioritize protected bike lanes
- Detect underserved neighborhoods → target station expansion
- Quantify transit integration → justify funding for bike-subway connectors

#### 2. Bike-Share Operators (Lyft/Motivate)
**Needs**:
- Service optimization (bike redistribution, maintenance)
- Pricing strategies (membership tiers, dynamic pricing)
- Marketing (target user segments)

**How clustering helps**:
- Predict demand by cluster (allocate bikes efficiently)
- Design tailored memberships (e.g., "Commuter Pass" vs "Weekend Explorer")
- Target ads (promote to casual riders → convert to members)

#### 3. Sustainability Advocates & NGOs (Transportation Alternatives, NRDC)
**Needs**:
- Evidence for bike-share impact on emissions and car trips
- Equity analysis (are low-income communities served?)
- Policy advocacy (support for green transport funding)

**How clustering helps**:
- Quantify carbon savings by cluster (e.g., "Commuters avoid 10,000 car trips/month")
- Identify equity gaps (clusters in affluent areas only?)
- Build data-driven narratives for policy campaigns

### Secondary Stakeholders
#### 4. Urban Commuters (Riders)
Benefit indirectly from improved service reliability, expanded networks, and safer bike infrastructure driven by insights.

#### 5. Researchers & Data Scientists
Methodology and findings contribute to urban mobility literature; reproducible pipeline enables replication in other cities.

#### 6. Policy Makers (City Council, State Legislators)
Use evidence to allocate budgets for green transport and regulate shared mobility.

### Impact Pathway
```
Clustering Insights
         ↓
Stakeholder Actions (infrastructure, pricing, policy)
         ↓
Improved bike-share system (more stations, better service)
         ↓
Increased ridership + reduced car trips
         ↓
Sustainability Outcomes (lower emissions, healthier cities)
```

---
## 3. Why Clustering Fits This Problem

### Clustering = Unsupervised Pattern Discovery
**What is clustering?**
- Groups similar trips based on behavior (duration, time, location, user type)
- No labeled data required ("commuter" or "tourist" not pre-defined)
- Discovers hidden patterns in millions of trips

**Why clustering (vs other ML methods)?**

| Alternative Approach | Why Not Suitable |
|---------------------|------------------|
| **Supervised classification** (e.g., "predict if trip is commute") | No labeled ground truth; would require manual annotation of millions of trips |
| **Regression** (e.g., "predict trip duration") | Doesn't reveal *types* of riders; just predicts one variable |
| **Rule-based segmentation** (e.g., "weekday AM = commuter") | Oversimplified; misses nuanced patterns (e.g., weekend commuters, tourist peak shifts) |
| **Time-series forecasting** | Focuses on *when*, not *who*; doesn't group rider behaviors |

**Clustering advantages**:
1. **Exploratory**: Uncover unexpected segments (e.g., "midnight commuters," "reverse commuters")
2. **Scalable**: Handles millions of trips efficiently (KMeans is O(n·k·i))
3. **Interpretable**: Cluster centroids = "average trip profile" → easy to explain to stakeholders
4. **Actionable**: Each cluster maps to policy/operational decisions

### Alignment with Project Goals
- **SMART Goal**: Segment trips into 4–6 interpretable clusters (e.g., commuter, tourist, casual, last-mile)
- **Success Criteria**: Silhouette ≥ 0.35 (statistical validity) + clusters align with domain knowledge (qualitative validation)
- **Deliverable**: Actionable insights for each cluster (e.g., "Add stations near transit hubs for last-mile cluster")

---
## 4. Behavioral Hypotheses

Based on domain knowledge and related work (see RELATED_WORK.md), we hypothesize the following trip clusters:

### Hypothesis 1: Weekday Commuters
**Behavior Profile**:
- **Timing**: Weekday AM (7–9 AM) and PM (5–7 PM) peaks
- **Duration**: Short (10–20 min)
- **Distance**: Medium (2–5 km)
- **User Type**: Mostly members (subscribers)
- **Trip Pattern**: One-way (home → work, work → home)

**Expected Size**: 40–50% of trips

**Policy Relevance**: Prioritize protected bike lanes on commuter corridors; expand stations near offices and transit hubs.

---

### Hypothesis 2: Weekend Leisure / Tourists
**Behavior Profile**:
- **Timing**: Weekend midday (11 AM–3 PM)
- **Duration**: Long (30–60 min)
- **Distance**: High (5–10 km) or low (round trips in parks)
- **User Type**: Mostly casual (pay-per-ride)
- **Trip Pattern**: Round trips or loops (return to origin)

**Expected Size**: 20–30% of trips

**Policy Relevance**: Add stations near tourist attractions (Central Park, Brooklyn Bridge); design scenic bike routes.

---

### Hypothesis 3: Last-Mile Connectors
**Behavior Profile**:
- **Timing**: Spread throughout day (no peak)
- **Duration**: Very short (<10 min)
- **Distance**: Low (1–2 km)
- **User Type**: Mix of members and casual
- **Trip Pattern**: One-way, near subway/bus stations

**Expected Size**: 15–20% of trips

**Policy Relevance**: Integrate with transit (bike racks at subway entrances); ensure station density near transit hubs.

---

### Hypothesis 4: Casual / Errand Riders
**Behavior Profile**:
- **Timing**: Off-peak (midday, early evening)
- **Duration**: Medium (15–30 min)
- **Distance**: Medium (3–5 km)
- **User Type**: Mix
- **Trip Pattern**: One-way or short round trips

**Expected Size**: 10–20% of trips

**Policy Relevance**: Ensure station coverage in residential/commercial areas; promote bike-share for daily errands.

---

### Validation Strategy
After clustering (Capstone 3), we will:
1. **Compare cluster profiles to hypotheses** (e.g., does Cluster 1 show weekday AM/PM peaks?)
2. **Check statistical significance** (t-tests for duration/hour differences)
3. **Visualize spatially** (map stations by dominant cluster)
4. **Seek domain expert review** (city planners, bike-share operators)

**What if hypotheses are wrong?**
- Clusters may reveal unexpected patterns (e.g., "reverse commuters," "midnight delivery riders")
- This is valuable! Exploratory clustering uncovers blind spots in domain assumptions
- We will interpret and name clusters based on *data*, not prior beliefs

---
## 5. Scope & Ethical Considerations

### In Scope
- **Trip-level clustering** (aggregated, anonymized data)
- **Features**: Duration, distance, start/end time, weekday, user type, station geography
- **Algorithms**: KMeans, Agglomerative Hierarchical, DBSCAN
- **Evaluation**: Silhouette, Davies-Bouldin, interpretability, spatial coverage
- **Dataset**: CitiBike NYC, Spring/Summer 2025 (Mar–Jun)

### Out of Scope
- **Individual user tracking** (privacy preserved; no user IDs or trajectories)
- **Real-time prediction** (focus on descriptive clustering, not forecasting)
- **External data integration** (weather, events) unless trivial to add
- **Deployment to production** (academic project; proof of concept only)

---

### Ethical Considerations

#### 1. Privacy & Anonymization
- **Data Source**: Public, open data (no PII: names, emails, payment info)
- **Aggregation**: All analysis at trip level (cannot link to individual riders)
- **Geolocation**: Station coords are public infrastructure (not home addresses)
- **Compliance**: Aligns with NYC Open Data policies and GDPR principles (anonymized, purpose-limited)

**Mitigation**: No additional data collection; use only publicly available, de-identified records.

---

#### 2. Equity & Bias
**Risk**: Clusters may reveal underserved neighborhoods (e.g., low-income areas lack stations)
- **Bias Amplification**: If recommendations prioritize already well-served areas (e.g., Manhattan), it worsens inequality
- **Geographic Skew**: 80% of stations in Manhattan/Brooklyn → outer boroughs underrepresented

**Mitigation**:
- **Explicitly analyze equity**: Flag clusters in underserved areas; recommend station expansion there
- **Avoid regressive recommendations**: Don't suggest pricing that penalizes low-income riders
- **Transparency**: Document biases in DATA_FITNESS_ASSESSMENT.md; acknowledge limitations in IMPACT_REPORT.md

**Example Equitable Recommendation**:
> "Last-mile cluster is underrepresented in the Bronx (only 5% of stations). Add 20 stations near subway lines to improve access."

---

#### 3. Dual Use & Unintended Consequences
**Risk**: Insights could be misused (e.g., surge pricing during commuter peaks, reducing service in low-profit areas)

**Mitigation**:
- **Frame recommendations for public good**: Emphasize sustainability, equity, and accessibility
- **Engage stakeholders**: Share findings with advocacy groups (Transportation Alternatives) to counter profit-only motives
- **Open methodology**: Publish code and decision log → allow scrutiny and alternative interpretations

---

#### 4. Seasonal Bias & Generalizability
**Risk**: Spring/summer data may not reflect fall/winter patterns (fewer leisure trips, more die-hard commuters)

**Mitigation**:
- **Document limitation**: Clearly state "Spring/Summer 2025 (Mar–Jun)" in all reports
- **Recommend validation**: Suggest re-running clustering on fall/winter data
- **Avoid overgeneralization**: Don't claim clusters apply year-round without evidence

---

### Ethical Framework: Beneficence & Justice
**Beneficence** (do good):
- Insights aim to improve bike-share system → benefits riders, environment, public health
- Prioritize recommendations that maximize social benefit (e.g., expand access, reduce emissions)

**Justice** (fairness):
- Ensure findings don't reinforce existing inequalities (geographic, economic)
- Advocate for equitable distribution of bike infrastructure

**Non-Maleficence** (do no harm):
- Protect privacy (no individual tracking)
- Avoid recommendations that harm vulnerable groups (e.g., pricing out low-income riders)

---

### Summary: Responsible Clustering
✅ **Privacy-preserving**: Aggregated, anonymized data  
✅ **Equity-focused**: Identify and address underserved areas  
✅ **Transparent**: Open methodology and decision log  
✅ **Purpose-driven**: Align with sustainability and public good  
⚠️ **Acknowledge limits**: Seasonal bias, no individual behavior  

---
## Reflection: Why This Clustering Problem Matters

### Summary
Clustering urban bike-share trips is more than a technical exercise—it's a pathway to **sustainable, equitable cities**. By revealing distinct rider behaviors (commuters, tourists, last-mile connectors), we empower:

- **City planners** to build bike infrastructure where it's needed most
- **Operators** to deliver better service and grow ridership
- **Advocates** to make evidence-based cases for green transport funding
- **Riders** to benefit from expanded, reliable bike-share networks

### Key Takeaways (Capstone 1)
1. **Problem is well-defined**: Segmenting heterogeneous bike-share trips into actionable clusters
2. **Stakeholders are clear**: City planners, operators, advocates, riders
3. **Clustering is the right tool**: Unsupervised, scalable, interpretable
4. **Hypotheses are testable**: Commuters (weekday peaks), tourists (weekend leisure), last-mile (short trips), casual (mid-duration)
5. **Ethics are considered**: Privacy-preserving, equity-focused, transparent

### Next Steps
**Capstone 2**: Acquire and prepare CitiBike trip data (cleaning, feature engineering, EDA)  
**Capstone 3**: Apply clustering algorithms (KMeans, Agglomerative, DBSCAN) and select champion model  
**Capstone 4**: Evaluate clusters (silhouette, visualizations, interpretation)  
**Capstone 5**: Translate findings into impact reports for stakeholders  

---

**Deliverables Completed** ✅
- Problem statement and context
- Stakeholder analysis
- Clustering rationale
- Behavioral hypotheses
- Ethical considerations

**Supporting Documents Created**:
- `PROJECT_CHARTER.md` (SMART goals, success criteria)
- `RELATED_WORK.md` (case studies: CitiBike, DBSCAN, K-Means)
- `METHODS_PLAN.md` (features, algorithms, workflow)
- `EVALUATION_PLAN.md` (metrics, validation strategy)
- `DECISIONS_LOG.md` (initial framing decisions)

---

*Ready to move to Capstone 2: Dataset Preparation* 🚴‍♀️📊