# Bay Area Boba: Where Do Shops Win?

**Author:** *Your Name*  
**Repo:** `boba-eda-starter`  
**Last updated:** _today_


## 1) Why boba? Why this study?

Boba shops are a useful sandbox for local retail strategy: they’re high frequency, locally competitive, and their success is tightly linked to neighborhood context. The question operators and investors ask is simple: **where do boba shops tend to earn higher ratings—and why?** Ratings aren’t revenue, but they’re a quick, directional proxy for product quality, experience, and word-of-mouth.

**Guiding question:** *What patterns in location and competition relate to higher Yelp ratings across the Bay Area?*


## 2) Data & assumptions (what’s in / what’s not)

- **Unit of analysis:** individual boba shops in the Bay Area (n ≈ 603).
- **Columns used:** `name`, `city`, `lat`, `lon` (or `long`), `rating`.
- **Assumptions & limitations**
  - Ratings ∈ [1, 5] with rounding and subjectivity; not a perfect proxy for revenue.
  - No true price/sales/foot-traffic data in this version.
  - “Chain” is inferred by repeated normalized names (≥2 locations ⇒ chain).
  - “Urban” is defined by a city list (SF, SJ, Oakland, Berkeley, Palo Alto).
  - “Near campus” = within 2 miles of major campuses (Stanford, UCB, SJSU, SCU, CSU East Bay, USF).
  - All effects are **associational** (observational data).

> Full workflow, figures, and code live in the notebook (`notebooks/eda.ipynb`) and its HTML export.


## 3) What we built (features & checks)

- **Chain vs Independent (`is_chain`)**  
  Normalize `name` (lowercase, strip punctuation/extra spaces). If a normalized name appears **≥2** times, flag as chain.  
  *Why:* brand presence might correlate with consistency or—conversely—commoditization.

- **Urban vs Suburban (`urban`)**  
  Binary flag for the five most urban Bay cities in this dataset (SF, SJ, Oakland, Berkeley, Palo Alto).  
  *Why:* density and expectations differ across urban cores vs. suburbs.

- **Near campus (`near_campus`)**  
  Haversine distance to the nearest campus; flag if ≤2 miles.  
  *Why:* student demand could shift preferences (sweeter drinks, value focus).

- **Local competition (optional)**  
  Count neighbors within 0.5/1/2 miles using a Haversine `BallTree` (`nn_0p5mi`, `nn_1p0mi`, `nn_2p0mi`).  
  *Why:* quick proxy for competitive density.

Sanity checks included value counts, means/medians by group, and outlier scans.


## 4) Exploratory patterns (what the map says)

**Top cities**  
San Jose + San Francisco together ~31% of all shops—two distinct gravity centers (South Bay scale; SF urban density).

![Top 10 Bay Area Cities – Boba Shop Distribution](../assets/top_cities.png)

**Competitive clusters (why we clustered)**  
We used K-Means on `[lat, lon, rating]` to spot **geographic + quality** pockets at different zoom levels. We evaluated K by elbow & silhouette, and visualized **k=3, 5, 10** to tell three stories:

- **Strategic (k=3)** – The Bay separates into **East Bay suburbs (Fremont area)**, **Urban core (Oakland/SF corridor)**, and **North fringe (Berkeley edge)**. Average cluster ratings hint that the urban core trends slightly higher than the suburban mass.

- **Balanced (k=5)** – Fremont and Oakland **split into high-performing cores and weaker spillovers** (e.g., Fremont ~4.14 vs ~3.35; Oakland ~4.22 vs ~3.29).

- **Tactical (k=10)** – You start seeing **hot blocks vs. soft pockets**:
  - Hotspots: **SF (~4.19), SJ (~4.11), Palo Alto (~4.21)**
  - Risk pockets: **Santa Clara (~3.46)**, **Dublin pocket (~2.67)**, uneven **Berkeley fragments (3.27→4.21)**

![Elbow & Silhouette (k selection)](../assets/k_selection.png)

![Competitive Clusters — k=3 / k=5 / k=10](../assets/boba_clusters_k_3_5_10.png)

*So what:* site selection isn’t just “which city” but **“which block.”** Clustering surfaces pockets where expectations (and competition) differ meaningfully.


## 5) Hypothesis tests (what the numbers say)

We tested three practical levers with **Welch t-tests**, **bootstrap CIs**, and an **OLS with robust (HC3) SEs** to estimate adjusted differences.

### 5.1 Simple group differences (Welch)
- **Urban (1) vs Suburban (0)**  
  Δ̄ ≈ **+0.18–0.21 stars**, *p* < 0.001, **small** effect size.
- **Chain (1) vs Indie (0)**  
  Δ̄ ≈ **−0.11 stars**, *p* ≈ 0.009, **small** effect.
- **Near campus (1) vs Not (0)**  
  Δ̄ ≈ **+0.04 stars**, *p* ≈ 0.52, **negligible**.

### 5.2 Multiple regression (controls)
Model: `rating ~ urban + is_chain + near_campus` (HC3 SEs, n=603).

- **Urban:** **+0.21** stars (95% CI ~ **+0.12 to +0.31**)
- **Chain:** **−0.11** stars (95% CI ~ **−0.20 to −0.03**)
- **Near campus:** about **−0.10** (95% CI crosses zero; **ns**)
- Model R² ≈ **0.04** (ratings are noisy; practical signals are modest but consistent).

### 5.3 Local competition (optional check)
Adding **neighbors within 0.5 mi** (`nn_0p5mi`) yields a tiny negative slope (~−0.008 stars per shop) and is **not statistically significant** at conventional levels. Directionally, **crowding doesn’t help**, but the average effect is small.

> Interpretation rule-of-thumb: **Urban helps. Independent branding helps a bit. Campus proximity, on average, doesn’t move ratings much once you control for the other two.** All effects are small—plausible for ratings data.


## 6) Operator playbook (what to do with this)

1) **Favor urban cores or “urban-adjacent” pockets**  
   Expect ~**+0.2 stars** advantage, all else equal. Within a city, aim for the **high-performing sub-clusters** highlighted by the tactical (k=10) view.

2) **Indie advantage is real but modest**  
   Independents run ~**0.1 stars** higher on average than chains—likely product focus and local fit. If you’re a chain, differentiate on **signature items** and **ops quality** to close the gap.

3) **Campus strategy: niche, not default**  
   Don’t over-index on being near campus for ratings alone. If you target students, plan to win on **value/format** (speed, group ordering, late hours) rather than expecting a ratings lift.

4) **Competition isn’t destiny**  
   More neighbors within ~0.5 mi does **not** predict a strong drop in ratings. Execution can win inside dense corridors—just mind your **signature & consistency**.


## 7) What would change my mind

- **Price & value**: add actual menu price points, discounts, bundle offers.
- **Foot traffic / co-tenants**: SafeGraph/Placer data or POI mixes (e.g., Asian grocers, campuses, gyms).
- **Store format**: seat count, hours, wait times—service friction hits ratings.
- **Brand strength**: true chain roster (not just repeated names) and national sentiment.
- **Demographics & spend**: incomes, age mix, student share by tract (ACS).
- **Text analytics**: topic/sentiment from reviews (e.g., “chewy pearls,” “wait time,” “too sweet”).


## 8) Limitations (read before acting)

- **Observational**: correlations ≠ causation.
- **Sampling**: Yelp coverage & review biases vary by neighborhood.
- **Feature shortcuts**: “urban” via city list; “chain” via name repetition; “near campus” threshold at 2 miles.
- **Clustering**: K-Means assumes roughly spherical clusters; we used it for **exploration**, not for causal claims.


## 9) How to reproduce

```bash
# 1) create env
python -m venv .venv
source .venv/bin/activate  # on Windows: .venv\Scripts\activate

# 2) install
pip install -r requirements.txt

# 3) run the notebook
jupyter lab
# open notebooks/eda.ipynb and Run All
```

Key figures (saved by the notebook):
- `assets/top_cities.png` — Top-10 cities by shop count  
- `assets/k_selection.png` — Elbow & silhouette (k selection)  
- `assets/boba_clusters_k_3_5_10.png` — Cluster maps (strategic / balanced / tactical)  
- *(optional)* `assets/adjusted_effects.png` — Adjusted means & 95% CIs (urban/chain/near-campus)


## 10) Appendix (notes on methods)

- **Welch t-tests** for unequal variances; **Cohen’s d / Hedges’ g** for effect sizes; **bootstrap CIs** (5,000 reps) for mean differences.
- **OLS with HC3** to reduce sensitivity to heteroskedasticity; **marginal means** computed by setting a factor to 0/1 over the observed design and averaging predicted ratings.
- **Haversine** for great-circle distances; **BallTree** for fast neighbor counts in radians.
- **Clustering**: K-Means on standardized `[lat, lon, rating]`; reverse-scaled centroids shown for interpretability; silhouette ~0.31–0.34 at the k’s we presented.
