# Museum Visitors vs City Population

This notebook uses the `wikiapp` package to:
1. Fetch museum data (Wikipedia API with offline fallback)
2. Store in PostgreSQL (Docker) or SQLite (local)
3. Run a linear regression: **city population → museum visitors**
4. Visualize the results

In [None]:
import logging

from wikiapp import scraper, population, db, model

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

## 1. Data Ingestion

Fetch museums from Wikipedia (or cached fallback) and enrich with city population.

In [None]:
museums = scraper.fetch_museums_from_wikipedia()
museums = population.enrich_museums_with_population(museums)
print(f"Museums retrieved: {len(museums)}")
museums[:3]

## 2. Database

Connect to the database (PostgreSQL via DATABASE_URL in Docker, or SQLite locally) and load data.

In [None]:
engine = db.get_engine()
db.init_db(engine)
db.load_museums(museums, engine)

df = db.query_dataset(engine)
engine.dispose()

print(f"Rows with population data: {len(df)}")
df.head(10)

## 3. Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution of visitors
axes[0].hist(df["visitors"] / 1e6, bins=12, edgecolor="black", alpha=0.7)
axes[0].set_xlabel("Annual Visitors (millions)")
axes[0].set_ylabel("Count")
axes[0].set_title("Distribution of Museum Visitors")

# Distribution of city populations
axes[1].hist(df["city_population"] / 1e6, bins=12, edgecolor="black", alpha=0.7, color="orange")
axes[1].set_xlabel("City Population (millions)")
axes[1].set_ylabel("Count")
axes[1].set_title("Distribution of City Populations")

plt.tight_layout()
plt.show()

## 4. Linear Regression

In [None]:
result = model.run_regression(df)
print(model.summary(result))

In [None]:
import numpy as np

fig, ax = plt.subplots(figsize=(10, 7))

# Scatter plot with labels
ax.scatter(df["city_population"] / 1e6, df["visitors"] / 1e6,
           s=80, alpha=0.7, edgecolors="black", linewidths=0.5, zorder=5)

# Label each point
for _, row in df.iterrows():
    ax.annotate(row["museum"], (row["city_population"] / 1e6, row["visitors"] / 1e6),
                fontsize=7, alpha=0.8, xytext=(5, 5), textcoords="offset points")

# Regression line
x_range = np.linspace(df["city_population"].min(), df["city_population"].max(), 100)
y_range = result.model.predict(x_range.reshape(-1, 1))
ax.plot(x_range / 1e6, y_range / 1e6, color="red", linewidth=2,
        label=f"y = {result.coef:.4f}x + {result.intercept:.0f}\nR\u00b2 = {result.r2:.4f}")

ax.set_xlabel("City Population (millions)", fontsize=12)
ax.set_ylabel("Annual Museum Visitors (millions)", fontsize=12)
ax.set_title("Museum Visitors vs City Population — Linear Regression", fontsize=14)
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()

## 5. Residual Analysis

In [None]:
df["predicted"] = result.y_pred
df["residual"] = df["visitors"] - df["predicted"]

fig, ax = plt.subplots(figsize=(10, 5))
colors = ["green" if r > 0 else "red" for r in df["residual"]]
ax.barh(df["museum"], df["residual"] / 1e6, color=colors, alpha=0.7)
ax.set_xlabel("Residual (millions of visitors)")
ax.set_title("Residuals: Actual − Predicted Visitors")
ax.axvline(0, color="black", linewidth=0.8)
plt.tight_layout()
plt.show()

## Interpretation

- **Positive residuals** (green): museums that outperform their city's population prediction — likely driven by international tourism, free admission, or iconic status (e.g., Louvre, British Museum).
- **Negative residuals** (red): museums below prediction — possibly newer, less internationally known, or in cities with many competing museums.
- The **low R\u00b2** is expected: city population alone is a weak predictor of museum attendance. Tourist infrastructure, museum type, pricing, and international reputation matter far more.

### Next Steps
- Add features (GDP, tourism arrivals, museum type, free-admission flag)
- Try log-log regression or polynomial features
- Aggregate museums by city for a city-level analysis
- Use cross-validation with a larger dataset