# üö∂‚Äç‚ôÇÔ∏è Pedestrian Traffic Prediction on Bahnhofstrasse (Zurich)

This project uses hourly foot traffic data from Zurich‚Äôs Bahnhofstrasse to explore pedestrian behavior and build a model to predict foot traffic patterns based on time, weather, and location.

üìç Dataset: [Opendata Zurich ‚Äì Passantenfrequenzen](https://opendata.swiss/en/dataset/passantenfrequenzen-an-der-bahnhofstrasse-stundenwerte)  
üéØ Goal: Predict hourly pedestrian counts  
üîß Tools: Python, Pandas, Seaborn, scikit-learn, Random Forest  
üìà Bonus: Add lag features and log-transformed targets for smarter prediction

## üì• Load Dataset and Check Basic Info

We load the CSV file manually downloaded from opendata.swiss and perform basic inspection of the data structure.

In [None]:
import pandas as pd

# Load your CSV (adjust path if needed)
data = pd.read_csv("data/foot_traffic.csv")

# Check structure
print(data.shape)
data.head()

## üßπ Clean the Data

We keep only entries from Bahnhofstrasse (excluding Lintheschergasse) and remove unused or incomplete columns such as `collection_type` and `zone_99`. We also remove rows with no recorded pedestrian activity.

In [None]:
# Keep only Bahnhofstrasse locations
bahnhof_data = data[data['location_name'].str.contains('Bahnhofstrasse')].copy()

# Drop zone_99 and collection_type
zone_99_cols = [col for col in bahnhof_data.columns if col.startswith('zone_99')]
bahnhof_data.drop(columns=['collection_type'] + zone_99_cols, inplace=True)

# Drop rows where all counts are 0 (invalid sensor readings)
bahnhof_data = bahnhof_data[bahnhof_data['pedestrians_count'] > 0]

# Reset index for safety
bahnhof_data.reset_index(drop=True, inplace=True)

# Check result
print("Cleaned shape:", bahnhof_data.shape)
bahnhof_data.head()

## üìÜ Time-Based Feature Engineering

We extract features such as hour of day, day of week, weekend flag, month, and year from the timestamp. These help the model learn time-dependent patterns in foot traffic.

In [None]:
# Convert timestamp to datetime
bahnhof_data['timestamp'] = pd.to_datetime(bahnhof_data['timestamp'])

# Extract time-based features
bahnhof_data['hour'] = bahnhof_data['timestamp'].dt.hour
bahnhof_data['weekday'] = bahnhof_data['timestamp'].dt.weekday  # 0 = Monday
bahnhof_data['is_weekend'] = bahnhof_data['weekday'].apply(lambda x: 1 if x >= 5 else 0)
bahnhof_data['month'] = bahnhof_data['timestamp'].dt.month
bahnhof_data['year'] = bahnhof_data['timestamp'].dt.year

# Preview
bahnhof_data[['timestamp', 'hour', 'weekday', 'is_weekend', 'month', 'year']].head()

## üîÅ Add Lag Features and Encode Categorical Data

To give the model context from previous time points, we add lag features: previous hour‚Äôs count and same hour from the previous day.  
We also convert categorical columns like `weather_condition` and `location_name` into numerical format using one-hot encoding.

In [None]:
# Sort by time to ensure lags are meaningful
bahnhof_data = bahnhof_data.sort_values('timestamp')

# Add lag features
bahnhof_data['prev_hour_count'] = bahnhof_data['pedestrians_count'].shift(1)
bahnhof_data['prev_day_same_hour'] = bahnhof_data['pedestrians_count'].shift(24)

# Drop rows with NaN values caused by shifting
bahnhof_data.dropna(subset=['prev_hour_count', 'prev_day_same_hour'], inplace=True)

# One-hot encode categorical features
data_model = pd.get_dummies(bahnhof_data, columns=['weather_condition', 'location_name'])

# Preview
data_model[['pedestrians_count', 'prev_hour_count', 'prev_day_same_hour']].head()

## ü§ñ Model: Predict Pedestrian Count (with Log Target)

We use a Random Forest Regressor to predict foot traffic.  
The target variable is log-transformed to reduce the impact of extreme spikes.  
We also include lag features and encoded weather/location data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np

# Define input features
features = [
    'hour', 'weekday', 'is_weekend', 'month', 'temperature',
    'prev_hour_count', 'prev_day_same_hour'
] + [col for col in data_model.columns if col.startswith('weather_condition_')] \
  + [col for col in data_model.columns if col.startswith('location_name_')]

# Prepare inputs and log-transformed target
X = data_model[features]
y = np.log1p(data_model['pedestrians_count'])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and revert log
y_pred_log = model.predict(X_test)
y_pred = np.expm1(y_pred_log)
y_test_actual = np.expm1(y_test)

# Evaluate
mae = mean_absolute_error(y_test_actual, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

## üìâ Prediction Results: Actual vs Predicted

We group the predictions into bins based on actual pedestrian counts  
and plot the average predicted value per bin.  
The closer the line is to the 45¬∞ reference line, the better the model‚Äôs accuracy.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Build DataFrame for comparison
results_df = pd.DataFrame({
    'actual': y_test_actual,
    'predicted': y_pred
})

# Bin actual values into ranges
results_df['bin'] = pd.cut(results_df['actual'], bins=10)
binned = results_df.groupby('bin').mean()

# Plot
plt.figure(figsize=(8, 5))
sns.lineplot(x=binned['actual'], y=binned['predicted'], marker='o', label='Model')
plt.plot(binned['actual'], binned['actual'], linestyle='--', color='gray', label='Perfect prediction (y = x)')
plt.title("Binned Actual vs Predicted Pedestrian Counts")
plt.xlabel("Average Actual Count (per bin)")
plt.ylabel("Average Predicted Count")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## üß† Final Results and Key Takeaways

‚úÖ **Model**: Random Forest Regressor  
‚úÖ **Features used**: time (hour, weekday, month), location, weather, and lagged traffic  
‚úÖ **Target**: pedestrian count (log-transformed for better peak handling)

üìâ **MAE** (Mean Absolute Error): ~175  
üìà **Prediction accuracy**: close to real values across most traffic levels  
üî• Big improvement compared to earlier MAE of ~408 (without lag & log)

---

### Next Steps / Ideas for Future Work

- Add support for holiday/event detection (e.g., public holidays, parades)
- Try time series models like Prophet or XGBoost
- Build a Streamlit app to visualize and explore predictions interactively
- Generate automated daily forecasts or GenAI summaries

---

üëâ This project shows how public open data can be turned into business insight using classic ML workflows.  
Great for urban planning, retail footfall forecasting, or smart city applications.