# LUKAS NextGen - PP5 Portfolio Project
## Predictive Analytics for Youth Engagement & Community Sustainability

**Student:** [Your Name]  
**Course:** Code Institute - Full Stack Development Diploma  
**Project:** PP5 - Predictive Analytics  
**Submission Date:** August 17, 2025  
**Project Theme:** www.wir-fuer-lukas.de - Innovative Konzepte für die Lukasgemeinde Karlsruhe

---

### 🎯 **Project Objectives**

This project develops a **predictive analytics solution** to optimize **youth engagement** and **financial sustainability** for the Lukasgemeinde Karlsruhe through data-driven insights and machine learning.

**Key Goals:**
1. **Predict youth engagement levels** and identify retention factors
2. **Forecast financial sustainability** for building maintenance & operations  
3. **Strengthen community bonding** through evidence-based programming
4. **Create actionable recommendations** for innovative funding concepts

### 📊 **Technical Stack**
- **Data Analysis:** Pandas, NumPy, Matplotlib, Seaborn
- **Machine Learning:** Scikit-learn, Optuna (hyperparameter tuning)
- **Visualization:** Plotly, Streamlit Dashboard
- **Development:** Jupyter Notebooks, Python 3.11+
- **Deployment:** Local Streamlit App with potential Heroku deployment

### 🗂️ **Data Sources Strategy**
We will utilize **publicly available datasets** from:
- **Karlsruhe Open Data Portal** (demographics, youth statistics)
- **Statistical Office Baden-Württemberg** (population trends)
- **Church attendance surveys** (synthetic/anonymized data)
- **Community event participation** (simulated realistic data)

---

## 1. Data Import and Setup

Setting up the environment and importing necessary libraries for predictive analytics and data visualization.

In [None]:
# Core Data Science Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Machine Learning Libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Additional Libraries
import warnings
import datetime as dt
from pathlib import Path
import sys

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8')
warnings.filterwarnings('ignore')

# Add project root to path for custom modules
sys.path.append('..')

print("📊 LUKAS NextGen - PP5 Development Environment Ready!")
print(f"🐍 Python version: {sys.version}")
print(f"📈 Pandas version: {pd.__version__}")
print(f"🧠 NumPy version: {np.__version__}")
print(f"📊 Last updated: {dt.datetime.now().strftime('%Y-%m-%d %H:%M')}")

In [None]:
# Set random seed for reproducible results
np.random.seed(42)

# Create realistic synthetic data for demonstration
# Note: In a real project, this would be actual data from Karlsruhe Open Data Portal

def generate_youth_engagement_data(n_samples=500):
    """Generate realistic youth engagement data for Lukasgemeinde Karlsruhe"""
    
    data = {
        'participant_id': range(1, n_samples + 1),
        'age': np.random.normal(19, 4, n_samples).astype(int).clip(13, 30),
        'gender': np.random.choice(['M', 'F', 'D'], n_samples, p=[0.48, 0.48, 0.04]),
        'district': np.random.choice([
            'Innenstadt-Ost', 'Innenstadt-West', 'Südstadt', 'Oststadt', 
            'Weststadt', 'Nordstadt', 'Mühlburg', 'Daxlanden'
        ], n_samples, p=[0.15, 0.12, 0.18, 0.15, 0.12, 0.08, 0.10, 0.10]),
        'education_level': np.random.choice([
            'Hauptschule', 'Realschule', 'Gymnasium', 'Studium', 'Ausbildung'
        ], n_samples, p=[0.15, 0.25, 0.30, 0.20, 0.10]),
        'family_church_background': np.random.choice([0, 1], n_samples, p=[0.4, 0.6]),
        'monthly_events_attended': np.random.poisson(2.5, n_samples),
        'volunteer_hours_per_month': np.random.exponential(3, n_samples).astype(int),
        'digital_engagement_score': np.random.normal(6.5, 2.0, n_samples).clip(1, 10),
        'peer_influence_score': np.random.normal(7.2, 1.8, n_samples).clip(1, 10),
        'event_satisfaction_avg': np.random.normal(7.8, 1.2, n_samples).clip(1, 10)
    }
    
    # Create target variable: high_engagement (binary)
    engagement_score = (
        0.3 * data['monthly_events_attended'] +
        0.2 * data['volunteer_hours_per_month'] +
        0.2 * data['digital_engagement_score'] +
        0.15 * data['peer_influence_score'] +
        0.15 * data['event_satisfaction_avg'] +
        np.random.normal(0, 2, n_samples)  # Add noise
    )
    
    data['engagement_score'] = engagement_score
    data['high_engagement'] = (engagement_score > np.percentile(engagement_score, 60)).astype(int)
    
    return pd.DataFrame(data)

def generate_financial_data(n_months=36):
    """Generate realistic financial sustainability data"""
    
    dates = pd.date_range('2022-01-01', periods=n_months, freq='M')
    
    # Simulate seasonal patterns and trends
    base_donations = 8500
    seasonal_factor = 1 + 0.3 * np.sin(2 * np.pi * np.arange(n_months) / 12)
    trend_factor = 1 + 0.02 * np.arange(n_months) / 12  # Slight growth
    
    data = {
        'month': dates,
        'total_donations': (base_donations * seasonal_factor * trend_factor + 
                          np.random.normal(0, 1000, n_months)).astype(int),
        'youth_donations': np.random.normal(450, 150, n_months).astype(int),
        'building_maintenance_costs': np.random.normal(3200, 800, n_months).astype(int),
        'event_costs': np.random.normal(1200, 400, n_months).astype(int),
        'youth_program_costs': np.random.normal(800, 200, n_months).astype(int),
        'active_youth_members': 45 + np.random.poisson(5, n_months),
        'total_members': 280 + np.random.poisson(15, n_months),
        'youth_events_per_month': np.random.poisson(4, n_months)
    }
    
    df = pd.DataFrame(data)
    df['net_result'] = df['total_donations'] - df['building_maintenance_costs'] - df['event_costs'] - df['youth_program_costs']
    df['youth_engagement_rate'] = df['active_youth_members'] / df['total_members']
    
    return df

# Generate demonstration datasets
print("🔄 Generating realistic demonstration data...")
youth_data = generate_youth_engagement_data(500)
financial_data = generate_financial_data(36)

print(f"✅ Youth engagement dataset: {youth_data.shape}")
print(f"✅ Financial sustainability dataset: {financial_data.shape}")
print("\n📝 Note: In production, this would use real data from:")
print("   • Karlsruhe Open Data Portal")
print("   • Statistical Office Baden-Württemberg") 
print("   • Anonymized church attendance records")

---

## 📋 Development Status (August 12, 2025)

### ✅ **Completed Today:**
- [x] **Project structure** set up and functional
- [x] **Streamlit app** running without errors (http://localhost:8513)
- [x] **Import architecture** fixed and streamlined
- [x] **Jupyter notebook** started with comprehensive plan
- [x] **Realistic data generation** functions implemented
- [x] **Development environment** configured

### 🔄 **Next Steps (August 13-17):**

**Day 2 - EDA & Data Analysis:**
- [ ] Complete Exploratory Data Analysis
- [ ] Feature engineering and correlation analysis
- [ ] Data visualization and insights discovery
- [ ] Data preprocessing pipeline

**Day 3 - Machine Learning:**
- [ ] Youth engagement prediction model
- [ ] Financial sustainability forecasting
- [ ] Model evaluation and hyperparameter tuning
- [ ] Model persistence and loading

**Day 4 - Streamlit Dashboard:**
- [ ] Complete all 7 app pages with real functionality
- [ ] Interactive visualizations integration
- [ ] Model prediction interface
- [ ] User experience optimization

**Day 5 - Portfolio Finalization:**
- [ ] Documentation and README
- [ ] Code testing and validation
- [ ] Deployment preparation
- [ ] Final submission ready

### 🎯 **Key Deliverables:**
1. **Functional ML-powered Streamlit app**
2. **Comprehensive Jupyter notebook with analysis**
3. **Deployment-ready codebase**
4. **Portfolio documentation**

---

### 💡 **Technical Architecture Overview**

```
LUKAS NextGen PP5/
├── app/                    # Streamlit dashboard
│   ├── app.py             # Main application
│   └── app_pages/         # Individual page modules
├── src/                   # Core ML pipeline
│   ├── data_loaders.py    # Data loading utilities
│   ├── model.py           # ML model training
│   ├── config.py          # Project configuration
│   └── recommendations.py # Recommendation engine
├── notebooks/             # Analysis & development
│   └── PP5_LUKAS_NextGen_Development.ipynb
├── models/                # Trained model storage
├── data/                  # Dataset storage
└── tests/                 # Unit tests
```

**Status: Ready for Day 2 development! 🚀**