# Q5: Pattern Analysis

**Phase 6:** Pattern Analysis & Advanced Visualization  
**Points: 6 points**

**Focus:** Identify trends over time, analyze seasonal patterns, create correlation analysis.

**Lecture Reference:** Lecture 11, Notebook 3 ([`11/demo/03_pattern_analysis_modeling_prep.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/03_pattern_analysis_modeling_prep.ipynb)), Phase 6. Also see Lecture 08 (groupby) and Lecture 07 (visualization).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load feature-engineered data from Q4
df = pd.read_csv('output/q4_features.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
# Or if you saved without index:
# df = pd.read_csv('output/q4_features.csv')
# df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
# df = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with features")

Loaded 196,479 records with features


---

## Objective

Identify trends over time, analyze seasonal patterns, and create correlation analysis.

**Time Series Note:** Time series data has temporal patterns (trends, seasonality, cycles). Use time-based aggregations and visualizations to identify these patterns. See **Lecture 09** for time series decomposition and pattern analysis. Use pandas `resample()` to aggregate by time periods (e.g., `resample('ME')` for monthly, `resample('D')` for daily) and `groupby()` with temporal features (hour, day_of_week, month) to identify patterns.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q5_correlations.csv`
**Format:** CSV file
**Content:** Correlation matrix (can be subset of key variables)
**Requirements:**
- Square matrix with variable names as both index and columns
- Values are correlation coefficients (between -1 and 1)
- Can be subset of key variables (e.g., top 10 most important variables)
- **Include index/column names** when saving: `corr_matrix.to_csv('output/q5_correlations.csv')`

**Example format:**
```csv
,Air Temperature,Water Temperature,Wind Speed,Humidity
Air Temperature,1.0,0.847,-0.234,-0.156
Water Temperature,0.847,1.0,0.123,0.089
Wind Speed,-0.234,0.123,1.0,0.456
Humidity,-0.156,0.089,0.456,1.0
```

### 2. `output/q5_patterns.png`
**Format:** PNG image file
**Content:** Advanced visualizations showing trends/seasonality
**Required visualizations (at least 2 of these):**
1. **Trend over time:** Line plot showing variable(s) over time (e.g., monthly averages)
2. **Seasonal pattern:** Bar plot or line plot showing patterns by month, day of week, or hour
3. **Correlation heatmap:** Heatmap of correlation matrix
4. **Multi-panel plot:** Multiple subplots showing different patterns

**Requirements:**
- Clear axis labels (xlabel, ylabel)
- Title for each subplot
- Overall figure title (optional but recommended)
- Legend if multiple series shown
- Saved as PNG with sufficient resolution (dpi=150 or higher)

### 3. `output/q5_trend_summary.txt`
**Format:** Plain text file
**Content:** Brief text summary of key patterns identified
**Required information:**
- Temporal trends (increasing, decreasing, stable)
- Seasonal patterns (daily, weekly, monthly cycles)
- Key correlations (mention 2-3 strongest correlations)

**Example format:**
```
KEY PATTERNS IDENTIFIED
======================

TEMPORAL TRENDS:
- Air and water temperatures show clear seasonal patterns
- Higher temperatures in summer months (June-August)
- Lower temperatures in winter months (December-February)
- Monthly air temp range: 4.2°C to 25.8°C

DAILY PATTERNS:
- Temperature shows diurnal cycle (warmer during day, cooler at night)
- Peak air temp typically at hour 14-15 (2-3 PM)
- Minimum air temp typically at hour 5-6 (5-6 AM)

CORRELATIONS:
- Air Temp vs Water Temp: 0.847 (strong positive correlation)
- Air Temp vs Humidity: -0.234 (moderate negative correlation)
- Wind Speed vs Wave Height: 0.612 (moderate positive correlation)
```

---

## Requirements Checklist

- [ ] Trends over time identified (increasing, decreasing, stable)
- [ ] Seasonal patterns analyzed (daily, weekly, monthly cycles)
- [ ] Correlation analysis completed
- [ ] Advanced visualizations created (multi-panel plots, grouped visualizations)
- [ ] Key patterns documented
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Identify trends** - Use `.resample()` to aggregate by time period and visualize long-term patterns
2. **Analyze seasonal patterns** - Use `.groupby()` with temporal features (hour, day_of_week, month)
3. **Create correlation analysis** - Compute correlation matrix for numeric columns
4. **Create visualizations** - Multi-panel plot showing trends, seasonal patterns, and correlations
5. **Document patterns** - Summarize key findings in text file

---

## Decision Points

- **Trend identification:** Is there a long-term trend? Is it increasing, decreasing, or stable? Use time series plots to visualize.
- **Seasonal patterns:** Are there daily patterns? Weekly? Monthly? Use aggregations and visualizations to identify.
- **Correlation analysis:** Which variables are correlated? Use correlation matrix and heatmaps. Focus on relationships that might be useful for modeling.

---

## Checkpoint

After Q5, you should have:
- [ ] Trends identified
- [ ] Seasonal patterns analyzed
- [ ] Correlations calculated
- [ ] Pattern visualizations created
- [ ] All 3 artifacts saved: `q5_correlations.csv`, `q5_patterns.png`, `q5_trend_summary.txt`

---

**Next:** Continue to `q6_modeling_preparation.md` for Modeling Preparation.


In [7]:
#Creating a correlation matrix
# Key variables to analyze patterns for
key_vars = [
    "Air Temperature",
    "Humidity",
    "Wind Speed",
    "Barometric Pressure",
    "Solar Radiation",
    "Total Rain"
]



In [8]:
corr_matrix = df[key_vars].corr()
corr_matrix.to_csv("output/q5_correlations.csv")

print("✅ output/q5_correlations.csv saved")
display(corr_matrix)


✅ output/q5_correlations.csv saved


Unnamed: 0,Air Temperature,Humidity,Wind Speed,Barometric Pressure,Solar Radiation,Total Rain
Air Temperature,1.0,0.008766,-0.22831,-0.243635,0.234073,0.422422
Humidity,0.008766,1.0,0.017027,-0.189546,-0.155521,0.100964
Wind Speed,-0.22831,0.017027,1.0,-0.081462,0.036858,-0.098539
Barometric Pressure,-0.243635,-0.189546,-0.081462,1.0,0.064277,0.008179
Solar Radiation,0.234073,-0.155521,0.036858,0.064277,1.0,0.053473
Total Rain,0.422422,0.100964,-0.098539,0.008179,0.053473,1.0


In [9]:
# =========================
# MULTI-VARIABLE MONTHLY TRENDS
# =========================
monthly_trends = df[key_vars].resample("ME").mean()

# =========================
# MULTI-VARIABLE HOURLY SEASONAL PATTERNS
# =========================
df["hour"] = df.index.hour
hourly_patterns = df.groupby("hour")[key_vars].mean()

print("✅ Multi-variable monthly & hourly patterns computed")



✅ Multi-variable monthly & hourly patterns computed


In [10]:
fig, axes = plt.subplots(1, 3, figsize=(22, 7))

# =========================
# Plot 1: MULTI-VARIABLE MONTHLY TRENDS
# =========================
for col in ["Air Temperature", "Humidity", "Wind Speed"]:
    axes[0].plot(monthly_trends[col], label=col)

axes[0].set_title("Monthly Trends (Multiple Variables)")
axes[0].set_xlabel("Date")
axes[0].set_ylabel("Value")
axes[0].legend()

# =========================
# Plot 2: MULTI-VARIABLE DIURNAL SEASONALITY
# =========================
for col in ["Air Temperature", "Humidity", "Wind Speed"]:
    axes[1].plot(hourly_patterns[col], label=col)

axes[1].set_title("Hourly Seasonal Patterns (Diurnal Cycles)")
axes[1].set_xlabel("Hour of Day")
axes[1].set_ylabel("Average Value")
axes[1].legend()

# =========================
# Plot 3: CORRELATION HEATMAP (ALL KEY VARIABLES)
# =========================
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", ax=axes[2])
axes[2].set_title("Correlation Heatmap (Key Variables)")

# =========================
# Overall Title & Save
# =========================
fig.suptitle("Chicago Beach Weather Patterns: Multi-Variable Trends, Seasonality & Correlations", fontsize=14)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.savefig("output/q5_patterns.png", dpi=150)
plt.close()

print("✅ output/q5_patterns.png saved with ALL key variables")


✅ output/q5_patterns.png saved with ALL key variables


In [12]:
# =========================
# LOWEST MONTH IDENTIFICATION
# =========================

# Group by month number (1–12) and compute mean
monthly_groups = df.groupby(df.index.month)[["Air Temperature", "Humidity"]].mean()

# Find lowest month for Air Temperature
lowest_air_temp_month = monthly_groups["Air Temperature"].idxmin()
lowest_air_temp_value = monthly_groups["Air Temperature"].min()

# Find lowest month for Humidity
lowest_humidity_month = monthly_groups["Humidity"].idxmin()
lowest_humidity_value = monthly_groups["Humidity"].min()

print("✅ Lowest months identified")


✅ Lowest months identified


In [13]:
# =========================
# STRONG CORRELATIONS (|r| > 0.20)
# =========================

corr_pairs = corr_matrix.unstack()

# Remove self-correlations
corr_pairs = corr_pairs[corr_pairs.index.get_level_values(0) != corr_pairs.index.get_level_values(1)]

# Keep only correlations with magnitude > 0.20
strong_corrs = corr_pairs[abs(corr_pairs) > 0.20]

# Drop duplicate mirrored pairs
strong_corrs = strong_corrs.groupby(
    strong_corrs.index.map(frozenset)
).first()

print("✅ Strong correlations filtered (|r| > 0.20)")
display(strong_corrs)


✅ Strong correlations filtered (|r| > 0.20)


(Air Temperature, Wind Speed)            -0.228310
(Barometric Pressure, Air Temperature)   -0.243635
(Solar Radiation, Air Temperature)        0.234073
(Total Rain, Air Temperature)             0.422422
dtype: float64

In [14]:
summary_lines = []

summary_lines.append("KEY PATTERNS IDENTIFIED")
summary_lines.append("======================\n")

# =========================
# TEMPORAL TRENDS (MULTI-VARIABLE)
# =========================
summary_lines.append("TEMPORAL TRENDS (MONTHLY):")

for col in key_vars:
    min_val = monthly_trends[col].min()
    max_val = monthly_trends[col].max()
    summary_lines.append(
        f"- {col}: Monthly average ranges from {min_val:.2f} to {max_val:.2f}"
    )

summary_lines.append("")

# =========================
# LOWEST MONTHS HIGHLIGHTED
# =========================
summary_lines.append("LOWEST MONTHS IDENTIFIED:")

summary_lines.append(
    f"- Lowest average Air Temperature occurs in month {lowest_air_temp_month} "
    f"with mean value {lowest_air_temp_value:.2f}"
)

summary_lines.append(
    f"- Lowest average Humidity occurs in month {lowest_humidity_month} "
    f"with mean value {lowest_humidity_value:.2f}"
)

summary_lines.append("")

# =========================
# DIURNAL (HOURLY) PATTERNS
# =========================
summary_lines.append("DAILY (DIURNAL) PATTERNS:")

for col in ["Air Temperature", "Humidity", "Wind Speed"]:
    peak_hr = hourly_patterns[col].idxmax()
    low_hr = hourly_patterns[col].idxmin()
    summary_lines.append(
        f"- {col}: Peak at hour {peak_hr}, minimum at hour {low_hr}"
    )

summary_lines.append("")

# =========================
# STRONG CORRELATIONS ONLY (|r| > 0.20)
# =========================
summary_lines.append("STRONG CORRELATIONS (|r| > 0.20):")

for (var1, var2), value in strong_corrs.items():
    summary_lines.append(f"- {var1} vs {var2}: {value:.3f}")

# =========================
# SAVE SUMMARY FILE
# =========================
with open("output/q5_trend_summary.txt", "w") as f:
    f.write("\n".join(summary_lines))

print("✅ output/q5_trend_summary.txt updated with lowest months & strong correlations")


✅ output/q5_trend_summary.txt updated with lowest months & strong correlations
