# Q1: Setup & Exploration

**Phase 1-2:** Project Setup, Data Exploration  
**Points: 6 points**

**Focus:** Load data, perform initial inspection, identify data quality issues.

**Lecture Reference:** Lecture 11, Notebook 1 ([`11/demo/01_setup_exploration_cleaning.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/01_setup_exploration_cleaning.ipynb)), Phases 1-2. Also see Lecture 04 (pandas I/O) and Lecture 07 (visualization).

---

## Setup

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import os

# Create output directory
os.makedirs("output", exist_ok=True)

# Set plotting style
plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")
%matplotlib inline

# Display options
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

---

## Objective

Load the Chicago Beach Weather Sensors dataset, perform initial inspection, and identify data quality issues.

**Note:** The datetime column in this dataset is named `Measurement Timestamp`.

**Time Series Note:** Unlike the lecture's NYC Taxi data (event-based), this dataset is **time-series data** with continuous sensor readings. The data is already indexed by time, so you'll work with datetime-indexed dataframes throughout. See **Lecture 09** for time series operations. For time series visualizations, you may want to use pandas `resample()` to aggregate data (e.g., daily averages) for clearer visualization of long-term trends.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q1_data_info.txt`
**Format:** Plain text file
**Content:** Dataset information including:
- Dataset shape (rows × columns)
- Column names (one per line or comma-separated)
- Data types for each column
- Date range (start date and end date) - **REQUIRED if temporal data**
- Missing value counts for each column (column name: count)

**Example format:**
```
Dataset Shape: 50000 rows × 10 columns

Column Names:
- Measurement Timestamp
- Beach
- Water Temperature
- Air Temperature
...

Data Types:
- Measurement Timestamp: datetime64[ns]
- Beach: object
- Water Temperature: float64
...

Date Range:
Start: 2022-01-01 00:00:00
End: 2027-09-15 07:00:00

Missing Values:
- Water Temperature: 2500 (5.0%)
- Air Temperature: 1500 (3.0%)
...
```

### 2. `output/q1_exploration.csv`
**Format:** CSV file
**Required Columns (exact names):** `column_name`, `mean`, `std`, `min`, `max`, `missing_count`
**Content:** One row per numeric column in the dataset
- `column_name`: Name of the numeric column
- `mean`: Mean value (float)
- `std`: Standard deviation (float)
- `min`: Minimum value (float)
- `max`: Maximum value (float)
- `missing_count`: Number of missing values (integer)

**Example:**
```csv
column_name,mean,std,min,max,missing_count
Water Temperature,15.23,5.12,0.5,28.7,2500
Air Temperature,18.45,8.23,-5.2,35.8,1500
Wind Speed,6.78,4.56,0.1,25.3,0
```

### 3. `output/q1_visualizations.png`
**Format:** PNG image file
**Content:** At least 2 plots in a single figure (use subplots)
**Required plots:**
1. **Distribution plot:** Histogram or density plot of at least one numeric variable
2. **Time series plot:** Line plot showing a numeric variable over time (if temporal data)

**Requirements:**
- Clear axis labels (xlabel, ylabel)
- Title for each subplot
- Overall figure title (optional but recommended)
- Legend if multiple series shown
- Saved as PNG with sufficient resolution (dpi=150 or higher)

---

## Requirements Checklist

- [ ] Data loaded successfully from `data/beach_sensors.csv`
- [ ] Initial inspection completed (shape, info, head, describe)
- [ ] Missing values identified and counted
- [ ] Basic visualizations created (at least 2 plots: distribution + time series)
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Load and inspect the dataset** - Use standard pandas I/O and inspection methods
2. **Parse datetime** - Identify and convert datetime column(s)
3. **Identify missing values** - Count and calculate percentages per column
4. **Create visualizations** - Distribution plot + time series plot (use subplots)
5. **Save artifacts** - Write to the three required output files

---

## Decision Points

- **Visualization choices:** What types of plots best show your data? See Lecture 11 Notebook 1 for examples.
- **Data quality assessment:** What issues do you see? Missing data patterns? Outliers? Inconsistent formats? Document these for Q2.

---

## Checkpoint

After Q1, you should have:
- [ ] Data loaded successfully
- [ ] Basic statistics calculated
- [ ] Initial visualizations created (2+ plots)
- [ ] Data quality issues identified
- [ ] All 3 artifacts saved: `q1_data_info.txt`, `q1_exploration.csv`, `q1_visualizations.png`

---

**Next:** Continue to `q2_data_cleaning.md` for Data Cleaning.


## Load Data & Generate Artifact 1

In [None]:
# Load dataset
df = pd.read_csv("data/beach_sensors.csv")

# Convert timestamp column to datetime
df["Measurement Timestamp"] = pd.to_datetime(
    df["Measurement Timestamp"], errors="coerce"
)

# Quick preview
df.head()

# Dataset shape
n_rows, n_cols = df.shape

# Column names
colnames = df.columns.tolist()

# Data types
dtypes = df.dtypes

# Date range (temporal REQUIRED)
start_date = df["Measurement Timestamp"].min()
end_date = df["Measurement Timestamp"].max()

# Missing values
missing_counts = df.isna().sum()
missing_rates = (df.isna().mean() * 100).round(2)

# Write file output/q1_data_info.txt
output_path = "output/q1_data_info.txt"
with open(output_path, "w") as f:
    # Shape
    f.write(f"Dataset Shape: {n_rows} rows × {n_cols} columns\n\n")

    # Column Names
    f.write("Column Names:\n")
    for col in colnames:
        f.write(f"- {col}\n")
    f.write("\n")

    # Data types
    f.write("Data Types:\n")
    for col in colnames:
        f.write(f"- {col}: {dtypes[col]}\n")
    f.write("\n")

    # Date Range
    f.write("Date Range:\n")
    f.write(f"Start: {start_date}\n")
    f.write(f"End:   {end_date}\n\n")

    # Missing values
    f.write("Missing Values:\n")
    for col in colnames:
        f.write(f"- {col}: {missing_counts[col]} ({missing_rates[col]}%)\n")

print("Created:", output_path)

: 

## Explore Data & Generate Artifact 2

In [5]:
# Identify numeric columns
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()

# Compute summary rows
exploration_rows = []
for col in numeric_cols:
    exploration_rows.append(
        {
            "column_name": col,
            "mean": df[col].mean(),
            "std": df[col].std(),
            "min": df[col].min(),
            "max": df[col].max(),
            "missing_count": df[col].isna().sum(),
        }
    )

# Convert to dataframe
exploration_df = pd.DataFrame(exploration_rows)

# Save
exploration_df.to_csv("output/q1_exploration.csv", index=False)
exploration_df.head()

Unnamed: 0,column_name,mean,std,min,max,missing_count
0,Air Temperature,12.627176,10.433975,-29.78,37.6,75
1,Wet Bulb Temperature,10.276958,9.40292,-28.9,28.4,75930
2,Humidity,68.022183,15.634777,0.0,100.0,0
3,Rain Intensity,0.158949,1.794149,0.0,183.6,75930
4,Interval Rain,0.142393,1.097016,-0.9,63.42,0


## Visualize Data & Generate Artifact 3

In [6]:
# Choose at least one numeric column to plot
if len(numeric_cols) == 0:
    raise ValueError("No numeric columns found for plotting.")

example_col = numeric_cols[0]

# Create figure with 2 subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Subplot 1: Distribution (Histogram)
axes[0].hist(df[example_col].dropna(), bins=30, alpha=0.7)
axes[0].set_title(f"Distribution of {example_col}")
axes[0].set_xlabel(example_col)
axes[0].set_ylabel("Frequency")

# Subplot 2: Time Series
# Sort by time index
df_ts = df.set_index("Measurement Timestamp").sort_index()

axes[1].plot(df_ts[example_col], label=example_col)
axes[1].set_title(f"Time Series of {example_col}")
axes[1].set_xlabel("Time")
axes[1].set_ylabel(example_col)
axes[1].legend()

fig.suptitle("Q1 Visualizations", fontsize=16)

# Save figure
output_fig = "output/q1_visualizations.png"
plt.tight_layout()
plt.savefig(output_fig, dpi=150)
plt.close()

print("Created:", output_fig)

Created: output/q1_visualizations.png
