# Q1: Setup & Exploration

**Phase 1-2:** Project Setup, Data Exploration  
**Points: 6 points**

**Focus:** Load data, perform initial inspection, identify data quality issues.

**Lecture Reference:** See **Lecture 11, Notebook 1** (`11/demo/01_setup_exploration_cleaning.ipynb`), Phases 1-2 for examples of data loading, inspection, and initial visualizations.

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import os

# Create output directory
os.makedirs('output', exist_ok=True)

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

---

## Objective

Load the Chicago Beach Weather Sensors dataset, perform initial inspection, and identify data quality issues.

**Time Series Note:** Unlike the lecture's NYC Taxi data (event-based), this dataset is **time-series data** with continuous sensor readings. The data is already indexed by time, so you'll work with datetime-indexed dataframes throughout. See **Lecture 09** for time series operations. For time series visualizations, you may want to use pandas `resample()` to aggregate data (e.g., daily averages) for clearer visualization of long-term trends.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q1_data_info.txt`
**Format:** Plain text file
**Content:** Dataset information including:
- Dataset shape (rows × columns)
- Column names (one per line or comma-separated)
- Data types for each column
- Date range (start date and end date) - **REQUIRED if temporal data**
- Missing value counts for each column (column name: count)

**Example format:**
```
Dataset Shape: 50000 rows × 10 columns

Column Names:
- Measurement Timestamp
- Beach
- Water Temperature
- Air Temperature
...

Data Types:
- Measurement Timestamp: datetime64[ns]
- Beach: object
- Water Temperature: float64
...

Date Range:
Start: 2022-01-01 00:00:00
End: 2027-09-15 07:00:00

Missing Values:
- Water Temperature: 2500 (5.0%)
- Air Temperature: 1500 (3.0%)
...
```

### 2. `output/q1_exploration.csv`
**Format:** CSV file
**Required Columns (exact names):** `column_name`, `mean`, `std`, `min`, `max`, `missing_count`
**Content:** One row per numeric column in the dataset
- `column_name`: Name of the numeric column
- `mean`: Mean value (float)
- `std`: Standard deviation (float)
- `min`: Minimum value (float)
- `max`: Maximum value (float)
- `missing_count`: Number of missing values (integer)

**Example:**
```csv
column_name,mean,std,min,max,missing_count
Water Temperature,15.23,5.12,0.5,28.7,2500
Air Temperature,18.45,8.23,-5.2,35.8,1500
Wind Speed,6.78,4.56,0.1,25.3,0
```

### 3. `output/q1_visualizations.png`
**Format:** PNG image file
**Content:** At least 2 plots in a single figure (use subplots)
**Required plots:**
1. **Distribution plot:** Histogram or density plot of at least one numeric variable
2. **Time series plot:** Line plot showing a numeric variable over time (if temporal data)

**Requirements:**
- Clear axis labels (xlabel, ylabel)
- Title for each subplot
- Overall figure title (optional but recommended)
- Legend if multiple series shown
- Saved as PNG with sufficient resolution (dpi=150 or higher)

---

## Requirements Checklist

- [ ] Data loaded successfully from `data/beach_sensors.csv`
- [ ] Initial inspection completed (shape, info, head, describe)
- [ ] Missing values identified and counted
- [ ] Basic visualizations created (at least 2 plots: distribution + time series)
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Load the dataset:**
   ```python
   df = pd.read_csv('data/beach_sensors.csv')
   ```

2. **Inspect the data:**
   - Check shape: `df.shape`
   - Check columns: `df.columns`
   - Check data types: `df.dtypes`
   - Check head: `df.head()`
   - Check summary: `df.describe()`

3. **Parse datetime (if applicable):**
   - Identify datetime column(s)
   - Parse using `pd.to_datetime()`
   - Check date range

4. **Identify missing values:**
   - Count missing values per column: `df.isnull().sum()`
   - Calculate percentages

5. **Create visualizations:**
   - Distribution plot (histogram or density)
   - Time series plot (if temporal data)

6. **Save artifacts:**
   - Write data info to `output/q1_data_info.txt`
   - Write exploration stats to `output/q1_exploration.csv`
   - Save figure to `output/q1_visualizations.png`

---

## Decision Points

- **Visualization choices:** What types of plots best show your data? See Lecture 11 Notebook 1 for examples.
- **Data quality assessment:** What issues do you see? Missing data patterns? Outliers? Inconsistent formats? Document these for Q2.

---

## Checkpoint

After Q1, you should have:
- [ ] Data loaded successfully
- [ ] Basic statistics calculated
- [ ] Initial visualizations created (2+ plots)
- [ ] Data quality issues identified
- [ ] All 3 artifacts saved: `q1_data_info.txt`, `q1_exploration.csv`, `q1_visualizations.png`

---

**Next:** Continue to `q2_data_cleaning.md` for Data Cleaning.


In [3]:
import pandas as pd

df_temp = pd.read_csv("data/beach_sensors.csv")
df_temp.head()

Unnamed: 0,Station Name,Measurement Timestamp,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
0,63rd Street Weather Station,09/27/2018 10:00:00 AM,16.4,12.2,61,0.0,0.0,260.3,0.0,231,2.5,4.7,996.3,484,356.0,11.9,09/27/2018 10:00 AM,63rdStreetWeatherStation201809271000
1,63rd Street Weather Station,09/27/2018 11:00:00 AM,17.1,11.5,51,0.0,0.0,260.3,0.0,244,3.6,5.7,995.4,468,356.0,11.9,09/27/2018 11:00 AM,63rdStreetWeatherStation201809271100
2,63rd Street Weather Station,09/27/2018 01:00:00 PM,18.2,12.4,51,0.0,0.0,260.3,0.0,248,3.1,5.3,994.8,377,355.0,11.9,09/27/2018 1:00 PM,63rdStreetWeatherStation201809271300
3,Foster Weather Station,09/27/2018 01:00:00 PM,17.89,,39,,0.0,,,249,1.4,2.3,993.6,0,,15.1,09/27/2018 1:00 PM,FosterWeatherStation201809271300
4,63rd Street Weather Station,09/27/2018 03:00:00 PM,19.5,13.0,47,0.0,0.0,260.3,0.0,249,3.1,5.7,992.9,461,355.0,11.9,09/27/2018 3:00 PM,63rdStreetWeatherStation201809271500


In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Setup
os.makedirs("output", exist_ok=True)

# Load dataset
df = pd.read_csv('data/beach_sensors.csv')

# Inspect the data
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
print("Data Types:\n", df.dtypes)
print("Head:\n", df.head())
print("Summary:\n", df.describe(include='all'))

# Parse datetime if applicable
datetime_col = None
for col in df.columns:
    if "timestamp" in col.lower():
        datetime_col = col
        break

if datetime_col:
    df[datetime_col] = pd.to_datetime(df[datetime_col])
    start_date = df[datetime_col].min()
    end_date = df[datetime_col].max()
else:
    start_date = end_date = None


# Identify missing values
missing_counts = df.isnull().sum()
missing_percent = (missing_counts / len(df)) * 100

# Artifact 1: Dataset info
data_info_path = "output/q1_data_info.txt"
with open(data_info_path, "w") as f:
    f.write(f"Dataset Shape: {df.shape[0]} rows × {df.shape[1]} columns\n\n")
    
    f.write("Column Names:\n")
    for col in df.columns:
        f.write(f"- {col}\n")
    
    f.write("\nData Types:\n")
    for col in df.columns:
        f.write(f"- {col}: {df[col].dtype}\n")
    
    if datetime_col:
        f.write("\nDate Range:\n")
        f.write(f"Start: {start_date}\n")
        f.write(f"End: {end_date}\n")
    
    f.write("\nMissing Values:\n")
    for col in df.columns:
        f.write(f"- {col}: {missing_counts[col]} ({missing_percent[col]:.1f}%)\n")

print(f"Q1 data info saved to {data_info_path}")


# Artifact 2: Numeric exploration
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
exploration_path = "output/q1_exploration.csv"

exploration_rows = []
for col in numeric_cols:
    exploration_rows.append({
        "column_name": col,
        "mean": df[col].mean(),
        "std": df[col].std(),
        "min": df[col].min(),
        "max": df[col].max(),
        "missing_count": df[col].isna().sum()
    })

exploration_df = pd.DataFrame(exploration_rows)
exploration_df.to_csv(exploration_path, index=False)
print(f"Q1 numeric exploration saved to {exploration_path}")


# Artifact 3: Visualizations
plt.figure(figsize=(12,5))
fig, axes = plt.subplots(1, 2, figsize=(14,5))

# Distribution plot of first numeric column
if numeric_cols:
    col_hist = numeric_cols[0]
    sns.histplot(df[col_hist], bins=30, kde=True, ax=axes[0], color='skyblue')
    axes[0].set_title(f'Distribution of {col_hist}')
    axes[0].set_xlabel(col_hist)
    axes[0].set_ylabel('Count')

# Time series plot of first numeric column (if datetime exists)
if datetime_col and numeric_cols:
    col_ts = numeric_cols[0]
    ts_data = df[[datetime_col, col_ts]].dropna()
    ts_data = ts_data.set_index(datetime_col).resample('D').mean()  # daily mean
    axes[1].plot(ts_data.index, ts_data[col_ts], color='darkorange')
    axes[1].set_title(f'Time Series of {col_ts} (Daily Avg)')
    axes[1].set_xlabel('Date')
    axes[1].set_ylabel(col_ts)

plt.tight_layout()
vis_path = "output/q1_visualizations.png"
plt.savefig(vis_path, dpi=150)
plt.close()
print(f"Q1 visualizations saved to {vis_path}")


Shape: (195873, 18)
Columns: ['Station Name', 'Measurement Timestamp', 'Air Temperature', 'Wet Bulb Temperature', 'Humidity', 'Rain Intensity', 'Interval Rain', 'Total Rain', 'Precipitation Type', 'Wind Direction', 'Wind Speed', 'Maximum Wind Speed', 'Barometric Pressure', 'Solar Radiation', 'Heading', 'Battery Life', 'Measurement Timestamp Label', 'Measurement ID']
Data Types:
 Station Name                    object
Measurement Timestamp           object
Air Temperature                float64
Wet Bulb Temperature           float64
Humidity                         int64
Rain Intensity                 float64
Interval Rain                  float64
Total Rain                     float64
Precipitation Type             float64
Wind Direction                   int64
Wind Speed                     float64
Maximum Wind Speed             float64
Barometric Pressure            float64
Solar Radiation                  int64
Heading                        float64
Battery Life                   fl

<Figure size 1200x500 with 0 Axes>