# Q1: Setup & Exploration

**Phase 1-2:** Project Setup, Data Exploration  
**Points: 6 points**

**Focus:** Load data, perform initial inspection, identify data quality issues.

**Lecture Reference:** Lecture 11, Notebook 1 ([`11/demo/01_setup_exploration_cleaning.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/01_setup_exploration_cleaning.ipynb)), Phases 1-2. Also see Lecture 04 (pandas I/O) and Lecture 07 (visualization).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import os

# Create output directory
os.makedirs('output', exist_ok=True)

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

---

## Objective

Load the Chicago Beach Weather Sensors dataset, perform initial inspection, and identify data quality issues.

**Note:** The datetime column in this dataset is named `Measurement Timestamp`.

**Time Series Note:** Unlike the lecture's NYC Taxi data (event-based), this dataset is **time-series data** with continuous sensor readings. The data is already indexed by time, so you'll work with datetime-indexed dataframes throughout. See **Lecture 09** for time series operations. For time series visualizations, you may want to use pandas `resample()` to aggregate data (e.g., daily averages) for clearer visualization of long-term trends.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q1_data_info.txt`
**Format:** Plain text file
**Content:** Dataset information including:
- Dataset shape (rows × columns)
- Column names (one per line or comma-separated)
- Data types for each column
- Date range (start date and end date) - **REQUIRED if temporal data**
- Missing value counts for each column (column name: count)

**Example format:**
```
Dataset Shape: 50000 rows × 10 columns

Column Names:
- Measurement Timestamp
- Beach
- Water Temperature
- Air Temperature
...

Data Types:
- Measurement Timestamp: datetime64[ns]
- Beach: object
- Water Temperature: float64
...

Date Range:
Start: 2022-01-01 00:00:00
End: 2027-09-15 07:00:00

Missing Values:
- Water Temperature: 2500 (5.0%)
- Air Temperature: 1500 (3.0%)
...
```

### 2. `output/q1_exploration.csv`
**Format:** CSV file
**Required Columns (exact names):** `column_name`, `mean`, `std`, `min`, `max`, `missing_count`
**Content:** One row per numeric column in the dataset
- `column_name`: Name of the numeric column
- `mean`: Mean value (float)
- `std`: Standard deviation (float)
- `min`: Minimum value (float)
- `max`: Maximum value (float)
- `missing_count`: Number of missing values (integer)

**Example:**
```csv
column_name,mean,std,min,max,missing_count
Water Temperature,15.23,5.12,0.5,28.7,2500
Air Temperature,18.45,8.23,-5.2,35.8,1500
Wind Speed,6.78,4.56,0.1,25.3,0
```

### 3. `output/q1_visualizations.png`
**Format:** PNG image file
**Content:** At least 2 plots in a single figure (use subplots)
**Required plots:**
1. **Distribution plot:** Histogram or density plot of at least one numeric variable
2. **Time series plot:** Line plot showing a numeric variable over time (if temporal data)

**Requirements:**
- Clear axis labels (xlabel, ylabel)
- Title for each subplot
- Overall figure title (optional but recommended)
- Legend if multiple series shown
- Saved as PNG with sufficient resolution (dpi=150 or higher)

---

## Requirements Checklist

- [ ] Data loaded successfully from `data/beach_sensors.csv`
- [ ] Initial inspection completed (shape, info, head, describe)
- [ ] Missing values identified and counted
- [ ] Basic visualizations created (at least 2 plots: distribution + time series)
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Load and inspect the dataset** - Use standard pandas I/O and inspection methods
2. **Parse datetime** - Identify and convert datetime column(s)
3. **Identify missing values** - Count and calculate percentages per column
4. **Create visualizations** - Distribution plot + time series plot (use subplots)
5. **Save artifacts** - Write to the three required output files

---

## Decision Points

- **Visualization choices:** What types of plots best show your data? See Lecture 11 Notebook 1 for examples.
- **Data quality assessment:** What issues do you see? Missing data patterns? Outliers? Inconsistent formats? Document these for Q2.

---

## Checkpoint

After Q1, you should have:
- [ ] Data loaded successfully
- [ ] Basic statistics calculated
- [ ] Initial visualizations created (2+ plots)
- [ ] Data quality issues identified
- [ ] All 3 artifacts saved: `q1_data_info.txt`, `q1_exploration.csv`, `q1_visualizations.png`

---

**Next:** Continue to `q2_data_cleaning.md` for Data Cleaning.


In [2]:
#To create Data Info 
# Load the dataset
df = pd.read_csv("data/beach_sensors.csv")

# Display basic preview
display(df.head())

# Save shape
rows, cols = df.shape
print(f"Dataset Shape: {rows} rows × {cols} columns")


Unnamed: 0,Station Name,Measurement Timestamp,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
0,63rd Street Weather Station,09/27/2018 10:00:00 AM,16.4,12.2,61,0.0,0.0,260.3,0.0,231,2.5,4.7,996.3,484,356.0,11.9,09/27/2018 10:00 AM,63rdStreetWeatherStation201809271000
1,63rd Street Weather Station,09/27/2018 11:00:00 AM,17.1,11.5,51,0.0,0.0,260.3,0.0,244,3.6,5.7,995.4,468,356.0,11.9,09/27/2018 11:00 AM,63rdStreetWeatherStation201809271100
2,63rd Street Weather Station,09/27/2018 01:00:00 PM,18.2,12.4,51,0.0,0.0,260.3,0.0,248,3.1,5.3,994.8,377,355.0,11.9,09/27/2018 1:00 PM,63rdStreetWeatherStation201809271300
3,Foster Weather Station,09/27/2018 01:00:00 PM,17.89,,39,,0.0,,,249,1.4,2.3,993.6,0,,15.1,09/27/2018 1:00 PM,FosterWeatherStation201809271300
4,63rd Street Weather Station,09/27/2018 03:00:00 PM,19.5,13.0,47,0.0,0.0,260.3,0.0,249,3.1,5.7,992.9,461,355.0,11.9,09/27/2018 3:00 PM,63rdStreetWeatherStation201809271500


Dataset Shape: 196627 rows × 18 columns


In [3]:
#Inspection of types and missing values
# Data types
dtypes = df.dtypes

# Missing values
missing_counts = df.isna().sum()
missing_percent = (missing_counts / len(df)) * 100

# Display inspection
display(df.info())
display(missing_counts)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196627 entries, 0 to 196626
Data columns (total 18 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Station Name                 196627 non-null  object 
 1   Measurement Timestamp        196627 non-null  object 
 2   Air Temperature              196552 non-null  float64
 3   Wet Bulb Temperature         120522 non-null  float64
 4   Humidity                     196627 non-null  int64  
 5   Rain Intensity               120522 non-null  float64
 6   Interval Rain                196627 non-null  float64
 7   Total Rain                   120522 non-null  float64
 8   Precipitation Type           120522 non-null  float64
 9   Wind Direction               196627 non-null  int64  
 10  Wind Speed                   196627 non-null  float64
 11  Maximum Wind Speed           196627 non-null  float64
 12  Barometric Pressure          196481 non-null  float64
 13 

None

Station Name                       0
Measurement Timestamp              0
Air Temperature                   75
Wet Bulb Temperature           76105
Humidity                           0
Rain Intensity                 76105
Interval Rain                      0
Total Rain                     76105
Precipitation Type             76105
Wind Direction                     0
Wind Speed                         0
Maximum Wind Speed                 0
Barometric Pressure              146
Solar Radiation                    0
Heading                        76105
Battery Life                       0
Measurement Timestamp Label        0
Measurement ID                     0
dtype: int64

In [4]:
#Creating Parse Datatime
# Convert datetime column
df["Measurement Timestamp"] = pd.to_datetime(df["Measurement Timestamp"], errors="coerce")

# Get date range
start_date = df["Measurement Timestamp"].min()
end_date = df["Measurement Timestamp"].max()

print("Start Date:", start_date)
print("End Date:", end_date)


Start Date: 2015-04-25 09:00:00
End Date: 2025-12-10 04:00:00


In [5]:
#Saving output #1
info_text = []

info_text.append(f"Dataset Shape: {rows} rows × {cols} columns\n")

info_text.append("Column Names:")
for col in df.columns:
    info_text.append(f"- {col}")

info_text.append("\nData Types:")
for col, dtype in dtypes.items():
    info_text.append(f"- {col}: {dtype}")

info_text.append("\nDate Range:")
info_text.append(f"Start: {start_date}")
info_text.append(f"End: {end_date}")

info_text.append("\nMissing Values:")
for col in df.columns:
    info_text.append(f"- {col}: {missing_counts[col]} ({missing_percent[col]:.2f}%)")

# Write file
with open("output/q1_data_info.txt", "w") as f:
    f.write("\n".join(info_text))

print("✅ output/q1_data_info.txt created")


✅ output/q1_data_info.txt created


In [6]:
#Data Exploration 
# Select numeric columns only
numeric_df = df.select_dtypes(include=[np.number])

summary_data = []

for col in numeric_df.columns:
    summary_data.append({
        "column_name": col,
        "mean": numeric_df[col].mean(),
        "std": numeric_df[col].std(),
        "min": numeric_df[col].min(),
        "max": numeric_df[col].max(),
        "missing_count": df[col].isna().sum()
    })

summary_df = pd.DataFrame(summary_data)

# Save CSV
summary_df.to_csv("output/q1_exploration.csv", index=False)

display(summary_df)
print("✅ output/q1_exploration.csv created")


Unnamed: 0,column_name,mean,std,min,max,missing_count
0,Air Temperature,12.600545,10.444932,-29.78,37.6,75
1,Wet Bulb Temperature,10.257292,9.411305,-28.9,28.4,76105
2,Humidity,68.021569,15.626334,0.0,100.0,0
3,Rain Intensity,0.158826,1.792922,0.0,183.6,76105
4,Interval Rain,0.142235,1.096103,-0.9,63.42,0
5,Total Rain,141.382922,190.356828,0.0,1056.1,76105
6,Precipitation Type,4.272374,15.598411,0.0,70.0,76105
7,Wind Direction,140.74325,122.0388,0.0,359.0,0
8,Wind Speed,2.91874,5.337807,0.0,999.9,0
9,Maximum Wind Speed,3.554967,5.951615,0.0,999.9,0


✅ output/q1_exploration.csv created


In [8]:
#Data Visualization 
# Ensure datetime is properly formatted
df["Measurement Timestamp"] = pd.to_datetime(df["Measurement Timestamp"], errors="coerce")
df_ts = df.set_index("Measurement Timestamp").sort_index()

# Create figure with 2 subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# =========================
# Subplot 1: Distribution (Humidity)
# =========================
axes[0].hist(df["Humidity"].dropna(), bins=40)
axes[0].set_title("Distribution of Humidity")
axes[0].set_xlabel("Humidity")
axes[0].set_ylabel("Frequency")

# =========================
# Subplot 2: Time Series (Total Rain)
# =========================
daily_rain = df_ts["Total Rain"].resample("D").sum()

axes[1].plot(daily_rain, label="Daily Total Rain")
axes[1].set_title("Daily Total Rain Over Time")
axes[1].set_xlabel("Date")
axes[1].set_ylabel("Total Rain")
axes[1].legend()

# =========================
# Overall Figure Title
# =========================
fig.suptitle("Chicago Beach Sensor Data: Humidity Distribution & Rainfall Over Time", fontsize=14)

# =========================
# Save Figure
# =========================
plt.tight_layout(rect=[0, 0, 1, 0.95])  # leave space for suptitle
plt.savefig("output/q1_visualizations.png", dpi=150)
plt.close()

print("✅ output/q1_visualizations.png created with all required formatting")


✅ output/q1_visualizations.png created with all required formatting
