# Q2: Data Cleaning

**Phase 3:** Data Cleaning & Preprocessing  
**Points: 9 points**

**Focus:** Handle missing data, outliers, validate data types, remove duplicates.

**Lecture Reference:** Lecture 11, Notebook 1 ([`11/demo/01_setup_exploration_cleaning.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/01_setup_exploration_cleaning.ipynb)), Phase 3. Also see Lecture 05 (data cleaning).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load data from Q1 (or directly from source)
df = pd.read_csv('data/beach_sensors.csv')
# If you saved cleaned data from Q1, you can load it:
# df = pd.read_csv('output/q1_exploration.csv')  # This won't work - load original

In [2]:
# Check if missing values occur in the same rows
df_clean = df.copy()

cols_with_same_missing = ['Air Temperature', 'Wet Bulb Temperature', 'Rain Intensity', 'Total Rain', 'Precipitation Type', 'Barometric Pressure', 'Heading']
missing_mask = df_clean[cols_with_same_missing].isnull()
print(f"Rows where ALL are missing: {missing_mask.all(axis=1).sum()}")
print(f"Rows where ANY are missing: {missing_mask.any(axis=1).sum()}")

Rows where ALL are missing: 0
Rows where ANY are missing: 75962


In [4]:
# Creating output/q2_cleaned_data.csv

display(df_clean.head())
# Handle missing values 
numeric_cols = df_clean.select_dtypes(include = np.number).columns.tolist()

stats_df_clean = df_clean.describe().T
stats_df_clean = stats_df_clean.round(2)
display(stats_df_clean)

for col in numeric_cols:
    df_clean[col].fillna(df_clean[col].median(), inplace = True)

df_clean = df_clean.sort_values("Measurement Timestamp")
df_clean['Measurement Timestamp'].fillna(method = 'ffill', inplace = True)
df_clean['Measurement Timestamp'].fillna(method = 'bfill', inplace = True)



object_cols = df_clean.select_dtypes(include = 'object').columns.tolist()
for col in object_cols:
    df_clean[col].fillna('Unknown', inplace = True)

# Handling outliers with range limits 
df_clean = df_clean[df_clean['Humidity'].between(0,100)]
df_clean = df_clean[df_clean['Wind Direction'].between(0,360)]
df_clean = df_clean[df_clean['Heading'].between(0,360)]
df_clean = df_clean[df_clean['Battery Life'].between(0,100)]

for col in ['Rain Intensity', 'Interval Rain', 'Total Rain', 'Wind Speed', 'Maximum Wind Speed', 'Solar Radiation']:
    df_clean = df_clean[df_clean[col] >= 0]


# Remove duplicates
df_clean.drop_duplicates(inplace = True)

# Check row count
print(f"Row count after cleaning: {len(df_clean)}")
df_clean.to_csv('output/q2_cleaned_data.csv', index = False)

display(df_clean.head())

stats_df_clean = df_clean.describe().T
stats_df_clean = stats_df_clean.round(2)
display(stats_df_clean)

display(df_clean.head(3))









Unnamed: 0,Station Name,Measurement Timestamp,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
136356,Oak Street Weather Station,01/01/2016 01:00:00 AM,-3.2,-4.8,67,0.0,0.0,6.3,0.0,286,1.5,5.6,1000.0,4,359.0,12.0,01/01/2016 1:00 AM,OakStreetWeatherStation201601010100
5085,Foster Weather Station,01/01/2016 01:00:00 AM,-4.56,11.6,63,0.0,0.0,55.5,0.0,290,5.9,6.6,999.3,0,354.0,14.8,01/01/2016 1:00 AM,FosterWeatherStation201601010100
35488,63rd Street Weather Station,01/01/2016 01:00:00 AM,-3.4,-4.8,72,0.0,0.0,6.7,0.0,273,6.4,9.4,999.9,5,353.0,11.9,01/01/2016 1:00 AM,63rdStreetWeatherStation201601010100
5133,Foster Weather Station,01/01/2016 01:00:00 PM,-2.56,11.6,65,0.0,0.05,55.5,0.0,263,1.9,2.7,997.0,208,354.0,15.2,01/01/2016 1:00 PM,FosterWeatherStation201601011300
35500,63rd Street Weather Station,01/01/2016 01:00:00 PM,-2.2,-3.6,73,0.0,0.0,6.7,0.0,269,8.1,13.7,997.7,147,353.0,11.9,01/01/2016 1:00 PM,63rdStreetWeatherStation201601011300


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Air Temperature,182768.0,11.97,10.45,-29.78,3.89,12.28,21.0,37.6
Wet Bulb Temperature,182768.0,10.3,7.35,-28.9,7.0,11.6,12.3,28.4
Humidity,182768.0,68.03,15.69,0.0,57.0,69.0,80.0,100.0
Rain Intensity,182768.0,0.1,1.43,0.0,0.0,0.0,0.0,183.6
Interval Rain,182768.0,0.14,1.12,0.0,0.0,0.0,0.0,63.42
Total Rain,182768.0,106.69,158.28,0.0,35.0,55.5,70.03,1056.1
Precipitation Type,182768.0,2.49,12.11,0.0,0.0,0.0,0.0,70.0
Wind Direction,182768.0,139.48,122.16,0.0,8.0,114.0,258.0,359.0
Wind Speed,182768.0,3.02,4.99,0.0,1.7,3.1,3.4,999.9
Maximum Wind Speed,182768.0,3.59,5.69,0.0,1.1,3.1,5.3,999.9


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean[col].fillna(df_clean[col].median(), inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean[col].fillna(df_clean[col].median(), inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on

Row count after cleaning: 182768


Unnamed: 0,Station Name,Measurement Timestamp,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
136356,Oak Street Weather Station,01/01/2016 01:00:00 AM,-3.2,-4.8,67,0.0,0.0,6.3,0.0,286,1.5,5.6,1000.0,4,359.0,12.0,01/01/2016 1:00 AM,OakStreetWeatherStation201601010100
5085,Foster Weather Station,01/01/2016 01:00:00 AM,-4.56,11.6,63,0.0,0.0,55.5,0.0,290,5.9,6.6,999.3,0,354.0,14.8,01/01/2016 1:00 AM,FosterWeatherStation201601010100
35488,63rd Street Weather Station,01/01/2016 01:00:00 AM,-3.4,-4.8,72,0.0,0.0,6.7,0.0,273,6.4,9.4,999.9,5,353.0,11.9,01/01/2016 1:00 AM,63rdStreetWeatherStation201601010100
5133,Foster Weather Station,01/01/2016 01:00:00 PM,-2.56,11.6,65,0.0,0.05,55.5,0.0,263,1.9,2.7,997.0,208,354.0,15.2,01/01/2016 1:00 PM,FosterWeatherStation201601011300
35500,63rd Street Weather Station,01/01/2016 01:00:00 PM,-2.2,-3.6,73,0.0,0.0,6.7,0.0,269,8.1,13.7,997.7,147,353.0,11.9,01/01/2016 1:00 PM,63rdStreetWeatherStation201601011300


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Air Temperature,182768.0,11.97,10.45,-29.78,3.89,12.28,21.0,37.6
Wet Bulb Temperature,182768.0,10.3,7.35,-28.9,7.0,11.6,12.3,28.4
Humidity,182768.0,68.03,15.69,0.0,57.0,69.0,80.0,100.0
Rain Intensity,182768.0,0.1,1.43,0.0,0.0,0.0,0.0,183.6
Interval Rain,182768.0,0.14,1.12,0.0,0.0,0.0,0.0,63.42
Total Rain,182768.0,106.69,158.28,0.0,35.0,55.5,70.03,1056.1
Precipitation Type,182768.0,2.49,12.11,0.0,0.0,0.0,0.0,70.0
Wind Direction,182768.0,139.48,122.16,0.0,8.0,114.0,258.0,359.0
Wind Speed,182768.0,3.02,4.99,0.0,1.7,3.1,3.4,999.9
Maximum Wind Speed,182768.0,3.59,5.69,0.0,1.1,3.1,5.3,999.9


Unnamed: 0,Station Name,Measurement Timestamp,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
136356,Oak Street Weather Station,01/01/2016 01:00:00 AM,-3.2,-4.8,67,0.0,0.0,6.3,0.0,286,1.5,5.6,1000.0,4,359.0,12.0,01/01/2016 1:00 AM,OakStreetWeatherStation201601010100
5085,Foster Weather Station,01/01/2016 01:00:00 AM,-4.56,11.6,63,0.0,0.0,55.5,0.0,290,5.9,6.6,999.3,0,354.0,14.8,01/01/2016 1:00 AM,FosterWeatherStation201601010100
35488,63rd Street Weather Station,01/01/2016 01:00:00 AM,-3.4,-4.8,72,0.0,0.0,6.7,0.0,273,6.4,9.4,999.9,5,353.0,11.9,01/01/2016 1:00 AM,63rdStreetWeatherStation201601010100


In [5]:

timestamp_cols = ['Measurement Timestamp']
# Creating data cleaning report 
with open('output/q2_cleaning_report.txt', 'w') as f:
    f.write("DATA CLEANING REPORT\n")
    f.write("====================\n\n")
    # Rows before cleaning
    f.write(f"Rows before cleaning: {len(df)}\n")
    # Missing data handling
    f.write("Missing Data Handling:\n")
    for col in numeric_cols + object_cols + timestamp_cols:
        n_missing = df[col].isna().sum()
        percent_missing = round((n_missing / len(df)) * 100, 2)
        if col in numeric_cols:
           method = "Median imputation"
        elif col in object_cols:
            method = "Imputed as unknown"
        else:
            method = "Unknown"
        f.write(f"- {col}: {n_missing} missing calues ({percent_missing}%)\n")
        f.write(f" Method: {method}\n")
        f.write(" Result: All missing values filled\n")
    
    # Outlier handling
    f.write("\nOutlier Handling:\n")
    n_outliers = ((df['Humidity'] < 0) | (df['Humidity'] > 100)).sum()
    f.write(f"- Humidity: Out-of-range values detected (<0 or >100): {n_outliers}\n")
    f.write(f" Method: Removed\n Result: {n_outliers} rows removed\n")

    f.write("\nOutlier Handling:\n")
    n_outliers = ((df['Wind Direction'] < 0) | (df['Wind Direction'] > 360)).sum()
    f.write(f"- Wind Direction: Out-of-range values detected (<0 or >360): {n_outliers}\n")
    f.write(f" Method: Removed\n Result: {n_outliers} rows removed\n")

    f.write("\nOutlier Handling:\n")
    n_outliers = ((df['Heading'] < 0) | (df['Heading'] > 360)).sum()
    f.write(f"- Heading: Out-of-range values detected (<0 or >360): {n_outliers}\n")
    f.write(f" Method: Removed\n Result: {n_outliers} rows removed\n")

    f.write("\nOutlier Handling:\n")
    n_outliers = ((df['Battery Life'] < 0) | (df['Battery Life'] > 100)).sum()
    f.write(f"- Battery Life: Out-of-range values detected (<0 or >100): {n_outliers}\n")
    f.write(f" Method: Removed\n Result: {n_outliers} rows removed\n")

    # Rain, wind, solar 
    for col in ['Rain Intensity', 'Interval Rain', 'Total Rain', 'Wind Speed', 'Maximum Wind Speed', 'Solar Radiation']:
        n_outliers = (df[col] < 0).sum()
        f.write(f"- {col}: Out-of-range values detected (<0)\n")
        f.write(f" Method: Removed\n Result: {n_outliers} rows removed\n")

    # Duplicates
    n_duplicates = df.duplicated().sum()
    f.write(f"\nDuplicates Removed: {n_duplicates}\n")

    # Data type conversions
    f.write("\nData Type Conversions:\n")

    # Rows after cleaning
    f.write(f"\nRows after cleaning: {len(df_clean)}\n")

display(df_clean.head())


Unnamed: 0,Station Name,Measurement Timestamp,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
136356,Oak Street Weather Station,01/01/2016 01:00:00 AM,-3.2,-4.8,67,0.0,0.0,6.3,0.0,286,1.5,5.6,1000.0,4,359.0,12.0,01/01/2016 1:00 AM,OakStreetWeatherStation201601010100
5085,Foster Weather Station,01/01/2016 01:00:00 AM,-4.56,11.6,63,0.0,0.0,55.5,0.0,290,5.9,6.6,999.3,0,354.0,14.8,01/01/2016 1:00 AM,FosterWeatherStation201601010100
35488,63rd Street Weather Station,01/01/2016 01:00:00 AM,-3.4,-4.8,72,0.0,0.0,6.7,0.0,273,6.4,9.4,999.9,5,353.0,11.9,01/01/2016 1:00 AM,63rdStreetWeatherStation201601010100
5133,Foster Weather Station,01/01/2016 01:00:00 PM,-2.56,11.6,65,0.0,0.05,55.5,0.0,263,1.9,2.7,997.0,208,354.0,15.2,01/01/2016 1:00 PM,FosterWeatherStation201601011300
35500,63rd Street Weather Station,01/01/2016 01:00:00 PM,-2.2,-3.6,73,0.0,0.0,6.7,0.0,269,8.1,13.7,997.7,147,353.0,11.9,01/01/2016 1:00 PM,63rdStreetWeatherStation201601011300


In [7]:
# Generate output/q2_rows_cleaned.txt
with open('output/q2_rows_cleaned.txt', 'w') as f:
    f.write(str(len(df_clean)))


---

## Objective

Clean the dataset by handling missing data, outliers, validating data types, and removing duplicates.

**Time Series Note:** For time series data, forward-fill (`ffill()`) is often appropriate for missing values since sensor readings are continuous. However, you may choose other strategies based on your analysis.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q2_cleaned_data.csv`
**Format:** CSV file
**Content:** Cleaned dataset with same structure as original (same columns)
**Requirements:**
- Same columns as original dataset
- Missing values handled (filled, dropped, or imputed)
- Outliers handled (removed, capped, or transformed)
- Data types validated and converted
- Duplicates removed
- **Sanity check:** Dataset should retain most rows after cleaning (at least 1,000 rows). If you're removing more than 50% of data, reconsider your strategy—imputation is usually preferable to dropping rows for this dataset.
- **No index column** (save with `index=False`)

### 2. `output/q2_cleaning_report.txt`
**Format:** Plain text file
**Content:** Detailed report of cleaning operations
**Required information:**
- Rows before cleaning: [number]
- Missing data handling method: [description]
  - Which columns had missing data
  - Method used (drop, forward-fill, impute, etc.)
  - Number of values handled
- Outlier handling: [description]
  - Detection method (IQR, z-scores, domain knowledge)
  - Which columns had outliers
  - Method used (remove, cap, transform)
  - Number of outliers handled
- Duplicates removed: [number]
- Data type conversions: [list any conversions]
- Rows after cleaning: [number]

**Example format:**
```
DATA CLEANING REPORT
====================

Rows before cleaning: 50000

Missing Data Handling:
- Water Temperature: 2500 missing values (5.0%)
  Method: Forward-fill (time series appropriate)
  Result: All missing values filled
  
- Air Temperature: 1500 missing values (3.0%)
  Method: Forward-fill, then median imputation for remaining
  Result: All missing values filled

Outlier Handling:
- Water Temperature: Detected 500 outliers using IQR method (3×IQR)
  Method: Capped at bounds [Q1 - 3×IQR, Q3 + 3×IQR]
  Bounds: [-5.2, 35.8]
  Result: 500 values capped

Duplicates Removed: 0

Data Type Conversions:
- Measurement Timestamp: Converted to datetime64[ns]

Rows after cleaning: 50000
```

### 3. `output/q2_rows_cleaned.txt`
**Format:** Plain text file
**Content:** Single integer number (total rows after cleaning)
**Requirements:**
- Only the number, no text, no labels
- No whitespace before or after
- Example: `50000`

---

## Requirements Checklist

- [ ] Missing data handling strategy chosen and implemented
- [ ] Outliers detected and handled (IQR method, z-scores, or domain knowledge)
- [ ] Data types validated and converted
- [ ] Duplicates identified and removed
- [ ] Cleaning decisions documented in report
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Handle missing data** - Choose appropriate strategy (drop, forward-fill, impute) based on data characteristics
2. **Detect and handle outliers** - Use IQR method or z-scores; decide whether to remove, cap, or transform
3. **Validate data types** - Ensure numeric and datetime columns are properly typed
4. **Remove duplicates**
5. **Document and save** - Write detailed cleaning report explaining your decisions

---

## Decision Points

- **Missing data:** Should you drop rows, impute values, or forward-fill? Consider: How much data is missing? Is it random or systematic? For time series, forward-fill is often appropriate.
- **Outliers:** Are they errors or valid extreme values? Use IQR method or z-scores to detect, then decide: remove, cap, or transform. Document your reasoning.
- **Data types:** Are numeric columns actually numeric? Are datetime columns properly formatted? Convert as needed.

---

## Checkpoint

After Q2, you should have:
- [ ] Missing data handled
- [ ] Outliers addressed
- [ ] Data types validated
- [ ] Duplicates removed
- [ ] All 3 artifacts saved: `q2_cleaned_data.csv`, `q2_cleaning_report.txt`, `q2_rows_cleaned.txt`

---

**Next:** Continue to `q3_data_wrangling.md` for Data Wrangling.
