#Final analysis.ipynb — Individual Deliverable

Course: MGMT 46700

Team: Team 11

Student Name: Zijing Zhang

Repo link: https://github.com/ethandlouiee/MGMT467_Team11/tree/main/team/Final_Project

###What this notebook includes (per rubric):

1. Prompt logs (AI usage evidence)
2. One substantive DIVE entry (Define–Investigate–Validate–Evaluate)
3. At least one interactive Plotly figure
4. A link to the dashboard section this notebook influenced



## Substantive Question for Air Quality Dataset

How do the diurnal patterns of Carbon Monoxide (CO(GT)) compare to those of Benzene (C6H6(GT)), and what might this suggest about their sources or atmospheric behavior?

In [16]:
!pip -q install plotly pandas numpy

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as g

## Load Data and Initial Preprocessing

### Prompt Log:
"Load the AirQualityUCI dataset, perform necessary preprocessing steps including:
1. Reading the CSV file with correct delimiters
2. Converting data types (numeric, datetime)
3. Handling missing values and placeholder codes
4. Setting a DateTime index for time series analysis"

First, I'll load the `AirQuality.csv` dataset. Based on common formats for this dataset, it's often semicolon-separated and uses a comma as a decimal point. I will account for this during loading. Then, I'll combine the 'Date' and 'Time' columns into a single 'DateTime' column and set it as the index. Finally, I'll convert all appropriate measurement columns to numeric types, coercing errors to `NaN`.

Analyze Diurnal Patterns

To compare the diurnal patterns of CO(GT) and C6H6(GT), I will first extract the hour of the day from the `DateTime` index. Then, I'll calculate the average concentration for each pollutant for every hour. This will reveal how their levels fluctuate throughout a typical day.

In [20]:
import pandas as pd
import numpy as np

# Load the dataset with semicolon separator and comma as decimal
# The file path is assumed to be in the current directory as per the kernel state
file_path = '/content/AirQuality.csv'
air_quality_df = pd.read_csv(file_path, sep=';', decimal=',')

# Drop the last two columns if they are entirely NaN, which is common in this dataset
air_quality_df = air_quality_df.dropna(axis=1, how='all')

# Convert 'Date' and 'Time' to datetime objects and combine them
# Some values might be invalid, so use errors='coerce'
air_quality_df['Date'] = pd.to_datetime(air_quality_df['Date'], format='%d/%m/%Y', errors='coerce')
air_quality_df['Time'] = pd.to_datetime(air_quality_df['Time'], format='%H.%M.%S', errors='coerce').dt.time

# Combine Date and Time into a new 'DateTime' column
air_quality_df['DateTime'] = air_quality_df.apply(lambda row: pd.Timestamp.combine(row['Date'], row['Time']) if pd.notna(row['Date']) and pd.notna(row['Time']) else pd.NaT, axis=1)

# Set 'DateTime' as the index and remove original 'Date' and 'Time' columns
air_quality_df = air_quality_df.set_index('DateTime')
air_quality_df = air_quality_df.drop(columns=['Date', 'Time'])

# Convert all columns (except the index) to numeric, coercing errors.
# This will turn any non-convertible values into NaN.
for col in air_quality_df.columns:
    air_quality_df[col] = pd.to_numeric(air_quality_df[col], errors='coerce')

# Replace the specific placeholder value -200 with NaN across the entire DataFrame
air_quality_df.replace(-200, np.nan, inplace=True)

# Display the first few rows and information about the processed DataFrame
print("DataFrame Info after preprocessing:")
air_quality_df.info()
print("\nFirst 5 rows of the preprocessed DataFrame:")
display(air_quality_df.head())

DataFrame Info after preprocessing:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9471 entries, 2004-03-10 18:00:00 to NaT
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   CO(GT)         7674 non-null   float64
 1   PT08.S1(CO)    8991 non-null   float64
 2   NMHC(GT)       914 non-null    float64
 3   C6H6(GT)       8991 non-null   float64
 4   PT08.S2(NMHC)  8991 non-null   float64
 5   NOx(GT)        7718 non-null   float64
 6   PT08.S3(NOx)   8991 non-null   float64
 7   NO2(GT)        7715 non-null   float64
 8   PT08.S4(NO2)   8991 non-null   float64
 9   PT08.S5(O3)    8991 non-null   float64
 10  T              8991 non-null   float64
 11  RH             8991 non-null   float64
 12  AH             8991 non-null   float64
dtypes: float64(13)
memory usage: 1.0 MB

First 5 rows of the preprocessed DataFrame:


Unnamed: 0_level_0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2004-03-10 18:00:00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578
2004-03-10 19:00:00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255
2004-03-10 20:00:00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502
2004-03-10 21:00:00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867
2004-03-10 22:00:00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888


In [21]:
# Extract the hour from the DateTime index
air_quality_df['Hour'] = air_quality_df.index.hour

# Calculate the average CO(GT) and C6H6(GT) concentrations for each hour, excluding NaN values
hourly_co_benzene_avg = air_quality_df.groupby('Hour')[['CO(GT)', 'C6H6(GT)']].mean()

print("Average CO(GT) and C6H6(GT) by hour:")
display(hourly_co_benzene_avg.head())

Average CO(GT) and C6H6(GT) by hour:


Unnamed: 0_level_0,CO(GT),C6H6(GT)
Hour,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,1.786018,7.68414
1.0,1.467802,5.991711
2.0,1.099063,4.379467
3.0,0.888462,3.379255
4.0,0.758659,2.916711


# Investation and Validate

###Prompt Log:
"
Load the AirQualityUCI dataset and perform the following comparative diurnal pattern analysis:
1. Extract the hour-of-day component from the DateTime timestamps
2. Calculate hourly average concentrations for both CO(GT) and C6H6(GT) pollutants
3. Generate comparative diurnal profile plots showing 24-hour patterns for both pollutants
4. Calculate the Pearson correlation coefficient between CO and Benzene diurnal patterns
5. Identify and compare the peak concentration hours for each pollutant

Use Plotly to create an interactive visualization that allows comparison of the two pollutant patterns.
"

"
Create an interactive Plotly figure showing:
1. Dual-axis line chart comparing hourly CO and Benzene concentrations
2. Annotations for peak hours
3. Hover info with exact values
4. Time range selector

Use simulated data showing realistic 24-hour patterns.
"


Calculate the Pearson correlation coefficient between the hourly average concentrations of CO(GT) and C6H6(GT), identify the peak hours for each pollutant, and generate an interactive Plotly line plot to visualize their diurnal patterns, then interpret these findings to compare the diurnal patterns and suggest insights into their sources or atmospheric behavior.

In [22]:
correlation = hourly_co_benzene_avg['CO(GT)'].corr(hourly_co_benzene_avg['C6H6(GT)'])

print(f"Pearson Correlation between hourly average CO(GT) and C6H6(GT): {correlation:.2f}")

Pearson Correlation between hourly average CO(GT) and C6H6(GT): 0.98


Identify Peak Hours:
Determine the hour of the day when CO(GT) and C6H6(GT) pollutants reach their peak average concentrations.


In [23]:
peak_co_hour = hourly_co_benzene_avg['CO(GT)'].idxmax()
peak_c6h6_hour = hourly_co_benzene_avg['C6H6(GT)'].idxmax()

print(f"Peak hour for CO(GT): {int(peak_co_hour)}:00")
print(f"Peak hour for C6H6(GT): {int(peak_c6h6_hour)}:00")

Peak hour for CO(GT): 19:00
Peak hour for C6H6(GT): 19:00


In [24]:
fig = px.line(
    hourly_co_benzene_avg,
    y=['CO(GT)', 'C6H6(GT)'],
    title='Diurnal Patterns of CO(GT) and C6H6(GT)',
    labels={'value': 'Average Concentration', 'Hour': 'Hour of Day'},
    line_dash_map={'CO(GT)': 'solid', 'C6H6(GT)': 'dot'}
)

fig.update_layout(
    xaxis_title='Hour of Day',
    yaxis_title='Average Concentration',
    legend_title='Pollutant',
    hovermode='x unified'
)

fig.show()

### Data Analysis Key Findings
*   The Pearson correlation coefficient between the hourly average concentrations of CO(GT) and C6H6(GT) is **0.98**, indicating a very strong positive relationship.
*   Both CO(GT) and C6H6(GT) pollutants reach their peak average concentrations at **19:00 (7 PM)**.
*   The diurnal patterns show:
    *   An early morning decline followed by a minimum around 04:00-06:00.
    *   A sharp increase from 06:00-07:00, peaking around 08:00-09:00.
    *   A slight dip or plateau during mid-day hours (10:00-16:00).
    *   Another significant rise leading to the absolute peak around 19:00.

### Insights or Next Steps
*   The highly synchronized diurnal patterns and strong positive correlation suggest that both CO(GT) and C6H6(GT) are predominantly influenced by common sources, most likely vehicular emissions, given the peaks align with typical morning and evening rush hours.
*   Future analysis could investigate specific meteorological factors (e.g., wind speed, temperature, atmospheric stability) that might explain the mid-day dip and early morning lows, providing a more comprehensive understanding of pollutant dispersion.


## Simulate Hourly Data
Generate a pandas DataFrame containing simulated hourly CO and Benzene concentrations for a 24-hour period, ensuring the data reflects realistic diurnal patterns with clear peaks and troughs.


In [25]:
import numpy as np
import pandas as pd

# 1. Create a numpy array representing the 24 hours of a day (0 to 23)
hours = np.arange(24)

# 2. Simulate CO(GT) concentrations
# Using a combination of sine waves to create a diurnal pattern
# Peak around 8-9 AM and 7-8 PM, dip around 3-5 AM and midday
co_base = 1.5 + 1.2 * np.sin(hours * (2 * np.pi / 24) + np.pi/2) + 0.8 * np.sin(hours * (4 * np.pi / 24) + np.pi)
co_noise = np.random.normal(0, 0.2, 24)
co_simulated = np.maximum(0.1, co_base + co_noise) # Ensure values are not negative

# Add a specific morning peak and evening peak more distinctly
co_simulated[7:10] += 1.5 # Morning rush hour peak
co_simulated[17:20] += 2.0 # Evening rush hour peak

# Adjust CO(GT) values to a plausible range (e.g., 0.5 to 4.0 mg/m^3 for average hourly)
co_simulated = np.clip(co_simulated, 0.5, 4.0)


# 3. Simulate C6H6(GT) concentrations, maintaining strong correlation with CO(GT)
# Use a similar pattern but scaled and with slightly different noise/offset
benzene_base = 0.8 * co_simulated + np.random.normal(0, 0.5, 24) # Correlate with CO, add noise
benzene_simulated = np.maximum(0.5, benzene_base)

# Adjust C6H6(GT) values to a plausible range (e.g., 2 to 20 ug/m^3 for average hourly)
benzene_simulated = np.clip(benzene_simulated, 2.0, 20.0)

# 4. Create a pandas DataFrame from the simulated CO and Benzene concentrations
simulated_hourly_data = pd.DataFrame({
    'CO(GT)': co_simulated,
    'C6H6(GT)': benzene_simulated
}, index=hours)

# 5. Name the index 'Hour'
simulated_hourly_data.index.name = 'Hour'

print("Simulated Hourly Data:")
display(simulated_hourly_data.head())

Simulated Hourly Data:


Unnamed: 0_level_0,CO(GT),C6H6(GT)
Hour,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.401781,2.0
1,2.026357,2.0
2,2.077209,2.0
3,1.664039,2.0
4,1.28271,2.0


### Visualize Simulated Data
Generate an interactive Plotly dual-axis line chart to visualize the simulated CO(GT) and C6H6(GT) concentrations, clearly representing CO on one axis and Benzene on the other. Identify and annotate the peak hours for both CO and Benzene directly on the Plotly figure. Implement interactive features such as hover information for exact concentration values and a time range selector.

In [26]:
peak_co_simulated_hour = simulated_hourly_data['CO(GT)'].idxmax()
peak_c6h6_simulated_hour = simulated_hourly_data['C6H6(GT)'].idxmax()

print(f"Peak hour for simulated CO(GT): {int(peak_co_simulated_hour)}:00")
print(f"Peak hour for simulated C6H6(GT): {int(peak_c6h6_simulated_hour)}:00")

Peak hour for simulated CO(GT): 19:00
Peak hour for simulated C6H6(GT): 22:00


In [28]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(
        x=simulated_hourly_data.index,
        y=simulated_hourly_data['CO(GT)'],
        name="CO(GT)",
        mode='lines+markers',
        line=dict(color='blue')
    ),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(
        x=simulated_hourly_data.index,
        y=simulated_hourly_data['C6H6(GT)'],
        name="C6H6(GT)",
        mode='lines+markers',
        line=dict(color='red')
    ),
    secondary_y=True,
)

# Add annotations for peak hours
fig.add_annotation(
    x=peak_co_simulated_hour,
    y=simulated_hourly_data.loc[peak_co_simulated_hour, 'CO(GT)'],
    text=f"Peak CO: {int(peak_co_simulated_hour)}:00",
    showarrow=True,
    arrowhead=1,
    yshift=10,
    font=dict(color='blue')
)

fig.add_annotation(
    x=peak_c6h6_simulated_hour,
    y=simulated_hourly_data.loc[peak_c6h6_simulated_hour, 'C6H6(GT)'],
    text=f"Peak C6H6: {int(peak_c6h6_simulated_hour)}:00",
    showarrow=True,
    arrowhead=1,
    yshift=10,
    font=dict(color='red')
)

# Update layout
fig.update_layout(
    title_text='Simulated Diurnal Patterns of CO(GT) and C6H6(GT)',
    hovermode='x unified',
    xaxis_title='Hour of Day',
    yaxis_title='CO(GT) Concentration',
    yaxis2_title='C6H6(GT) Concentration',
    legend_title='Pollutant',
    xaxis=dict(
        tickmode='linear',
        dtick=1,
        rangeselector=dict(
            buttons=list([
                dict(count=6, label="6h", step="hour", stepmode="backward"),
                dict(count=12, label="12h", step="hour", stepmode="backward"),
                dict(count=1, label="1d", step="day", stepmode="backward")
            ])
        ),
        rangeslider=dict(visible=True),
        type='linear'
    )
)

fig.show()


##Dashboard Section:

###Looker Studio Dashboard (Team Report):
https://lookerstudio.google.com/u/0/reporting/2fc71b11-3a55-45b6-8a01-d77ad5878b63



## DIVE Summary

**D - Define:** The primary research question focused on understanding the comparative diurnal patterns of Carbon Monoxide (CO(GT)) and Benzene (C6H6(GT)) concentrations. We aimed to analyze how their levels fluctuate over a typical 24-hour cycle. Furthermore, the inquiry sought to identify any significant similarities or differences in these patterns. Ultimately, the objective was to infer potential common or distinct sources that contribute to their atmospheric presence. Finally, we aimed to understand the implications of these patterns on their overall atmospheric behavior and persistence.

**I - Investigate:** The investigation commenced by extracting the hourly data for CO(GT) and C6H6(GT) from the preprocessed dataset. Subsequently, the average concentration for each pollutant was calculated for every hour of the day. This step allowed for the clear identification of daily fluctuations for both CO and Benzene. Following this, the Pearson correlation coefficient was computed to quantify the linear relationship between their hourly average concentrations. Finally, an interactive Plotly line plot was generated to visually represent these diurnal patterns and highlight specific peak and trough periods.

**V - Validate:** Key statistical validation revealed a remarkably high Pearson correlation coefficient of 0.98 between the hourly average concentrations of CO(GT) and C6H6(GT). This strong positive correlation indicated a nearly identical movement in their daily cycles. Both pollutants exhibited a distinct peak in average concentration at precisely 19:00 (7 PM) each day. This synchronized evening peak suggests a common environmental influence driving their accumulation. Additionally, the diurnal patterns consistently showed early morning lows followed by sharp increases during morning rush hours, further reinforcing shared behavioral characteristics.

**E - Evaluate:** The evaluation strongly concludes that both CO(GT) and C6H6(GT) are predominantly influenced by common sources, with vehicular emissions being the most significant contributor. The striking similarity in their diurnal patterns, characterized by synchronized morning and evening peaks, directly aligns with typical traffic rush hours. The very high correlation coefficient further solidifies the assertion of shared origins and atmospheric processing. Variations throughout the day, such as mid-day dips, are likely influenced by atmospheric dispersion rather than differing emission sources. Therefore, the data provides compelling evidence for a shared emission profile, primarily from combustion processes associated with transportation.