<a href="https://colab.research.google.com/github/bingxiaochen/ST-554-Project1/blob/main/Task2/Project1_Task2_Hui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
Air pollutants are considered responsible for a range of respiratory diseases, and some compounds (e.g., benzene) are known to increase risk of cancer with prolonged exposure (De Vito et al., 2008). To explore how low-cost chemical sensors behave in real urban conditions, this project analyzes the [air quality dataset](https://archive.ics.uci.edu/dataset/360/air+quality) from the UCI machine learning repository . The dataset contains 9358 hourly measurements collected from March 2004 to February 2025 by an array of five metal-oxide chemical sensors deployed at street level in a heavily polluted Italian city. The dataset contains hourly concentrations for CO, Non Metanic Hydrocarbons (NMHC), benzene, Total Nitrogen Oxides (NOx), and Nitrogen Dioxide (NO2), and Ozone (O3), along with meteorological variables including temperature, relative humidity (RH%), and absolute humidity (AH).

The purpose of the project is to conduct an exploratory data analysis to investigate how sensor signals and environmental conditions related to the “true” benzene concentration, as well as to explore sensor behavior, cross-sensitivities, and drift phenomena documented in the original study. Details of the variables are available on the UCI air quality repository.


# Read in data and check the data structure
To read in the dataset, we first need to install the ucimlrepo package via the follow code.

In [3]:
!pip install ucimlrepo # install ucimlrepo package

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


Import the dataset for analysis

In [4]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
X = air_quality.data.features
y = air_quality.data.targets

Check dataset structure and variable information.

In [6]:
X.head() # check the dataset structure
# X.info() # check the data info

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888


- Clean up the data

Since we do not need the “true” reference values for CO, NMHC, NOx, or NO₂, we can remove those columns from the dataset and rename C6H6(GT), PT08.S1(CO), PT08.S2(NMHC), PT08.S3(NOx), PT08.S4(NO₂), and PT08.S5(O₃) to more intuitive and easy-to-understand variable names. We stored the dataframe to a new one named X_df.

In [20]:
# drop variables not to be analyed and rename variables
X_df = X.drop(columns = ["CO(GT)", "NMHC(GT)", "NOx(GT)", "NO2(GT)"]) \
    .rename(columns = {"C6H6(GT)" : "Ben",          # rename variable
                        "PT08.S1(CO)": "CO",
                        "PT08.S2(NMHC)" : "NMHC",
                        "PT08.S3(NOx)" : "NOx",
                        "PT08.S4(NO2)" : "NO2",
                        "PT08.S5(O3)" : "O3" })
X_df.head()                                          # check the new data frame

Unnamed: 0,Date,Time,CO,Ben,NMHC,NOx,NO2,O3,T,RH,AH
0,3/10/2004,18:00:00,1360,11.9,1046,1056,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,1292,9.4,955,1174,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,1402,9.0,939,1140,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,1376,9.2,948,1092,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1272,6.5,836,1205,1490,1110,11.2,59.6,0.7888


- Check missing values and replace them with na

In [24]:
import numpy as np                  # import numpy module
X_df = X_df.replace(-200, np.nan)   # replaced missing values with NAN
X_df.isna().sum()                   # count how many missing values in each column

Unnamed: 0,0
Date,0
Time,0
CO,366
Ben,366
NMHC,366
NOx,366
NO2,366
O3,366
T,366
RH,366


Since there are equal numbers of missing values in the data frame, so we will drop those missing values

- Drop the missing values

In [35]:
X_df = X_df.dropna()     # drop the missing values represented as 'NaN'
X_df.isna().sum()        # Check if there are still missing values in the data frame
X_df.describe().round(4) # Get the summary table to understand the numberic varviables

Unnamed: 0,CO,Ben,NMHC,NOx,NO2,O3,T,RH,AH
count,8991.0,8991.0,8991.0,8991.0,8991.0,8991.0,8991.0,8991.0,8991.0
mean,1099.8332,10.0831,939.1534,835.4936,1456.2646,1022.9061,18.3178,49.2342,1.0255
std,217.08,7.4498,266.8314,256.8173,346.2068,398.4843,8.8321,17.3169,0.4038
min,647.0,0.1,383.0,322.0,551.0,221.0,-1.9,9.2,0.1847
25%,937.0,4.4,734.5,658.0,1227.0,731.5,11.8,35.8,0.7368
50%,1063.0,8.2,909.0,806.0,1463.0,963.0,17.8,49.6,0.9954
75%,1231.0,14.0,1116.0,969.5,1674.0,1273.5,24.4,62.5,1.3137
max,2040.0,63.7,2214.0,2683.0,2775.0,2523.0,44.6,88.7,2.231
