<a href="https://colab.research.google.com/github/bingxiaochen/ST-554-Project1/blob/main/Project1_Hui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
Air pollutants are considered responsible for a range of respiratory diseases, and some compounds (e.g., benzene) are known to increase risk of cancer with prolonged exposure (De Vito et al., 2008). To explore how low-cost chemical sensors behave in real urban conditions, this project analyzes the [air quality dataset](https://archive.ics.uci.edu/dataset/360/air+quality) from the UCI machine learning repository . The dataset contains 9358 hourly measurements collected from March 2004 to February 2025 by an array of five metal-oxide chemical sensors deployed at street level in a heavily polluted Italian city. The dataset contains hourly concentrations for CO, Non Metanic Hydrocarbons (NMHC), benzene, Total Nitrogen Oxides (NOx), and Nitrogen Dioxide (NO2), and Ozone (O3), along with meteorological variables including temperature, relative humidity (RH%), and absolute humidity (AH).

The purpose of the project is to conduct an exploratory data analysis to investigate how sensor signals and environmental conditions related to the “true” benzene concentration, as well as to explore sensor behavior, cross-sensitivities, and drift phenomena documented in the original study. Details of the variables are available on the UCI air quality repository.


# Read in data and check the data structure
To read in the dataset, we first need to install the ucimlrepo package via the follow code.

In [3]:
!pip install ucimlrepo # install ucimlrepo package

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


Import the dataset for analysis

In [4]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
X = air_quality.data.features
y = air_quality.data.targets

Check dataset structure and variable information.

In [5]:
X.head() # check the dataset structure
X.info() # check the data info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   int64  
 4   NMHC(GT)       9357 non-null   int64  
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   int64  
 7   NOx(GT)        9357 non-null   int64  
 8   PT08.S3(NOx)   9357 non-null   int64  
 9   NO2(GT)        9357 non-null   int64  
 10  PT08.S4(NO2)   9357 non-null   int64  
 11  PT08.S5(O3)    9357 non-null   int64  
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
dtypes: float64(5), int64(8), object(2)
memory usage: 1.1+ MB


- Clean up the data

Since we do not need the “true” reference values for CO, NMHC, NOx, or NO₂, we can remove those columns from the dataset and rename PT08.S1(CO), PT08.S2(NMHC), PT08.S3(NOx), PT08.S4(NO₂), and PT08.S5(O₃) to more intuitive and easy-to-understand variable names.