<div style="text-align: left; background-color:#5A96E3; font-family:Times New Roman; color:#191414; padding: 12px; line-height:1.25;border-radius:1px; margin-bottom: 0em; text-align: center; font-size: 40px;border-style: solid;border-color: #5A96E3;"><strong>ĐỒ ÁN MÔN HỌC</strong></div>
<div style="text-align: left; background-color:#FFFFFF; font-family: Times New Roman; color:black; padding: 12px; line-height:1.25;border-radius:1px; margin-bottom: 0em; text-align: center; font-size: 18px"><strong>| CSC14119 – NHẬP MÔN KHOA HỌC DỮ LIỆU - NHÓM 16 |<strong></div>

## Team Member Table
|<center><div style="width:150px">ID</div><center>|<center><div style="width:290px">NAME</div><center>|
|---------- |:-------------:|
| <center>21120112<center>  |<center>Bùi Kim Phúc<center> |

## 1. INTRODUCTION

In the modern context, air pollution has become one of the most critical environmental challenges, directly affecting human health and ecosystems. Airborne pollutants such as fine particulate matter (PM2.5, PM10), gases like O₃, NO₂, SO₂, and CO, not only contribute to respiratory diseases but also significantly increase the global rate of premature mortality. According to international reports, air pollution is one of the leading causes of severe health problems, particularly in densely populated urban areas.

This project utilizes key data indicators, including pollutant concentrations and the Air Quality Index (AQI), collected from Ho Chi Minh City—a region facing immense pressure from air pollution in Vietnam. The primary focus of the project is to analyze air quality in major cities, aiming to evaluate pollution levels, identify major contributing factors, and provide scientifically grounded insights into the trends in air quality variations.

Through the application of data analysis and machine learning models, this study not only seeks to offer a comprehensive understanding of the current state of air pollution but also aims to support the development of actionable recommendations to mitigate the adverse impacts of air pollution on public health and the environment.

Through the application of analytical models and machine learning, this project aims not only to provide a comprehensive overview of the current state of air pollution but also to support the development of actionable recommendations to mitigate the impacts of pollution on public health and the environment.

## 1.1 Import Required Libraries

In [1]:
import requests
import json
import pandas as pd
import time
import datetime
import calendar
import seaborn as sns
import matplotlib.pyplot as plt

## 1.2 Data Collection

To conduct a scientific and reliable analysis of air quality, the team chose to collect data from OpenWeatherMap (https://openweathermap.org) – one of the most reputable platforms for meteorological and environmental data worldwide. OpenWeatherMap offers detailed information on air quality, including pollutant concentrations such as PM2.5, PM10, O₃, NO₂, SO₂, CO, and the Air Quality Index (AQI) for numerous cities and regions globally.

The data collected by the team focuses on Ho Chi Minh City, an area under significant pressure from air pollution due to urbanization and the increasing number of vehicles. The data sourced from OpenWeatherMap not only ensures timeliness but also provides high accuracy, serving as a robust foundation for the analysis and development of air quality prediction models in this project. Using this data source enables the team to gain a detailed understanding of pollution trends and contributing factors, thereby facilitating the formulation of appropriate insights and recommendations.

API call: http://api.openweathermap.org/data/2.5/air_pollution/history?lat={lat}&lon={lon}&start={start}&end={end}&appid={API key}

Example of an API request: http://api.openweathermap.org/data/2.5/air_pollution/history?lat=508&lon=50&start=1606223802&end=1606482999&appid={API key} 

| **Parameter** | **Requirement** | **Description**                                                                                                                                                              |
|---------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `lat`         | Required        | Latitude. If you need the geocoder to automatically convert city names and zip codes to geo-coordinates and vice versa, please use the Geocoding API.                    |
| `lon`         | Required        | Longitude. If you need the geocoder to automatically convert city names and zip codes to geo-coordinates and vice versa, please use the Geocoding API.                   |
| `start`       | Required        | Start date (unix time, UTC time zone), e.g., `start=1606488670`.                                                                                                         |
| `end`         | Required        | End date (unix time, UTC time zone), e.g., `end=1606747870`.                                                                                                             |
| `appid`       | Required        | Your unique API key (you can always find it on your account page under the "API key" tab).                                                                               |


In [2]:
def convert_to_utc_epoch(datetime_obj):  
     
    format_str = "%Y-%m-%d %H:%M:%S"
    datetime_str = datetime_obj.strftime(format_str)  # Chuyển datetime thành chuỗi
    time_tuple = time.strptime(datetime_str, format_str)  # Chuyển chuỗi thành time tuple
    utc_epoch = calendar.timegm(time_tuple)  # Chuyển time tuple thành thời gian epoch
    return str(utc_epoch)


## 1.3 Create function to fetch data from the web page

In [3]:
# Base URL of API
base_url = 'http://api.openweathermap.org/data/2.5/air_pollution/history?'

# fetch data function 
def fetch_air_quality_data(latitude: str, longitude: str, start_time: str, end_time: str, api_key: str):
    params = {
        'lat': latitude, #Latitude of the area where data needs to be collected.
        'lon': longitude, #Longitude of the area where data needs to be collected.
        'start': start_time, # Start time (Unix timestamp, UTC timezone).
        'end': end_time, # End time (Unix timestamp, UTC timezone).
        'appid': api_key #OpenWeatherMap API key.
    }
    response = requests.get(base_url, params=params)
    air_quality_df = pd.DataFrame()
    if response.status_code == 200:
        data = json.loads(response.text)
        extracted_data = [{'dt': item['dt'], 'aqi': item['main']['aqi'], **item['components']} for item in data['list']]
        air_quality_df = pd.DataFrame(extracted_data)
    return air_quality_df


Our group decide to fetch data from the year 2021 till the nearest month

In [4]:

start_date_epoch = convert_to_utc_epoch(datetime.datetime(2021, 1, 1))
end_date_epoch = convert_to_utc_epoch(datetime.datetime(2024, 11, 30))

# HCM city location
latitude = '10.8231'
longitude = '106.6297'

# API key provided by the website. 
api_key = "59da78cf88e29ed9966a7904d10dbf34"

# fetch data
air_quality_df = fetch_air_quality_data(latitude, longitude, start_date_epoch, end_date_epoch, api_key)



In [5]:
air_quality_df

Unnamed: 0,dt,aqi,co,no,no2,o3,so2,pm2_5,pm10,nh3
0,1609459200,3,700.95,0.44,35.99,17.35,32.90,20.33,26.64,8.99
1,1609462800,3,847.82,2.46,38.04,18.06,36.24,23.32,30.54,9.37
2,1609466400,3,894.55,5.25,38.39,23.25,41.01,24.16,31.93,9.25
3,1609470000,3,827.79,6.20,36.33,33.98,43.39,23.20,30.91,8.61
4,1609473600,2,660.90,3.69,29.13,54.36,35.76,19.50,25.60,6.21
...,...,...,...,...,...,...,...,...,...,...
33812,1732910400,2,600.81,1.30,37.70,5.99,23.13,21.54,27.61,9.25
33813,1732914000,2,554.08,0.75,35.99,8.85,23.13,20.50,26.39,8.36
33814,1732917600,2,567.44,0.64,36.67,10.19,24.80,22.20,28.90,8.04
33815,1732921200,2,600.81,0.76,37.36,9.66,26.23,24.03,32.19,8.61


In [6]:
air_quality_df.to_csv('../Data/air_pollution.csv', sep=',', encoding='utf-8', index=False)