# Table of Content:

### 0. Objectives

### 1. Loading Data

### 2. Data Cleaning & Preprocessing

### 3. Descriptive Analysis (Patterns)

### 4. Feature Engineering for Prediction

### 5. Baseline Forecasting Models

### 6. ML Models (Tree-based + Neural Nets)

### 7. Wrap-up & Presentation


## 0. Objectives

Descriptive:

- Identify daily, weekly, and seasonal patterns of PM2.5/PM10.

- Compare air quality across weekdays vs weekends and daytime vs nighttime.

- Visualize pollution cycles that align with traffic, heating, or special events (e.g., fireworks, holidays).

Predictive:

- Build models to predict next-day PM2.5 levels.

- Compare classical and machine learning approaches:

- Baseline: ARIMA or Holt-Winters

- ML: Random Forest, XGBoost

- Deep Learning: LSTM / GRU


### Setup:


In [5]:
# --- Core data stack ---
import numpy as np
import pandas as pd

# --- Visualization ---
import matplotlib.pyplot as plt
import seaborn as sns

# --- Stats & time-series helpers ---
from scipy import stats                     # t-tests, basic stats
from statsmodels.tsa.seasonal import seasonal_decompose  # trend/seasonality

# --- Quality of life ---
from tqdm import tqdm  # progress bars for loops (e.g., many days/years)

## Load and Show the data!


In [7]:
# Loading PurpleAir Data Frames
df_purpleAir_44919 = pd.read_csv('44919 2016-08-26 2025-08-26 60-Minute Average.csv')
df_purpleAir_92387 = pd.read_csv('92387 2016-08-26 2025-08-26 60-Minute Average.csv')
df_purpleAir_217883 = pd.read_csv('217883 2016-08-26 2025-08-26 60-Minute Average.csv')

df_purpleAir_44919.head()

Unnamed: 0,time_stamp,humidity,temperature,pressure,voc,analog_input,pm2.5_alt|pm2.5_alt = C * (0.00030418*N1 0.0018512*N2 0.02069706*N3),deciviews,visual_range,0.3_um_count,...,1.0_um_count,2.5_um_count,5.0_um_count,10.0_um_count,pm1.0_cf_1,pm1.0_atm,pm2.5_atm,pm2.5_cf_1,pm10.0_atm,pm10.0_cf_1
0,2025-01-15T07:00:00+06:00,16,72,935.98,,0.04,10.6,16.9,71.6,2470.2,...,112.26,17.3,4.834,2.622,12.0,12.0,18.4,18.4,22.2,22.2
1,2025-01-15T08:00:00+06:00,18,56,936.21,,0.04,15.7,20.3,51.4,3657.7,...,164.92,26.63,6.509,3.412,20.3,19.2,29.3,30.9,36.0,36.5
2,2025-01-15T09:00:00+06:00,22,46,936.45,,0.05,16.4,20.0,52.8,3550.6,...,188.31,33.56,7.445,3.638,20.6,20.3,32.9,33.9,40.7,40.7
3,2025-01-15T10:00:00+06:00,20,50,936.39,,0.05,22.9,22.4,41.6,4652.3,...,275.75,48.06,10.694,5.239,26.8,23.9,39.5,46.1,51.4,55.6
4,2025-01-15T11:00:00+06:00,17,53,936.27,,0.05,14.2,19.3,56.4,3283.3,...,162.32,27.93,5.728,2.579,15.7,15.5,26.4,27.0,31.8,31.9


In [14]:
df_purpleAir_44919.describe()

Unnamed: 0,humidity,temperature,pressure,voc,analog_input,pm2.5_alt|pm2.5_alt = C * (0.00030418*N1 0.0018512*N2 0.02069706*N3),deciviews,visual_range,0.3_um_count,0.5_um_count,1.0_um_count,2.5_um_count,5.0_um_count,10.0_um_count,pm1.0_cf_1,pm1.0_atm,pm2.5_atm,pm2.5_cf_1,pm10.0_atm,pm10.0_cf_1
count,5333.0,5333.0,5333.0,0.0,5332.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0,5333.0
mean,30.63679,70.917495,929.868637,,0.043638,27.151416,19.671873,68.558241,5135.471273,1444.568611,337.455586,53.441378,13.772453,6.286568,31.905625,24.289162,41.189068,55.112676,52.084699,68.182205
std,12.183037,19.654373,6.650782,,0.004811,42.03111,7.621115,39.697254,6569.250133,1944.365056,584.650982,96.192155,25.958029,11.662299,67.878155,44.519029,67.760454,103.685645,83.295932,127.279819
min,7.0,21.0,915.2,,0.04,1.6,4.8,3.5,344.6,96.2,16.56,2.05,0.564,0.191,0.9,0.9,1.9,1.9,2.3,2.3
25%,21.0,53.0,924.59,,0.04,7.2,14.0,36.5,1707.2,453.8,74.17,10.97,3.246,1.48,8.0,8.0,12.0,12.0,15.2,15.2
50%,28.0,76.0,929.51,,0.04,11.7,17.7,66.3,2710.8,726.1,125.22,20.14,5.574,2.538,13.6,13.4,20.3,20.5,25.3,25.4
75%,39.0,87.0,934.59,,0.05,26.5,23.7,95.8,5381.0,1502.1,320.97,50.86,11.934,5.557,30.8,26.2,42.5,51.8,55.6,61.9
max,63.0,105.0,958.92,,0.05,502.7,47.1,240.7,61421.4,19760.8,7941.62,1514.77,431.094,189.362,2331.7,1554.7,1557.2,2334.5,1559.6,2337.1


In [15]:
df_purpleAir_92387.describe()

Unnamed: 0,humidity,temperature,pressure,voc,analog_input,pm2.5_alt|pm2.5_alt = C * (0.00030418*N1 0.0018512*N2 0.02069706*N3),deciviews,visual_range,0.3_um_count,0.5_um_count,1.0_um_count,2.5_um_count,5.0_um_count,10.0_um_count,pm1.0_cf_1,pm1.0_atm,pm2.5_atm,pm2.5_cf_1,pm10.0_atm,pm10.0_cf_1
count,39957.0,39957.0,39957.0,0.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0,39957.0
mean,19.94394,81.00896,926.661536,,0.02,15.661348,15.775579,100.082421,3240.631486,870.619231,168.493145,13.897153,3.17861,0.832086,17.449303,14.140586,20.990545,26.070318,23.727107,28.31877
std,5.090616,4.515667,6.467675,,1.093236e-14,26.514077,7.379462,58.532837,4542.470653,1305.992256,323.576579,30.544342,6.758318,1.868609,27.876624,18.371949,30.082495,45.458339,34.090651,50.234441
min,6.0,62.0,861.27,,0.02,0.1,0.5,3.7,29.6,7.8,1.17,0.05,0.021,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16.0,78.0,921.83,,0.02,4.1,10.8,54.7,1081.2,261.1,35.68,1.94,0.522,0.178,4.4,4.4,6.0,6.0,6.4,6.4
50%,20.0,81.0,926.29,,0.02,7.0,14.0,96.5,1689.3,430.2,64.79,3.76,1.018,0.328,8.1,8.1,10.9,10.9,11.6,11.6
75%,23.0,84.0,931.05,,0.02,15.8,19.6,132.4,3407.6,910.0,163.66,12.1,2.8,0.728,18.3,18.1,26.1,26.3,28.3,28.3
max,42.0,97.0,952.72,,0.02,450.2,46.5,370.3,57487.7,17631.2,6290.01,638.83,140.31,41.088,370.8,246.4,489.0,734.5,558.1,838.3


In [16]:
df_purpleAir_217883.describe()

Unnamed: 0,humidity,temperature,pressure,voc,analog_input,pm2.5_alt|pm2.5_alt = C * (0.00030418*N1 0.0018512*N2 0.02069706*N3),deciviews,visual_range,0.3_um_count,0.5_um_count,1.0_um_count,2.5_um_count,5.0_um_count,10.0_um_count,pm1.0_cf_1,pm1.0_atm,pm2.5_atm,pm2.5_cf_1,pm10.0_atm,pm10.0_cf_1
count,4898.0,4898.0,4898.0,0.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,19.289098,83.481421,924.192999,,0.0,11.858983,13.618416,113.772397,2133.342956,643.003348,133.679198,10.160135,2.480249,0.584667,11.979053,10.533463,16.666313,19.346325,18.679339,20.854614
std,4.547767,1.885903,5.716168,,0.0,16.244929,5.746021,48.688419,2533.300309,782.053027,203.377125,18.68759,4.493325,0.895643,15.650932,10.71574,18.726916,27.714401,21.66503,30.389882
min,6.0,76.0,910.76,,0.0,1.4,4.4,4.4,306.4,90.4,12.47,0.56,0.15,0.031,0.9,0.9,1.4,1.4,1.5,1.5
25%,16.0,82.0,919.63,,0.0,4.4,9.7,80.025,904.5,267.625,44.355,2.33,0.658,0.17,4.5,4.5,6.6,6.6,7.0,7.0
50%,19.0,83.0,923.85,,0.0,6.5,12.0,117.2,1292.55,383.5,67.265,3.8,1.0875,0.302,6.8,6.8,10.05,10.05,10.8,10.8
75%,22.0,85.0,928.63,,0.0,11.5,15.8,148.375,2151.625,643.0,126.525,8.7,2.2655,0.604,12.0,12.0,18.6,18.775,20.0,20.1
max,34.0,90.0,940.78,,0.0,395.2,44.9,251.4,49061.2,15670.4,5436.61,478.41,118.25,19.984,321.7,213.7,437.1,656.9,487.2,732.0


In [17]:
df_purpleAir_44919.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5333 entries, 0 to 5332
Data columns (total 21 columns):
 #   Column                                                                    Non-Null Count  Dtype  
---  ------                                                                    --------------  -----  
 0   time_stamp                                                                5333 non-null   object 
 1   humidity                                                                  5333 non-null   int64  
 2   temperature                                                               5333 non-null   int64  
 3   pressure                                                                  5333 non-null   float64
 4   voc                                                                       0 non-null      float64
 5   analog_input                                                              5332 non-null   float64
 6   pm2.5_alt|pm2.5_alt = C * (0.00030418*N1   0.0018512*N2   0.0206

In [18]:
df_purpleAir_92387.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39957 entries, 0 to 39956
Data columns (total 21 columns):
 #   Column                                                                    Non-Null Count  Dtype  
---  ------                                                                    --------------  -----  
 0   time_stamp                                                                39957 non-null  object 
 1   humidity                                                                  39957 non-null  int64  
 2   temperature                                                               39957 non-null  int64  
 3   pressure                                                                  39957 non-null  float64
 4   voc                                                                       0 non-null      float64
 5   analog_input                                                              39957 non-null  float64
 6   pm2.5_alt|pm2.5_alt = C * (0.00030418*N1   0.0018512*N2   0.02

In [19]:
df_purpleAir_217883.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 21 columns):
 #   Column                                                                    Non-Null Count  Dtype  
---  ------                                                                    --------------  -----  
 0   time_stamp                                                                4898 non-null   object 
 1   humidity                                                                  4898 non-null   int64  
 2   temperature                                                               4898 non-null   int64  
 3   pressure                                                                  4898 non-null   float64
 4   voc                                                                       0 non-null      float64
 5   analog_input                                                              4898 non-null   int64  
 6   pm2.5_alt|pm2.5_alt = C * (0.00030418*N1   0.0018512*N2   0.0206

## Define PurpleAir DF features:


### `humidity`

- **What:** Relative humidity.
- **Units:** % (0–100).
- **Notes:** Can be biased on some devices; consider smoothing and bounds checks.

---

### `temperature`

- **What:** Ambient temperature reported by the onboard environmental sensor.
- **Units:** °F (PurpleAir commonly reports Fahrenheit via API).
- **Notes:** Convert to °C if needed: `(F − 32) × 5/9`. Can read slightly high in sun or near walls.

---

### `pressure`

- **What:** Barometric pressure at the sensor.
- **Units:** hPa (hectopascals).
- **Notes:** Varies with elevation and weather; Bishkek (\~760 m) typically shows \~920–940 hPa.

---

### `voc`

- **What:** Volatile Organic Compounds signal or index (only on models with a VOC sensor).
- **Units:** Device/firmware dependent (often an index, sometimes absent).
- **Notes:** Many PurpleAir units do not report VOC → expect `NaN` frequently.

---

### `analog_input`

- **What:** Raw analog input channel for external integrations.
- **Units:** Unitless (normalized 0–1) or device-specific voltage proxy.
- **Notes:** Often unused (constant near `0.02–0.05` or `0.0`). Treat as optional/diagnostic.

---

### `pm2.5_alt`

- **What:** Alternate PM2.5 mass concentration derived from particle counts (the Lance Wallace method).
- **Units:** µg/m³.
- **Formula:**

  ```text
  pm2.5_alt = C * (0.00030418*N1 + 0.0018512*N2 + 0.02069706*N3)
  ```

  where **C** is a constant “CF factor” (commonly 3.0; some use 3.4) and **N1–N3** are small-particle count bins from the sensor.

- **Notes:** Useful when comparing against `cf_1` and `atm` outputs; sensitive to count noise.

---

### `deciviews`

- **What:** Logarithmic haze metric (higher = hazier).
- **Units:** deciview (dv).
- **Notes:** Derived from estimated light extinction; intended for visibility/haze interpretation rather than compliance reporting.

---

### `visual_range`

- **What:** Estimated visibility distance derived from aerosol extinction.
- **Units:** Typically km.
- **Notes:** Lower values indicate poorer visibility; derived (not directly measured).

---

### `0.3_um_count`, `0.5_um_count`, `1.0_um_count`, `2.5_um_count`, `5.0_um_count`, `10.0_um_count`

- **What:** Particle number counts in size bins ≥0.3 µm, ≥0.5 µm, …, ≥10 µm.
- **Units:** Particles per 0.1 L (Plantower convention) or device-specific count units.
- **Notes:** Extremely wide dynamic range—plot on **log scale**. Great for research/QC and diagnosing sensor issues.

---

### `pm1.0_cf_1`, `pm2.5_cf_1`, `pm10.0_cf_1`

- **What:** Mass concentrations computed by the sensor’s **CF=1** factory algorithm.
- **Units:** µg/m³.
- **Use:** Good for method comparison and diagnostics; may not best match outdoor regulatory references.

---

### `pm1.0_atm`, `pm2.5_atm`, `pm10.0_atm`

- **What:** Mass concentrations with the sensor’s **“atmospheric”** correction.
- **Units:** µg/m³.
- **Use (recommended):** For outdoor reporting and AQI conversion, use `*_atm` (especially `pm2.5_atm`).
- **Notes:** Often aligns better with ambient aerosol than `cf_1`.


In [12]:
# Loading NOAA Weather DF:
df_weather = pd.read_csv('isd_lite_383530_99999_2016_2025.csv')
df_weather.head()

Unnamed: 0,datetime_utc,datetime_local,air_temp_c,dewpoint_c,slp_hpa,wind_dir_deg,wind_speed_ms,sky_cover_code,precip_1h_mm,precip_6h_mm,year,month,day,hour
0,2016-01-01 00:00:00+00:00,2016-01-01 06:00:00+06:00,4.9,-5.8,1023.4,280,1.0,8,,,2016,1,1,0
1,2016-01-01 03:00:00+00:00,2016-01-01 09:00:00+06:00,4.7,-3.7,1026.5,140,1.0,8,,,2016,1,1,3
2,2016-01-01 06:00:00+00:00,2016-01-01 12:00:00+06:00,11.6,-2.2,1024.7,140,2.0,5,,,2016,1,1,6
3,2016-01-01 09:00:00+00:00,2016-01-01 15:00:00+06:00,13.8,-2.2,1023.2,50,2.0,5,,,2016,1,1,9
4,2016-01-01 12:00:00+00:00,2016-01-01 18:00:00+06:00,9.5,-3.3,1024.6,90,1.0,5,,,2016,1,1,12


In [13]:
df_weather.describe()

Unnamed: 0,air_temp_c,dewpoint_c,slp_hpa,wind_dir_deg,wind_speed_ms,sky_cover_code,precip_1h_mm,precip_6h_mm,year,month,day,hour
count,26495.0,26465.0,26304.0,26532.0,22147.0,26532.0,0.0,3.0,26532.0,26532.0,26532.0,26532.0
mean,12.906243,2.589972,1017.827783,161.336876,1.472841,-1398.028155,,3.666667,2020.357154,6.405548,15.700814,10.526082
std,11.714525,6.898117,10.446068,231.462566,1.942067,3473.869635,,3.511885,2.79727,3.414833,8.790157,6.902611
min,-26.6,-29.0,994.5,-9999.0,0.0,-9999.0,,0.0,2016.0,1.0,1.0,0.0
25%,3.5,-2.3,1009.575,40.0,1.0,1.0,,2.0,2018.0,3.0,8.0,6.0
50%,13.6,3.4,1017.1,180.0,1.0,5.0,,4.0,2020.0,6.0,16.0,9.0
75%,22.2,8.1,1025.4,270.0,2.0,8.0,,5.5,2023.0,9.0,23.0,18.0
max,39.6,20.7,1058.9,360.0,53.0,9.0,,7.0,2025.0,12.0,31.0,21.0


In [20]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26532 entries, 0 to 26531
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   datetime_utc    26532 non-null  object 
 1   datetime_local  26532 non-null  object 
 2   air_temp_c      26495 non-null  float64
 3   dewpoint_c      26465 non-null  float64
 4   slp_hpa         26304 non-null  float64
 5   wind_dir_deg    26532 non-null  int64  
 6   wind_speed_ms   22147 non-null  float64
 7   sky_cover_code  26532 non-null  int64  
 8   precip_1h_mm    0 non-null      float64
 9   precip_6h_mm    3 non-null      float64
 10  year            26532 non-null  int64  
 11  month           26532 non-null  int64  
 12  day             26532 non-null  int64  
 13  hour            26532 non-null  int64  
dtypes: float64(6), int64(6), object(2)
memory usage: 2.8+ MB


## Define NOAA Weather DB features:


### `air_temp_c`

- **What:** Dry-bulb air temperature (near-surface).
- **Units:** °C (converted from tenths of °C in raw file).
- **Typical range:** −50 to 60 °C.
- **Missing code in raw:** `-9999` → set to `NaN`.

---

### `dewpoint_c`

- **What:** Dew point temperature (moisture content proxy).
- **Units:** °C (from tenths of °C).
- **Typical range:** −60 to 35 °C.
- **Missing:** `-9999`.

---

### `slp_hpa`

- **What:** Sea-level pressure.
- **Units:** hPa (from tenths of hPa).
- **Typical range:** 870–1085 hPa.
- **Missing:** `-9999`.

---

### `wind_dir_deg`

- **What:** Direction **from** which the wind blows.
- **Units:** degrees (0–360; 0/360 = north).
- **Conventions:** Calm winds can have speed 0 and any direction; some feeds use special codes for “variable.”
- **Missing:** often `-9999` in ISD-Lite → treat as `NaN`.

---

### `wind_speed_ms`

- **What:** Wind speed.
- **Units:** m/s (from tenths of m/s).
- **Typical range:** 0–60+ m/s (extremes possible in storms).
- **Missing:** `-9999`.

---

### `sky_cover_code`

- **What:** Total cloud cover (coded).
- **Codes (rule-of-thumb):**

  - `0–8` ≈ oktas (0 = clear, 8 = overcast)
  - `9` = sky obscured/indeterminable
  - **Missing:** `-9999`

- **Note:** Use as **categorical**; don’t average raw codes.

---

### `precip_1h_mm`

- **What:** Precipitation in the last 1 hour.
- **Units:** mm (from tenths of mm).
- **Missing:** `-9999`. Zeros are valid (no rain).

---

### `precip_6h_mm`

- **What:** Precipitation in the last 6 hours.
- **Units:** mm (from tenths of mm).
- **Missing:** `-9999`.

---

### `year`, `month`, `day`, `hour`

- **What:** Timestamp components (ISD/ISD-Lite are in **UTC**).
- **Hour range:** 0–23 (if you see max 21, you may have filtered or resampled).
