### Download File

# Air Quality and Pollution Terms

## 1. NOx (Nitrogen Oxides)
- Refers to a group of gases composed of nitrogen and oxygen.
- The most common nitrogen oxides are nitrogen dioxide (NO2) and nitric oxide (NO).
- Produced from vehicle emissions, industrial processes, and combustion of fossil fuels.
- Contributes to the formation of smog and can have harmful effects on human health and the environment.

## 2. NO2 (Nitrogen Dioxide)
- A specific type of nitrogen oxide.
- Reddish-brown gas with a characteristic sharp, biting odor.
- Primarily produced from burning fossil fuels.
- A significant air pollutant that can irritate the respiratory system and is associated with health problems, including asthma and other lung diseases.

## 3. PM10 (Particulate Matter 10 micrometers or less)
- Refers to particulate matter that is 10 micrometers or smaller in diameter.
- Can include dust, pollen, soot, and smoke.
- Can be inhaled and may cause health issues, particularly respiratory problems, as they can penetrate the lungs.

## 4. PM2.5 (Particulate Matter 2.5 micrometers or less)
- Consists of finer particulate matter that is 2.5 micrometers or smaller.
- Originates from various sources, including vehicle emissions, industrial processes, and natural sources like wildfires.
- Particularly concerning for health as it can penetrate deep into the lungs and enter the bloodstream, leading to serious health effects, including cardiovascular and respiratory diseases.



#### Installs

In [1]:
!pip install requests pandas scikit_learn

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Method for downloading the file


In [2]:
import requests
from tempfile import TemporaryDirectory

emissions_url_excel = "https://data.london.gov.uk/download/london-atmospheric-emissions-inventory--laei--2019/17d21cd1-892e-4388-9fea-b48c1b61ee3c/LAEI-2019-Emissions-Summary-including-Forecast.zipc"

def download_dataset(url):
    tempdir = TemporaryDirectory(prefix="downloaded", suffix="datasets", dir=".")
    with requests.get(url) as response:
        with open(f"{tempdir.name}/datasets.zip", "wb") as f:
            f.write(response.content)
    return tempdir

### Extract the file

In [3]:
from zipfile import ZipFile

def unzip(path):
    with ZipFile(f"{path}/datasets.zip") as zipf:
        zipf.extractall(path)

### Preparation

In [4]:
import pandas
from pathlib import Path

dir_ = download_dataset(emissions_url_excel)
unzip(dir_.name)
files = Path(".").rglob("**/*/*.xlsx")
file = pandas.read_excel(next(files).as_posix(), sheet_name="Emissions by Grid ID")

In [8]:
file.info()
# Here we have all the required columns and some extra ones with multiple null values
# nox, no2, pm10, pm2.5


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699120 entries, 0 to 699119
Data columns (total 30 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Year                  699120 non-null  int64  
 1   Grid ID 2019          699120 non-null  int64  
 2   LAEI 1km2 ID          699120 non-null  int64  
 3   Easting               699120 non-null  int64  
 4   Northing              699120 non-null  int64  
 5   Borough               699120 non-null  object 
 6   Zone                  699120 non-null  object 
 7   Main Source Category  699120 non-null  object 
 8   Sector                699120 non-null  object 
 9   Source                699120 non-null  object 
 10  bap                   162620 non-null  float64
 11  cd                    128020 non-null  float64
 12  c4h6                  186840 non-null  float64
 13  c6h6                  217980 non-null  float64
 14  ch4                   266420 non-null  float64
 15  

#### Filling missing values with the mean

In [14]:
key_pollutants = ["nox", "n2o", "pm10", "pm2.5", "co2"]
filled_na_with_mean = file[file.Year > 2019].copy()

for column in key_pollutants:
    colmean = filled_na_with_mean[column].mean()
    filled_na_with_mean[column] = filled_na_with_mean[column].fillna(colmean)

#### Filling missing values with the median

In [15]:
filled_na_with_median = file[file.Year < 2020].copy()
for column in key_pollutants:
    colmedian = filled_na_with_median[column].median()
    filled_na_with_median[column] = filled_na_with_median[column].fillna(colmean)

In [19]:
group_columns = ["Year", "Sector", *key_pollutants]

filled_na_with_mean[group_columns]\
.groupby(by=["Year", "Sector"])\
.sum()\
.reset_index()\
.to_csv(f"{dir_.name}/LAEI_2019_NA_FILLED_WITH_MEAN.csv", index=False)

# filled_na_with_median[group_columns]\
# .groupby(by=["Year", "Sector"])\
# .sum()\
# .reset_index()\
# .to_csv(f"{dir_.name}/LAEI_2019_NA_FILLED_WITH_MEDIAN.csv", index=False)

file[file.Year > 2019][group_columns]\
.groupby(by=["Year", "Sector"])\
.sum()\
.reset_index()\
.to_csv(f"{dir_.name}/LAEI_2019_TEST_DATA.csv", index=False)