# D-DUST Data Pre-Processing Notebook

Check the following link to Data Management Plan and the Variable list table: <br>
**1. [Data Management Plan (DMP)](https://docs.google.com/document/d/1n3PVat7PBTG76JnINOkL2pvBZuKQlakZkTgqNj39oAQ/edit#)**<br>
**2. [Variables list table](https://docs.google.com/spreadsheets/d/1-5pwMSc1QlFyC8iIaA-l1fWhWtpqVio2/edit#gid=91313358)**

This notebook describes the physical variables selected for the project and how they are preprocessed.
These variables are divided into <u>4 categories</u>, as shown in the Variable list table:
1. **Map Layer**: static layer used to describe Lombardy region morphology and its features (such as elevation, infrastructures, land use and cover etc.)
2. **Model**: data retrieved from a model that uses satellite and in-situ observations of meteorological and air quality data as input (such as ERA5, CAMS).
3. **Satellite**: data obtained directly from satellite observations (such as Sentinel-5P).
4. **Ground Sensor**: data retrieved from ground monitoring stations measuring air quality and meteorological variables.

## Import libraries

In [None]:
import os
import pandas as pd
import geopandas as gpd
import numpy as np
import rasterio as rio
import rasterstats as rstat
import shapely.speedups
from shapely.geometry import shape
from shapely.geometry import  MultiLineString
shapely.speedups.enable()

In [None]:
absolutepath = os.path.dirname(os.path.abspath("__file__"))

In [None]:
print(absolutepath)

## Import grids

Three grids with different spatial resolution are used in this project:
1. **grid_cams**: 0.1° x 0.1° resolution - Grid with CAMS Model spatial resolution.
2. **grid_s5p**: 0.066° x 0.066° resolution - Grid with the Sentinel-5P approximate spatial resolution.
3. **grid_st**: 0.01° x 0.01° resolution- Grid generated with at most one ARPA monitoring station for each pixel.

These grids are defined as bounding box of the Lombardy region layer applying a buffer of 20 km.

In [None]:
grid_cams_path = absolutepath + '/grid/grid0_1.gpkg'
grid_s5p_path = absolutepath + '/grid/grid0_066.gpkg'
grid_st_path = absolutepath + '/grid/grid0_01.gpkg'
grid_cams = gpd.read_file(grid_cams_path)
grid_s5p = gpd.read_file(grid_s5p_path)
grid_st = gpd.read_file(grid_st_path)

Grid selection

In [None]:
grid = grid_cams

In [None]:
m_to_km = 10**(-3)
m2_to_km2 = 10**(-6)

---

# Importing Map Layers

### [DUSAF - Land use - Geoportale Lombardia](https://www.geoportale.regione.lombardia.it/metadati?p_p_id=detailSheetMetadata_WAR_gptmetadataportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&_detailSheetMetadata_WAR_gptmetadataportlet_uuid=%7B18EE7CDC-E51B-4DFB-99F8-3CF416FC3C70%7D) <br>

Consists in a multi-temporal geographic database that classifies land based on major land cover and land use types. Reference system EPSG:4326.<br>
Land use:
- 2 = Aree agricole.
- 3 =Territori boscati e ambienti seminaturali.
- 4 = Aree umide.
- 5 = Corpi idrici.
- 11 = Zone urbanizzate.
- 12 = Insediamenti produttivo, grandi impianti e reti di comunicazione.
- 13 = Aree estrattive, discariche, cantieri, terreni artefatti e abbandonati.
- 14 = Aree verdi non agricole.

In [None]:
dusaf_path = absolutepath + '/land_use_cover/DUSAF6_dissolve_rast_4326.tiff'

In [None]:
dusaf = rio.open(dusaf_path)
dusaf_array = dusaf.read(1).astype('float64') 
dusaf_array[dusaf_array<1.0]=np.nan
affine = dusaf.transform

In [None]:
stats = rstat.zonal_stats(grid, dusaf_array, affine=affine, nodata=np.nan, stats=['majority'], categorical=True)
majority_list = [{k: v for k, v in d.items() if k == 'majority'} for d in stats]
grid = grid.join(pd.DataFrame(majority_list), how='left')
grid = grid.rename(columns={"majority": "dusaf"})

In [None]:
# Class counts in each tile
stats2 = rstat.zonal_stats(grid, dusaf_array, affine=affine, nodata=np.nan, stats=['count'], categorical=True)
p = pd.DataFrame.from_dict(stats2, orient='columns')
p = p*m2_to_km2

grid['dsf2'] = p[2.0]
grid['dsf3'] = p[3.0]
grid['dsf4'] = p[4.0]
grid['dsf5'] = p[5.0]
grid['dsf11'] = p[11.0]
grid['dsf12'] = p[12.0]
grid['dsf13'] = p[13.0]
grid['dsf14'] = p[14.0]
grid['dsfSum'] = p['count']

 - - -

### [SIARL - Agricultural use - Geoportale Lombardia](https://www.geoportale.regione.lombardia.it/metadati?p_p_id=detailSheetMetadata_WAR_gptmetadataportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&_detailSheetMetadata_WAR_gptmetadataportlet_uuid=%7B83483117-8742-4A1F-A16E-3A48AEE2EBE2%7D) <br>
This layer contains the agricoltural use for each cadastral parcel provided by SIARL 2019 Catalog for the Lombardy region. Reference system EPSG:4326. <br>
Agricoltural use:
1. Altre coltivazioni agrarie
2. Altri cereali
3. Barbabietola
4. Boschi e colture arboree
5. Coltivazioni florovivaistiche
6. Coltivazioni orticole
7. Foraggere
8. Frutticole
9. Mais
10. Olivo
11. Piante industriali e legumi secchi
12. Riso
13. Sementi
14. Tare e incolti
15.  Terreni a riposo
16. Vite
17. Aree antropizzate
18. Aree sterili naturali
19. Corpi idrici
20. Terreni agricoli non classificabili
21. Vegetazione naturale

In [None]:
siarl_path = absolutepath + '/land_use_cover/siarl.tif'
siarl = rio.open(siarl_path)

In [None]:
siarl_array = siarl.read(1).astype('float64') 
siarl_array[siarl_array<1.0]=np.nan
affine = siarl.transform

In [None]:
stats_siarl = rstat.zonal_stats(grid, siarl_array, affine=affine, nodata=np.nan, stats=['majority'], categorical=True)
majority_list = [{k: v for k, v in d.items() if k == 'majority'} for d in stats_siarl]
grid = grid.join(pd.DataFrame(majority_list), how='left')
grid = grid.rename(columns={"majority": "siarl"})

In [None]:
# Class counts in each tile
stats2_siarl = rstat.zonal_stats(grid, siarl_array, affine=affine, nodata=np.nan, stats=['count'], categorical=True)
p = pd.DataFrame.from_dict(stats2_siarl, orient='columns')
p = p*m2_to_km2
grid['siarl1'] = p[1.0]
grid['siarl2'] = p[2.0]
grid['siarl3'] = p[3.0]
grid['siarl4'] = p[4.0]
grid['siarl5'] = p[5.0]
grid['siarl6'] = p[6.0]
grid['siarl7'] = p[7.0]
grid['siarl8'] = p[8.0]
grid['siarl9'] = p[9.0]
grid['siarl10'] = p[10.0]
grid['siarl11'] = p[11.0]
grid['siarl12'] = p[12.0]
grid['siarl13'] = p[13.0]
grid['siarl14'] = p[14.0]
grid['siarl15'] = p[15.0]
grid['siarl16'] = p[16.0]
grid['siarl17'] = p[17.0]
grid['siarl18'] = p[18.0]
grid['siarl19'] = p[19.0]
grid['siarl20'] = p[20.0]
grid['siarl21'] = p[21.0]

grid['siarlSum'] = p['count']

- - -

### [Digital Terrain Model - Geoportale Lombardia](https://www.geoportale.regione.lombardia.it/metadati?p_p_id=detailSheetMetadata_WAR_gptmetadataportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&_detailSheetMetadata_WAR_gptmetadataportlet_uuid=%7BFC06681A-2403-481F-B6FE-5F952DD48BAF%7D)<br>
Digital Terrain Model of Lombardy region with 20 m resolution. Reference system EPSG:4326.
1. **Elevation** = DTM with 20 m resolution
2. **Aspect** = calculated previously from the elevation layer
3. **Slope** = calculated previously from the elevation layer

In [None]:
dtm_path = absolutepath + '/terrain/dtm20.tif'
aspect_path = absolutepath + '/terrain/aspect.tif'
slope_path = absolutepath + '/terrain/slope.tif'
dtm = rio.open(dtm_path)
aspect = rio.open(aspect_path)
slope = rio.open(slope_path)

In [None]:
dtm_array = dtm.read(1)
dtm_array[dtm_array<0]=np.nan
affine = dtm.transform
grid = grid.join(pd.DataFrame(rstat.zonal_stats(grid, dtm_array, affine=affine, nodata=np.nan, stats=['mean'])), how='left')
grid = grid.rename(columns={"mean": "h_mean"})

In [None]:
aspect_array = aspect.read(1)
aspect_array[aspect_array<0]=np.nan
affine = aspect.transform
grid = grid.join(pd.DataFrame(rstat.zonal_stats(grid, aspect_array, affine=affine, nodata=np.nan, stats=['mean'])), how='left')
grid = grid.rename(columns={"mean": "aspect_mean"})

In [None]:
slope_array = slope.read(1)
slope_array[slope_array<0]=np.nan
affine = slope.transform
grid = grid.join(pd.DataFrame(rstat.zonal_stats(grid, slope_array, affine=affine, nodata=np.nan, stats=['mean'])), how='left')
grid = grid.rename(columns={"mean": "slope_mean"})

- - -

### [Gridded Population of the World - GPW](https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-rev11)<br>
To provide estimates of population density for the year 2020, based on counts consistent with national censuses and population registers, as raster data to facilitate data integration.
Input reference system EPSG: 4326

In [None]:
pop_path = absolutepath + '/population/population.tif'
pop = rio.open(pop_path)

In [None]:
pop_array = pop.read(1)
pop_array[pop_array<0]=np.nan
affine = pop.transform
grid = grid.join(pd.DataFrame(rstat.zonal_stats(grid, pop_array, affine=affine, nodata=np.nan, stats=['sum'])), how='left')
grid = grid.rename(columns={"sum": "pop"})

 - - -

### [Road Infrastructures - Geoportale Lombardia (DBTR 2019)](https://www.geoportale.regione.lombardia.it/metadati?p_p_id=detailSheetMetadata_WAR_gptmetadataportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&_detailSheetMetadata_WAR_gptmetadataportlet_uuid=%7B17D4656F-2E9D-4951-9DC1-4AD32C0959B1%7D): 

**Point layers** considered:
1. Intersection between primary roads including highways
2. Intersection between primary and secondary roads
3. Intersection between secondary roads

Input reference system EPSG: 4326

In [None]:
int_prim_path = absolutepath + '/road_infrastructures/inters_highway_prim_road.gpkg'
int_prim_sec_path = absolutepath + '/road_infrastructures/inters_prim_sec_road.gpkg'
int_sec_path = absolutepath + '/road_infrastructures/inters_sec_road.gpkg'

In [None]:
int_prim = gpd.read_file(int_prim_path)
int_prim_sec = gpd.read_file(int_prim_sec_path)
int_sec = gpd.read_file(int_sec_path)

df_dict = {'int_prim':int_prim,
          'int_prim_sec':int_prim_sec, 'int_sec': int_sec}


for key in df_dict:
    poor_points = df_dict[key][['OBJECTID','geometry']]
    sjoined = gpd.sjoin(poor_points, grid)
    df_count = pd.DataFrame(sjoined.groupby('index_right').size()) 
    grid_join = grid.join(df_count)
    grid[key] = grid_join[0]

In [None]:
grid = grid.to_crs(32632)

**Line layers** considered:
1. Highways
2. Primary roads
3. Secondary roads

Input reference system EPSG: 4326

In [None]:
highway_path = absolutepath + '/road_infrastructures/highway.gpkg'
prim_road_path = absolutepath + '/road_infrastructures/prim_road.gpkg'
sec_road_path = absolutepath + '/road_infrastructures/sec_road.gpkg'

It is required to convert to a cartographic reference system EPSG:32632 to calculate distances.

In [None]:
highway = gpd.read_file(highway_path).to_crs(32632)
prim_road = gpd.read_file(prim_road_path).to_crs(32632)
sec_road = gpd.read_file(sec_road_path).to_crs(32632)

In [None]:
df_dict = {'highway':highway, 'prim_road':prim_road, 'sec_road':sec_road}

for key in df_dict:
    grid[key] = np.nan
    poor_lines = df_dict[key][['geodb_oid','geometry']]
    for index, row in grid.iterrows():
        mask = row['geometry']
        clip = gpd.clip(poor_lines, mask) 
        l = clip.geometry.length.sum()
        grid[key].iloc[index] = l*m_to_km
    print(key)

 - - -

### Farms
Vector file obtained from DUSAF 2018 (features with cod. 12112 = "Insediamenti produttivi agricoli
Sono compresi in questa classe gli edifici utilizzati per le attività produttive del settore primario, come capannoni, rimesse per macchine agricole, fienili, stalle, silos, ecc, unitamente agli spazi accessori. Quando tali edifici sono presenti insieme a quelli residenziali configurando un aggregato rurale, se le due tipologie non risultano separabili in modo evidente si classifica tutto il nucleo come cascina (11231)").

In [None]:
farms_path = absolutepath + '/farms/farms_dissolve.gpkg'
farms = gpd.read_file(farms_path)

In [None]:
df_dict2 = {'farms':farms}

In [None]:
for key in df_dict2:
    grid[key] = np.nan
    poor_poly = df_dict2[key][['COD_TOT','geometry']]
    for index, row in grid.iterrows():
        mask = row['geometry']
        clip = gpd.clip(poor_poly, mask) 
        a = clip.geometry.area.sum()
        grid[key].iloc[index] = a*m2_to_km2
    print(key)

---

In [None]:
grid.to_crs(4326).to_file("grid_prova.gpkg", driver="GPKG")