### Air pollution levels assignment to each local authority level
#### Creation of the datasets covid_air_dt and covid_air_dt_5YA
This python notebook is taken from the authors of the articole. 

First, the official longitude and latitude of each local authority in England are generated (with OpenCageGeocode). 

Then, the authors use a function in sklearn to find the 10 nearest modelled air pollutant values based on center of each city. 

Eventually, the averaged value is matched to each local authority.

Import modules, paths and functions

In [1]:
%run conf_files.ipynb
%run functions.ipynb

Read the first dataset, that is [merged_covid_dt](../Data/datasets.ipynb#merged_covid_dt), created in the [aim2](./aim2.ipynb#merged_covid_dt)

In [31]:
merged_covid_dt = pd.read_csv("../data_out/merged_covid_dt.csv")

Add latitude and longitude to each subregion

In [3]:
#loop to get lat long
list_lat = []   # create empty lists

list_lon = []

for index, row in merged_covid_dt.iterrows(): # iterate over rows in dataframe
    
    query = row['Name']  + ', England, UK'

    results = geocoder.geocode(query)   
    lat = results[0]['geometry']['lat']
    lon = results[0]['geometry']['lng']

    list_lat.append(lat)
    list_lon.append(lon)

# create new columns from lists    

merged_covid_dt['lat'] = list_lat   

merged_covid_dt['lon'] = list_lon

Look at the dataset

In [4]:
merged_covid_dt.head()

Unnamed: 0.1,Unnamed: 0,Code,Mean_ann_earnings,median_age_2018,Name,2018 people per sq. km,deaths,lat,lon
0,1,E06000001,25985.0,41.8,Hartlepool,997,20,54.685728,-1.20937
1,2,E06000002,22878.0,36.2,Middlesbrough,2608,60,54.576042,-1.234405
2,3,E06000003,23236.0,45.0,Redcar and Cleveland,558,26,54.567906,-1.005496
3,4,E06000004,26622.0,40.4,Stockton-on-Tees,962,26,54.564094,-1.312916
4,0,E06000005,26908.0,43.1,Darlington,540,14,54.524208,-1.555581


Save the dataset

In [6]:
merged_covid_dt.to_csv("../data_out/merged_covid_dt_LL.csv", index=False)

The datasets about pollutants are read. They are all described in the linked notebook: [pm25_df, pm10_df, no2_df, o3_df, so2_df, nox_df](../Data/datasets.ipynb#pollutant_df)

In [5]:
pm25_df = pd.read_csv("%s/processed_pm25_lonlat.csv" %path, usecols=['pm25_lon','pm25_lat','pm25_val'])
no2_df = pd.read_csv('%s/processed_no2_lonlat.csv' %path, usecols = ['no2_lon','no2_lat','no2_val'])
o3_df = pd.read_csv('%s/processed_o3_lonlat.csv' %path, usecols = ['o3_lon','o3_lat','o3_val'])
pm10_df = pd.read_csv('%s/processed_pm10_lonlat.csv' %path, usecols = ['pm10_lon','pm10_lat','pm10_val'])
so2_df = pd.read_csv('%s/processed_so2_lonlat.csv' %path, usecols = ['so2_lon','so2_lat','so2_val'])
nox_df = pd.read_csv('%s/processed_nox_lonlat.csv' %path, usecols = ['nox_lon','nox_lat','nox_val'])

Once read, they are all converted into GeoDataFrame through the function [to_gpd_air_dt](functions.ipynb#to_gpd_air_dt)

In [6]:
pm25_gpd = to_gpd_air_dt(pm25_df)
no2_gpd = to_gpd_air_dt(no2_df)
o3_gpd = to_gpd_air_dt(o3_df)
pm10_gpd = to_gpd_air_dt(pm10_df)
so2_gpd = to_gpd_air_dt(so2_df)
nox_gpd = to_gpd_air_dt(nox_df)

The merged_covid_dt, with long and lat just addes, is convert into GeoDataFrame. The function to_gpd_air_dt requests that the input DataFrame have as first two columns longitude and latitude. So, the columns have to be reorder before applying the conversion. The output dataset is now of the type GeoDataFrame and has an additional column called 'geometry' with the coordinates.

In [7]:
merged_covid_dt = merged_covid_dt[['lon','lat','Code','deaths','2018 people per sq. km','Mean_ann_earnings','median_age_2018','Name']]
covid_gpd = to_gpd_air_dt(merged_covid_dt)
covid_gpd.head()

Unnamed: 0,lon,lat,Code,deaths,2018 people per sq. km,Mean_ann_earnings,median_age_2018,Name,geometry
0,-1.20937,54.685728,E06000001,20,997,25985.0,41.8,Hartlepool,POINT (-1.20937 54.68573)
1,-1.234405,54.576042,E06000002,60,2608,22878.0,36.2,Middlesbrough,POINT (-1.23440 54.57604)
2,-1.005496,54.567906,E06000003,26,558,23236.0,45.0,Redcar and Cleveland,POINT (-1.00550 54.56791)
3,-1.312916,54.564094,E06000004,26,962,26622.0,40.4,Stockton-on-Tees,POINT (-1.31292 54.56409)
4,-1.555581,54.524208,E06000005,14,540,26908.0,43.1,Darlington,POINT (-1.55558 54.52421)


Now the [nearest_neighbor](functions.ipynb#nearest_neighbor) function is applied. 

In [8]:
pm25_covid = nearest_neighbor(covid_gpd[['Code','geometry']], pm25_gpd, return_dist=True)
no2_covid = nearest_neighbor(covid_gpd[['Code','geometry']], no2_gpd, return_dist=True)
o3_covid = nearest_neighbor(covid_gpd[['Code','geometry']], o3_gpd, return_dist=True)
pm10_covid = nearest_neighbor(covid_gpd[['Code','geometry']], pm10_gpd, return_dist=True)
so2_covid = nearest_neighbor(covid_gpd[['Code','geometry']], so2_gpd, return_dist=True)
nox_covid = nearest_neighbor(covid_gpd[['Code','geometry']], nox_gpd, return_dist=True)

This dataset contains the subregion Code, the coordinates, the mean value of the 10 nearest pollutants data and the further distance (in meters) used to include pollutants

Now all the different pollutants datasets are merged together and the final datasets is saved as csv

In [61]:
covid_air_dt = pd.merge(covid_gpd.drop(columns=['geometry','Name']), pm25_covid.drop(columns=['geometry','radius']), on='Code')
covid_air_dt = pd.merge(covid_air_dt, no2_covid.drop(columns=['geometry','radius']), on='Code')
covid_air_dt = pd.merge(covid_air_dt, o3_covid.drop(columns=['geometry','radius']), on='Code')
covid_air_dt = pd.merge(covid_air_dt, pm10_covid.drop(columns=['geometry','radius']), on='Code')
covid_air_dt = pd.merge(covid_air_dt, so2_covid.drop(columns=['geometry','radius']), on='Code')
covid_air_dt = pd.merge(covid_air_dt, nox_covid.drop(columns=['geometry','radius']), on='Code')
#covid_air_dt.drop(columns=['pm25_lon','pm25_lat','pm10_lon','pm10_lat', 'o3_lon','o3_lat','no2_lon','no2_lat', 'nox_lon','nox_lat','so2_lon','so2_lat'], inplace=True)
#covid_air_dt.rename(columns = {'2018 people per sq. km': 'X2018_people_per_sq_km'}, inplace=True)
#covid_air_dt.to_csv("../data_out/covid_air_dt.csv", index=False)
covid_air_dt.head()
np.mean(covid_air_dt.pm25_val)

9.101279022455088

In [63]:
covid_air_dt.pm25_val[0]

7.3628211

In [64]:
a = pd.read_csv("%s/merged_covidAir_cov_dt_LA.csv" %path_output)
a.head()
a.pm25_val[0]
#np.mean(a.pm25_val)

7.3628211

In [47]:

#a.pm25_val[88]

Unnamed: 0,lon,lat,Code,deaths,2018 people per sq. km,Mean_ann_earnings,median_age_2018,pm25_lon,pm25_lat,pm25_val,...,o3_val,pm10_lon,pm10_lat,pm10_val,so2_lon,so2_lat,so2_val,nox_lon,nox_lat,nox_val
0,-1.209370,54.685728,E06000001,20,997,25985.0,41.8,-1.210453,54.684977,7.362821,...,5.303858,-1.210453,54.684977,11.506778,-1.210453,54.684977,1.575561,-1.210453,54.684977,17.730895
1,-1.234405,54.576042,E06000002,60,2608,22878.0,36.2,-1.234211,54.576387,8.210541,...,4.251802,-1.234211,54.576387,12.561928,-1.234211,54.576387,3.133287,-1.234211,54.576387,29.296034
2,-1.005496,54.567906,E06000003,26,558,23236.0,45.0,-1.005466,54.567534,7.293137,...,6.851500,-1.005466,54.567534,12.855827,-1.005466,54.567534,0.998864,-1.005466,54.567534,9.983241
3,-1.312916,54.564094,E06000004,26,962,26622.0,40.4,-1.310219,54.564262,7.900856,...,5.050632,-1.310219,54.564262,12.251130,-1.310219,54.564262,1.934330,-1.310219,54.564262,21.440737
4,-1.555581,54.524208,E06000005,14,540,26908.0,43.1,-1.556564,54.523176,7.153221,...,5.198811,-1.556564,54.523176,11.129635,-1.556564,54.523176,1.129034,-1.556564,54.523176,15.098272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
329,-0.703120,52.160450,W06000020,33,740,25755.0,42.4,-0.704773,52.159159,9.543746,...,7.687268,-0.704773,52.159159,14.961366,-0.704773,52.159159,1.014256,-0.704773,52.159159,11.810053
330,-2.768756,51.481873,W06000021,25,111,32558.0,48.6,-2.768916,51.480431,8.189910,...,9.438886,-2.768916,51.480431,12.382244,-2.768916,51.480431,1.465795,-2.768916,51.480431,12.792954
331,-1.316356,50.691311,W06000022,61,805,24974.0,38.8,-1.317599,50.689653,8.870632,...,13.130471,-1.317599,50.689653,13.383171,-1.317599,50.689653,1.229098,-1.317599,50.689653,11.169410
332,-3.242966,50.682753,W06000023,18,26,22421.0,49.9,-3.242789,50.685021,5.966326,...,11.397849,-3.242789,50.685021,9.498978,-3.242789,50.685021,0.607522,-3.242789,50.685021,6.371829


### Repeat this process for 5 year average air pollution values

A similar dataset is now created using pollution data about the 5 years 2014-2018.

We need again the [merged_covid_dt](../Data/datasets.ipynb#merged_covid_dt) datasets.

5-years data about pollutants are in in theese datasets [pm25_5YA, no22_5YA, o3_5YA, pm10_5YA, nox_%YA](../Data/datasets.ipynb#).

Remember that the first two column must be longitude and latitude

In [14]:
merged_covid_dt_LL = pd.read_csv('../data_out/merged_covid_dt_LL.csv')
merged_covid_5YA_dt = merged_covid_dt_LL[['lon','lat','Code','deaths','2018 people per sq. km','Mean_ann_earnings','median_age_2018','Name']]

pm25_5YA_df = pd.read_csv("%s/processed_allAP_lonlat.csv" %path_output, usecols=['lon','lat','pm25_5yAvg'])
pm25_5YA_df = pm25_5YA_df[['lon', 'lat', 'pm25_5yAvg']]

no2_5YA_df = pd.read_csv('%s/processed_allAP_lonlat.csv' %path_output, usecols = ['lon','lat','no2_5yAvg'])
no2_5YA_df = no2_5YA_df[['lon', 'lat', 'no2_5yAvg']]

o3_5YA_df = pd.read_csv('%s/processed_allAP_lonlat.csv' %path_output, usecols = ['lon','lat','o3_5yAvg'])
o3_5YA_df = o3_5YA_df[['lon', 'lat', 'o3_5yAvg']]

pm10_5YA_df = pd.read_csv('%s/processed_allAP_lonlat.csv' %path_output, usecols = ['lon','lat','pm10_5yAvg'])
pm10_5YA_df = pm10_5YA_df[['lon', 'lat', 'pm10_5yAvg']]

nox_5YA_df = pd.read_csv('%s/processed_allAP_lonlat.csv' %path_output, usecols = ['lon','lat','nox_5yAvg'])
nox_5YA_df = nox_5YA_df[['lon', 'lat', 'nox_5yAvg']]

They are all convert into GeoDataFrame

In [15]:
pm25_5YA_gpd = to_gpd_air_dt(pm25_5YA_df)
no2_5YA_gpd = to_gpd_air_dt(no2_5YA_df)
o3_5YA_gpd = to_gpd_air_dt(o3_5YA_df)
pm10_5YA_gpd = to_gpd_air_dt(pm10_5YA_df)
nox_5YA_gpd = to_gpd_air_dt(nox_5YA_df)
covid_5YA_gpd = to_gpd_air_dt(merged_covid_5YA_dt)

Now the [nearest_neighbors](functions.ipynb#nearest_neighbors)

In [16]:
pm25_5YA_covid = nearest_neighbor(covid_5YA_gpd[['Code','geometry']], pm25_5YA_gpd, return_dist=True)
no2_5YA_covid = nearest_neighbor(covid_5YA_gpd[['Code','geometry']], no2_5YA_gpd, return_dist=True)
o3_5YA_covid = nearest_neighbor(covid_5YA_gpd[['Code','geometry']], o3_5YA_gpd, return_dist=True)
pm10_5YA_covid = nearest_neighbor(covid_5YA_gpd[['Code','geometry']], pm10_5YA_gpd, return_dist=True)
nox_5YA_covid = nearest_neighbor(covid_5YA_gpd[['Code','geometry']], nox_5YA_gpd, return_dist=True)

In [17]:
pm25_5YA_covid.head()

Unnamed: 0,Code,geometry,lon,lat,pm25_5yAvg,radius
0,E06000001,POINT (-1.20937 54.68573),-1.210453,54.684977,7.617911,2313.328476
1,E06000002,POINT (-1.23440 54.57604),-1.234211,54.576387,8.342286,2140.177884
2,E06000003,POINT (-1.00550 54.56791),-1.005466,54.567534,7.748073,2144.162968
3,E06000004,POINT (-1.31292 54.56409),-1.310219,54.564262,8.03818,2274.218771
4,E06000005,POINT (-1.55558 54.52421),-1.556564,54.523176,7.368714,2342.234609


This dataset contains the subregion Code, the coordinates, the mean value of the 10 nearest pollutants data and the further distance (in meters) used to include pollutants

Now all the different pollutants datasets are merged together and the final datasets is saved as csv

In [19]:
covid_air_dt_5YA = pd.merge(covid_5YA_gpd.drop(columns=['geometry','Name','lon','lat']), 
                                pm25_5YA_covid.drop(columns=['geometry','radius','lon','lat','radius']), on='Code')
covid_air_dt_5YA = pd.merge(covid_air_dt_5YA, 
                                no2_5YA_covid.drop(columns=['geometry','radius','lon','lat','radius']), on='Code')
covid_air_dt_5YA = pd.merge(covid_air_dt_5YA, 
                                o3_5YA_covid.drop(columns=['geometry','radius','lon','lat','radius']), on='Code')
covid_air_dt_5YA = pd.merge(covid_air_dt_5YA, 
                                pm10_5YA_covid.drop(columns=['geometry','radius','lon','lat','radius']), on='Code')
covid_air_dt_5YA = pd.merge(covid_air_dt_5YA, 
                                nox_5YA_covid.drop(columns=['geometry','radius','lon','lat','radius']), on='Code')

covid_air_dt_5YA.to_csv("../data_out/covid_air_dt_5YA.csv", index=False)
covid_air_dt_5YA.head()

Unnamed: 0,Code,deaths,2018 people per sq. km,Mean_ann_earnings,median_age_2018,pm25_5yAvg,no2_5yAvg,o3_5yAvg,pm10_5yAvg,nox_5yAvg
0,E06000001,20,997,25985.0,41.8,7.617911,14.359226,2.811139,11.625293,20.206839
1,E06000002,60,2608,22878.0,36.2,8.342286,20.537654,2.240947,12.522083,30.628743
2,E06000003,26,558,23236.0,45.0,7.748073,8.440613,3.795747,12.845976,11.273612
3,E06000004,26,962,26622.0,40.4,8.03818,16.408541,2.670293,12.169144,23.435383
4,E06000005,14,540,26908.0,43.1,7.368714,12.552451,2.763806,11.199639,17.361185
