# Dataset description
Here a list of all the used datasets (and in which notebook are read) follows 
- [preL_dt](#preL_dt) (aim1)
- [pop_dens](#pop_dens) (aim2)
- [earning](#earning) (aim2)
- [age](#age) (aim2)
- [covid_deaths](#covid_deaths) (aim2)
- [merged_covid_dt](#merged_covid_dt) (aim2)
- [covid_air_dt](#covid_air_dt) (aim2)
- [pm25_df, no2_df, o3_df, pm10_df, so2_df, nox_df](#pollutants) (covid_air_dt)

In [2]:
%run conf_files.ipynb

##### preL_dt
<a id='preL_dt'></a>

The dataset preL_dt is the one used for regional level analysis in [aim 1](http://localhost:8888/notebooks/Documents/links%20between%20air%20pollution%20and%20covid19/aim1.ipynb)

In [3]:
preL_dt = pd.read_csv("%s/26-4-2020_yyAIR_COVID_PRE_LD_dt.csv" %path)
#preL_dt.info()
preL_dt.head()

Unnamed: 0.1,Unnamed: 0,Region,cases_preL,deaths_preL,Date_cases,Population_size_2018,Average_Pop_density_personkm2,Cases,Deaths,NO.levels,NO2.levels,O3.levels
0,1,East Of England,5356,746,,6201214,324.0,6499,1448,9.502135,19.553718,54.367479
1,2,London,16913,2120,,8908081,5666.0,19511,3522,25.193133,38.51252,36.913913
2,3,Midlands,10501,1491,,10704906,380.5,14844,2684,14.529437,24.119851,47.698892
3,4,North East And Yorkshire,8004,893,,8137524,333.0,10633,1641,16.501209,25.151394,45.295342
4,5,North West,9394,847,,7292093,517.0,12093,1801,8.581661,20.062745,48.95081


The dataset preL_dt contains: 
- the cumulative numbers of COVID-19 cases ('Cases') and death ('Deaths') count per region un to and including April 8, 2020 
- the cumulative numbers of COVID-19 cases ('cases_preL') and death ('deaths_preL') count per region pre Lockdown
- the population size ('Population_size_2018') and the average population density ('Average_Pop_density_personkm2') per region
- annual mean values of daily measurements for the three major air pollutatnts recorded between 2018 and 2019, that are nitrogen dioxide (NO2.levels), nitrogen oxode ('NO.levels') and ozone ('O3.levels)

<a id='pop_dens'></a>
##### pop_dens

In [7]:
pop_dens = pd.read_csv("%s/2018_official_popDensity.csv" %path)[['Code', '2018 people per sq. km']]
#pop_dens.info()
pop_dens.head()

Unnamed: 0,Code,2018 people per sq. km
0,E06000047,237
1,E06000005,540
2,E06000001,997
3,E06000002,2608
4,E06000057,64


The dataset pop_dens contains:
- the subregion code
- the 2018 population density per subregion 

<a id='earning'></a>
##### earning

In [8]:
earning = pd.read_csv("%s/ann_earning_2018_perLA.csv" %path)[["Code","Mean_ann_earnings"]]
#earning.info()
earning.head()

Unnamed: 0,Code,Mean_ann_earnings
0,K02000001,29817
1,K03000001,29950
2,K04000001,30140
3,E92000001,30397
4,E12000001,25805


The dataset earning contains:
- the subregion code
- the 2018 mean annual earning

<a id='age'></a>
##### age

In [13]:
age = pd.read_csv("%s/processed_median_age_of_population_perLA.csv" %path)[["Code","median_age_2018","Name"]]
#age.info()
age.head()

Unnamed: 0,Code,median_age_2018,Name
0,K02000001,40.1,UNITED KINGDOM
1,K03000001,40.2,GREAT BRITAIN
2,K04000001,40.0,ENGLAND AND WALES
3,E92000001,39.9,ENGLAND
4,E12000001,41.8,NORTH EAST


The dataset age contains:
- the subregion code
- the subregion name
- the 2018 median age 

<a id='covid_deaths'></a>
##### covid_deaths

In [12]:
covid_deaths = pd.read_csv("%s/covid_deaths_until10April_byAreaCode.csv" %path)
#covid_deaths.info()
covid_deaths.head()

Unnamed: 0,Area code,Area name,Home,Hospital,Care home,Hospice,Other communal establishment,Elsewhere
0,E06000001,Hartlepool,1,14,5,0,0,0
1,E06000002,Middlesbrough,1,49,10,0,0,0
2,E06000003,Redcar and Cleveland,1,22,2,0,0,1
3,E06000004,Stockton-on-Tees,3,19,4,0,0,0
4,E06000005,Darlington,0,12,2,0,0,0


The covid_deaths dataset contains:
- the area code
- the area name
- number of covid deaths at home
- number of covid deaths at the hospital
- number of covid deaths at a care home
- number of covid deaths at the hospice
- number of covid deaths in other communal establishment
- number of covid deaths elsewhere

After some manipulations made in the [aim 2](http://localhost:8888/notebooks/Documents/links%20between%20air%20pollution%20and%20covid19/aim2.ipynb) the final covid_deaths (actually used in the analysis) contains
- the area code
- the total number of covid deaths, obtained by summuning the covid deaths in every location

<a id='merged_covid_dt'></a>
##### merged_covid_dt

In [4]:
merged_covid_dt = pd.read_csv("./data_out/merged_covid_dt.csv")
#merged_covid_dt.info()
merged_covid_dt.head()

Unnamed: 0.1,Unnamed: 0,Code,Mean_ann_earnings,median_age_2018,Name,2018 people per sq. km,deaths
0,0,E06000005,540,43.1,Darlington,540,14
1,1,E06000001,997,41.8,Hartlepool,997,20
2,2,E06000002,2608,36.2,Middlesbrough,2608,60
3,3,E06000003,558,45.0,Redcar and Cleveland,558,26
4,4,E06000004,962,40.4,Stockton-on-Tees,962,26


The merged_covid_dt is obtained by merging the [covid_deaths](#covid_deaths), [age](#age), [pop_dens](#pop_dens) and [earning](#earning) datasets. It contains:
- the area code
- the 2018 mean annual earning
- the 2018 median age
- the area name
- the 2018 population density

<a id='covid_air_dt'></a>
##### covid_air_dt

In [8]:
covid_air_dt = pd.read_csv("%s/merged_covidAir_cov_dt_LA.csv" %path_output, na_values='x')
#covid_air_dt.info()
covid_air_dt.columns

Index(['Unnamed: 0', 'Code', 'deaths', 'X2018.people.per.sq..km',
       'Mean_ann_earnings', 'median_age_2018', 'lat', 'lon', 'pm25_lon',
       'pm25_lat', 'pm25_val', 'no2_lon', 'no2_lat', 'no2_val', 'o3_lon',
       'o3_lat', 'o3_val', 'pm10_lon', 'pm10_lat', 'pm10_val', 'so2_lon',
       'so2_lat', 'so2_val', 'nox_lon', 'nox_lat', 'nox_val'],
      dtype='object')

The covid_air_dt dataset contains informations about 
- the subregion code
- the subregion latitude and longitude
- cumulative number of deaths
- the 2018 population density
- the 2018 mean annual earning
- the 2018 median age
- value, longitude and latitude of all pollutants:
    - particular matter with an aerodynamic diameter < 2.5 μm (pm25)
    - particular matter with an aerodynamic diameter < 10.0 μm (pm10)
    - ozone (O3)
    - nitrogen dioxide (NO2)
    - Sulfur dioxide (SO2)
    - nitrogen oxides (NOX)
    
    
    
    

<a id='pollutants'></a>
##### pm25_df, no2_df, o3_df, pm10_df, so2_df, nox_df

In [6]:
pm25_df = pd.read_csv("%s/processed_pm25_lonlat.csv" %path, usecols=['pm25_lon','pm25_lat','pm25_val'])
no2_df = pd.read_csv('%s/processed_no2_lonlat.csv' %path, usecols = ['no2_lon','no2_lat','no2_val'])
o3_df = pd.read_csv('%s/processed_o3_lonlat.csv' %path, usecols = ['o3_lon','o3_lat','o3_val'])
pm10_df = pd.read_csv('%s/processed_pm10_lonlat.csv' %path, usecols = ['pm10_lon','pm10_lat','pm10_val'])
so2_df = pd.read_csv('%s/processed_so2_lonlat.csv' %path, usecols = ['so2_lon','so2_lat','so2_val'])
nox_df = pd.read_csv('%s/processed_nox_lonlat.csv' %path, usecols = ['nox_lon','nox_lat','nox_val'])
pm25_df.head()

Unnamed: 0,pm25_lon,pm25_lat,pm25_val
0,-0.888372,60.853691,2.994053
1,-0.907079,60.844866,2.988176
2,-0.888684,60.844715,2.995678
3,-0.870288,60.844562,2.996176
4,-0.851893,60.844406,2.997217


These dataset contain:
- the pollutant's value
- the correspondent longitude and latitude

They are only used to build the covid_air_dt dataset