<a href="https://colab.research.google.com/github/claret003/pythoncourse/blob/main/Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Air Pollution and Asthma 

---

This project will explore two seperate datasets, one related to air pollution and the other to asthma prevalence, to test the following hypothesis....

### **'elevated levels of air pollution result in increases asthma prevalence in the general population'.**

Air pollution data has been obtained from the WHO Ambient Air Quality Database
https://whoairquality.shinyapps.io/AmbientAirQualityDatabase/

Asthma prevalence data has been sourced from OWID
https://ourworldindata.org/grapher/asthma-prevalence


For ease of access datasets have been saved on GitHub:

*   **Air pollution dataset** = https://github.com/claret003/pythoncourse/blob/main/WHO_AirQuality_Database_2018.csv

*   **Asthma prevalence dataset** = https://github.com/claret003/pythoncourse/blob/main/asthma-prevalence.csv

This hypothesis will be tested by first conducting exploratory data analysis on both datasets to establish their completeness, data cleaning where required, then combining relevant aspects of each dataset to carry out in-depth analysis aiming to answer specific questions, specifically looking at correlation bewteen these two factors.

*A project title and description (what is interesting and how do you propose to find and present this story.*

*A link to the data set* 

*A set of code cells containing functions that manipulate the data, analyse the data, visualise the data and show correlations and trends*

*Text cells which explain what is being done in each code cell and why*

*Text cells with a summary of the findings from the code*

*A final text cell with a reflection on what has been done, any references (including where the data came from.* 



STEP 1 - 
Importing in the relevant libraries.
Brining the datasets into seperate dataframes.
Carrying out some exploratory data analysis.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as py
import seaborn as sns

air_pollution = pd.read_csv('https://raw.githubusercontent.com/claret003/pythoncourse/main/WHO_AirQuality_Database_2018.csv')

asthma_prev = pd.read_csv('https://raw.githubusercontent.com/claret003/pythoncourse/main/asthma-prevalence.csv')

air_pollution.head()


Unnamed: 0,ID WHO city,iso3,country,city,pm10,Year,type_of_stations,pm10_type,pm25,pm25_type,reference,latitude,longitude,population,wbinc16_text,region,date_compiled,population_source,tempcov_PM10,tempcov_PM25,latitude_pop,longitude_pop,Region2,region_abbr,tempcov_PM10_grad,tempcov_PM25_grad,conc_pm25,color_pm25,conc_pm10,color_pm10
0,3.0,ALB,Albania,Tirana,31.615421,2013,"1 station, traffic, urban",Measured,16.062366,Measured,"European Environment Agency, Air quality e-rep...",41.330269,19.821772,453509.0,Upper middle income,Eur_LM,2016,,,,,,Europe (LMIC),Eur (LMIC),,,15-<25,orange,30-<50,darkred
1,15.0,AUS,Australia,Central Coast,12.820462,2014,-,Converted,5.5,Measured,http://www.environment.nsw.gov.au/resources/aq...,-33.278889,151.432495,297713.0,High income,Wpr_HI,2016,,,,,,Western Pacific (HIC),Wpr (HIC),,,<10,green,<20,green
2,18.0,AUS,Australia,Devonport,14.918356,2013,-,Converted,6.4,Measured,Environment Protection Authority,-41.184799,146.345993,29050.0,High income,Wpr_HI,2016,,,,,,Western Pacific (HIC),Wpr (HIC),,,<10,green,<20,green
3,19.0,AUS,Australia,Geelong,17.5,2014,"1 station, Residential/Light Industry",Measured,7.50753,Converted,"EPA Victoria, Environment Protection Authority...",-38.174999,144.369003,173450.0,High income,Wpr_HI,2016,,,,,,Western Pacific (HIC),Wpr (HIC),,,<10,green,<20,green
4,22.0,AUS,Australia,Hobart,14.219058,2013,-,Converted,6.1,Measured,Environment Protection Authority,-42.854599,147.315002,170977.0,High income,Wpr_HI,2016,,,,,,Western Pacific (HIC),Wpr (HIC),,,<10,green,<20,green


STEP 2 - Checking shape of datasets. Looking for null values and things that might cause an issue with analysis.

In [32]:
# creating a function to check for null values in dataset
def check_shape(df):
  shape = df.shape
  # rows = 
  nulls = df.isnull().sum() #shows the total number of null values in each column
  print(f'there are {shape[1]} columns in the dataset')
  print(f'there are {shape[0]} rows in the dataset \n')
  print('null values per column: \n')
  print(nulls)

print('Air pollution dataset') 
check_shape(air_pollution)
print('\n')
print('Asthma dataset')
check_shape(asthma_prev)



Air pollution dataset
there are 30 columns in the dataset
there are 11971 rows in the dataset 

null values per column: 

ID WHO city          11090
iso3                     0
country                  0
city                     0
pm10                     0
Year                     0
type_of_stations        68
pm10_type                0
pm25                     0
pm25_type                0
reference               40
latitude                 0
longitude                0
population               0
wbinc16_text             0
region                   0
date_compiled            0
population_source      907
tempcov_PM10          4320
tempcov_PM25          7458
latitude_pop         11961
longitude_pop        11961
Region2                  0
region_abbr              0
tempcov_PM10_grad     4320
tempcov_PM25_grad     7458
conc_pm25                0
color_pm25               0
conc_pm10                0
color_pm10               0
dtype: int64


Asthma dataset
there are 4 columns in the dataset
the

The fields of interest in each dataset are complete - i.e the outputs above indicate that there are no null values in these columns so there is no need to remove null values as doing so would unnecessaarily reduce the size of the dataset for no gain.

The fields of interest are:

Air pollution dataset: **country, pm25, year**

Asthma dataset: **entity, year, prevalence**

'entity' refers to the country, so the column name will be changed to reflect this.

In [33]:
#changing column name. the output confirms that this change has been made

asthma_prev.rename(columns={'Entity': 'country'})

# add an if statement to test whether country is now a column ***** RUN A TEST

Unnamed: 0,country,Code,Year,Prevalence - Asthma - Sex: Both - Age: Age-standardized (Percent)
0,Afghanistan,AFG,1990,6.871359
1,Afghanistan,AFG,1991,6.778874
2,Afghanistan,AFG,1992,6.694809
3,Afghanistan,AFG,1993,6.617201
4,Afghanistan,AFG,1994,6.546920
...,...,...,...,...
6463,Zimbabwe,ZWE,2013,2.924504
6464,Zimbabwe,ZWE,2014,2.938950
6465,Zimbabwe,ZWE,2015,2.953462
6466,Zimbabwe,ZWE,2016,2.967474


STEP 2 - Isolate relevant repoting period for each dataset.

The two datasets have differnt reporting periods. As the hypothesis being tested assumes a link between air pollution and asthma, reporting peiods are being selected on the basis that there is a lag time, i.e. exposure to air pollution is required for a period of time before the onset of asthma, and then its diagnosis.

Based on this assumption, data has been selected to incorporporate a lag of 5 years bewteen the reporting periods of the two dataset. Air pollution reporting period is 2009-2013, and the Asthma prevaence reporting period is 2013 - 2017.
*The validity of this lag time is something that could be investigated further.*

In [67]:
# isolating 2009-2013 air pollution data

def keep_years(df, column_name, start_year,end_year):
  new_df = df[(df[column_name]>=start_year)]
  new_df = new_df[(new_df[column_name]<=end_year)]
  return new_df

air_data = keep_years(air_pollution,"Year",2009,2013)

air_data.describe() #using this to check the max and min year, making sure that it has selected only what I have specified


Unnamed: 0,ID WHO city,pm10,Year,pm25,latitude,longitude,population,date_compiled,tempcov_PM10,tempcov_PM25,latitude_pop,longitude_pop
count,474.0,3304.0,3304.0,3304.0,3304.0,3304.0,3304.0,3304.0,1853.0,811.0,1.0,1.0
mean,1498.35654,35.025908,2012.538136,20.009787,34.619968,11.267983,460441.0,2017.713075,0.980149,1.028545,34.681667,33.017776
std,675.107112,36.326386,0.871244,17.364901,25.128767,44.113748,1875943.0,0.701194,1.415652,2.356234,,
min,3.0,3.414147,2009.0,1.420602,-46.413185,-123.16333,6.0,2016.0,0.013699,0.152854,34.681667,33.017776
25%,1207.25,18.6996,2012.0,11.750177,31.839295,-0.233333,11484.75,2018.0,0.931507,0.928767,34.681667,33.017776
50%,1537.0,23.606363,2013.0,15.306579,45.370477,8.840091,42769.0,2018.0,0.972603,0.967466,34.681667,33.017776
75%,1888.75,35.213722,2013.0,21.423682,49.329654,18.417867,195283.2,2018.0,0.992808,0.991781,34.681667,33.017776
max,2978.0,540.0,2013.0,216.62,69.66791,176.918625,25703170.0,2018.0,61.75,68.0,34.681667,33.017776


In [68]:
# isolating 2013 - 2017 asthma prevalence data by calling the function defined above

asthma_data = keep_years(asthma_prev,"Year",2013,2017)

asthma_data.describe()

Unnamed: 0,Year,Prevalence - Asthma - Sex: Both - Age: Age-standardized (Percent)
count,1155.0,1155.0
mean,2015.0,5.285344
std,1.414826,2.123482
min,2013.0,1.939809
25%,2014.0,3.641307
50%,2015.0,4.781563
75%,2016.0,6.216574
max,2017.0,12.612076


STEP 3 - Group the datasets by country to create a country mean for air pollution rate and asthma prevalence, calculated over a five year period.