<h1>The adverse health effects of air pollution - are we making any progress?</h1>
<p><img src=files/air.jpg width="900"></p>
<p><strong>Credit:</strong>  <a href="https://www.flickr.com/people/44221799@N08/">Flickr/E4C</a> </p>

In [1]:
# Load relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings

warnings.filterwarnings("ignore")  # Suppress all warnings

<h2>Introduction</h2>
<p><strong>Business Context.</strong> Air pollution is a very serious issue that the global population is currently dealing with. The abundance of air pollutants is not only contributing to global warming, but it is also causing problematic health issues to the population. There have been numerous efforts to protect and improve air quality across most nations. However, it seems that we are making very little progress. One of the main causes of this is the fact that the majority of air pollutants are derived from the burning of fossil fuels such as <em>coal</em>. Big industries and several other economical and political factors have slowed the progress towards the use of renewable energy by promoting the use of fossil fuels. Nevertheless, if we educate the general population and create awareness of this issue, we will be able to overcome this problem in the future.      </p>
<p>For this case, you have been hired as a data science consultant for an important environmental organization. In order to promote awareness of environmental and greenhouse gas issues, your client is interested in a study of <strong>plausible impacts of air contamination on the health of the global population</strong>. They have gathered some <em>raw</em> data provided by the <a href="https://www.who.int/">World Health Organization</a>, <a href="http://www.healthdata.org/">The Institute for Health Metrics and Evaluation</a> and the <a href="https://www.worldbank.org/">World Bank Group</a>. Your task is to conduct data analysis, search for potential information, and create visualizations that the client can use for their campaigns and grant applications. </p>
<p><strong>Analytical Context.</strong> You are given a folder, named <code>files</code> with <em>raw</em> data. This data contains quite a large number of variables and it is in a fairly disorganized state. In addition, one of the datasets contains very poor documentation, segmented into several datasets. Your objective will be to:</p>
<ol>
<li>Extract and clean the relevant data. You will have to manipulate several datasets to obtain useful information for the case. </li>
<li>Conduct Exploratory Data Analysis. You will have to create meaningful plots, formulate meaningful hypotheses and study the relationship between various indicators related to air pollution.</li>
</ol>
<p>Additionally, the client has some broad questions they would like to answer:<br />
1. Are we making any progress in reducing the amount of emitted pollutants across the globe?<br />
2. Which are the critical regions where we should start environmental campaigns?<br />
3. Are we making any progress in the prevention of deaths related to air pollution?<br />
4. Which demographic characteristics seem to correlate with the number of health-related issues derived from air pollution? </p>

<h2>Extracting and cleaning relevant data</h2>
<p>Let's take a look at the data provided by the client in the <code>files</code> folder. There, we see another folder  named <code>WDI_csv</code> with several CSV files corresponding to the World Bank's primary <a href="https://datacatalog.worldbank.org/dataset/world-development-indicators">World Development Indicators</a>. The client stated that this data may contain some useful information relevant to our study, but they have not told us anything aside from that. Thus, we are on our own in finding and extracting the relevant data for our study. This we will do next. </p>
<p>Let's take a peek at the file <code>WDIData.csv</code>:</p>

In [2]:
WDI_data = pd.read_csv("./files/WDI_csv/WDIData.csv")
print(WDI_data.columns)
print(WDI_data.info())
WDI_data.head()

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', 'Unnamed: 64'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 377256 entries, 0 to 377255
Data columns (total 65 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Country Name    377256 non-null  object 
 1   Country Code    377256 non-null  object 
 2   Indicator Name  377256 non-null  object 
 3   Indicator Co

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,Unnamed: 64
0,Arab World,ARB,"2005 PPP conversion factor, GDP (LCU per inter...",PA.NUS.PPP.05,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"2005 PPP conversion factor, private consumptio...",PA.NUS.PRVT.PP.05,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,82.783289,83.120303,83.533457,83.897596,84.171599,84.510171,,,,
3,Arab World,ARB,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,86.428272,87.070576,88.176836,87.342739,89.130121,89.678685,90.273687,,,
4,Arab World,ARB,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,73.942103,75.244104,77.162305,75.538976,78.741152,79.665635,80.749293,,,


<p>The data seems to have a large number of indicators dating from 1960. There are also columns containing country names and codes. Notice that the first couple of rows say <code>Arab World</code>, which may indicate that the data contains broad regional data as well. We notice also that there are at least 100,000 entries with <code>NaN</code> values for each year column.</p>
<p>Since we are interested in environmental indicators, we must get rid of any rows not relevant to our study. However, the number of indicators seems to be quite large and a manual inspection seems impossible. Let's load the file <code>WDISeries.csv</code> which seems to contain more information about the indicators:</p>

In [3]:
WDI_ids = pd.read_csv("./files/WDI_csv/WDISeries.csv")
print(WDI_ids.columns)
WDI_ids.head()
print(WDI_ids.shape)

Index(['Series Code', 'Topic', 'Indicator Name', 'Short definition',
       'Long definition', 'Unit of measure', 'Periodicity', 'Base Period',
       'Other notes', 'Aggregation method', 'Limitations and exceptions',
       'Notes from original source', 'General comments', 'Source',
       'Statistical concept and methodology', 'Development relevance',
       'Related source links', 'Other web links', 'Related indicators',
       'License Type', 'Unnamed: 20'],
      dtype='object')
(1429, 21)


<p>Bingo! The <code>WDI_ids</code> DataFrame contains a column named <code>Topic</code>. Moreover, it seems that <em>Environment</em> is listed as a key topic in the column.</p>

<h3>Exercise 1:</h3>
<p>Extract all the rows that have the topic key <em>Environment</em> in <code>WDI_ids</code>. Add to the resulting DataFrame a new column named <code>Subtopic</code> which contains the corresponding subtopic of the indicator. For example, the subtopic of <code>Environment: Agricultural production</code> is <code>Agricultural production</code>. Which subtopics do you think are of interest to us?</p>
<p><strong>Hint:</strong> Remember that you can apply string methods to Series using the <code>str()</code> method of <code>pandas</code>. </p>

**Answer.**

In [4]:
WDI_ids=pd.DataFrame(WDI_ids)
WDI_ids_Environment = WDI_ids[WDI_ids['Topic'].str.contains("Environment:", case=False)]
WDI_ids_Environment[['Topic','Subtopic']]=WDI_ids_Environment.Topic.str.split(': ',expand=True)
pd.set_option('max_columns', None)
print(WDI_ids_Environment.Subtopic.unique())
WDI_ids_Environment.head(2)

['Agricultural production' 'Land use' 'Energy production & use'
 'Emissions' 'Biodiversity & protected areas' 'Density & urbanization'
 'Freshwater' 'Natural resources contribution to GDP']


Unnamed: 0,Series Code,Topic,Indicator Name,Short definition,Long definition,Unit of measure,Periodicity,Base Period,Other notes,Aggregation method,Limitations and exceptions,Notes from original source,General comments,Source,Statistical concept and methodology,Development relevance,Related source links,Other web links,Related indicators,License Type,Unnamed: 20,Subtopic
0,AG.AGR.TRAC.NO,Environment,"Agricultural machinery, tractors",,Agricultural machinery refers to the number of...,,Annual,,,Sum,The data are collected by the Food and Agricul...,,,"Food and Agriculture Organization, electronic ...",A tractor provides the power and traction to m...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,,Agricultural production
1,AG.CON.FERT.PT.ZS,Environment,Fertilizer consumption (% of fertilizer produc...,,Fertilizer consumption measures the quantity o...,,Annual,,,Weighted average,The FAO has revised the time series for fertil...,,,"Food and Agriculture Organization, electronic ...",Fertilizer consumption measures the quantity o...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0,,Agricultural production


**I think that Emissions and Density & urbanization would be of interest to us from this dataframe**

<h3>Exercise 2:</h3>
<p>Use the results of Exercise 1 to create a new DataFrame with the history of all emissions indicators for countries and major regions. Call this new DataFrame <code>Emissions_df</code>. How many emissions indicators are in the study?</p>

**Answer.**

In [5]:
pd.set_option('max_columns', None)

#Seleccionamos las filas que tienen como subtipo emision y generamos una lista con los Series Code

Emissions_sl = WDI_ids_Environment[WDI_ids_Environment['Subtopic'].str.contains('Emissions', case=False)]
list_Code=Emissions_sl['Series Code'].unique().tolist()
#list_Code

#Seleccionamos la información de los paises donde Series Code es corresponde a un código de la lista 

Emissions_pre= [WDI_data[WDI_data['Indicator Code'].str.contains(code,case=False)] for code in list_Code]
#Emissions_pre

#Debemos concatenar las listas generadas

Emissions_df1= pd.concat([Emissions_pre[0],Emissions_pre[1],Emissions_pre[2],Emissions_pre[3],Emissions_pre[4],Emissions_pre[5],Emissions_pre[6],Emissions_pre[7],Emissions_pre[8],Emissions_pre[9],Emissions_pre[10]],axis=0)
Emissions_df2= pd.concat([Emissions_pre[12],Emissions_pre[13],Emissions_pre[14],Emissions_pre[15],Emissions_pre[16],Emissions_pre[17]],axis=0)
Emissions_df3= pd.concat([Emissions_pre[11],Emissions_pre[18],Emissions_pre[19],Emissions_pre[20],Emissions_pre[21],Emissions_pre[22]],axis=0)                     
Emissions_df4= pd.concat([Emissions_pre[23],Emissions_pre[24],Emissions_pre[25],Emissions_pre[26],Emissions_pre[27],Emissions_pre[28]],axis=0)             
Emissions_df5= pd.concat([Emissions_pre[29],Emissions_pre[30],Emissions_pre[31],Emissions_pre[32],Emissions_pre[33],Emissions_pre[34]],axis=0)                      
Emissions_df6= pd.concat([Emissions_pre[35],Emissions_pre[36],Emissions_pre[37],Emissions_pre[38],Emissions_pre[39],Emissions_pre[40],Emissions_pre[41]],axis=0)
Emissions_df=pd.concat([Emissions_df1,Emissions_df2,Emissions_df3,Emissions_df4,Emissions_df5,Emissions_df6],axis=0)

# No es la manera optima pero a pesar de saber que hacer no encontre el error para hacerlo automatico

In [6]:
Emissions_df.head(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,Unnamed: 64
207,Arab World,ARB,CO2 intensity (kg per kg of oil equivalent ene...,EN.ATM.CO2E.EG.ZS,,,,,,,,,,,,4.914091,5.146828,5.461254,5.000364,4.575705,4.99104,4.800156,4.509988,4.04432,4.022687,3.423656,2.9413,2.908118,3.101542,3.105701,3.280671,3.023061,2.951808,2.854121,2.903059,3.091429,3.005554,3.105094,3.011296,2.824569,2.700443,2.514938,2.699857,2.713744,3.018297,2.795018,2.720807,2.884507,2.974147,2.946281,2.88881,2.720899,2.780559,2.800466,2.750499,2.757789,2.769331,2.740288,2.805363,,,,,,
1636,Caribbean small states,CSS,CO2 intensity (kg per kg of oil equivalent ene...,EN.ATM.CO2E.EG.ZS,,,,,,,,,,,,2.981459,2.767799,3.18557,3.265939,3.559908,4.632752,4.460784,4.090289,4.08165,4.157995,3.863719,3.864821,3.647059,3.338497,3.785759,3.101243,3.365404,2.783669,2.996005,2.841171,2.832316,2.83304,2.820058,2.754908,2.801756,2.845184,2.788445,2.515103,2.521288,2.544499,2.539917,2.527502,2.512494,2.544184,2.496589,2.485315,2.428676,2.491693,2.439868,2.456941,2.436703,2.418799,2.440773,2.414172,,,,,,
3065,Central Europe and the Baltics,CEB,CO2 intensity (kg per kg of oil equivalent ene...,EN.ATM.CO2E.EG.ZS,3.687416,3.655012,3.618671,3.794636,3.843568,3.84983,3.808153,3.724021,3.771374,3.712066,3.700793,3.443223,3.475812,3.413793,3.411482,3.414267,3.408728,3.374856,3.289347,3.27037,3.321013,3.260478,3.230597,3.26581,3.306323,3.26308,3.231961,3.233544,3.123572,3.159256,,,3.008946,3.004492,2.939487,2.891952,2.846321,2.835288,2.806038,2.796169,2.778844,2.780281,2.723098,2.725158,2.687614,2.669642,2.695068,2.699467,2.644495,2.577999,2.5913,2.60234,2.534722,2.526103,2.488878,,,,,,


<h3>Exercise 3:</h3>
<p>The DataFrame <code>Emissions_df</code> has one column per year of observation. Data in this form is usually referred to as data in <em>wide format</em>, as the number of columns is high. However, it might be easier to query and filter the data if we had a single column containing the year in which each indicator was calculated. This way, <em>each observation will be represented by a single row</em>. Use the <code>pandas</code> function <a href="https://pandas.pydata.org/docs/reference/api/pandas.melt.html"><code>melt()</code></a> to reshape the <code>Emissions_df</code> data into <em>long format</em>. The resulting DataFrame should contain a pair of new columns named <code>Year</code> and <code>Indicator Value</code>:</p>

**Answer.**

In [7]:
# Antes de realizar cualquier otra labor vamos a mirar los campos nulos del dataframe
print(Emissions_df.shape)
print(Emissions_df.isnull().sum())

# Se observa que las columnas Unnamed: 64, 2019 y 2018 tienen practicamente todos sus datos nulos por lo que procedemos a eliminarlas
Emissions_df=Emissions_df.drop(['Unnamed: 64','2019','2018'],axis=1)
print(Emissions_df.isnull().sum())
Emissions_df.head(3)

(11352, 65)
Country Name          0
Country Code          0
Indicator Name        0
Indicator Code        0
1960               9463
                  ...  
2016              10289
2017              10289
2018              11351
2019              11352
Unnamed: 64       11352
Length: 65, dtype: int64
Country Name          0
Country Code          0
Indicator Name        0
Indicator Code        0
1960               9463
                  ...  
2013               6227
2014               6249
2015              10289
2016              10289
2017              10289
Length: 62, dtype: int64


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
207,Arab World,ARB,CO2 intensity (kg per kg of oil equivalent ene...,EN.ATM.CO2E.EG.ZS,,,,,,,,,,,,4.914091,5.146828,5.461254,5.000364,4.575705,4.99104,4.800156,4.509988,4.04432,4.022687,3.423656,2.9413,2.908118,3.101542,3.105701,3.280671,3.023061,2.951808,2.854121,2.903059,3.091429,3.005554,3.105094,3.011296,2.824569,2.700443,2.514938,2.699857,2.713744,3.018297,2.795018,2.720807,2.884507,2.974147,2.946281,2.88881,2.720899,2.780559,2.800466,2.750499,2.757789,2.769331,2.740288,2.805363,,,
1636,Caribbean small states,CSS,CO2 intensity (kg per kg of oil equivalent ene...,EN.ATM.CO2E.EG.ZS,,,,,,,,,,,,2.981459,2.767799,3.18557,3.265939,3.559908,4.632752,4.460784,4.090289,4.08165,4.157995,3.863719,3.864821,3.647059,3.338497,3.785759,3.101243,3.365404,2.783669,2.996005,2.841171,2.832316,2.83304,2.820058,2.754908,2.801756,2.845184,2.788445,2.515103,2.521288,2.544499,2.539917,2.527502,2.512494,2.544184,2.496589,2.485315,2.428676,2.491693,2.439868,2.456941,2.436703,2.418799,2.440773,2.414172,,,
3065,Central Europe and the Baltics,CEB,CO2 intensity (kg per kg of oil equivalent ene...,EN.ATM.CO2E.EG.ZS,3.687416,3.655012,3.618671,3.794636,3.843568,3.84983,3.808153,3.724021,3.771374,3.712066,3.700793,3.443223,3.475812,3.413793,3.411482,3.414267,3.408728,3.374856,3.289347,3.27037,3.321013,3.260478,3.230597,3.26581,3.306323,3.26308,3.231961,3.233544,3.123572,3.159256,,,3.008946,3.004492,2.939487,2.891952,2.846321,2.835288,2.806038,2.796169,2.778844,2.780281,2.723098,2.725158,2.687614,2.669642,2.695068,2.699467,2.644495,2.577999,2.5913,2.60234,2.534722,2.526103,2.488878,,,


In [13]:
Emissions_Long=pd.melt(Emissions_df, id_vars=['Country Name','Country Code','Indicator Name','Indicator Code'],var_name='Year', value_name='Indicator Value')
print(Emissions_Long.head(3))
print(Emissions_Long.shape)

                     Country Name Country Code  \
0                      Arab World          ARB   
1          Caribbean small states          CSS   
2  Central Europe and the Baltics          CEB   

                                      Indicator Name     Indicator Code  Year  \
0  CO2 intensity (kg per kg of oil equivalent ene...  EN.ATM.CO2E.EG.ZS  1960   
1  CO2 intensity (kg per kg of oil equivalent ene...  EN.ATM.CO2E.EG.ZS  1960   
2  CO2 intensity (kg per kg of oil equivalent ene...  EN.ATM.CO2E.EG.ZS  1960   

   Indicator Value  
0              NaN  
1              NaN  
2         3.687416  
(658416, 6)


<h3>Exercise 4:</h3>
<p>The column <code>Indicator Value</code> of the new <code>Emissions_df</code> contains a bunch of <code>NaN</code> values. Additionally, the <code>Year</code> column contains an <code>Unnamed: 64</code> value. What procedure should we follow to clean these missing values in our DataFrame? Proceed with your suggested cleaning process.</p>

**The unnamed: 64 column was previously deleted, the null data will be removed from the dataframe  
Lines are eliminated with the years where there is no Emissions data**

**Answer.**

In [24]:
#Eliminamos los datos nulos del dataframe
print(Emissions_Long.isnull().sum())
print(Emissions_Long.shape)
Emissions_df=Emissions_Long.dropna()  #Se utiliza dropna porque solo se tienen datos nulos en la columna Indicator Value
print()
print('**********Luego de Borrar los Nulos********')
print()
print(Emissions_df.isnull().sum())
print(Emissions_df.shape)

Country Name            0
Country Code            0
Indicator Name          0
Indicator Code          0
Year                    0
Indicator Value    326924
dtype: int64
(658416, 6)

**********Luego de Borrar los Nulos********

Country Name       0
Country Code       0
Indicator Name     0
Indicator Code     0
Year               0
Indicator Value    0
dtype: int64
(331492, 6)


<h3>Exercise 5:</h3>
<p>Split the <code>Emissions_df</code> into two DataFrames, one containing only countries and the other containing only regions. Name these <code>Emissions_C_df</code> and <code>Emissions_R_df</code> respectively.</p>
<p><strong>Hint:</strong> You may want to inspect the file <code>WDICountry.csv</code> for this task. Region country codes may be found by looking at <code>null</code> values of the <code>Region</code> column in <code>WDICountry</code>.</p>

**Answer.**

In [29]:
#Importamos la tabla con la informacion de los paises, con el fin de utilizarla para filtrar los datos
WDICountry = pd.read_csv("./files/WDI_csv/WDICountry.csv")
print(WDICountry.head(3))

#Seleccionamos las listas de Paises y Regiones de Emissions_df
list_Country=Emissions_df['Country Name'].unique().tolist()
print(list_Country)
list_Code=Emissions_df['Country Code'].unique().tolist()
print(list_Code)

#Seleccionamos las listas de Paises y Regiones de Emissions_df
list_Country= WDICountry['Short Name'].unique().tolist()
print(list_Country)
list_Code= WDICountry['Region'].unique().tolist()
print(list_Code)



  Country Code   Short Name   Table Name                     Long Name  \
0          ABW        Aruba        Aruba                         Aruba   
1          AFG  Afghanistan  Afghanistan  Islamic State of Afghanistan   
2          AGO       Angola       Angola   People's Republic of Angola   

  2-alpha code   Currency Unit Special Notes                     Region  \
0           AW   Aruban florin           NaN  Latin America & Caribbean   
1           AF  Afghan afghani           NaN                 South Asia   
2           AO  Angolan kwanza           NaN         Sub-Saharan Africa   

          Income Group WB-2 code National accounts base year  \
0          High income        AW                        2000   
1           Low income        AF                     2002/03   
2  Lower middle income        AO                        2002   

  National accounts reference year                SNA price valuation  \
0                              NaN  Value added at basic prices (VAB)   

<h2>Finalizing the cleaning for our study</h2>
<p>Our data has improved a lot by now. However, since the number of indicators is still quite large, let us focus our study on the following indicators for now:</p>
<ul>
<li>
<p><strong>Total greenhouse gas emissions (kt of CO2 equivalent), EN.ATM.GHGT.KT.CE</strong>: The total of greenhouse emissions includes CO2, Methane, Nitrous oxide, among other pollutant gases. Measured in kilotons.</p>
</li>
<li>
<p><strong>CO2 emissions (kt), EN.ATM.CO2E.KT</strong>: Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring.  </p>
</li>
<li>
<p><strong>Methane emissions (kt of CO2 equivalent), EN.ATM.METH.KT.CE</strong>: Methane emissions are those stemming from human activities such as agriculture and from industrial methane production.</p>
</li>
<li>
<p><strong>Nitrous oxide emissions (kt of CO2 equivalent), EN.ATM.NOXE.KT.CE</strong>: Nitrous oxide emissions are emissions from agricultural biomass burning, industrial activities, and livestock management.</p>
</li>
<li>
<p><strong>Other greenhouse gas emissions, HFC, PFC and SF6 (kt of CO2 equivalent), EN.ATM.GHGO.KT.CE</strong>: Other pollutant gases.</p>
</li>
<li>
<p><strong>PM2.5 air pollution, mean annual exposure (micrograms per cubic meter), EN.ATM.PM25.MC.M3</strong>: Population-weighted exposure to ambient PM2.5 pollution is defined as the average level of exposure of a nation's population to concentrations of suspended particles measuring less than 2.5 microns in aerodynamic diameter, which are capable of penetrating deep into the respiratory tract and causing severe health damage. Exposure is calculated by weighting mean annual concentrations of PM2.5 by population in both urban and rural areas.</p>
</li>
<li>
<p><strong>PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total), EN.ATM.PM25.MC.ZS</strong>: Percent of population exposed to ambient concentrations of PM2.5 that exceed the World Health Organization (WHO) guideline value.</p>
</li>
</ul>

<h3>Exercise 6:</h3>
<p>For each of the emissions DataFrames, extract the rows corresponding to the above indicators of interest. Replace the long names of the indicators by the short names <code>Total</code>, <code>CO2</code>, <code>CH4</code>, <code>N2O</code>, <code>Other</code>, <code>PM2.5</code>, and <code>PM2.5_WHO</code>. (This will be helpful later when we need to label plots of our data.) </p>

**Answer.**

-------

<h2>Where shall the client start environmental campaigns?</h2>
<p>Now the DataFrames <code>Emissions_C_df</code> and <code>Emissions_R_df</code> seem to be in a good shape. Let's proceed to conduct some exploratory data analysis so that we can make recommendations to our client.</p>

<h3>Exercise 7:</h3>
<p>Let's first calculate some basic information about the main indicators across the globe.</p>
<h4>7.1</h4>
<p>Compute some basic statistics of the amount of kt of emissions for each of the four main pollutants (<code>CO2, CH4, N2O, Others</code>) over the years. Use the <code>Emissions_C_df</code> data frame. What trends do you see? </p>

**Answer.**

-------

<h4>7.2</h4>
<p>What can you say about the <em>distribution</em> of emissions around the globe over the years? What information can you extract from the <em>tails</em> of these distributions over the years?</p>

**Answer.**

-------

<h4>7.3</h4>
<p>Compute a plot showing the behavior of each of the four main air pollutants for each of the main global regions in the <code>Emissions_R_df</code> data frame. The main regions are <code>'Latin America &amp; Caribbean', 'South Asia', 'Sub-Saharan Africa', 'Europe &amp; Central Asia', 'Middle East &amp; North Africa', 'East Asia &amp; Pacific'</code> and <code>'North America'</code>. What conclusions can you make?</p>

**Answer.**

-------

<p>It seems that countries in East Asia and the Pacific are the worst dealing with pollutant emissions. We also see that Europe and Central Asia have been making some efforts to reduce their emissions. Surprisingly this is not the case with North America and Sub-Saharan Africa, which levels have been increasing over the years as well. </p>

<h3>Exercise 8:</h3>
<p>In Exercise 7 we discovered some interesting features of the distribution of the emissions over the years. Let us explore these features in more detail. </p>

<h4>8.1</h4>
<p>Which are the top five countries that have been in the top 10 of <code>CO2</code> emitters over the years? Have any of these countries made efforts to reduce the amount of CO2 emissions over the last 10 years?</p>

**Answer.**

-------

<h4>8.2</h4>
<p>Are these five countries carrying out the burden of most of the emissions emitted over the years globally? Can we say that the rest of the world is making some effort to control their polluted gasses emissions over the years?</p>

**Answer.**

-------

<h2>The health impacts of air pollution</h2>

<h3>Exercise 9:</h3>
<p>One of the main contributions of poor health from air pollution is particulate matter. In particular, very small particles (those with a size less than 2.5 micrometres ($\mu$m)) can enter and affect the respiratory system. The <code>PM2.5</code> indicator measures the average level of exposure of a nation's population to concentrations of these small particles. The <code>PM2.5_WHO</code> measures the  percentage of the population who are exposed to ambient concentrations of these particles that exceed some thresholds set by the World Health Organization (WHO). In particular, countries with a higher <code>PM2.5_WHO</code> indicator are more likely to suffer from bad health conditions. </p>
<h4>9.1</h4>
<p>The client would like to know if there is any relationship between the <code>PM2.5_WHO</code> indicator and the level of income of the general population, as well as how this changes over time. What plot(s) might be helpful to solve the client's question?  What conclusion can you draw from your plot(s) to answer their question?</p>
<p><strong>Hint:</strong> The DataFrame <code>WDI_countries</code> contains a column named <code>Income Group</code>. </p>

**Answer.**

-------

<h4>9.2</h4>
<p>What do you think are the causes behind the results in Exercise 9.1?</p>

**Answer.**

-------

<h3>Exercise 10:</h3>
<p>Finally, our client is interested in investigating the impacts and relationships between <strong>high levels of exposure to particle matter</strong> and <strong>the health of the population</strong>. Coming up with additional data for this task may be infeasible for the client, thus they have asked us to search for relevant health data in the <code>WDIdata.csv</code> file and work with that. </p>

<h4>10.1</h4>
<p>Which indicators present in the file  <code>WDISeries.csv</code> file might be useful to solve the client's question? Explain.</p>
<p><strong>Note:</strong> Naming one or two indicators is more than enough for this question. </p>

**Answer.**

-------

<h4>10.2</h4>
<p>Use the indicators provided in Exercise 10.1 to give valuable information to the client. </p>

**Answer.**

-------

<h4>10.3</h4>
<p>Extend the analysis above to find some countries of interest. These are defined as</p>
<ul>
<li>The countries that have a high mortality rate due to household and ambient air pollution, but with low PM2.5 exposure</li>
<li>The countries that have a low mortality rate due to household and ambient air pollution, but with high PM2.5 exposure</li>
</ul>

**Answer.**

-------

<h4>10.4</h4>
<p>Finally, we want to look at the mortality data by income. We expect higher income countries to have lower pollution-related mortality. Find out if this assumption holds. Calculate summary statistics and histograms for each income category and note any trends.</p>

**Answer.**

-------

<h4>10.5</h4>
<p>At the start, we asked some questions. Based on your analysis, provide a short answer to each of these:</p>
<ol>
<li>Are we making any progress in reducing the amount of emitted pollutants across the globe?</li>
<li>Which are the critical regions where we should start environmental campaigns?</li>
<li>Are we making any progress in the prevention of deaths related to air pollution?</li>
<li>Which demographic characteristics seem to correlate with the number of health-related issues derived from air pollution? </li>
</ol>

**Answer.**

-------