# Introduction
For my very first data science project, I would like to understand how data science can be used to better understand the environmental challenges singapore faces today as a nation. I choose to look at factors that affect air pollution as Singapore needs to take more steps to ensure that we [are able to meet air quality targets](https://www.channelnewsasia.com/news/singapore/singapore-not-meeting-its-air-quality-targets-masagos-7543240) set by the World Health Organization (WHO). By identifying the root causes for problems, it would help the administrators develop targeted policies to directly address these issues.

According to the [National Environment Agency](https://www.nea.gov.sg/our-services/pollution-control/air-pollution/air-quality) (NEA), Singapore enjoys better air quality than many cities in Asia, comparable with that of cities in the United States and Europe. Singapore’s Pollutant Standards Index (PSI) has remained in the ‘Good’ and 'Moderate' range for much of 2017.

The main sources of air pollution in Singapore are emissions from the industries and motor vehicles. Therefore, we will not only look at the vehicle population data and manufacturing industries data but we will also see if there is a relation between commercial and residential development and air pollution. I choose to include the latter as I feel that Singapore constantly has on-going construction project and i would like to understand if this affects the environment. 

## Background
[Air pollution](https://www.environmentalpollutioncenters.org/air/) can be defined as the presence of toxic chemicals or compounds (including those of biological origin) in the air, at levels that pose a health risk. In an even broader sense, air pollution means the presence of chemicals or compounds in the air which are usually not present and which lower the quality of the air or cause detrimental changes to the quality of life (such as the damaging of the ozone layer or causing global warming).

Air pollution is probably one of the most serious environmental problems confronting our civilization today. Most often, it is caused by human activities such as mining, construction, transportation, industrial work, agriculture, smelting, etc. However, natural processes such as volcanic eruptions and wildfires may also pollute the air, but their occurrence is rare and they usually have a local effect, unlike human activities that are ubiquitous causes of air pollution and contribute to the global pollution of the air every single day.

The air pollutants that we will be studying in this project are as follows:
* Sulphur dioxide ( SO2 ) : This contaminant is mainly emitted during the combustion of fossil fuels such as crude oil and coal.
* Carbon monoxide ( CO ) : This gas consists during incomplete combustion of fuels example : A car engine running in a closed room.
* Nitrogen dioxide ( NO2 ) : These contaminants are emitted by traffic, combustion installations and the industries.
* Ozone ( O3 ) : Ozone is created through the influence of ultra violet sunlight (UV) on pollutants in the outside air.
* Particulate Matter ( PM ) : Particulate matter is the sum of all solid and liquid particles suspended in air. NEA uses two measurements: PM-10 (10 micrometers or less) and PM-2.5 (2.5 micrometers or less). This complex mixture includes both organic and inorganic particles, such as dust, pollen, soot, smoke, and liquid droplets. These particles vary greatly in size, composition, and origin.

## Project overview
In this project I am trying to understand the type of factors that would affect environmental pollution in Singapore only. A description of the pollutants can be found [here](https://www.nea.gov.sg/our-services/pollution-control/air-pollution/air-quality). Please note that all the data used here can be found at the end of this document.

The hypotheses that I am trying to test are as follows: 
1.  An increase in the rate of manufacturing would lead to an increase in pollution in Singapore
2. An increase in the rate of development of commercial and housing development would lead to an increase in pollution in Singapore
3. An increase in vehicle population would lead to an increase in pollution in Singapore

In addition, I would also like to identify which pollutant do these industries largely contribute to. 

## Results analysis
The analysis flow can be summarised into four main steps. They are as follows:

First, we will download and import the data. As mentioned previously, the download links for the data used can be found at the end of this document. All downloaded data are in the form of a CSV file.

Second, we will perform data cleansing. In this step, the first thing I did was to remove all NaN or 0 values and select a timeframe to study. Then I will perform [mean normalization](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e) on the data to get a more accurate statistical analysis and graphical plot. We have to normalize our data because our features do not have a uniform scale. Since I wanted data that was centered at zero and kept within a small range, mean normalization was the most appropriate normalization technique. 

Thirdly, I want to statistically analyze the data. [DataFrame.corr](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) computes pairwise correlation of columns. The default method used is [spearmans](https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php) to compute the correlation coefficient. 

Spearman's correlation determines the strength and direction of the monotonic relationship between your two variables rather than the strength and direction of the linear relationship between your two variables, which is what Pearson's correlation determines. A monotonic relationship is a relationship that does one of the following: (1) as the value of one variable increases, so does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases. Since we are trying to understand the trend, spearman would be the most suitable method of compute the correlation coefficient.

Next, I will plot the data on a graph to verify the results obtained in the third step.

Finally, I will conclude with a short write up about the findings.

## Import the relevant libraries

In [None]:
import matplotlib.pyplot as plt 
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Width = 16, Height = 6
DIMS=(16, 6)

## Import the pollution data

In [None]:
df1 = pd.read_csv("../input/air-pollutant-carbon-monoxide-2nd-maximum-8-hour-mean.csv")
df2 = pd.read_csv("../input/air-pollutant-nitrogen-dioxide.csv")
df3 = pd.read_csv("../input/air-pollutant-ozone.csv")
df4 = pd.read_csv("../input/air-pollutant-particulate-matter-pm2-5.csv")
df5 = pd.read_csv("../input/air-pollutant-particulate-matter-pm10.csv")
df6 = pd.read_csv("../input/air-pollutant-particulate-matter-pm10.csv")
df7 = pd.read_csv("../input/air-polluant-lead.csv")

Now, I want to combine all my dataframes into 1 dataframe so that it would be easier to process and plot the data.

In [None]:
pollution_df_1 = pd.merge(df1, df2, on='year', how='outer')
pollution_df_2 = pd.merge(pollution_df_1, df3, on='year', how='outer')
pollution_df_3 = pd.merge(pollution_df_2, df4, on='year', how='outer')
pollution_df_4 = pd.merge(pollution_df_3, df5, on='year', how='outer')
pollution_df_5 = pd.merge(pollution_df_4, df6, on='year', how='outer')
pollution_df_6 = pd.merge(pollution_df_5, df7, on='year', how='outer')

## Data preprocessing

Next, we need to pre-process this data. I wish to drop all the rows with 'NaN' values and I only want to display the data from 2008 to 2014. I choose to study the data from 2008 to 2014 as that is the largest timeframe where each feature has all the data.

In [None]:
#Remove NaN values
pol_df = pollution_df_6.dropna().sort_values('year', ascending=True)

#Select data from 2008 to 2014
year = range(2008, 2015)
pol_df = pol_df[pol_df['year'].isin(year)]
pol_df = pol_df.drop('year', axis=1)

#Mean Normalization
pol_df=(pol_df-pol_df.mean())/pol_df.std()
pol_df['year'] = year

pol_df

## Plotting the pollution graph

In [None]:
#Variables to plot
Var_to_plot = ['carbon_monoxide_2nd_maximum_8hourly_mean','nitrogen_dioxide_mean', 'pm2.5_mean','ozone_4th_maximum_8hourly_mean',
               'pm10_2nd_maximum_24hourly_mean_x','pm10_2nd_maximum_24hourly_mean_x', 'air_pollutant_lead_mean']

#Draw plot
Indi_pol_plot = pol_df.plot(x='year', y = Var_to_plot, kind = 'line', grid = True, figsize=DIMS,
                        title = 'Individual Pollution in Singapore from 2008 to 2014')

#Graph formatting
Indi_pol_plot.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

## Observation

This graph shows the pollution trends of multiple pollutants. We will be using this information to find out if there is a co-relation between pollution and certain industries. 

## Importing manufacturing data
Now, we wish to study if changes in the manufacturing industry can affect the environment. According to the [smeportal](https://www.smeportal.sg/content/smeportal/en/industries/manufacturing.html) website, the manufacturing industry contributes 20% to 25% of Singapore’s Gross Domestic Product (GDP). [This](https://www.smeportal.sg/content/smeportal/en/industries/manufacturing/overview-of-manufacturing-industry.html) web page shows us that the two biggest industries are Chemicals & Chemical Products and Computer, Electronic & Optical Products. Therefore, we will study the behavior of these two industries.

To get started, we import the data. 

In [None]:
maufacture_df =  pd.read_csv("../input/total-output-in-manufacturing-by-industry-annual.csv")

## Data preprocessing

In [None]:
#Selecting years required
year = range(2008, 2015)
maufacture_df = maufacture_df[maufacture_df['year'].isin(year)]

#Selecting the industries
listtofind = ['Chemicals & Chemical Products', 'Computer, Electronic & Optical Products' ]
maufacture_df = maufacture_df[maufacture_df['level_2'].isin(listtofind)]

#Reformat the dataframe
maufacture_df =  maufacture_df.set_index(['year', 'level_2'])['value'].unstack()

#Mean normalization
maufacture_df=(maufacture_df-maufacture_df.mean())/maufacture_df.std()
maufacture_df['year'] = year
maufacture_df

## Plotting the manufacturing graph

In [None]:
#Draw plot
manu_graph = maufacture_df.plot(x='year', y=listtofind, kind = 'line', grid = True, figsize=DIMS, #ax = ax2,
                         title = 'Manufacturing of different industries from 2008 to 2014')
Indi_pol_plot = pol_df.plot(x='year', y = Var_to_plot, kind = 'line', grid = True, figsize=DIMS,
                        title = 'Individual Pollution in Singapore from 2008 to 2014')

#Graph formatting
Indi_pol_plot.legend(loc='center left', bbox_to_anchor=(1, 0.5))
manu_graph.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()

## Finding which pollutant does the electronics industry correlates to the most.

In [None]:
#Create new dataframe
com_corr = pol_df.copy()

#Add in the electronics data
com_corr['Com'] = maufacture_df['Computer, Electronic & Optical Products'].tolist()

#Product correlation dataframe
com_corr.corr(method = 'spearman')

## Observation
As seen from the table above, there is a 86% correlation betweem the electronics industry and the carbon monoxide pollutant. To study this visually, we will plot a graph. 

## Plotting the electronics industry VS Carbon monoxide pollution graph

In [None]:
#Draw plot
manu_graph = maufacture_df.plot(x='year', y='Computer, Electronic & Optical Products', kind = 'line', grid = True)
pol_df.plot(x='year', y = 'carbon_monoxide_2nd_maximum_8hourly_mean', kind = 'line', grid = True, figsize=DIMS, ax=manu_graph,
                        title = 'Electronics Industry VS Carbon Monoxide Pollution')

#Graph formatting
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()

## Observation
This graph supports the above observation as it can be seen that the trends of the carbon monoxide pollution and electronics indusrty are very similar.

## Finding which pollutant does the checmical industry correlates to the most.

In [None]:
#Create new dataframe
chem_corr = pol_df.copy()

#Add in the chemical data
chem_corr['Chem'] = maufacture_df['Chemicals & Chemical Products'].tolist()

#Product correlation dataframe
chem_corr.corr(method = 'spearman')

## Observation
As seen from the table above, there is a 82% correlation betweem the chemical industry and the ozone pollutant. To study this visually, we will plot a graph.

## Plotting the chemical industry VS Ozone pollution graph

In [None]:
#Draw plot
manu_graph = maufacture_df.plot(x='year', y='Chemicals & Chemical Products', kind = 'line', grid = True)
pol_df.plot(x='year', y = 'ozone_4th_maximum_8hourly_mean', kind = 'line', grid = True, figsize=DIMS, ax=manu_graph,
                        title = 'Chemicals Industry VS Ozone Pollution')

#Graph formatting
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()

## Observation
This graph supports the above observation as it can be seen that the trends of the ozone pollution and Chemicals indusrty are very similar.

## Hypothesis 1 conclusion
Studying the statistical data closely, we can conclude that the electronics industry has a 86% similarity with the carbon monoxide pollution and the chemical industry has a 82% similarity with the ozone pollution. To supplement this observtion, we can also see from the graphs that the variables are closely related.

Therefore, I can conclude that there is **sufficient** evidence to support my first hypothesis claim.

## Importing housing data

Now we do the necessary data pre-processing. For this section, I will only be looking at the under construction data for HDB, Emporiums and Supermarkets and Shops, Lock-Up Shops and Eating Houses. The reason why I choose HDB for housing is because 80% of Singaporeans live in HDBs. For commercial development, I chose to study Emporiums and Supermarkets and Shops, Lock-Up Shops and Eating Houses. This is because these two categories are the most commonly developed commercial developments in Singapore. In addition, these are the only two categories that has all the data for the selected time period.

First, for the HDB development dataset, we select the data from 2008 to 2014. Then, we need to remove DBSS flats in the type column as we are only looking at HDBs. Then, we need to select 'under construction' in the status column as pollution is mainly produced during construction. Then we need to convert the data in the 'no_of_units' column from string to integer. Finally, we will perform a mean normalization.

In [None]:
flats_df =  pd.read_csv("../input/completion-status-of-hdb-residential-developments.csv")
commercial_df = pd.read_csv("../input/completion-status-of-hdb-commercial-developments.csv")

## Data preprocessing for residential development

In [None]:
#Select the data we want
flats_df = flats_df[flats_df['financial_year'].isin(year)]
flats_df = flats_df[(flats_df["type"] != "DBSS") & (flats_df["status"] == 'Under Construction')]

#Change the data type
flats_df['no_of_units'] = flats_df['no_of_units'].apply(np.float)

#Mean normalization
flats_df['no_of_units']=(flats_df['no_of_units']-flats_df['no_of_units'].mean())/flats_df['no_of_units'].std()

flats_df

## Data preprocessing for commercial development

For the commercial development dataset, we first select the data from 2008 to 2014. Then, we select the 'under construction' data in the 'status' column and we also select the two categories that we want from the 'type' column. Next, we rearrange the dataframe so that each category has it's own column. Finally we perform mean normalization.


In [None]:
#Select the data we want
commercial_df = commercial_df[commercial_df['financial_year'].isin(year)]
commercial_df = commercial_df[(commercial_df["status"] == 'Under Construction') & (commercial_df['no_of_units'] != 0) & (commercial_df['type'].isin(['Shops, Lock-Up Shops and Eating Houses', 'Emporiums and Supermarkets']))]

#Reformat dataframe
comm_df =  commercial_df.set_index(['financial_year', 'type'])['no_of_units'].unstack().reset_index()

#Mean normalization
comm_df=(comm_df-comm_df.mean())/comm_df.std()

comm_df

## Joining the dataframes together

In [None]:
property_df = pd.DataFrame()
property_df['year'] = year
property_df['HDB'] = flats_df['no_of_units'].tolist()
property_df['Emporiums and Supermarkets'] = comm_df['Emporiums and Supermarkets'].tolist()
property_df['Shops, Lock-Up Shops and Eating Houses'] = comm_df['Shops, Lock-Up Shops and Eating Houses'].tolist()

property_df

## Finding a correlation between residential and commercial data and each pollutant

In [None]:
#Create new dataframe
housing_corr = pol_df.copy()

#Add in the housing data
housing_corr['HDB'] = property_df['HDB'].tolist()
housing_corr['E&S'] = property_df['Emporiums and Supermarkets'].tolist()
housing_corr['Shops'] = property_df['Shops, Lock-Up Shops and Eating Houses'].tolist()

#Product correlation dataframe
housing_corr.corr(method = 'spearman')

## Observation

As seen from the table above, all three features are very closely related, more than 80%, to the ozone pollution. We will verify this by plotting the graphs.

## Plotting residential and commercial data

In [None]:
#Draw plot
property_df_plot = property_df.plot(x='year', y=['HDB', 
                              'Emporiums and Supermarkets', 
                              'Shops, Lock-Up Shops and Eating Houses'], 
                              kind = 'line', grid = True, figsize = DIMS,
                              title = 'Under construction HDB from 2008 to 2014')
pol_df.plot(x='year', y = 'ozone_4th_maximum_8hourly_mean', kind = 'line', grid = True, figsize=DIMS, ax=property_df_plot,
                        title = 'Residencial and Commercial Development VS Ozone Pollution')

#Graph formatting
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()

## Hypothesis 2 conclusion
Statistically, each feature is closely related to the ozone pollution; 82% for HDB, 86% for emporiums and supermarkets and 86% as well for shops, lock-up shops and eating houses. The graph plotted also shows that the features do follow the trend of the ozone pollution.

Therefore, I can conclude that there is **sufficient** evidence to support my second hypothesis claim.

## Import vehicle data

Now we need to do data preprocessing. Like we have done in the previous three steps, we first select the years that we are interested in. We then perform a group by year so that all the populations of the different vehicles are added together. Lastly, we perform a mean normalization.

In [None]:
veh_df = pd.read_csv("../input/annual-motor-vehicle-population-by-vehicle-type.csv")

## Data preprocessing

In [None]:
#Select the data we need
veh_df = veh_df[veh_df['year'].isin(year)]

#Perform groupby
veh_df = veh_df.groupby('year').sum().reset_index()

#Mean normalization
veh_df['number']=(veh_df['number']-veh_df['number'].mean())/veh_df['number'].std()
veh_df.rename(columns = {'number':'Number of Vehicles'}, inplace = True)
veh_df

## Finding a correlation with the vehicle population and pollutants

In [None]:
#Create new dataframe
veh_corr = pol_df.copy()

#Add in the housing data
veh_corr['Veh'] = veh_df['Number of Vehicles'].tolist()

#Product correlation dataframe
veh_corr.corr(method = 'spearman')

## Observation
The table above shows that the vehicle population has a 80% correlation with nitrogen dioxide and 86% correlation with ground-level ozone. The graph bellow will explore this further.

## Plot vehicle data

In [None]:
veh_graph = veh_df.plot(x='year', y='Number of Vehicles', kind = 'line', grid = True,
                    title = 'Vehicle Population from 2008 to 2014')
pol_df.plot(x='year', y = ['nitrogen_dioxide_mean', 'ozone_4th_maximum_8hourly_mean'], kind = 'line', grid = True, figsize=DIMS, ax=veh_graph,
                        title = 'Vehicle Population VS Ground-Level Ozone and Nitrogen Dioxide Pollution')

plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

## Observation
This graph shows that the trends of the vehicle population and pollutions are similar. Therefore, this supports the claim made by the above statistical analysis. 

## Hypothesis 3 conclusion
The trends of vehicle population and nitrogen dioxide and ozone pollution are closely related; both statistically and graphically. This makes sense as according to [this](https://www.theicct.org/cards/stack/vehicle-nox-emissions-basics#4) article posted on the the international council on clean transportation website, a study done by them shows that vehicles contribute a lot of nitrogen oxide pollution. Note that nitrogen dioxide is a form of nitrogen oxide. In addition, [this](https://www.scientificamerican.com/article/ozone-pollution-grows-but-it-can-be-fixed/) website on the Scientific American says that, there is a correlation between nitrogen oxide levels and ground-level ozone levels. nitrogen oxide pollution, combined with volatile organic compounds, interact in the presence of sunlight to produce ground-level ozone pollution. Since Singapore is a tropical island, we get a lot of sunlight which could explain the high ozone levels.

Therefore, we can conclude that there is **sufficient** evidence to support my third hypothesis.

## Experiment conclusion
One of the main reasons I want to enter the data science industry is that, as seen from this experiment, data science allows us to find relations between subjects that seem unrelated and will help us find meaning in data. This experiment has helped us identify what type of pollutants Singapore's 2 biggest manufacturing industries release into the environment. It has also given us some insight about pollution released by the residencial and commercial development in Singapore and the vehicle population in Singapore. 

Thankfully, Singapore aims to become a [car-lite socitey](https://www.todayonline.com/singapore/getting-singaporeans-embrace-car-lite-society) by 2040. This means that the vehicle population will be significatly reduced. However, more work has to be done to find clean alternatives to residencial and commercial development and the manufacturing industries in Singapore.

## Further development
Further improvements to identify sources of pollution can be made by studying more recent data, when it becomes available, or by studying other factors such as haze, shipping and Waste incineration.

## Reader take-aways
I hope the reader has gotten a better understanding of air pollution and the type of air pollutants there are in the environment. I also hope that I have convinced the reader that data science can be used in many areas, not only environmental, to provide a bird’s eye view of any issue and in this case, to awaken the general population to the reality of environmental problems and lending credibility to campaigns about adopting lifestyle changes that help address these problems. 

# APPENDIX

Data Used:  
[air-pollutant-carbon-monoxide-2nd-maximum-8-hour-mean.csv](http://data.gov.sg/dataset/air-pollutant-carbon-monoxide)  
[air-pollutant-nitrogen-dioxide.csv](https://data.gov.sg/dataset/air-pollutant-nitrogen-dioxide)  
[air-pollutant-ozone.csv](https://data.gov.sg/dataset/air-pollutant-ozone)  
[air-pollutant-particulate-matter-pm10.csv](https://data.gov.sg/dataset/air-pollutant-particulate-matter-pm10)  
[air-pollutant-particulate-matter-pm2-5.csv](https://data.gov.sg/dataset/air-pollutant-particulate-matter-pm2-5)  
[air-pollutant-sulphur-dioxide.csv](https://data.gov.sg/dataset/air-pollutant-sulphur-dioxide)  
[annual-motor-vehicle-population-by-vehicle-type.csv](https://data.gov.sg/dataset/annual-motor-vehicle-population-by-vehicle-type)  
[completion-status-of-hdb-commercial-developments.csv](https://data.gov.sg/dataset/number-of-units-of-hdb-developments-by-status?resource_id=cd37aed4-2d93-4e05-b6f9-4b249603f125)  
[completion-status-of-hdb-residential-developments.csv](https://data.gov.sg/dataset/number-of-units-of-hdb-developments-by-status?resource_id=ff97dd96-6db5-4eb7-ba79-ad8d4840a3aa)  
[total-output-in-manufacturing-by-industry-annual.csv](https://data.gov.sg/dataset/total-output-manufacturing-annual?resource_id=7ef90aef-5191-44cd-bc06-0d37088a5733)  
[air-polluant-lead.csv](https://data.gov.sg/dataset/air-polluant-lead?resource_id=2b14a0cf-203c-4b0f-8432-62be3971f9b6)
