# Progress of the Philippines' Sustainable Development Goals

### Import

In [None]:
import os
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore

## Data Collection
The following **csv** files used in this project are acquired through a request sent to the Knowledge Management and Communications Division of the Philippine Statistics Authority.

### Combining the Datasets 
In this stage, the separate datasets underwent pre-processing and cleaning before they are combined together. 

First, the irrelevant rows were dropped first. These were the rows that have all NaN values and the additional rows (i.e., note rows, “Data available” rows) found in the CSV files. 

Second, since the first row of the CSV files was the name of the indicator and unnamed rows, the resulting DataFrame had “Unnamed” as its column header. Due to this, we had to set the column headers to the second row of the DataFrame, and then drop this afterward.

Third, since the `Geolocation` column would be used later to merge the datasets, the values in this column were standardized into the format `Region n: region_name`, where *n* is the corresponding region number and *region_name* is the name of the region. If it does not have a region number, then it was formatted as `region_abbreviation: region_name`, where *region_abbreviation* is its official abbreviation. 

Fourth, there are datasets that had divisions for a region and year, but still include a cumulative value for that division (e.g., datasets that are also divided per `Sex`, while having a value of “Both Sexes”. For this situation, we have decided to only get the cumulative row (e.g., Both Sexes), drop the other rows that represent the division (e.g., Female and Male), and drop the column that is related to this division (Sex). 

Fifth, we convert the DataFrame into its long representation. Once we have the dataset into its long representation, then we can merge it to the combined dataset while using the Year and Geolocation columns as its primary key. This would be done for all of the twenty-five datasets.

This process would result in one DataFrame that is in its long representation, with three kinds of columns: (1) Geolocation, (2) Year, and (3) the value for each of the indicators. 

#### 1.2.1. Proportion of population living below the national poverty line 
To start with, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

The [`os.getenv`](https://docs.python.org/3/library/os.html) function was used to get the environment variable `DSDATA_PROJ`, which points to the data folder of this project.

In [None]:
data = pd.read_csv('data' + '/1.2.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.2.1.csv')
data

Looking at the DataFrame, we could see that the columns are unnamed and that the column names are located at the 0th row. Using [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html), we could get the 0th row and then assign it as the column values. 

Then, using the [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function, we can drop the 0th row as we have no need for it anymore. Additionally, since the row at index 1 is a row full of NaN, we can also drop it using the same function. 

To be able to fix the indexing of the rows, the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function was used to reset the index from 0.

In [None]:
# setting our column names
data.columns = data.iloc [0] 

# dropping the 'geolocation' row as that is actually used as a header
data = data.drop (data.index [1])

# dropping the column names 
data = data.drop (data.index [0])

data.reset_index (drop=True, inplace=True)

Irrelevant rows that are just footers for the file are also removed.

In [None]:
# dropping irrelevant rows 
data = data.drop (data.index [18:]) 

The `Year` column must also be renamed into `Geolocation` as this row refers to the different regions in the Philippines, and not the years. This can be done through the use of the of the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function.

In [None]:
# renames the column 'Year' as its actually the location column
data.rename(columns = {'Year':'Geolocation'}, inplace=True)

To easily determine which region the `Geolocation` values refer to, we can also change these values to include the names that they are commonly referred to, instead of just their region numbers. 

For consistency throughout the different datasets, the `region_names` variable was declared. The reason why a map was not used was that different datasets have different representations of the region (i.e., differences in naming a region), however, they are always arranged in the same way. This would be shown below in the pre-processing of each of the datasets.

In [None]:
# NOTE: Before applying, make sure that the arrangement of the regions are the same as the arrangement in your table
region_names = ['PHILIPPINES', 'NCR: National Capital Region', 
                 'CAR: Cordillera Administrative Region', 
                 'Region 1: Ilocos Region', 
                 'Region 2: Cagayan Valley', 
                 'Region 3: Central Luzon', 
                 'Region 4A: CALABARZON', 
                'MIMAROPA: Southwestern Tagalog Region', 
                'Region 5: Bicol Region', 
                'Region 6: Western Visayas', 
                'Region 7: Central Visayas', 
                'Region 8: Eastern Visayas', 
                'Region 9: Zamboanga Peninsula', 
                'Region 10: Northern Mindanao', 
                'Region 11: Davao Region', 
                'Region 12: SOCCSKSARGEN', 
                'CARAGA: Cordillera Administrative Region', 
                'BARMM: Bangsamoro Autonomous Region in Muslim Mindanao']

In [None]:
# renames the data in the Geolocation for consistency
data['Geolocation'] = region_names
data.set_index('Geolocation')
data = data.reset_index(drop=True)
data

Next, we can convert the strings of '..' and '...', which were used to represent that there were no values for these cells, to **NaN**, through the use of the [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function.

However, the columns that have all **NaN** values were not dropped because if this dataset would be combined with other datasets, all years would still be present as there are datasets with complete data for all the years. Additionally, dropping the years for some of the dataset would result in the combined dataset having a weird sorting (i.e., a sorting of the region that does not follow the usual sorting of the datasets in the Philippines), even if it was sorted based on the `Year` and `Geolocation` column.

In [None]:
for c in data.columns.difference(['Geolocation']):
    # cells without values are represented as either '..' or '...', so we should convert them to NaN so we could dropna()
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# drops columns if all of the values are NaN
# data = data.dropna(axis=1)

In [None]:
data

As the final step, the wide representation of this dataset is converted to a long representation through the use of the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. 

Then, the column that holds the value for a specific year and region is coverted, using [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html), to the ID of this Sustainable Development Goal (SDG), so that it can be distinguished when it is combined with other datasets.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'1.2.1. Proportion of population living below the national poverty line', 0 : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

As this is the first dataset, we can just assign it to the `combined_data` DataFrame, which would hold the combined datasets.

In [None]:
combined_data = data

#### 1.4.1p5. Net Enrolment Rate in elementary

Using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we load the next dataset. 

In [None]:
data = pd.read_csv('data' + '/1.4.1p5.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.4.1p5.csv')
data

From the DataFrame above, we can see that the footer of the .csv files was included in the DataFrame. As the rows from the 56th index are irrelevant, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) them. 

In [None]:
data = data.drop (data.index [56:]) 

Additionally, we can see that the columns are unnamed, and upon inspection, the original column names can be found at `Index 0`. Thus, we can set the columns to this row, and then  [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Index 0` row as it would only be redundant and might affect the computations.

The [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function was used in order to make the index of the rows start from 0.

In [None]:
# setting the column names and removing the row that held the previous column names
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

However, these is still a row of NaN found at `Index 0`, and we can see that the column names for the first two columns are not correct for the values underneath it, as the ones under the first column are actually Geolocations and those under the second columns are the values for Sex. Thus, we can [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) it, and then [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)  the row at `Index 0`.

In [None]:
data = data.rename(columns = {np.nan:'Geolocation', 'Year': 'Sex'})
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

As we would only need the data that is grouped by region and not by sex, we would only be getting the rows that has **Both Sexes** as the value in the Sex column. After this, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the Sex column as it would not be used onwards.

In [None]:
# Only getting the total data, then dropping Sex column as it's not needed anymore
data = data[data['Sex'] == 'Both Sexes']
data = data.drop("Sex", axis = 1)
data = data.reset_index (drop=True)
data

To be able to merge this to the combined DataFrame, the value of the Geolocation column has been set to the same values.

In [None]:
data['Geolocation'] = region_names

Since the dataset represents missing values as either '...' or '..', we can [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the columns with these values with `np.nan`.

In [None]:
for c in data.columns.difference(['Geolocation']):
    # cells without values are represented as either '..' or '...', so we should convert them to NaN so we could dropna()
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Then, we can transform the wide representation of the DataFrame to its long representation version using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. 

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

data.rename(columns = {'value':'1.4.1p5 Net Enrolment Rate in elementary', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [None]:
data

Then we can [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) this long representation DataFrame into the combined DataFrame. It would be merged with respect to the values in the **Geolocation** and **Year** column. An outer join is used as we want to retain all the values of both of the DataFrames, even if there would be **NaN** values for some of cells.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 1.4.1p6. Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2)

Next, we can load the third dataset.

In [None]:
data = pd.read_csv('data' + '/1.4.1p6.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.4.1p6.csv')
data

Just like in the processing of the previous datasets, we first [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the unnecessary rows at the bottom part of the DataFrame. 

In [None]:
data = data.drop (data.index [110:]) 

From the DataFrame above, we can see that the correct column headers are found at `Index 0`. However, upon inspection, we would see that there are two NaN values and the 'Year' value at the third column should actually be 'Sex' based on the values below it. Thus, before setting this row as the column header, we first correct the values of these first three columns using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

In [None]:
data.at[0, '1.4.1p6 Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2)'] = 'Level of Education'
data.at[0, 'Unnamed: 1'] = 'Geolocation'
data.at[0, 'Unnamed: 2'] = 'Sex'

Now that first row can correctly act as the column header, we can set is as the column header, before dropping the row at `Index 0`. Then we must also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row of **NaN**s at `Index 1` as it is unnecessary, before using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Using the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) function, we can see that there are two values for 'Level of Education' columns. To be able to combine this to the combined dataset, we must separate them as we cannot add another column that would hold the education level, thus, we can just add it as two different columns.

In [None]:
data ['Level of Education'].unique ()

In [None]:
senior_high_data = data [54:]
junior_high_data = data [:54]

Now, we must process these two separately, but the processes done to them would be the same.

First, as we only need the general data, without taking *Sex* into consideration. This can be done by only getting the rows that has **Both Sexes** as the value of the `Sex` column.

In [None]:
junior_high_data = junior_high_data [junior_high_data['Sex'] == 'Both Sexes']
junior_high_data = junior_high_data.reset_index (drop=True)

In [None]:
senior_high_data = senior_high_data [senior_high_data['Sex'] == 'Both Sexes']
senior_high_data = senior_high_data.reset_index (drop=True)

Next, as we have already separated the dataset into two based on the value of the `Level of Education` column, we have no need for this column anymore. This means that we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column.  

In [None]:
junior_high_data = junior_high_data.drop("Level of Education", axis = 1)
junior_high_data = junior_high_data.drop("Sex", axis = 1)
junior_high_data = junior_high_data.reset_index (drop=True)

In [None]:
senior_high_data = senior_high_data.drop("Level of Education", axis = 1)
senior_high_data = senior_high_data.drop("Sex", axis = 1)
senior_high_data = senior_high_data.reset_index (drop=True)

For consistency, we set the values of the `Geolocation` column to the format of the region names that we have decided before.

In [None]:
senior_high_data['Geolocation'] = region_names

In [None]:
junior_high_data['Geolocation'] = region_names

As the dataset represents missing values as '..' or '...', we must [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html) these values with `np.nan`.

In [None]:
for c in junior_high_data.columns.difference(['Geolocation']):
    junior_high_data [c].replace(to_replace='..', value= np.nan, inplace= True)
    junior_high_data [c].replace(to_replace='...', value= np.nan, inplace= True)

In [None]:
for c in senior_high_data.columns.difference(['Geolocation']):
    senior_high_data [c].replace(to_replace='..', value= np.nan, inplace= True)
    senior_high_data [c].replace(to_replace='...', value= np.nan, inplace= True)

Looking at the senior high data, we can see that all of the values are `NaN` from 2000 to 2016, which is to be expected as Senior High School was only implemented from 2016.

In [None]:
senior_high_data

Next, we can convert both of the datasets into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

In [None]:
junior_high_data = pd.melt(junior_high_data, id_vars='Geolocation', value_vars=junior_high_data.columns [1:]) 

junior_high_data.rename(columns = {'value':'1.4.1p6 Net Enrolment Rate in secondary education (Junior High School)', 0 : 'Year'}, inplace=True)
junior_high_data = junior_high_data.astype({'Year':'int'})

In [None]:
senior_high_data = pd.melt(senior_high_data, id_vars='Geolocation', value_vars=senior_high_data.columns [1:]) 

senior_high_data.rename(columns = {'value':'1.4.1p6 Net Enrolment Rate in secondary education (Senior High School)', 0 : 'Year'}, inplace=True)
senior_high_data = senior_high_data.astype({'Year':'int'})

Once that both datasets has been converted to their long representation, we can [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) the two datasets to the combined dataset based on the values of the `Geolocation` and the `Year` column with an outer join.

In [None]:
combined_data = combined_data.merge(junior_high_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.merge(senior_high_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 1.5.4. Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies
Then, the fourth dataset could be loaded using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [None]:
data = pd.read_csv('data' + '/1.5.4.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.5.4.csv')
data

Same as the previous datasets, we would need to [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the irrelevant rows at the bottom of the DataFrame. These are the rows that were a footer outside of the table in the csv files.

In [None]:
data = data.drop (data.index [19:])

Likewise, we know that the row at `Index 0` has the values that is the supposed column header for the table. However, checking each of the cells in this row would make us realize that the column header for the first column should not be `Year`, but rather `Geolocation` as the values in these columns refer to the different regions. 

Thus, we can change the value of the first column in this row to `Geolocation`, so that we would not need to rename the column if we directly made the 0th row into the column header. Then, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row at `Index 0` as it is now unnecessary. Additionally, we can see that there is a row of **NaN**s at `Index 1`, which would become the 0th row once we drop the row that became the column headers. This should be dropped also, before the index is resetted using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [None]:
data.at[0, '1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies (Indicator can also found in SDG 13.1.3 and 11.b.2)'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

The next step would be renaming the values under the `Geolocation`, although, as seen in the resulting table, we would notice that there is no row for **PHILIPPINES**. This is reflected in the way that we set the values of this column.

In [None]:
data ['Geolocation'] = region_names [1:]
data

As with the previous datasets, we would have to [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the '..' and '...' values, which represents **null**, in the DataFrame with **NaN**s. This is to avoid any errors that would happen in these rows, and so that it would be represented properly.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

After all of this, we can now transform this dataset that is in its wide represetation into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 
data

Once we were able to convert it to its long representation, we would see that the column names in this new DataFrame are not descriptive with respect to the values underneath the column. Directly merging this with the combined DataFrame would make it hard for its users to distinguish what these columns are for, which is why it was [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d to its correct column names.

In [None]:
data.rename(columns = {'value':'1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

After this, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the combined dataframe.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 3.4.1. Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease
To start with the fifth dataset, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [None]:
data = pd.read_csv('data' + '/3.4.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.4.1.csv')
data

Based on the DataFrame that we got using the [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we can see that there are rows of **NaN**s at the lower part of the DataFrame. Upon further inspection, it started from `Index 266`, which is why the rows from this index was [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped.

In [None]:
data = data.drop (data.index [266:])

As the column headers are all **Unnamed**, we need to set the column headers to its correct value, which is found at `Index 0`. Although, the values for the first three columns in this row are not descriptive to be column headers, which is why we are changing their values to the correct descriptive name for the rows underneath them using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

As we have no use for the row at `Index 0`, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this row. With this, we would also be [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping the next row as it is just a row of **NaN**s.

In [None]:
data.at[0, '3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease'] = 'Indicator'
data.at[0, 'Unnamed: 1'] = 'Geolocation'
data.at[0, 'Unnamed: 2'] = 'Sex'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

As the `Sex` column is not available for all datasets, it was decided that only the total—or those rows with **Both Sexes**—would be considered. Once we our data only includes rows with **Both Sexes** as the value of their `Sex` column, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column as this column would only have one unique value.

In [None]:
data = data [data ['Sex'] == 'Both Sexes']
data = data.drop('Sex', axis = 1)
data = data.reset_index(drop=True)

Then, we need to [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) all cells that has the value of either '..' or '...' with **NaN** for better computation in the future. 

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Upon studying the different indicators under this specific Sustainable Development Goal (SDG), we would realize that it is comprised of different subsets: (1) cardiovascular diseases, (2) cancer, (3) diabetes, and (4) chronic respiratory disease. However, as we only aim to get the total mortality rate with respect to all of these diseases, we would only get the rows under this indicator which is from `Index 0` to `Index 16`.

Then, after dividing the different subsets, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Indicator` column. 

In [None]:
data['Indicator'].unique()

In [None]:
all_data = data [0:16]
cardio_data = data [16:34]
cancer_data = data [34:52]
diabetes_data = data [52:70]
respi_data = data [70:]

In [None]:
all_data = all_data.drop('Indicator', axis = 1)
all_data

Upon inspection, we would realize that there are two regions that are missing from the table, which are **Region V** and **Region VI**, which is why we would only be using the region names that are included in the DataFrame. 

In [None]:
# no region five and six
all_data ['Geolocation'] = region_names [0:8] + region_names [10:]

After this, with the use of the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, we can now convert our DataFrame to its long representation. Then, we must set the column headers to describe the values in this column, which is why we would need to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the columns. 

In [None]:
all_data = pd.melt(all_data, id_vars='Geolocation', value_vars=all_data.columns [1:]) 

all_data.rename(columns = {'value':'3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease (Total data)', 0 : 'Year'}, inplace=True)
all_data = all_data.astype({'Year':'int'})

After this, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the DataFrame which holds the combined datasets.

In [None]:
combined_data = combined_data.merge(all_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 3.7.1. Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods

Using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we load the sixth dataset. 

In [None]:
data = pd.read_csv('data' + '/3.7.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.7.1.csv')
data

Irrelevant rows that are just footers for the file are also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped. From the DataFrame above, we can see that these are the rows from `Index 20`.

In [None]:
data = data.drop (data.index [20:])

Additionally, we can see that the current column names are **Unnamed**. Thus, we have to set the column names to its correct values so that we can determine what the values in the columns are.

Understanding the data, we can see that the row at `Index 0` holds the value for the column headers. However, there is a **NaN** value, which should be **Geolocation** based on the data underneath it. This is why the value of this cell was changed to **Geolocation** using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

This is done before the column names was set to the row at `Index 0`, and then [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping this row and the row of NaNs at the next row.

In [None]:
data.at[0, 'Unnamed: 1'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Added to this, we can see that there is a column of **NaN**s, which we do not need, so we can also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this.

In [None]:
data = data.drop('Year', axis=1)

Just like what we have done in the previous datasets, we would rename the **Geolocation** column based on the common names of the region for easier understanding of the dataset.

In [None]:
data ['Geolocation'] = region_names

As the missing data or null values in the dataset are represented by '..' or '...', which are strings that might affect the computations that might be done in this numerical columns, we would be using the [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function to replace these string values to **np.nan**.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')
data

As the dataset now looks like the wide representation that we wanted, we would be transforming it to its long representation, using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, so that we could merge it to the combined dataset.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 
data

Although, before merging it to the combined dataset, we would need to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the columns `0` and `value`, as they are not descriptive enough. If we directly merged it to the combined dataset, we might not be able to determine what the values in these columns mean. 

In [None]:
data.rename(columns = {'value':'3.7.1 Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})
data

Once the column names have been fixed, we could use the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 3.7.2. Adolescent birth rate aged 15-19 years per 1,000 women in that age group
Then, the seventh dataset could be loaded using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [None]:
data = pd.read_csv('data' + '/3.7.2.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.7.2.csv')
data

As seen in the previous datasets, there are three types of columns that are processed and [`drop`]ped first: (1) the irrelevant rows that were footers in the .csv file, (2) the row that would be turned into the column headers, and (3) the row of **NaN**s.

In [None]:
data = data.drop (data.index [20:])

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Although, we can see that there is a column name that does not correctly represent the data of this column: the `Year` column does not indicate years, but rather the regions. This is why it was [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d to `Geolocation`. 

In [None]:
data.rename(columns = {'Year':'Geolocation'}, inplace=True)

Once we have cleaned the column headers, the values for the `Geolocation` column would be fixed to include their common names. It is important to note that it was made sure that each of the row completely match the arrangement in the `region_name` variable.

In [None]:
data ['Geolocation'] = region_names

As we now have fixed the number of rows and the column names, we would now replace the string representation of null or missing vlaues. This is done with the use of [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function, which would convert the '..' and '...' values into **np.nan**.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Then, we can now convert our DataFrame into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. As in the processing of the previous datasets, we would have to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the column names as they are not descriptive enough.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

data.rename(columns = {'value':'3.7.2 Adolescent birth rate aged 15-19 years per 1,000 women in that age group', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

As we are now sure that the missing or null values are correctly represented, the values of the `Geolocation` are now more easily understandable, and the column headers are descriptive enough, we can now merge this dataset into the combined datasets using the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 4.1.s1. Completion Rate of elementary and secondary students
To start with the eighth dataset, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [None]:
data = pd.read_csv('data' + '/4.1.s1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/4.1.s1.csv')
data

From the view of the DataFrame above, we can see that there are unnecessary rows captured by the  [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function. To be able to correctly represent the data, we would need to [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) these rows.

In [None]:
data = data.drop(data.index[164:])

Another problem that we have based on the DataFrame shown above is the lack of column names, as shown in the **Unnamed** values in the header. Studying the DataFrame, we would find the supposed column headers in the row of `Index 0`, though we face the problem of having **NaN** values at the first three columns of this row. This is why the values in these cells are changed using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function, before converting this row to be the column header.

After we have been able to turn this into the column header, we would need to drop this row and the row beneath it as they are unnecessary rows.

In [None]:
data.at[0, '4.1.s1 Completion Rate of elementary and secondary students 1/ 2/'] = 'Geolocation'
data.at[0, 'Unnamed: 1'] = 'Level of Education'
data.at[0, 'Unnamed: 2'] = 'Sex'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Just like in datasets that has the `Sex` column, we would only be getting rows with the value for this column as **Both Sexes**. Afterwards, as we already have no need for this column anymore, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) it. 

In [None]:
data = data [data['Sex'] == 'Both Sexes']
data = data.drop ('Sex', axis = 1)
data = data.reset_index(drop=True)
data

As we can see from the resulting dataset, there are still **NaN** values in the `Geolocation` column, which we do not want as this would be used in merging the datasets together. However, if we study it, we would realize that the reason for this is that one value for `Geolocation` actually spans to the next two rows as there are different values for the `Level of Education` column. Although, we cannot just separate the dataset per unique value of the `Level of Education` column, as the `Geolocation` would be NaN for all  **Secondary (Junior High School)** and **Secondary (Senior High School)**. 

Due to this, we copy the value of the `Geolocation` column of a row to the next two rows after it. 

In [None]:
# copying the geolocation value to the next two rows
i = 0
while i < len (data):
    if i % 3 == 0:
        data.at[i + 1, 'Geolocation'] = data['Geolocation'][i]
        data.at[i + 2, 'Geolocation'] = data['Geolocation'][i]
        i = i + 3

Before we divide the dataset based on the value of `Level of Education`, we must first replace cells with the strings '..' or '...' with **np.nan**. This is so that we would not need to process this representation of missing or null values separately (i.e., per division). Then, we can now separate them so that we can properly label it before merging it to the combined dataset.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

In [None]:
elem_data = data [data['Level of Education'] == 'Elementary']
elem_data = elem_data.reset_index (drop=True)

junior_data = data [data['Level of Education'] == 'Secondary (Junior High School)']
junior_data = junior_data.reset_index (drop=True)

senior_data = data [data['Level of Education'] == 'Secondary (Senior High School)']
senior_data = senior_data.reset_index (drop=True)

Once we have successfully divided the dataset based on the value of the `Level of Education` column, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column as each of the division would technically only have one value for this column.

In [None]:
elem_data = elem_data.drop ('Level of Education', axis = 1)
elem_data = elem_data.reset_index(drop=True)

In [None]:
junior_data = junior_data.drop ('Level of Education', axis = 1)
junior_data = junior_data.reset_index(drop=True)

In [None]:
senior_data = senior_data.drop ('Level of Education', axis = 1)
senior_data = senior_data.reset_index(drop=True)

After making sure that the arrangement of the region matches the arrangement of the values of the `region_names` variable, we can change the values of the `Geolocation` column for each of the division. 

In [None]:
elem_data ['Geolocation'] = region_names

In [None]:
junior_data ['Geolocation'] = region_names

In [None]:
senior_data ['Geolocation'] = region_names

Then, we can now convert the DataFrames into their long representation, before using the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function to make the column names more descriptive of the data in the columns.

In [None]:
elem_data = pd.melt(elem_data, id_vars='Geolocation', value_vars=elem_data.columns [1:]) 

elem_data.rename(columns = {'value':'4.1.s1 Completion Rate of elementary and secondary students (Elementary)', 0 : 'Year'}, inplace=True)
elem_data = elem_data.astype({'Year':'int'})

In [None]:
junior_data = pd.melt(junior_data, id_vars='Geolocation', value_vars=junior_data.columns [1:]) 

junior_data.rename(columns = {'value':'4.1.s1 Completion Rate of elementary and secondary students (Junior High School)', 0 : 'Year'}, inplace=True)
junior_data = junior_data.astype({'Year':'int'})

In [None]:
senior_data = pd.melt(senior_data, id_vars='Geolocation', value_vars=senior_data.columns [1:]) 

senior_data.rename(columns = {'value':'4.1.s1 Completion Rate of elementary and secondary students (Senior High School)', 0 : 'Year'}, inplace=True)
senior_data = senior_data.astype({'Year':'int'})

As we have now made sure that each of division would be understandable even if combined with the combined dataset, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) each of them into the combined dataset

In [None]:
combined_data = combined_data.merge(elem_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data = combined_data.merge(junior_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data = combined_data.merge(senior_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 4.c.s2. Number of Technical-Vocational Education and Training (TVET) trainers trained
Next, we can load the ninth dataset.

In [None]:
data = pd.read_csv('data' + '/4.c.s2.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/4.c.s2.csv')
data

As usual, we would first be [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping the irrelevant rows. 

In [None]:
data = data.drop(data.index[20:])

Then, as we know that the correct column headers are found at `Index 0`, we have to fix the values of this row to fully represent the data in the columns. This is why the **Year** value was changed into **Geolocation** because the values in this column are the rows of the country.

After this, we can now make the value of this row as the value of the column headers, before [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping this row as it would not be used anymore. In line with this, we can also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row of **NaN**s underneath this row.

In [None]:
data.at[0, '4.c.s2 Number of Technical-Vocational Education and Training (TVET) trainers trained'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Then, we need to change the values of the `Geolocation` column to match the prescribed format for the region names.

In [None]:
data ['Geolocation'] = region_names

After this, we need to clean the dataset by turning the string representation of missing or null values, which are '..' and '...', into **np.nan**. This would allow us to correctly use mathematical functions into these columns without errors arising due to strings.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Once we have done this, we can convert the DataFrame into its long representation, which would allow us to merge it with the combined dataset. Converting a DataFrame that is in its wide representation into its long representation is made possible by the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

However, using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function would result into a three-column DataFrame which has the following column names: (1) `Geolocation`, (2) `0`, and (3) `value`. The last two columns are not properly descriptive of the values of the column, which is why these two columns are [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d. 

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

data.rename(columns = {'value':'4.c.s2 Number of Technical-Vocational Education and Training (TVET) trainers trained', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

As we now have a DataFrame that is in its long representation, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the combined DataFrame, with respect to the values of the `Geolocation` and `Year` columns. This means that a row from this DataFrame would be [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)d into the combined dataset on the row that has the same `Geolocation` and `Year`. 

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 7.1.1. Proportion of population with access to electricity

Now, we will proceed to loading the tenth dataset.

In [None]:
data = pd.read_csv('data' + '/7.1.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/7.1.1.csv') // AJ TO DO
data

Before anything else, we drop the irrelevant rows.

In [None]:
data = data.drop(data.index[20:])

First, we will change the data in Index 0 at column '7.1.1 Proportion of population with access to electricity 1/' into 'Geolocation' since our goal is to make the geolocation the first column of the dataframe. By doing this, Index 0 has now the correct column headers. 

With this, we have to arrange the values of this row to fully represent the data in the columns. Therefeore, we will now make the value of this row as the value of the column headers. 

After this, we drop this row (Index 0) as it would not be used anymore as well as the row of NaNs underneath this row.

In [None]:
data.at[0,'7.1.1 Proportion of population with access to electricity 1/'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

After checking if the order of the Geolocation is the same as what we intended, we will initialize the Geolocation column of the region names to make sure that the format of the region names in this dataset is the same as the currently combined dataset.

In [None]:
data ['Geolocation'] = region_names

We will then change the the '..' or '...' strings to NaN using the np.nan. Again, these NaN values were not dropped because all years from 2001-2022 will be in the combined dataset.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

We can now convert the DataFrame into its long representation using the `melt` function. This would allow us to merge it with the combined dataset.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

data.rename(columns = {'value':'7.1.1 Proportion of population with access to electricity', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [None]:
data

We will now combine this dataset to the currently combined dataset.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 8.1.1. Annual growth rate of real GDP per capita

Loading the eleventh dataset...

In [None]:
data = pd.read_csv('data' + '/8.1.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/8.1.1.csv')
data

We will now drop the irrelevant rows.

In [None]:
data = data.drop(data.index[20:])

Observing the header column and the Index 0, the data in the Index 0 is much more similar to the column names we want for the dataset. With this, it would be more hassle to change all columns names in the header column than changing the data in Index 0 and setting it to be the header column.

With this, we will change the data in at Index 0 Column 0 into 'Geolocation' since our goal is to make the geolocation the first column of the dataframe. By doing this, Index 0 has now the correct column headers. Then, will now make the value of this row as the value of the column headers.

After this, we drop this row (Index 0) as it would not be used anymore as well as the row of NaNs underneath this.

In [None]:
data.at[0,'8.1.1 Annual growth rate of real GDP per capita'] = 'Geolocation'
data.head()

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

After checking if the order of the Geolocation is the same as what we intended, we will initialize the Geolocation column of the region names.

In [None]:
data ['Geolocation'] = region_names

To represent the missing values clearly, we change the the '..' or '...' strings to NaN using the np.nan.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

We can now convert the DataFrame into its long representation to allow us to merge it with the combined dataset.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

data.rename(columns = {'value':'8.1.1 Annual growth rate of real GDP per capita', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [None]:
data[200:]

After this, we combine this dataset with the currently combined dataset.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 10.1.1. Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population and the total population

Loading the twelfth dataset...

In [None]:
data = pd.read_csv('data' + '/10.1.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/10.1.1.csv') 
data

Dropping the irrelevant rows...

In [None]:
data = data.drop(data.index[38:])

We will change the data in at Index 0 Column 0 into 'Geolocation' since our goal is to make the geolocation the first column of the dataframe. By doing this, Index 0 has now the correct column headers. Then, will now make the value of this row as the value of the column headers.

After this, we drop this row (Index 0) as it would not be used anymore as well as the row of NaNs underneath this.

In [None]:
data.at[0,'10.1.1 Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population and the total population'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

To represent the missing values clearly, we change the the '..' or '...' strings to NaN using the np.nan.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

As observed in this dataset, we have two parts which are **10.1.1.1 Bottom 40 percent of the population** and **10.1.1.2 Total Population**. Since we will both need these parts, we will still get both parts to combine with other datasets. However, we will divide them into two different datasets.

In [None]:
data['Geolocation'].unique()

**10.1.1.1 Bottom 40 percent of the population** goes to `bottom_popu_data` while **10.1.1.2 Total Population** goes to `total_popu_data`.

In [None]:
bottom_popu_data = data [0:18]
total_popu_data = data [18:]

Since `total_popu_data` started with index 18, we will set its starting index to 0. 

Also, since the first row of each of the parts is a record for the Philippines and the order of the geolocation of each DataFrame is correct, we will initialize it with the region_names for uniformity.

In [None]:
total_popu_data = total_popu_data.reset_index (drop=True)

bottom_popu_data ['Geolocation'] = region_names
total_popu_data ['Geolocation'] = region_names

In [None]:
bottom_popu_data

In [None]:
total_popu_data 

We can now convert both DataFrames into their long representation to allow us to merge both of them with the combined dataset.

In [None]:
bottom_popu_data = pd.melt(bottom_popu_data, id_vars='Geolocation', value_vars=bottom_popu_data.columns [1:]) 

bottom_popu_data.rename(columns = {'value':'10.1.1.1 Bottom 40 percent of the population', 0 : 'Year'}, inplace=True)
bottom_popu_data = bottom_popu_data.astype({'Year':'int'})

total_popu_data = pd.melt(total_popu_data, id_vars='Geolocation', value_vars=total_popu_data.columns [1:]) 

total_popu_data.rename(columns = {'value':'10.1.1.2 Total Population', 0 : 'Year'}, inplace=True)
total_popu_data = total_popu_data.astype({'Year':'int'})

Combining the two different datasets with the currently combined data...

In [None]:
# Adding the 10.1.1.1 dataset with the current combined dataset
combined_data = combined_data.merge(bottom_popu_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
# Adding the 10.1.1.2 dataset with the current combined dataset
combined_data = combined_data.merge(total_popu_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 14.5.1. Coverage of protected areas in relation to marine areas

We will now read the thirteenth dataset.

In [None]:
data = pd.read_csv('data' + '/14.5.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/14.5.1.csv')
data

As usual, we drop the irrelevant columns.

In [None]:
data = data.drop (data.index [38:])

We will now edit the data in column 0 & 1 at Index 0 to make the whole Index 0 look like the column headers we want.Then, we set the Index 0 to become the header columns. After this, we drop the Index 0 and the row of NaNs underneath it.

The first column was renamed to `Indicator` as it contains the names of the parts in this dataset.

In [None]:
data.at[0, '14.5.1 Coverage of protected areas in relation to marine areas'] = 'Indicator'
data.at[0, 'Unnamed: 1'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data.head()

To represent the missing values clearly, we change the the '..' or '...' strings to NaN using the np.nan.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

As observed in this dataset, we have two parts which are **14.5.1.1 Coverage of protected areas in relation to marine areas, Universe (in million hectares)** and **14.5.1.2 Coverage of protected areas in relation to marine areas, NIPAS ans Locally managed MPAs 1/**. Since we will both need these parts, we will still get both parts to combine with other datasets. However, we will divide them into two different datasets.

For this, we will retain the `Indicator` column first, which contains the name of the parts, for identifying how this dataset will be divided. 

In [None]:
data['Indicator'].unique()
data

**14.5.1.1 Coverage of protected areas in relation to marine areas, Universe (in million hectares)** goes to `universe_data` while **14.5.1.2 Coverage of protected areas in relation to marine areas, NIPAS ans Locally managed MPAs 1/** goes to `nipas_data`. 

Since the `nipas_data` will start at Index 18, we will reset it to Index 0 after the division.

In [None]:
universe_data = data [0:18]
nipas_data = data [18:]

In [None]:
nipas_data = nipas_data.reset_index (drop=True)
nipas_data

Since the dividing of the dataset is done, we won't be needing the `Indicator` column anymore. Therefore, we drop them.

In [None]:
universe_data = universe_data.drop('Indicator', axis = 1)
nipas_data = nipas_data.drop('Indicator', axis = 1)

Since the order of the geolocation of each DataFrame is correct, we will initialize it with the region_names for uniformity.

In [None]:
universe_data ['Geolocation'] = region_names
nipas_data ['Geolocation'] = region_names

We can now convert both DataFrames into their long representation to allow us to merge both of them with the combined dataset.

In [None]:
# 14.5.1.1
universe_data = pd.melt(universe_data, id_vars='Geolocation', value_vars=universe_data.columns [1:]) 
universe_data.rename(columns = {'value':'14.5.1.1 Coverage of protected areas in relation to marine areas, Universe (in million hectares)', 0 : 'Year'}, inplace=True)
universe_data = universe_data.astype({'Year':'int'})
# 14.5.1.2
nipas_data = pd.melt(nipas_data, id_vars='Geolocation', value_vars=nipas_data.columns [1:]) 
nipas_data.rename(columns = {'value':'14.5.1.2 Coverage of protected areas in relation to marine areas, NIPAS ans Locally managed MPAs', 0 : 'Year'}, inplace=True)
nipas_data = nipas_data.astype({'Year':'int'})

Combining the two different datasets with the currently combined data...

In [None]:
# Adding the 14.5.1.1 dataset with the current combined dataset
combined_data = combined_data.merge(universe_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
# Adding the 14.5.1.2 dataset with the current combined dataset
combined_data = combined_data.merge(nipas_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 16.1.1 Number of victims of intentional homicide (per 100,000 population)

We will now load the fourteenth dataset.

In [None]:
data = pd.read_csv('data' + '/16.1.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/16.1.1.csv')
data

Now, we will drop the irrelevant rows.

In [None]:
data = data.drop(data.index[20:])

Since Index 0 is almost the same as the column header we want, we will just change the content in the first column to `Geolocation`. This also because the column already containes the regions of the Philippines.

Then, we set the Index 0 to become the header column. After this, we will drop Index 0 and the rows of NANs underneath it since we will not be needing this later.

In [None]:
data.at[0,'16.1.1 Number of victims of intentional homicide (per 100,000 population) 1/'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

We will now check the order of the Geolocation if it is the same as the combined dataset. Then, to make the naming of Geolocation uniformed, we will initialized the Geolocation with region_names.

In [None]:
data ['Geolocation'] = region_names

We will then change the the '..' or '...' strings to NaN.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

We can now convert the DataFrame into its long representation using the melt function. This would allow us to merge it with the combined dataset.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

data.rename(columns = {'value':'16.1.1 Number of victims of intentional homicide (per 100,000 population)', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [None]:
data

We will now combine this dataset to the currently combined dataset.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### 16.1.s1 Number of murder cases

We are now loading our fifteenth dataset.

In [None]:
data = pd.read_csv('data' + '/16.1.s1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/16.1.s1.csv')
data

Dropping the irrelevant rows.

In [None]:
data = data.drop(data.index[20:])

Since Index 0 is almost the same as the column header we want, we will just change the content in the first column to "Geolocation". Then, we set the Index 0 to become the header column. After this, we will drop Index 0 and the rows of NANs underneath it since we will not be needing this later.

In [None]:
data.at[0,'16.1.s1 Number of murder cases'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

We will now check the order of the Geolocation if it is the same as the combined dataset. Then, to make the naming of Geolocation uniformed, we will initialized the Geolocation with region_names.

In [None]:
data ['Geolocation'] = region_names

We will then change the the '..' or '...' strings to NaN.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

We can now convert the DataFrame into its long representation using the melt function. This would allow us to merge it with the combined dataset.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

data.rename(columns = {'value':'16.1.s1 Number of murder cases', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [None]:
data

We will now combine this dataset to the currently combined dataset.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

#### Other Non-SDG datasets
These are datasets that can provide us with more context when exploring the datasets for the Sustainable Development Goals

##### Changes in Inventories, by Region

Loading the sixteenth dataset...

In [None]:
data = pd.read_csv('data' + '/Changes in Inventories, by Region.csv', header=1, delimiter=";")
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/Changes in Inventories, by Region.csv')
data

Since we will be only needng the current prices and we will not be focusing on comparing each record to 2018, we will be dropping the columns with `At Constant 2018 Prices`.

In [None]:
data.drop(data.iloc[:, 22:], inplace=True, axis=1)
data

Also, since the ordering of the Geolocation is different in this dataset, we will be rearranging the rows based on the order of the Geolocation in `region_names`.

In [None]:
data = data.reindex(index=[17,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])
data

Then, we will proceed to reindexing the rows.

In [None]:
data = data.reset_index (drop=True)
data

After this, we will now change the columns names: (1) `Region` to `Geolocation`, (2) `At Current Prices <Year>` to `<Year>`

In [None]:
data.columns = ['Geolocation', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
               '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017','2018', '2019', '2020']

After this, we will insert the region_names in the Geolocation column so that the format of the region_names will fit the ones in the combined data.

In [None]:
data ['Geolocation'] = region_names
data

We will then change the the '..' or '...' strings to NaN. This is to represent the missing values.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

We can now convert the DataFrame into its long representation using the melt function. This would allow us to merge it with the combined dataset.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 
data.rename(columns = {'value':'Changes in Inventories, by Region', 'variable' : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})
data

Finally, we will now combine this dataset to the currently combined dataset.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

##### Current Health Expenditure by Region, Growth Rates 

Loading the seventeenth dataset...

In [None]:
data = pd.read_csv('data' + '/Current Health Expenditure by Region, Growth Rates.csv', header=1, delimiter=";")
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/Current Health Expenditure by Region, Growth Rates.csv')
data

Since we will only need the data nationwide and per region, we will drop the `Index 0` which contains the Total Current Health Expenditure.

In [None]:
data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Also, since the ordering of the Geolocation is different in this dataset, we will be rearranging the rows based on the order of the Geolocation in region_names. After this, we will reset the index again.

In [None]:
data = data.reindex(index=[17,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])
data = data.reset_index (drop=True)
data

After this, we will now change the columns names: (1) `Region` to `Geolocation`, (2) `At Current Prices <Year>` to `<Year>`. 

In [None]:
data.columns = ['Geolocation', '2014', '2015', '2016', '2017','2018', '2019']
data

As observed, this dataset does not have the records for the years: 2000-2013 and 2020-2022. To allow this dataset to merge with the currently combined dataset easily, we add additional columns for representing the missing years in this dataset.

In [None]:
# For adding columns 2000-2013
col = 1
for i in range(2000,2014):
    data.insert(col, str(i), np.nan, True)
    col+=1
# For adding columns 2020-2022
col = 21
for i in range(2020,2023):
    data.insert(col, str(i), np.nan, True)
    col+=1
data

After this, we will insert the region_names in the Geolocation column so that the format of the region_names will fit the ones in the combined data.

In [None]:
data ['Geolocation'] = region_names
data

We will then change the the '..' or '...' strings to NaN. This is to represent the missing values.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

We can now convert the DataFrame into its long representation using the melt function. This would allow us to merge it with the combined dataset.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 
data.rename(columns = {'value':'Current Health Expenditure by Region, Growth Rates', 'variable' : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})
data

Finally, we will now combine this dataset to the currently combined dataset.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

##### Current Health Expenditure by Region

In [None]:
data = pd.read_csv('data' + '/Current Health Expenditure by Region.csv',header = 1,sep = ';')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/Current Health Expenditure by Region.csv',header = 1,sep = ';')
data

In [None]:
#drop total current health expenditure
data = data.drop (data.index[0])
data

In [None]:
#remove '..' and 'r'
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data.columns = data.columns.str.replace('[r]', '',regex = True)
data

In [None]:
#make nationwide index 0
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

In [None]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

To follow the format of the combined dataset and to make combining dataset easier, we add columns for years: `2000-2013` and `2020-2022`.

In [None]:
# For adding columns 2000-2013
col = 1
for i in range(2000,2014):
    data.insert(col, str(i), np.nan, True)
    col+=1
# For adding columns 2021 and 2022
data.insert(22, 2021, np.nan, True)
data.insert(23, 2022, np.nan, True)
data

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Current Health Expenditure by Region', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

Finally, we will now add this dataset with the currently combined dataset.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

##### Government Final Consumption Expenditure, by Region, Growth Rates
Load next dataset

In [None]:
data = pd.read_csv('data' + '/Government Final Consumption Expenditure, by Region, Growth Rates.csv',header = 1,sep = ';')
data

We remove the '..' at the start of the Region column values then put the last row as the first row to follow the format of region_names

In [None]:
#remove '..' and arrange row
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
#data

We rename the Region column to region_names for consistency then rename the column header Region to Geolocation

In [None]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
#data

We only need the at current price for that year so we drop  the not needed columns. We then format the column names.

In [None]:
data.drop(data.iloc[:, 21:41], inplace = True, axis = 1)
data.columns = data.columns.map(lambda x: x.lstrip('At Current Prices'))
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation'},inplace = True)
data

Add missing columns 2020-2022 to be able to merge easily

In [None]:
# For adding columns 2020-2022
col = 21
for i in range(2020,2023):
    data.insert(col, str(i), np.nan, True)
    col+=1

data


Then, we can now convert our DataFrame into its long representation using the melt function. As in the processing of the previous datasets, we would have to rename the column names as they are not descriptive enough.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Consumption Expenditure GR', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

We use the merge function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

##### Government Final Consumption Expenditure, by Region, Percent Share
Load the next dataset

In [None]:
data = pd.read_csv('data' + '/Government Final Consumption Expenditure, by Region, Percent Share.csv',header = 1,sep = ';')
data

We remove the '..' at the start of the Region column values then put the last row as the first row to follow the format of region_names

In [None]:
#remove '..' and arrange row
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
#data

We rename the Region column to region_names for consistency then rename the column header Region to Geolocation

In [None]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
# data

We only need the at current price for that year so we drop  the not needed columns. We then format the column names.

In [None]:
data.drop(data.iloc[:, 22:43], inplace = True, axis = 1)
data.columns = data.columns.map(lambda x: x.lstrip('At Current Prices'))
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation'},inplace = True)
data

Add missing columns 2021-2022 to be able to merge easily

In [None]:
# For adding columns 2021-2022
col = 22
for i in range(2021,2023):
    data.insert(col, str(i), np.nan, True)
    col+=1

Then, we can now convert our DataFrame into its long representation using the melt function. As in the processing of the previous datasets, we would have to rename the column names as they are not descriptive enough.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Consumption Expenditure %', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

We use the merge function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

##### Gross Capital Formation, by Region
Load the next dataset

In [None]:
data = pd.read_csv('data' + '/Gross Capital Formation, by Region.csv',header = 1,sep = ';')
data

We remove the '..' at the start of the Region column values then put the last row as the first row to follow the format of region_names

In [None]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
#data

We rename the Region column to region_names for consistency then rename the column header Region to Geolocation

In [None]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
#data

We only need the at current price for that year so we drop  the not needed columns. We then format the column names.

In [None]:
data.drop(data.iloc[:, 22:43], inplace = True, axis = 1)
data.columns = data.columns.map(lambda x: x.lstrip('At Current Prices'))
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation'},inplace = True)
data

Add missing columns 2021-2022 to be able to merge easily

In [None]:
# For adding columns 2021-2022
col = 22
for i in range(2021,2023):
    data.insert(col, str(i), np.nan, True)
    col+=1

Then, we can now convert our DataFrame into its long representation using the melt function. As in the processing of the previous datasets, we would have to rename the column names as they are not descriptive enough.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Gross Capital Formation', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

We use the merge function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

##### Gross Regional Domestic Product, by Region
Load the next dataset

In [None]:
data = pd.read_csv('data' + '/Gross Regional Domestic Product, by Region.csv',header = 1,sep = ';')
data

We remove the '..' at the start of the Region column values then put the last row as the first row to follow the format of region_names

In [None]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
#data

We rename the Region column to region_names for consistency then rename the column header Region to Geolocation

In [None]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
#data

We only need the at current price for that year so we drop  the not needed columns. We then format the column names.

In [None]:
data.drop(data.iloc[:, 22:43], inplace = True, axis = 1)
data.columns = data.columns.map(lambda x: x.lstrip('At Current Prices'))
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation'},inplace = True)
data

Add missing columns 2021-2022 to be able to merge easily

In [None]:
# For adding columns 2021-2022
col = 22
for i in range(2021,2023):
    data.insert(col, str(i), np.nan, True)
    col+=1
data

Then, we can now convert our DataFrame into its long representation using the melt function. As in the processing of the previous datasets, we would have to rename the column names as they are not descriptive enough.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'GRDP', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

We use the merge function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

##### Population, by Region
Load the next dataset

In [None]:
data = pd.read_csv('data' + '/Population, by Region.csv',header = 1,sep = ';')
data

We remove the '..' at the start of the Region column values then put the last row as the first row to follow the format of region_names

In [None]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
#data

We rename the Region column to region_names for consistency then rename the column header Region to Geolocation

In [None]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Add missing columns 2021-2022 to be able to merge easily

In [None]:
# For adding columns 2021-2022
col = 22
for i in range(2021,2023):
    data.insert(col, str(i), np.nan, True)
    col+=1

Then, we can now convert our DataFrame into its long representation using the melt function. As in the processing of the previous datasets, we would have to rename the column names as they are not descriptive enough.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Population', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

We use the merge function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

##### Primary Drop-out rates by Region, Sex and Year
Load next dataset

In [None]:
data = pd.read_csv('data' + '/Primary Drop-out rates by Region, Sex and Year.csv',header = 1,sep = ';')
data

We rename the Region column to region_names for consistency then rename the column header Region to Geolocation

In [None]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
#data

Drop the unnecessary columns as we only need the data for both sexes

In [None]:
data.drop(data.iloc[:, 11:31], inplace = True, axis = 1)
  
data

Strip the 'Both Sexes' at the start of the column name so only the year would remain

In [None]:
data.columns = data.columns.map(lambda x: x.lstrip('Both Sexes '))
data

Add missing columns to be able to merge easily

In [None]:
# For adding columns 2000-2005
col = 1
for i in range(2000,2006):
    data.insert(col, str(i), np.nan, True)
    col+=1
    
# For adding columns 2016-2022
col = 17
for i in range(2016,2023):
    data.insert(col, str(i), np.nan, True)
    col+=1
data


Then, we can now convert our DataFrame into its long representation using the melt function. As in the processing of the previous datasets, we would have to rename the column names as they are not descriptive enough.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Drop-out rate', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

We use the merge function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

##### Quarterly Producer Price Index for Agriculture (First Quarter 2018 to Third Quarter 2021)
Load the next dataset

In [None]:
data = pd.read_csv('data' + '/Quarterly Producer Price Index for Agriculture (2018=100) _ First Quarter 2018 to Third Quarter 2021.csv',header = 1,sep = ';')
data[data['Commodity'] == 'AGRICULTURE']

We remove the '..' at the start of the Region column values then put the last row as the first row to follow the format of region_names

In [None]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data['Commodity'] = data['Commodity'].map(lambda x: x.lstrip('..'))
data['Commodity'] = data['Commodity'].map(lambda x: x.lstrip('….'))
#data

Since there is no NCR in this dataset we declare region_names again this time without NCR.

In [None]:
# NOTE: Before applying, make sure that the arrangement of the regions are the same as the arrangement in your table
region_names1 = ['PHILIPPINES', 
                 'CAR: Cordillera Administrative Region', 
                 'Region 1: Ilocos Region', 
                 'Region 2: Cagayan Valley', 
                 'Region 3: Central Luzon', 
                 'Region 4A: CALABARZON', 
                'MIMAROPA: Southwestern Tagalog Region', 
                'Region 5: Bicol Region', 
                'Region 6: Western Visayas', 
                'Region 7: Central Visayas', 
                'Region 8: Eastern Visayas', 
                'Region 9: Zamboanga Peninsula', 
                'Region 10: Northern Mindanao', 
                'Region 11: Davao Region', 
                'Region 12: SOCCSKSARGEN', 
                'CARAGA: Cordillera Administrative Region', 
                'BARMM: Bangsamoro Autonomous Region in Muslim Mindanao']

We only take the Agriculture then we rename the Region column to region_names for consistency then rename the column header Region to Geolocation

In [None]:
data = data[data['Commodity'] == 'AGRICULTURE']

# renames the data in the Geolocation for consistency
data['Region'] = region_names1
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)

We keep the column that has the average for that year and drop the rest

In [None]:
data = data[['Geolocation','Commodity','2018 Average (Jan-Dec)','2019 Average (Jan-Dec)','2020 Average (Jan-Dec)']]
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation','Comm':'Commodity' },inplace = True)
data

Add missing columns to be able to merge easily

In [None]:
# For adding columns 2000-2017
col = 2
for i in range(2000,2018):
    data.insert(col, str(i), np.nan, True)
    col+=1
# For adding columns 2021-2022
col = 23
for i in range(2021,2023):
    data.insert(col, str(i), np.nan, True)
    col+=1
data

Then, we can now convert our DataFrame into its long representation using the melt function. As in the processing of the previous datasets, we would have to rename the column names as they are not descriptive enough.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Price Index for Agriculture', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

We use the merge function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

## Data Cleaning
There are four steps for the cleaning of the combined dataset: (1) the dropping of the rows wherein all the values of the indicator columns are **NaN**s, (2) the fixing of the data types of the columns, (3) the dropping of duplicated rows, and (4) the cleaning of the individual columns.

### Dropping of rows that has all **NaN** values
The first thing that we would do is to drop the rows that only have **NaN** values. This means that for that specific region in that specific year, there is no data that is collected for any of the indicators, thus, we would not be able to derive any knowledge from it.

Using the combination of the isna and sum functions, we would be able to see the total number of **NaN** values a specific row has.

In [None]:
combined_data.isna().sum(axis = 1).sort_values(ascending=False)

From the result above, we can see that there are rows that have all **NaN** values (i.e., where the number of **NaN** values outputted is equal to the number of the columns for indicators). Since we know that the `Geolocation` and `Year` column does not have any **NaN** values, we would set a threshold of 3 (which means that if there are at least three non-NaN values, the row would not be dropped ) in the [`dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function.

In [None]:
combined_data = combined_data.dropna(axis = 0, thresh = 3)

With this, we would have a new DataFrame that has 377 rows, with the `Year` having a range of from 2000 to 2021.

In [None]:
combined_data['Year'].describe()

In [None]:
combined_data

### Fixing the Data Types of the Columns
Using the [`dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) property, we would see that some indicators columns are **object**-types. As we know that all columns except for the `Geolocation` and `Year` are supposed to be **float64** columns, we would need to convert these objects.

In [None]:
combined_data.dtypes

For each of the column that are not the `Geolocation` and `Year` columns, their datatypes are checker. In the scenario that they are not **float64**, the function [`astype`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) was used in order to convert it to float. Even though we are sure that all of the values in these columns can be transformed to float as this was its original value in the csv file, the parameter `errors` was still set to **raise** for validation.

In [None]:
for x in combined_data.columns.difference(['Geolocation', 'Year']):
    if combined_data[x].dtypes != 'float64':
        combined_data.loc[:, x] = combined_data[x].astype(float, errors = 'raise')

Using the [`info`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) function, we would see that all the indicator columns are now **float64**.

In [None]:
combined_data.info()

### Dropping of Duplicated Rows
Using a combination of [`duplicated`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) and [`sum`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html), we would be able to see how many rows are duplicated and should be dropped.

In [None]:
combined_data.duplicated().sum()

As the combination of these functions outputted the number 0, then we can conclude that each of the rows are unique. This means that we would not have to drop any of the rows.

### Cleaning of Each Columns
As each of the column came from different datasets, we would be checking and cleaning the values for each of the column.

#### 1.2.1. Proportion of population living below the national poverty line
For this column, we would be using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function in order to check if we have an outliner. This is due to the fact that we are expecting a value of 0 to 10, as we are talking about proportion or percentage.

In [None]:
combined_data['1.2.1. Proportion of population living below the national poverty line'].describe()

From what we can see, the minimum and maximum values of the columns are within the range of values that we expected from this column. Thus, there are no outliers that we need to remove or drop.

#### 1.4.1p5 Net Enrolment Rate in elementary
Just like in the first column, we would be using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function in order to check the value range of the variable. 

According to the Philippine Statistics Authority (n.d.), the formula for net enrollment rate in elementary is defined as total enrollment of aged six to 11, divided by the population of kids of the same age, and then multiplied by 100. For this, we are expecting a value of 0 to 100, as we are talking about a percentage of a population: we cannot have more children enrolled than the total population of kids. 

In [None]:
combined_data['1.4.1p5 Net Enrolment Rate in elementary'].describe()

As we can see, the maximum value of this column is higher than 100, which can be concerning as the unit of measurement set by United Nations for all of the countries in this indicators is percentage. Thus, these might be error in encodings.

Let us check all of the rows which has values higher than 100 for this indicator.

In [None]:
combined_data[combined_data['1.4.1p5 Net Enrolment Rate in elementary'] > 100]

As we can see, there are 18 rows which has more than 100% value for the `1.4.1p5 Net Enrolment Rate in elementary`. In order to prevent these values from skewing the data in the scenario that it is used for computation, these values are instead turned into **NaN**s.

In [None]:
combined_data.loc[combined_data['1.4.1p5 Net Enrolment Rate in elementary'] > 100, '1.4.1p5 Net Enrolment Rate in elementary'] = np.nan

Now, we can see that all of the values for this column are now within the range that we expected.

In [None]:
combined_data[combined_data['1.4.1p5 Net Enrolment Rate in elementary'] > 100]

#### 1.4.1p6 Net Enrolment Rate in secondary education (Junior High School)

As we have the same expectations in the second dataset, the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function would be used in order to check if there are outliers or values that are outside of the range.

In [None]:
combined_data['1.4.1p6 Net Enrolment Rate in secondary education (Junior High School)'].describe()

From the minimum and maximum value, we can see that the range of values are within the expected values.

#### 1.4.1p6 Net Enrolment Rate in secondary education (Senior High School)
Next, in this column, we would be using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function for the same purpose: checking if the maximum and minimum values are within the range we expected.

In [None]:
combined_data['1.4.1p6 Net Enrolment Rate in secondary education (Senior High School)'].describe()

Based on the output, we can see that the minimum and maximum are within the range.

However, another expectation that we have from this column is that the rows that are not **NaN** have a value of **2016 - onwards** for the `Year` column. This is due to the fact that the Senior High School years was only added from 2016. Thus, if there are values for years lower than this, we would need to turn it to **NaN**.

To check this, we can use a mixture of the [`isnull`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function and the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function. Using the negation of the [`isnull`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function, we can only return rows that are not missing. Then, using the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function, we can return the unique values of the `Year` column of the previously returned rows.

In [None]:
combined_data[~combined_data['1.4.1p6 Net Enrolment Rate in secondary education (Senior High School)'].isnull()]['Year'].unique()

We can see that the values of the `Year` column of the rows that are not **NaN** for this column are what we expected.

#### 1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies

As this column talks about proportion, we are expecting a value from 0 to 100 again. This means that we can check it using the same function as the previous columns (i.e., the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function), in order to recheck this using the returned minimum and maximum values.

In [None]:
combined_data['1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies'].describe()

Since the maximum is 100 and the minimum is not less than 0, then we can conclude that there are no values that are outside of the accepted range.

#### 3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease (Total data)
As the next column talks about a rate again, the accepted value range is within 0 to 100. This is because of its formula wherein we divide the number of people who died attributed to the said diseases by the population. Since the number of deaths cannot be higher than the population, we cannot accept a value higher than 100.

We can check for the minimum and maximum value through the use of the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function.

In [None]:
combined_data['3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease (Total data)'].describe()

#### 3.7.1 Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods

In [None]:
combined_data['3.7.1 Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods'].describe()

#### 3.7.2 Adolescent birth rate aged 15-19 years per 1,000 women in that age group

In [None]:
combined_data['3.7.2 Adolescent birth rate aged 15-19 years per 1,000 women in that age group'].describe()

#### 4.1.s1 Completion Rate of elementary and secondary students (Elementary)

In [None]:
combined_data['4.1.s1 Completion Rate of elementary and secondary students (Elementary)'].describe()

#### 4.1.s1 Completion Rate of elementary and secondary students (Junior High School)

In [None]:
combined_data['4.1.s1 Completion Rate of elementary and secondary students (Junior High School)'].describe()

#### 4.1.s1 Completion Rate of elementary and secondary students (Senior High School)

In [None]:
combined_data['4.1.s1 Completion Rate of elementary and secondary students (Senior High School)'].describe()

In [None]:
combined_data[~combined_data['4.1.s1 Completion Rate of elementary and secondary students (Senior High School)'].isnull()]['Year'].unique()

#### 7.1.1 Proportion of population with access to electricity

In [None]:
combined_data['7.1.1 Proportion of population with access to electricity'].describe()

In [None]:
combined_data[combined_data['7.1.1 Proportion of population with access to electricity'] > 100]

In [None]:
combined_data.loc[combined_data['7.1.1 Proportion of population with access to electricity'] > 100, '7.1.1 Proportion of population with access to electricity'] = np.nan

In [None]:
combined_data[combined_data['7.1.1 Proportion of population with access to electricity'] > 100]

#### 10.1.1.1 Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population

#### 10.1.1.2 Growth rates of household expenditure or income per capita among the Total Population

#### 14.5.1.1 Coverage of protected areas in relation to marine areas, Universe (in million hectares)

#### 14.5.1.2 Coverage of protected areas in relation to marine areas, NIPAS and Locally managed MPAs

#### 16.1.1 Number of victims of intentional homicide (per 100,000 population)

#### 16.1.s1 Number of murder cases

#### Changes in Inventories, by Region

#### Current Health Expenditure by Region, Growth Rates

#### Current Health Expenditure by Region

#### Government Final Consumption Expenditure, by Region, Growth Rates

#### Government Final Consumption Expenditure, by Region, Percent Share

#### Gross Capital Formation, by Region

#### Gross Regional Domestic Product, by Region

#### Population, by Region

#### Primary Drop-out rates by Region, Sex and Year

#### Quarterly Producer Price Index for Agriculture (First Quarter 2018 to Third Quarter 2021)

## Exploratory Data Analysis
With the combined dataset, there is a substantial amount of raw data to process and analyze. Before performing any statistical analysis, it is good practice to do exploratory data analysis to observe patterns and detect any outliers in the dataset. With this, we can properly identify particular relationships between specific variables.

### Per year, what region has the lowest proportion value of the population living below the national poverty line?
To answer this questio, we would be utilizing three columns from the combined DataFrame: (1) `Geolocation`, (2) `Year`. (3) `1.2.1. Proportion of population living below the national poverty line`. However, since we aim to get the lowest proportion value per year, we would first need to group the rows, using the [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function. 

Once the rows has been grouped, per group, we would be getting the [`min`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html)imum value for the `1.2.1. Proportion of population living below the national poverty line` column. Then, as we know from the cleaning that there are years without values for this column, we would only be getting the years that has not **NaN** as its minimum value, using the [`notna`](https://pandas.pydata.org/docs/reference/api/pandas.Series.notna.html) function.

In [None]:
grouped_by_year = combined_data.groupby(['Year'])['1.2.1. Proportion of population living below the national poverty line'].min()
grouped_by_year = grouped_by_year[grouped_by_year.notna()]
grouped_by_year

Then, once we got the minimum values for this column, we can use this to get the rows that has this minimum value.

In [None]:
combined_data[combined_data ['1.2.1. Proportion of population living below the national poverty line'].isin(grouped_by_year.values)]

From the DataFrame, we can see that the **National Capital Region** has the lowest proportion value of the population living below the national poverty line for both of the years, and that it even decreased in the year 2018. 

Let us cross-check this using bar graph, wherein we would be able to see the proportion value of the population living below the national poverty line per region clearly. 

To do this, let us first get all the rows for **2015** and **2018**.

In [None]:
data_2015 = combined_data[combined_data['Year'] == 2015]
data_2018 = combined_data[combined_data['Year'] == 2018]

Let us plot the data from 2015 into a bar graph using the [`plot`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) function.

In [None]:
ax1 = data_2015['1.2.1. Proportion of population living below the national poverty line'].plot(figsize=(8, 6), kind='bar', width=0.5)

ax1.set_xticklabels(data_2015['Geolocation'], rotation=90)

ax1.set_title('1.2.1. Proportion of population living below the national poverty line by Geolocation')
ax1.set_ylabel('Proportion');
ax1.set_xlabel('Geolocation');

#### Figure 1. Proportion of population living below the national poverty line by Geolocation (2015)
From the above figure, we can see that the bar of the **National Capital Region** is lower than the other regions. It has a proportion of lower than 10%, compared to the other graphs that look near 10% or higher. 

Using the same [`plot`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) function, let us also plot the data from 2018 into a bar graph.

In [None]:
ax2 = data_2018['1.2.1. Proportion of population living below the national poverty line'].plot(figsize=(8, 6), kind='bar', width=0.5)

ax2.set_xticklabels(data_2018['Geolocation'], rotation=90)

ax2.set_title('1.2.1. Proportion of population living below the national poverty line by Geolocation')
ax2.set_ylabel('Proportion');
ax2.set_xlabel('Geolocation');

#### Figure 2. Proportion of population living below the national poverty line by Geolocation (2018)
For the `Year` 2018, we can see that the **National Capital Region** still has the shortest bar. Compared to the 2016 bar of  the region, this one is shorter.

From these two bar graphs, we can conclude that the **National Capital Region** has the lowest proportion value of the population living below the national poverty line for the years available in the dataset (i.e., 2016 and 2018).

### What education level (Junior or Senior High School) has a higher rate per region (2016 - 2018)?

### What year has the most adolescent birth rate?

## Conversion of DataFrame to File

## References
https://www.un.org/esa/sustdev/natlinfo/indicators/methodology_sheets/education/net_enrolment.pdf
https://psa.gov.ph/content/net-enrolment-ratio-ner-1