# Progress of the Philippines' Sustainable Development Goals

### Import

In [245]:
import os
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore

## Data Collection
The following **csv** files used in this project are acquired through a request sent to the Knowledge Management and Communications Division of the Philippine Statistics Authority.

### Combining the Datasets 
In this stage, the separate datasets underwent pre-processing and cleaning before they are combined together. Some of the cleaning done on each of the datasets are: (1) fixing of column names, (2) modification of the values of the 'Geolocation' column, (3) removal of unneeded rows and columns, and (4) conversion of '..' or '...' values to NaN. After this, the dataset is converted into a long representation before they are merged together.

#### 1.2.1. Proportion of population living below the national poverty line 
To start with, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

The [`os.getenv`](https://docs.python.org/3/library/os.html) function was used to get the environment variable `DSDATA_PROJ`, which points to the data folder of this project.

In [246]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.2.1.csv')
data

Unnamed: 0,1.2.1 Proportion of population living below the national poverty line by sex age 1/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,23.5,..,..,16.7,..,..,...,..
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,4.1,..,..,2.2,..,..,...,..
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,22.7,..,..,12.0,..,..,...,..
5,..Region I,..,..,..,..,..,..,..,..,..,...,..,..,18.8,..,..,9.9,..,..,...,..
6,..Region II,..,..,..,..,..,..,..,..,..,...,..,..,17.8,..,..,16.3,..,..,...,..
7,..Region III,..,..,..,..,..,..,..,..,..,...,..,..,10.5,..,..,7.0,..,..,...,..
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,12.5,..,..,7.1,..,..,...,..
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,25.2,..,..,15.1,..,..,...,..


Looking at the DataFrame, we could see that the columns are unnamed and that the column names are located at the 0th row. Using [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html), we could get the 0th row and then assign it as the column values. 

Then, using the [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function, we can drop the 0th row as we have no need for it anymore. Additionally, since the row at index 1 is a row full of NaN, we can also drop it using the same function. 

To be able to fix the indexing of the rows, the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function was used to reset the index from 0.

In [247]:
# setting our column names
data.columns = data.iloc [0] 

# dropping the 'geolocation' row as that is actually used as a header
data = data.drop (data.index [1])

# dropping the column names 
data = data.drop (data.index [0])

data.reset_index (drop=True, inplace=True)

Irrelevant rows that are just footers for the file are also removed.

In [248]:
# dropping irrelevant rows 
data = data.drop (data.index [18:]) 

The `Year` column must also be renamed into `Geolocation` as this row refers to the different regions in the Philippines, and not the years. This can be done through the use of the of the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function.

In [249]:
# renames the column 'Year' as its actually the location column
data.rename(columns = {'Year':'Geolocation'}, inplace=True)

To easily determine which region the `Geolocation` values refer to, we can also change these values to include the names that they are commonly referred to, instead of just their region numbers. 

For consistency throughout the different datasets, the `region_names` variable was declared.

In [250]:
# NOTE: Before applying, make sure that the arrangement of the regions are the same as the arrangement in your table
region_names = ['PHILIPPINES', 'NCR: National Capital Region', 
                 'CAR: Cordillera Administrative Region', 
                 'Region 1: Ilocos Region', 
                 'Region 2: Cagayan Valley', 
                 'Region 3: Central Luzon', 
                 'Region 4A: CALABARZON', 
                'MIMAROPA: Southwestern Tagalog Region', 
                'Region 5: Bicol Region', 
                'Region 6: Western Visayas', 
                'Region 7: Central Visayas', 
                'Region 8: Eastern Visayas', 
                'Region 9: Zamboanga Peninsula', 
                'Region 10: Northern Mindanao', 
                'Region 11: Davao Region', 
                'Region 12: SOCCSKSARGEN', 
                'CARAGA: Cordillera Administrative Region', 
                'BARMM: Bangsamoro Autonomous Region in Muslim Mindanao']

In [251]:
# renames the data in the Geolocation for consistency
data['Geolocation'] = region_names
data.set_index('Geolocation')
data = data.reset_index(drop=True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
0,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,23.5,..,..,16.7,..,..,...,..
1,NCR: National Capital Region,..,..,..,..,..,..,..,..,..,...,..,..,4.1,..,..,2.2,..,..,...,..
2,CAR: Cordillera Administrative Region,..,..,..,..,..,..,..,..,..,...,..,..,22.7,..,..,12.0,..,..,...,..
3,Region 1: Ilocos Region,..,..,..,..,..,..,..,..,..,...,..,..,18.8,..,..,9.9,..,..,...,..
4,Region 2: Cagayan Valley,..,..,..,..,..,..,..,..,..,...,..,..,17.8,..,..,16.3,..,..,...,..
5,Region 3: Central Luzon,..,..,..,..,..,..,..,..,..,...,..,..,10.5,..,..,7.0,..,..,...,..
6,Region 4A: CALABARZON,..,..,..,..,..,..,..,..,..,...,..,..,12.5,..,..,7.1,..,..,...,..
7,MIMAROPA: Southwestern Tagalog Region,..,..,..,..,..,..,..,..,..,...,..,..,25.2,..,..,15.1,..,..,...,..
8,Region 5: Bicol Region,..,..,..,..,..,..,..,..,..,...,..,..,39.8,..,..,27.0,..,..,...,..
9,Region 6: Western Visayas,..,..,..,..,..,..,..,..,..,...,..,..,24.6,..,..,16.3,..,..,...,..


Next, we can convert the strings of '..' and '...', which were used to represent that there were no values for these cells, to **NaN**, through the use of the [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function.

However, the columns that have all **NaN** values were not dropped because if this dataset would be combined with other datasets, all years would still be present as there are datasets with complete data for all the years. Additionally, dropping the years for some of the dataset would result in the combined dataset having a weird sorting (i.e., a sorting of the region that does not follow the usual sorting of the datasets in the Philippines), even if it was sorted based on the `Year` and `Geolocation` column.

In [252]:
for c in data.columns.difference(['Geolocation']):
    # cells without values are represented as either '..' or '...', so we should convert them to NaN so we could dropna()
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# drops columns if all of the values are NaN
# data = data.dropna(axis=1)

In [253]:
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
0,PHILIPPINES,,,,,,,,,,...,,,23.5,,,16.7,,,,
1,NCR: National Capital Region,,,,,,,,,,...,,,4.1,,,2.2,,,,
2,CAR: Cordillera Administrative Region,,,,,,,,,,...,,,22.7,,,12.0,,,,
3,Region 1: Ilocos Region,,,,,,,,,,...,,,18.8,,,9.9,,,,
4,Region 2: Cagayan Valley,,,,,,,,,,...,,,17.8,,,16.3,,,,
5,Region 3: Central Luzon,,,,,,,,,,...,,,10.5,,,7.0,,,,
6,Region 4A: CALABARZON,,,,,,,,,,...,,,12.5,,,7.1,,,,
7,MIMAROPA: Southwestern Tagalog Region,,,,,,,,,,...,,,25.2,,,15.1,,,,
8,Region 5: Bicol Region,,,,,,,,,,...,,,39.8,,,27.0,,,,
9,Region 6: Western Visayas,,,,,,,,,,...,,,24.6,,,16.3,,,,


As the final step, the wide representation of this dataset is converted to a long representation through the use of the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. 

Then, the column that holds the value for a specific year and region is coverted, using [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html), to the ID of this Sustainable Development Goal (SDG), so that it can be distinguished when it is combined with other datasets.

In [254]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'1.2.1', 0 : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

Unnamed: 0,Geolocation,Year,1.2.1
0,PHILIPPINES,2001,
1,NCR: National Capital Region,2001,
2,CAR: Cordillera Administrative Region,2001,
3,Region 1: Ilocos Region,2001,
4,Region 2: Cagayan Valley,2001,
...,...,...,...
391,Region 10: Northern Mindanao,2022,
392,Region 11: Davao Region,2022,
393,Region 12: SOCCSKSARGEN,2022,
394,CARAGA: Cordillera Administrative Region,2022,


As this is the first dataset, we can just assign it to the `combined_data` DataFrame, which would hold the combined datasets.

In [255]:
combined_data = data

#### 1.4.1p5. Net Enrolment Rate in elementary

Using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we load the next dataset. 

In [256]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.4.1p5.csv')
data

Unnamed: 0,1.4.1p5 Net Enrolment Rate in elementary (Indicator is also found in SDG 4.3.s1) 1/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,,Year,2000,2001,2002.00,2003.00,2004.00,2005.00,2006.00,2007.00,...,2013.00,2014.00,2015.00,2016.00,2017.00,2018.00,2019.00,2020.0000,2021,2022
1,Geolocation,Sex,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,Both Sexes,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,...,97.20,97.19,96.90,96.15,94.19,94.05,93.96,89.1064,...,...
3,,Boys,96.27,89.33,89.51,87.84,86.17,83.56,82.39,84.07,...,96.74,96.87,96.66,96.17,94.12,94.25,93.79,88.9318,...,...
4,,Girls,97.28,90.91,91.10,89.68,88.08,85.35,84.08,85.83,...,97.68,97.53,97.15,96.12,94.27,93.85,94.15,89.2898,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Note:,,,,,,,,,,...,,,,,,,,,,
59,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
60,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
61,1/ - Updates were based on the submission of D...,,,,,,,,,,...,,,,,,,,,,


From the DataFrame above, we can see that the footer of the .csv files was included in the DataFrame. As the rows from the 56th index are irrelevant, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) them. 

In [257]:
data = data.drop (data.index [56:]) 

Additionally, we can see that the columns are unnamed, and upon inspection, the original column names can be found at `Index 0`. Thus, we can set the columns to this row, and then  [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Index 0` row as it would only be redundant and might affect the computations.

The [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function was used in order to make the index of the rows start from 0.

In [258]:
# setting the column names and removing the row that held the previous column names
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,NaN,Year,2000,2001,2002.0,2003.0,2004.0,2005.0,2006.0,2007.0,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,Geolocation,Sex,,,,,,,,,...,,,,,,,,,,
1,PHILIPPINES,Both Sexes,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,...,97.2,97.19,96.9,96.15,94.19,94.05,93.96,89.1064,...,...
2,,Boys,96.27,89.33,89.51,87.84,86.17,83.56,82.39,84.07,...,96.74,96.87,96.66,96.17,94.12,94.25,93.79,88.9318,...,...
3,,Girls,97.28,90.91,91.1,89.68,88.08,85.35,84.08,85.83,...,97.68,97.53,97.15,96.12,94.27,93.85,94.15,89.2898,...,...
4,..National Capital Region (NCR),Both Sexes,101,97.82,97.38,96.81,94.82,92.61,92.89,94.42,...,99.64,99.01,99.85,95.92,92.83,92.11,89.91,81.1478,...,...
5,,Boys,100.13,96.57,96.52,95.81,93.75,91.65,92.0,93.21,...,98.77,98.13,98.8,95.3,92.2,91.85,89.43,80.6316,...,...
6,,Girls,101.92,99.13,98.28,97.87,95.95,93.63,93.83,95.69,...,100.57,99.95,100.95,96.58,93.5,92.38,90.42,81.6903,...,...
7,..Cordillera Administrative Region (CAR),Both Sexes,94.42,92.89,91.52,89.19,86.4,82.58,80.86,81.5,...,99.66,100.16,99.19,97.24,94.37,92.24,91.4,87.5276,...,...
8,,Boys,94.26,91.96,90.53,88.36,85.52,81.75,80.19,81.01,...,99.85,100.27,99.42,97.94,95.13,93.45,92.25,88.5518,...,...
9,,Girls,94.58,93.88,92.57,90.07,87.31,83.46,81.57,82.01,...,99.47,100.05,98.95,96.51,93.59,90.99,90.51,86.4657,...,...


However, these is still a row of NaN found at `Index 0`, and we can see that the column names for the first two columns are not correct for the values underneath it, as the ones under the first column are actually Geolocations and those under the second columns are the values for Sex. Thus, we can [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) it, and then [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)  the row at `Index 0`.

In [259]:
data = data.rename(columns = {np.nan:'Geolocation', 'Year': 'Sex'})
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

As we would only need the data that is grouped by region and not by sex, we would only be getting the rows that has **Both Sexes** as the value in the Sex column. After this, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the Sex column as it would not be used onwards.

In [260]:
# Only getting the total data, then dropping Sex column as it's not needed anymore
data = data[data['Sex'] == 'Both Sexes']
data = data.drop("Sex", axis = 1)
data = data.reset_index (drop=True)
data

Unnamed: 0,Geolocation,2000,2001,2002.0,2003.0,2004.0,2005.0,2006.0,2007.0,2008.0,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,85.11,...,97.2,97.19,96.9,96.15,94.19,94.05,93.96,89.1064,...,...
1,..National Capital Region (NCR),101,97.82,97.38,96.81,94.82,92.61,92.89,94.42,93.69,...,99.64,99.01,99.85,95.92,92.83,92.11,89.91,81.1478,...,...
2,..Cordillera Administrative Region (CAR),94.42,92.89,91.52,89.19,86.4,82.58,80.86,81.5,81.93,...,99.66,100.16,99.19,97.24,94.37,92.24,91.4,87.5276,...,...
3,..Region I,97.73,91.33,89.64,88.52,86.98,84.87,82.74,83.14,82.85,...,97.39,97.84,96.78,94.84,92.5,90.48,89.99,86.2185,...,...
4,..Region II,95.65,89.45,86.71,85.65,82.9,79.92,77.7,77.53,76.23,...,100.08,101.15,102.42,100.26,98.45,96.86,97.17,93.6348,...,...
5,..Region III,98.32,86.35,93.58,93.61,92.03,90.77,89.14,91.37,90.93,...,99.03,99.56,98.84,98.53,97.91,98.77,100.03,95.4067,...,...
6,..Region IV-A 2/,98.5,93.44,95.97,95.33,95.1,92.87,92.36,94.02,94.1,...,96.09,97.09,96.36,97.2,96.31,97.36,98.23,91.9912,...,...
7,..MIMAROPA 2/,..,..,91.52,89.42,88.0,84.39,83.84,84.07,85.42,...,95.77,95.58,96.56,94.98,92.33,90.99,90.26,86.2074,...,...
8,..Region V,95.56,91.77,90.95,89.3,87.78,85.43,83.8,85.41,85.07,...,98.53,98.25,96.41,95.77,93.56,93.12,92.68,87.2573,...,...
9,..Region VI,96.16,89.6,85.95,83.25,80.49,77.14,74.96,75.44,74.93,...,97.71,98.47,98.89,99.09,97.16,97.38,97.25,93.9281,...,...


To be able to merge this to the combined DataFrame, the value of the Geolocation column has been set to the same values.

In [261]:
data['Geolocation'] = region_names

Since the dataset represents missing values as either '...' or '..', we can [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the columns with these values with `np.nan`.

In [262]:
for c in data.columns.difference(['Geolocation']):
    # cells without values are represented as either '..' or '...', so we should convert them to NaN so we could dropna()
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Then, we can transform the wide representation of the DataFrame to its long representation version using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. 

In [263]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

data.rename(columns = {'value':'1.4.1p5', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [264]:
data

Unnamed: 0,Geolocation,Year,1.4.1p5
0,PHILIPPINES,2001,90.1
1,NCR: National Capital Region,2001,97.82
2,CAR: Cordillera Administrative Region,2001,92.89
3,Region 1: Ilocos Region,2001,91.33
4,Region 2: Cagayan Valley,2001,89.45
...,...,...,...
391,Region 10: Northern Mindanao,2022,
392,Region 11: Davao Region,2022,
393,Region 12: SOCCSKSARGEN,2022,
394,CARAGA: Cordillera Administrative Region,2022,


Then we can [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) this long representation DataFrame into the combined DataFrame. It would be merged with respect to the values in the **Geolocation** and **Year** column. An outer join is used as we want to retain all the values of both of the DataFrames, even if there would be **NaN** values for some of cells.

In [265]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [266]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5
0,PHILIPPINES,2001,,90.1
1,NCR: National Capital Region,2001,,97.82
2,CAR: Cordillera Administrative Region,2001,,92.89
3,Region 1: Ilocos Region,2001,,91.33
4,Region 2: Cagayan Valley,2001,,89.45
...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,
392,Region 11: Davao Region,2022,,
393,Region 12: SOCCSKSARGEN,2022,,
394,CARAGA: Cordillera Administrative Region,2022,,


#### 1.4.1p6. Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2)

Next, we can load the third dataset.

In [267]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.4.1p6.csv')
data

Unnamed: 0,1.4.1p6 Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,,,Year,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016.00,2017.00,2018.00,2019.00,2020.0000,2021,2022
1,Level of Education,Geolocation,Sex,,,,,,,,...,,,,,,,,,,
2,Junior High School,PHILIPPINES,Both Sexes,66.06,57.55,59,60.15,59.97,58.54,58.59,...,67.89,67.19,73.57,74.19,75.99,81.41,82.89,81.4869,...,...
3,,,Boys,62.72,52.96,54.39,55.34,55.04,53.65,53.85,...,62.42,61.68,68.09,68.79,70.88,77.24,78.80,77.6557,...,...
4,,,Girls,69.49,62.24,63.72,65.07,65.01,63.53,63.44,...,73.69,73.05,79.42,79.94,81.42,85.82,87.20,85.5003,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
113,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
114,1/ - Updates were based on submission of DepEd...,,,,,,,,,,...,,,,,,,,,,
115,2/ - Estimation of this sub-indicator only sta...,,,,,,,,,,...,,,,,,,,,,


Just like in the processing of the previous datasets, we first [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the unnecessary rows at the bottom part of the DataFrame. 

In [268]:
data = data.drop (data.index [110:]) 

From the DataFrame above, we can see that the correct column headers are found at `Index 0`. However, upon inspection, we would see that there are two NaN values and the 'Year' value at the third column should actually be 'Sex' based on the values below it. Thus, before setting this row as the column header, we first correct the values of these first three columns using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

In [269]:
data.at[0, '1.4.1p6 Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2)'] = 'Level of Education'
data.at[0, 'Unnamed: 1'] = 'Geolocation'
data.at[0, 'Unnamed: 2'] = 'Sex'

Now that first row can correctly act as the column header, we can set is as the column header, before dropping the row at `Index 0`. Then we must also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row of **NaN**s at `Index 1` as it is unnecessary, before using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [270]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Using the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) function, we can see that there are two values for 'Level of Education' columns. To be able to combine this to the combined dataset, we must separate them as we cannot add another column that would hold the education level, thus, we can just add it as two different columns.

In [271]:
data ['Level of Education'].unique ()

array(['Junior High School', nan, 'Senior High School'], dtype=object)

In [272]:
senior_high_data = data [54:]
junior_high_data = data [:54]

Now, we must process these two separately, but the processes done to them would be the same.

First, as we only need the general data, without taking *Sex* into consideration. This can be done by only getting the rows that has **Both Sexes** as the value of the `Sex` column.

In [273]:
junior_high_data = junior_high_data [junior_high_data['Sex'] == 'Both Sexes']
junior_high_data = junior_high_data.reset_index (drop=True)

In [274]:
senior_high_data = senior_high_data [senior_high_data['Sex'] == 'Both Sexes']
senior_high_data = senior_high_data.reset_index (drop=True)

Next, as we have already separated the dataset into two based on the value of the `Level of Education` column, we have no need for this column anymore. This means that we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column.  

In [275]:
junior_high_data = junior_high_data.drop("Level of Education", axis = 1)
junior_high_data = junior_high_data.drop("Sex", axis = 1)
junior_high_data = junior_high_data.reset_index (drop=True)

In [276]:
senior_high_data = senior_high_data.drop("Level of Education", axis = 1)
senior_high_data = senior_high_data.drop("Sex", axis = 1)
senior_high_data = senior_high_data.reset_index (drop=True)

For consistency, we set the values of the `Geolocation` column to the format of the region names that we have decided before.

In [277]:
senior_high_data['Geolocation'] = region_names

In [278]:
junior_high_data['Geolocation'] = region_names

As the dataset represents missing values as '..' or '...', we must [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html) these values with `np.nan`.

In [279]:
for c in junior_high_data.columns.difference(['Geolocation']):
    junior_high_data [c].replace(to_replace='..', value= np.nan, inplace= True)
    junior_high_data [c].replace(to_replace='...', value= np.nan, inplace= True)

In [280]:
for c in senior_high_data.columns.difference(['Geolocation']):
    senior_high_data [c].replace(to_replace='..', value= np.nan, inplace= True)
    senior_high_data [c].replace(to_replace='...', value= np.nan, inplace= True)

Looking at the senior high data, we can see that all of the values are `NaN` from 2000 to 2016, which is to be expected as Senior High School was only implemented from 2016.

In [281]:
senior_high_data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,,,,,,,,,,...,,,,37.38,46.12,51.24,47.76,49.48,,
1,NCR: National Capital Region,,,,,,,,,,...,,,,55.32,62.74,68.63,62.28,56.4435,,
2,CAR: Cordillera Administrative Region,,,,,,,,,,...,,,,40.16,49.55,53.64,50.53,52.8763,,
3,Region 1: Ilocos Region,,,,,,,,,,...,,,,51.11,60.39,64.06,61.54,65.6379,,
4,Region 2: Cagayan Valley,,,,,,,,,,...,,,,43.41,51.49,56.21,56.46,61.4433,,
5,Region 3: Central Luzon,,,,,,,,,,...,,,,47.96,55.99,60.19,58.03,60.0165,,
6,Region 4A: CALABARZON,,,,,,,,,,...,,,,45.61,53.9,58.33,54.79,54.7999,,
7,MIMAROPA: Southwestern Tagalog Region,,,,,,,,,,...,,,,35.09,43.27,48.14,46.0,50.2024,,
8,Region 5: Bicol Region,,,,,,,,,,...,,,,28.35,39.63,45.8,42.31,43.518,,
9,Region 6: Western Visayas,,,,,,,,,,...,,,,32.54,44.17,49.74,44.22,48.2144,,


Next, we can convert both of the datasets into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

In [282]:
junior_high_data = pd.melt(junior_high_data, id_vars='Geolocation', value_vars=junior_high_data.columns [2:]) 

junior_high_data.rename(columns = {'value':'1.4.1p6 (Junior High School)', 0 : 'Year'}, inplace=True)
junior_high_data = junior_high_data.astype({'Year':'int'})

In [283]:
senior_high_data = pd.melt(senior_high_data, id_vars='Geolocation', value_vars=senior_high_data.columns [2:]) 

senior_high_data.rename(columns = {'value':'1.4.1p6 (Senior High School)', 0 : 'Year'}, inplace=True)
senior_high_data = senior_high_data.astype({'Year':'int'})

Once that both datasets has been converted to their long representation, we can [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) the two datasets to the combined dataset based on the values of the `Geolocation` and the `Year` column with an outer join.

In [284]:
combined_data = combined_data.merge(junior_high_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.merge(senior_high_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [285]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School)
0,PHILIPPINES,2001,,90.1,57.55,
1,NCR: National Capital Region,2001,,97.82,67.84,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,
3,Region 1: Ilocos Region,2001,,91.33,68.21,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,
...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,
392,Region 11: Davao Region,2022,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,


#### 1.5.4. Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies
Then, the fourth dataset could be loaded using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [286]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.5.4.csv')
data

Unnamed: 0,1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies (Indicator can also found in SDG 13.1.3 and 11.b.2),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
3,Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
4,Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
5,Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
6,Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
7,Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
8,MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
9,Region V,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...


Same as the previous datasets, we would need to [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the irrelevant rows at the bottom of the DataFrame. These are the rows that were a footer outside of the table in the csv files.

In [287]:
data = data.drop (data.index [19:])

Likewise, we know that the row at `Index 0` has the values that is the supposed column header for the table. However, checking each of the cells in this row would make us realize that the column header for the first column should not be `Year`, but rather `Geolocation` as the values in these columns refer to the different regions. 

Thus, we can change the value of the first column in this row to `Geolocation`, so that we would not need to rename the column if we directly made the 0th row into the column header. Then, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row at `Index 0` as it is now unnecessary. Additionally, we can see that there is a row of **NaN**s at `Index 1`, which would become the 0th row once we drop the row that became the column headers. This should be dropped also, before the index is resetted using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [288]:
data.at[0, '1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies (Indicator can also found in SDG 13.1.3 and 11.b.2)'] = 'Geolocation'

In [289]:
data.columns = data.loc[0]
data = data.drop (data.index[0])

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
0,National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
1,Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
2,Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
3,Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
4,Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
5,Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
6,MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
7,Region V,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...
8,Region VI,..,..,..,..,..,..,..,..,..,...,..,..,..,25.1,..,20.2,..,99.3,100.0,...
9,Region VII,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,87.5,..,94.1,100.0,...


The next step would be renaming the values under the `Geolocation`, although, as seen in the resulting table, we would notice that there is no row for **PHILIPPINES**. This is reflected in the way that we set the values of this column.

In [290]:
data ['Geolocation'] = region_names [1:]
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
0,NCR: National Capital Region,..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
1,CAR: Cordillera Administrative Region,..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
2,Region 1: Ilocos Region,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
3,Region 2: Cagayan Valley,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
4,Region 3: Central Luzon,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
5,Region 4A: CALABARZON,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
6,MIMAROPA: Southwestern Tagalog Region,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
7,Region 5: Bicol Region,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...
8,Region 6: Western Visayas,..,..,..,..,..,..,..,..,..,...,..,..,..,25.1,..,20.2,..,99.3,100.0,...
9,Region 7: Central Visayas,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,87.5,..,94.1,100.0,...


As with the previous datasets, we would have to [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the '..' and '...' values, which represents **null**, in the DataFrame with **NaN**s. This is to avoid any errors that would happen in these rows, and so that it would be represented properly.

In [291]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

After all of this, we can now transform this dataset that is in its wide represetation into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

In [292]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 
data

Unnamed: 0,Geolocation,0,value
0,NCR: National Capital Region,2001,
1,CAR: Cordillera Administrative Region,2001,
2,Region 1: Ilocos Region,2001,
3,Region 2: Cagayan Valley,2001,
4,Region 3: Central Luzon,2001,
...,...,...,...
369,Region 10: Northern Mindanao,2022,
370,Region 11: Davao Region,2022,
371,Region 12: SOCCSKSARGEN,2022,
372,CARAGA: Cordillera Administrative Region,2022,


Once we were able to convert it to its long representation, we would see that the column names in this new DataFrame are not descriptive with respect to the values underneath the column. Directly merging this with the combined DataFrame would make it hard for its users to distinguish what these columns are for, which is why it was [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d to its correct column names.

In [293]:
data.rename(columns = {'value':'1.5.4', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

After this, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the combined dataframe.

In [294]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [295]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4
0,PHILIPPINES,2001,,90.1,57.55,,
1,NCR: National Capital Region,2001,,97.82,67.84,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,
...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,
392,Region 11: Davao Region,2022,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,


#### 3.4.1. Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease
To start with the fifth dataset, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [296]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.4.1.csv')
data

Unnamed: 0,"3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,,Year,,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
1,Indicator,Geolocation,,,,,,,,,...,,,,,,,,,,
2,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Both Sexes,..,..,..,..,..,..,4.2,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,..,...
3,,,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,5.6,..,...
4,,,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,3.7,..,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,,,,,,,,,,,...,,,,,,,,,,
268,,,,,,,,,,,...,,,,,,,,,,
269,Note:,,,,,,,,,,...,,,,,,,,,,
270,.. - Data not available,,,,,,,,,,...,,,,,,,,,,


Based on the DataFrame that we got using the [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we can see that there are rows of **NaN**s at the lower part of the DataFrame. Upon further inspection, it started from `Index 266`, which is why the rows from this index was [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped.

In [297]:
data = data.drop (data.index [266:])

As the column headers are all **Unnamed**, we need to set the column headers to its correct value, which is found at `Index 0`. Although, the values for the first three columns in this row are not descriptive to be column headers, which is why we are changing their values to the correct descriptive name for the rows underneath them using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

As we have no use for the row at `Index 0`, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this row. With this, we would also be [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping the next row as it is just a row of **NaN**s.

In [298]:
data.at[0, '3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease'] = 'Indicator'
data.at[0, 'Unnamed: 1'] = 'Geolocation'
data.at[0, 'Unnamed: 2'] = 'Sex'

In [299]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

As the `Sex` column is not available for all datasets, it was decided that only the total—or those rows with **Both Sexes**—would be considered. Once we our data only includes rows with **Both Sexes** as the value of their `Sex` column, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column as this column would only have one unique value.

In [300]:
data = data [data ['Sex'] == 'Both Sexes']
data = data.drop('Sex', axis = 1)
data = data.reset_index(drop=True)

Then, we need to [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) all cells that has the value of either '..' or '...' with **NaN** for better computation in the future. 

In [302]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Upon studying the different indicators under this specific Sustainable Development Goal (SDG), we would realize that it is comprised of different subsets: (1) cardiovascular diseases, (2) cancer, (3) diabetes, and (4) chronic respiratory disease. However, as we only aim to get the total mortality rate with respect to all of these diseases, we would only get the rows under this indicator which is from `Index 0` to `Index 16`.

Then, after dividing the different subsets, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Indicator` column. 

In [304]:
data['Indicator'].unique()

array(['3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease',
       nan,
       '..3.4.1.1 Mortality rate attributed to cardiovascular disease',
       '..3.4.1.2 Mortality rate attributed to cancer',
       '..3.4.1.3 Mortality rate attributed to diabetes',
       '..3.4.1.4 Mortality rate attributed to chronic respiratory disease'],
      dtype=object)

In [305]:
all_data = data [0:16]
cardio_data = data [16:34]
cancer_data = data [34:52]
diabetes_data = data [52:70]
respi_data = data [70:]

In [306]:
all_data = all_data.drop('Indicator', axis = 1)
all_data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
0,PHILIPPINES,,,,,,,4.2,4.2,4.3,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,,
1,..National Capital Region (NCR),,,,,,,5.1,5.2,5.2,...,5.2,5.3,5.5,5.2,4.9,4.9,5.0,4.8,,
2,..Cordillera Administrative Region (CAR),,,,,,,3.3,3.1,3.3,...,3.4,3.5,3.7,3.6,3.6,3.8,4.1,3.8,,
3,..Region I,,,,,,,4.9,4.8,5.0,...,5.0,5.1,5.1,5.0,4.9,4.9,4.9,4.9,,
4,..Region II,,,,,,,4.0,3.9,4.0,...,4.4,4.4,4.5,4.4,4.3,4.5,4.7,4.3,,
5,..Region III,,,,,,,4.8,5.0,5.0,...,5.2,5.4,5.4,5.3,5.2,5.2,5.3,5.2,,
6,..Region IV-A,,,,,,,4.7,4.7,4.6,...,4.9,5.1,5.0,4.9,4.9,4.9,5.1,5.1,,
7,..MIMAROPA,,,,,,,3.5,3.5,3.5,...,3.8,3.9,3.9,4.2,3.9,4.1,4.3,4.2,,
8,..Region VII,,,,,,,4.3,4.2,4.4,...,4.7,4.8,5.0,4.9,4.7,4.7,4.8,4.9,,
9,..Region VIII,,,,,,,3.5,3.5,3.5,...,3.7,3.8,3.8,3.7,3.7,3.8,3.9,4.0,,


Upon inspection, we would realize that there are two regions that are missing from the table, which are **Region V** and **Region VI**, which is why we would only be using the region names that are included in the DataFrame. 

In [307]:
# no region five and six
all_data ['Geolocation'] = region_names [0:8] + region_names [10:]

After this, with the use of the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, we can now convert our DataFrame to its long representation. Then, we must set the column headers to describe the values in this column, which is why we would need to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the columns. 

In [308]:
all_data = pd.melt(all_data, id_vars='Geolocation', value_vars=all_data.columns [2:]) 

all_data.rename(columns = {'value':'3.4.1 (Total data)', 0 : 'Year'}, inplace=True)
all_data = all_data.astype({'Year':'int'})

After this, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the DataFrame which holds the combined datasets.

In [309]:
combined_data = combined_data.merge(all_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [310]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data)
0,PHILIPPINES,2001,,90.1,57.55,,,
1,NCR: National Capital Region,2001,,97.82,67.84,,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,
...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,
392,Region 11: Davao Region,2022,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,


#### 3.7.1. Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods

In [59]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.7.1.csv')
data

Unnamed: 0,3.7.1 Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Year,,2000,2001,2002,2003.0,2004,2005,2006,2007,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
1,Indicator/Sub-indicators,Geolocation,,,,,,,,,...,,,,,,,,,,
2,3.7.1 Proportion of women of reproductive age ...,PHILIPPINES,..,..,..,46.7,..,..,..,..,...,51.8,..,..,..,56.9,..,..,..,..,...
3,,..National Capital Region (NCR),..,..,..,47.2,..,..,..,..,...,53.4,..,..,..,59.5,..,..,..,..,...
4,,..Cordillera Administrative Region (CAR),..,..,..,44.4,..,..,..,..,...,59.8,..,..,..,66.7,..,..,..,..,...
5,,..Region I,..,..,..,49.6,..,..,..,..,...,50.8,..,..,..,59.5,..,..,..,..,...
6,,..Region II,..,..,..,68.8,..,..,..,..,...,69.1,..,..,..,74.1,..,..,..,..,...
7,,..Region III,..,..,..,54.2,..,..,..,..,...,60.4,..,..,..,56.8,..,..,..,..,...
8,,..Region IV-A,..,..,..,46.1,..,..,..,..,...,49.1,..,..,..,49.2,..,..,..,..,...
9,,..MIMAROPA,..,..,..,48.5,..,..,..,..,...,55.1,..,..,..,61.7,..,..,..,..,...


In [60]:
data = data.drop (data.index [20:])

In [61]:
data.at[0, 'Unnamed: 1'] = 'Geolocation'

In [62]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

In [63]:
data = data.drop('Year', axis=1)

In [64]:
data ['Geolocation'] = region_names

In [65]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

In [66]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

data.rename(columns = {'value':'3.7.1', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [67]:
data

Unnamed: 0,Geolocation,Year,3.7.1
0,PHILIPPINES,2001,
1,NCR: National Capital Region,2001,
2,CAR: Cordillera Administrative Region,2001,
3,Region 1: Ilocos Region,2001,
4,Region 2: Cagayan Valley,2001,
...,...,...,...
391,Region 10: Northern Mindanao,2022,
392,Region 11: Davao Region,2022,
393,Region 12: SOCCSKSARGEN,2022,
394,CARAGA: Cordillera Administrative Region,2022,


In [68]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [69]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data),3.7.1
0,PHILIPPINES,2001,,90.1,57.55,,,,
1,NCR: National Capital Region,2001,,97.82,67.84,,,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,,
...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,,
392,Region 11: Davao Region,2022,,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,,


#### 3.7.2. Adolescent birth rate aged 15-19 years per 1,000 women in that age group

In [70]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.7.2.csv')
data

Unnamed: 0,"3.7.2 Adolescent birth rate aged 15-19 years per 1,000 women in that age group",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003.0,2004,2005,2006,2007,2008.0,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,53.0,..,..,..,..,54.0,...,57.0,..,..,..,47.0,..,..,..,..,...
3,..National Capital Region (NCR),..,..,..,35.0,..,..,..,..,25.0,...,48.0,..,..,..,27.0,..,..,..,..,...
4,..Cordillera Administrative Region (CAR),..,..,..,52.0,..,..,..,..,34.0,...,53.0,..,..,..,25.0,..,..,..,..,...
5,..Region I,..,..,..,55.0,..,..,..,..,52.0,...,78.0,..,..,..,46.0,..,..,..,..,...
6,..Region II,..,..,..,85.0,..,..,..,..,54.0,...,65.0,..,..,..,51.0,..,..,..,..,...
7,..Region III,..,..,..,42.0,..,..,..,..,69.0,...,63.0,..,..,..,61.0,..,..,..,..,...
8,..Region IV-A,..,..,..,44.0,..,..,..,..,63.0,...,58.0,..,..,..,37.0,..,..,..,..,...
9,..MIMAROPA,..,..,..,108.0,..,..,..,..,87.0,...,68.0,..,..,..,47.0,..,..,..,..,...


In [71]:
data = data.drop (data.index [20:])

In [72]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

In [73]:
data.rename(columns = {'Year':'Geolocation'}, inplace=True)

In [74]:
data ['Geolocation'] = region_names

In [75]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

In [76]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

data.rename(columns = {'value':'3.7.2', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [77]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [78]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data),3.7.1,3.7.2
0,PHILIPPINES,2001,,90.1,57.55,,,,,
1,NCR: National Capital Region,2001,,97.82,67.84,,,,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,,,
...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,,,
392,Region 11: Davao Region,2022,,,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,,,


#### 4.1.s1. Completion Rate of elementary and secondary students

In [79]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/4.1.s1.csv')
data

Unnamed: 0,4.1.s1 Completion Rate of elementary and secondary students 1/ 2/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,Year,,,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018.00,2019.00,2020.000000,2021,2022
1,Geolocation,Level of Education,Sex,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,Elementary,Both Sexes,62.72,68.18,71.55,70.24,69.06,68.11,71.72,...,77.67,83.74,84.02,93.06,92.41,97.15,96.56,82.510000,...,...
3,,,Female,65.53,70.7,76.32,75.63,75.2,73.46,76.7,...,81.33,86.23,87.43,95.52,94.61,99.12,98.08,84.681828,...,...
4,,,Male,60.05,65.78,67.23,65.42,63.63,63.29,67.28,...,74.38,81.45,80.97,90.83,90.41,95.26,95.10,80.500538,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
167,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
168,1/ - Updates were based on the submission of D...,,,,,,,,,,...,,,,,,,,,,
169,2/ - Estimation in Senior High School only sta...,,,,,,,,,,...,,,,,,,,,,


In [80]:
data = data.drop(data.index[164:])

In [81]:
data.at[0, '4.1.s1 Completion Rate of elementary and secondary students 1/ 2/'] = 'Geolocation'
data.at[0, 'Unnamed: 1'] = 'Level of Education'
data.at[0, 'Unnamed: 2'] = 'Sex'

In [82]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

In [83]:
data = data [data['Sex'] == 'Both Sexes']
data = data.drop ('Sex', axis = 1)
data = data.reset_index(drop=True)

In [84]:
# copying the geolocation value to the next two rows
i = 0
while i < len (data):
    if i % 3 == 0:
        data.at[i + 1, 'Geolocation'] = data['Geolocation'][i]
        data.at[i + 2, 'Geolocation'] = data['Geolocation'][i]
        i = i + 3

In [85]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

In [86]:
elem_data = data [data['Level of Education'] == 'Elementary']
elem_data = elem_data.reset_index (drop=True)

junior_data = data [data['Level of Education'] == 'Secondary (Junior High School)']
junior_data = junior_data.reset_index (drop=True)

senior_data = data [data['Level of Education'] == 'Secondary (Senior High School)']
senior_data = senior_data.reset_index (drop=True)

In [87]:
elem_data = elem_data.drop ('Level of Education', axis = 1)
elem_data = elem_data.reset_index(drop=True)

In [88]:
elem_data ['Geolocation'] = region_names

In [89]:
elem_data = pd.melt(elem_data, id_vars='Geolocation', value_vars=elem_data.columns [2:]) 

elem_data.rename(columns = {'value':'4.1.s1 (Elementary)', 0 : 'Year'}, inplace=True)
elem_data = elem_data.astype({'Year':'int'})

In [90]:
combined_data = combined_data.merge(elem_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [91]:
junior_data = junior_data.drop ('Level of Education', axis = 1)
junior_data = junior_data.reset_index(drop=True)

In [92]:
junior_data ['Geolocation'] = region_names

In [93]:
junior_data = pd.melt(junior_data, id_vars='Geolocation', value_vars=junior_data.columns [2:]) 

junior_data.rename(columns = {'value':'4.1.s1 (Junior High School)', 0 : 'Year'}, inplace=True)
junior_data = junior_data.astype({'Year':'int'})

In [94]:
combined_data = combined_data.merge(junior_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [95]:
senior_data = senior_data.drop ('Level of Education', axis = 1)
senior_data = senior_data.reset_index(drop=True)

In [96]:
senior_data ['Geolocation'] = region_names

In [97]:
senior_data = pd.melt(senior_data, id_vars='Geolocation', value_vars=senior_data.columns [2:]) 

senior_data.rename(columns = {'value':'4.1.s1 (Senior High School)', 0 : 'Year'}, inplace=True)
senior_data = senior_data.astype({'Year':'int'})

In [98]:
combined_data = combined_data.merge(senior_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [99]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data),3.7.1,3.7.2,4.1.s1 (Elementary),4.1.s1 (Junior High School),4.1.s1 (Senior High School)
0,PHILIPPINES,2001,,90.1,57.55,,,,,,68.18,69.97,
1,NCR: National Capital Region,2001,,97.82,67.84,,,,,,74.29,68.43,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,,,,59.55,61.75,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,,,,79.7,75.35,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,,,,74.07,69.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,,,,,,
392,Region 11: Davao Region,2022,,,,,,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,,,


#### 4.c.s2. Number of Technical-Vocational Education and Training (TVET) trainers trained

In [100]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/4.c.s2.csv')
data

Unnamed: 0,4.c.s2 Number of Technical-Vocational Education and Training (TVET) trainers trained,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017.0,2018.0,2019.0,2020.0,2021.0,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,..,6518.0,11159.0,10118.0,10855.0,4023.0,7746.0,...
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,610.0,1028.0,1280.0,1409.0,782.0,1985.0,...
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,201.0,302.0,166.0,260.0,92.0,199.0,...
5,..Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,474.0,455.0,475.0,501.0,375.0,327.0,...
6,..Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,270.0,612.0,447.0,686.0,215.0,240.0,...
7,..Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,280.0,262.0,354.0,839.0,277.0,471.0,...
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,833.0,1067.0,1440.0,817.0,177.0,647.0,...
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,139.0,523.0,709.0,413.0,162.0,255.0,...


In [101]:
data = data.drop(data.index[20:])

In [102]:
data.at[0, '4.c.s2 Number of Technical-Vocational Education and Training (TVET) trainers trained'] = 'Geolocation'

In [103]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

In [104]:
data ['Geolocation'] = region_names

In [105]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

In [106]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

data.rename(columns = {'value':'4.c.s2', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [107]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [108]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data),3.7.1,3.7.2,4.1.s1 (Elementary),4.1.s1 (Junior High School),4.1.s1 (Senior High School),4.c.s2
0,PHILIPPINES,2001,,90.1,57.55,,,,,,68.18,69.97,,
1,NCR: National Capital Region,2001,,97.82,67.84,,,,,,74.29,68.43,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,,,,59.55,61.75,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,,,,79.7,75.35,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,,,,74.07,69.4,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,,,,,,,
392,Region 11: Davao Region,2022,,,,,,,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,,,,


#### 7.1.1. Proportion of population with access to electricity

#### 8.1.1. Annual growth rate of real GDP per capita

#### 10.1.1. Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population and the total population

#### 14.5.1. Coverage of protected areas in relation to marine areas

#### 16.1.1 Number of victims of intentional homicide (per 100,000 population)

#### 16.1.s1 Number of murder cases

#### Other Non-SDG datasets that can help us in exploring the former datasets

##### Changes in Inventories, by Region

##### Current Health Expenditure by Region, Growth Rates 

##### Current Health Expenditure by Region

##### Government Final Consumption Expenditure, by Region, Growth Rates

##### Government Final Consumption Expenditure, by Region, Percent Share

##### Population, by Region

##### Primary Drop-out rates by Region, Sex and Year

##### Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods

##### Quarterly Producer Price Index for Agriculture (First Quarter 2018 to Third Quarter 2021)

## Data Cleaning and Wrangling

## Exploratory Data Analysis