# Progress of the Philippines' Sustainable Development Goals

### Import

In [1]:
import os
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore

## Data Collection
The following **csv** files used in this project are acquired through a request sent to the Knowledge Management and Communications Division of the Philippine Statistics Authority.

### Combining the Datasets 
In this stage, the separate datasets underwent pre-processing and cleaning before they are combined together. Some of the cleaning done on each of the datasets are: (1) fixing of column names, (2) modification of the values of the 'Geolocation' column, (3) removal of unneeded rows and columns, and (4) conversion of '..' or '...' values to NaN. After this, the dataset is converted into a long representation before they are merged together.

#### 1.2.1. Proportion of population living below the national poverty line 
To start with, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

The [`os.getenv`](https://docs.python.org/3/library/os.html) function was used to get the environment variable `DSDATA_PROJ`, which points to the data folder of this project.

In [12]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.2.1.csv')
data

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Looking at the DataFrame, we could see that the columns are unnamed and that the column names are located at the 0th row. Using [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html), we could get the 0th row and then assign it as the column values. 

Then, using the [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function, we can drop the 0th row as we have no need for it anymore. Additionally, since the row at index 1 is a row full of NaN, we can also drop it using the same function. 

To be able to fix the indexing of the rows, the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function was used to reset the index from 0.

In [None]:
# setting our column names
data.columns = data.iloc [0] 

# dropping the 'geolocation' row as that is actually used as a header
data = data.drop (data.index [1])

# dropping the column names 
data = data.drop (data.index [0])

data.reset_index (drop=True, inplace=True)

Irrelevant rows that are just footers for the file are also removed.

In [None]:
# dropping irrelevant rows 
data = data.drop (data.index [18:]) 

The `Year` column must also be renamed into `Geolocation` as this row refers to the different regions in the Philippines, and not the years. This can be done through the use of the of the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function.

In [None]:
# renames the column 'Year' as its actually the location column
data.rename(columns = {'Year':'Geolocation'}, inplace=True)

To easily determine which region the `Geolocation` values refer to, we can also change these values to include the names that they are commonly referred to, instead of just their region numbers. 

For consistency throughout the different datasets, the `region_names` variable was declared. The reason why a map was not used was that different datasets have different representations of the region (i.e., differences in naming a region), however, they are always arranged in the same way. This would be shown below in the pre-processing of each of the datasets.

In [2]:
# NOTE: Before applying, make sure that the arrangement of the regions are the same as the arrangement in your table
region_names = ['PHILIPPINES', 'NCR: National Capital Region', 
                 'CAR: Cordillera Administrative Region', 
                 'Region 1: Ilocos Region', 
                 'Region 2: Cagayan Valley', 
                 'Region 3: Central Luzon', 
                 'Region 4A: CALABARZON', 
                'MIMAROPA: Southwestern Tagalog Region', 
                'Region 5: Bicol Region', 
                'Region 6: Western Visayas', 
                'Region 7: Central Visayas', 
                'Region 8: Eastern Visayas', 
                'Region 9: Zamboanga Peninsula', 
                'Region 10: Northern Mindanao', 
                'Region 11: Davao Region', 
                'Region 12: SOCCSKSARGEN', 
                'CARAGA: Cordillera Administrative Region', 
                'BARMM: Bangsamoro Autonomous Region in Muslim Mindanao']

In [None]:
# renames the data in the Geolocation for consistency
data['Geolocation'] = region_names
data.set_index('Geolocation')
data = data.reset_index(drop=True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
0,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,23.5,..,..,16.7,..,..,...,..
1,NCR: National Capital Region,..,..,..,..,..,..,..,..,..,...,..,..,4.1,..,..,2.2,..,..,...,..
2,CAR: Cordillera Administrative Region,..,..,..,..,..,..,..,..,..,...,..,..,22.7,..,..,12.0,..,..,...,..
3,Region 1: Ilocos Region,..,..,..,..,..,..,..,..,..,...,..,..,18.8,..,..,9.9,..,..,...,..
4,Region 2: Cagayan Valley,..,..,..,..,..,..,..,..,..,...,..,..,17.8,..,..,16.3,..,..,...,..
5,Region 3: Central Luzon,..,..,..,..,..,..,..,..,..,...,..,..,10.5,..,..,7.0,..,..,...,..
6,Region 4A: CALABARZON,..,..,..,..,..,..,..,..,..,...,..,..,12.5,..,..,7.1,..,..,...,..
7,MIMAROPA: Southwestern Tagalog Region,..,..,..,..,..,..,..,..,..,...,..,..,25.2,..,..,15.1,..,..,...,..
8,Region 5: Bicol Region,..,..,..,..,..,..,..,..,..,...,..,..,39.8,..,..,27.0,..,..,...,..
9,Region 6: Western Visayas,..,..,..,..,..,..,..,..,..,...,..,..,24.6,..,..,16.3,..,..,...,..


Next, we can convert the strings of '..' and '...', which were used to represent that there were no values for these cells, to **NaN**, through the use of the [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function.

However, the columns that have all **NaN** values were not dropped because if this dataset would be combined with other datasets, all years would still be present as there are datasets with complete data for all the years. Additionally, dropping the years for some of the dataset would result in the combined dataset having a weird sorting (i.e., a sorting of the region that does not follow the usual sorting of the datasets in the Philippines), even if it was sorted based on the `Year` and `Geolocation` column.

In [None]:
for c in data.columns.difference(['Geolocation']):
    # cells without values are represented as either '..' or '...', so we should convert them to NaN so we could dropna()
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# drops columns if all of the values are NaN
# data = data.dropna(axis=1)

In [None]:
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
0,PHILIPPINES,,,,,,,,,,...,,,23.5,,,16.7,,,,
1,NCR: National Capital Region,,,,,,,,,,...,,,4.1,,,2.2,,,,
2,CAR: Cordillera Administrative Region,,,,,,,,,,...,,,22.7,,,12.0,,,,
3,Region 1: Ilocos Region,,,,,,,,,,...,,,18.8,,,9.9,,,,
4,Region 2: Cagayan Valley,,,,,,,,,,...,,,17.8,,,16.3,,,,
5,Region 3: Central Luzon,,,,,,,,,,...,,,10.5,,,7.0,,,,
6,Region 4A: CALABARZON,,,,,,,,,,...,,,12.5,,,7.1,,,,
7,MIMAROPA: Southwestern Tagalog Region,,,,,,,,,,...,,,25.2,,,15.1,,,,
8,Region 5: Bicol Region,,,,,,,,,,...,,,39.8,,,27.0,,,,
9,Region 6: Western Visayas,,,,,,,,,,...,,,24.6,,,16.3,,,,


As the final step, the wide representation of this dataset is converted to a long representation through the use of the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. 

Then, the column that holds the value for a specific year and region is coverted, using [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html), to the ID of this Sustainable Development Goal (SDG), so that it can be distinguished when it is combined with other datasets.

In [None]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'1.2.1', 0 : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

Unnamed: 0,Geolocation,Year,1.2.1
0,PHILIPPINES,2001,
1,NCR: National Capital Region,2001,
2,CAR: Cordillera Administrative Region,2001,
3,Region 1: Ilocos Region,2001,
4,Region 2: Cagayan Valley,2001,
...,...,...,...
391,Region 10: Northern Mindanao,2022,
392,Region 11: Davao Region,2022,
393,Region 12: SOCCSKSARGEN,2022,
394,CARAGA: Cordillera Administrative Region,2022,


As this is the first dataset, we can just assign it to the `combined_data` DataFrame, which would hold the combined datasets.

In [None]:
combined_data = data

#### 1.4.1p5. Net Enrolment Rate in elementary

Using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we load the next dataset. 

In [None]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.4.1p5.csv')
data

Unnamed: 0,1.4.1p5 Net Enrolment Rate in elementary (Indicator is also found in SDG 4.3.s1) 1/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,,Year,2000,2001,2002.00,2003.00,2004.00,2005.00,2006.00,2007.00,...,2013.00,2014.00,2015.00,2016.00,2017.00,2018.00,2019.00,2020.0000,2021,2022
1,Geolocation,Sex,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,Both Sexes,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,...,97.20,97.19,96.90,96.15,94.19,94.05,93.96,89.1064,...,...
3,,Boys,96.27,89.33,89.51,87.84,86.17,83.56,82.39,84.07,...,96.74,96.87,96.66,96.17,94.12,94.25,93.79,88.9318,...,...
4,,Girls,97.28,90.91,91.10,89.68,88.08,85.35,84.08,85.83,...,97.68,97.53,97.15,96.12,94.27,93.85,94.15,89.2898,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Note:,,,,,,,,,,...,,,,,,,,,,
59,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
60,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
61,1/ - Updates were based on the submission of D...,,,,,,,,,,...,,,,,,,,,,


From the DataFrame above, we can see that the footer of the .csv files was included in the DataFrame. As the rows from the 56th index are irrelevant, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) them. 

In [None]:
data = data.drop (data.index [56:]) 

Additionally, we can see that the columns are unnamed, and upon inspection, the original column names can be found at `Index 0`. Thus, we can set the columns to this row, and then  [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Index 0` row as it would only be redundant and might affect the computations.

The [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function was used in order to make the index of the rows start from 0.

In [None]:
# setting the column names and removing the row that held the previous column names
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,NaN,Year,2000,2001,2002.0,2003.0,2004.0,2005.0,2006.0,2007.0,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,Geolocation,Sex,,,,,,,,,...,,,,,,,,,,
1,PHILIPPINES,Both Sexes,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,...,97.2,97.19,96.9,96.15,94.19,94.05,93.96,89.1064,...,...
2,,Boys,96.27,89.33,89.51,87.84,86.17,83.56,82.39,84.07,...,96.74,96.87,96.66,96.17,94.12,94.25,93.79,88.9318,...,...
3,,Girls,97.28,90.91,91.1,89.68,88.08,85.35,84.08,85.83,...,97.68,97.53,97.15,96.12,94.27,93.85,94.15,89.2898,...,...
4,..National Capital Region (NCR),Both Sexes,101,97.82,97.38,96.81,94.82,92.61,92.89,94.42,...,99.64,99.01,99.85,95.92,92.83,92.11,89.91,81.1478,...,...
5,,Boys,100.13,96.57,96.52,95.81,93.75,91.65,92.0,93.21,...,98.77,98.13,98.8,95.3,92.2,91.85,89.43,80.6316,...,...
6,,Girls,101.92,99.13,98.28,97.87,95.95,93.63,93.83,95.69,...,100.57,99.95,100.95,96.58,93.5,92.38,90.42,81.6903,...,...
7,..Cordillera Administrative Region (CAR),Both Sexes,94.42,92.89,91.52,89.19,86.4,82.58,80.86,81.5,...,99.66,100.16,99.19,97.24,94.37,92.24,91.4,87.5276,...,...
8,,Boys,94.26,91.96,90.53,88.36,85.52,81.75,80.19,81.01,...,99.85,100.27,99.42,97.94,95.13,93.45,92.25,88.5518,...,...
9,,Girls,94.58,93.88,92.57,90.07,87.31,83.46,81.57,82.01,...,99.47,100.05,98.95,96.51,93.59,90.99,90.51,86.4657,...,...


However, these is still a row of NaN found at `Index 0`, and we can see that the column names for the first two columns are not correct for the values underneath it, as the ones under the first column are actually Geolocations and those under the second columns are the values for Sex. Thus, we can [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) it, and then [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)  the row at `Index 0`.

In [None]:
data = data.rename(columns = {np.nan:'Geolocation', 'Year': 'Sex'})
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

As we would only need the data that is grouped by region and not by sex, we would only be getting the rows that has **Both Sexes** as the value in the Sex column. After this, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the Sex column as it would not be used onwards.

In [None]:
# Only getting the total data, then dropping Sex column as it's not needed anymore
data = data[data['Sex'] == 'Both Sexes']
data = data.drop("Sex", axis = 1)
data = data.reset_index (drop=True)
data

Unnamed: 0,Geolocation,2000,2001,2002.0,2003.0,2004.0,2005.0,2006.0,2007.0,2008.0,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,85.11,...,97.2,97.19,96.9,96.15,94.19,94.05,93.96,89.1064,...,...
1,..National Capital Region (NCR),101,97.82,97.38,96.81,94.82,92.61,92.89,94.42,93.69,...,99.64,99.01,99.85,95.92,92.83,92.11,89.91,81.1478,...,...
2,..Cordillera Administrative Region (CAR),94.42,92.89,91.52,89.19,86.4,82.58,80.86,81.5,81.93,...,99.66,100.16,99.19,97.24,94.37,92.24,91.4,87.5276,...,...
3,..Region I,97.73,91.33,89.64,88.52,86.98,84.87,82.74,83.14,82.85,...,97.39,97.84,96.78,94.84,92.5,90.48,89.99,86.2185,...,...
4,..Region II,95.65,89.45,86.71,85.65,82.9,79.92,77.7,77.53,76.23,...,100.08,101.15,102.42,100.26,98.45,96.86,97.17,93.6348,...,...
5,..Region III,98.32,86.35,93.58,93.61,92.03,90.77,89.14,91.37,90.93,...,99.03,99.56,98.84,98.53,97.91,98.77,100.03,95.4067,...,...
6,..Region IV-A 2/,98.5,93.44,95.97,95.33,95.1,92.87,92.36,94.02,94.1,...,96.09,97.09,96.36,97.2,96.31,97.36,98.23,91.9912,...,...
7,..MIMAROPA 2/,..,..,91.52,89.42,88.0,84.39,83.84,84.07,85.42,...,95.77,95.58,96.56,94.98,92.33,90.99,90.26,86.2074,...,...
8,..Region V,95.56,91.77,90.95,89.3,87.78,85.43,83.8,85.41,85.07,...,98.53,98.25,96.41,95.77,93.56,93.12,92.68,87.2573,...,...
9,..Region VI,96.16,89.6,85.95,83.25,80.49,77.14,74.96,75.44,74.93,...,97.71,98.47,98.89,99.09,97.16,97.38,97.25,93.9281,...,...


To be able to merge this to the combined DataFrame, the value of the Geolocation column has been set to the same values.

In [None]:
data['Geolocation'] = region_names

Since the dataset represents missing values as either '...' or '..', we can [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the columns with these values with `np.nan`.

In [None]:
for c in data.columns.difference(['Geolocation']):
    # cells without values are represented as either '..' or '...', so we should convert them to NaN so we could dropna()
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Then, we can transform the wide representation of the DataFrame to its long representation version using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. 

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

data.rename(columns = {'value':'1.4.1p5', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

In [None]:
data

Unnamed: 0,Geolocation,Year,1.4.1p5
0,PHILIPPINES,2001,90.1
1,NCR: National Capital Region,2001,97.82
2,CAR: Cordillera Administrative Region,2001,92.89
3,Region 1: Ilocos Region,2001,91.33
4,Region 2: Cagayan Valley,2001,89.45
...,...,...,...
391,Region 10: Northern Mindanao,2022,
392,Region 11: Davao Region,2022,
393,Region 12: SOCCSKSARGEN,2022,
394,CARAGA: Cordillera Administrative Region,2022,


Then we can [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) this long representation DataFrame into the combined DataFrame. It would be merged with respect to the values in the **Geolocation** and **Year** column. An outer join is used as we want to retain all the values of both of the DataFrames, even if there would be **NaN** values for some of cells.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5
0,PHILIPPINES,2001,,90.1
1,NCR: National Capital Region,2001,,97.82
2,CAR: Cordillera Administrative Region,2001,,92.89
3,Region 1: Ilocos Region,2001,,91.33
4,Region 2: Cagayan Valley,2001,,89.45
...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,
392,Region 11: Davao Region,2022,,
393,Region 12: SOCCSKSARGEN,2022,,
394,CARAGA: Cordillera Administrative Region,2022,,


#### 1.4.1p6. Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2)

Next, we can load the third dataset.

In [None]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.4.1p6.csv')
data

Unnamed: 0,1.4.1p6 Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,,,Year,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016.00,2017.00,2018.00,2019.00,2020.0000,2021,2022
1,Level of Education,Geolocation,Sex,,,,,,,,...,,,,,,,,,,
2,Junior High School,PHILIPPINES,Both Sexes,66.06,57.55,59,60.15,59.97,58.54,58.59,...,67.89,67.19,73.57,74.19,75.99,81.41,82.89,81.4869,...,...
3,,,Boys,62.72,52.96,54.39,55.34,55.04,53.65,53.85,...,62.42,61.68,68.09,68.79,70.88,77.24,78.80,77.6557,...,...
4,,,Girls,69.49,62.24,63.72,65.07,65.01,63.53,63.44,...,73.69,73.05,79.42,79.94,81.42,85.82,87.20,85.5003,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
113,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
114,1/ - Updates were based on submission of DepEd...,,,,,,,,,,...,,,,,,,,,,
115,2/ - Estimation of this sub-indicator only sta...,,,,,,,,,,...,,,,,,,,,,


Just like in the processing of the previous datasets, we first [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the unnecessary rows at the bottom part of the DataFrame. 

In [None]:
data = data.drop (data.index [110:]) 

From the DataFrame above, we can see that the correct column headers are found at `Index 0`. However, upon inspection, we would see that there are two NaN values and the 'Year' value at the third column should actually be 'Sex' based on the values below it. Thus, before setting this row as the column header, we first correct the values of these first three columns using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

In [None]:
data.at[0, '1.4.1p6 Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2)'] = 'Level of Education'
data.at[0, 'Unnamed: 1'] = 'Geolocation'
data.at[0, 'Unnamed: 2'] = 'Sex'

Now that first row can correctly act as the column header, we can set is as the column header, before dropping the row at `Index 0`. Then we must also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row of **NaN**s at `Index 1` as it is unnecessary, before using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Using the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) function, we can see that there are two values for 'Level of Education' columns. To be able to combine this to the combined dataset, we must separate them as we cannot add another column that would hold the education level, thus, we can just add it as two different columns.

In [None]:
data ['Level of Education'].unique ()

array(['Junior High School', nan, 'Senior High School'], dtype=object)

In [None]:
senior_high_data = data [54:]
junior_high_data = data [:54]

Now, we must process these two separately, but the processes done to them would be the same.

First, as we only need the general data, without taking *Sex* into consideration. This can be done by only getting the rows that has **Both Sexes** as the value of the `Sex` column.

In [None]:
junior_high_data = junior_high_data [junior_high_data['Sex'] == 'Both Sexes']
junior_high_data = junior_high_data.reset_index (drop=True)

In [None]:
senior_high_data = senior_high_data [senior_high_data['Sex'] == 'Both Sexes']
senior_high_data = senior_high_data.reset_index (drop=True)

Next, as we have already separated the dataset into two based on the value of the `Level of Education` column, we have no need for this column anymore. This means that we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column.  

In [None]:
junior_high_data = junior_high_data.drop("Level of Education", axis = 1)
junior_high_data = junior_high_data.drop("Sex", axis = 1)
junior_high_data = junior_high_data.reset_index (drop=True)

In [None]:
senior_high_data = senior_high_data.drop("Level of Education", axis = 1)
senior_high_data = senior_high_data.drop("Sex", axis = 1)
senior_high_data = senior_high_data.reset_index (drop=True)

For consistency, we set the values of the `Geolocation` column to the format of the region names that we have decided before.

In [None]:
senior_high_data['Geolocation'] = region_names

In [None]:
junior_high_data['Geolocation'] = region_names

As the dataset represents missing values as '..' or '...', we must [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html) these values with `np.nan`.

In [None]:
for c in junior_high_data.columns.difference(['Geolocation']):
    junior_high_data [c].replace(to_replace='..', value= np.nan, inplace= True)
    junior_high_data [c].replace(to_replace='...', value= np.nan, inplace= True)

In [None]:
for c in senior_high_data.columns.difference(['Geolocation']):
    senior_high_data [c].replace(to_replace='..', value= np.nan, inplace= True)
    senior_high_data [c].replace(to_replace='...', value= np.nan, inplace= True)

Looking at the senior high data, we can see that all of the values are `NaN` from 2000 to 2016, which is to be expected as Senior High School was only implemented from 2016.

In [None]:
senior_high_data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,,,,,,,,,,...,,,,37.38,46.12,51.24,47.76,49.48,,
1,NCR: National Capital Region,,,,,,,,,,...,,,,55.32,62.74,68.63,62.28,56.4435,,
2,CAR: Cordillera Administrative Region,,,,,,,,,,...,,,,40.16,49.55,53.64,50.53,52.8763,,
3,Region 1: Ilocos Region,,,,,,,,,,...,,,,51.11,60.39,64.06,61.54,65.6379,,
4,Region 2: Cagayan Valley,,,,,,,,,,...,,,,43.41,51.49,56.21,56.46,61.4433,,
5,Region 3: Central Luzon,,,,,,,,,,...,,,,47.96,55.99,60.19,58.03,60.0165,,
6,Region 4A: CALABARZON,,,,,,,,,,...,,,,45.61,53.9,58.33,54.79,54.7999,,
7,MIMAROPA: Southwestern Tagalog Region,,,,,,,,,,...,,,,35.09,43.27,48.14,46.0,50.2024,,
8,Region 5: Bicol Region,,,,,,,,,,...,,,,28.35,39.63,45.8,42.31,43.518,,
9,Region 6: Western Visayas,,,,,,,,,,...,,,,32.54,44.17,49.74,44.22,48.2144,,


Next, we can convert both of the datasets into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

In [None]:
junior_high_data = pd.melt(junior_high_data, id_vars='Geolocation', value_vars=junior_high_data.columns [2:]) 

junior_high_data.rename(columns = {'value':'1.4.1p6 (Junior High School)', 0 : 'Year'}, inplace=True)
junior_high_data = junior_high_data.astype({'Year':'int'})

In [None]:
senior_high_data = pd.melt(senior_high_data, id_vars='Geolocation', value_vars=senior_high_data.columns [2:]) 

senior_high_data.rename(columns = {'value':'1.4.1p6 (Senior High School)', 0 : 'Year'}, inplace=True)
senior_high_data = senior_high_data.astype({'Year':'int'})

Once that both datasets has been converted to their long representation, we can [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) the two datasets to the combined dataset based on the values of the `Geolocation` and the `Year` column with an outer join.

In [None]:
combined_data = combined_data.merge(junior_high_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.merge(senior_high_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School)
0,PHILIPPINES,2001,,90.1,57.55,
1,NCR: National Capital Region,2001,,97.82,67.84,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,
3,Region 1: Ilocos Region,2001,,91.33,68.21,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,
...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,
392,Region 11: Davao Region,2022,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,


#### 1.5.4. Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies
Then, the fourth dataset could be loaded using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [None]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/1.5.4.csv')
data

Unnamed: 0,1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies (Indicator can also found in SDG 13.1.3 and 11.b.2),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
3,Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
4,Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
5,Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
6,Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
7,Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
8,MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
9,Region V,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...


Same as the previous datasets, we would need to [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the irrelevant rows at the bottom of the DataFrame. These are the rows that were a footer outside of the table in the csv files.

In [None]:
data = data.drop (data.index [19:])

Likewise, we know that the row at `Index 0` has the values that is the supposed column header for the table. However, checking each of the cells in this row would make us realize that the column header for the first column should not be `Year`, but rather `Geolocation` as the values in these columns refer to the different regions. 

Thus, we can change the value of the first column in this row to `Geolocation`, so that we would not need to rename the column if we directly made the 0th row into the column header. Then, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row at `Index 0` as it is now unnecessary. Additionally, we can see that there is a row of **NaN**s at `Index 1`, which would become the 0th row once we drop the row that became the column headers. This should be dropped also, before the index is resetted using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [None]:
data.at[0, '1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies (Indicator can also found in SDG 13.1.3 and 11.b.2)'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
0,National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
1,Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
2,Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
3,Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
4,Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
5,Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
6,MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
7,Region V,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...
8,Region VI,..,..,..,..,..,..,..,..,..,...,..,..,..,25.1,..,20.2,..,99.3,100.0,...
9,Region VII,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,87.5,..,94.1,100.0,...


The next step would be renaming the values under the `Geolocation`, although, as seen in the resulting table, we would notice that there is no row for **PHILIPPINES**. This is reflected in the way that we set the values of this column.

In [None]:
data ['Geolocation'] = region_names [1:]
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
0,NCR: National Capital Region,..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
1,CAR: Cordillera Administrative Region,..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
2,Region 1: Ilocos Region,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
3,Region 2: Cagayan Valley,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
4,Region 3: Central Luzon,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
5,Region 4A: CALABARZON,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
6,MIMAROPA: Southwestern Tagalog Region,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
7,Region 5: Bicol Region,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...
8,Region 6: Western Visayas,..,..,..,..,..,..,..,..,..,...,..,..,..,25.1,..,20.2,..,99.3,100.0,...
9,Region 7: Central Visayas,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,87.5,..,94.1,100.0,...


As with the previous datasets, we would have to [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the '..' and '...' values, which represents **null**, in the DataFrame with **NaN**s. This is to avoid any errors that would happen in these rows, and so that it would be represented properly.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

After all of this, we can now transform this dataset that is in its wide represetation into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 
data

Unnamed: 0,Geolocation,0,value
0,NCR: National Capital Region,2001,
1,CAR: Cordillera Administrative Region,2001,
2,Region 1: Ilocos Region,2001,
3,Region 2: Cagayan Valley,2001,
4,Region 3: Central Luzon,2001,
...,...,...,...
369,Region 10: Northern Mindanao,2022,
370,Region 11: Davao Region,2022,
371,Region 12: SOCCSKSARGEN,2022,
372,CARAGA: Cordillera Administrative Region,2022,


Once we were able to convert it to its long representation, we would see that the column names in this new DataFrame are not descriptive with respect to the values underneath the column. Directly merging this with the combined DataFrame would make it hard for its users to distinguish what these columns are for, which is why it was [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d to its correct column names.

In [None]:
data.rename(columns = {'value':'1.5.4', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

After this, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the combined dataframe.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4
0,PHILIPPINES,2001,,90.1,57.55,,
1,NCR: National Capital Region,2001,,97.82,67.84,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,
...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,
392,Region 11: Davao Region,2022,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,


#### 3.4.1. Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease
To start with the fifth dataset, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [None]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.4.1.csv')
data

Unnamed: 0,"3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,,Year,,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
1,Indicator,Geolocation,,,,,,,,,...,,,,,,,,,,
2,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Both Sexes,..,..,..,..,..,..,4.2,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,..,...
3,,,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,5.6,..,...
4,,,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,3.7,..,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,,,,,,,,,,,...,,,,,,,,,,
268,,,,,,,,,,,...,,,,,,,,,,
269,Note:,,,,,,,,,,...,,,,,,,,,,
270,.. - Data not available,,,,,,,,,,...,,,,,,,,,,


Based on the DataFrame that we got using the [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we can see that there are rows of **NaN**s at the lower part of the DataFrame. Upon further inspection, it started from `Index 266`, which is why the rows from this index was [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped.

In [None]:
data = data.drop (data.index [266:])

As the column headers are all **Unnamed**, we need to set the column headers to its correct value, which is found at `Index 0`. Although, the values for the first three columns in this row are not descriptive to be column headers, which is why we are changing their values to the correct descriptive name for the rows underneath them using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

As we have no use for the row at `Index 0`, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this row. With this, we would also be [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping the next row as it is just a row of **NaN**s.

In [None]:
data.at[0, '3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease'] = 'Indicator'
data.at[0, 'Unnamed: 1'] = 'Geolocation'
data.at[0, 'Unnamed: 2'] = 'Sex'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

As the `Sex` column is not available for all datasets, it was decided that only the total—or those rows with **Both Sexes**—would be considered. Once we our data only includes rows with **Both Sexes** as the value of their `Sex` column, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column as this column would only have one unique value.

In [None]:
data = data [data ['Sex'] == 'Both Sexes']
data = data.drop('Sex', axis = 1)
data = data.reset_index(drop=True)

Then, we need to [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) all cells that has the value of either '..' or '...' with **NaN** for better computation in the future. 

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Upon studying the different indicators under this specific Sustainable Development Goal (SDG), we would realize that it is comprised of different subsets: (1) cardiovascular diseases, (2) cancer, (3) diabetes, and (4) chronic respiratory disease. However, as we only aim to get the total mortality rate with respect to all of these diseases, we would only get the rows under this indicator which is from `Index 0` to `Index 16`.

Then, after dividing the different subsets, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Indicator` column. 

In [None]:
data['Indicator'].unique()

array(['3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease',
       nan,
       '..3.4.1.1 Mortality rate attributed to cardiovascular disease',
       '..3.4.1.2 Mortality rate attributed to cancer',
       '..3.4.1.3 Mortality rate attributed to diabetes',
       '..3.4.1.4 Mortality rate attributed to chronic respiratory disease'],
      dtype=object)

In [None]:
all_data = data [0:16]
cardio_data = data [16:34]
cancer_data = data [34:52]
diabetes_data = data [52:70]
respi_data = data [70:]

In [None]:
all_data = all_data.drop('Indicator', axis = 1)
all_data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
0,PHILIPPINES,,,,,,,4.2,4.2,4.3,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,,
1,..National Capital Region (NCR),,,,,,,5.1,5.2,5.2,...,5.2,5.3,5.5,5.2,4.9,4.9,5.0,4.8,,
2,..Cordillera Administrative Region (CAR),,,,,,,3.3,3.1,3.3,...,3.4,3.5,3.7,3.6,3.6,3.8,4.1,3.8,,
3,..Region I,,,,,,,4.9,4.8,5.0,...,5.0,5.1,5.1,5.0,4.9,4.9,4.9,4.9,,
4,..Region II,,,,,,,4.0,3.9,4.0,...,4.4,4.4,4.5,4.4,4.3,4.5,4.7,4.3,,
5,..Region III,,,,,,,4.8,5.0,5.0,...,5.2,5.4,5.4,5.3,5.2,5.2,5.3,5.2,,
6,..Region IV-A,,,,,,,4.7,4.7,4.6,...,4.9,5.1,5.0,4.9,4.9,4.9,5.1,5.1,,
7,..MIMAROPA,,,,,,,3.5,3.5,3.5,...,3.8,3.9,3.9,4.2,3.9,4.1,4.3,4.2,,
8,..Region VII,,,,,,,4.3,4.2,4.4,...,4.7,4.8,5.0,4.9,4.7,4.7,4.8,4.9,,
9,..Region VIII,,,,,,,3.5,3.5,3.5,...,3.7,3.8,3.8,3.7,3.7,3.8,3.9,4.0,,


Upon inspection, we would realize that there are two regions that are missing from the table, which are **Region V** and **Region VI**, which is why we would only be using the region names that are included in the DataFrame. 

In [None]:
# no region five and six
all_data ['Geolocation'] = region_names [0:8] + region_names [10:]

After this, with the use of the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, we can now convert our DataFrame to its long representation. Then, we must set the column headers to describe the values in this column, which is why we would need to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the columns. 

In [None]:
all_data = pd.melt(all_data, id_vars='Geolocation', value_vars=all_data.columns [2:]) 

all_data.rename(columns = {'value':'3.4.1 (Total data)', 0 : 'Year'}, inplace=True)
all_data = all_data.astype({'Year':'int'})

After this, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the DataFrame which holds the combined datasets.

In [None]:
combined_data = combined_data.merge(all_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data)
0,PHILIPPINES,2001,,90.1,57.55,,,
1,NCR: National Capital Region,2001,,97.82,67.84,,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,
...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,
392,Region 11: Davao Region,2022,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,


#### 3.7.1. Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods

Using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we load the sixth dataset. 

In [None]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.7.1.csv')
data

Unnamed: 0,3.7.1 Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Year,,2000,2001,2002,2003.0,2004,2005,2006,2007,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
1,Indicator/Sub-indicators,Geolocation,,,,,,,,,...,,,,,,,,,,
2,3.7.1 Proportion of women of reproductive age ...,PHILIPPINES,..,..,..,46.7,..,..,..,..,...,51.8,..,..,..,56.9,..,..,..,..,...
3,,..National Capital Region (NCR),..,..,..,47.2,..,..,..,..,...,53.4,..,..,..,59.5,..,..,..,..,...
4,,..Cordillera Administrative Region (CAR),..,..,..,44.4,..,..,..,..,...,59.8,..,..,..,66.7,..,..,..,..,...
5,,..Region I,..,..,..,49.6,..,..,..,..,...,50.8,..,..,..,59.5,..,..,..,..,...
6,,..Region II,..,..,..,68.8,..,..,..,..,...,69.1,..,..,..,74.1,..,..,..,..,...
7,,..Region III,..,..,..,54.2,..,..,..,..,...,60.4,..,..,..,56.8,..,..,..,..,...
8,,..Region IV-A,..,..,..,46.1,..,..,..,..,...,49.1,..,..,..,49.2,..,..,..,..,...
9,,..MIMAROPA,..,..,..,48.5,..,..,..,..,...,55.1,..,..,..,61.7,..,..,..,..,...


Irrelevant rows that are just footers for the file are also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped. From the DataFrame above, we can see that these are the rows from `Index 20`.

In [None]:
data = data.drop (data.index [20:])

Additionally, we can see that the current column names are **Unnamed**. Thus, we have to set the column names to its correct values so that we can determine what the values in the columns are.

Understanding the data, we can see that the row at `Index 0` holds the value for the column headers. However, there is a **NaN** value, which should be **Geolocation** based on the data underneath it. This is why the value of this cell was changed to **Geolocation** using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

This is done before the column names was set to the row at `Index 0`, and then [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping this row and the row of NaNs at the next row.

In [None]:
data.at[0, 'Unnamed: 1'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,Year,Geolocation,2000,2001,2002,2003.0,2004,2005,2006,2007,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
0,3.7.1 Proportion of women of reproductive age ...,PHILIPPINES,..,..,..,46.7,..,..,..,..,...,51.8,..,..,..,56.9,..,..,..,..,...
1,,..National Capital Region (NCR),..,..,..,47.2,..,..,..,..,...,53.4,..,..,..,59.5,..,..,..,..,...
2,,..Cordillera Administrative Region (CAR),..,..,..,44.4,..,..,..,..,...,59.8,..,..,..,66.7,..,..,..,..,...
3,,..Region I,..,..,..,49.6,..,..,..,..,...,50.8,..,..,..,59.5,..,..,..,..,...
4,,..Region II,..,..,..,68.8,..,..,..,..,...,69.1,..,..,..,74.1,..,..,..,..,...
5,,..Region III,..,..,..,54.2,..,..,..,..,...,60.4,..,..,..,56.8,..,..,..,..,...
6,,..Region IV-A,..,..,..,46.1,..,..,..,..,...,49.1,..,..,..,49.2,..,..,..,..,...
7,,..MIMAROPA,..,..,..,48.5,..,..,..,..,...,55.1,..,..,..,61.7,..,..,..,..,...
8,,..Region V,..,..,..,30.6,..,..,..,..,...,29.3,..,..,..,44.4,..,..,..,..,...
9,,..Region VI,..,..,..,42.3,..,..,..,..,...,45.5,..,..,..,56.8,..,..,..,..,...


Added to this, we can see that there is a column of **NaN**s, which we do not need, so we can also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this.

In [None]:
data = data.drop('Year', axis=1)

Just like what we have done in the previous datasets, we would rename the **Geolocation** column based on the common names of the region for easier understanding of the dataset.

In [None]:
data ['Geolocation'] = region_names

As the missing data or null values in the dataset are represented by '..' or '...', which are strings that might affect the computations that might be done in this numerical columns, we would be using the [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function to replace these string values to **np.nan**.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')
data

Unnamed: 0,Geolocation,2000,2001,2002,2003.0,2004,2005,2006,2007,2008.0,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
0,PHILIPPINES,,,,46.7,,,,,46.8,...,51.8,,,,56.9,,,,,
1,NCR: National Capital Region,,,,47.2,,,,,43.2,...,53.4,,,,59.5,,,,,
2,CAR: Cordillera Administrative Region,,,,44.4,,,,,55.0,...,59.8,,,,66.7,,,,,
3,Region 1: Ilocos Region,,,,49.6,,,,,49.7,...,50.8,,,,59.5,,,,,
4,Region 2: Cagayan Valley,,,,68.8,,,,,62.6,...,69.1,,,,74.1,,,,,
5,Region 3: Central Luzon,,,,54.2,,,,,54.0,...,60.4,,,,56.8,,,,,
6,Region 4A: CALABARZON,,,,46.1,,,,,46.1,...,49.1,,,,49.2,,,,,
7,MIMAROPA: Southwestern Tagalog Region,,,,48.5,,,,,48.5,...,55.1,,,,61.7,,,,,
8,Region 5: Bicol Region,,,,30.6,,,,,33.8,...,29.3,,,,44.4,,,,,
9,Region 6: Western Visayas,,,,42.3,,,,,44.4,...,45.5,,,,56.8,,,,,


As the dataset now looks like the wide representation that we wanted, we would be transforming it to its long representation, using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, so that we could merge it to the combined dataset.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 
data

Unnamed: 0,Geolocation,0,value
0,PHILIPPINES,2001,
1,NCR: National Capital Region,2001,
2,CAR: Cordillera Administrative Region,2001,
3,Region 1: Ilocos Region,2001,
4,Region 2: Cagayan Valley,2001,
...,...,...,...
391,Region 10: Northern Mindanao,2022,
392,Region 11: Davao Region,2022,
393,Region 12: SOCCSKSARGEN,2022,
394,CARAGA: Cordillera Administrative Region,2022,


Although, before merging it to the combined dataset, we would need to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the columns `0` and `value`, as they are not descriptive enough. If we directly merged it to the combined dataset, we might not be able to determine what the values in these columns mean. 

In [None]:
data.rename(columns = {'value':'3.7.1', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})
data

Unnamed: 0,Geolocation,Year,3.7.1
0,PHILIPPINES,2001,
1,NCR: National Capital Region,2001,
2,CAR: Cordillera Administrative Region,2001,
3,Region 1: Ilocos Region,2001,
4,Region 2: Cagayan Valley,2001,
...,...,...,...
391,Region 10: Northern Mindanao,2022,
392,Region 11: Davao Region,2022,
393,Region 12: SOCCSKSARGEN,2022,
394,CARAGA: Cordillera Administrative Region,2022,


Once the column names have been fixed, we could use the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function to use outer join to merge the two datasets.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data),3.7.1
0,PHILIPPINES,2001,,90.1,57.55,,,,
1,NCR: National Capital Region,2001,,97.82,67.84,,,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,,
...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,,
392,Region 11: Davao Region,2022,,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,,


#### 3.7.2. Adolescent birth rate aged 15-19 years per 1,000 women in that age group
Then, the seventh dataset could be loaded using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [None]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.7.2.csv')
data

Unnamed: 0,"3.7.2 Adolescent birth rate aged 15-19 years per 1,000 women in that age group",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003.0,2004,2005,2006,2007,2008.0,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,53.0,..,..,..,..,54.0,...,57.0,..,..,..,47.0,..,..,..,..,...
3,..National Capital Region (NCR),..,..,..,35.0,..,..,..,..,25.0,...,48.0,..,..,..,27.0,..,..,..,..,...
4,..Cordillera Administrative Region (CAR),..,..,..,52.0,..,..,..,..,34.0,...,53.0,..,..,..,25.0,..,..,..,..,...
5,..Region I,..,..,..,55.0,..,..,..,..,52.0,...,78.0,..,..,..,46.0,..,..,..,..,...
6,..Region II,..,..,..,85.0,..,..,..,..,54.0,...,65.0,..,..,..,51.0,..,..,..,..,...
7,..Region III,..,..,..,42.0,..,..,..,..,69.0,...,63.0,..,..,..,61.0,..,..,..,..,...
8,..Region IV-A,..,..,..,44.0,..,..,..,..,63.0,...,58.0,..,..,..,37.0,..,..,..,..,...
9,..MIMAROPA,..,..,..,108.0,..,..,..,..,87.0,...,68.0,..,..,..,47.0,..,..,..,..,...


As seen in the previous datasets, there are three types of columns that are processed and [`drop`]ped first: (1) the irrelevant rows that were footers in the .csv file, (2) the row that would be turned into the column headers, and (3) the row of **NaN**s.

In [None]:
data = data.drop (data.index [20:])

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,Year,2000,2001,2002,2003.0,2004,2005,2006,2007,2008.0,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
0,PHILIPPINES,..,..,..,53.0,..,..,..,..,54.0,...,57.0,..,..,..,47.0,..,..,..,..,...
1,..National Capital Region (NCR),..,..,..,35.0,..,..,..,..,25.0,...,48.0,..,..,..,27.0,..,..,..,..,...
2,..Cordillera Administrative Region (CAR),..,..,..,52.0,..,..,..,..,34.0,...,53.0,..,..,..,25.0,..,..,..,..,...
3,..Region I,..,..,..,55.0,..,..,..,..,52.0,...,78.0,..,..,..,46.0,..,..,..,..,...
4,..Region II,..,..,..,85.0,..,..,..,..,54.0,...,65.0,..,..,..,51.0,..,..,..,..,...
5,..Region III,..,..,..,42.0,..,..,..,..,69.0,...,63.0,..,..,..,61.0,..,..,..,..,...
6,..Region IV-A,..,..,..,44.0,..,..,..,..,63.0,...,58.0,..,..,..,37.0,..,..,..,..,...
7,..MIMAROPA,..,..,..,108.0,..,..,..,..,87.0,...,68.0,..,..,..,47.0,..,..,..,..,...
8,..Region V,..,..,..,60.0,..,..,..,..,63.0,...,59.0,..,..,..,36.0,..,..,..,..,...
9,..Region VI,..,..,..,57.0,..,..,..,..,41.0,...,58.0,..,..,..,38.0,..,..,..,..,...


Although, we can see that there is a column name that does not correctly represent the data of this column: the `Year` column does not indicate years, but rather the regions. This is why it was [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d to `Geolocation`. 

In [None]:
data.rename(columns = {'Year':'Geolocation'}, inplace=True)

Once we have cleaned the column headers, the values for the `Geolocation` column would be fixed to include their common names. It is important to note that it was made sure that each of the row completely match the arrangement in the `region_name` variable.

In [None]:
data ['Geolocation'] = region_names

As we now have fixed the number of rows and the column names, we would now replace the string representation of null or missing vlaues. This is done with the use of [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function, which would convert the '..' and '...' values into **np.nan**.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Then, we can now convert our DataFrame into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. As in the processing of the previous datasets, we would have to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the column names as they are not descriptive enough.

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

data.rename(columns = {'value':'3.7.2', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

As we are now sure that the missing or null values are correctly represented, the values of the `Geolocation` are now more easily understandable, and the column headers are descriptive enough, we can now merge this dataset into the combined datasets using the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function.

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data),3.7.1,3.7.2
0,PHILIPPINES,2001,,90.1,57.55,,,,,
1,NCR: National Capital Region,2001,,97.82,67.84,,,,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,,,
...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,,,
392,Region 11: Davao Region,2022,,,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,,,


#### 4.1.s1. Completion Rate of elementary and secondary students
To start with the eighth dataset, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [None]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/4.1.s1.csv')
data

Unnamed: 0,4.1.s1 Completion Rate of elementary and secondary students 1/ 2/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,Year,,,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018.00,2019.00,2020.000000,2021,2022
1,Geolocation,Level of Education,Sex,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,Elementary,Both Sexes,62.72,68.18,71.55,70.24,69.06,68.11,71.72,...,77.67,83.74,84.02,93.06,92.41,97.15,96.56,82.510000,...,...
3,,,Female,65.53,70.7,76.32,75.63,75.2,73.46,76.7,...,81.33,86.23,87.43,95.52,94.61,99.12,98.08,84.681828,...,...
4,,,Male,60.05,65.78,67.23,65.42,63.63,63.29,67.28,...,74.38,81.45,80.97,90.83,90.41,95.26,95.10,80.500538,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
167,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
168,1/ - Updates were based on the submission of D...,,,,,,,,,,...,,,,,,,,,,
169,2/ - Estimation in Senior High School only sta...,,,,,,,,,,...,,,,,,,,,,


From the view of the DataFrame above, we can see that there are unnecessary rows captured by the  [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function. To be able to correctly represent the data, we would need to [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) these rows.

In [None]:
data = data.drop(data.index[164:])

Another problem that we have based on the DataFrame shown above is the lack of column names, as shown in the **Unnamed** values in the header. Studying the DataFrame, we would find the supposed column headers in the row of `Index 0`, though we face the problem of having **NaN** values at the first three columns of this row. This is why the values in these cells are changed using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function, before converting this row to be the column header.

After we have been able to turn this into the column header, we would need to drop this row and the row beneath it as they are unnecessary rows.

In [None]:
data.at[0, '4.1.s1 Completion Rate of elementary and secondary students 1/ 2/'] = 'Geolocation'
data.at[0, 'Unnamed: 1'] = 'Level of Education'
data.at[0, 'Unnamed: 2'] = 'Sex'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Just like in datasets that has the `Sex` column, we would only be getting rows with the value for this column as **Both Sexes**. Afterwards, as we already have no need for this column anymore, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) it. 

In [None]:
data = data [data['Sex'] == 'Both Sexes']
data = data.drop ('Sex', axis = 1)
data = data.reset_index(drop=True)
data

Unnamed: 0,Geolocation,Level of Education,2000,2001,2002,2003,2004,2005,2006,2007,...,2013,2014,2015,2016,2017,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,Elementary,62.72,68.18,71.55,70.24,69.06,68.11,71.72,73.06,...,77.67,83.74,84.02,93.06,92.41,97.15,96.56,82.51,...,...
1,,Secondary (Junior High School),70.07,69.97,74.81,71.67,72.38,61.66,72.14,75.37,...,76.25,77.77,74.03,80.91,84.32,88.84,85.75,82.111684,...,...
2,,Secondary (Senior High School),..,..,..,..,..,..,..,..,...,..,..,..,..,..,81.01,76.71,69.317762,...,...
3,..National Capital Region (NCR),Elementary,63.87,74.29,84.35,83.81,82.1,82.5,88.48,85.35,...,78.72,74.71,82.29,85.97,94.65,99.04,94.97,69.36,...,...
4,,Secondary (Junior High School),68.16,68.43,75.51,73.36,77.33,65.87,71.62,78.71,...,76.33,77.27,74.23,80.29,90.62,92.8,87.31,73.645274,...,...
5,,Secondary (Senior High School),..,..,..,..,..,..,..,..,...,..,..,..,..,..,82.64,76.28,56.255397,...,...
6,..Cordillera Administrative Region (CAR),Elementary,61.95,59.55,77.61,73.9,69.46,67.53,74.99,71.7,...,82.61,86.43,86.22,93.51,91.64,97.52,95.81,94.56,...,...
7,,Secondary (Junior High School),70.31,61.75,59.41,73.61,72.54,63.2,83.69,75.67,...,76.34,76.91,69.97,79.78,81.23,87.01,81.69,87.860475,...,...
8,,Secondary (Senior High School),..,..,..,..,..,..,..,..,...,..,..,..,..,..,81.07,76.25,81.151454,...,...
9,..Region I,Elementary,78.73,79.7,86.74,84.46,85.49,85.48,81.64,82.71,...,91.14,94.54,91.5,97.41,95.45,98.1,98.97,97.32,...,...


As we can see from the resulting dataset, there are still **NaN** values in the `Geolocation` column, which we do not want as this would be used in merging the datasets together. However, if we study it, we would realize that the reason for this is that one value for `Geolocation` actually spans to the next two rows as there are different values for the `Level of Education` column. Although, we cannot just separate the dataset per unique value of the `Level of Education` column, as the `Geolocation` would be NaN for all  **Secondary (Junior High School)** and **Secondary (Senior High School)**. 

Due to this, we copy the value of the `Geolocation` column of a row to the next two rows after it. 

In [None]:
# copying the geolocation value to the next two rows
i = 0
while i < len (data):
    if i % 3 == 0:
        data.at[i + 1, 'Geolocation'] = data['Geolocation'][i]
        data.at[i + 2, 'Geolocation'] = data['Geolocation'][i]
        i = i + 3

Before we divide the dataset based on the value of `Level of Education`, we must first replace cells with the strings '..' or '...' with **np.nan**. This is so that we would not need to process this representation of missing or null values separately (i.e., per division). Then, we can now separate them so that we can properly label it before merging it to the combined dataset.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

In [None]:
elem_data = data [data['Level of Education'] == 'Elementary']
elem_data = elem_data.reset_index (drop=True)

junior_data = data [data['Level of Education'] == 'Secondary (Junior High School)']
junior_data = junior_data.reset_index (drop=True)

senior_data = data [data['Level of Education'] == 'Secondary (Senior High School)']
senior_data = senior_data.reset_index (drop=True)

Once we have successfully divided the dataset based on the value of the `Level of Education` column, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column as each of the division would technically only have one value for this column.

In [None]:
elem_data = elem_data.drop ('Level of Education', axis = 1)
elem_data = elem_data.reset_index(drop=True)

In [None]:
junior_data = junior_data.drop ('Level of Education', axis = 1)
junior_data = junior_data.reset_index(drop=True)

In [None]:
senior_data = senior_data.drop ('Level of Education', axis = 1)
senior_data = senior_data.reset_index(drop=True)

After making sure that the arrangement of the region matches the arrangement of the values of the `region_names` variable, we can change the values of the `Geolocation` column for each of the division. 

In [None]:
elem_data ['Geolocation'] = region_names

In [None]:
junior_data ['Geolocation'] = region_names

In [None]:
senior_data ['Geolocation'] = region_names

Then, we can now convert the DataFrames into their long representation, before using the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function to make the column names more descriptive of the data in the columns.

In [None]:
elem_data = pd.melt(elem_data, id_vars='Geolocation', value_vars=elem_data.columns [2:]) 

elem_data.rename(columns = {'value':'4.1.s1 (Elementary)', 0 : 'Year'}, inplace=True)
elem_data = elem_data.astype({'Year':'int'})

In [None]:
junior_data = pd.melt(junior_data, id_vars='Geolocation', value_vars=junior_data.columns [2:]) 

junior_data.rename(columns = {'value':'4.1.s1 (Junior High School)', 0 : 'Year'}, inplace=True)
junior_data = junior_data.astype({'Year':'int'})

In [None]:
senior_data = pd.melt(senior_data, id_vars='Geolocation', value_vars=senior_data.columns [2:]) 

senior_data.rename(columns = {'value':'4.1.s1 (Senior High School)', 0 : 'Year'}, inplace=True)
senior_data = senior_data.astype({'Year':'int'})

As we have now made sure that each of division would be understandable even if combined with the combined dataset, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) each of them into the combined dataset

In [None]:
combined_data = combined_data.merge(elem_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data = combined_data.merge(junior_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data = combined_data.merge(senior_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data),3.7.1,3.7.2,4.1.s1 (Elementary),4.1.s1 (Junior High School),4.1.s1 (Senior High School)
0,PHILIPPINES,2001,,90.1,57.55,,,,,,68.18,69.97,
1,NCR: National Capital Region,2001,,97.82,67.84,,,,,,74.29,68.43,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,,,,59.55,61.75,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,,,,79.7,75.35,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,,,,74.07,69.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,,,,,,
392,Region 11: Davao Region,2022,,,,,,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,,,


#### 4.c.s2. Number of Technical-Vocational Education and Training (TVET) trainers trained
Next, we can load the ninth dataset.

In [None]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/4.c.s2.csv')
data

Unnamed: 0,4.c.s2 Number of Technical-Vocational Education and Training (TVET) trainers trained,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017.0,2018.0,2019.0,2020.0,2021.0,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,..,6518.0,11159.0,10118.0,10855.0,4023.0,7746.0,...
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,610.0,1028.0,1280.0,1409.0,782.0,1985.0,...
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,201.0,302.0,166.0,260.0,92.0,199.0,...
5,..Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,474.0,455.0,475.0,501.0,375.0,327.0,...
6,..Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,270.0,612.0,447.0,686.0,215.0,240.0,...
7,..Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,280.0,262.0,354.0,839.0,277.0,471.0,...
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,833.0,1067.0,1440.0,817.0,177.0,647.0,...
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,139.0,523.0,709.0,413.0,162.0,255.0,...


As usual, we would first be [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping the irrelevant rows. 

In [None]:
data = data.drop(data.index[20:])

Then, as we know that the correct column headers are found at `Index 0`, we have to fix the values of this row to fully represent the data in the columns. This is why the **Year** value was changed into **Geolocation** because the values in this column are the rows of the country.

After this, we can now make the value of this row as the value of the column headers, before [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping this row as it would not be used anymore. In line with this, we can also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row of **NaN**s underneath this row.

In [None]:
data.at[0, '4.c.s2 Number of Technical-Vocational Education and Training (TVET) trainers trained'] = 'Geolocation'

In [None]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Then, we need to change the values of the `Geolocation` column to match the prescribed format for the region names.

In [None]:
data ['Geolocation'] = region_names

After this, we need to clean the dataset by turning the string representation of missing or null values, which are '..' and '...', into **np.nan**. This would allow us to correctly use mathematical functions into these columns without errors arising due to strings.

In [None]:
for c in data.columns.difference(['Geolocation']):
    data [c].replace(to_replace='..', value= np.nan, inplace= True)
    data [c].replace(to_replace='...', value= np.nan, inplace= True)

# data = data.dropna(axis=1, how = 'all')

Once we have done this, we can convert the DataFrame into its long representation, which would allow us to merge it with the combined dataset. Converting a DataFrame that is in its wide representation into its long representation is made possible by the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

However, using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function would result into a three-column DataFrame which has the following column names: (1) `Geolocation`, (2) `0`, and (3) `value`. The last two columns are not properly descriptive of the values of the column, which is why these two columns are [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d. 

In [None]:
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [2:]) 

data.rename(columns = {'value':'4.c.s2', 0 : 'Year'}, inplace=True)
data = data.astype({'Year':'int'})

As we now have a DataFrame that is in its long representation, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the combined DataFrame, with respect to the values of the `Geolocation` and `Year` columns. This means that a row from this DataFrame would be [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)d into the combined dataset on the row that has the same `Geolocation` and `Year`. 

In [None]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [None]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1,1.4.1p5,1.4.1p6 (Junior High School),1.4.1p6 (Senior High School),1.5.4,3.4.1 (Total data),3.7.1,3.7.2,4.1.s1 (Elementary),4.1.s1 (Junior High School),4.1.s1 (Senior High School),4.c.s2
0,PHILIPPINES,2001,,90.1,57.55,,,,,,68.18,69.97,,
1,NCR: National Capital Region,2001,,97.82,67.84,,,,,,74.29,68.43,,
2,CAR: Cordillera Administrative Region,2001,,92.89,59.84,,,,,,59.55,61.75,,
3,Region 1: Ilocos Region,2001,,91.33,68.21,,,,,,79.7,75.35,,
4,Region 2: Cagayan Valley,2001,,89.45,59.67,,,,,,74.07,69.4,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2022,,,,,,,,,,,,
392,Region 11: Davao Region,2022,,,,,,,,,,,,
393,Region 12: SOCCSKSARGEN,2022,,,,,,,,,,,,
394,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,,,,


#### 7.1.1. Proportion of population with access to electricity

#### 8.1.1. Annual growth rate of real GDP per capita

#### 10.1.1. Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population and the total population

#### 14.5.1. Coverage of protected areas in relation to marine areas

#### 16.1.1 Number of victims of intentional homicide (per 100,000 population)

#### 16.1.s1 Number of murder cases

#### Other Non-SDG datasets that can help us in exploring the former datasets

##### Changes in Inventories, by Region

##### Current Health Expenditure by Region, Growth Rates 

##### Current Health Expenditure by Region

In [59]:
data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/Current Health Expenditure by Region.csv',header = 1,sep = ';')
data

Unnamed: 0,Region,2014,2015,2016,2017,2018r,2019r,2020
0,Total Current Health Expenditure,100.0,100.0,100.0,100.0,100.0,100.0,100.0
1,..NCR,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,..CAR,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,..Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,..Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,..Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,..CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,..MIMAROPA,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,..Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,..Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6


In [60]:
#drop total current health expenditure
data = data.drop (data.index[0])
data

Unnamed: 0,Region,2014,2015,2016,2017,2018r,2019r,2020
1,..NCR,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,..CAR,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,..Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,..Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,..Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,..CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,..MIMAROPA,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,..Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,..Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6
10,..Central Visayas,5.9,5.6,5.6,5.7,5.9,5.9,5.7


In [61]:
#remove '..' and 'r'
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data.columns = data.columns.str.replace('[r]', '',regex = True)
data

Unnamed: 0,Region,2014,2015,2016,2017,2018,2019,2020
1,NCR,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,CAR,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,MIMAROPA,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6
10,Central Visayas,5.9,5.6,5.6,5.7,5.9,5.9,5.7


In [62]:
#make nationwide index 0
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,2014,2015,2016,2017,2018,2019,2020
0,Nationwide,16.8,19.3,19.7,21.2,20.0,18.6,23.4
1,NCR,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,CAR,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,MIMAROPA,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6


In [63]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,2014,2015,2016,2017,2018,2019,2020
0,PHILIPPINES,16.8,19.3,19.7,21.2,20.0,18.6,23.4
1,NCR: National Capital Region,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,CAR: Cordillera Administrative Region,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,Region 1: Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,Region 2: Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,Region 3: Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,Region 4A: CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,MIMAROPA: Southwestern Tagalog Region,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,Region 5: Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,Region 6: Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6


In [64]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Health Expenditure', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

Unnamed: 0,Geolocation,Year,Health Expenditure
0,PHILIPPINES,2014,16.8
1,NCR: National Capital Region,2014,24.7
2,CAR: Cordillera Administrative Region,2014,2.1
3,Region 1: Ilocos Region,2014,4.0
4,Region 2: Cagayan Valley,2014,2.3
...,...,...,...
121,Region 10: Northern Mindanao,2020,3.4
122,Region 11: Davao Region,2020,3.9
123,Region 12: SOCCSKSARGEN,2020,3.6
124,CARAGA: Cordillera Administrative Region,2020,1.7


##### Government Final Consumption Expenditure, by Region, Growth Rates

In [83]:
data = pd.read_csv("D:\datapre-project\data\Government Final Consumption Expenditure, by Region, Growth Rates.csv",header = 1,sep = ';')
data

Unnamed: 0,Region,At Current Prices 2000-2001,At Current Prices 2001-2002,At Current Prices 2002-2003,At Current Prices 2003-2004,At Current Prices 2004-2005,At Current Prices 2005-2006,At Current Prices 2006-2007,At Current Prices 2007-2008,At Current Prices 2008-2009,...,At Constant 2018 Prices 2010-2011,At Constant 2018 Prices 2011-2012,At Constant 2018 Prices 2012-2013,At Constant 2018 Prices 2013-2014,At Constant 2018 Prices 2014-2015,At Constant 2018 Prices 2015-2016,At Constant 2018 Prices 2016-2017,At Constant 2018 Prices 2017-2018,At Constant 2018 Prices 2018-2019,At Constant 2018 Prices 2019-2020
0,..National Capital Region (NCR),8.3,-1.7,4.7,5.0,5.8,9.1,12.5,9.2,16.6,...,-1.9,14.3,6.0,2.1,5.5,8.3,5.5,13.3,9.8,11.6
1,..Cordillera Administrative Region (CAR),2.3,4.5,0.9,-1.4,14.3,6.1,3.3,3.4,10.0,...,2.7,17.3,0.5,3.6,6.3,6.7,6.7,13.2,6.1,7.2
2,..Region I (Ilocos Region),1.3,6.6,6.8,0.3,11.9,29.2,2.3,0.8,6.4,...,2.1,16.2,2.3,10.7,13.3,12.9,6.9,13.2,6.5,7.6
3,..Region II (Cagayan Valley),4.2,11.5,7.5,4.6,5.7,17.5,11.2,1.8,15.2,...,4.1,5.7,-1.1,5.3,7.8,7.7,6.6,14.3,7.1,8.8
4,..Region III (Central Luzon),2.0,21.1,7.2,19.7,3.8,20.2,21.3,9.1,20.3,...,1.8,17.1,8.1,3.4,8.8,9.4,7.3,13.9,8.8,9.1
5,..Region IV-A (CALABARZON),8.2,15.0,5.9,5.1,19.4,14.1,38.8,0.7,20.3,...,1.2,20.5,2.2,7.9,9.4,10.8,7.5,14.3,9.0,10.1
6,..MIMAROPA Region,3.0,7.8,6.8,-9.0,25.2,18.0,21.4,7.0,16.7,...,-0.7,18.4,-0.3,2.7,4.9,8.9,7.2,13.5,3.4,9.6
7,..Region V (Bicol Region),0.3,3.1,6.7,-5.4,18.4,20.9,20.4,1.3,21.3,...,2.2,20.5,1.7,5.9,8.7,5.5,6.5,9.1,12.9,9.0
8,..Region VI (Western Visayas),7.9,4.2,2.9,-1.9,9.5,18.6,4.1,0.3,20.7,...,2.0,16.2,-0.5,0.7,12.4,12.0,6.8,19.3,7.5,8.1
9,..Region VII (Central Visayas),6.8,11.0,1.8,13.7,0.6,17.7,6.3,7.5,13.2,...,5.0,22.8,12.7,12.1,14.7,13.0,7.8,15.7,8.2,10.6


In [84]:
#remove '..' and arrange row
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,At Current Prices 2000-2001,At Current Prices 2001-2002,At Current Prices 2002-2003,At Current Prices 2003-2004,At Current Prices 2004-2005,At Current Prices 2005-2006,At Current Prices 2006-2007,At Current Prices 2007-2008,At Current Prices 2008-2009,...,At Constant 2018 Prices 2010-2011,At Constant 2018 Prices 2011-2012,At Constant 2018 Prices 2012-2013,At Constant 2018 Prices 2013-2014,At Constant 2018 Prices 2014-2015,At Constant 2018 Prices 2015-2016,At Constant 2018 Prices 2016-2017,At Constant 2018 Prices 2017-2018,At Constant 2018 Prices 2018-2019,At Constant 2018 Prices 2019-2020
0,Philippines,5.7,3.4,4.9,3.9,7.8,13.2,11.9,6.8,16.2,...,1.9,15.5,4.9,3.6,7.9,9.4,6.5,13.4,9.1,10.5
1,National Capital Region (NCR),8.3,-1.7,4.7,5.0,5.8,9.1,12.5,9.2,16.6,...,-1.9,14.3,6.0,2.1,5.5,8.3,5.5,13.3,9.8,11.6
2,Cordillera Administrative Region (CAR),2.3,4.5,0.9,-1.4,14.3,6.1,3.3,3.4,10.0,...,2.7,17.3,0.5,3.6,6.3,6.7,6.7,13.2,6.1,7.2
3,Region I (Ilocos Region),1.3,6.6,6.8,0.3,11.9,29.2,2.3,0.8,6.4,...,2.1,16.2,2.3,10.7,13.3,12.9,6.9,13.2,6.5,7.6
4,Region II (Cagayan Valley),4.2,11.5,7.5,4.6,5.7,17.5,11.2,1.8,15.2,...,4.1,5.7,-1.1,5.3,7.8,7.7,6.6,14.3,7.1,8.8
5,Region III (Central Luzon),2.0,21.1,7.2,19.7,3.8,20.2,21.3,9.1,20.3,...,1.8,17.1,8.1,3.4,8.8,9.4,7.3,13.9,8.8,9.1
6,Region IV-A (CALABARZON),8.2,15.0,5.9,5.1,19.4,14.1,38.8,0.7,20.3,...,1.2,20.5,2.2,7.9,9.4,10.8,7.5,14.3,9.0,10.1
7,MIMAROPA Region,3.0,7.8,6.8,-9.0,25.2,18.0,21.4,7.0,16.7,...,-0.7,18.4,-0.3,2.7,4.9,8.9,7.2,13.5,3.4,9.6
8,Region V (Bicol Region),0.3,3.1,6.7,-5.4,18.4,20.9,20.4,1.3,21.3,...,2.2,20.5,1.7,5.9,8.7,5.5,6.5,9.1,12.9,9.0
9,Region VI (Western Visayas),7.9,4.2,2.9,-1.9,9.5,18.6,4.1,0.3,20.7,...,2.0,16.2,-0.5,0.7,12.4,12.0,6.8,19.3,7.5,8.1


In [85]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,At Current Prices 2000-2001,At Current Prices 2001-2002,At Current Prices 2002-2003,At Current Prices 2003-2004,At Current Prices 2004-2005,At Current Prices 2005-2006,At Current Prices 2006-2007,At Current Prices 2007-2008,At Current Prices 2008-2009,...,At Constant 2018 Prices 2010-2011,At Constant 2018 Prices 2011-2012,At Constant 2018 Prices 2012-2013,At Constant 2018 Prices 2013-2014,At Constant 2018 Prices 2014-2015,At Constant 2018 Prices 2015-2016,At Constant 2018 Prices 2016-2017,At Constant 2018 Prices 2017-2018,At Constant 2018 Prices 2018-2019,At Constant 2018 Prices 2019-2020
0,PHILIPPINES,5.7,3.4,4.9,3.9,7.8,13.2,11.9,6.8,16.2,...,1.9,15.5,4.9,3.6,7.9,9.4,6.5,13.4,9.1,10.5
1,NCR: National Capital Region,8.3,-1.7,4.7,5.0,5.8,9.1,12.5,9.2,16.6,...,-1.9,14.3,6.0,2.1,5.5,8.3,5.5,13.3,9.8,11.6
2,CAR: Cordillera Administrative Region,2.3,4.5,0.9,-1.4,14.3,6.1,3.3,3.4,10.0,...,2.7,17.3,0.5,3.6,6.3,6.7,6.7,13.2,6.1,7.2
3,Region 1: Ilocos Region,1.3,6.6,6.8,0.3,11.9,29.2,2.3,0.8,6.4,...,2.1,16.2,2.3,10.7,13.3,12.9,6.9,13.2,6.5,7.6
4,Region 2: Cagayan Valley,4.2,11.5,7.5,4.6,5.7,17.5,11.2,1.8,15.2,...,4.1,5.7,-1.1,5.3,7.8,7.7,6.6,14.3,7.1,8.8
5,Region 3: Central Luzon,2.0,21.1,7.2,19.7,3.8,20.2,21.3,9.1,20.3,...,1.8,17.1,8.1,3.4,8.8,9.4,7.3,13.9,8.8,9.1
6,Region 4A: CALABARZON,8.2,15.0,5.9,5.1,19.4,14.1,38.8,0.7,20.3,...,1.2,20.5,2.2,7.9,9.4,10.8,7.5,14.3,9.0,10.1
7,MIMAROPA: Southwestern Tagalog Region,3.0,7.8,6.8,-9.0,25.2,18.0,21.4,7.0,16.7,...,-0.7,18.4,-0.3,2.7,4.9,8.9,7.2,13.5,3.4,9.6
8,Region 5: Bicol Region,0.3,3.1,6.7,-5.4,18.4,20.9,20.4,1.3,21.3,...,2.2,20.5,1.7,5.9,8.7,5.5,6.5,9.1,12.9,9.0
9,Region 6: Western Visayas,7.9,4.2,2.9,-1.9,9.5,18.6,4.1,0.3,20.7,...,2.0,16.2,-0.5,0.7,12.4,12.0,6.8,19.3,7.5,8.1


In [86]:
data.drop(data.iloc[:, 21:41], inplace = True, axis = 1)
  
data

Unnamed: 0,Geolocation,At Current Prices 2000-2001,At Current Prices 2001-2002,At Current Prices 2002-2003,At Current Prices 2003-2004,At Current Prices 2004-2005,At Current Prices 2005-2006,At Current Prices 2006-2007,At Current Prices 2007-2008,At Current Prices 2008-2009,...,At Current Prices 2010-2011,At Current Prices 2011-2012,At Current Prices 2012-2013,At Current Prices 2013-2014,At Current Prices 2014-2015,At Current Prices 2015-2016,At Current Prices 2016-2017,At Current Prices 2017-2018,At Current Prices 2018-2019,At Current Prices 2019-2020
0,PHILIPPINES,5.7,3.4,4.9,3.9,7.8,13.2,11.9,6.8,16.2,...,7.9,21.1,9.2,7.0,9.1,12.0,10.0,17.3,10.6,12.6
1,NCR: National Capital Region,8.3,-1.7,4.7,5.0,5.8,9.1,12.5,9.2,16.6,...,3.8,19.9,10.3,5.4,6.7,10.9,9.0,17.3,11.3,13.7
2,CAR: Cordillera Administrative Region,2.3,4.5,0.9,-1.4,14.3,6.1,3.3,3.4,10.0,...,8.7,23.0,4.5,6.9,7.5,9.3,10.2,17.1,7.6,9.2
3,Region 1: Ilocos Region,1.3,6.6,6.8,0.3,11.9,29.2,2.3,0.8,6.4,...,8.1,21.9,6.4,14.3,14.5,15.6,10.5,17.1,8.0,9.6
4,Region 2: Cagayan Valley,4.2,11.5,7.5,4.6,5.7,17.5,11.2,1.8,15.2,...,10.2,10.9,2.9,8.7,9.0,10.3,10.1,18.3,8.6,10.9
5,Region 3: Central Luzon,2.0,21.1,7.2,19.7,3.8,20.2,21.3,9.1,20.3,...,7.7,22.8,12.5,6.8,10.0,12.0,10.9,17.9,10.3,11.2
6,Region 4A: CALABARZON,8.2,15.0,5.9,5.1,19.4,14.1,38.8,0.7,20.3,...,7.1,26.4,6.3,11.5,10.5,13.4,11.0,18.3,10.5,12.2
7,MIMAROPA: Southwestern Tagalog Region,3.0,7.8,6.8,-9.0,25.2,18.0,21.4,7.0,16.7,...,5.1,24.2,3.7,6.0,6.0,11.4,10.8,17.4,4.9,11.6
8,Region 5: Bicol Region,0.3,3.1,6.7,-5.4,18.4,20.9,20.4,1.3,21.3,...,8.2,26.4,5.8,9.4,9.9,8.0,10.1,12.9,14.4,11.0
9,Region 6: Western Visayas,7.9,4.2,2.9,-1.9,9.5,18.6,4.1,0.3,20.7,...,8.0,21.9,3.6,4.0,13.6,14.6,10.4,23.4,9.0,10.1


##### Government Final Consumption Expenditure, by Region, Percent Share

In [12]:
data = pd.read_csv("D:\datapre-project\data\Government Final Consumption Expenditure, by Region, Percent Share.csv",header = 1,sep = ';')
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,..National Capital Region (NCR),45.9,47.0,44.8,44.7,45.1,44.3,42.7,42.9,43.9,...,42.1,41.7,42.1,41.5,40.6,40.2,39.8,39.8,40.0,40.4
1,..Cordillera Administrative Region (CAR),2.7,2.7,2.7,2.6,2.5,2.6,2.4,2.3,2.2,...,2.1,2.1,2.0,2.0,2.0,1.9,1.9,1.9,1.9,1.8
2,..Region I (Ilocos Region),4.0,3.8,3.9,4.0,3.9,4.0,4.6,4.2,4.0,...,3.7,3.7,3.6,3.9,4.0,4.2,4.2,4.2,4.1,4.0
3,..Region II (Cagayan Valley),2.4,2.4,2.6,2.6,2.7,2.6,2.7,2.7,2.6,...,2.5,2.3,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.1
4,..Region III (Central Luzon),4.3,4.1,4.8,4.9,5.7,5.5,5.8,6.3,6.4,...,6.7,6.8,7.0,7.0,7.0,7.0,7.1,7.1,7.1,7.0
5,..Region IV-A (CALABARZON),4.4,4.5,5.0,5.0,5.1,5.6,5.7,7.0,6.6,...,6.9,7.2,7.0,7.3,7.4,7.5,7.5,7.6,7.6,7.6
6,..MIMAROPA Region,2.0,2.0,2.1,2.1,1.8,2.1,2.2,2.4,2.4,...,2.3,2.4,2.3,2.2,2.2,2.2,2.2,2.2,2.1,2.1
7,..Region V (Bicol Region),3.6,3.5,3.5,3.5,3.2,3.5,3.8,4.0,3.8,...,4.0,4.2,4.1,4.1,4.2,4.0,4.0,3.9,4.0,3.9
8,..Region VI (Western Visayas),5.0,5.1,5.1,5.0,4.8,4.8,5.1,4.7,4.4,...,4.6,4.7,4.4,4.3,4.5,4.6,4.6,4.8,4.8,4.7
9,..Region VII (Central Visayas),2.7,2.8,3.0,2.9,3.2,3.0,3.1,2.9,2.9,...,3.0,3.2,3.5,3.8,4.0,4.1,4.2,4.3,4.2,4.2


In [13]:
#remove '..' and arrange row
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,Philippines,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
1,National Capital Region (NCR),45.9,47.0,44.8,44.7,45.1,44.3,42.7,42.9,43.9,...,42.1,41.7,42.1,41.5,40.6,40.2,39.8,39.8,40.0,40.4
2,Cordillera Administrative Region (CAR),2.7,2.7,2.7,2.6,2.5,2.6,2.4,2.3,2.2,...,2.1,2.1,2.0,2.0,2.0,1.9,1.9,1.9,1.9,1.8
3,Region I (Ilocos Region),4.0,3.8,3.9,4.0,3.9,4.0,4.6,4.2,4.0,...,3.7,3.7,3.6,3.9,4.0,4.2,4.2,4.2,4.1,4.0
4,Region II (Cagayan Valley),2.4,2.4,2.6,2.6,2.7,2.6,2.7,2.7,2.6,...,2.5,2.3,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.1
5,Region III (Central Luzon),4.3,4.1,4.8,4.9,5.7,5.5,5.8,6.3,6.4,...,6.7,6.8,7.0,7.0,7.0,7.0,7.1,7.1,7.1,7.0
6,Region IV-A (CALABARZON),4.4,4.5,5.0,5.0,5.1,5.6,5.7,7.0,6.6,...,6.9,7.2,7.0,7.3,7.4,7.5,7.5,7.6,7.6,7.6
7,MIMAROPA Region,2.0,2.0,2.1,2.1,1.8,2.1,2.2,2.4,2.4,...,2.3,2.4,2.3,2.2,2.2,2.2,2.2,2.2,2.1,2.1
8,Region V (Bicol Region),3.6,3.5,3.5,3.5,3.2,3.5,3.8,4.0,3.8,...,4.0,4.2,4.1,4.1,4.2,4.0,4.0,3.9,4.0,3.9
9,Region VI (Western Visayas),5.0,5.1,5.1,5.0,4.8,4.8,5.1,4.7,4.4,...,4.6,4.7,4.4,4.3,4.5,4.6,4.6,4.8,4.8,4.7


In [14]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,PHILIPPINES,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
1,NCR: National Capital Region,45.9,47.0,44.8,44.7,45.1,44.3,42.7,42.9,43.9,...,42.1,41.7,42.1,41.5,40.6,40.2,39.8,39.8,40.0,40.4
2,CAR: Cordillera Administrative Region,2.7,2.7,2.7,2.6,2.5,2.6,2.4,2.3,2.2,...,2.1,2.1,2.0,2.0,2.0,1.9,1.9,1.9,1.9,1.8
3,Region 1: Ilocos Region,4.0,3.8,3.9,4.0,3.9,4.0,4.6,4.2,4.0,...,3.7,3.7,3.6,3.9,4.0,4.2,4.2,4.2,4.1,4.0
4,Region 2: Cagayan Valley,2.4,2.4,2.6,2.6,2.7,2.6,2.7,2.7,2.6,...,2.5,2.3,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.1
5,Region 3: Central Luzon,4.3,4.1,4.8,4.9,5.7,5.5,5.8,6.3,6.4,...,6.7,6.8,7.0,7.0,7.0,7.0,7.1,7.1,7.1,7.0
6,Region 4A: CALABARZON,4.4,4.5,5.0,5.0,5.1,5.6,5.7,7.0,6.6,...,6.9,7.2,7.0,7.3,7.4,7.5,7.5,7.6,7.6,7.6
7,MIMAROPA: Southwestern Tagalog Region,2.0,2.0,2.1,2.1,1.8,2.1,2.2,2.4,2.4,...,2.3,2.4,2.3,2.2,2.2,2.2,2.2,2.2,2.1,2.1
8,Region 5: Bicol Region,3.6,3.5,3.5,3.5,3.2,3.5,3.8,4.0,3.8,...,4.0,4.2,4.1,4.1,4.2,4.0,4.0,3.9,4.0,3.9
9,Region 6: Western Visayas,5.0,5.1,5.1,5.0,4.8,4.8,5.1,4.7,4.4,...,4.6,4.7,4.4,4.3,4.5,4.6,4.6,4.8,4.8,4.7


##### Gross Capital Formation, by Region

In [15]:
data = pd.read_csv("D:\datapre-project\data\Gross Capital Formation, by Region.csv",header = 1,sep = ';')
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,..National Capital Region (NCR),203930819,288023206,312792420,350366278,445396815,364638998,350945836,410794316,541772989,...,795953279,814136171,899258808,943457629,1055777100,1334869204,1364654442,1640251453,1629320588,1102162331
1,..Cordillera Administrative Region (CAR),13865180,13448285,16806013,19788755,24257807,20432006,25385665,26901466,29704835,...,43410767,29012559,33620165,35098035,30308209,24558165,33059389,32266092,26362454,23041977
2,..Region I (Ilocos Region),24454284,26821755,29627138,32967294,45222018,39338515,32127165,41310895,50763801,...,82663294,85433538,104875107,120705416,128404394,160871351,181860608,193759545,220744366,171771884
3,..Region II (Cagayan Valley),32773347,34873457,37349804,40755790,51068600,58370199,68595346,71904291,73649021,...,42835117,52381575,67970385,83988890,92410049,102800111,120125423,134311405,147930411,100972112
4,..Region III (Central Luzon),8037710,35506256,63533211,64973784,68231368,92419886,84806309,92665051,138098487,...,254138225,309828802,353024974,371092410,406615615,473831044,585347897,590017254,610620576,342676989
5,..Region IV-A (CALABARZON),18214696,76941117,108934789,111408235,117422285,173261985,120393803,119653336,149756689,...,175971083,160028504,288993216,323836715,451566485,630840253,767416654,712401187,770805671,377997089
6,..MIMAROPA Region,10234683,11200127,14899234,12029760,9464169,15527012,10027999,3017260,22411802,...,34062851,28545616,32125182,47669461,50527960,49834380,49231467,78995250,86362820,55876531
7,..Region V (Bicol Region),23129693,23889005,26003348,23452407,28387601,30110212,31513023,37647782,47880014,...,51953786,65112644,89509800,108718089,125633690,142075341,160202016,181534196,198313314,140428210
8,..Region VI (Western Visayas),46342201,44831024,46626457,45601881,48709503,44867130,52570827,61931227,73382845,...,90978389,97801511,84268564,86501669,91657960,90785992,111771723,155804331,171809008,123668823
9,..Region VII (Central Visayas),65471807,74574164,72065505,70933637,76422804,64122060,52741677,51229324,66644077,...,122005475,157231413,174373898,183162167,195165138,211462656,230970904,257710115,299306342,184535636


In [16]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,Philippines,579938180,762429457,890086990,921328434,1103698971,1098633998,1049071426,1160979516,1526893379,...,2169604419,2287334968,2709182905,2933448471,3326756400,4018374755,4456327663,4959105466,5132349167,3366706948
1,National Capital Region (NCR),203930819,288023206,312792420,350366278,445396815,364638998,350945836,410794316,541772989,...,795953279,814136171,899258808,943457629,1055777100,1334869204,1364654442,1640251453,1629320588,1102162331
2,Cordillera Administrative Region (CAR),13865180,13448285,16806013,19788755,24257807,20432006,25385665,26901466,29704835,...,43410767,29012559,33620165,35098035,30308209,24558165,33059389,32266092,26362454,23041977
3,Region I (Ilocos Region),24454284,26821755,29627138,32967294,45222018,39338515,32127165,41310895,50763801,...,82663294,85433538,104875107,120705416,128404394,160871351,181860608,193759545,220744366,171771884
4,Region II (Cagayan Valley),32773347,34873457,37349804,40755790,51068600,58370199,68595346,71904291,73649021,...,42835117,52381575,67970385,83988890,92410049,102800111,120125423,134311405,147930411,100972112
5,Region III (Central Luzon),8037710,35506256,63533211,64973784,68231368,92419886,84806309,92665051,138098487,...,254138225,309828802,353024974,371092410,406615615,473831044,585347897,590017254,610620576,342676989
6,Region IV-A (CALABARZON),18214696,76941117,108934789,111408235,117422285,173261985,120393803,119653336,149756689,...,175971083,160028504,288993216,323836715,451566485,630840253,767416654,712401187,770805671,377997089
7,MIMAROPA Region,10234683,11200127,14899234,12029760,9464169,15527012,10027999,3017260,22411802,...,34062851,28545616,32125182,47669461,50527960,49834380,49231467,78995250,86362820,55876531
8,Region V (Bicol Region),23129693,23889005,26003348,23452407,28387601,30110212,31513023,37647782,47880014,...,51953786,65112644,89509800,108718089,125633690,142075341,160202016,181534196,198313314,140428210
9,Region VI (Western Visayas),46342201,44831024,46626457,45601881,48709503,44867130,52570827,61931227,73382845,...,90978389,97801511,84268564,86501669,91657960,90785992,111771723,155804331,171809008,123668823


In [17]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,PHILIPPINES,579938180,762429457,890086990,921328434,1103698971,1098633998,1049071426,1160979516,1526893379,...,2169604419,2287334968,2709182905,2933448471,3326756400,4018374755,4456327663,4959105466,5132349167,3366706948
1,NCR: National Capital Region,203930819,288023206,312792420,350366278,445396815,364638998,350945836,410794316,541772989,...,795953279,814136171,899258808,943457629,1055777100,1334869204,1364654442,1640251453,1629320588,1102162331
2,CAR: Cordillera Administrative Region,13865180,13448285,16806013,19788755,24257807,20432006,25385665,26901466,29704835,...,43410767,29012559,33620165,35098035,30308209,24558165,33059389,32266092,26362454,23041977
3,Region 1: Ilocos Region,24454284,26821755,29627138,32967294,45222018,39338515,32127165,41310895,50763801,...,82663294,85433538,104875107,120705416,128404394,160871351,181860608,193759545,220744366,171771884
4,Region 2: Cagayan Valley,32773347,34873457,37349804,40755790,51068600,58370199,68595346,71904291,73649021,...,42835117,52381575,67970385,83988890,92410049,102800111,120125423,134311405,147930411,100972112
5,Region 3: Central Luzon,8037710,35506256,63533211,64973784,68231368,92419886,84806309,92665051,138098487,...,254138225,309828802,353024974,371092410,406615615,473831044,585347897,590017254,610620576,342676989
6,Region 4A: CALABARZON,18214696,76941117,108934789,111408235,117422285,173261985,120393803,119653336,149756689,...,175971083,160028504,288993216,323836715,451566485,630840253,767416654,712401187,770805671,377997089
7,MIMAROPA: Southwestern Tagalog Region,10234683,11200127,14899234,12029760,9464169,15527012,10027999,3017260,22411802,...,34062851,28545616,32125182,47669461,50527960,49834380,49231467,78995250,86362820,55876531
8,Region 5: Bicol Region,23129693,23889005,26003348,23452407,28387601,30110212,31513023,37647782,47880014,...,51953786,65112644,89509800,108718089,125633690,142075341,160202016,181534196,198313314,140428210
9,Region 6: Western Visayas,46342201,44831024,46626457,45601881,48709503,44867130,52570827,61931227,73382845,...,90978389,97801511,84268564,86501669,91657960,90785992,111771723,155804331,171809008,123668823


##### Gross Regional Domestic Product, by Region

In [18]:
data = pd.read_csv("D:\datapre-project\data\Gross Regional Domestic Product, by Region.csv",header = 1,sep = ';')
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,..National Capital Region (NCR),1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3833040760,4072189800,4339858704,4576914911,4865073608,5216091453,5507681038,5814440130,6224134457,5596389427
1,..Cordillera Administrative Region (CAR),90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,221420475,221950621,236871362,248982342,259526063,266691916,291775205,308267122,321722276,289898072
2,..Region I (Ilocos Region),128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,378621382,398616600,432692177,461217885,486062080,525389666,554680762,587597251,630362667,581894468
3,..Region II (Cagayan Valley),85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,247587301,264876624,288512422,312814849,325728274,341540501,368250759,385061271,411513567,370865964
4,..Region III (Central Luzon),368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1233431610,1336963348,1417025690,1526621058,1619134911,1747565400,1929193858,2062393875,2183779631,1880093241
5,..Region IV-A (CALABARZON),601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1676719329,1809708699,1942476533,2054936917,2196313576,2346818115,2527658901,2706994745,2831599919,2535284422
6,..MIMAROPA Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,241561347,254595082,265363628,292033215,305952533,321170118,341505717,370744808,386783632,357386005
7,..Region V (Bicol Region),96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,304077556,334658891,368570786,390488251,431896619,462821105,488370132,522014835,564941774,517464559
8,..Region VI (Western Visayas),180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,564705465,605376449,631660677,665091892,715495909,758042789,820773095,860107768,913909365,825445426
9,..Region VII (Central Visayas),201368028,221348981,236861695,251698543,286487926,321657954,354599746,394422136,442118144,...,713605729,779760030,841425750,903755199,958557189,1029641257,1102761503,1180945761,1254113393,1129843546


In [19]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,Philippines,3697556205,4024398940,4350559772,4717808940,5323904177,5917282301,6550417113,7198244888,8050200621,...,11615360393,12416466190,13254643627,14096046745,14990907450,16062675895,17175978086,18265190258,19382750611,17527234105
1,National Capital Region (NCR),1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3833040760,4072189800,4339858704,4576914911,4865073608,5216091453,5507681038,5814440130,6224134457,5596389427
2,Cordillera Administrative Region (CAR),90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,221420475,221950621,236871362,248982342,259526063,266691916,291775205,308267122,321722276,289898072
3,Region I (Ilocos Region),128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,378621382,398616600,432692177,461217885,486062080,525389666,554680762,587597251,630362667,581894468
4,Region II (Cagayan Valley),85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,247587301,264876624,288512422,312814849,325728274,341540501,368250759,385061271,411513567,370865964
5,Region III (Central Luzon),368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1233431610,1336963348,1417025690,1526621058,1619134911,1747565400,1929193858,2062393875,2183779631,1880093241
6,Region IV-A (CALABARZON),601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1676719329,1809708699,1942476533,2054936917,2196313576,2346818115,2527658901,2706994745,2831599919,2535284422
7,MIMAROPA Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,241561347,254595082,265363628,292033215,305952533,321170118,341505717,370744808,386783632,357386005
8,Region V (Bicol Region),96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,304077556,334658891,368570786,390488251,431896619,462821105,488370132,522014835,564941774,517464559
9,Region VI (Western Visayas),180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,564705465,605376449,631660677,665091892,715495909,758042789,820773095,860107768,913909365,825445426


In [20]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Constant 2018 Prices 2011,At Constant 2018 Prices 2012,At Constant 2018 Prices 2013,At Constant 2018 Prices 2014,At Constant 2018 Prices 2015,At Constant 2018 Prices 2016,At Constant 2018 Prices 2017,At Constant 2018 Prices 2018,At Constant 2018 Prices 2019,At Constant 2018 Prices 2020
0,PHILIPPINES,3697556205,4024398940,4350559772,4717808940,5323904177,5917282301,6550417113,7198244888,8050200621,...,11615360393,12416466190,13254643627,14096046745,14990907450,16062675895,17175978086,18265190258,19382750611,17527234105
1,NCR: National Capital Region,1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3833040760,4072189800,4339858704,4576914911,4865073608,5216091453,5507681038,5814440130,6224134457,5596389427
2,CAR: Cordillera Administrative Region,90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,221420475,221950621,236871362,248982342,259526063,266691916,291775205,308267122,321722276,289898072
3,Region 1: Ilocos Region,128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,378621382,398616600,432692177,461217885,486062080,525389666,554680762,587597251,630362667,581894468
4,Region 2: Cagayan Valley,85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,247587301,264876624,288512422,312814849,325728274,341540501,368250759,385061271,411513567,370865964
5,Region 3: Central Luzon,368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1233431610,1336963348,1417025690,1526621058,1619134911,1747565400,1929193858,2062393875,2183779631,1880093241
6,Region 4A: CALABARZON,601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1676719329,1809708699,1942476533,2054936917,2196313576,2346818115,2527658901,2706994745,2831599919,2535284422
7,MIMAROPA: Southwestern Tagalog Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,241561347,254595082,265363628,292033215,305952533,321170118,341505717,370744808,386783632,357386005
8,Region 5: Bicol Region,96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,304077556,334658891,368570786,390488251,431896619,462821105,488370132,522014835,564941774,517464559
9,Region 6: Western Visayas,180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,564705465,605376449,631660677,665091892,715495909,758042789,820773095,860107768,913909365,825445426


##### Population, by Region

In [69]:
data = pd.read_csv("D:\datapre-project\data\Population, by Region.csv",header = 1,sep = ';')
data

Unnamed: 0,Region,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,..National Capital Region (NCR),9961971,10153254,10344788,10536574,10729137,10921427,11113967,11306759,11500331,...,12080971,12275553,12469854,12664407,12859211,13066832,13264805,13453701,13633497,13804656
1,..Cordillera Administrative Region (CAR),1369249,1397362,1424800,1451561,1477718,1503126,1527858,1551914,1575358,...,1641438,1662169,1682167,1701488,1720134,1739380,1757717,1775210,1791881,1807738
2,..Region I (Ilocos Region),4209083,4265007,4320673,4376079,4431377,4486265,4540893,4595262,4649520,...,4810293,4863510,4916323,4968877,5021171,5076184,5128542,5178410,5225800,5270807
3,..Region II (Cagayan Valley),2819641,2860861,2902169,2943564,2985160,3026730,3068387,3110132,3152079,...,3278100,3320398,3362667,3405024,3447468,3493662,3537703,3579715,3619689,3657741
4,..Region III (Central Luzon),8233671,8420004,8607944,8797491,8989170,9181933,9376303,9572279,9770406,...,10372806,10577379,10783003,10990233,11199069,11437442,11667642,11890314,12105494,12313718
5,..Region IV-A (CALABARZON),9367205,9687547,10009909,10334289,10661585,10990009,11320451,11652912,11988312,...,13003881,13347384,13691969,14038573,14387196,14741686,15085285,15418944,15742673,16057299
6,..MIMAROPA Region,2305919,2352183,2398060,2443548,2488772,2533484,2577808,2621744,2665411,...,2793725,2835835,2877441,2918660,2959491,3006430,3051342,3094357,3135503,3174859
7,..Region V (Bicol Region),4698058,4772451,4846614,4920546,4994449,5067918,5141157,5214165,5287141,...,5504085,5576135,5647756,5719147,5790307,5865520,5937321,6005949,6071398,6133836
8,..Region VI (Western Visayas),6224949,6317904,6409990,6501206,6591799,6681274,6769880,6857616,6944719,...,7200096,7283710,7366225,7447870,7528646,7610389,7688734,7763898,7835883,7904899
9,..Region VII (Central Visayas),5723559,5830498,5937986,6046025,6154912,6264053,6373743,6483984,6595079,...,6930757,7044060,7157605,7271699,7386344,7511565,7631003,7745017,7853606,7957046


In [70]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Philippines,76723051,78273584,79832103,81398610,82977428,84559930,86150420,87748896,89359772,...,94227150,95870083,97516526,99170955,100833371,102530196,104169230,105755180,107288150,108771978
1,National Capital Region (NCR),9961971,10153254,10344788,10536574,10729137,10921427,11113967,11306759,11500331,...,12080971,12275553,12469854,12664407,12859211,13066832,13264805,13453701,13633497,13804656
2,Cordillera Administrative Region (CAR),1369249,1397362,1424800,1451561,1477718,1503126,1527858,1551914,1575358,...,1641438,1662169,1682167,1701488,1720134,1739380,1757717,1775210,1791881,1807738
3,Region I (Ilocos Region),4209083,4265007,4320673,4376079,4431377,4486265,4540893,4595262,4649520,...,4810293,4863510,4916323,4968877,5021171,5076184,5128542,5178410,5225800,5270807
4,Region II (Cagayan Valley),2819641,2860861,2902169,2943564,2985160,3026730,3068387,3110132,3152079,...,3278100,3320398,3362667,3405024,3447468,3493662,3537703,3579715,3619689,3657741
5,Region III (Central Luzon),8233671,8420004,8607944,8797491,8989170,9181933,9376303,9572279,9770406,...,10372806,10577379,10783003,10990233,11199069,11437442,11667642,11890314,12105494,12313718
6,Region IV-A (CALABARZON),9367205,9687547,10009909,10334289,10661585,10990009,11320451,11652912,11988312,...,13003881,13347384,13691969,14038573,14387196,14741686,15085285,15418944,15742673,16057299
7,MIMAROPA Region,2305919,2352183,2398060,2443548,2488772,2533484,2577808,2621744,2665411,...,2793725,2835835,2877441,2918660,2959491,3006430,3051342,3094357,3135503,3174859
8,Region V (Bicol Region),4698058,4772451,4846614,4920546,4994449,5067918,5141157,5214165,5287141,...,5504085,5576135,5647756,5719147,5790307,5865520,5937321,6005949,6071398,6133836
9,Region VI (Western Visayas),6224949,6317904,6409990,6501206,6591799,6681274,6769880,6857616,6944719,...,7200096,7283710,7366225,7447870,7528646,7610389,7688734,7763898,7835883,7904899


In [71]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,PHILIPPINES,76723051,78273584,79832103,81398610,82977428,84559930,86150420,87748896,89359772,...,94227150,95870083,97516526,99170955,100833371,102530196,104169230,105755180,107288150,108771978
1,NCR: National Capital Region,9961971,10153254,10344788,10536574,10729137,10921427,11113967,11306759,11500331,...,12080971,12275553,12469854,12664407,12859211,13066832,13264805,13453701,13633497,13804656
2,CAR: Cordillera Administrative Region,1369249,1397362,1424800,1451561,1477718,1503126,1527858,1551914,1575358,...,1641438,1662169,1682167,1701488,1720134,1739380,1757717,1775210,1791881,1807738
3,Region 1: Ilocos Region,4209083,4265007,4320673,4376079,4431377,4486265,4540893,4595262,4649520,...,4810293,4863510,4916323,4968877,5021171,5076184,5128542,5178410,5225800,5270807
4,Region 2: Cagayan Valley,2819641,2860861,2902169,2943564,2985160,3026730,3068387,3110132,3152079,...,3278100,3320398,3362667,3405024,3447468,3493662,3537703,3579715,3619689,3657741
5,Region 3: Central Luzon,8233671,8420004,8607944,8797491,8989170,9181933,9376303,9572279,9770406,...,10372806,10577379,10783003,10990233,11199069,11437442,11667642,11890314,12105494,12313718
6,Region 4A: CALABARZON,9367205,9687547,10009909,10334289,10661585,10990009,11320451,11652912,11988312,...,13003881,13347384,13691969,14038573,14387196,14741686,15085285,15418944,15742673,16057299
7,MIMAROPA: Southwestern Tagalog Region,2305919,2352183,2398060,2443548,2488772,2533484,2577808,2621744,2665411,...,2793725,2835835,2877441,2918660,2959491,3006430,3051342,3094357,3135503,3174859
8,Region 5: Bicol Region,4698058,4772451,4846614,4920546,4994449,5067918,5141157,5214165,5287141,...,5504085,5576135,5647756,5719147,5790307,5865520,5937321,6005949,6071398,6133836
9,Region 6: Western Visayas,6224949,6317904,6409990,6501206,6591799,6681274,6769880,6857616,6944719,...,7200096,7283710,7366225,7447870,7528646,7610389,7688734,7763898,7835883,7904899


In [72]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Population', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

Unnamed: 0,Geolocation,Year,Population
0,PHILIPPINES,2000,76723051
1,NCR: National Capital Region,2000,9961971
2,CAR: Cordillera Administrative Region,2000,1369249
3,Region 1: Ilocos Region,2000,4209083
4,Region 2: Cagayan Valley,2000,2819641
...,...,...,...
373,Region 10: Northern Mindanao,2020,5017051
374,Region 11: Davao Region,2020,5290869
375,Region 12: SOCCSKSARGEN,2020,4378871
376,CARAGA: Cordillera Administrative Region,2020,2753109


##### Primary Drop-out rates by Region, Sex and Year

In [76]:
data = pd.read_csv("D:\datapre-project\data\Primary Drop-out rates by Region, Sex and Year.csv",header = 1,sep = ';')
data

Unnamed: 0,Region,Both Sexes 2006,Both Sexes 2007,Both Sexes 2008,Both Sexes 2009,Both Sexes 2010,Both Sexes 2011,Both Sexes 2012,Both Sexes 2013,Both Sexes 2014,...,Girls 2006,Girls 2007,Girls 2008,Girls 2009,Girls 2010,Girls 2011,Girls 2012,Girls 2013,Girls 2014,Girls 2015
0,Philippines,6.37,5.99,6.02,6.28,6.29,6.36,6.24,4.85,3.26,...,5.0,4.72,4.87,4.93,5.02,5.18,5.12,4.04,2.77,2.01
1,NCR,2.37,2.83,2.92,4.07,3.31,2.93,4.1,4.36,4.25,...,1.46,1.8,2.13,3.33,2.53,2.42,3.27,3.78,3.6,1.45
2,CAR,5.67,6.41,5.49,5.04,6.0,4.91,4.79,3.75,2.84,...,4.23,5.09,4.16,2.73,4.94,3.57,3.77,3.15,1.71,1.36
3,Region I,3.93,3.76,3.09,3.6,3.78,3.36,3.1,1.92,1.13,...,3.15,2.92,2.41,2.75,3.02,2.53,2.48,1.42,0.93,0.83
4,Region II,4.72,4.95,4.3,4.81,4.95,4.73,3.93,2.42,2.61,...,3.21,3.71,3.28,3.45,3.71,3.59,2.97,1.67,1.96,0.98
5,Region III,3.69,3.97,3.94,3.72,4.15,4.07,3.71,1.9,2.28,...,2.74,3.02,3.05,2.7,3.02,3.17,2.89,1.37,1.74,0.93
6,Region IV-A,3.9,4.41,3.87,2.57,3.75,2.46,3.26,6.03,1.55,...,2.88,3.21,2.87,1.78,2.55,1.83,1.71,5.13,1.22,1.54
7,Region IV-B,6.7,7.4,6.4,6.93,6.25,6.13,5.87,4.73,2.78,...,5.18,6.29,5.23,5.64,4.79,4.86,4.52,3.84,2.25,0.89
8,Region V,6.06,5.78,5.9,5.8,5.79,5.7,5.35,3.19,2.72,...,4.78,4.61,4.71,4.63,4.51,4.46,4.17,2.39,2.16,1.92
9,Region VI,6.38,6.14,6.03,6.05,6.56,5.64,4.8,2.91,1.97,...,4.61,4.45,4.49,4.4,4.88,4.17,3.46,2.1,1.54,1.02


In [77]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,Both Sexes 2006,Both Sexes 2007,Both Sexes 2008,Both Sexes 2009,Both Sexes 2010,Both Sexes 2011,Both Sexes 2012,Both Sexes 2013,Both Sexes 2014,...,Girls 2006,Girls 2007,Girls 2008,Girls 2009,Girls 2010,Girls 2011,Girls 2012,Girls 2013,Girls 2014,Girls 2015
0,PHILIPPINES,6.37,5.99,6.02,6.28,6.29,6.36,6.24,4.85,3.26,...,5.0,4.72,4.87,4.93,5.02,5.18,5.12,4.04,2.77,2.01
1,NCR: National Capital Region,2.37,2.83,2.92,4.07,3.31,2.93,4.1,4.36,4.25,...,1.46,1.8,2.13,3.33,2.53,2.42,3.27,3.78,3.6,1.45
2,CAR: Cordillera Administrative Region,5.67,6.41,5.49,5.04,6.0,4.91,4.79,3.75,2.84,...,4.23,5.09,4.16,2.73,4.94,3.57,3.77,3.15,1.71,1.36
3,Region 1: Ilocos Region,3.93,3.76,3.09,3.6,3.78,3.36,3.1,1.92,1.13,...,3.15,2.92,2.41,2.75,3.02,2.53,2.48,1.42,0.93,0.83
4,Region 2: Cagayan Valley,4.72,4.95,4.3,4.81,4.95,4.73,3.93,2.42,2.61,...,3.21,3.71,3.28,3.45,3.71,3.59,2.97,1.67,1.96,0.98
5,Region 3: Central Luzon,3.69,3.97,3.94,3.72,4.15,4.07,3.71,1.9,2.28,...,2.74,3.02,3.05,2.7,3.02,3.17,2.89,1.37,1.74,0.93
6,Region 4A: CALABARZON,3.9,4.41,3.87,2.57,3.75,2.46,3.26,6.03,1.55,...,2.88,3.21,2.87,1.78,2.55,1.83,1.71,5.13,1.22,1.54
7,MIMAROPA: Southwestern Tagalog Region,6.7,7.4,6.4,6.93,6.25,6.13,5.87,4.73,2.78,...,5.18,6.29,5.23,5.64,4.79,4.86,4.52,3.84,2.25,0.89
8,Region 5: Bicol Region,6.06,5.78,5.9,5.8,5.79,5.7,5.35,3.19,2.72,...,4.78,4.61,4.71,4.63,4.51,4.46,4.17,2.39,2.16,1.92
9,Region 6: Western Visayas,6.38,6.14,6.03,6.05,6.56,5.64,4.8,2.91,1.97,...,4.61,4.45,4.49,4.4,4.88,4.17,3.46,2.1,1.54,1.02


In [78]:
data.drop(data.iloc[:, 11:31], inplace = True, axis = 1)
  
data

Unnamed: 0,Geolocation,Both Sexes 2006,Both Sexes 2007,Both Sexes 2008,Both Sexes 2009,Both Sexes 2010,Both Sexes 2011,Both Sexes 2012,Both Sexes 2013,Both Sexes 2014,Both Sexes 2015
0,PHILIPPINES,6.37,5.99,6.02,6.28,6.29,6.36,6.24,4.85,3.26,2.69
1,NCR: National Capital Region,2.37,2.83,2.92,4.07,3.31,2.93,4.1,4.36,4.25,2.05
2,CAR: Cordillera Administrative Region,5.67,6.41,5.49,5.04,6.0,4.91,4.79,3.75,2.84,1.9
3,Region 1: Ilocos Region,3.93,3.76,3.09,3.6,3.78,3.36,3.1,1.92,1.13,1.13
4,Region 2: Cagayan Valley,4.72,4.95,4.3,4.81,4.95,4.73,3.93,2.42,2.61,1.74
5,Region 3: Central Luzon,3.69,3.97,3.94,3.72,4.15,4.07,3.71,1.9,2.28,1.45
6,Region 4A: CALABARZON,3.9,4.41,3.87,2.57,3.75,2.46,3.26,6.03,1.55,1.76
7,MIMAROPA: Southwestern Tagalog Region,6.7,7.4,6.4,6.93,6.25,6.13,5.87,4.73,2.78,1.66
8,Region 5: Bicol Region,6.06,5.78,5.9,5.8,5.79,5.7,5.35,3.19,2.72,2.54
9,Region 6: Western Visayas,6.38,6.14,6.03,6.05,6.56,5.64,4.8,2.91,1.97,1.76


In [80]:
data.columns = data.columns.map(lambda x: x.lstrip('Both Sexes '))
data

Unnamed: 0,Geolocation,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,PHILIPPINES,6.37,5.99,6.02,6.28,6.29,6.36,6.24,4.85,3.26,2.69
1,NCR: National Capital Region,2.37,2.83,2.92,4.07,3.31,2.93,4.1,4.36,4.25,2.05
2,CAR: Cordillera Administrative Region,5.67,6.41,5.49,5.04,6.0,4.91,4.79,3.75,2.84,1.9
3,Region 1: Ilocos Region,3.93,3.76,3.09,3.6,3.78,3.36,3.1,1.92,1.13,1.13
4,Region 2: Cagayan Valley,4.72,4.95,4.3,4.81,4.95,4.73,3.93,2.42,2.61,1.74
5,Region 3: Central Luzon,3.69,3.97,3.94,3.72,4.15,4.07,3.71,1.9,2.28,1.45
6,Region 4A: CALABARZON,3.9,4.41,3.87,2.57,3.75,2.46,3.26,6.03,1.55,1.76
7,MIMAROPA: Southwestern Tagalog Region,6.7,7.4,6.4,6.93,6.25,6.13,5.87,4.73,2.78,1.66
8,Region 5: Bicol Region,6.06,5.78,5.9,5.8,5.79,5.7,5.35,3.19,2.72,2.54
9,Region 6: Western Visayas,6.38,6.14,6.03,6.05,6.56,5.64,4.8,2.91,1.97,1.76


In [81]:
# converting from a wide representation to a long representation
data = pd.melt(data, id_vars='Geolocation', value_vars=data.columns [1:]) 

# renaming the columns into a more readable anmes
data.rename(columns = {'value':'Drop-out rate', 'variable' : 'Year'}, inplace=True)

# making the year type into integer
data = data.astype({'Year':'int'})

data

Unnamed: 0,Geolocation,Year,Drop-out rate
0,PHILIPPINES,2006,6.37
1,NCR: National Capital Region,2006,2.37
2,CAR: Cordillera Administrative Region,2006,5.67
3,Region 1: Ilocos Region,2006,3.93
4,Region 2: Cagayan Valley,2006,4.72
...,...,...,...
175,Region 10: Northern Mindanao,2015,3.00
176,Region 11: Davao Region,2015,1.90
177,Region 12: SOCCSKSARGEN,2015,3.15
178,CARAGA: Cordillera Administrative Region,2015,2.84


##### Quarterly Producer Price Index for Agriculture (First Quarter 2018 to Third Quarter 2021)

In [28]:
data = pd.read_csv("D:\datapre-project\data\Quarterly Producer Price Index for Agriculture (2018=100) _ First Quarter 2018 to Third Quarter 2021.csv",header = 1,sep = ';')
data

Unnamed: 0,Region,Commodity,2018 First Quarter (Jan-Mar),2018 Second Quarter (Apr-Jun),2018 Third Quarter (Jul-Sep),2018 Fourth Quarter (Oct-Dec),2018 Average (Jan-Dec),2019 First Quarter (Jan-Mar),2019 Second Quarter (Apr-Jun),2019 Third Quarter (Jul-Sep),...,2020 First Quarter (Jan-Mar),2020 Second Quarter (Apr-Jun),2020 Third Quarter (Jul-Sep),2020 Fourth Quarter (Oct-Dec),2020 Average (Jan-Dec),2021 First Quarter (Jan-Mar),2021 Second Quarter (Apr-Jun),2021 Third Quarter (Jul-Sep),2021 Fourth Quarter (Oct-Dec),2021 Average (Jan-Dec)
0,PHILIPPINES,AGRICULTURE,95.4,101.7,104.4,98.4,100.0,96.0,96.4,93.4,...,93.7,95.2,95.1,96.2,95.1,101.7,103.4,104.0,..,..
1,PHILIPPINES,..CROPS,94.2,102.5,105.3,98.0,100.0,95.2,95.6,91.6,...,92.9,95.1,94.9,92.9,94.0,95.8,98.8,101.8,..,..
2,PHILIPPINES,….Cereals,96.3,101.0,105.7,97.1,100.0,93.4,89.3,81.8,...,80.8,89.4,83.5,77.3,82.8,85.2,86.9,87.7,..,..
3,PHILIPPINES,….Rootcrops,97.1,92.1,102.6,108.2,100.0,101.4,110.8,125.0,...,119.9,124.1,118.3,117.0,119.8,121.3,88.1,94.5,..,..
4,PHILIPPINES,….Beans and Legumes,100.1,96.0,106.1,97.8,100.0,95.5,95.6,93.6,...,94.9,91.2,78.2,117.1,95.4,94.5,89.9,90.1,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301,..BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDA...,..FISHERY,95.8,102.4,98.2,103.6,100.0,98.5,102.0,99.4,...,96.1,102.5,100.9,102.8,100.6,96.0,99.7,100.4,..,..
302,..BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDA...,….Aquaculture,93.0,100.8,100.7,105.5,100.0,101.1,104.6,101.0,...,99.8,118.3,120.4,117.7,114.1,115.2,115.6,120.0,..,..
303,..BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDA...,….Commercial Fishery,100.5,101.2,97.2,101.1,100.0,96.4,94.6,87.1,...,88.8,84.0,79.7,85.5,84.5,80.7,77.7,74.4,..,..
304,..BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDA...,….Inland Municipal Fishery,90.4,112.5,92.8,104.3,100.0,95.4,110.7,92.3,...,89.2,105.9,105.3,104.1,101.1,85.0,106.4,115.1,..,..


In [29]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data['Commodity'] = data['Commodity'].map(lambda x: x.lstrip('..'))
data['Commodity'] = data['Commodity'].map(lambda x: x.lstrip('….'))
data

Unnamed: 0,Region,Commodity,2018 First Quarter (Jan-Mar),2018 Second Quarter (Apr-Jun),2018 Third Quarter (Jul-Sep),2018 Fourth Quarter (Oct-Dec),2018 Average (Jan-Dec),2019 First Quarter (Jan-Mar),2019 Second Quarter (Apr-Jun),2019 Third Quarter (Jul-Sep),...,2020 First Quarter (Jan-Mar),2020 Second Quarter (Apr-Jun),2020 Third Quarter (Jul-Sep),2020 Fourth Quarter (Oct-Dec),2020 Average (Jan-Dec),2021 First Quarter (Jan-Mar),2021 Second Quarter (Apr-Jun),2021 Third Quarter (Jul-Sep),2021 Fourth Quarter (Oct-Dec),2021 Average (Jan-Dec)
0,PHILIPPINES,AGRICULTURE,95.4,101.7,104.4,98.4,100.0,96.0,96.4,93.4,...,93.7,95.2,95.1,96.2,95.1,101.7,103.4,104.0,..,..
1,PHILIPPINES,CROPS,94.2,102.5,105.3,98.0,100.0,95.2,95.6,91.6,...,92.9,95.1,94.9,92.9,94.0,95.8,98.8,101.8,..,..
2,PHILIPPINES,Cereals,96.3,101.0,105.7,97.1,100.0,93.4,89.3,81.8,...,80.8,89.4,83.5,77.3,82.8,85.2,86.9,87.7,..,..
3,PHILIPPINES,Rootcrops,97.1,92.1,102.6,108.2,100.0,101.4,110.8,125.0,...,119.9,124.1,118.3,117.0,119.8,121.3,88.1,94.5,..,..
4,PHILIPPINES,Beans and Legumes,100.1,96.0,106.1,97.8,100.0,95.5,95.6,93.6,...,94.9,91.2,78.2,117.1,95.4,94.5,89.9,90.1,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,FISHERY,95.8,102.4,98.2,103.6,100.0,98.5,102.0,99.4,...,96.1,102.5,100.9,102.8,100.6,96.0,99.7,100.4,..,..
302,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,Aquaculture,93.0,100.8,100.7,105.5,100.0,101.1,104.6,101.0,...,99.8,118.3,120.4,117.7,114.1,115.2,115.6,120.0,..,..
303,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,Commercial Fishery,100.5,101.2,97.2,101.1,100.0,96.4,94.6,87.1,...,88.8,84.0,79.7,85.5,84.5,80.7,77.7,74.4,..,..
304,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,Inland Municipal Fishery,90.4,112.5,92.8,104.3,100.0,95.4,110.7,92.3,...,89.2,105.9,105.3,104.1,101.1,85.0,106.4,115.1,..,..


In [30]:
data.drop(columns =['2021 Fourth Quarter (Oct-Dec)', '2021 Average (Jan-Dec)'],inplace = True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,Commodity,2018 First Quarter (Jan-Mar),2018 Second Quarter (Apr-Jun),2018 Third Quarter (Jul-Sep),2018 Fourth Quarter (Oct-Dec),2018 Average (Jan-Dec),2019 First Quarter (Jan-Mar),2019 Second Quarter (Apr-Jun),2019 Third Quarter (Jul-Sep),2019 Fourth Quarter (Oct-Dec),2019 Average (Jan-Dec),2020 First Quarter (Jan-Mar),2020 Second Quarter (Apr-Jun),2020 Third Quarter (Jul-Sep),2020 Fourth Quarter (Oct-Dec),2020 Average (Jan-Dec),2021 First Quarter (Jan-Mar),2021 Second Quarter (Apr-Jun),2021 Third Quarter (Jul-Sep)
0,PHILIPPINES,AGRICULTURE,95.4,101.7,104.4,98.4,100.0,96.0,96.4,93.4,91.7,94.4,93.7,95.2,95.1,96.2,95.1,101.7,103.4,104.0
1,PHILIPPINES,CROPS,94.2,102.5,105.3,98.0,100.0,95.2,95.6,91.6,89.9,93.1,92.9,95.1,94.9,92.9,94.0,95.8,98.8,101.8
2,PHILIPPINES,Cereals,96.3,101.0,105.7,97.1,100.0,93.4,89.3,81.8,74.9,84.9,80.8,89.4,83.5,77.3,82.8,85.2,86.9,87.7
3,PHILIPPINES,Rootcrops,97.1,92.1,102.6,108.2,100.0,101.4,110.8,125.0,122.8,115.0,119.9,124.1,118.3,117.0,119.8,121.3,88.1,94.5
4,PHILIPPINES,Beans and Legumes,100.1,96.0,106.1,97.8,100.0,95.5,95.6,93.6,94.4,94.8,94.9,91.2,78.2,117.1,95.4,94.5,89.9,90.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,FISHERY,95.8,102.4,98.2,103.6,100.0,98.5,102.0,99.4,103.7,100.9,96.1,102.5,100.9,102.8,100.6,96.0,99.7,100.4
302,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,Aquaculture,93.0,100.8,100.7,105.5,100.0,101.1,104.6,101.0,109.5,104.1,99.8,118.3,120.4,117.7,114.1,115.2,115.6,120.0
303,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,Commercial Fishery,100.5,101.2,97.2,101.1,100.0,96.4,94.6,87.1,89.9,92.0,88.8,84.0,79.7,85.5,84.5,80.7,77.7,74.4
304,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,Inland Municipal Fishery,90.4,112.5,92.8,104.3,100.0,95.4,110.7,92.3,102.9,100.3,89.2,105.9,105.3,104.1,101.1,85.0,106.4,115.1


## Data Cleaning and Wrangling

## Exploratory Data Analysis