# Progress of the Philippines' Sustainable Development Goals
In 2015, the United Nations General Assembly created 17 interlinked global goals that was intended to be achieved by 2030. It was said that it would pave the way to "a better and more sustainable future for all". The interlinked global goals was named the **Sustainable Development Goals** (SDG).

In this notebook, 27 different datasets from the Philippine Statistics Authority, the government agency assigned to update the Philippine's data on this goals, would be combined. Fifteen (15) of these datasets are directly about the SDG, while others are indirectly connected to the SDGs.

### Import
Import `os`, `math`, `numpy`, and `pandas`.
- `os` is a module that allows the usage of operating system dependent functionality
- `numpy` contains a large collection of mathematical functions
- `pandas` is a software library for Python that is designed for data manipulation and data analysis
- `math` is a model that provides mathematical functions

In [1]:
import os
import math
import pandas as pd
import numpy as np

## Data Collection
The following **csv** files used in this project are acquired through a request sent to the Knowledge Management and Communications Division of the Philippine Statistics Authority.

### Combining the Datasets 
In this stage, the separate datasets underwent pre-processing and cleaning before they are combined together. 

First, the irrelevant rows were dropped first. These were the rows that have all NaN values and the additional rows (i.e., note rows, “Data available” rows) found in the CSV files. 

Second, since the first row of the CSV files was the name of the indicator and unnamed rows, the resulting dataframe had “Unnamed” as its column header. Due to this, we had to set the column headers to the second row of the dataframe, and then drop this afterward.

Third, since the `Geolocation` column would be used later to merge the datasets, the values in this column were standardized into the format `Region n: region_name`, where *n* is the corresponding region number and *region_name* is the name of the region. If it does not have a region number, then it was formatted as `region_abbreviation: region_name`, where *region_abbreviation* is its official abbreviation. 

Fourth, there are datasets that had divisions for a region and year, but still include a cumulative value for that division (e.g., datasets that are also divided per `Sex`, while having a value of “Both Sexes”. For this situation, we have decided to create three different columns: one for the cumulative row, another for the **Female**, and another for the **Male**.

Fifth, we convert the dataframe into its long representation. Once we have the dataset into its long representation, then we can merge it to the combined dataset while using the Year and Geolocation columns as its primary key. This would be done for all of the twenty-five datasets.

This process would result in one dataframe that is in its long representation, with three kinds of columns: (1) Geolocation, (2) Year, and (3) the value for each of the indicators. 

#### Functions used to transform the data
Throughout the data combination section, the `replace_missing`, `fix_geolocation` and `change_to_long` functions are widely used. Because of this, functions were created to abstract these functionalities.

The `replace_missing` function would [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the missing representation of '..' and '...' to NaN. This would allow numerical calculation ot be done on these columns.

In [2]:
def replace_missing (temp):
    # temp is the dataframe that we would need to change the '..' and '...' values to NaN
    
    # c is the column names of the dataframe, except for the 'Geolocation' column
    for c in temp.columns.difference(['Geolocation']):
        ## replaces the '..' values to NaN values
        temp [c].replace(to_replace='..', value= np.nan, inplace= True)
        
        ## replaces the '...' values to NaN values
        temp [c].replace(to_replace='...', value= np.nan, inplace= True)
    
    # returns the edited dataframe
    return temp

The `change_to_long` function would change the wide representation of the dataframe to its corresponding long representation, given the dataframe that needs to be converted and the column name. This function utilizes the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, which would convert it to its long representation. Afterwards, the columns would be [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) to be more descriptive, and then, the `Year` column would be changed from float to int.

In [3]:
def change_to_long (temp, col_name):
    # temp is the dataframe that we need to convert to its long representation
    # col_name would be the new column name, when the dataframe was converted
    
    # converting from a wide representation to a long representation
    temp = pd.melt(temp, id_vars='Geolocation', value_vars = temp.columns [1:]) 

    # renaming the columns into a more readable names
    if (0 in temp.columns):
        temp.rename(columns = {'value': col_name, 0 : 'Year'}, inplace=True)
    else:
        temp.rename(columns = {'value': col_name, 'variable' : 'Year'}, inplace=True)

    # making the year type into integer
    temp = temp.astype({'Year':'int'})

    # returns the long representation of the dataframe
    return temp

In the `fix_geolocation` function, for the rows with **NaN** value for the `Geolocation` column, we copy the `Geolocation` value of the nearest row above it with a not-NaN value. 

This is needed as there are datasets with the `Sex` column. In these datasets, the `Geolocation` column has **NaN** values.

In [4]:
def fix_geolocation (temp):
    # temp is the dataframe that we would have to fix the geolocation value of
    
    i = 0 # counter

    # curr_geo is the variable that holds the current value of the geolocation
    curr_geo = temp['Geolocation'][0]

    # we will iterate through the dataframe until its last row
    while i < len (temp):
        """
            If the value of the geolocation of the curernt row is not NaN, then that is the Geolocation
            value that we would copy to the next rows
        """ 
        if temp['Geolocation'][i] != np.nan:
            curr_geo = temp['Geolocation'][i]
            i = i + 1

        """
            We would copy the current Geolocation value to the next rows until the next not-NaN Geolocation.
        """
        while (i < len (temp) and data['Geolocation'][i] is np.nan):
            temp.at[i, 'Geolocation'] = curr_geo
            i = i + 1
    
    # return the fixed dataframe
    return temp

#### 1.2.1. Proportion of population living below the national poverty line 
To start with, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [5]:
data = pd.read_csv('data' + '/1.2.1.csv')         # data would now hold the csv data that we read
data

Unnamed: 0,1.2.1 Proportion of population living below the national poverty line by sex age 1/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,23.5,..,..,16.7,..,..,...,..
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,4.1,..,..,2.2,..,..,...,..
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,22.7,..,..,12.0,..,..,...,..
5,..Region I,..,..,..,..,..,..,..,..,..,...,..,..,18.8,..,..,9.9,..,..,...,..
6,..Region II,..,..,..,..,..,..,..,..,..,...,..,..,17.8,..,..,16.3,..,..,...,..
7,..Region III,..,..,..,..,..,..,..,..,..,...,..,..,10.5,..,..,7.0,..,..,...,..
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,12.5,..,..,7.1,..,..,...,..
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,25.2,..,..,15.1,..,..,...,..


Looking at the dataframe, we could see that the columns are unnamed and that the column names are located at the 0th row. Using [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html), we could get the 0th row and then assign it as the column values. 

Then, using the [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) function, we can drop the 0th row as we have no need for it anymore. Additionally, since the row at index 1 is a row full of NaN, we can also drop it using the same function. 

To be able to fix the indexing of the rows, the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function was used to reset the index from 0.

In [6]:
# setting our column names
data.columns = data.iloc [0] 

# dropping the 'geolocation' row as that is actually used as a header
data = data.drop (data.index [1])

# dropping the column names 
data = data.drop (data.index [0])

# resets the index of the dataframe
data.reset_index (drop=True, inplace=True)

Irrelevant rows that are just footers for the file are also removed.

In [7]:
# dropping irrelevant rows 
data = data.drop (data.index [18:]) 

The `Year` column must also be renamed into `Geolocation` as this row refers to the different regions in the Philippines, and not the years. This can be done through the use of the of the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function.

In [8]:
# renames the column 'Year' as its actually the location column
data.rename(columns = {'Year':'Geolocation'}, inplace=True)

To easily determine which region the `Geolocation` values refer to, we can also change these values to include the names that they are commonly referred to, instead of just their region numbers. 

For consistency throughout the different datasets, the `region_names` variable was declared. The reason why a map was not used was that different datasets have different representations of the region (i.e., differences in naming a region), however, they are always arranged in the same way. This would be shown below in the pre-processing of each of the datasets.

In [9]:
# NOTE: Before applying, make sure that the arrangement of the regions are the same as the arrangement in your table
region_names = ['PHILIPPINES', 'NCR: National Capital Region', 
                 'CAR: Cordillera Administrative Region', 
                 'Region 1: Ilocos Region', 
                 'Region 2: Cagayan Valley', 
                 'Region 3: Central Luzon', 
                 'Region 4A: CALABARZON', 
                'MIMAROPA: Southwestern Tagalog Region', 
                'Region 5: Bicol Region', 
                'Region 6: Western Visayas', 
                'Region 7: Central Visayas', 
                'Region 8: Eastern Visayas', 
                'Region 9: Zamboanga Peninsula', 
                'Region 10: Northern Mindanao', 
                'Region 11: Davao Region', 
                'Region 12: SOCCSKSARGEN', 
                'CARAGA: Cordillera Administrative Region', 
                'BARMM: Bangsamoro Autonomous Region in Muslim Mindanao']

In [10]:
# setting the values of the region_names
data['Geolocation'] = region_names
data.set_index('Geolocation')
data = data.reset_index(drop=True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
0,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,23.5,..,..,16.7,..,..,...,..
1,NCR: National Capital Region,..,..,..,..,..,..,..,..,..,...,..,..,4.1,..,..,2.2,..,..,...,..
2,CAR: Cordillera Administrative Region,..,..,..,..,..,..,..,..,..,...,..,..,22.7,..,..,12.0,..,..,...,..
3,Region 1: Ilocos Region,..,..,..,..,..,..,..,..,..,...,..,..,18.8,..,..,9.9,..,..,...,..
4,Region 2: Cagayan Valley,..,..,..,..,..,..,..,..,..,...,..,..,17.8,..,..,16.3,..,..,...,..
5,Region 3: Central Luzon,..,..,..,..,..,..,..,..,..,...,..,..,10.5,..,..,7.0,..,..,...,..
6,Region 4A: CALABARZON,..,..,..,..,..,..,..,..,..,...,..,..,12.5,..,..,7.1,..,..,...,..
7,MIMAROPA: Southwestern Tagalog Region,..,..,..,..,..,..,..,..,..,...,..,..,25.2,..,..,15.1,..,..,...,..
8,Region 5: Bicol Region,..,..,..,..,..,..,..,..,..,...,..,..,39.8,..,..,27.0,..,..,...,..
9,Region 6: Western Visayas,..,..,..,..,..,..,..,..,..,...,..,..,24.6,..,..,16.3,..,..,...,..


Next, we can convert the strings of '..' and '...', which were used to represent that there were no values for these cells, to **NaN**, through the use of the `replace_missing` function that we have created earlier.

However, the columns that have all **NaN** values were not dropped because if this dataset would be combined with other datasets, all years would still be present as there are datasets with complete data for all the years. Additionally, dropping the years for some of the dataset would result in the combined dataset having a weird sorting (i.e., a sorting of the region that does not follow the usual sorting of the datasets in the Philippines), even if it was sorted based on the `Year` and `Geolocation` column.

In [11]:
data = replace_missing (data)    # replaces the '..' and '...' with NaN
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
0,PHILIPPINES,,,,,,,,,,...,,,23.5,,,16.7,,,,
1,NCR: National Capital Region,,,,,,,,,,...,,,4.1,,,2.2,,,,
2,CAR: Cordillera Administrative Region,,,,,,,,,,...,,,22.7,,,12.0,,,,
3,Region 1: Ilocos Region,,,,,,,,,,...,,,18.8,,,9.9,,,,
4,Region 2: Cagayan Valley,,,,,,,,,,...,,,17.8,,,16.3,,,,
5,Region 3: Central Luzon,,,,,,,,,,...,,,10.5,,,7.0,,,,
6,Region 4A: CALABARZON,,,,,,,,,,...,,,12.5,,,7.1,,,,
7,MIMAROPA: Southwestern Tagalog Region,,,,,,,,,,...,,,25.2,,,15.1,,,,
8,Region 5: Bicol Region,,,,,,,,,,...,,,39.8,,,27.0,,,,
9,Region 6: Western Visayas,,,,,,,,,,...,,,24.6,,,16.3,,,,


As the final step, the wide representation of this dataset is converted to a long representation through the use of the `convert_long` function. 

In this function, we have used the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function to convert it to its long representation. Then, the column that holds the value for a specific year and region is coverted, using [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html), to the ID of this Sustainable Development Goal (SDG), so that it can be distinguished when it is combined with other datasets.

In [12]:
data = change_to_long (data, '1.2.1 Poverty Proportion')

As this is the first dataset, we can just assign it to the `combined_data` dataframe, which would hold the combined datasets.

In [13]:
combined_data = data     
# combined_data would be the dataframe that holds the combinination of the datasets

#### 1.4.1p5. Net Enrolment Rate in elementary

Using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we load the next dataset. 

In [14]:
data = pd.read_csv('data' + '/1.4.1p5.csv')
data

Unnamed: 0,1.4.1p5 Net Enrolment Rate in elementary (Indicator is also found in SDG 4.3.s1) 1/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,,Year,2000,2001,2002.00,2003.00,2004.00,2005.00,2006.00,2007.00,...,2013.00,2014.00,2015.00,2016.00,2017.00,2018.00,2019.00,2020.0000,2021,2022
1,Geolocation,Sex,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,Both Sexes,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,...,97.20,97.19,96.90,96.15,94.19,94.05,93.96,89.1064,...,...
3,,Boys,96.27,89.33,89.51,87.84,86.17,83.56,82.39,84.07,...,96.74,96.87,96.66,96.17,94.12,94.25,93.79,88.9318,...,...
4,,Girls,97.28,90.91,91.10,89.68,88.08,85.35,84.08,85.83,...,97.68,97.53,97.15,96.12,94.27,93.85,94.15,89.2898,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Note:,,,,,,,,,,...,,,,,,,,,,
59,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
60,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
61,1/ - Updates were based on the submission of D...,,,,,,,,,,...,,,,,,,,,,


From the dataframe above, we can see that the footer of the .csv files was included in the dataframe. As the rows from the 56th index are irrelevant, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) them. 

In [15]:
# dropping the irrelevant rows, which starts at Index 56
data = data.drop (data.index [56:]) 

Additionally, we can see that the columns are unnamed, and upon inspection, the original column names can be found at `Index 0`. Thus, we can set the columns to this row, and then  [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Index 0` row as it would only be redundant and might affect the computations.

The [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function was used in order to make the index of the rows start from 0.

In [16]:
# setting the column names
data.columns = data.loc[0]

# removing the row that held the column names
data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,NaN,Year,2000,2001,2002.0,2003.0,2004.0,2005.0,2006.0,2007.0,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,Geolocation,Sex,,,,,,,,,...,,,,,,,,,,
1,PHILIPPINES,Both Sexes,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,...,97.2,97.19,96.9,96.15,94.19,94.05,93.96,89.1064,...,...
2,,Boys,96.27,89.33,89.51,87.84,86.17,83.56,82.39,84.07,...,96.74,96.87,96.66,96.17,94.12,94.25,93.79,88.9318,...,...
3,,Girls,97.28,90.91,91.1,89.68,88.08,85.35,84.08,85.83,...,97.68,97.53,97.15,96.12,94.27,93.85,94.15,89.2898,...,...
4,..National Capital Region (NCR),Both Sexes,101,97.82,97.38,96.81,94.82,92.61,92.89,94.42,...,99.64,99.01,99.85,95.92,92.83,92.11,89.91,81.1478,...,...
5,,Boys,100.13,96.57,96.52,95.81,93.75,91.65,92.0,93.21,...,98.77,98.13,98.8,95.3,92.2,91.85,89.43,80.6316,...,...
6,,Girls,101.92,99.13,98.28,97.87,95.95,93.63,93.83,95.69,...,100.57,99.95,100.95,96.58,93.5,92.38,90.42,81.6903,...,...
7,..Cordillera Administrative Region (CAR),Both Sexes,94.42,92.89,91.52,89.19,86.4,82.58,80.86,81.5,...,99.66,100.16,99.19,97.24,94.37,92.24,91.4,87.5276,...,...
8,,Boys,94.26,91.96,90.53,88.36,85.52,81.75,80.19,81.01,...,99.85,100.27,99.42,97.94,95.13,93.45,92.25,88.5518,...,...
9,,Girls,94.58,93.88,92.57,90.07,87.31,83.46,81.57,82.01,...,99.47,100.05,98.95,96.51,93.59,90.99,90.51,86.4657,...,...


However, these is still a row of NaN found at `Index 0`, and we can see that the column names for the first two columns are not correct for the values underneath it, as the ones under the first column are actually Geolocations and those under the second columns are the values for Sex. Thus, we can [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) it, and then [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)  the row at `Index 0`.

In [17]:
# renaming the columns
data = data.rename(columns = {np.nan:'Geolocation', 'Year': 'Sex'})

# dropping the row that held the Geolocation and Year column name
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

As we would want to create different columns for the different values of `Sex`, we can get the rows depending on the values of their `Sex` column. But, before we can do this, we have to copy the correct `Geolocation` values for the other rows through the use of the `fix_geolocation` function that we have created at the start.

In [18]:
# fixing the values of the geolocation column, such that each row has a value
data = fix_geolocation (data)
data 

Unnamed: 0,Geolocation,Sex,2000,2001,2002.0,2003.0,2004.0,2005.0,2006.0,2007.0,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,Both Sexes,96.77,90.1,90.29,88.74,87.11,84.44,83.22,84.93,...,97.2,97.19,96.9,96.15,94.19,94.05,93.96,89.1064,...,...
1,PHILIPPINES,Boys,96.27,89.33,89.51,87.84,86.17,83.56,82.39,84.07,...,96.74,96.87,96.66,96.17,94.12,94.25,93.79,88.9318,...,...
2,PHILIPPINES,Girls,97.28,90.91,91.1,89.68,88.08,85.35,84.08,85.83,...,97.68,97.53,97.15,96.12,94.27,93.85,94.15,89.2898,...,...
3,..National Capital Region (NCR),Both Sexes,101,97.82,97.38,96.81,94.82,92.61,92.89,94.42,...,99.64,99.01,99.85,95.92,92.83,92.11,89.91,81.1478,...,...
4,..National Capital Region (NCR),Boys,100.13,96.57,96.52,95.81,93.75,91.65,92.0,93.21,...,98.77,98.13,98.8,95.3,92.2,91.85,89.43,80.6316,...,...
5,..National Capital Region (NCR),Girls,101.92,99.13,98.28,97.87,95.95,93.63,93.83,95.69,...,100.57,99.95,100.95,96.58,93.5,92.38,90.42,81.6903,...,...
6,..Cordillera Administrative Region (CAR),Both Sexes,94.42,92.89,91.52,89.19,86.4,82.58,80.86,81.5,...,99.66,100.16,99.19,97.24,94.37,92.24,91.4,87.5276,...,...
7,..Cordillera Administrative Region (CAR),Boys,94.26,91.96,90.53,88.36,85.52,81.75,80.19,81.01,...,99.85,100.27,99.42,97.94,95.13,93.45,92.25,88.5518,...,...
8,..Cordillera Administrative Region (CAR),Girls,94.58,93.88,92.57,90.07,87.31,83.46,81.57,82.01,...,99.47,100.05,98.95,96.51,93.59,90.99,90.51,86.4657,...,...
9,..Region I,Both Sexes,97.73,91.33,89.64,88.52,86.98,84.87,82.74,83.14,...,97.39,97.84,96.78,94.84,92.5,90.48,89.99,86.2185,...,...


Now that all rows have a value for the `Geolocation` column, we can now divide the dataframe into three different dataframes. 

In [19]:
both_sexes = data[data['Sex'] == 'Both Sexes']   # Getting the data for Both Sexes
girls = data[data['Sex'] == 'Girls']             # Getting the data for Female
boys = data[data['Sex'] == 'Boys']               # Getting the data for Male

As we have divided the data based on the **Sex** column, each of these divisions would only have one value for this column. Thus, we can drop this column for all of these divisions.

In [20]:
# dropping the Sex column for both sexes
both_sexes = both_sexes.drop("Sex", axis = 1)
both_sexes = both_sexes.reset_index (drop=True)

# dropping the Sex column for girls
girls = girls.drop("Sex", axis = 1)
girls = girls.reset_index (drop=True)

# dropping the Sex column for boys
boys = boys.drop("Sex", axis = 1)
boys = boys.reset_index (drop=True)

To be able to merge these to the combined dataframe, let us check the `Geolocation` values of the dataframes if they follow the order of `region_names`.

In [21]:
# checks the Geolocation values for all of the dataframes

print(both_sexes['Geolocation'].values)
print(girls['Geolocation'].values)
print(boys['Geolocation'].values)

['PHILIPPINES' '..National Capital Region (NCR)'
 '..Cordillera Administrative Region (CAR)' '..Region I' '..Region II'
 '..Region III' '..Region IV-A 2/' '..MIMAROPA 2/' '..Region V'
 '..Region VI' '..Region VII' '..Region VIII' '..Region IX' '..Region X'
 '..Region XI' '..Region XII' '..Caraga' '..BARMM']
['PHILIPPINES' '..National Capital Region (NCR)'
 '..Cordillera Administrative Region (CAR)' '..Region I' '..Region II'
 '..Region III' '..Region IV-A 2/' '..MIMAROPA 2/' '..Region V'
 '..Region VI' '..Region VII' '..Region VIII' '..Region IX' '..Region X'
 '..Region XI' '..Region XII' '..Caraga' '..BARMM']
['PHILIPPINES' '..National Capital Region (NCR)'
 '..Cordillera Administrative Region (CAR)' '..Region I' '..Region II'
 '..Region III' '..Region IV-A 2/' '..MIMAROPA 2/' '..Region V'
 '..Region VI' '..Region VII' '..Region VIII' '..Region IX' '..Region X'
 '..Region XI' '..Region XII' '..Caraga' '..BARMM']


As they follow the same order, the value of the Geolocation column can be set to `region_names`. 

In [22]:
# setting the values of the region_names
both_sexes['Geolocation'] = region_names
girls['Geolocation'] = region_names
boys['Geolocation'] = region_names

Since the dataset represents missing values as either '...' or '..', we can [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the columns with these values with `np.nan`.

In [23]:
# converts the '..' and '...' values to NaN
both_sexes = replace_missing (both_sexes)
girls = replace_missing (girls)
boys = replace_missing (boys)

Then, we can transform the wide representation of the dataframes to their long representation version using the `change_to_long` function. 

In [24]:
# converts the wide representation to the long representation

both_sexes = change_to_long (both_sexes, '1.4.1 Net Elem Enrolment Rate')
girls = change_to_long (girls, '1.4.1 Net Elem Enrolment Rate (Girls)')
boys = change_to_long (boys, '1.4.1 Net Elem Enrolment Rate (Boys)')

Then we can [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) these long representation dataframes into the combined dataframe. These would be merged with respect to the values in the **Geolocation** and **Year** column. An outer join is used as we want to retain all the values of all of the dataframes, even if there would be **NaN** values for some of cells.

In [25]:
# combines the both sexes data to the combined dataframe
combined_data = combined_data.merge(both_sexes, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

# combines the girls data to the combined dataframe
combined_data = combined_data.merge(girls, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

# combines the boys data to the combined dataframe
combined_data = combined_data.merge(boys, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [26]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys)
0,PHILIPPINES,2000,,96.77,97.28,96.27
1,NCR: National Capital Region,2000,,101,101.92,100.13
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57
...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,
410,Region 11: Davao Region,2022,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,


#### 1.4.1p6. Net Enrolment Rate in secondary education

Next, we can load the third dataset.

In [27]:
data = pd.read_csv('data' + '/1.4.1p6.csv')
data

Unnamed: 0,1.4.1p6 Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,,,Year,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016.00,2017.00,2018.00,2019.00,2020.0000,2021,2022
1,Level of Education,Geolocation,Sex,,,,,,,,...,,,,,,,,,,
2,Junior High School,PHILIPPINES,Both Sexes,66.06,57.55,59,60.15,59.97,58.54,58.59,...,67.89,67.19,73.57,74.19,75.99,81.41,82.89,81.4869,...,...
3,,,Boys,62.72,52.96,54.39,55.34,55.04,53.65,53.85,...,62.42,61.68,68.09,68.79,70.88,77.24,78.80,77.6557,...,...
4,,,Girls,69.49,62.24,63.72,65.07,65.01,63.53,63.44,...,73.69,73.05,79.42,79.94,81.42,85.82,87.20,85.5003,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
113,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
114,1/ - Updates were based on submission of DepEd...,,,,,,,,,,...,,,,,,,,,,
115,2/ - Estimation of this sub-indicator only sta...,,,,,,,,,,...,,,,,,,,,,


Just like in the processing of the previous datasets, we first [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the unnecessary rows at the bottom part of the dataframe. 

In [28]:
# drops the unnecessary rows, which start at index 110
data = data.drop (data.index [110:])    

From the dataframe above, we can see that the correct column headers are found at `Index 0`. However, upon inspection, we would see that there are two NaN values and the 'Year' value at the third column should actually be 'Sex' based on the values below it. Thus, before setting this row as the column header, we first correct the values of these first three columns using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

In [29]:
# changes the values of the first row

data.at[0, '1.4.1p6 Net Enrolment Rate in secondary education (Indicator is also found in SDG 4.3.s2)'] = 'Level of Education'
data.at[0, 'Unnamed: 1'] = 'Geolocation'
data.at[0, 'Unnamed: 2'] = 'Sex'

Now that first row can correctly act as the column header, we can set is as the column header, before dropping the row at `Index 0`. Then we must also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row of **NaN**s at `Index 1` as it is unnecessary, before using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [30]:
data.columns = data.loc[0]            # sets the first row as values of the column header
data = data.drop (data.index[0])      # drops the first row
data = data.reset_index (drop=True)   # resets the index

data = data.drop (data.index[0])      # drops the first row, which is composed of NaN
data = data.reset_index (drop=True)   # resets the index

Using the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) function, we can see that there are two values for 'Level of Education' columns. To be able to combine this to the combined dataset, we must separate them as we cannot add another column that would hold the education level, thus, we can just add it as two different columns.

In [31]:
data ['Level of Education'].unique ()

array(['Junior High School', nan, 'Senior High School'], dtype=object)

In [32]:
# divides the dataframe into two: (1) jhs, and (2) shs
senior_high_data = data [54:]
junior_high_data = data [:54]

Now, we must process these two separately, but the processes done to them would be the same.

First, let us divided the Junior High School and Senior High School data by the values of the **Sex** column. Thus, this would result in three divisions per Level of Education.

In [33]:
# divides the jhs data into both sexes, girls, and boys
jhs_both = junior_high_data [junior_high_data['Sex'] == 'Both Sexes']
jhs_girls = junior_high_data [junior_high_data['Sex'] == 'Girls']
jhs_boys = junior_high_data [junior_high_data['Sex'] == 'Boys']

In [34]:
# divides the shs data into both sexes, girls, and boys
shs_both = senior_high_data [senior_high_data['Sex'] == 'Both Sexes']
shs_girls = senior_high_data [senior_high_data['Sex'] == 'Girls']
shs_boys = senior_high_data [senior_high_data['Sex'] == 'Boys']

Next, as we have already separated the dataset based on the value of the `Level of Education` column, we have no need for this column anymore. This means that we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column for all.  

In [35]:
# drops the Level of Education and Sex columns for JHS (Both Sex) data
jhs_both = jhs_both.drop("Level of Education", axis = 1)
jhs_both = jhs_both.drop("Sex", axis = 1)
jhs_both = jhs_both.reset_index (drop=True)

# drops the Level of Education and Sex columns for JHS (Girls) data
jhs_girls = jhs_girls.drop("Level of Education", axis = 1)
jhs_girls = jhs_girls.drop("Sex", axis = 1)
jhs_girls = jhs_girls.reset_index (drop=True)

# drops the Level of Education and Sex columns for JHS (Boys) data
jhs_boys = jhs_boys.drop("Level of Education", axis = 1)
jhs_boys = jhs_boys.drop("Sex", axis = 1)
jhs_boys = jhs_boys.reset_index (drop=True)

In [36]:
# drops the Level of Education and Sex columns for SHS (Both Sex) data
shs_both = shs_both.drop("Level of Education", axis = 1)
shs_both = shs_both.drop("Sex", axis = 1)
shs_both = shs_both.reset_index (drop=True)

# drops the Level of Education and Sex columns for SHS (Girls) data
shs_girls = shs_girls.drop("Level of Education", axis = 1)
shs_girls = shs_girls.drop("Sex", axis = 1)
shs_girls = shs_girls.reset_index (drop=True)

# drops the Level of Education and Sex columns for SHS (Boys) data
shs_boys = shs_boys.drop("Level of Education", axis = 1)
shs_boys = shs_boys.drop("Sex", axis = 1)
shs_boys = shs_boys.reset_index (drop=True)

For consistency, we set the values of the `Geolocation` column to the format of the region names that we have decided before.

In [37]:
# setting the values of the region_names

shs_both['Geolocation'] = region_names
shs_girls['Geolocation'] = region_names
shs_boys['Geolocation'] = region_names

In [38]:
# setting the values of the region_names

jhs_both['Geolocation'] = region_names
jhs_girls['Geolocation'] = region_names
jhs_boys['Geolocation'] = region_names

As the dataset represents missing values as '..' or '...', we must [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html) these values with `np.nan`.

In [39]:
# replaces the '..' and '...' with NaN for all of the dataframes
jhs_both = replace_missing (jhs_both)
jhs_girls = replace_missing (jhs_girls)
jhs_boys = replace_missing (jhs_boys)

shs_both = replace_missing (shs_both)
shs_girls = replace_missing (shs_girls)
shs_boys = replace_missing (shs_boys)

Looking at the one of the variables that has the senior high data, we can see that all of the values are `NaN` from 2000 to 2016, which is to be expected as Senior High School was only implemented from 2016.

In [40]:
shs_both

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,,,,,,,,,,...,,,,37.38,46.12,51.24,47.76,49.48,,
1,NCR: National Capital Region,,,,,,,,,,...,,,,55.32,62.74,68.63,62.28,56.4435,,
2,CAR: Cordillera Administrative Region,,,,,,,,,,...,,,,40.16,49.55,53.64,50.53,52.8763,,
3,Region 1: Ilocos Region,,,,,,,,,,...,,,,51.11,60.39,64.06,61.54,65.6379,,
4,Region 2: Cagayan Valley,,,,,,,,,,...,,,,43.41,51.49,56.21,56.46,61.4433,,
5,Region 3: Central Luzon,,,,,,,,,,...,,,,47.96,55.99,60.19,58.03,60.0165,,
6,Region 4A: CALABARZON,,,,,,,,,,...,,,,45.61,53.9,58.33,54.79,54.7999,,
7,MIMAROPA: Southwestern Tagalog Region,,,,,,,,,,...,,,,35.09,43.27,48.14,46.0,50.2024,,
8,Region 5: Bicol Region,,,,,,,,,,...,,,,28.35,39.63,45.8,42.31,43.518,,
9,Region 6: Western Visayas,,,,,,,,,,...,,,,32.54,44.17,49.74,44.22,48.2144,,


Next, we can convert both of the datasets into its long representation using the `change_to_long` function.

In [41]:
# converts all of the jhs data to its long representation
jhs_both = change_to_long (jhs_both, '1.4.1 Net JHS Enrolment Rate')
jhs_girls = change_to_long (jhs_girls, '1.4.1 Net JHS Enrolment Rate (Girls)')
jhs_boys = change_to_long (jhs_boys, '1.4.1 Net JHS Enrolment Rate (Boys)')

In [42]:
# converts all of the shs data to its long representation
shs_both = change_to_long (shs_both, '1.4.1 Net SHS Enrolment Rate')
shs_girls = change_to_long (shs_girls, '1.4.1 Net SHS Enrolment Rate (Girls)')
shs_boys = change_to_long (shs_boys, '1.4.1 Net SHS Enrolment Rate (Boys)')

Once that both datasets has been converted to their long representation, we can [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) the  datasets to the combined dataset based on the values of the `Geolocation` and the `Year` column with an outer join.

In [43]:
# merges the jhs (both sex) data to the combined dataframe
combined_data = combined_data.merge(jhs_both, how = 'outer', on = ['Geolocation', 'Year'])

# merges the jhs (girls) data to the combined dataframe
combined_data = combined_data.merge(jhs_girls, how = 'outer', on = ['Geolocation', 'Year'])

# merges the jhs (boys) data to the combined dataframe
combined_data = combined_data.merge(jhs_boys, how = 'outer', on = ['Geolocation', 'Year'])

# merges the shs (both sex) data to the combined dataframe
combined_data = combined_data.merge(shs_both, how = 'outer', on = ['Geolocation', 'Year'])

# merges the shs (girls) data to the combined dataframe
combined_data = combined_data.merge(shs_girls, how = 'outer', on = ['Geolocation', 'Year'])

# merges the shs (boys) data to the combined dataframe
combined_data = combined_data.merge(shs_boys, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [44]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,1.4.1 Net SHS Enrolment Rate (Girls),1.4.1 Net SHS Enrolment Rate (Boys)
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,,


#### 1.5.4. Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies
Then, the fourth dataset could be loaded using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [45]:
data = pd.read_csv('data' + '/1.5.4.csv')
data

Unnamed: 0,1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies (Indicator can also found in SDG 13.1.3 and 11.b.2),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
3,Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
4,Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
5,Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
6,Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
7,Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
8,MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
9,Region V,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...


Same as the previous datasets, we would need to [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the irrelevant rows at the bottom of the dataframe. These are the rows that were a footer outside of the table in the csv files.

In [46]:
# drops the unnecessary rows, which start at index 19
data = data.drop (data.index [19:])

Likewise, we know that the row at `Index 0` has the values that is the supposed column header for the table. However, checking each of the cells in this row would make us realize that the column header for the first column should not be `Year`, but rather `Geolocation` as the values in these columns refer to the different regions. 

Thus, we can change the value of the first column in this row to `Geolocation`, so that we would not need to rename the column if we directly made the 0th row into the column header. Then, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row at `Index 0` as it is now unnecessary. Additionally, we can see that there is a row of **NaN**s at `Index 1`, which would become the 0th row once we drop the row that became the column headers. This should be dropped also, before the index is resetted using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [47]:
data.at[0, '1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies (Indicator can also found in SDG 13.1.3 and 11.b.2)'] = 'Geolocation'

In [48]:
data.columns = data.loc[0]           # sets the first row as the column header's values
data = data.drop (data.index[0])     # drops the first row

data = data.drop (data.index[0])     # drops the new first row
data = data.reset_index (drop=True)  # resets the index
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
0,National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
1,Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
2,Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
3,Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
4,Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
5,Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
6,MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
7,Region V,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...
8,Region VI,..,..,..,..,..,..,..,..,..,...,..,..,..,25.1,..,20.2,..,99.3,100.0,...
9,Region VII,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,87.5,..,94.1,100.0,...


The next step would be renaming the values under the `Geolocation`, although, as seen in the resulting table, we would notice that there is no row for **PHILIPPINES**. This is reflected in the way that we set the values of this column.

In [49]:
# converting the value of Geolocation column with the values of region_names, starting with NCR 
data ['Geolocation'] = region_names [1:]
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017,2018.0,2019,2020.0,2021.0,2022
0,NCR: National Capital Region,..,..,..,..,..,..,..,..,..,...,..,..,..,52.9,..,76.5,..,82.4,100.0,...
1,CAR: Cordillera Administrative Region,..,..,..,..,..,..,..,..,..,...,..,..,..,94.0,..,97.5,..,79.5,61.5,...
2,Region 1: Ilocos Region,..,..,..,..,..,..,..,..,..,...,..,..,..,44.8,..,100.0,..,74.4,76.7,...
3,Region 2: Cagayan Valley,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,100.0,..,49.0,55.1,...
4,Region 3: Central Luzon,..,..,..,..,..,..,..,..,..,...,..,..,..,59.0,..,99.3,..,100.0,100.0,...
5,Region 4A: CALABARZON,..,..,..,..,..,..,..,..,..,...,..,..,..,99.8,..,100.0,..,100.0,74.8,...
6,MIMAROPA: Southwestern Tagalog Region,..,..,..,..,..,..,..,..,..,...,..,..,..,82.0,..,100.0,..,100.0,100.0,...
7,Region 5: Bicol Region,..,..,..,..,..,..,..,..,..,...,..,..,..,91.0,..,93.3,..,57.5,56.7,...
8,Region 6: Western Visayas,..,..,..,..,..,..,..,..,..,...,..,..,..,25.1,..,20.2,..,99.3,100.0,...
9,Region 7: Central Visayas,..,..,..,..,..,..,..,..,..,...,..,..,..,100.0,..,87.5,..,94.1,100.0,...


As with the previous datasets, we would have to [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the '..' and '...' values, which represents **null**, in the dataframe with **NaN**s. This is to avoid any errors that would happen in these rows, and so that it would be represented properly.

In [50]:
data = replace_missing (data)

After all of this, we can now transform this dataset that is in its wide represetation into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

Once we were able to convert it to its long representation, we would see that the column names in this new dataframe are not descriptive with respect to the values underneath the column. Directly merging this with the combined dataframe would make it hard for its users to distinguish what these columns are for, which is why it was [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d to its correct column names.

In [51]:
data = change_to_long (data, '1.5.4 Proportion of LGU with DRR')

After this, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the combined dataframe.

In [52]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [53]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,1.4.1 Net SHS Enrolment Rate (Girls),1.4.1 Net SHS Enrolment Rate (Boys),1.5.4 Proportion of LGU with DRR
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,,,


#### 3.4.1. Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease
To start with the fifth dataset, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [54]:
data = pd.read_csv('data' + '/3.4.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.4.1.csv')
data

Unnamed: 0,"3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,,Year,,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
1,Indicator,Geolocation,,,,,,,,,...,,,,,,,,,,
2,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Both Sexes,..,..,..,..,..,..,4.2,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,..,...
3,,,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,5.6,..,...
4,,,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,3.7,..,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,,,,,,,,,,,...,,,,,,,,,,
268,,,,,,,,,,,...,,,,,,,,,,
269,Note:,,,,,,,,,,...,,,,,,,,,,
270,.. - Data not available,,,,,,,,,,...,,,,,,,,,,


Based on the dataframe that we got using the [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we can see that there are rows of **NaN**s at the lower part of the dataframe. Upon further inspection, it started from `Index 266`, which is why the rows from this index was [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped.

In [55]:
# drops the unnecessary rows, which start at index 266
data = data.drop (data.index [266:])

As the column headers are all **Unnamed**, we need to set the column headers to its correct value, which is found at `Index 0`. Although, the values for the first three columns in this row are not descriptive to be column headers, which is why we are changing their values to the correct descriptive name for the rows underneath them using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

As we have no use for the row at `Index 0`, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this row. With this, we would also be [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping the next row as it is just a row of **NaN**s.

In [56]:
data.at[0, '3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease'] = 'Indicator'
data.at[0, 'Unnamed: 1'] = 'Geolocation'
data.at[0, 'Unnamed: 2'] = 'Sex'

In [57]:
data.columns = data.loc[0]             # sets the first row as the value of the column headers
data = data.drop (data.index[0])       # drops the first row
data = data.reset_index (drop=True)    # resets the index

data = data.drop (data.index[0])       # drops the new first row
data = data.reset_index (drop=True)    # resets the index

In [58]:
data

Unnamed: 0,Indicator,Geolocation,Sex,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
0,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Both Sexes,..,..,..,..,..,..,4.2,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,..,...
1,,,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,5.6,..,...
2,,,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,3.7,..,...
3,,..National Capital Region (NCR),Both Sexes,..,..,..,..,..,..,5.1,...,5.2,5.3,5.5,5.2,4.9,4.9,5,4.8,..,...
4,,,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,6.1,..,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,,,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,0.3,..,...
260,,,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,0.1,..,...
261,,..BBARMM,Both Sexes,..,..,..,..,..,..,0,...,0,0,0,0,0,0,0.1,0.1,..,...
262,,,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,0.1,..,...


As we would be dividing the indicators into different variables, we would need to make sure that each row has a value for the `Indicator` column. Observing the table, we can copy the upper row that has a value for the `Indicator` column to the rows that are under it that has **NaN** as its value.

In [59]:
i = 0
curr_indicator = data['Indicator'][0]
while i < len (data):
    if data['Indicator'][i] != np.nan:
        curr_indicator = data['Indicator'][i]
        i = i + 1

    while (i < len (data) and data['Indicator'][i] is np.nan):
        data.at[i, 'Indicator'] = curr_indicator
        i = i + 1

Likewise, we would handle the missing `Geolocation` values in the same way.

In [60]:
data = fix_geolocation(data)
data

Unnamed: 0,Indicator,Geolocation,Sex,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
0,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Both Sexes,..,..,..,..,..,..,4.2,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,..,...
1,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,5.6,..,...
2,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,3.7,..,...
3,3.4.1 Mortality rate attributed to cardiovascu...,..National Capital Region (NCR),Both Sexes,..,..,..,..,..,..,5.1,...,5.2,5.3,5.5,5.2,4.9,4.9,5,4.8,..,...
4,3.4.1 Mortality rate attributed to cardiovascu...,..National Capital Region (NCR),Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,6.1,..,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,..3.4.1.4 Mortality rate attributed to chronic...,..Caraga,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,0.3,..,...
260,..3.4.1.4 Mortality rate attributed to chronic...,..Caraga,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,0.1,..,...
261,..3.4.1.4 Mortality rate attributed to chronic...,..BBARMM,Both Sexes,..,..,..,..,..,..,0,...,0,0,0,0,0,0,0.1,0.1,..,...
262,..3.4.1.4 Mortality rate attributed to chronic...,..BBARMM,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,..,..,0.1,..,...


Then, we need to [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) all cells that has the value of either '..' or '...' with **NaN** for better computation in the future. 

In [61]:
data = replace_missing (data)
data

Unnamed: 0,Indicator,Geolocation,Sex,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
0,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Both Sexes,,,,,,,4.2,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,,
1,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Male,,,,,,,,...,,,,,,,,5.6,,
2,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,Female,,,,,,,,...,,,,,,,,3.7,,
3,3.4.1 Mortality rate attributed to cardiovascu...,..National Capital Region (NCR),Both Sexes,,,,,,,5.1,...,5.2,5.3,5.5,5.2,4.9,4.9,5,4.8,,
4,3.4.1 Mortality rate attributed to cardiovascu...,..National Capital Region (NCR),Male,,,,,,,,...,,,,,,,,6.1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,..3.4.1.4 Mortality rate attributed to chronic...,..Caraga,Male,,,,,,,,...,,,,,,,,0.3,,
260,..3.4.1.4 Mortality rate attributed to chronic...,..Caraga,Female,,,,,,,,...,,,,,,,,0.1,,
261,..3.4.1.4 Mortality rate attributed to chronic...,..BBARMM,Both Sexes,,,,,,,0,...,0,0,0,0,0,0,0.1,0.1,,
262,..3.4.1.4 Mortality rate attributed to chronic...,..BBARMM,Male,,,,,,,,...,,,,,,,,0.1,,


As we aim to focus on the different sicknesses under the `Indicator` column, we will just get the rows that has **Both Sexes** as its value for the `Sex` column. Then, as all the rows would have the same value for the `Sex` column, we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column.

In [62]:
data = data [data ['Sex'] == 'Both Sexes']
data = data.drop('Sex', axis = 1)
data = data.reset_index(drop = True)
data

Unnamed: 0,Indicator,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,...,2013,2014,2015,2016,2017,2018,2019,2020.0,2021,2022
0,3.4.1 Mortality rate attributed to cardiovascu...,PHILIPPINES,,,,,,,4.2,4.2,...,4.5,4.6,4.7,4.6,4.5,4.5,4.7,4.6,,
1,3.4.1 Mortality rate attributed to cardiovascu...,..National Capital Region (NCR),,,,,,,5.1,5.2,...,5.2,5.3,5.5,5.2,4.9,4.9,5,4.8,,
2,3.4.1 Mortality rate attributed to cardiovascu...,..Cordillera Administrative Region (CAR),,,,,,,3.3,3.1,...,3.4,3.5,3.7,3.6,3.6,3.8,4.1,3.8,,
3,3.4.1 Mortality rate attributed to cardiovascu...,..Region I,,,,,,,4.9,4.8,...,5,5.1,5.1,5,4.9,4.9,4.9,4.9,,
4,3.4.1 Mortality rate attributed to cardiovascu...,..Region II,,,,,,,4,3.9,...,4.4,4.4,4.5,4.4,4.3,4.5,4.7,4.3,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,..3.4.1.4 Mortality rate attributed to chronic...,..Region X,,,,,,,0.2,0.2,...,0.3,0.2,0.2,0.2,0.2,0.3,0.3,0.2,,
84,..3.4.1.4 Mortality rate attributed to chronic...,..Region XI,,,,,,,0.2,0.2,...,0.3,0.2,0.2,0.3,0.3,0.2,0.3,0.2,,
85,..3.4.1.4 Mortality rate attributed to chronic...,..Region XII,,,,,,,0.3,0.2,...,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.2,,
86,..3.4.1.4 Mortality rate attributed to chronic...,..Caraga,,,,,,,0.3,0.3,...,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,,


Upon studying the different indicators under this specific Sustainable Development Goal (SDG), we would realize that it is comprised of different subsets: (1) cardiovascular diseases, (2) cancer, (3) diabetes, and (4) chronic respiratory disease. 

Then, after dividing the different subsets, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Indicator` column. 

In [63]:
data['Indicator'].unique()

array(['3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease',
       '..3.4.1.1 Mortality rate attributed to cardiovascular disease',
       '..3.4.1.2 Mortality rate attributed to cancer',
       '..3.4.1.3 Mortality rate attributed to diabetes',
       '..3.4.1.4 Mortality rate attributed to chronic respiratory disease'],
      dtype=object)

In [64]:
all_data = data [data ['Indicator'] == '3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease']
cardio_data = data [data ['Indicator'] == '..3.4.1.1 Mortality rate attributed to cardiovascular disease']
cancer_data = data [data ['Indicator'] == '..3.4.1.2 Mortality rate attributed to cancer']
diabetes_data = data [data ['Indicator'] == '..3.4.1.3 Mortality rate attributed to diabetes']
respi_data = data [data ['Indicator'] == '..3.4.1.4 Mortality rate attributed to chronic respiratory disease']

In [65]:
all_data = all_data.drop('Indicator', axis = 1)
all_data = all_data.reset_index(drop = True)

cardio_data = cardio_data.drop('Indicator', axis = 1)
cardio_data = cardio_data.reset_index(drop = True)

cancer_data = cancer_data.drop('Indicator', axis = 1)
cancer_data = cancer_data.reset_index(drop = True)

diabetes_data = diabetes_data.drop('Indicator', axis = 1)
diabetes_data = diabetes_data.reset_index(drop = True)

respi_data = respi_data.drop('Indicator', axis = 1)
respi_data = respi_data.reset_index(drop = True)

After this, we would need to set the values of the `Geolocation` column to the formatted region names. But, before that, let us check if all the regions are present for all of the data.

In [66]:
print(len(all_data ['Geolocation'].unique()))
print(len(cardio_data ['Geolocation'].unique()))
print(len(cancer_data ['Geolocation'].unique()))
print(len(diabetes_data ['Geolocation'].unique()))
print(len(respi_data ['Geolocation'].unique()))

16
18
18
18
18


Upon inspection, we would realize that there are two regions that are missing from the table, which are **Region V** and **Region VI**, which is why we would only be using the region names that are included in the dataframe. Then, for the other variables, we can use all of the region names as the regions are complete.

In [67]:
# setting the values of the region_names, without the region five and six
all_data ['Geolocation'] = region_names [0:8] + region_names [10:]

# setting the values of the region_names
cardio_data ['Geolocation'] = region_names
cancer_data ['Geolocation'] = region_names
diabetes_data ['Geolocation'] = region_names
respi_data ['Geolocation'] = region_names

After this, with the use of the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, we can now convert our dataframes to their long representation. Then, we must set the column headers to describe the values in this column, which is why we would need to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the columns. 

In [68]:
all_data = change_to_long (all_data, '3.4.1 Mortality rate credited to NCD')
cardio_data = change_to_long (cardio_data, '3.4.1 Mortality rate credited to Cardio')
cancer_data = change_to_long (cancer_data, '3.4.1 Mortality rate credited to Cancer')
diabetes_data = change_to_long (diabetes_data, '3.4.1 Mortality rate credited to Diabetes')
respi_data = change_to_long (respi_data, '3.4.1 Mortality rate credited to Respi')

After this, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the dataframe which holds the combined datasets.

In [69]:
combined_data = combined_data.merge(all_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(cardio_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(cancer_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(diabetes_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(respi_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [70]:
combined_data [combined_data ['Year'] == 2020]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,1.4.1 Net SHS Enrolment Rate (Girls),1.4.1 Net SHS Enrolment Rate (Boys),1.5.4 Proportion of LGU with DRR,3.4.1 Mortality rate credited to NCD,3.4.1 Mortality rate credited to Cardio,3.4.1 Mortality rate credited to Cancer,3.4.1 Mortality rate credited to Diabetes,3.4.1 Mortality rate credited to Respi
360,PHILIPPINES,2020,,89.1064,89.2898,88.9318,81.4869,85.5003,77.6557,49.48,57.4119,42.0505,,4.6,2.8,1.0,0.6,0.2
361,NCR: National Capital Region,2020,,81.1478,81.6903,80.6316,85.1646,88.9647,81.5199,56.4435,63.8118,49.4413,82.4,4.8,3.1,1.0,0.6,0.1
362,CAR: Cordillera Administrative Region,2020,,87.5276,86.4657,88.5518,84.9372,88.1383,81.866,52.8763,63.109,43.3977,79.5,3.8,2.1,1.2,0.3,0.2
363,Region 1: Ilocos Region,2020,,86.2185,85.8033,86.6105,90.1031,92.3834,87.9661,65.6379,73.2342,58.7108,74.4,4.9,2.9,1.2,0.6,0.3
364,Region 2: Cagayan Valley,2020,,93.6348,93.1946,94.053,93.4367,97.0227,90.0406,61.4433,70.8277,52.8585,49.0,4.3,2.5,1.1,0.4,0.3
365,Region 3: Central Luzon,2020,,95.4067,95.3453,95.4649,86.0948,89.9462,82.4587,60.0165,67.5813,53.02,100.0,5.2,3.1,1.1,0.7,0.2
366,Region 4A: CALABARZON,2020,,91.9912,92.497,91.5134,84.4897,88.3569,80.833,54.7999,61.8488,48.1915,100.0,5.1,3.2,1.0,0.8,0.2
367,MIMAROPA: Southwestern Tagalog Region,2020,,86.2074,86.6139,85.8247,80.7633,84.9722,76.7903,50.2024,57.9633,42.9974,100.0,4.2,2.5,0.9,0.6,0.2
368,Region 5: Bicol Region,2020,,87.2573,86.8802,87.6157,75.087,79.4398,70.953,43.518,52.4657,35.2324,57.5,,3.4,0.8,0.6,0.3
369,Region 6: Western Visayas,2020,,93.9281,93.4375,94.3939,87.906,91.3535,84.6275,48.2144,57.4538,39.5905,99.3,,2.6,1.1,0.6,0.3


#### 3.7.1. Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods

Using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, we load the sixth dataset. 

In [71]:
data = pd.read_csv('data' + '/3.7.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/3.7.1.csv')
data

Unnamed: 0,3.7.1 Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Year,,2000,2001,2002,2003.0,2004,2005,2006,2007,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
1,Indicator/Sub-indicators,Geolocation,,,,,,,,,...,,,,,,,,,,
2,3.7.1 Proportion of women of reproductive age ...,PHILIPPINES,..,..,..,46.7,..,..,..,..,...,51.8,..,..,..,56.9,..,..,..,..,...
3,,..National Capital Region (NCR),..,..,..,47.2,..,..,..,..,...,53.4,..,..,..,59.5,..,..,..,..,...
4,,..Cordillera Administrative Region (CAR),..,..,..,44.4,..,..,..,..,...,59.8,..,..,..,66.7,..,..,..,..,...
5,,..Region I,..,..,..,49.6,..,..,..,..,...,50.8,..,..,..,59.5,..,..,..,..,...
6,,..Region II,..,..,..,68.8,..,..,..,..,...,69.1,..,..,..,74.1,..,..,..,..,...
7,,..Region III,..,..,..,54.2,..,..,..,..,...,60.4,..,..,..,56.8,..,..,..,..,...
8,,..Region IV-A,..,..,..,46.1,..,..,..,..,...,49.1,..,..,..,49.2,..,..,..,..,...
9,,..MIMAROPA,..,..,..,48.5,..,..,..,..,...,55.1,..,..,..,61.7,..,..,..,..,...


Irrelevant rows that are just footers for the file are also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped. From the dataframe above, we can see that these are the rows from `Index 20`.

In [72]:
# drops the unnecessary rows, which start at index 20
data = data.drop (data.index [20:])

Additionally, we can see that the current column names are **Unnamed**. Thus, we have to set the column names to its correct values so that we can determine what the values in the columns are.

Understanding the data, we can see that the row at `Index 0` holds the value for the column headers. However, there is a **NaN** value, which should be **Geolocation** based on the data underneath it. This is why the value of this cell was changed to **Geolocation** using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function.

This is done before the column names was set to the row at `Index 0`, and then [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping this row and the row of NaNs at the next row.

In [73]:
data.at[0, 'Unnamed: 1'] = 'Geolocation'

In [74]:
data.columns = data.loc[0]             # sets the first row as the column header's values
data = data.drop (data.index[0])       # drops the first row
data = data.reset_index (drop=True)    # resets the index

data = data.drop (data.index[0])       # drops the first row
data = data.reset_index (drop=True)    # resets the index
data

Unnamed: 0,Year,Geolocation,2000,2001,2002,2003.0,2004,2005,2006,2007,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
0,3.7.1 Proportion of women of reproductive age ...,PHILIPPINES,..,..,..,46.7,..,..,..,..,...,51.8,..,..,..,56.9,..,..,..,..,...
1,,..National Capital Region (NCR),..,..,..,47.2,..,..,..,..,...,53.4,..,..,..,59.5,..,..,..,..,...
2,,..Cordillera Administrative Region (CAR),..,..,..,44.4,..,..,..,..,...,59.8,..,..,..,66.7,..,..,..,..,...
3,,..Region I,..,..,..,49.6,..,..,..,..,...,50.8,..,..,..,59.5,..,..,..,..,...
4,,..Region II,..,..,..,68.8,..,..,..,..,...,69.1,..,..,..,74.1,..,..,..,..,...
5,,..Region III,..,..,..,54.2,..,..,..,..,...,60.4,..,..,..,56.8,..,..,..,..,...
6,,..Region IV-A,..,..,..,46.1,..,..,..,..,...,49.1,..,..,..,49.2,..,..,..,..,...
7,,..MIMAROPA,..,..,..,48.5,..,..,..,..,...,55.1,..,..,..,61.7,..,..,..,..,...
8,,..Region V,..,..,..,30.6,..,..,..,..,...,29.3,..,..,..,44.4,..,..,..,..,...
9,,..Region VI,..,..,..,42.3,..,..,..,..,...,45.5,..,..,..,56.8,..,..,..,..,...


Added to this, we can see that there is a column of **NaN**s, which we do not need, so we can also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this.

In [75]:
# drops the Year column
data = data.drop('Year', axis=1)

Just like what we have done in the previous datasets, we would rename the **Geolocation** column based on the common names of the region for easier understanding of the dataset.

In [76]:
data ['Geolocation'] = region_names   # setting the values of the region_names

As the missing data or null values in the dataset are represented by '..' or '...', which are strings that might affect the computations that might be done in this numerical columns, we would be using the [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function to replace these string values to **np.nan**.

In [77]:
# replaces the '..' and '...' values with NaN
data = replace_missing (data)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003.0,2004,2005,2006,2007,2008.0,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
0,PHILIPPINES,,,,46.7,,,,,46.8,...,51.8,,,,56.9,,,,,
1,NCR: National Capital Region,,,,47.2,,,,,43.2,...,53.4,,,,59.5,,,,,
2,CAR: Cordillera Administrative Region,,,,44.4,,,,,55.0,...,59.8,,,,66.7,,,,,
3,Region 1: Ilocos Region,,,,49.6,,,,,49.7,...,50.8,,,,59.5,,,,,
4,Region 2: Cagayan Valley,,,,68.8,,,,,62.6,...,69.1,,,,74.1,,,,,
5,Region 3: Central Luzon,,,,54.2,,,,,54.0,...,60.4,,,,56.8,,,,,
6,Region 4A: CALABARZON,,,,46.1,,,,,46.1,...,49.1,,,,49.2,,,,,
7,MIMAROPA: Southwestern Tagalog Region,,,,48.5,,,,,48.5,...,55.1,,,,61.7,,,,,
8,Region 5: Bicol Region,,,,30.6,,,,,33.8,...,29.3,,,,44.4,,,,,
9,Region 6: Western Visayas,,,,42.3,,,,,44.4,...,45.5,,,,56.8,,,,,


As the dataset now looks like the wide representation that we wanted, we would be transforming it to its long representation, using the `change_to_long` function.

In this function, we used the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, so that we could merge it to the combined dataset. Although, before merging it to the combined dataset, we used the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the columns `0` and `value`, as they are not descriptive enough. If we directly merged it to the combined dataset, we might not be able to determine what the values in these columns mean. 

In [78]:
data = change_to_long (data, '3.7.1 Proportion of Contraceptive Use of Women')

Once the column names have been fixed, we could use the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function to use outer join to merge the two datasets.

In [79]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [80]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,1.4.1 Net SHS Enrolment Rate (Girls),1.4.1 Net SHS Enrolment Rate (Boys),1.5.4 Proportion of LGU with DRR,3.4.1 Mortality rate credited to NCD,3.4.1 Mortality rate credited to Cardio,3.4.1 Mortality rate credited to Cancer,3.4.1 Mortality rate credited to Diabetes,3.4.1 Mortality rate credited to Respi,3.7.1 Proportion of Contraceptive Use of Women
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,,,,,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,,,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,,,,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,,,,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,,,,,,,,,


#### 3.7.2. Adolescent birth rate aged 15-19 years per 1,000 women in that age group
Then, the seventh dataset could be loaded using the same [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [81]:
data = pd.read_csv('data' + '/3.7.2.csv')
data

Unnamed: 0,"3.7.2 Adolescent birth rate aged 15-19 years per 1,000 women in that age group",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003.0,2004,2005,2006,2007,2008.0,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,53.0,..,..,..,..,54.0,...,57.0,..,..,..,47.0,..,..,..,..,...
3,..National Capital Region (NCR),..,..,..,35.0,..,..,..,..,25.0,...,48.0,..,..,..,27.0,..,..,..,..,...
4,..Cordillera Administrative Region (CAR),..,..,..,52.0,..,..,..,..,34.0,...,53.0,..,..,..,25.0,..,..,..,..,...
5,..Region I,..,..,..,55.0,..,..,..,..,52.0,...,78.0,..,..,..,46.0,..,..,..,..,...
6,..Region II,..,..,..,85.0,..,..,..,..,54.0,...,65.0,..,..,..,51.0,..,..,..,..,...
7,..Region III,..,..,..,42.0,..,..,..,..,69.0,...,63.0,..,..,..,61.0,..,..,..,..,...
8,..Region IV-A,..,..,..,44.0,..,..,..,..,63.0,...,58.0,..,..,..,37.0,..,..,..,..,...
9,..MIMAROPA,..,..,..,108.0,..,..,..,..,87.0,...,68.0,..,..,..,47.0,..,..,..,..,...


As seen in the previous datasets, there are three types of columns that are processed and [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped first: (1) the irrelevant rows that were footers in the .csv file, (2) the row that would be turned into the column headers, and (3) the row of **NaN**s.

In [82]:
# drops the unnecessary rows, which start at index 20
data = data.drop (data.index [20:])

In [83]:
data.columns = data.loc[0]               # sets the values of the first row as the column headers
data = data.drop (data.index[0])         # drops the first row
data = data.reset_index (drop=True)      # resets the index of the dataframe

data = data.drop (data.index[0])         # drops the first row
data = data.reset_index (drop=True)      # resets the index of the dataframe
data

Unnamed: 0,Year,2000,2001,2002,2003.0,2004,2005,2006,2007,2008.0,...,2013.0,2014,2015,2016,2017.0,2018,2019,2020,2021,2022
0,PHILIPPINES,..,..,..,53.0,..,..,..,..,54.0,...,57.0,..,..,..,47.0,..,..,..,..,...
1,..National Capital Region (NCR),..,..,..,35.0,..,..,..,..,25.0,...,48.0,..,..,..,27.0,..,..,..,..,...
2,..Cordillera Administrative Region (CAR),..,..,..,52.0,..,..,..,..,34.0,...,53.0,..,..,..,25.0,..,..,..,..,...
3,..Region I,..,..,..,55.0,..,..,..,..,52.0,...,78.0,..,..,..,46.0,..,..,..,..,...
4,..Region II,..,..,..,85.0,..,..,..,..,54.0,...,65.0,..,..,..,51.0,..,..,..,..,...
5,..Region III,..,..,..,42.0,..,..,..,..,69.0,...,63.0,..,..,..,61.0,..,..,..,..,...
6,..Region IV-A,..,..,..,44.0,..,..,..,..,63.0,...,58.0,..,..,..,37.0,..,..,..,..,...
7,..MIMAROPA,..,..,..,108.0,..,..,..,..,87.0,...,68.0,..,..,..,47.0,..,..,..,..,...
8,..Region V,..,..,..,60.0,..,..,..,..,63.0,...,59.0,..,..,..,36.0,..,..,..,..,...
9,..Region VI,..,..,..,57.0,..,..,..,..,41.0,...,58.0,..,..,..,38.0,..,..,..,..,...


Although, we can see that there is a column name that does not correctly represent the data of this column: the `Year` column does not indicate years, but rather the regions. This is why it was [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d to `Geolocation`. 

In [84]:
data.rename(columns = {'Year':'Geolocation'}, inplace=True)

Once we have cleaned the column headers, the values for the `Geolocation` column would be fixed to include their common names. It is important to note that it was made sure that each of the row completely match the arrangement in the `region_name` variable.

In [85]:
data ['Geolocation'] = region_names   # setting the values of the region_names

As we now have fixed the number of rows and the column names, we would now replace the string representation of null or missing vlaues. This is done with the use of [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function, which would convert the '..' and '...' values into **np.nan**.

In [86]:
data = replace_missing (data)

Then, we can now convert our dataframe into its long representation using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function. As in the processing of the previous datasets, we would have to [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) the column names as they are not descriptive enough.

In [87]:
data = change_to_long(data, '3.7.2 Teenage pregnancy rates per 1000')

As we are now sure that the missing or null values are correctly represented, the values of the `Geolocation` are now more easily understandable, and the column headers are descriptive enough, we can now merge this dataset into the combined datasets using the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function.

In [88]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [89]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,1.4.1 Net SHS Enrolment Rate (Girls),1.4.1 Net SHS Enrolment Rate (Boys),1.5.4 Proportion of LGU with DRR,3.4.1 Mortality rate credited to NCD,3.4.1 Mortality rate credited to Cardio,3.4.1 Mortality rate credited to Cancer,3.4.1 Mortality rate credited to Diabetes,3.4.1 Mortality rate credited to Respi,3.7.1 Proportion of Contraceptive Use of Women,3.7.2 Teenage pregnancy rates per 1000
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,,,,,,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,,,,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,,,,,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,,,,,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,,,,,,,,,,


#### 4.1.s1. Completion Rate of elementary and secondary students
To start with the eighth dataset, let us load the data from the csv file using pandas' [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [90]:
data = pd.read_csv('data' + '/4.1.s1.csv')
data

Unnamed: 0,4.1.s1 Completion Rate of elementary and secondary students 1/ 2/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,Year,,,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018.00,2019.00,2020.000000,2021,2022
1,Geolocation,Level of Education,Sex,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,Elementary,Both Sexes,62.72,68.18,71.55,70.24,69.06,68.11,71.72,...,77.67,83.74,84.02,93.06,92.41,97.15,96.56,82.510000,...,...
3,,,Female,65.53,70.7,76.32,75.63,75.2,73.46,76.7,...,81.33,86.23,87.43,95.52,94.61,99.12,98.08,84.681828,...,...
4,,,Male,60.05,65.78,67.23,65.42,63.63,63.29,67.28,...,74.38,81.45,80.97,90.83,90.41,95.26,95.10,80.500538,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166,.. - Data not available,,,,,,,,,,...,,,,,,,,,,
167,... - Data not yet available,,,,,,,,,,...,,,,,,,,,,
168,1/ - Updates were based on the submission of D...,,,,,,,,,,...,,,,,,,,,,
169,2/ - Estimation in Senior High School only sta...,,,,,,,,,,...,,,,,,,,,,


From the view of the dataframe above, we can see that there are unnecessary rows captured by the  [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function. To be able to correctly represent the data, we would need to [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) these rows.

In [91]:
data = data.drop(data.index[164:])

Another problem that we have based on the dataframe shown above is the lack of column names, as shown in the **Unnamed** values in the header. Studying the dataframe, we would find the supposed column headers in the row of `Index 0`, though we face the problem of having **NaN** values at the first three columns of this row. This is why the values in these cells are changed using the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) function, before converting this row to be the column header.

After we have been able to turn this into the column header, we would need to drop this row and the row beneath it as they are unnecessary rows.

In [92]:
# changing the values at the row that would be used as the column
data.at[0, '4.1.s1 Completion Rate of elementary and secondary students 1/ 2/'] = 'Geolocation'
data.at[0, 'Unnamed: 1'] = 'Level of Education'
data.at[0, 'Unnamed: 2'] = 'Sex'

In [93]:
data.columns = data.loc[0]             # sets the values of the first row as the column header's value
data = data.drop (data.index[0])       # drops the first row
data = data.reset_index (drop=True)    # resets the index

data = data.drop (data.index[0])       # drops the first row
data = data.reset_index (drop=True)    # resets the index
data

Unnamed: 0,Geolocation,Level of Education,Sex,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,Elementary,Both Sexes,62.72,68.18,71.55,70.24,69.06,68.11,71.72,...,77.67,83.74,84.02,93.06,92.41,97.15,96.56,82.510000,...,...
1,,,Female,65.53,70.7,76.32,75.63,75.2,73.46,76.7,...,81.33,86.23,87.43,95.52,94.61,99.12,98.08,84.681828,...,...
2,,,Male,60.05,65.78,67.23,65.42,63.63,63.29,67.28,...,74.38,81.45,80.97,90.83,90.41,95.26,95.10,80.500538,...,...
3,,Secondary (Junior High School),Both Sexes,70.07,69.97,74.81,71.67,72.38,61.66,72.14,...,76.25,77.77,74.03,80.91,84.32,88.84,85.75,82.111684,...,...
4,,,Female,72.29,72.94,79.98,77.2,77.8,68.14,76.96,...,81.01,81.7,78.47,85.6,88.12,92.97,89.69,85.916052,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
157,,,Female,58.65,66.32,71.5,56.66,44.01,54.81,37.81,...,58.74,50.29,65.03,79.87,62.1,93.60,88.24,77.454619,...,...
158,,,Male,57.98,62.25,49.72,69.53,46.82,50.93,32.22,...,52.86,46.67,62.63,77.11,62.24,92.80,84.59,67.366448,...,...
159,,Secondary (Senior High School),Both Sexes,..,..,..,..,..,..,..,...,..,..,..,..,..,68.08,66.24,56.886363,...,...
160,,,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,67.73,67.80,61.194563,...,...


As we can see from the resulting dataset, there are still **NaN** values in the `Geolocation` column, which we do not want as this would be used in merging the datasets together. However, if we study it, we would realize that the reason for this is that one value for `Geolocation` actually spans to the next rows after it (until the next new `Geolocation` value) as there are different values for the `Level of Education` column and the `Sex` column. Although, we cannot just separate the dataset per unique value of these two columns, as the `Geolocation` would be NaN for those that are not **Both Sexes** and **Elementary**.

Due to this, we copy the value of the `Geolocation` column of a row to the next two rows after it, using the `fix_geolocation` function. 

In [94]:
data = fix_geolocation (data)

As we can see, there are now values for the `Geolocation` column for each of the rows.

In [95]:
data.head(10)

Unnamed: 0,Geolocation,Level of Education,Sex,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,Elementary,Both Sexes,62.72,68.18,71.55,70.24,69.06,68.11,71.72,...,77.67,83.74,84.02,93.06,92.41,97.15,96.56,82.51,...,...
1,PHILIPPINES,,Female,65.53,70.7,76.32,75.63,75.2,73.46,76.7,...,81.33,86.23,87.43,95.52,94.61,99.12,98.08,84.681828,...,...
2,PHILIPPINES,,Male,60.05,65.78,67.23,65.42,63.63,63.29,67.28,...,74.38,81.45,80.97,90.83,90.41,95.26,95.1,80.500538,...,...
3,PHILIPPINES,Secondary (Junior High School),Both Sexes,70.07,69.97,74.81,71.67,72.38,61.66,72.14,...,76.25,77.77,74.03,80.91,84.32,88.84,85.75,82.111684,...,...
4,PHILIPPINES,,Female,72.29,72.94,79.98,77.2,77.8,68.14,76.96,...,81.01,81.7,78.47,85.6,88.12,92.97,89.69,85.916052,...,...
5,PHILIPPINES,,Male,67.66,66.72,69.5,66.03,66.87,55.06,67.17,...,71.57,73.93,69.68,76.2,80.52,84.73,81.89,78.412296,...,...
6,PHILIPPINES,Secondary (Senior High School),Both Sexes,..,..,..,..,..,..,..,...,..,..,..,..,..,81.01,76.71,69.317762,...,...
7,PHILIPPINES,,Female,..,..,..,..,..,..,..,...,..,..,..,..,..,84.78,80.45,74.609274,...,...
8,PHILIPPINES,,Male,..,..,..,..,..,..,..,...,..,..,..,..,..,77.23,73.03,64.171123,...,...
9,..National Capital Region (NCR),Elementary,Both Sexes,63.87,74.29,84.35,83.81,82.1,82.5,88.48,...,78.72,74.71,82.29,85.97,94.65,99.04,94.97,69.36,...,...


Before we divide the dataset based on the value of `Level of Education`, we must first replace cells with the strings '..' or '...' with **np.nan**. This is so that we would not need to process this representation of missing or null values separately (i.e., per division).

In [96]:
data = replace_missing (data)

Additionally, we face the same problem that we faced with the `Geolocation` column: there are rows with NaN values for the `Level of Education`. But by observing the dataset, we can see that a Level of Education's value spans the next rows after it as it is still divided by the value of `Gender`. Thus, we can copy a value of `Level of Education` from the above row.

In [97]:
i = 0
curr_educ = data['Level of Education'][0]
while i < len (data):
    if data['Level of Education'][i] != np.nan:
        curr_educ = data['Level of Education'][i]
        i = i + 1

    while (i < len (data) and data['Level of Education'][i] is np.nan):
        data.at[i, 'Level of Education'] = curr_educ
        i = i + 1

Now, there are no **NaN**s for the `Level of Education` column.

In [98]:
data.head(10)

Unnamed: 0,Geolocation,Level of Education,Sex,2000,2001,2002,2003,2004,2005,2006,...,2013,2014,2015,2016,2017,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,Elementary,Both Sexes,62.72,68.18,71.55,70.24,69.06,68.11,71.72,...,77.67,83.74,84.02,93.06,92.41,97.15,96.56,82.51,,
1,PHILIPPINES,Elementary,Female,65.53,70.7,76.32,75.63,75.2,73.46,76.7,...,81.33,86.23,87.43,95.52,94.61,99.12,98.08,84.681828,,
2,PHILIPPINES,Elementary,Male,60.05,65.78,67.23,65.42,63.63,63.29,67.28,...,74.38,81.45,80.97,90.83,90.41,95.26,95.1,80.500538,,
3,PHILIPPINES,Secondary (Junior High School),Both Sexes,70.07,69.97,74.81,71.67,72.38,61.66,72.14,...,76.25,77.77,74.03,80.91,84.32,88.84,85.75,82.111684,,
4,PHILIPPINES,Secondary (Junior High School),Female,72.29,72.94,79.98,77.2,77.8,68.14,76.96,...,81.01,81.7,78.47,85.6,88.12,92.97,89.69,85.916052,,
5,PHILIPPINES,Secondary (Junior High School),Male,67.66,66.72,69.5,66.03,66.87,55.06,67.17,...,71.57,73.93,69.68,76.2,80.52,84.73,81.89,78.412296,,
6,PHILIPPINES,Secondary (Senior High School),Both Sexes,,,,,,,,...,,,,,,81.01,76.71,69.317762,,
7,PHILIPPINES,Secondary (Senior High School),Female,,,,,,,,...,,,,,,84.78,80.45,74.609274,,
8,PHILIPPINES,Secondary (Senior High School),Male,,,,,,,,...,,,,,,77.23,73.03,64.171123,,
9,..National Capital Region (NCR),Elementary,Both Sexes,63.87,74.29,84.35,83.81,82.1,82.5,88.48,...,78.72,74.71,82.29,85.97,94.65,99.04,94.97,69.36,,


Then, we can now separate them so that we can properly label it before merging it to the combined dataset.

In [99]:
elem_data = data [data['Level of Education'] == 'Elementary']
elem_data = elem_data.reset_index (drop=True)

junior_data = data [data['Level of Education'] == 'Secondary (Junior High School)']
junior_data = junior_data.reset_index (drop=True)

senior_data = data [data['Level of Education'] == 'Secondary (Senior High School)']
senior_data = senior_data.reset_index (drop=True)

Once we have successfully divided the dataset based on the value of the `Level of Education` column, we can now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) this column as each of the division would technically only have one value for this column.

In [100]:
elem_data = elem_data.drop ('Level of Education', axis = 1)
elem_data = elem_data.reset_index(drop=True)

In [101]:
junior_data = junior_data.drop ('Level of Education', axis = 1)
junior_data = junior_data.reset_index(drop=True)

In [102]:
senior_data = senior_data.drop ('Level of Education', axis = 1)
senior_data = senior_data.reset_index(drop=True)

In [103]:
senior_data

Unnamed: 0,Geolocation,Sex,2000,2001,2002,2003,2004,2005,2006,2007,...,2013,2014,2015,2016,2017,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,Both Sexes,,,,,,,,,...,,,,,,81.01,76.71,69.317762,,
1,PHILIPPINES,Female,,,,,,,,,...,,,,,,84.78,80.45,74.609274,,
2,PHILIPPINES,Male,,,,,,,,,...,,,,,,77.23,73.03,64.171123,,
3,..National Capital Region (NCR),Both Sexes,,,,,,,,,...,,,,,,82.64,76.28,56.255397,,
4,..National Capital Region (NCR),Female,,,,,,,,,...,,,,,,86.92,79.62,61.92332,,
5,..National Capital Region (NCR),Male,,,,,,,,,...,,,,,,78.35,72.96,50.881891,,
6,..Cordillera Administrative Region (CAR),Both Sexes,,,,,,,,,...,,,,,,81.07,76.25,81.151454,,
7,..Cordillera Administrative Region (CAR),Female,,,,,,,,,...,,,,,,88.36,82.86,85.668754,,
8,..Cordillera Administrative Region (CAR),Male,,,,,,,,,...,,,,,,73.88,69.88,76.673288,,
9,..Region I,Both Sexes,,,,,,,,,...,,,,,,83.26,80.57,84.450218,,


Then, for each of these levels of education, we can now divide it based on the value of the `Sex` column.

In [104]:
elem_both = elem_data [elem_data ['Sex'] == 'Both Sexes']
elem_girls = elem_data [elem_data ['Sex'] == 'Female']
elem_boys = elem_data [elem_data ['Sex'] == 'Male']

junior_both = junior_data [junior_data ['Sex'] == 'Both Sexes']
junior_girls = junior_data [junior_data ['Sex'] == 'Female']
junior_boys = junior_data [junior_data ['Sex'] == 'Male']

senior_both = senior_data [senior_data ['Sex'] == 'Both Sexes']
senior_girls = senior_data [senior_data ['Sex'] == 'Female']
senior_boys = senior_data [senior_data ['Sex'] == 'Male']

Then, we will now [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the `Sex` column for all of these variables as each of them has the same value for the said column

In [105]:
elem_both = elem_both.drop ('Sex', axis = 1)
elem_both = elem_both.reset_index(drop=True)

elem_girls = elem_girls.drop ('Sex', axis = 1)
elem_girls = elem_girls.reset_index(drop=True)

elem_boys = elem_boys.drop ('Sex', axis = 1)
elem_boys = elem_boys.reset_index(drop=True)

In [106]:
junior_both = junior_both.drop ('Sex', axis = 1)
junior_both = junior_both.reset_index(drop=True)

junior_girls = junior_girls.drop ('Sex', axis = 1)
junior_girls = junior_girls.reset_index(drop=True)

junior_boys = junior_boys.drop ('Sex', axis = 1)
junior_boys = junior_boys.reset_index(drop=True)

In [107]:
senior_both = senior_both.drop ('Sex', axis = 1)
senior_both = senior_both.reset_index(drop=True)

senior_girls = senior_girls.drop ('Sex', axis = 1)
senior_girls = senior_girls.reset_index(drop=True)

senior_boys = senior_boys.drop ('Sex', axis = 1)
senior_boys = senior_boys.reset_index(drop=True)

Then, we can set the value of the `Geolocation` column to the same format as the other datasets.

In [108]:
# setting the values of the region_names

elem_both ['Geolocation'] = region_names
elem_girls ['Geolocation'] = region_names
elem_boys ['Geolocation'] = region_names

In [109]:
# setting the values of the region_names

junior_both ['Geolocation'] = region_names
junior_girls ['Geolocation'] = region_names
junior_boys ['Geolocation'] = region_names

In [110]:
# setting the values of the region_names

senior_both ['Geolocation'] = region_names
senior_girls ['Geolocation'] = region_names
senior_boys ['Geolocation'] = region_names

Then, we can now convert the dataframes into their long representation, before using the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) function to make the column names more descriptive of the data in the columns.

In [111]:
elem_both = change_to_long (elem_both, '4.1 Elem Completion Rate')
elem_girls = change_to_long (elem_girls, '4.1 Elem Completion Rate (Female)')
elem_boys = change_to_long (elem_boys, '4.1 Elem Completion Rate (Male)')

In [112]:
junior_both = change_to_long (junior_both, '4.1 JHS Completion Rate')
junior_girls = change_to_long (junior_girls, '4.1 JHS Completion Rate (Female)')
junior_boys = change_to_long (junior_boys, '4.1 JHS Completion Rate (Male)')

In [113]:
senior_both = change_to_long (senior_both, '4.1 SHS Completion Rate')
senior_girls = change_to_long (senior_girls, '4.1 SHS Completion Rate (Female)')
senior_boys = change_to_long (senior_boys, '4.1 SHS Completion Rate (Male)')

As we have now made sure that each of division would be understandable even if combined with the combined dataset, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) each of them into the combined dataset

In [114]:
combined_data = combined_data.merge(elem_both, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(elem_girls, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(elem_boys, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [115]:
combined_data = combined_data.merge(junior_both, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(junior_girls, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(junior_boys, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [116]:
combined_data = combined_data.merge(senior_both, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(senior_girls, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

combined_data = combined_data.merge(senior_boys, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [117]:
combined_data [combined_data ['Year'] == 2020]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,3.7.2 Teenage pregnancy rates per 1000,4.1 Elem Completion Rate,4.1 Elem Completion Rate (Female),4.1 Elem Completion Rate (Male),4.1 JHS Completion Rate,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),4.1 SHS Completion Rate (Male)
360,PHILIPPINES,2020,,89.1064,89.2898,88.9318,81.4869,85.5003,77.6557,49.48,...,,82.51,84.681828,80.500538,82.111684,85.916052,78.412296,69.317762,74.609274,64.171123
361,NCR: National Capital Region,2020,,81.1478,81.6903,80.6316,85.1646,88.9647,81.5199,56.4435,...,,69.36,71.642798,67.256534,73.645274,77.586226,69.865084,56.255397,61.92332,50.881891
362,CAR: Cordillera Administrative Region,2020,,87.5276,86.4657,88.5518,84.9372,88.1383,81.866,52.8763,...,,94.56,95.523977,93.654864,87.860475,91.159161,84.679362,81.151454,85.668754,76.673288
363,Region 1: Ilocos Region,2020,,86.2185,85.8033,86.6105,90.1031,92.3834,87.9661,65.6379,...,,97.32,97.893346,96.793224,91.962261,94.392097,89.670049,84.450218,88.053999,81.027657
364,Region 2: Cagayan Valley,2020,,93.6348,93.1946,94.053,93.4367,97.0227,90.0406,61.4433,...,,98.56,99.130841,97.884624,95.806836,98.259478,93.394499,90.356108,94.490419,86.236932
365,Region 3: Central Luzon,2020,,95.4067,95.3453,95.4649,86.0948,89.9462,82.4587,60.0165,...,,86.62,86.855425,82.560382,84.012348,87.885741,80.325645,72.072685,77.756625,66.705252
366,Region 4A: CALABARZON,2020,,91.9912,92.497,91.5134,84.4897,88.3569,80.833,54.7999,...,,78.03,80.607718,75.661103,81.497899,85.640399,77.562889,64.199939,69.869538,58.854251
367,MIMAROPA: Southwestern Tagalog Region,2020,,86.2074,86.6139,85.8247,80.7633,84.9722,76.7903,50.2024,...,,90.55,92.31964,88.892115,84.393697,88.759497,80.195478,72.90367,78.930007,67.105459
368,Region 5: Bicol Region,2020,,87.2573,86.8802,87.6157,75.087,79.4398,70.953,43.518,...,,77.47,80.214233,74.958843,67.848866,74.010141,62.08974,51.050553,57.428523,45.098846
369,Region 6: Western Visayas,2020,,93.9281,93.4375,94.3939,87.906,91.3535,84.6275,48.2144,...,,93.46,95.107652,91.941081,93.506693,95.187537,91.875104,85.196948,89.311042,81.085195


#### 4.c.s2. Number of Technical-Vocational Education and Training (TVET) trainers trained
Next, we can load the ninth dataset.

In [118]:
data = pd.read_csv('data' + '/4.c.s2.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/4.c.s2.csv')
data

Unnamed: 0,4.c.s2 Number of Technical-Vocational Education and Training (TVET) trainers trained,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017.0,2018.0,2019.0,2020.0,2021.0,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,..,6518.0,11159.0,10118.0,10855.0,4023.0,7746.0,...
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,610.0,1028.0,1280.0,1409.0,782.0,1985.0,...
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,201.0,302.0,166.0,260.0,92.0,199.0,...
5,..Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,474.0,455.0,475.0,501.0,375.0,327.0,...
6,..Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,270.0,612.0,447.0,686.0,215.0,240.0,...
7,..Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,280.0,262.0,354.0,839.0,277.0,471.0,...
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,833.0,1067.0,1440.0,817.0,177.0,647.0,...
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,139.0,523.0,709.0,413.0,162.0,255.0,...


As usual, we would first be [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping the irrelevant rows. 

In [119]:
# drops the unnecessary rows, which start at index 20
data = data.drop(data.index[20:])

Then, as we know that the correct column headers are found at `Index 0`, we have to fix the values of this row to fully represent the data in the columns. This is why the **Year** value was changed into **Geolocation** because the values in this column are the rows of the country.

After this, we can now make the value of this row as the value of the column headers, before [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ping this row as it would not be used anymore. In line with this, we can also [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) the row of **NaN**s underneath this row.

In [120]:
data.at[0, '4.c.s2 Number of Technical-Vocational Education and Training (TVET) trainers trained'] = 'Geolocation'

In [121]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Then, we need to change the values of the `Geolocation` column to match the prescribed format for the region names.

In [122]:
data ['Geolocation'] = region_names   # setting the values of the region_names

After this, we need to clean the dataset by turning the string representation of missing or null values, which are '..' and '...', into **np.nan**. This would allow us to correctly use mathematical functions into these columns without errors arising due to strings.

In [123]:
data = replace_missing (data)

Once we have done this, we can convert the dataframe into its long representation, which would allow us to merge it with the combined dataset. Converting a dataframe that is in its wide representation into its long representation is made possible by the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.

However, using the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function would result into a three-column dataframe which has the following column names: (1) `Geolocation`, (2) `0`, and (3) `value`. The last two columns are not properly descriptive of the values of the column, which is why these two columns are [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)d. 

This was done using the `change_to_long` function.

In [124]:
data = change_to_long (data, '4.c TVET trainers trained')

As we now have a dataframe that is in its long representation, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it to the combined dataframe, with respect to the values of the `Geolocation` and `Year` columns. This means that a row from this dataframe would be [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)d into the combined dataset on the row that has the same `Geolocation` and `Year`. 

In [125]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [126]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,4.1 Elem Completion Rate,4.1 Elem Completion Rate (Female),4.1 Elem Completion Rate (Male),4.1 JHS Completion Rate,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),4.1 SHS Completion Rate (Male),4.c TVET trainers trained
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,62.72,65.53,60.05,70.07,72.29,67.66,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,63.87,66.58,61.35,68.16,72.18,63.88,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,61.95,65.93,58.28,70.31,73.34,67.08,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,78.73,81.44,76.23,73.38,76.87,69.82,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,70.75,74.95,66.9,72.2,73.78,70.48,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


#### 7.1.1. Proportion of population with access to electricity

Now, we will proceed to loading the tenth dataset.

In [127]:
data = pd.read_csv('data' + '/7.1.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/7.1.1.csv') // AJ TO DO
data

Unnamed: 0,7.1.1 Proportion of population with access to electricity 1/,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,79.853466,80.9,89.62,90.65,91.09,96.12,92.96,94.49,...,...
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,97.737787,98.027099,98.023644,98.259443,103.023194,107.61482,100.0,100.0,...,...
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,78.061518,83.482667,87.491115,90.3,92.451772,95.622169,93.31971,94.3,...,...
5,..Region I,..,..,..,..,..,..,..,..,..,...,83.917453,86.43631,93.663932,93.900447,99.154068,102.201664,98.687338,99.02,...,...
6,..Region II,..,..,..,..,..,..,..,..,..,...,81.122927,83.534027,92.294218,93.095178,94.770382,97.913797,95.535692,99.63,...,...
7,..Region III,..,..,..,..,..,..,..,..,..,...,93.178333,92.299013,96.871581,97.373164,107.082482,109.124555,99.429197,99.74,...,...
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,93.744276,92.033842,95.396047,96.670558,104.925142,110.433491,99.01244,99.17,...,...
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,67.231577,69.355541,82.033903,84.141455,82.919516,87.27024,90.250458,91.01,...,...


Before anything else, we drop the irrelevant rows.

In [128]:
data = data.drop(data.index[20:])

First, we will change the data in Index 0 at column **7.1.1 Proportion of population with access to electricity 1/** into `Geolocation` since our goal is to make the geolocation the first column of the dataframe. By doing this, `Index 0` now has the correct column headers. 

In [129]:
data.at[0,'7.1.1 Proportion of population with access to electricity 1/'] = 'Geolocation'

With this, we will now make the row of Index 0 as column headers. This is done by passing the row of `Index 0` to the `data.columns`. 

Following this, we drop this row (`Index 0`), as it will no longer be needed, as well as the row of NaNs underneath this row.

In [130]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
0,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,79.853466,80.9,89.62,90.65,91.09,96.12,92.96,94.49,...,...
1,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,97.737787,98.027099,98.023644,98.259443,103.023194,107.61482,100.0,100.0,...,...
2,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,78.061518,83.482667,87.491115,90.3,92.451772,95.622169,93.31971,94.3,...,...
3,..Region I,..,..,..,..,..,..,..,..,..,...,83.917453,86.43631,93.663932,93.900447,99.154068,102.201664,98.687338,99.02,...,...
4,..Region II,..,..,..,..,..,..,..,..,..,...,81.122927,83.534027,92.294218,93.095178,94.770382,97.913797,95.535692,99.63,...,...
5,..Region III,..,..,..,..,..,..,..,..,..,...,93.178333,92.299013,96.871581,97.373164,107.082482,109.124555,99.429197,99.74,...,...
6,..Region IV-A,..,..,..,..,..,..,..,..,..,...,93.744276,92.033842,95.396047,96.670558,104.925142,110.433491,99.01244,99.17,...,...
7,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,67.231577,69.355541,82.033903,84.141455,82.919516,87.27024,90.250458,91.01,...,...
8,..Region V,..,..,..,..,..,..,..,..,..,...,73.896402,72.962479,87.985816,89.964312,78.751847,82.385457,91.526618,93.65,...,...
9,..Region VI,..,..,..,..,..,..,..,..,..,...,72.969417,76.606153,91.451884,94.945718,86.057332,92.011066,90.91695,92.98,...,...


After checking if the order of the Geolocation is the same as what we intended, we will initialize the Geolocation column of the region names to make sure that the format of the region names in this dataset is the same as the currently combined dataset.

In [131]:
data ['Geolocation'] = region_names   # setting the values of the region_names

We will then change the '..' or '...' strings to NaN using `replace()` and setting the **value** to **np.nan**. Again, these missing or NaN values were not dropped because all years from 2000-2022 will be in the combined dataset. This is also to make combining the dataset easier.

In [132]:
data = replace_missing (data)

We can now convert the dataframe into its long representation using the `melt` function. This would allow us to merge it with the combined dataset since it reformats it into the same format as the combined data.

In [133]:
data = change_to_long (data, '7.1.1 Proportion of pop with electricity')
data

Unnamed: 0,Geolocation,Year,7.1.1 Proportion of pop with electricity
0,PHILIPPINES,2000,
1,NCR: National Capital Region,2000,
2,CAR: Cordillera Administrative Region,2000,
3,Region 1: Ilocos Region,2000,
4,Region 2: Cagayan Valley,2000,
...,...,...,...
409,Region 10: Northern Mindanao,2022,
410,Region 11: Davao Region,2022,
411,Region 12: SOCCSKSARGEN,2022,
412,CARAGA: Cordillera Administrative Region,2022,


Lastly, we will now combine this dataset to the currently combined dataset.

In [134]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [135]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,4.1 Elem Completion Rate (Female),4.1 Elem Completion Rate (Male),4.1 JHS Completion Rate,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),4.1 SHS Completion Rate (Male),4.c TVET trainers trained,7.1.1 Proportion of pop with electricity
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,65.53,60.05,70.07,72.29,67.66,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,66.58,61.35,68.16,72.18,63.88,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,65.93,58.28,70.31,73.34,67.08,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,81.44,76.23,73.38,76.87,69.82,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,74.95,66.9,72.2,73.78,70.48,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


#### 8.1.1. Annual growth rate of real GDP per capita

Next, loading the eleventh dataset...

In [136]:
data = pd.read_csv('data' + '/8.1.1.csv')
data

Unnamed: 0,8.1.1 Annual growth rate of real GDP per capita,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001.0,2002.0,2003.0,2004.0,2005.0,2006.0,2007.0,2008.0,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,1.007914,1.691458,3.064526,4.541527,2.978552,3.372089,4.578884,2.463484,...,4.948185,4.573827,4.59497,5.376185,5.2485,4.746743,4.602268,-10.806602,4.2,...
3,..National Capital Region (NCR),..,0.841128,-0.916195,2.756815,6.355851,3.310895,3.809951,4.989652,2.771026,...,4.912506,3.842172,4.685639,5.5115,4.014293,4.087414,5.634448,-11.200476,...,...
4,..Cordillera Administrative Region (CAR),..,1.900838,2.60404,3.268534,3.487977,0.05674,2.068305,5.577005,2.103372,...,5.453843,3.919264,3.104855,1.624092,8.263996,4.611166,3.3938,-10.68223,...,...
5,..Region I,..,0.811589,1.704884,3.18301,3.889663,3.362599,4.594092,4.671401,1.774523,...,7.382397,5.465228,4.289072,6.919629,4.497287,4.914167,6.305168,-8.477176,...,...
6,..Region II,..,2.147953,0.089252,1.132812,7.426633,-3.238956,8.370662,5.103214,1.083639,...,7.554134,7.074622,2.846142,3.468013,6.478263,3.337775,5.689419,-10.815141,...,...
7,..Region III,..,3.600192,3.351012,2.107589,0.962747,1.72532,2.538984,3.720296,3.03303,...,3.967256,5.702767,4.082271,5.682582,8.215194,4.902421,4.003513,-15.362293,...,...
8,..Region IV-A,..,-1.683998,1.524221,1.948337,2.276289,2.138433,1.490928,2.573883,0.684335,...,4.635088,3.177655,4.289997,4.283138,5.252561,4.777452,2.452046,-12.218949,...,...
9,..MIMAROPA,..,1.664022,12.480291,7.750968,1.830638,7.306973,2.09092,7.61986,3.256205,...,2.722557,8.496021,3.320929,3.334896,4.76665,7.052686,2.957077,-8.745932,...,...


First and foremost, we will drop the irrelevant rows.

In [137]:
data = data.drop(data.index[20:])

Observing the header column and the row of Index 0, the data in the Index 0 is much more similar to the column names we want for the dataset, which is `[Geolocation | 2000 |  2001 | 2002 | ... |2022]`.

From this, it would be more hassle to (1) change all columns names in the current header column than (2) changing the data in Index 0 and setting it to be the header column. With this, we proceed to doing the second option.

Because we want the geolocation to be the first column of the dataframe, we'll update the data in `Index 0 Column 0` to **Geolocation**. With this, Index 0 has now the correct column headers. Then, will now make the value of this row as the value of the column headers using `data.columns`. 

Again, after this, the Index 0 and the row of NaN underneath it will be dropped.

In [138]:
data.at[0,'8.1.1 Annual growth rate of real GDP per capita'] = 'Geolocation'
data.head()

Unnamed: 0,8.1.1 Annual growth rate of real GDP per capita,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Geolocation,2000,2001.0,2002.0,2003.0,2004.0,2005.0,2006.0,2007.0,2008.0,...,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,1.007914,1.691458,3.064526,4.541527,2.978552,3.372089,4.578884,2.463484,...,4.948185,4.573827,4.59497,5.376185,5.2485,4.746743,4.602268,-10.806602,4.2,...
3,..National Capital Region (NCR),..,0.841128,-0.916195,2.756815,6.355851,3.310895,3.809951,4.989652,2.771026,...,4.912506,3.842172,4.685639,5.5115,4.014293,4.087414,5.634448,-11.200476,...,...
4,..Cordillera Administrative Region (CAR),..,1.900838,2.60404,3.268534,3.487977,0.05674,2.068305,5.577005,2.103372,...,5.453843,3.919264,3.104855,1.624092,8.263996,4.611166,3.3938,-10.68223,...,...


In [139]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

After confirming if the order of the `Geolocation` matches the currently combined dataset, we will initialize the Geolocation column of the region names for uniformity in Geolocation column.

In [140]:
data ['Geolocation'] = region_names    # setting the values of the region_names

To represent the missing values clearly, we change the the '..' or '...' strings to **NaN** using the np.nan.

In [141]:
data = replace_missing (data)

We can now convert the dataframe into its long representation to allow us to merge it with the combined dataset.

In [142]:
data = change_to_long (data, '8.1.1 Growth rate of real GDP per capita')

In [143]:
data[200:]

Unnamed: 0,Geolocation,Year,8.1.1 Growth rate of real GDP per capita
200,CAR: Cordillera Administrative Region,2011,-0.818431
201,Region 1: Ilocos Region,2011,2.760202
202,Region 2: Cagayan Valley,2011,5.009277
203,Region 3: Central Luzon,2011,5.734126
204,Region 4A: CALABARZON,2011,-0.346203
...,...,...,...
409,Region 10: Northern Mindanao,2022,
410,Region 11: Davao Region,2022,
411,Region 12: SOCCSKSARGEN,2022,
412,CARAGA: Cordillera Administrative Region,2022,


After this, we combine this dataset with the currently combined dataset.

In [144]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [145]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,4.1 Elem Completion Rate (Male),4.1 JHS Completion Rate,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),4.1 SHS Completion Rate (Male),4.c TVET trainers trained,7.1.1 Proportion of pop with electricity,8.1.1 Growth rate of real GDP per capita
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,60.05,70.07,72.29,67.66,,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,61.35,68.16,72.18,63.88,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,58.28,70.31,73.34,67.08,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,76.23,73.38,76.87,69.82,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,66.9,72.2,73.78,70.48,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


#### 10.1.1. Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population and the total population

We will now load the twelfth dataset. 

As observed from the first column, there are two indicators present: **10.1.1.1 Bottom 40 percent of the population** and **10.1.1.2 Total Population**. Meaning, this dataset has two sub parts.

The following cells will demonstrate how this dataset will be merged with the currently combined dataset.

In [146]:
data = pd.read_csv('data' + '/10.1.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/10.1.1.csv') 
data

Unnamed: 0,10.1.1 Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population and the total population,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
1,Indicator,,,,,,,,,,...,,,,,,,,,,
2,10.1.1.1 Bottom 40 percent of the population,..,..,..,..,..,..,..,..,..,...,..,..,7.406738,..,..,9.261018,..,..,...,...
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,5.883486,..,..,7.085782,..,..,...,...
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,6.635383,..,..,9.647948,..,..,...,...
5,..Region I,..,..,..,..,..,..,..,..,..,...,..,..,7.453736,..,..,11.312069,..,..,...,...
6,..Region II,..,..,..,..,..,..,..,..,..,...,..,..,8.900391,..,..,4.499523,..,..,...,...
7,..Region III,..,..,..,..,..,..,..,..,..,...,..,..,6.519357,..,..,8.215452,..,..,...,...
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,6.307004,..,..,9.511903,..,..,...,...
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,9.530952,..,..,10.132831,..,..,...,...


Dropping the irrelevant rows...

In [147]:
data = data.drop(data.index[38:])

Once again, since the row of Index 0 is more similar to the arrangement of columns names we want: 

`[Geolocation | 2000 |  2001 | 2002 | ... |2022]`

We will proceed to changing  the data in `Index 0` to fit in our goal. With 
this, we change the data in `Index 0 Column 0` into **Geolocation**. Then, we will now make the value of this row as the value of the column headers.

After this, we drop this row (`Index 0`) as it would not be used anymore as well as the row of NaNs underneath this.

In [148]:
data.at[0,'10.1.1 Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population and the total population'] = 'Geolocation'

In [149]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

To represent the missing values clearly, we change the the '..' or '...' strings to NaN using `replace_missing`, which would turn these values to **NaN**.

In [150]:
data = replace_missing (data)

As observed in this dataset, we have two parts which are **10.1.1.1 Growth rates of household expenditure or income per capita among the bottom 40 percent of the population** and **10.1.1.2 Growth rates of household expenditure or income per capita among among the total Population**. Since we will need both parts, we will still get both parts to combine with other datasets. However, we will divide them into two different datasets.

In [151]:
data['Geolocation'].unique()

array(['10.1.1.1 Bottom 40 percent of the population',
       '..National Capital Region (NCR)',
       '..Cordillera Administrative Region (CAR)', '..Region I',
       '..Region II', '..Region III', '..Region IV-A', '..MIMAROPA',
       '..Region V', '..Region VI', '..Region VII', '..Region VIII',
       '..Region IX', '..Region X', '..Region XI', '..Region XII',
       '..Caraga', '..BARMM', '10.1.1.2 Total Population'], dtype=object)

**10.1.1.1 Bottom 40 percent of the population** goes to `bottom_popu_data` while **10.1.1.2 Total Population** goes to `total_popu_data`.

In [152]:
bottom_popu_data = data [0:18]
total_popu_data = data [18:]

Since `total_popu_data` started with index 18, we will reset its starting index using `.reset_index`. 

Also, since the first row of each of the parts is a record for the Philippines and the order of the geolocation of each dataframe is correct, we will proceed to initializing both parts with the `region_names` for uniformity.

In [153]:
total_popu_data = total_popu_data.reset_index (drop=True)

In [154]:
# setting the values of the region_names

bottom_popu_data ['Geolocation'] = region_names
total_popu_data ['Geolocation'] = region_names

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bottom_popu_data ['Geolocation'] = region_names


This shows the updated dataframe for the first part of this dataset which is **10.1.1.1 Growth rates of household expenditure or income per capita among the bottom 40 percent of the population**.

In [155]:
bottom_popu_data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
0,PHILIPPINES,,,,,,,,,,...,,,7.406738,,,9.261018,,,,
1,NCR: National Capital Region,,,,,,,,,,...,,,5.883486,,,7.085782,,,,
2,CAR: Cordillera Administrative Region,,,,,,,,,,...,,,6.635383,,,9.647948,,,,
3,Region 1: Ilocos Region,,,,,,,,,,...,,,7.453736,,,11.312069,,,,
4,Region 2: Cagayan Valley,,,,,,,,,,...,,,8.900391,,,4.499523,,,,
5,Region 3: Central Luzon,,,,,,,,,,...,,,6.519357,,,8.215452,,,,
6,Region 4A: CALABARZON,,,,,,,,,,...,,,6.307004,,,9.511903,,,,
7,MIMAROPA: Southwestern Tagalog Region,,,,,,,,,,...,,,9.530952,,,10.132831,,,,
8,Region 5: Bicol Region,,,,,,,,,,...,,,8.240401,,,8.730943,,,,
9,Region 6: Western Visayas,,,,,,,,,,...,,,9.042917,,,8.430302,,,,


This shows the updated dataframe for the second part of this dataset which is **10.1.1.2 Growth rates of household expenditure or income per capita among among the total Population***.

In [156]:
total_popu_data 

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015.0,2016,2017,2018.0,2019,2020,2021,2022
0,PHILIPPINES,,,,,,,,,,...,,,5.045087,,,6.522714,,,,
1,NCR: National Capital Region,,,,,,,,,,...,,,4.570268,,,3.84488,,,,
2,CAR: Cordillera Administrative Region,,,,,,,,,,...,,,1.658328,,,10.502686,,,,
3,Region 1: Ilocos Region,,,,,,,,,,...,,,3.572707,,,9.985572,,,,
4,Region 2: Cagayan Valley,,,,,,,,,,...,,,7.312018,,,5.014911,,,,
5,Region 3: Central Luzon,,,,,,,,,,...,,,5.508813,,,5.008588,,,,
6,Region 4A: CALABARZON,,,,,,,,,,...,,,4.059653,,,7.614648,,,,
7,MIMAROPA: Southwestern Tagalog Region,,,,,,,,,,...,,,9.318983,,,5.518178,,,,
8,Region 5: Bicol Region,,,,,,,,,,...,,,5.21329,,,9.110868,,,,
9,Region 6: Western Visayas,,,,,,,,,,...,,,1.901536,,,7.777133,,,,


After this, we can now proceed to converting both dataframes into their long representation to allow us to merge both of them with the combined dataset easily.

In [157]:
bottom_popu_data = change_to_long (bottom_popu_data, '10.1.1.1 Income per capita growth rate of bottom 40')
total_popu_data = change_to_long (total_popu_data, '10.1.1.2 Income per capita growth rate')

Finally, we will now combine the two separated parts of the dataset with the currently combined data.

In [158]:
# Adding the 10.1.1.1 dataset with the current combined dataset
combined_data = combined_data.merge(bottom_popu_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

# Adding the 10.1.1.2 dataset with the current combined dataset
combined_data = combined_data.merge(total_popu_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [159]:
combined_data [combined_data ['Year'] == 2018]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),4.1 SHS Completion Rate (Male),4.c TVET trainers trained,7.1.1 Proportion of pop with electricity,8.1.1 Growth rate of real GDP per capita,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate
324,PHILIPPINES,2018,16.7,94.05,93.85,94.25,81.41,85.82,77.24,51.24,...,92.97,84.73,81.01,84.78,77.23,10118.0,96.12,4.746743,9.261018,6.522714
325,NCR: National Capital Region,2018,2.2,92.11,92.38,91.85,88.74,92.64,85.04,68.63,...,96.13,89.49,82.64,86.92,78.35,1280.0,107.61482,4.087414,7.085782,3.84488
326,CAR: Cordillera Administrative Region,2018,12.0,92.24,90.99,93.45,83.64,88.13,79.4,53.64,...,93.18,81.04,81.07,88.36,73.88,166.0,95.622169,4.611166,9.647948,10.502686
327,Region 1: Ilocos Region,2018,9.9,90.48,89.67,91.26,87.81,90.68,85.14,64.06,...,94.6,87.35,83.26,86.38,80.25,475.0,102.201664,4.914167,11.312069,9.985572
328,Region 2: Cagayan Valley,2018,16.3,96.86,96.36,97.32,84.76,89.01,80.78,56.21,...,92.63,84.36,82.32,86.12,78.47,447.0,97.913797,3.337775,4.499523,5.014911
329,Region 3: Central Luzon,2018,7.0,98.77,98.32,99.2,85.01,88.83,81.44,60.19,...,93.91,86.96,81.76,84.95,78.57,354.0,109.124555,4.902421,8.215452,5.008588
330,Region 4A: CALABARZON,2018,7.1,97.36,97.37,97.34,86.38,89.97,83.01,58.33,...,96.35,89.79,81.66,85.25,77.8,1440.0,110.433491,4.777452,9.511903,7.614648
331,MIMAROPA: Southwestern Tagalog Region,2018,15.1,90.99,91.1,90.88,79.74,83.95,75.79,48.14,...,87.63,78.79,70.69,74.48,66.95,709.0,87.27024,7.052686,10.132831,5.518178
332,Region 5: Bicol Region,2018,27.0,93.12,92.38,93.82,82.97,87.51,78.71,45.8,...,92.07,80.84,78.34,83.2,73.5,543.0,82.385457,5.667794,8.730943,9.110868
333,Region 6: Western Visayas,2018,16.3,97.38,96.47,98.24,84.54,89.21,80.14,49.74,...,92.38,84.74,79.46,83.13,75.75,598.0,92.011066,3.777875,8.430302,7.777133


#### 14.5.1. Coverage of protected areas in relation to marine areas

We will now read the thirteenth dataset. 

As observed in this dataset, the data set has **two parts** as there are two indicators seen in the first column. This will be further discussed in the following cells.

In [160]:
data = pd.read_csv('data' + '/14.5.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/14.5.1.csv')
data

Unnamed: 0,14.5.1 Coverage of protected areas in relation to marine areas,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,Year,,2000,2001,2002,2003,2004,2005,2006,2007,...,2013,2014,2015,2016,2017,2018,2019.0,2020.0,2021,2022
1,Sub-Indicator,Geolocation,,,,,,,,,...,,,,,,,,,,
2,14.5.1.1 Coverage of protected areas in relati...,PHILIPPINES,..,..,..,..,..,..,..,..,...,..,..,..,1.4121254399999998,..,..,3.143559,3.143559,...,...
3,,..National Capital Region (NCR),..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.000108,0.000108,...,...
4,,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.0,0.0,...,...
5,,..Region I,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.012083,0.012083,...,...
6,,..Region II,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.280804,0.280804,...,...
7,,..Region III,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.364699,0.364699,...,...
8,,..Region IV-A,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.00061,0.00061,...,...
9,,..MIMAROPA,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.635796,0.635796,...,...


As usual, we drop the irrelevant columns.

In [161]:
data = data.drop (data.index [38:])

Same process as the previous datasets, we will evaluate which is less hassle, (1) revising the header column or (2) revising the row of Index 0 to fit the goal arrangement of columns and putting it the header column after. 

Since option 2 is still the  better choice, we will now edit the data in column 0 & 1 at Index 0 to make the whole Index 0 look like the column headers we desire, which is `[Geolocation | 2000 |  2001 | 2002 | ... |2022]`. Then, we set the Index 0 to become the header columns. After this, we drop the Index 0 and the row of NaNs underneath it.

In addition, there is a unique column, `column 0`, which contains the indicators of the sub parts of this dataset. With this, the first column was renamed to `Indicator`.

In [162]:
data.at[0, '14.5.1 Coverage of protected areas in relation to marine areas'] = 'Indicator'
data.at[0, 'Unnamed: 1'] = 'Geolocation'

In [163]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data.head()

Unnamed: 0,Indicator,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,...,2013,2014,2015,2016,2017,2018,2019.0,2020.0,2021,2022
0,14.5.1.1 Coverage of protected areas in relati...,PHILIPPINES,..,..,..,..,..,..,..,..,...,..,..,..,1.4121254399999998,..,..,3.143559,3.143559,...,...
1,,..National Capital Region (NCR),..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.000108,0.000108,...,...
2,,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.0,0.0,...,...
3,,..Region I,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.012083,0.012083,...,...
4,,..Region II,..,..,..,..,..,..,..,..,...,..,..,..,..,..,..,0.280804,0.280804,...,...


To represent the missing values clearly, we change the the '..' or '...' strings to NaN.

In [164]:
data = replace_missing (data)

As observed in this dataset, we have two parts which are **14.5.1.1 Coverage of protected areas in relation to marine areas, Universe (in million hectares)** and **14.5.1.2 Coverage of protected areas in relation to marine areas, NIPAS ans Locally managed MPAs 1/**. Since we will need both parts, we will consider both parts in combining with other datasets. However, they will be merged separately.

For this, we will retain the `Indicator` column first, which contains the name of the parts, for identifying how this dataset will be divided. 

In [165]:
data['Indicator'].unique()

array(['14.5.1.1 Coverage of protected areas in relation to marine areas, Universe (in million hectares)',
       nan,
       '14.5.1.2 Coverage of protected areas in relation to marine areas, NIPAS ans Locally managed MPAs 1/'],
      dtype=object)

**14.5.1.1 Coverage of protected areas in relation to marine areas, Universe (in million hectares)** goes to `universe_data` while **14.5.1.2 Coverage of protected areas in relation to marine areas, NIPAS ans Locally managed MPAs 1/** goes to `nipas_data`. 

Since the `nipas_data` will start at Index 18, we will reset it to `Index 0` after the division using the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function.

In [166]:
universe_data = data [0:18]
nipas_data = data [18:]

In [167]:
nipas_data = nipas_data.reset_index (drop=True)
nipas_data

Unnamed: 0,Indicator,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,...,2013,2014,2015,2016,2017,2018,2019.0,2020.0,2021,2022
0,14.5.1.2 Coverage of protected areas in relati...,PHILIPPINES,,,,,,,,,...,,,,0.647,,,1.42,1.42,,
1,,..National Capital Region (NCR),,,,,,,,,...,,,,,,,4.9e-05,4.9e-05,,
2,,..Cordillera Administrative Region (CAR),,,,,,,,,...,,,,,,,0.0,0.0,,
3,,..Region I,,,,,,,,,...,,,,,,,0.005476,0.005476,,
4,,..Region II,,,,,,,,,...,,,,,,,0.127265,0.127265,,
5,,..Region III,,,,,,,,,...,,,,,,,0.165288,0.165288,,
6,,..Region IV-A,,,,,,,,,...,,,,,,,0.000277,0.000277,,
7,,..MIMAROPA,,,,,,,,,...,,,,,,,0.288154,0.288154,,
8,,..Region V,,,,,,,,,...,,,,,,,0.179797,0.179797,,
9,,..Region VI,,,,,,,,,...,,,,,,,0.015022,0.015022,,


Since the dividing of the dataset is done, we won't be needing the `Indicator` column anymore. Therefore, we drop the Indicator column from both of the separated dataframes.

In [168]:
universe_data = universe_data.drop('Indicator', axis = 1)
nipas_data = nipas_data.drop('Indicator', axis = 1)

Since the order of the geolocation of each dataframe is correct, we will proceed to initializing it with the `region_names` for uniformity.

In [169]:
# setting the values of the region_names

universe_data ['Geolocation'] = region_names
nipas_data ['Geolocation'] = region_names

We can now convert both dataframes into their long representation to allow us to merge both of them with the combined dataset.

In [170]:
# 14.5.1.1
universe_data = change_to_long (universe_data, '14.5.1.1 Coverage of protected areas')

# 14.5.1.2
nipas_data = change_to_long (nipas_data, '14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs')

Finally, we combine the two different datasets with the currently combined data.

In [171]:
# Adding the 14.5.1.1 dataset with the current combined dataset
combined_data = combined_data.merge(universe_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
# Adding the 14.5.1.2 dataset with the current combined dataset
combined_data = combined_data.merge(nipas_data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [172]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),4.1 SHS Completion Rate (Male),4.c TVET trainers trained,7.1.1 Proportion of pop with electricity,8.1.1 Growth rate of real GDP per capita,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,,,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


#### 16.1.1 Number of victims of intentional homicide (per 100,000 population)

We will now load the fourteenth dataset.

In [173]:
data = pd.read_csv('data' + '/16.1.1.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/16.1.1.csv')
data

Unnamed: 0,"16.1.1 Number of victims of intentional homicide (per 100,000 population) 1/",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017.0,2018.0,2019.0,2020.0,2021.0,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,..,12.110579,8.648427,6.502755,5.660457,4.753062,4.396605,...
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,17.739571,11.632286,5.448315,4.620971,4.766508,3.551425,...
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,10.348515,3.413519,3.999527,4.799426,3.595654,2.688292,...
5,..Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,11.031909,6.278588,4.441518,3.540128,2.921754,2.785503,...
6,..Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,9.188067,7.405935,5.838454,4.613656,3.198696,2.761486,...
7,..Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,9.704967,6.153771,3.725722,3.502542,2.444428,2.517212,...
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,11.043513,9.088327,6.336361,5.265942,3.991954,4.13175,...
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,7.284387,5.309139,5.20302,4.114172,3.779695,3.36217,...


Again, we will drop the irrelevant rows first.

In [174]:
data = data.drop(data.index[20:])

Since `Index 0` is almost the same as the column header we want, , which is `[Geolocation | 2000 |  2001 | 2002 | ... |2022]`. We will just change the content in the first column to **Geolocation**. This also because the column already contains the regions of the Philippines.

In [175]:
data.at[0,'16.1.1 Number of victims of intentional homicide (per 100,000 population) 1/'] = 'Geolocation'

Then, we set the `Index 0` to become the header column. 

As usual, after updating the header column, we will drop Index 0 and the rows of NANs underneath it since we will not be needing this later.

In [176]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

Again, we will now check the order of the Geolocation if it matches the combined dataset. Since it matches, we will proceed to initializing the Geolocation with region_names to make the naming of Geolocation uniformed.

In [177]:
data ['Geolocation'] = region_names    # setting the values of the region_names

As usual, we will next use the `replace_missing` function and set the values of '..' or '...' strings to NaN.

In [178]:
data = replace_missing (data)

We can now convert the dataframe into its long representation using the `change_to_long` function. This would rearrange the format of the dataframe in a way that would allow us to merge it with the combined dataset.

In [179]:
data = change_to_long (data, '16.1.1 Victims of intentional homicide per 100,000')
data

Unnamed: 0,Geolocation,Year,"16.1.1 Victims of intentional homicide per 100,000"
0,PHILIPPINES,2000,
1,NCR: National Capital Region,2000,
2,CAR: Cordillera Administrative Region,2000,
3,Region 1: Ilocos Region,2000,
4,Region 2: Cagayan Valley,2000,
...,...,...,...
409,Region 10: Northern Mindanao,2022,
410,Region 11: Davao Region,2022,
411,Region 12: SOCCSKSARGEN,2022,
412,CARAGA: Cordillera Administrative Region,2022,


Finally, this dataset can be combined with the currently combined dataset.

In [180]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [181]:
combined_data [combined_data ['Year'] == 2016]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,4.1 SHS Completion Rate (Female),4.1 SHS Completion Rate (Male),4.c TVET trainers trained,7.1.1 Proportion of pop with electricity,8.1.1 Growth rate of real GDP per capita,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000"
288,PHILIPPINES,2016,,96.15,96.12,96.17,74.19,79.94,68.79,37.38,...,,,6518.0,90.65,5.376185,,,1.4121254399999998,0.647,12.110579
289,NCR: National Capital Region,2016,,95.92,96.58,95.3,83.37,88.33,78.7,55.32,...,,,610.0,98.259443,5.5115,,,,,17.739571
290,CAR: Cordillera Administrative Region,2016,,97.24,96.51,97.94,78.72,85.5,72.37,40.16,...,,,201.0,90.3,1.624092,,,,,10.348515
291,Region 1: Ilocos Region,2016,,94.84,94.14,95.5,84.85,89.18,80.86,51.11,...,,,474.0,93.900447,6.919629,,,,,11.031909
292,Region 2: Cagayan Valley,2016,,100.26,99.82,100.68,78.97,84.53,73.82,43.41,...,,,270.0,93.095178,3.468013,,,,,9.188067
293,Region 3: Central Luzon,2016,,98.53,98.44,98.62,82.78,87.52,78.39,47.96,...,,,280.0,97.373164,5.682582,,,,,9.704967
294,Region 4A: CALABARZON,2016,,97.2,97.46,96.94,81.41,86.46,76.69,45.61,...,,,833.0,96.670558,4.283138,,,,,11.043513
295,MIMAROPA: Southwestern Tagalog Region,2016,,94.98,94.88,95.07,73.48,79.26,68.1,35.09,...,,,139.0,84.141455,3.334896,,,,,7.284387
296,Region 5: Bicol Region,2016,,95.77,95.12,96.38,72.78,79.33,66.68,28.35,...,,,467.0,89.964312,5.786055,,,,,8.064076
297,Region 6: Western Visayas,2016,,99.09,98.42,99.72,74.2,80.81,68.02,32.54,...,,,357.0,94.945718,4.80852,,,,,7.121844


#### 16.1.s1 Number of murder cases

We are now loading our fifteenth dataset.

In [182]:
data = pd.read_csv('data' + '/16.1.s1.csv')
data

Unnamed: 0,16.1.s1 Number of murder cases,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Year,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016.0,2017.0,2018.0,2019.0,2020.0,2021.0,2022
1,Geolocation,,,,,,,,,,...,,,,,,,,,,
2,PHILIPPINES,..,..,..,..,..,..,..,..,..,...,..,..,..,12417.0,9009.0,6877.0,6073.0,5170.0,4845.0,...
3,..National Capital Region (NCR),..,..,..,..,..,..,..,..,..,...,..,..,..,2318.0,1543.0,733.0,630.0,658.0,496.0,...
4,..Cordillera Administrative Region (CAR),..,..,..,..,..,..,..,..,..,...,..,..,..,180.0,60.0,71.0,86.0,65.0,49.0,...
5,..Region I,..,..,..,..,..,..,..,..,..,...,..,..,..,560.0,322.0,230.0,185.0,154.0,148.0,...
6,..Region II,..,..,..,..,..,..,..,..,..,...,..,..,..,321.0,262.0,209.0,167.0,117.0,102.0,...
7,..Region III,..,..,..,..,..,..,..,..,..,...,..,..,..,1110.0,718.0,443.0,424.0,301.0,315.0,...
8,..Region IV-A,..,..,..,..,..,..,..,..,..,...,..,..,..,1628.0,1371.0,977.0,829.0,641.0,676.0,...
9,..MIMAROPA,..,..,..,..,..,..,..,..,..,...,..,..,..,219.0,162.0,161.0,129.0,120.0,108.0,...


Again, we first drop the irrelevant rows.

In [183]:
data = data.drop(data.index[20:])

Again, since `Index 0` is almost the same as our goal column header (`[Geolocation | 2000 |  2001 | 2002 | ... |2022]`), we will just change the content in the first column to `Geolocation`. Then, we set the `Index 0` to become the header column. After this, we will drop `Index 0` and the rows of **NANs** underneath it.

In [184]:
data.at[0,'16.1.s1 Number of murder cases'] = 'Geolocation'

In [185]:
data.columns = data.loc[0]
data = data.drop (data.index[0])
data = data.reset_index (drop=True)

data = data.drop (data.index[0])
data = data.reset_index (drop=True)

We will now check the order of the Geolocation if it is the same as the combined dataset. Then, to make the naming of Geolocation uniformed, we will initialized the Geolocation with region_names.

In [186]:
data ['Geolocation'] = region_names   # setting the values of the region_names

We will then replace the '..' or '...' strings to NaN.

In [187]:
data = replace_missing (data)

Using the `change_to_long` function, which utilizes the [`melt`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function, this would format the dataframe in a the same way as the format of the combined dataset.

In [188]:
data = change_to_long (data, '16.1.s1 Number of murder cases')
data

Unnamed: 0,Geolocation,Year,16.1.s1 Number of murder cases
0,PHILIPPINES,2000,
1,NCR: National Capital Region,2000,
2,CAR: Cordillera Administrative Region,2000,
3,Region 1: Ilocos Region,2000,
4,Region 2: Cagayan Valley,2000,
...,...,...,...
409,Region 10: Northern Mindanao,2022,
410,Region 11: Davao Region,2022,
411,Region 12: SOCCSKSARGEN,2022,
412,CARAGA: Cordillera Administrative Region,2022,


After all of this, we can now combine this dataset with the currently combined dataset.

In [189]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,4.1 SHS Completion Rate (Male),4.c TVET trainers trained,7.1.1 Proportion of pop with electricity,8.1.1 Growth rate of real GDP per capita,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,,,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


#### Other Non-SDG datasets
These are datasets that can provide us with more context when exploring the datasets for the Sustainable Development Goals

##### Changes in Inventories, by Region

We will now proceed to loading the sixteenth dataset. 

Observing the result of reading this file, the content has semicolon (`;`) in between data and are not separated in cells.

In [190]:
data = pd.read_csv('data' + '/Changes in Inventories, by Region.csv')
data

Unnamed: 0,"Changes in Inventories, by Region"
0,"Region;""At Current Prices 2000"";""At Current Pr..."
1,..National Capital Region (NCR);2177317;324076...
2,..Cordillera Administrative Region (CAR);-6416...
3,..Region I (Ilocos Region);-1891391;-415859;-2...
4,..Region II (Cagayan Valley);5458610;6710711;7...
5,..Region III (Central Luzon);-52958073;-261358...
6,..Region IV-A (CALABARZON);-111297796;-2883045...
7,..MIMAROPA Region;2369275;2492120;1885432;1015...
8,..Region V (Bicol Region);5267903;6003300;6176...
9,..Region VI (Western Visayas);12238329;1219391...


With this, we will make use of the **delimeter** parameter of `read_csv()` and set the `delimeter=";"` to tell how the content of the file will be separated, in this case, the content will be separated by the semicolon `;`.

In [191]:
data = pd.read_csv('data' + '/Changes in Inventories, by Region.csv', delimiter=";")
data

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,"Changes in Inventories, by Region"
Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,At Current Prices 2009,At Current Prices 2010,At Current Prices 2011,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
..National Capital Region (NCR),2177317,32407610,47446140,38680890,90746761,-21121052,-62074630,-72647490,-16791691,-78327162,-8382259,37960386,-53978335,-25567519,-9498709,-78887754,-61168427,-3527878,-8952050,-170725161,-261889025,-73053461
..Cordillera Administrative Region (CAR),-6416286,-3465763,-854248,2816938,5942522,922235,1539829,2224400,1699529,4102771,10081530,20906816,268953,-2783987,-5031597,-1869509,2845014,-1400306,-12287003,-21374032,-9164951,-23580868
..Region I (Ilocos Region),-1891391,-415859,-2366347,1563172,11013024,2156933,-12615425,-7528468,-1384595,-2241727,17223201,22315022,-424963,-891655,2641538,20010380,161771,1990840,-344781,3195818,-12806475,-9173717
..Region II (Cagayan Valley),5458610,6710711,7073336,7434120,14501407,21591202,19634112,24258860,18866447,21317690,24698127,26715349,20349948,-3805473,3196568,-6693576,833107,-1813875,-1033799,4908941,-14374477,-27952163
..Region III (Central Luzon),-52958073,-26135881,-755268,-11413915,-18880906,-4770170,-23168248,-27299916,-1141473,-4638249,-25471626,9838677,15729819,14493604,-7236367,14886473,8706663,-2730350,5584131,-34811984,-162144251,-16635169
..Region IV-A (CALABARZON),-111297796,-28830457,-9172498,-5564609,-7379076,30758718,-40148922,-56711613,-51903824,-70383785,-74356788,-65654452,-58510828,-11077119,7465112,-89593345,-7623979,4054462,8412077,98541144,-152583773,-15075549
..MIMAROPA Region,2369275,2492120,1885432,1015582,-2997026,1792345,-6738316,-15985915,1808683,-4735049,-5964736,5318716,512923,-6946834,580986,-5855580,-10614285,-5470926,813229,14079039,-17843634,752871
..Region V (Bicol Region),5267903,6003300,6176622,133123,2225757,3413493,1929194,350885,5154369,3798417,13023064,8723895,7479267,8305364,9615771,-4034708,8943283,2236237,-487965,-1902098,-6441298,-19629124
..Region VI (Western Visayas),12238329,12193914,10581660,5708728,6233671,1445825,-6533844,5404782,15823095,6024890,11296248,21788503,3128975,4547658,-7722981,-15516499,3271575,-2277977,-3692731,3767878,5302810,-12227335


Now, we successfully separated the content of the file properly. However, all of the text in the previous output are bold letters. With this, we will work on making the output the same as the usual output of the dataframe which is having only the header column in bold letters. 

For this, we will make use of the **header** parameter of [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and set the `header` value to 1, to specify which row will be the column names for this dataframe.

In [192]:
data = pd.read_csv('data' + '/Changes in Inventories, by Region.csv', header=1, delimiter=";")
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,..National Capital Region (NCR),2177317,32407610,47446140,38680890,90746761,-21121052,-62074630,-72647490,-16791691,...,-53978335,-25567519,-9498709,-78887754,-61168427,-3527878,-8952050,-170725161,-261889025,-73053461
1,..Cordillera Administrative Region (CAR),-6416286,-3465763,-854248,2816938,5942522,922235,1539829,2224400,1699529,...,268953,-2783987,-5031597,-1869509,2845014,-1400306,-12287003,-21374032,-9164951,-23580868
2,..Region I (Ilocos Region),-1891391,-415859,-2366347,1563172,11013024,2156933,-12615425,-7528468,-1384595,...,-424963,-891655,2641538,20010380,161771,1990840,-344781,3195818,-12806475,-9173717
3,..Region II (Cagayan Valley),5458610,6710711,7073336,7434120,14501407,21591202,19634112,24258860,18866447,...,20349948,-3805473,3196568,-6693576,833107,-1813875,-1033799,4908941,-14374477,-27952163
4,..Region III (Central Luzon),-52958073,-26135881,-755268,-11413915,-18880906,-4770170,-23168248,-27299916,-1141473,...,15729819,14493604,-7236367,14886473,8706663,-2730350,5584131,-34811984,-162144251,-16635169
5,..Region IV-A (CALABARZON),-111297796,-28830457,-9172498,-5564609,-7379076,30758718,-40148922,-56711613,-51903824,...,-58510828,-11077119,7465112,-89593345,-7623979,4054462,8412077,98541144,-152583773,-15075549
6,..MIMAROPA Region,2369275,2492120,1885432,1015582,-2997026,1792345,-6738316,-15985915,1808683,...,512923,-6946834,580986,-5855580,-10614285,-5470926,813229,14079039,-17843634,752871
7,..Region V (Bicol Region),5267903,6003300,6176622,133123,2225757,3413493,1929194,350885,5154369,...,7479267,8305364,9615771,-4034708,8943283,2236237,-487965,-1902098,-6441298,-19629124
8,..Region VI (Western Visayas),12238329,12193914,10581660,5708728,6233671,1445825,-6533844,5404782,15823095,...,3128975,4547658,-7722981,-15516499,3271575,-2277977,-3692731,3767878,5302810,-12227335
9,..Region VII (Central Visayas),14163801,23634467,15639464,15346129,14804151,2064083,-22029093,-27281777,-23683350,...,-24630912,-15927072,1646648,17167841,3895526,-1260411,-2452424,1950007,-7527662,15804194


Also, since the ordering of the Geolocation is different in this dataset, we will be rearranging the rows based on the order of the Geolocation in `region_names`.

In [193]:
data = data.reindex (index = [17, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
17,Philippines,-136845782,24650494,91133746,40200557,115195365,20573012,-178148844,-198481639,-10934733,...,-42228645,-18218400,6863551,-125424625,-58265758,-15471546,-26943785,-148526300,-695464321,-216438323
0,..National Capital Region (NCR),2177317,32407610,47446140,38680890,90746761,-21121052,-62074630,-72647490,-16791691,...,-53978335,-25567519,-9498709,-78887754,-61168427,-3527878,-8952050,-170725161,-261889025,-73053461
1,..Cordillera Administrative Region (CAR),-6416286,-3465763,-854248,2816938,5942522,922235,1539829,2224400,1699529,...,268953,-2783987,-5031597,-1869509,2845014,-1400306,-12287003,-21374032,-9164951,-23580868
2,..Region I (Ilocos Region),-1891391,-415859,-2366347,1563172,11013024,2156933,-12615425,-7528468,-1384595,...,-424963,-891655,2641538,20010380,161771,1990840,-344781,3195818,-12806475,-9173717
3,..Region II (Cagayan Valley),5458610,6710711,7073336,7434120,14501407,21591202,19634112,24258860,18866447,...,20349948,-3805473,3196568,-6693576,833107,-1813875,-1033799,4908941,-14374477,-27952163
4,..Region III (Central Luzon),-52958073,-26135881,-755268,-11413915,-18880906,-4770170,-23168248,-27299916,-1141473,...,15729819,14493604,-7236367,14886473,8706663,-2730350,5584131,-34811984,-162144251,-16635169
5,..Region IV-A (CALABARZON),-111297796,-28830457,-9172498,-5564609,-7379076,30758718,-40148922,-56711613,-51903824,...,-58510828,-11077119,7465112,-89593345,-7623979,4054462,8412077,98541144,-152583773,-15075549
6,..MIMAROPA Region,2369275,2492120,1885432,1015582,-2997026,1792345,-6738316,-15985915,1808683,...,512923,-6946834,580986,-5855580,-10614285,-5470926,813229,14079039,-17843634,752871
7,..Region V (Bicol Region),5267903,6003300,6176622,133123,2225757,3413493,1929194,350885,5154369,...,7479267,8305364,9615771,-4034708,8943283,2236237,-487965,-1902098,-6441298,-19629124
8,..Region VI (Western Visayas),12238329,12193914,10581660,5708728,6233671,1445825,-6533844,5404782,15823095,...,3128975,4547658,-7722981,-15516499,3271575,-2277977,-3692731,3767878,5302810,-12227335


Then, we will proceed to reindexing the rows.

In [194]:
data = data.reset_index (drop=True)
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,Philippines,-136845782,24650494,91133746,40200557,115195365,20573012,-178148844,-198481639,-10934733,...,-42228645,-18218400,6863551,-125424625,-58265758,-15471546,-26943785,-148526300,-695464321,-216438323
1,..National Capital Region (NCR),2177317,32407610,47446140,38680890,90746761,-21121052,-62074630,-72647490,-16791691,...,-53978335,-25567519,-9498709,-78887754,-61168427,-3527878,-8952050,-170725161,-261889025,-73053461
2,..Cordillera Administrative Region (CAR),-6416286,-3465763,-854248,2816938,5942522,922235,1539829,2224400,1699529,...,268953,-2783987,-5031597,-1869509,2845014,-1400306,-12287003,-21374032,-9164951,-23580868
3,..Region I (Ilocos Region),-1891391,-415859,-2366347,1563172,11013024,2156933,-12615425,-7528468,-1384595,...,-424963,-891655,2641538,20010380,161771,1990840,-344781,3195818,-12806475,-9173717
4,..Region II (Cagayan Valley),5458610,6710711,7073336,7434120,14501407,21591202,19634112,24258860,18866447,...,20349948,-3805473,3196568,-6693576,833107,-1813875,-1033799,4908941,-14374477,-27952163
5,..Region III (Central Luzon),-52958073,-26135881,-755268,-11413915,-18880906,-4770170,-23168248,-27299916,-1141473,...,15729819,14493604,-7236367,14886473,8706663,-2730350,5584131,-34811984,-162144251,-16635169
6,..Region IV-A (CALABARZON),-111297796,-28830457,-9172498,-5564609,-7379076,30758718,-40148922,-56711613,-51903824,...,-58510828,-11077119,7465112,-89593345,-7623979,4054462,8412077,98541144,-152583773,-15075549
7,..MIMAROPA Region,2369275,2492120,1885432,1015582,-2997026,1792345,-6738316,-15985915,1808683,...,512923,-6946834,580986,-5855580,-10614285,-5470926,813229,14079039,-17843634,752871
8,..Region V (Bicol Region),5267903,6003300,6176622,133123,2225757,3413493,1929194,350885,5154369,...,7479267,8305364,9615771,-4034708,8943283,2236237,-487965,-1902098,-6441298,-19629124
9,..Region VI (Western Visayas),12238329,12193914,10581660,5708728,6233671,1445825,-6533844,5404782,15823095,...,3128975,4547658,-7722981,-15516499,3271575,-2277977,-3692731,3767878,5302810,-12227335


After this, we will now change the columns names: (1) `Region` to `Geolocation`, (2) `At Current Prices <Year>` to `<Year>`

In [195]:
data.columns = ['Geolocation', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
               '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017','2018', '2019', '2020', '2021']

After this, we will insert the region_names in the Geolocation column so that the format of the region_names will fit the ones in the combined data.

In [196]:
data ['Geolocation'] = region_names    # setting the values of the region_names
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,PHILIPPINES,-136845782,24650494,91133746,40200557,115195365,20573012,-178148844,-198481639,-10934733,...,-42228645,-18218400,6863551,-125424625,-58265758,-15471546,-26943785,-148526300,-695464321,-216438323
1,NCR: National Capital Region,2177317,32407610,47446140,38680890,90746761,-21121052,-62074630,-72647490,-16791691,...,-53978335,-25567519,-9498709,-78887754,-61168427,-3527878,-8952050,-170725161,-261889025,-73053461
2,CAR: Cordillera Administrative Region,-6416286,-3465763,-854248,2816938,5942522,922235,1539829,2224400,1699529,...,268953,-2783987,-5031597,-1869509,2845014,-1400306,-12287003,-21374032,-9164951,-23580868
3,Region 1: Ilocos Region,-1891391,-415859,-2366347,1563172,11013024,2156933,-12615425,-7528468,-1384595,...,-424963,-891655,2641538,20010380,161771,1990840,-344781,3195818,-12806475,-9173717
4,Region 2: Cagayan Valley,5458610,6710711,7073336,7434120,14501407,21591202,19634112,24258860,18866447,...,20349948,-3805473,3196568,-6693576,833107,-1813875,-1033799,4908941,-14374477,-27952163
5,Region 3: Central Luzon,-52958073,-26135881,-755268,-11413915,-18880906,-4770170,-23168248,-27299916,-1141473,...,15729819,14493604,-7236367,14886473,8706663,-2730350,5584131,-34811984,-162144251,-16635169
6,Region 4A: CALABARZON,-111297796,-28830457,-9172498,-5564609,-7379076,30758718,-40148922,-56711613,-51903824,...,-58510828,-11077119,7465112,-89593345,-7623979,4054462,8412077,98541144,-152583773,-15075549
7,MIMAROPA: Southwestern Tagalog Region,2369275,2492120,1885432,1015582,-2997026,1792345,-6738316,-15985915,1808683,...,512923,-6946834,580986,-5855580,-10614285,-5470926,813229,14079039,-17843634,752871
8,Region 5: Bicol Region,5267903,6003300,6176622,133123,2225757,3413493,1929194,350885,5154369,...,7479267,8305364,9615771,-4034708,8943283,2236237,-487965,-1902098,-6441298,-19629124
9,Region 6: Western Visayas,12238329,12193914,10581660,5708728,6233671,1445825,-6533844,5404782,15823095,...,3128975,4547658,-7722981,-15516499,3271575,-2277977,-3692731,3767878,5302810,-12227335


We will then change the '..' or '...' strings to NaN, representing the missing values.

In [197]:
data = replace_missing (data)

After this, we can now convert the dataframe into its long representation using the `change_to_long` function. This would make merging with the combined data easier since it reformats it into the same format as the combined data.

In [198]:
data = change_to_long (data, 'Changes in Inventories')

Finally, we will now combine this dataset to the currently combined dataset.

In [199]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [200]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,4.c TVET trainers trained,7.1.1 Proportion of pop with electricity,8.1.1 Growth rate of real GDP per capita,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases,Changes in Inventories
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,,,,,,,-136845782.0
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,,,,,,,2177317.0
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,,,,,,,-6416286.0
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,,,,,,,-1891391.0
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,,,,,,,5458610.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Current Health Expenditure by Region

Next, we load the eighteenth dataset.

Still, this has the same case as the previous dataset where there is a semicolon in between each content. 

Therefore, we will still set the `delimeter=";"` to separate the content of the file by the semicolon and set the `header` to 1 to specify the column header of this dataframe.

In [201]:
data = pd.read_csv('data' + '/Current Health Expenditure by Region.csv',header = 1, delimiter = ';')
data

Unnamed: 0,Region,2014,2015,2016,2017,2018r,2019r,2020
0,Total Current Health Expenditure,100.0,100.0,100.0,100.0,100.0,100.0,100.0
1,..NCR,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,..CAR,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,..Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,..Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,..Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,..CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,..MIMAROPA,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,..Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,..Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6


Again, we will drop the `Index 0` which contains the Total Current Health Expenditure. This row is irrelevant because we only need the data for nationwide and per region.

In [202]:
# drop total current health expenditure
data = data.drop (data.index[0])
data

Unnamed: 0,Region,2014,2015,2016,2017,2018r,2019r,2020
1,..NCR,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,..CAR,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,..Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,..Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,..Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,..CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,..MIMAROPA,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,..Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,..Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6
10,..Central Visayas,5.9,5.6,5.6,5.7,5.9,5.9,5.7


To make the `Geolocation` names consistent, we remove the '..' at the start of the Region column values. Then, we remove some 'r' at the years (2018r & 2019r) since we will convert the years into an `int` datatype in the data cleaning part. Additionally, it is important that the years follow the same format as the other columns of the dataframe as it would be used to merge this dataset into the combined dataframe.

In [203]:
#remove '..' and 'r'
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data.columns = data.columns.str.replace('[r]', '',regex = True)
data

Unnamed: 0,Region,2014,2015,2016,2017,2018,2019,2020
1,NCR,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,CAR,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,MIMAROPA,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6
10,Central Visayas,5.9,5.6,5.6,5.7,5.9,5.9,5.7


Next, we rearrange the rows and make the Nationwide Index 0.

In [204]:
# make nationwide index 0
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,2014,2015,2016,2017,2018,2019,2020
0,Nationwide,16.8,19.3,19.7,21.2,20.0,18.6,23.4
1,NCR,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,CAR,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,MIMAROPA,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6


After checking if the Geolocation column is in correct order, we initialize the Geolocation column of the `region_names` for consistency. 

In [205]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)

# renames the Region column to Geolocation
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,2014,2015,2016,2017,2018,2019,2020
0,PHILIPPINES,16.8,19.3,19.7,21.2,20.0,18.6,23.4
1,NCR: National Capital Region,24.7,23.0,22.5,22.3,22.3,23.0,17.3
2,CAR: Cordillera Administrative Region,2.1,2.1,2.1,2.1,2.1,2.2,2.1
3,Region 1: Ilocos Region,4.0,4.0,4.0,4.1,4.0,3.9,4.4
4,Region 2: Cagayan Valley,2.3,2.4,2.4,2.3,2.3,2.4,2.6
5,Region 3: Central Luzon,10.9,10.7,10.8,10.7,10.9,11.0,11.0
6,Region 4A: CALABARZON,2.9,2.6,2.9,2.8,3.7,3.2,2.3
7,MIMAROPA: Southwestern Tagalog Region,1.2,1.2,1.2,1.4,0.5,1.1,0.9
8,Region 5: Bicol Region,4.2,4.1,4.1,4.0,4.0,4.1,4.3
9,Region 6: Western Visayas,6.4,6.2,6.3,6.3,6.4,6.0,6.6


Then, we can now change the wide representation into a long representation.

In [206]:
data = change_to_long (data, 'Current Health Expenditure')

Finally, we will now add this dataset with the currently combined dataset.

In [207]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [208]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,7.1.1 Proportion of pop with electricity,8.1.1 Growth rate of real GDP per capita,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases,Changes in Inventories,Current Health Expenditure
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,,,,,,-136845782.0,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,,,,,,2177317.0,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,,,,,,-6416286.0,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,,,,,,-1891391.0,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,,,,,,5458610.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Current Health Expenditure by Region, Growth Rates 

Next, we load the seventeenth dataset.

This has the same case as the loading of the previous dataset. The content of the file also has semicolon (`;`) in between data and are not separated in cells.

In [209]:
data = pd.read_csv('data' + '/Current Health Expenditure by Region, Growth Rates.csv')
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/Current Health Expenditure by Region, Growth Rates.csv')
data

Unnamed: 0,"Current Health Expenditure by Region, Growth Rates"
0,"Region;""2014-15"";""2015-16"";""2016-17"";""2017-18r..."
1,Total Current Health Expenditure;11.1;10.1;9.6...
2,..NCR;3.6;7.9;7.2;11.2;13.7;-15.1
3,..CAR;11.4;9.9;11.0;10.4;14.9;10.4
4,..Ilocos Region;12.4;9.5;10.2;9.8;7.7;26.1
5,..Cagayan Valley;13.9;10.6;5.0;11.7;13.7;23.6
6,..Central Luzon;9.9;10.9;7.6;13.0;11.4;12.3
7,..CALABARZON;1.8;23.6;4.3;46.1;-4.1;-18.8
8,..MIMAROPA;9.9;10.2;22.6;-63.0;163.8;-11.5
9,..Bicol Region;9.7;10.0;5.7;11.8;13.7;16.8


Just like what we did from the previous dataset, we will set the `delimeter=";"` to separate the content of the file by the semicolon and set the `header=1` to specify which the column names for this dataframe.

In [210]:
data = pd.read_csv('data' + '/Current Health Expenditure by Region, Growth Rates.csv', header=1, delimiter=";")
# data = pd.read_csv(os.getenv('DSDATA_PROJ') + '/Current Health Expenditure by Region, Growth Rates.csv')
data

Unnamed: 0,Region,2014-15,2015-16,2016-17,2017-18r,2018-19r,2019-20
0,Total Current Health Expenditure,11.1,10.1,9.6,10.1,10.2,12.6
1,..NCR,3.6,7.9,7.2,11.2,13.7,-15.1
2,..CAR,11.4,9.9,11.0,10.4,14.9,10.4
3,..Ilocos Region,12.4,9.5,10.2,9.8,7.7,26.1
4,..Cagayan Valley,13.9,10.6,5.0,11.7,13.7,23.6
5,..Central Luzon,9.9,10.9,7.6,13.0,11.4,12.3
6,..CALABARZON,1.8,23.6,4.3,46.1,-4.1,-18.8
7,..MIMAROPA,9.9,10.2,22.6,-63.0,163.8,-11.5
8,..Bicol Region,9.7,10.0,5.7,11.8,13.7,16.8
9,..Western Visayas,7.6,11.2,9.1,11.7,3.3,23.9


Since we will only need the data nationwide and per region, we will drop the `Index 0` which contains the Total Current Health Expenditure.

In [211]:
data = data.drop (data.index[0])
data = data.reset_index (drop=True)
data

Unnamed: 0,Region,2014-15,2015-16,2016-17,2017-18r,2018-19r,2019-20
0,..NCR,3.6,7.9,7.2,11.2,13.7,-15.1
1,..CAR,11.4,9.9,11.0,10.4,14.9,10.4
2,..Ilocos Region,12.4,9.5,10.2,9.8,7.7,26.1
3,..Cagayan Valley,13.9,10.6,5.0,11.7,13.7,23.6
4,..Central Luzon,9.9,10.9,7.6,13.0,11.4,12.3
5,..CALABARZON,1.8,23.6,4.3,46.1,-4.1,-18.8
6,..MIMAROPA,9.9,10.2,22.6,-63.0,163.8,-11.5
7,..Bicol Region,9.7,10.0,5.7,11.8,13.7,16.8
8,..Western Visayas,7.6,11.2,9.1,11.7,3.3,23.9
9,..Central Visayas,6.1,9.7,11.8,13.7,11.7,7.3


Also, since the ordering of the Geolocation is different in this dataset, we will be rearranging the rows based on the order of the Geolocation in region_names. After this, we will reset the index again.

In [212]:
data = data.reindex (index = [17, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
data = data.reset_index (drop=True)
data

Unnamed: 0,Region,2014-15,2015-16,2016-17,2017-18r,2018-19r,2019-20
0,..Nationwide,27.8,12.2,16.7,4.9,2.5,41.8
1,..NCR,3.6,7.9,7.2,11.2,13.7,-15.1
2,..CAR,11.4,9.9,11.0,10.4,14.9,10.4
3,..Ilocos Region,12.4,9.5,10.2,9.8,7.7,26.1
4,..Cagayan Valley,13.9,10.6,5.0,11.7,13.7,23.6
5,..Central Luzon,9.9,10.9,7.6,13.0,11.4,12.3
6,..CALABARZON,1.8,23.6,4.3,46.1,-4.1,-18.8
7,..MIMAROPA,9.9,10.2,22.6,-63.0,163.8,-11.5
8,..Bicol Region,9.7,10.0,5.7,11.8,13.7,16.8
9,..Western Visayas,7.6,11.2,9.1,11.7,3.3,23.9


After this, we will now change the columns names: (1) `Region` to `Geolocation`, (2) `At Current Prices <Year>` to `<Year>`. 

The `<Year>` columns must consist of numbers only since we will convert the years into an integer later in the data cleaning part. 

In [213]:
data.columns = ['Geolocation', '2014', '2015', '2016', '2017','2018', '2019']
data

Unnamed: 0,Geolocation,2014,2015,2016,2017,2018,2019
0,..Nationwide,27.8,12.2,16.7,4.9,2.5,41.8
1,..NCR,3.6,7.9,7.2,11.2,13.7,-15.1
2,..CAR,11.4,9.9,11.0,10.4,14.9,10.4
3,..Ilocos Region,12.4,9.5,10.2,9.8,7.7,26.1
4,..Cagayan Valley,13.9,10.6,5.0,11.7,13.7,23.6
5,..Central Luzon,9.9,10.9,7.6,13.0,11.4,12.3
6,..CALABARZON,1.8,23.6,4.3,46.1,-4.1,-18.8
7,..MIMAROPA,9.9,10.2,22.6,-63.0,163.8,-11.5
8,..Bicol Region,9.7,10.0,5.7,11.8,13.7,16.8
9,..Western Visayas,7.6,11.2,9.1,11.7,3.3,23.9


After confirming that the Geolocation order is correct, we will initialize the Geolocation column of the region names to ensure uniformity in the Geolocation column.

In [214]:
# setting the values of the region_names
data ['Geolocation'] = region_names
data

Unnamed: 0,Geolocation,2014,2015,2016,2017,2018,2019
0,PHILIPPINES,27.8,12.2,16.7,4.9,2.5,41.8
1,NCR: National Capital Region,3.6,7.9,7.2,11.2,13.7,-15.1
2,CAR: Cordillera Administrative Region,11.4,9.9,11.0,10.4,14.9,10.4
3,Region 1: Ilocos Region,12.4,9.5,10.2,9.8,7.7,26.1
4,Region 2: Cagayan Valley,13.9,10.6,5.0,11.7,13.7,23.6
5,Region 3: Central Luzon,9.9,10.9,7.6,13.0,11.4,12.3
6,Region 4A: CALABARZON,1.8,23.6,4.3,46.1,-4.1,-18.8
7,MIMAROPA: Southwestern Tagalog Region,9.9,10.2,22.6,-63.0,163.8,-11.5
8,Region 5: Bicol Region,9.7,10.0,5.7,11.8,13.7,16.8
9,Region 6: Western Visayas,7.6,11.2,9.1,11.7,3.3,23.9


Again, to represent the missing values, we will then replace '..' or '...' strings to NaN.

In [215]:
data = replace_missing (data)

We can now convert the dataframe into its long representation. This would reorganize the dataframe's format, allowing us to join it with the combined dataset easier.

In [216]:
data = change_to_long (data, 'Current Health Expenditure GR')

After this, this dataset can finally be added to the currently combined dataset.

In [217]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)

In [218]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,8.1.1 Growth rate of real GDP per capita,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases,Changes in Inventories,Current Health Expenditure,Current Health Expenditure GR
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,,,,,-136845782.0,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,,,,,2177317.0,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,,,,,-6416286.0,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,,,,,-1891391.0,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,,,,,5458610.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Government Final Consumption Expenditure, by Region, Percent Share
Afterwards, we can now load the next dataset. Like in the previous datasets, we would use a separator **;**, since the separator for the columns are semi-colons and not commas. Additionally, we would need to set the value of the header to 1, which means that `Row 1` would be our column headers. 

In [219]:
data = pd.read_csv('data' + '/Government Final Consumption Expenditure, by Region, Percent Share.csv', header = 1, sep = ';')
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,..National Capital Region (NCR),45.9,47.0,44.8,44.7,45.1,44.3,42.7,42.9,43.9,...,41.7,42.1,41.5,40.6,40.2,39.8,39.8,40.0,40.4,40.6
1,..Cordillera Administrative Region (CAR),2.7,2.7,2.7,2.6,2.5,2.6,2.4,2.3,2.2,...,2.1,2.0,2.0,2.0,1.9,1.9,1.9,1.9,1.8,1.8
2,..Region I (Ilocos Region),4.0,3.8,3.9,4.0,3.9,4.0,4.6,4.2,4.0,...,3.7,3.6,3.9,4.0,4.2,4.2,4.2,4.1,4.0,3.9
3,..Region II (Cagayan Valley),2.4,2.4,2.6,2.6,2.7,2.6,2.7,2.7,2.6,...,2.3,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.1,2.2
4,..Region III (Central Luzon),4.3,4.1,4.8,4.9,5.7,5.5,5.8,6.3,6.4,...,6.8,7.0,7.0,7.0,7.0,7.1,7.1,7.1,7.0,7.1
5,..Region IV-A (CALABARZON),4.4,4.5,5.0,5.0,5.1,5.6,5.7,7.0,6.6,...,7.2,7.0,7.3,7.4,7.5,7.5,7.6,7.6,7.6,7.4
6,..MIMAROPA Region,2.0,2.0,2.1,2.1,1.8,2.1,2.2,2.4,2.4,...,2.4,2.3,2.2,2.2,2.2,2.2,2.2,2.1,2.1,2.1
7,..Region V (Bicol Region),3.6,3.5,3.5,3.5,3.2,3.5,3.8,4.0,3.8,...,4.2,4.1,4.1,4.2,4.0,4.0,3.9,4.0,3.9,3.9
8,..Region VI (Western Visayas),5.0,5.1,5.1,5.0,4.8,4.8,5.1,4.7,4.4,...,4.7,4.4,4.3,4.5,4.6,4.6,4.8,4.8,4.7,4.6
9,..Region VII (Central Visayas),2.7,2.8,3.0,2.9,3.2,3.0,3.1,2.9,2.9,...,3.2,3.5,3.8,4.0,4.1,4.2,4.3,4.2,4.2,4.2


We remove the '..' at the start of the Region column value. Additionally, we would be re-ordering the rows, because, as we can see in the previous dataframe, **Philippines** is found in the last row instead of the first row.

In [220]:
# remove '..' in the values of Geolication
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))

# re-arrange the rows
data = data.iloc [np.arange (-1, len (data) - 1)]
data = data.reset_index ()
data.drop ('index', axis = 1, inplace = True)
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,Philippines,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
1,National Capital Region (NCR),45.9,47.0,44.8,44.7,45.1,44.3,42.7,42.9,43.9,...,41.7,42.1,41.5,40.6,40.2,39.8,39.8,40.0,40.4,40.6
2,Cordillera Administrative Region (CAR),2.7,2.7,2.7,2.6,2.5,2.6,2.4,2.3,2.2,...,2.1,2.0,2.0,2.0,1.9,1.9,1.9,1.9,1.8,1.8
3,Region I (Ilocos Region),4.0,3.8,3.9,4.0,3.9,4.0,4.6,4.2,4.0,...,3.7,3.6,3.9,4.0,4.2,4.2,4.2,4.1,4.0,3.9
4,Region II (Cagayan Valley),2.4,2.4,2.6,2.6,2.7,2.6,2.7,2.7,2.6,...,2.3,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.1,2.2
5,Region III (Central Luzon),4.3,4.1,4.8,4.9,5.7,5.5,5.8,6.3,6.4,...,6.8,7.0,7.0,7.0,7.0,7.1,7.1,7.1,7.0,7.1
6,Region IV-A (CALABARZON),4.4,4.5,5.0,5.0,5.1,5.6,5.7,7.0,6.6,...,7.2,7.0,7.3,7.4,7.5,7.5,7.6,7.6,7.6,7.4
7,MIMAROPA Region,2.0,2.0,2.1,2.1,1.8,2.1,2.2,2.4,2.4,...,2.4,2.3,2.2,2.2,2.2,2.2,2.2,2.1,2.1,2.1
8,Region V (Bicol Region),3.6,3.5,3.5,3.5,3.2,3.5,3.8,4.0,3.8,...,4.2,4.1,4.1,4.2,4.0,4.0,3.9,4.0,3.9,3.9
9,Region VI (Western Visayas),5.0,5.1,5.1,5.0,4.8,4.8,5.1,4.7,4.4,...,4.7,4.4,4.3,4.5,4.6,4.6,4.8,4.8,4.7,4.6


We follow the previous format that we set for the `Geolocation` column. Additionally, we would rename the `Region` column to `Geolocation` for consistency.

In [221]:
# renames the data in the Geolocation for consistency
data ['Region'] = region_names
data.set_index ('Region')
data = data.reset_index (drop=True)

# renames the Region column to Geolocation
data.rename (columns = {'Region': 'Geolocation'}, inplace = True)
# data

We will now format the column names to remove the `At Current Price`, which would change the column headers to only the years.

In [222]:
data.columns = data.columns.map (lambda x: x.lstrip ('At Current Prices'))
data.columns = data.columns.str [:4]
data.rename (columns = {'Geol': 'Geolocation'}, inplace = True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,PHILIPPINES,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
1,NCR: National Capital Region,45.9,47.0,44.8,44.7,45.1,44.3,42.7,42.9,43.9,...,41.7,42.1,41.5,40.6,40.2,39.8,39.8,40.0,40.4,40.6
2,CAR: Cordillera Administrative Region,2.7,2.7,2.7,2.6,2.5,2.6,2.4,2.3,2.2,...,2.1,2.0,2.0,2.0,1.9,1.9,1.9,1.9,1.8,1.8
3,Region 1: Ilocos Region,4.0,3.8,3.9,4.0,3.9,4.0,4.6,4.2,4.0,...,3.7,3.6,3.9,4.0,4.2,4.2,4.2,4.1,4.0,3.9
4,Region 2: Cagayan Valley,2.4,2.4,2.6,2.6,2.7,2.6,2.7,2.7,2.6,...,2.3,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.1,2.2
5,Region 3: Central Luzon,4.3,4.1,4.8,4.9,5.7,5.5,5.8,6.3,6.4,...,6.8,7.0,7.0,7.0,7.0,7.1,7.1,7.1,7.0,7.1
6,Region 4A: CALABARZON,4.4,4.5,5.0,5.0,5.1,5.6,5.7,7.0,6.6,...,7.2,7.0,7.3,7.4,7.5,7.5,7.6,7.6,7.6,7.4
7,MIMAROPA: Southwestern Tagalog Region,2.0,2.0,2.1,2.1,1.8,2.1,2.2,2.4,2.4,...,2.4,2.3,2.2,2.2,2.2,2.2,2.2,2.1,2.1,2.1
8,Region 5: Bicol Region,3.6,3.5,3.5,3.5,3.2,3.5,3.8,4.0,3.8,...,4.2,4.1,4.1,4.2,4.0,4.0,3.9,4.0,3.9,3.9
9,Region 6: Western Visayas,5.0,5.1,5.1,5.0,4.8,4.8,5.1,4.7,4.4,...,4.7,4.4,4.3,4.5,4.6,4.6,4.8,4.8,4.7,4.6


After we have cleaned our dataset, we can now convert it to its long representation. This can be done through the use of the `change_to_long` function.

In [223]:
data = change_to_long (data, 'Consumption Expenditure %')

Once our dataframe is in its long format, we can now combine this with our combined dataset.

In [224]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases,Changes in Inventories,Current Health Expenditure,Current Health Expenditure GR,Consumption Expenditure %
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,,,,-136845782.0,,,100.0
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,,,,2177317.0,,,45.9
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,,,,-6416286.0,,,2.7
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,,,,-1891391.0,,,4.0
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,,,,5458610.0,,,2.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Government Final Consumption Expenditure, by Region, Growth Rates
Then, we can now load the next dataset. We would pass the same parameters to the [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function, as those datasets that are under the non-SDGs follow the same format.

In [225]:
data = pd.read_csv('data' + '/Government Final Consumption Expenditure, by Region, Growth Rates.csv',header = 1,sep = ';')
data

Unnamed: 0,Region,At Current Prices 2000-2001,At Current Prices 2001-2002,At Current Prices 2002-2003,At Current Prices 2003-2004,At Current Prices 2004-2005,At Current Prices 2005-2006,At Current Prices 2006-2007,At Current Prices 2007-2008,At Current Prices 2008-2009,...,At Current Prices 2011-2012,At Current Prices 2012-2013,At Current Prices 2013-2014,At Current Prices 2014-2015,At Current Prices 2015-2016,At Current Prices 2016-2017,At Current Prices 2017-2018,At Current Prices 2018-2019,At Current Prices 2019-2020,At Current Prices 2020-2021
0,..National Capital Region (NCR),8.3,-1.7,4.7,5.0,5.8,9.1,12.5,9.2,16.6,...,19.9,10.3,5.4,6.7,10.9,9.0,17.3,11.3,13.6,10.9
1,..Cordillera Administrative Region (CAR),2.3,4.5,0.9,-1.4,14.3,6.1,3.3,3.4,10.0,...,23.0,4.5,6.9,7.5,9.3,10.2,17.1,7.6,9.2,8.6
2,..Region I (Ilocos Region),1.3,6.6,6.8,0.3,11.9,29.2,2.3,0.8,6.4,...,21.9,6.4,14.3,14.5,15.6,10.5,17.1,8.0,9.6,8.5
3,..Region II (Cagayan Valley),4.2,11.5,7.5,4.6,5.7,17.5,11.2,1.8,15.2,...,10.9,2.9,8.7,9.0,10.3,10.1,18.3,8.6,10.8,15.0
4,..Region III (Central Luzon),2.0,21.1,7.2,19.7,3.8,20.2,21.3,9.1,20.3,...,22.8,12.5,6.8,10.0,12.0,10.9,17.9,10.3,11.2,12.1
5,..Region IV-A (CALABARZON),8.2,15.0,5.9,5.1,19.4,14.1,38.8,0.7,20.3,...,26.4,6.3,11.5,10.5,13.4,11.0,18.3,10.5,12.1,8.3
6,..MIMAROPA Region,3.0,7.8,6.8,-9.0,25.2,18.0,21.4,7.0,16.7,...,24.2,3.7,6.0,6.0,11.4,10.8,17.4,4.9,11.6,10.2
7,..Region V (Bicol Region),0.3,3.1,6.7,-5.4,18.4,20.9,20.4,1.3,21.3,...,26.4,5.8,9.4,9.9,8.0,10.1,12.9,14.4,11.0,9.4
8,..Region VI (Western Visayas),7.9,4.2,2.9,-1.9,9.5,18.6,4.1,0.3,20.7,...,21.9,3.6,4.0,13.6,14.6,10.4,23.4,9.0,10.1,9.1
9,..Region VII (Central Visayas),6.8,11.0,1.8,13.7,0.6,17.7,6.3,7.5,13.2,...,28.8,17.3,15.7,16.0,15.7,11.3,19.7,9.7,12.7,9.6


Additionally, as our the values for our `Region` column has two periods in front of it, we would remove these.

In [226]:
# remove '..'
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))

Afterwards, we would be re-arranging the rows so that we could easily use the `region_names` to follow the set format for the values of the `Geolocation` column. 

In [227]:
# re-arranging the rows
data = data.iloc [np.arange(-1, len (data) - 1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,At Current Prices 2000-2001,At Current Prices 2001-2002,At Current Prices 2002-2003,At Current Prices 2003-2004,At Current Prices 2004-2005,At Current Prices 2005-2006,At Current Prices 2006-2007,At Current Prices 2007-2008,At Current Prices 2008-2009,...,At Current Prices 2011-2012,At Current Prices 2012-2013,At Current Prices 2013-2014,At Current Prices 2014-2015,At Current Prices 2015-2016,At Current Prices 2016-2017,At Current Prices 2017-2018,At Current Prices 2018-2019,At Current Prices 2019-2020,At Current Prices 2020-2021
0,Philippines,5.7,3.4,4.9,3.9,7.8,13.2,11.9,6.8,16.2,...,21.1,9.2,7.0,9.1,12.0,10.0,17.3,10.6,12.6,10.3
1,National Capital Region (NCR),8.3,-1.7,4.7,5.0,5.8,9.1,12.5,9.2,16.6,...,19.9,10.3,5.4,6.7,10.9,9.0,17.3,11.3,13.6,10.9
2,Cordillera Administrative Region (CAR),2.3,4.5,0.9,-1.4,14.3,6.1,3.3,3.4,10.0,...,23.0,4.5,6.9,7.5,9.3,10.2,17.1,7.6,9.2,8.6
3,Region I (Ilocos Region),1.3,6.6,6.8,0.3,11.9,29.2,2.3,0.8,6.4,...,21.9,6.4,14.3,14.5,15.6,10.5,17.1,8.0,9.6,8.5
4,Region II (Cagayan Valley),4.2,11.5,7.5,4.6,5.7,17.5,11.2,1.8,15.2,...,10.9,2.9,8.7,9.0,10.3,10.1,18.3,8.6,10.8,15.0
5,Region III (Central Luzon),2.0,21.1,7.2,19.7,3.8,20.2,21.3,9.1,20.3,...,22.8,12.5,6.8,10.0,12.0,10.9,17.9,10.3,11.2,12.1
6,Region IV-A (CALABARZON),8.2,15.0,5.9,5.1,19.4,14.1,38.8,0.7,20.3,...,26.4,6.3,11.5,10.5,13.4,11.0,18.3,10.5,12.1,8.3
7,MIMAROPA Region,3.0,7.8,6.8,-9.0,25.2,18.0,21.4,7.0,16.7,...,24.2,3.7,6.0,6.0,11.4,10.8,17.4,4.9,11.6,10.2
8,Region V (Bicol Region),0.3,3.1,6.7,-5.4,18.4,20.9,20.4,1.3,21.3,...,26.4,5.8,9.4,9.9,8.0,10.1,12.9,14.4,11.0,9.4
9,Region VI (Western Visayas),7.9,4.2,2.9,-1.9,9.5,18.6,4.1,0.3,20.7,...,21.9,3.6,4.0,13.6,14.6,10.4,23.4,9.0,10.1,9.1


As our rows now follow the same arrangement as `region_names`, we can now set this as our values for the `Region` column. However, in the previous dataframes, we used the column name `Geolocation` instead of `Region`, which is why we have to rename it for consistency.

In [228]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop = True)

# renames the Region column with Geolocation
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
# data

Another problem that we would have to clean is the name of the columns—instead of the column names being just the years, it has the term **At Current Prices** included. We would have to remove this term and retain just the year. so that we could combine it with the other datasets. 

In [229]:
data.columns = data.columns.map(lambda x: x.lstrip('At Current Prices'))
data

Unnamed: 0,Geolocation,2000-2001,2001-2002,2002-2003,2003-2004,2004-2005,2005-2006,2006-2007,2007-2008,2008-2009,...,2011-2012,2012-2013,2013-2014,2014-2015,2015-2016,2016-2017,2017-2018,2018-2019,2019-2020,2020-2021
0,PHILIPPINES,5.7,3.4,4.9,3.9,7.8,13.2,11.9,6.8,16.2,...,21.1,9.2,7.0,9.1,12.0,10.0,17.3,10.6,12.6,10.3
1,NCR: National Capital Region,8.3,-1.7,4.7,5.0,5.8,9.1,12.5,9.2,16.6,...,19.9,10.3,5.4,6.7,10.9,9.0,17.3,11.3,13.6,10.9
2,CAR: Cordillera Administrative Region,2.3,4.5,0.9,-1.4,14.3,6.1,3.3,3.4,10.0,...,23.0,4.5,6.9,7.5,9.3,10.2,17.1,7.6,9.2,8.6
3,Region 1: Ilocos Region,1.3,6.6,6.8,0.3,11.9,29.2,2.3,0.8,6.4,...,21.9,6.4,14.3,14.5,15.6,10.5,17.1,8.0,9.6,8.5
4,Region 2: Cagayan Valley,4.2,11.5,7.5,4.6,5.7,17.5,11.2,1.8,15.2,...,10.9,2.9,8.7,9.0,10.3,10.1,18.3,8.6,10.8,15.0
5,Region 3: Central Luzon,2.0,21.1,7.2,19.7,3.8,20.2,21.3,9.1,20.3,...,22.8,12.5,6.8,10.0,12.0,10.9,17.9,10.3,11.2,12.1
6,Region 4A: CALABARZON,8.2,15.0,5.9,5.1,19.4,14.1,38.8,0.7,20.3,...,26.4,6.3,11.5,10.5,13.4,11.0,18.3,10.5,12.1,8.3
7,MIMAROPA: Southwestern Tagalog Region,3.0,7.8,6.8,-9.0,25.2,18.0,21.4,7.0,16.7,...,24.2,3.7,6.0,6.0,11.4,10.8,17.4,4.9,11.6,10.2
8,Region 5: Bicol Region,0.3,3.1,6.7,-5.4,18.4,20.9,20.4,1.3,21.3,...,26.4,5.8,9.4,9.9,8.0,10.1,12.9,14.4,11.0,9.4
9,Region 6: Western Visayas,7.9,4.2,2.9,-1.9,9.5,18.6,4.1,0.3,20.7,...,21.9,3.6,4.0,13.6,14.6,10.4,23.4,9.0,10.1,9.1


However, as we can see, the column headers still does not contain just the year. As the column header is a range of years (i.e., 2000-2001 means that the certain price is from January 2000 to January 2001), we would be adopting the first year.

In [230]:
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,PHILIPPINES,5.7,3.4,4.9,3.9,7.8,13.2,11.9,6.8,16.2,...,21.1,9.2,7.0,9.1,12.0,10.0,17.3,10.6,12.6,10.3
1,NCR: National Capital Region,8.3,-1.7,4.7,5.0,5.8,9.1,12.5,9.2,16.6,...,19.9,10.3,5.4,6.7,10.9,9.0,17.3,11.3,13.6,10.9
2,CAR: Cordillera Administrative Region,2.3,4.5,0.9,-1.4,14.3,6.1,3.3,3.4,10.0,...,23.0,4.5,6.9,7.5,9.3,10.2,17.1,7.6,9.2,8.6
3,Region 1: Ilocos Region,1.3,6.6,6.8,0.3,11.9,29.2,2.3,0.8,6.4,...,21.9,6.4,14.3,14.5,15.6,10.5,17.1,8.0,9.6,8.5
4,Region 2: Cagayan Valley,4.2,11.5,7.5,4.6,5.7,17.5,11.2,1.8,15.2,...,10.9,2.9,8.7,9.0,10.3,10.1,18.3,8.6,10.8,15.0
5,Region 3: Central Luzon,2.0,21.1,7.2,19.7,3.8,20.2,21.3,9.1,20.3,...,22.8,12.5,6.8,10.0,12.0,10.9,17.9,10.3,11.2,12.1
6,Region 4A: CALABARZON,8.2,15.0,5.9,5.1,19.4,14.1,38.8,0.7,20.3,...,26.4,6.3,11.5,10.5,13.4,11.0,18.3,10.5,12.1,8.3
7,MIMAROPA: Southwestern Tagalog Region,3.0,7.8,6.8,-9.0,25.2,18.0,21.4,7.0,16.7,...,24.2,3.7,6.0,6.0,11.4,10.8,17.4,4.9,11.6,10.2
8,Region 5: Bicol Region,0.3,3.1,6.7,-5.4,18.4,20.9,20.4,1.3,21.3,...,26.4,5.8,9.4,9.9,8.0,10.1,12.9,14.4,11.0,9.4
9,Region 6: Western Visayas,7.9,4.2,2.9,-1.9,9.5,18.6,4.1,0.3,20.7,...,21.9,3.6,4.0,13.6,14.6,10.4,23.4,9.0,10.1,9.1


As it now follows the format of the previous dataframes, we can now convert it to its long representation.

In [231]:
data = change_to_long (data, 'Consumption Expenditure GR')

We use the merge function to use outer join to merge the two datasets.

In [232]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,10.1.1.2 Income per capita growth rate,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases,Changes in Inventories,Current Health Expenditure,Current Health Expenditure GR,Consumption Expenditure %,Consumption Expenditure GR
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,,,-136845782.0,,,100.0,5.7
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,,,2177317.0,,,45.9,8.3
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,,,-6416286.0,,,2.7,2.3
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,,,-1891391.0,,,4.0,1.3
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,,,5458610.0,,,2.4,4.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Gross Capital Formation, by Region
After this, we can now move on to the next dataset `Gross Capital Formation, by Region`. The given dataset would have to be separated using a semi-colon, as it does not use the default comma as its separator.

In [233]:
data = pd.read_csv('data' + '/Gross Capital Formation, by Region.csv', header = 1 ,sep = ';')
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,..National Capital Region (NCR),203930819,288023206,312792420,350366278,445396815,364638998,350945836,410794316,541772989,...,758973608,957586594,1055722012,1072923782,1268840988,1391592148,1640251453,1598461505,960095438,1283778216
1,..Cordillera Administrative Region (CAR),13865180,13448285,16806013,19788755,24257807,20432006,25385665,26901466,29704835,...,29548738,33081560,24485878,29049288,32105379,35881923,32266092,23666821,31273604,24347205
2,..Region I (Ilocos Region),24454284,26821755,29627138,32967294,45222018,39338515,32127165,41310895,50763801,...,68070088,75736595,88877912,113489548,131580148,163128777,193759545,228616710,172416085,199301927
3,..Region II (Cagayan Valley),32773347,34873457,37349804,40755790,51068600,58370199,68595346,71904291,73649021,...,79946525,62731146,75028145,77480450,95196604,113995195,134311405,146410694,95318052,93038750
4,..Region III (Central Luzon),8037710,35506256,63533211,64973784,68231368,92419886,84806309,92665051,138098487,...,244530920,274814176,273085629,351251636,421665953,513606387,590017254,610978504,283879955,478799528
5,..Region IV-A (CALABARZON),18214696,76941117,108934789,111408235,117422285,173261985,120393803,119653336,149756689,...,231742975,290367514,320280912,303867486,517220942,630500072,712401187,766853055,298782200,518586252
6,..MIMAROPA Region,10234683,11200127,14899234,12029760,9464169,15527012,10027999,3017260,22411802,...,24934114,21427705,41529678,36665567,36490353,55532730,78995250,88406153,45361975,76249821
7,..Region V (Bicol Region),23129693,23889005,26003348,23452407,28387601,30110212,31513023,37647782,47880014,...,69740827,83011228,98493385,104978578,134334431,142545107,181534196,202053823,153689482,168273510
8,..Region VI (Western Visayas),46342201,44831024,46626457,45601881,48709503,44867130,52570827,61931227,73382845,...,95477787,101336678,77105459,95430431,119446427,132134843,155804331,176527730,133826656,132970199
9,..Region VII (Central Visayas),65471807,74574164,72065505,70933637,76422804,64122060,52741677,51229324,66644077,...,113821899,126595075,158426246,158168742,216329082,220634348,257710115,303559170,196171280,248790513


First, we will remove the '...' at the start of the values of the `Region` column. Then, we would re-arrange the rows because the dataframe starts with the value **National Capital Region**, instead of the **Philippines**. This would allow us to follow the format set in `region_names.`

In [234]:
data ['Region'] = data ['Region'].map(lambda x: x.lstrip ('..'))
data = data.iloc [np.arange (-1, len (data) - 1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
# data

Because the arrangement of the rows follow `region_names`, we can now set it to follow the format of `region_names`. After this, we would also need to rename the column `Region` to `Geolocation`.

In [235]:
# replaces the format of the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)

# renames the column name
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,PHILIPPINES,579938180,762429457,890086990,921328434,1103698971,1098633998,1049071426,1160979516,1526893379,...,2163531693,2487510204,2763392839,2975815372,3725365802,4231677010,4959105466,5153068851,3129565862,4111887238
1,NCR: National Capital Region,203930819,288023206,312792420,350366278,445396815,364638998,350945836,410794316,541772989,...,758973608,957586594,1055722012,1072923782,1268840988,1391592148,1640251453,1598461505,960095438,1283778216
2,CAR: Cordillera Administrative Region,13865180,13448285,16806013,19788755,24257807,20432006,25385665,26901466,29704835,...,29548738,33081560,24485878,29049288,32105379,35881923,32266092,23666821,31273604,24347205
3,Region 1: Ilocos Region,24454284,26821755,29627138,32967294,45222018,39338515,32127165,41310895,50763801,...,68070088,75736595,88877912,113489548,131580148,163128777,193759545,228616710,172416085,199301927
4,Region 2: Cagayan Valley,32773347,34873457,37349804,40755790,51068600,58370199,68595346,71904291,73649021,...,79946525,62731146,75028145,77480450,95196604,113995195,134311405,146410694,95318052,93038750
5,Region 3: Central Luzon,8037710,35506256,63533211,64973784,68231368,92419886,84806309,92665051,138098487,...,244530920,274814176,273085629,351251636,421665953,513606387,590017254,610978504,283879955,478799528
6,Region 4A: CALABARZON,18214696,76941117,108934789,111408235,117422285,173261985,120393803,119653336,149756689,...,231742975,290367514,320280912,303867486,517220942,630500072,712401187,766853055,298782200,518586252
7,MIMAROPA: Southwestern Tagalog Region,10234683,11200127,14899234,12029760,9464169,15527012,10027999,3017260,22411802,...,24934114,21427705,41529678,36665567,36490353,55532730,78995250,88406153,45361975,76249821
8,Region 5: Bicol Region,23129693,23889005,26003348,23452407,28387601,30110212,31513023,37647782,47880014,...,69740827,83011228,98493385,104978578,134334431,142545107,181534196,202053823,153689482,168273510
9,Region 6: Western Visayas,46342201,44831024,46626457,45601881,48709503,44867130,52570827,61931227,73382845,...,95477787,101336678,77105459,95430431,119446427,132134843,155804331,176527730,133826656,132970199


Like in the previous dataset, we would have to remove **At Current Prices**. This would result in the column names being the year only.

In [236]:
data.columns = data.columns.map(lambda x: x.lstrip('At Current Prices'))
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,PHILIPPINES,579938180,762429457,890086990,921328434,1103698971,1098633998,1049071426,1160979516,1526893379,...,2163531693,2487510204,2763392839,2975815372,3725365802,4231677010,4959105466,5153068851,3129565862,4111887238
1,NCR: National Capital Region,203930819,288023206,312792420,350366278,445396815,364638998,350945836,410794316,541772989,...,758973608,957586594,1055722012,1072923782,1268840988,1391592148,1640251453,1598461505,960095438,1283778216
2,CAR: Cordillera Administrative Region,13865180,13448285,16806013,19788755,24257807,20432006,25385665,26901466,29704835,...,29548738,33081560,24485878,29049288,32105379,35881923,32266092,23666821,31273604,24347205
3,Region 1: Ilocos Region,24454284,26821755,29627138,32967294,45222018,39338515,32127165,41310895,50763801,...,68070088,75736595,88877912,113489548,131580148,163128777,193759545,228616710,172416085,199301927
4,Region 2: Cagayan Valley,32773347,34873457,37349804,40755790,51068600,58370199,68595346,71904291,73649021,...,79946525,62731146,75028145,77480450,95196604,113995195,134311405,146410694,95318052,93038750
5,Region 3: Central Luzon,8037710,35506256,63533211,64973784,68231368,92419886,84806309,92665051,138098487,...,244530920,274814176,273085629,351251636,421665953,513606387,590017254,610978504,283879955,478799528
6,Region 4A: CALABARZON,18214696,76941117,108934789,111408235,117422285,173261985,120393803,119653336,149756689,...,231742975,290367514,320280912,303867486,517220942,630500072,712401187,766853055,298782200,518586252
7,MIMAROPA: Southwestern Tagalog Region,10234683,11200127,14899234,12029760,9464169,15527012,10027999,3017260,22411802,...,24934114,21427705,41529678,36665567,36490353,55532730,78995250,88406153,45361975,76249821
8,Region 5: Bicol Region,23129693,23889005,26003348,23452407,28387601,30110212,31513023,37647782,47880014,...,69740827,83011228,98493385,104978578,134334431,142545107,181534196,202053823,153689482,168273510
9,Region 6: Western Visayas,46342201,44831024,46626457,45601881,48709503,44867130,52570827,61931227,73382845,...,95477787,101336678,77105459,95430431,119446427,132134843,155804331,176527730,133826656,132970199


Just like in the previous dataset, we would have adopt the first year, as the column header includes a range of years. 

In [237]:
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,PHILIPPINES,579938180,762429457,890086990,921328434,1103698971,1098633998,1049071426,1160979516,1526893379,...,2163531693,2487510204,2763392839,2975815372,3725365802,4231677010,4959105466,5153068851,3129565862,4111887238
1,NCR: National Capital Region,203930819,288023206,312792420,350366278,445396815,364638998,350945836,410794316,541772989,...,758973608,957586594,1055722012,1072923782,1268840988,1391592148,1640251453,1598461505,960095438,1283778216
2,CAR: Cordillera Administrative Region,13865180,13448285,16806013,19788755,24257807,20432006,25385665,26901466,29704835,...,29548738,33081560,24485878,29049288,32105379,35881923,32266092,23666821,31273604,24347205
3,Region 1: Ilocos Region,24454284,26821755,29627138,32967294,45222018,39338515,32127165,41310895,50763801,...,68070088,75736595,88877912,113489548,131580148,163128777,193759545,228616710,172416085,199301927
4,Region 2: Cagayan Valley,32773347,34873457,37349804,40755790,51068600,58370199,68595346,71904291,73649021,...,79946525,62731146,75028145,77480450,95196604,113995195,134311405,146410694,95318052,93038750
5,Region 3: Central Luzon,8037710,35506256,63533211,64973784,68231368,92419886,84806309,92665051,138098487,...,244530920,274814176,273085629,351251636,421665953,513606387,590017254,610978504,283879955,478799528
6,Region 4A: CALABARZON,18214696,76941117,108934789,111408235,117422285,173261985,120393803,119653336,149756689,...,231742975,290367514,320280912,303867486,517220942,630500072,712401187,766853055,298782200,518586252
7,MIMAROPA: Southwestern Tagalog Region,10234683,11200127,14899234,12029760,9464169,15527012,10027999,3017260,22411802,...,24934114,21427705,41529678,36665567,36490353,55532730,78995250,88406153,45361975,76249821
8,Region 5: Bicol Region,23129693,23889005,26003348,23452407,28387601,30110212,31513023,37647782,47880014,...,69740827,83011228,98493385,104978578,134334431,142545107,181534196,202053823,153689482,168273510
9,Region 6: Western Visayas,46342201,44831024,46626457,45601881,48709503,44867130,52570827,61931227,73382845,...,95477787,101336678,77105459,95430431,119446427,132134843,155804331,176527730,133826656,132970199


Afterwards, we can now convert the dataframe from the current wide representation to its long representation. This would allow us to merge this to the combined dataset.

In [238]:
data = change_to_long (data, 'Gross Capital Formation')

Then, we can now use the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function to use outer join to merge the two datasets.

In [239]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases,Changes in Inventories,Current Health Expenditure,Current Health Expenditure GR,Consumption Expenditure %,Consumption Expenditure GR,Gross Capital Formation
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,,-136845782.0,,,100.0,5.7,579938180.0
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,,2177317.0,,,45.9,8.3,203930819.0
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,,-6416286.0,,,2.7,2.3,13865180.0
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,,-1891391.0,,,4.0,1.3,24454284.0
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,,5458610.0,,,2.4,4.2,32773347.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Gross Regional Domestic Product, by Region
This is followed by the `Gross Regional Domestic Product, by Region`, which we would be loading. We would be passing the same parameters, as this follows the same format as the other non-SDG datasets.

In [240]:
data = pd.read_csv('data' + '/Gross Regional Domestic Product, by Region.csv', header = 1, sep = ';')
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,..National Capital Region (NCR),1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3553088571,3890702252,4215201150,4532443704,4928222539,5327124065,5814440130,6294194685,5800819179,6157784762
1,..Cordillera Administrative Region (CAR),90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,201836502,219989238,237096008,243673958,255584787,280805810,308267122,322106088,295502542,323711729
2,..Region I (Ilocos Region),128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,356651167,393993447,430678687,449373363,490900560,527801491,587597251,630300658,597917300,643928511
3,..Region II (Cagayan Valley),85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,234724620,268304482,304061090,303113509,323458416,358686602,385061271,399370781,367331248,399981012
4,..Region III (Central Luzon),368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1206580595,1306617621,1461916690,1511232438,1640708336,1860323671,2062393875,2184815143,1864111182,2061962928
5,..Region IV-A (CALABARZON),601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1721867470,1847878662,2006907506,2076801161,2206254090,2423069480,2706994745,2865792547,2565120461,2785911990
6,..MIMAROPA Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,231318002,242953220,276139956,274605684,287300274,321948705,370744808,375589553,342643478,382736604
7,..Region V (Bicol Region),96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,298247874,332295863,366663555,395256951,431762419,465966050,522014835,560835429,516847516,564612777
8,..Region VI (Western Visayas),180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,527279962,566792962,615916323,667996261,720566797,791281769,860107768,919163055,850797186,937244700
9,..Region VII (Central Visayas),201368028,221348981,236861695,251698543,286487926,321657954,354599746,394422136,442118144,...,674544563,746139449,829547258,889023020,977480106,1067272679,1180945761,1270612311,1170115820,1237626585


Before we start with cleaning the dataframe itself, we would be removing the '..' found at the start for each of the values of the `Region` column. Additionally, we would be re-arranging the rows so that it follows the arrangement of the other datasets.

In [241]:
# removing '..' at the start of each of the values of the Region column
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))

# re-arranging the rows
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,Philippines,3697556205,4024398940,4350559772,4717808940,5323904177,5917282301,6550417113,7198244888,8050200621,...,11060588831,12050591984,13206828252,13944157448,15132381470,16556651083,18265190258,19517863172,17951573570,19410568055
1,National Capital Region (NCR),1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3553088571,3890702252,4215201150,4532443704,4928222539,5327124065,5814440130,6294194685,5800819179,6157784762
2,Cordillera Administrative Region (CAR),90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,201836502,219989238,237096008,243673958,255584787,280805810,308267122,322106088,295502542,323711729
3,Region I (Ilocos Region),128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,356651167,393993447,430678687,449373363,490900560,527801491,587597251,630300658,597917300,643928511
4,Region II (Cagayan Valley),85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,234724620,268304482,304061090,303113509,323458416,358686602,385061271,399370781,367331248,399981012
5,Region III (Central Luzon),368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1206580595,1306617621,1461916690,1511232438,1640708336,1860323671,2062393875,2184815143,1864111182,2061962928
6,Region IV-A (CALABARZON),601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1721867470,1847878662,2006907506,2076801161,2206254090,2423069480,2706994745,2865792547,2565120461,2785911990
7,MIMAROPA Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,231318002,242953220,276139956,274605684,287300274,321948705,370744808,375589553,342643478,382736604
8,Region V (Bicol Region),96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,298247874,332295863,366663555,395256951,431762419,465966050,522014835,560835429,516847516,564612777
9,Region VI (Western Visayas),180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,527279962,566792962,615916323,667996261,720566797,791281769,860107768,919163055,850797186,937244700


For consistency of the values of the `Geolocation` column, as we would be using this as key in the combination, we would be replacing its values with the set format. This would not change the regions that the data represents as the `Region` column and the variable `region_names` follow the same format.

In [242]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data

Unnamed: 0,Region,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,PHILIPPINES,3697556205,4024398940,4350559772,4717808940,5323904177,5917282301,6550417113,7198244888,8050200621,...,11060588831,12050591984,13206828252,13944157448,15132381470,16556651083,18265190258,19517863172,17951573570,19410568055
1,NCR: National Capital Region,1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3553088571,3890702252,4215201150,4532443704,4928222539,5327124065,5814440130,6294194685,5800819179,6157784762
2,CAR: Cordillera Administrative Region,90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,201836502,219989238,237096008,243673958,255584787,280805810,308267122,322106088,295502542,323711729
3,Region 1: Ilocos Region,128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,356651167,393993447,430678687,449373363,490900560,527801491,587597251,630300658,597917300,643928511
4,Region 2: Cagayan Valley,85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,234724620,268304482,304061090,303113509,323458416,358686602,385061271,399370781,367331248,399981012
5,Region 3: Central Luzon,368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1206580595,1306617621,1461916690,1511232438,1640708336,1860323671,2062393875,2184815143,1864111182,2061962928
6,Region 4A: CALABARZON,601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1721867470,1847878662,2006907506,2076801161,2206254090,2423069480,2706994745,2865792547,2565120461,2785911990
7,MIMAROPA: Southwestern Tagalog Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,231318002,242953220,276139956,274605684,287300274,321948705,370744808,375589553,342643478,382736604
8,Region 5: Bicol Region,96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,298247874,332295863,366663555,395256951,431762419,465966050,522014835,560835429,516847516,564612777
9,Region 6: Western Visayas,180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,527279962,566792962,615916323,667996261,720566797,791281769,860107768,919163055,850797186,937244700


Additionally, we would be changing the name of the `Region` column to `Geolocation` so that the column names are consistent throughout the dataframes.

In [243]:
data.rename(columns = {'Region': 'Geolocation'}, inplace = True)
data

Unnamed: 0,Geolocation,At Current Prices 2000,At Current Prices 2001,At Current Prices 2002,At Current Prices 2003,At Current Prices 2004,At Current Prices 2005,At Current Prices 2006,At Current Prices 2007,At Current Prices 2008,...,At Current Prices 2012,At Current Prices 2013,At Current Prices 2014,At Current Prices 2015,At Current Prices 2016,At Current Prices 2017,At Current Prices 2018,At Current Prices 2019,At Current Prices 2020,At Current Prices 2021
0,PHILIPPINES,3697556205,4024398940,4350559772,4717808940,5323904177,5917282301,6550417113,7198244888,8050200621,...,11060588831,12050591984,13206828252,13944157448,15132381470,16556651083,18265190258,19517863172,17951573570,19410568055
1,NCR: National Capital Region,1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3553088571,3890702252,4215201150,4532443704,4928222539,5327124065,5814440130,6294194685,5800819179,6157784762
2,CAR: Cordillera Administrative Region,90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,201836502,219989238,237096008,243673958,255584787,280805810,308267122,322106088,295502542,323711729
3,Region 1: Ilocos Region,128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,356651167,393993447,430678687,449373363,490900560,527801491,587597251,630300658,597917300,643928511
4,Region 2: Cagayan Valley,85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,234724620,268304482,304061090,303113509,323458416,358686602,385061271,399370781,367331248,399981012
5,Region 3: Central Luzon,368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1206580595,1306617621,1461916690,1511232438,1640708336,1860323671,2062393875,2184815143,1864111182,2061962928
6,Region 4A: CALABARZON,601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1721867470,1847878662,2006907506,2076801161,2206254090,2423069480,2706994745,2865792547,2565120461,2785911990
7,MIMAROPA: Southwestern Tagalog Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,231318002,242953220,276139956,274605684,287300274,321948705,370744808,375589553,342643478,382736604
8,Region 5: Bicol Region,96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,298247874,332295863,366663555,395256951,431762419,465966050,522014835,560835429,516847516,564612777
9,Region 6: Western Visayas,180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,527279962,566792962,615916323,667996261,720566797,791281769,860107768,919163055,850797186,937244700


As we can see, the column headers does not have the same format as the previous dataframes, which we would have to fix. To start with, we would be removing the **At Current Prices** found at the start of the column headers for the years.

In [244]:
data.columns = data.columns.map(lambda x: x.lstrip('At Current Prices'))
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,PHILIPPINES,3697556205,4024398940,4350559772,4717808940,5323904177,5917282301,6550417113,7198244888,8050200621,...,11060588831,12050591984,13206828252,13944157448,15132381470,16556651083,18265190258,19517863172,17951573570,19410568055
1,NCR: National Capital Region,1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3553088571,3890702252,4215201150,4532443704,4928222539,5327124065,5814440130,6294194685,5800819179,6157784762
2,CAR: Cordillera Administrative Region,90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,201836502,219989238,237096008,243673958,255584787,280805810,308267122,322106088,295502542,323711729
3,Region 1: Ilocos Region,128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,356651167,393993447,430678687,449373363,490900560,527801491,587597251,630300658,597917300,643928511
4,Region 2: Cagayan Valley,85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,234724620,268304482,304061090,303113509,323458416,358686602,385061271,399370781,367331248,399981012
5,Region 3: Central Luzon,368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1206580595,1306617621,1461916690,1511232438,1640708336,1860323671,2062393875,2184815143,1864111182,2061962928
6,Region 4A: CALABARZON,601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1721867470,1847878662,2006907506,2076801161,2206254090,2423069480,2706994745,2865792547,2565120461,2785911990
7,MIMAROPA: Southwestern Tagalog Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,231318002,242953220,276139956,274605684,287300274,321948705,370744808,375589553,342643478,382736604
8,Region 5: Bicol Region,96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,298247874,332295863,366663555,395256951,431762419,465966050,522014835,560835429,516847516,564612777
9,Region 6: Western Visayas,180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,527279962,566792962,615916323,667996261,720566797,791281769,860107768,919163055,850797186,937244700


Additionally, as we now know that the range of years can be represented by the first year, we would be removing the second year. 

In [245]:
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,PHILIPPINES,3697556205,4024398940,4350559772,4717808940,5323904177,5917282301,6550417113,7198244888,8050200621,...,11060588831,12050591984,13206828252,13944157448,15132381470,16556651083,18265190258,19517863172,17951573570,19410568055
1,NCR: National Capital Region,1237450701,1349240704,1435447457,1545711905,1743576537,1954778997,2176864045,2399583921,2655351616,...,3553088571,3890702252,4215201150,4532443704,4928222539,5327124065,5814440130,6294194685,5800819179,6157784762
2,CAR: Cordillera Administrative Region,90446009,95301368,104838201,113075065,126064714,133005766,145857673,153956921,165532965,...,201836502,219989238,237096008,243673958,255584787,280805810,308267122,322106088,295502542,323711729
3,Region 1: Ilocos Region,128944988,137048662,147243617,157948520,178137944,198599341,218353822,235792743,265782293,...,356651167,393993447,430678687,449373363,490900560,527801491,587597251,630300658,597917300,643928511
4,Region 2: Cagayan Valley,85937981,93104321,96648982,99116555,117674721,121542578,139253836,154237291,174606287,...,234724620,268304482,304061090,303113509,323458416,358686602,385061271,399370781,367331248,399981012
5,Region 3: Central Luzon,368786804,406500228,452337803,494424474,548718978,610192972,668848239,721803908,832147448,...,1206580595,1306617621,1461916690,1511232438,1640708336,1860323671,2062393875,2184815143,1864111182,2061962928
6,Region 4A: CALABARZON,601691426,671228264,713061159,788436127,881529587,987614867,1093698856,1194749373,1286125893,...,1721867470,1847878662,2006907506,2076801161,2206254090,2423069480,2706994745,2865792547,2565120461,2785911990
7,MIMAROPA: Southwestern Tagalog Region,66182382,71908260,83490454,95423489,103521732,121806090,131197013,148513117,174989809,...,231318002,242953220,276139956,274605684,287300274,321948705,370744808,375589553,342643478,382736604
8,Region 5: Bicol Region,96854451,105014935,113543367,120066693,135875452,148022711,158101543,178100846,203567999,...,298247874,332295863,366663555,395256951,431762419,465966050,522014835,560835429,516847516,564612777
9,Region 6: Western Visayas,180855163,190527313,206272535,220844486,244363281,270572999,299925476,329317934,375511999,...,527279962,566792962,615916323,667996261,720566797,791281769,860107768,919163055,850797186,937244700


Now, we can change it to its long representation as its columns follow the set format. 

In [246]:
data = change_to_long (data, 'GRDP')

Then, we can use the merge function to combine this to the combined dataframe.

In [247]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases,Changes in Inventories,Current Health Expenditure,Current Health Expenditure GR,Consumption Expenditure %,Consumption Expenditure GR,Gross Capital Formation,GRDP
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,,-136845782.0,,,100.0,5.7,579938180.0,3.697556e+09
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,,2177317.0,,,45.9,8.3,203930819.0,1.237451e+09
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,,-6416286.0,,,2.7,2.3,13865180.0,9.044601e+07
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,,-1891391.0,,,4.0,1.3,24454284.0,1.289450e+08
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,,5458610.0,,,2.4,4.2,32773347.0,8.593798e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Population, by Region
Let us now load the next dataset, which is holds the data about the regional Population.

In [248]:
data = pd.read_csv('data' + '/Population, by Region.csv', header = 1, sep = ';')
data

Unnamed: 0,Region,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,..National Capital Region (NCR),9961971,10153254,10344788,10536574,10729137,10921427,11113967,11306759,11500331,...,12080971,12275553,12469854,12664407,12859211,13066832,13264805,13453701,13633497,13804656
1,..Cordillera Administrative Region (CAR),1369249,1397362,1424800,1451561,1477718,1503126,1527858,1551914,1575358,...,1641438,1662169,1682167,1701488,1720134,1739380,1757717,1775210,1791881,1807738
2,..Region I (Ilocos Region),4209083,4265007,4320673,4376079,4431377,4486265,4540893,4595262,4649520,...,4810293,4863510,4916323,4968877,5021171,5076184,5128542,5178410,5225800,5270807
3,..Region II (Cagayan Valley),2819641,2860861,2902169,2943564,2985160,3026730,3068387,3110132,3152079,...,3278100,3320398,3362667,3405024,3447468,3493662,3537703,3579715,3619689,3657741
4,..Region III (Central Luzon),8233671,8420004,8607944,8797491,8989170,9181933,9376303,9572279,9770406,...,10372806,10577379,10783003,10990233,11199069,11437442,11667642,11890314,12105494,12313718
5,..Region IV-A (CALABARZON),9367205,9687547,10009909,10334289,10661585,10990009,11320451,11652912,11988312,...,13003881,13347384,13691969,14038573,14387196,14741686,15085285,15418944,15742673,16057299
6,..MIMAROPA Region,2305919,2352183,2398060,2443548,2488772,2533484,2577808,2621744,2665411,...,2793725,2835835,2877441,2918660,2959491,3006430,3051342,3094357,3135503,3174859
7,..Region V (Bicol Region),4698058,4772451,4846614,4920546,4994449,5067918,5141157,5214165,5287141,...,5504085,5576135,5647756,5719147,5790307,5865520,5937321,6005949,6071398,6133836
8,..Region VI (Western Visayas),6224949,6317904,6409990,6501206,6591799,6681274,6769880,6857616,6944719,...,7200096,7283710,7366225,7447870,7528646,7610389,7688734,7763898,7835883,7904899
9,..Region VII (Central Visayas),5723559,5830498,5937986,6046025,6154912,6264053,6373743,6483984,6595079,...,6930757,7044060,7157605,7271699,7386344,7511565,7631003,7745017,7853606,7957046


As seen in the dataframe above, the values in the `Region` column has '..' at the start. For consistency's sake, we would be removing these '..'. This is followed by the rearrangement of the rows so that it would follow the order of the other dataframes.

In [249]:
# removing '..' at the frontt of the values in the Region column
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))

# rearranging of the rows
data = data.iloc[np.arange(-1, len(data)-1)]
data = data.reset_index()
data.drop('index', axis = 1,inplace = True)
data

Unnamed: 0,Region,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Philippines,76723051,78273584,79832103,81398610,82977428,84559930,86150420,87748896,89359772,...,94227150,95870083,97516526,99170955,100833371,102530196,104169230,105755180,107288150,108771978
1,National Capital Region (NCR),9961971,10153254,10344788,10536574,10729137,10921427,11113967,11306759,11500331,...,12080971,12275553,12469854,12664407,12859211,13066832,13264805,13453701,13633497,13804656
2,Cordillera Administrative Region (CAR),1369249,1397362,1424800,1451561,1477718,1503126,1527858,1551914,1575358,...,1641438,1662169,1682167,1701488,1720134,1739380,1757717,1775210,1791881,1807738
3,Region I (Ilocos Region),4209083,4265007,4320673,4376079,4431377,4486265,4540893,4595262,4649520,...,4810293,4863510,4916323,4968877,5021171,5076184,5128542,5178410,5225800,5270807
4,Region II (Cagayan Valley),2819641,2860861,2902169,2943564,2985160,3026730,3068387,3110132,3152079,...,3278100,3320398,3362667,3405024,3447468,3493662,3537703,3579715,3619689,3657741
5,Region III (Central Luzon),8233671,8420004,8607944,8797491,8989170,9181933,9376303,9572279,9770406,...,10372806,10577379,10783003,10990233,11199069,11437442,11667642,11890314,12105494,12313718
6,Region IV-A (CALABARZON),9367205,9687547,10009909,10334289,10661585,10990009,11320451,11652912,11988312,...,13003881,13347384,13691969,14038573,14387196,14741686,15085285,15418944,15742673,16057299
7,MIMAROPA Region,2305919,2352183,2398060,2443548,2488772,2533484,2577808,2621744,2665411,...,2793725,2835835,2877441,2918660,2959491,3006430,3051342,3094357,3135503,3174859
8,Region V (Bicol Region),4698058,4772451,4846614,4920546,4994449,5067918,5141157,5214165,5287141,...,5504085,5576135,5647756,5719147,5790307,5865520,5937321,6005949,6071398,6133836
9,Region VI (Western Visayas),6224949,6317904,6409990,6501206,6591799,6681274,6769880,6857616,6944719,...,7200096,7283710,7366225,7447870,7528646,7610389,7688734,7763898,7835883,7904899


After this, we would be converting the values of the `Region` column to the formatted region names. Furthermore, we would also be renaming the column header `Region` to `Geolocation`. This is because there is the value **Philippines** in this column, which is not a region.

In [250]:
# converts the region values to the set format
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)

# renaming the column header Region to Geolocation
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,PHILIPPINES,76723051,78273584,79832103,81398610,82977428,84559930,86150420,87748896,89359772,...,94227150,95870083,97516526,99170955,100833371,102530196,104169230,105755180,107288150,108771978
1,NCR: National Capital Region,9961971,10153254,10344788,10536574,10729137,10921427,11113967,11306759,11500331,...,12080971,12275553,12469854,12664407,12859211,13066832,13264805,13453701,13633497,13804656
2,CAR: Cordillera Administrative Region,1369249,1397362,1424800,1451561,1477718,1503126,1527858,1551914,1575358,...,1641438,1662169,1682167,1701488,1720134,1739380,1757717,1775210,1791881,1807738
3,Region 1: Ilocos Region,4209083,4265007,4320673,4376079,4431377,4486265,4540893,4595262,4649520,...,4810293,4863510,4916323,4968877,5021171,5076184,5128542,5178410,5225800,5270807
4,Region 2: Cagayan Valley,2819641,2860861,2902169,2943564,2985160,3026730,3068387,3110132,3152079,...,3278100,3320398,3362667,3405024,3447468,3493662,3537703,3579715,3619689,3657741
5,Region 3: Central Luzon,8233671,8420004,8607944,8797491,8989170,9181933,9376303,9572279,9770406,...,10372806,10577379,10783003,10990233,11199069,11437442,11667642,11890314,12105494,12313718
6,Region 4A: CALABARZON,9367205,9687547,10009909,10334289,10661585,10990009,11320451,11652912,11988312,...,13003881,13347384,13691969,14038573,14387196,14741686,15085285,15418944,15742673,16057299
7,MIMAROPA: Southwestern Tagalog Region,2305919,2352183,2398060,2443548,2488772,2533484,2577808,2621744,2665411,...,2793725,2835835,2877441,2918660,2959491,3006430,3051342,3094357,3135503,3174859
8,Region 5: Bicol Region,4698058,4772451,4846614,4920546,4994449,5067918,5141157,5214165,5287141,...,5504085,5576135,5647756,5719147,5790307,5865520,5937321,6005949,6071398,6133836
9,Region 6: Western Visayas,6224949,6317904,6409990,6501206,6591799,6681274,6769880,6857616,6944719,...,7200096,7283710,7366225,7447870,7528646,7610389,7688734,7763898,7835883,7904899


Since the dataframe now follows the format of the other dataframes, we can now convert it to a long representation.

In [251]:
data = change_to_long (data, 'Population')

Because the datagrame is now in a long representation, we can now [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) it with the combined dataset.

In [252]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases,Changes in Inventories,Current Health Expenditure,Current Health Expenditure GR,Consumption Expenditure %,Consumption Expenditure GR,Gross Capital Formation,GRDP,Population
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,-136845782.0,,,100.0,5.7,579938180.0,3.697556e+09,76723051.0
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,2177317.0,,,45.9,8.3,203930819.0,1.237451e+09,9961971.0
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,-6416286.0,,,2.7,2.3,13865180.0,9.044601e+07,1369249.0
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,-1891391.0,,,4.0,1.3,24454284.0,1.289450e+08,4209083.0
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,5458610.0,,,2.4,4.2,32773347.0,8.593798e+07,2819641.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Primary Drop-out rates by Region, Sex and Year
Consequently, we can now load the next dataset.

In [253]:
data = pd.read_csv('data' + '/Primary Drop-out rates by Region, Sex and Year.csv',header = 1,sep = ';')
data

Unnamed: 0,Region,Both Sexes 2006,Both Sexes 2007,Both Sexes 2008,Both Sexes 2009,Both Sexes 2010,Both Sexes 2011,Both Sexes 2012,Both Sexes 2013,Both Sexes 2014,...,Girls 2006,Girls 2007,Girls 2008,Girls 2009,Girls 2010,Girls 2011,Girls 2012,Girls 2013,Girls 2014,Girls 2015
0,Philippines,6.37,5.99,6.02,6.28,6.29,6.36,6.24,4.85,3.26,...,5.0,4.72,4.87,4.93,5.02,5.18,5.12,4.04,2.77,2.01
1,NCR,2.37,2.83,2.92,4.07,3.31,2.93,4.1,4.36,4.25,...,1.46,1.8,2.13,3.33,2.53,2.42,3.27,3.78,3.6,1.45
2,CAR,5.67,6.41,5.49,5.04,6.0,4.91,4.79,3.75,2.84,...,4.23,5.09,4.16,2.73,4.94,3.57,3.77,3.15,1.71,1.36
3,Region I,3.93,3.76,3.09,3.6,3.78,3.36,3.1,1.92,1.13,...,3.15,2.92,2.41,2.75,3.02,2.53,2.48,1.42,0.93,0.83
4,Region II,4.72,4.95,4.3,4.81,4.95,4.73,3.93,2.42,2.61,...,3.21,3.71,3.28,3.45,3.71,3.59,2.97,1.67,1.96,0.98
5,Region III,3.69,3.97,3.94,3.72,4.15,4.07,3.71,1.9,2.28,...,2.74,3.02,3.05,2.7,3.02,3.17,2.89,1.37,1.74,0.93
6,Region IV-A,3.9,4.41,3.87,2.57,3.75,2.46,3.26,6.03,1.55,...,2.88,3.21,2.87,1.78,2.55,1.83,1.71,5.13,1.22,1.54
7,Region IV-B,6.7,7.4,6.4,6.93,6.25,6.13,5.87,4.73,2.78,...,5.18,6.29,5.23,5.64,4.79,4.86,4.52,3.84,2.25,0.89
8,Region V,6.06,5.78,5.9,5.8,5.79,5.7,5.35,3.19,2.72,...,4.78,4.61,4.71,4.63,4.51,4.46,4.17,2.39,2.16,1.92
9,Region VI,6.38,6.14,6.03,6.05,6.56,5.64,4.8,2.91,1.97,...,4.61,4.45,4.49,4.4,4.88,4.17,3.46,2.1,1.54,1.02


As the dataframe follows the sorting of the `region_names`, we can now convert the values of the `Region` column to the standard. Additionally, to keep the consistency throughout the dataframes, we would be renaming the `Region` column to `Geolocation`, using the [`rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html). 

In [254]:
# renames the data in the Geolocation for consistency
data['Region'] = region_names
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,Both Sexes 2006,Both Sexes 2007,Both Sexes 2008,Both Sexes 2009,Both Sexes 2010,Both Sexes 2011,Both Sexes 2012,Both Sexes 2013,Both Sexes 2014,...,Girls 2006,Girls 2007,Girls 2008,Girls 2009,Girls 2010,Girls 2011,Girls 2012,Girls 2013,Girls 2014,Girls 2015
0,PHILIPPINES,6.37,5.99,6.02,6.28,6.29,6.36,6.24,4.85,3.26,...,5.0,4.72,4.87,4.93,5.02,5.18,5.12,4.04,2.77,2.01
1,NCR: National Capital Region,2.37,2.83,2.92,4.07,3.31,2.93,4.1,4.36,4.25,...,1.46,1.8,2.13,3.33,2.53,2.42,3.27,3.78,3.6,1.45
2,CAR: Cordillera Administrative Region,5.67,6.41,5.49,5.04,6.0,4.91,4.79,3.75,2.84,...,4.23,5.09,4.16,2.73,4.94,3.57,3.77,3.15,1.71,1.36
3,Region 1: Ilocos Region,3.93,3.76,3.09,3.6,3.78,3.36,3.1,1.92,1.13,...,3.15,2.92,2.41,2.75,3.02,2.53,2.48,1.42,0.93,0.83
4,Region 2: Cagayan Valley,4.72,4.95,4.3,4.81,4.95,4.73,3.93,2.42,2.61,...,3.21,3.71,3.28,3.45,3.71,3.59,2.97,1.67,1.96,0.98
5,Region 3: Central Luzon,3.69,3.97,3.94,3.72,4.15,4.07,3.71,1.9,2.28,...,2.74,3.02,3.05,2.7,3.02,3.17,2.89,1.37,1.74,0.93
6,Region 4A: CALABARZON,3.9,4.41,3.87,2.57,3.75,2.46,3.26,6.03,1.55,...,2.88,3.21,2.87,1.78,2.55,1.83,1.71,5.13,1.22,1.54
7,MIMAROPA: Southwestern Tagalog Region,6.7,7.4,6.4,6.93,6.25,6.13,5.87,4.73,2.78,...,5.18,6.29,5.23,5.64,4.79,4.86,4.52,3.84,2.25,0.89
8,Region 5: Bicol Region,6.06,5.78,5.9,5.8,5.79,5.7,5.35,3.19,2.72,...,4.78,4.61,4.71,4.63,4.51,4.46,4.17,2.39,2.16,1.92
9,Region 6: Western Visayas,6.38,6.14,6.03,6.05,6.56,5.64,4.8,2.91,1.97,...,4.61,4.45,4.49,4.4,4.88,4.17,3.46,2.1,1.54,1.02


As we can see in the column headers of the given dataset, it is composed of three divisions: (1) **Both Sexes**, (2) **Boys**, and (3) **Girls**. We would separate this into three different divisions in order to create three different columns in the combined dataset.

In [255]:
both_sexes, boys, girls = [], [], []
both_sexes = data.iloc[:,:11]
boys = data.iloc[:,11:21]
girls = data.iloc[:,21:]

Because we would want the column headers to just include the year, we would be removing the **Both Sexes**, **Boys**, and **Girls** at the start of the column names. 

In [256]:
both_sexes.columns = both_sexes.columns.map(lambda x: x.lstrip('Both Sexes '))
boys.columns = boys.columns.map(lambda x: x.lstrip('Boys '))
girls.columns = girls.columns.map(lambda x: x.lstrip('Girls '))

For the boys and girls partition, we will insert a Geolocation column at the start of their own dataframe since the range of column numbers during the division of the dataframe by columns does not capture the Geolocation column which is in index 0.

In [257]:
boys.insert(loc=0, column='Geolocation', value=data.iloc[:, 0])
girls.insert(loc=0, column='Geolocation', value=data.iloc[:, 0])

As all of the dataframes now include all the columns needed, we can now use the `change_to_long` function to convert it to its long representation.

In [258]:
both_sexes = change_to_long (both_sexes, 'Primary Drop-out rate')
boys = change_to_long (boys, 'Primary Drop-out rate (Boys)')
girls = change_to_long (girls, 'Primary Drop-out rate (Girls)')

Afterwards, the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function can be used to merge these three dataframes to the combined dataset.

In [259]:
combined_data = combined_data.merge(both_sexes, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.merge(girls, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.merge(boys, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Current Health Expenditure,Current Health Expenditure GR,Consumption Expenditure %,Consumption Expenditure GR,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys)
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,,,100.0,5.7,579938180.0,3.697556e+09,76723051.0,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,,,45.9,8.3,203930819.0,1.237451e+09,9961971.0,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,,,2.7,2.3,13865180.0,9.044601e+07,1369249.0,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,,,4.0,1.3,24454284.0,1.289450e+08,4209083.0,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,,,2.4,4.2,32773347.0,8.593798e+07,2819641.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Secondary Drop-out rates by Region, Sex and Year
Then, this is followed by the loading of the next dataset.

In [260]:
data = pd.read_csv('data' + '/Secondary Drop-out rates by Geolocation, Sex and Year.csv', header = 1, sep = ';')
data

Unnamed: 0,Geolocation,Both Sexes 2006,Both Sexes 2007,Both Sexes 2008,Both Sexes 2009,Both Sexes 2010,Both Sexes 2011,Both Sexes 2012,Both Sexes 2013,Both Sexes 2014,...,Girls 2006,Girls 2007,Girls 2008,Girls 2009,Girls 2010,Girls 2011,Girls 2012,Girls 2013,Girls 2014,Girls 2015
0,Philippines,8.5,7.5,7.4,8.0,7.8,7.8,8.1,7.6,6.9,...,6.7,5.6,5.8,6.2,5.7,5.6,6.2,5.8,5.5,4.9
1,NCR,7.4,6.1,5.6,6.9,6.7,5.8,6.9,7.2,6.7,...,6.8,3.5,3.8,5.0,4.6,3.9,4.9,5.8,5.1,3.6
2,CAR,4.2,7.8,5.1,7.9,8.1,6.4,6.9,7.8,6.3,...,2.6,5.3,2.7,4.4,5.6,2.7,4.1,5.3,3.8,3.2
3,Region I,4.1,4.8,5.4,5.5,5.9,6.9,6.1,5.3,4.9,...,4.1,3.1,3.9,3.8,4.2,4.9,4.2,3.5,3.7,4.4
4,Region II,7.1,6.7,6.1,6.9,6.7,6.3,6.9,5.7,6.2,...,5.7,5.3,4.8,5.3,5.4,4.4,5.2,4.3,4.6,4.4
5,Region III,7.3,6.8,6.9,7.0,6.7,7.6,7.6,5.3,6.4,...,5.5,4.9,5.4,4.9,4.7,5.5,5.6,3.8,4.9,5.4
6,Region IV-A,5.3,5.5,5.4,6.5,5.9,4.5,5.4,7.0,5.3,...,3.4,3.2,3.2,4.2,3.9,2.7,3.6,5.0,4.2,3.5
7,Region IV-B,9.1,8.8,8.2,8.6,9.4,8.9,7.7,7.6,7.1,...,6.7,7.4,6.8,6.3,7.4,6.1,5.6,5.7,5.5,4.3
8,Region V,9.2,8.4,8.8,8.5,8.6,9.3,9.2,7.9,7.5,...,6.9,7.0,6.9,6.7,5.8,6.6,6.5,5.5,5.4,5.9
9,Region VI,8.9,7.1,7.4,7.1,7.5,7.4,7.7,6.8,5.5,...,6.2,5.3,5.4,5.4,5.5,5.0,5.5,5.0,4.0,4.8


Like in the previous datasets, we would be converting the values of the `Geolocation` column so that it would have the same values as the combined dataframe. This would not affect the data in anyway as the `Geolocation` column and the `region_names` variable follow the same ordering of the regions.

In [261]:
# renames the data in the Geolocation for consistency
data['Geolocation'] = region_names
data

Unnamed: 0,Geolocation,Both Sexes 2006,Both Sexes 2007,Both Sexes 2008,Both Sexes 2009,Both Sexes 2010,Both Sexes 2011,Both Sexes 2012,Both Sexes 2013,Both Sexes 2014,...,Girls 2006,Girls 2007,Girls 2008,Girls 2009,Girls 2010,Girls 2011,Girls 2012,Girls 2013,Girls 2014,Girls 2015
0,PHILIPPINES,8.5,7.5,7.4,8.0,7.8,7.8,8.1,7.6,6.9,...,6.7,5.6,5.8,6.2,5.7,5.6,6.2,5.8,5.5,4.9
1,NCR: National Capital Region,7.4,6.1,5.6,6.9,6.7,5.8,6.9,7.2,6.7,...,6.8,3.5,3.8,5.0,4.6,3.9,4.9,5.8,5.1,3.6
2,CAR: Cordillera Administrative Region,4.2,7.8,5.1,7.9,8.1,6.4,6.9,7.8,6.3,...,2.6,5.3,2.7,4.4,5.6,2.7,4.1,5.3,3.8,3.2
3,Region 1: Ilocos Region,4.1,4.8,5.4,5.5,5.9,6.9,6.1,5.3,4.9,...,4.1,3.1,3.9,3.8,4.2,4.9,4.2,3.5,3.7,4.4
4,Region 2: Cagayan Valley,7.1,6.7,6.1,6.9,6.7,6.3,6.9,5.7,6.2,...,5.7,5.3,4.8,5.3,5.4,4.4,5.2,4.3,4.6,4.4
5,Region 3: Central Luzon,7.3,6.8,6.9,7.0,6.7,7.6,7.6,5.3,6.4,...,5.5,4.9,5.4,4.9,4.7,5.5,5.6,3.8,4.9,5.4
6,Region 4A: CALABARZON,5.3,5.5,5.4,6.5,5.9,4.5,5.4,7.0,5.3,...,3.4,3.2,3.2,4.2,3.9,2.7,3.6,5.0,4.2,3.5
7,MIMAROPA: Southwestern Tagalog Region,9.1,8.8,8.2,8.6,9.4,8.9,7.7,7.6,7.1,...,6.7,7.4,6.8,6.3,7.4,6.1,5.6,5.7,5.5,4.3
8,Region 5: Bicol Region,9.2,8.4,8.8,8.5,8.6,9.3,9.2,7.9,7.5,...,6.9,7.0,6.9,6.7,5.8,6.6,6.5,5.5,5.4,5.9
9,Region 6: Western Visayas,8.9,7.1,7.4,7.1,7.5,7.4,7.7,6.8,5.5,...,6.2,5.3,5.4,5.4,5.5,5.0,5.5,5.0,4.0,4.8


Similar to the previous dataset, we will include all columns for this dataset. However, we have to separate them into three different dataframes: (1) **Both sexes**, (2) **Boys**, and (3) **Girls**. 

In [262]:
both_sexes, boys, girls = [],[],[]
both_sexes = data.iloc[:,:11]
boys = data.iloc[:,11:21]
girls = data.iloc[:,21:]

We removed the terms **Both sexes**, **Boys**, and **Girls** at the column headers because the column headers would need to only include the years. If these terms are not removed, this dataset would not be merged to the combined dataset correctly.

In [263]:
both_sexes.columns = both_sexes.columns.map(lambda x: x.lstrip('Both Sexes '))
boys.columns = boys.columns.map(lambda x: x.lstrip('Boys '))
girls.columns = girls.columns.map(lambda x: x.lstrip('Girls '))

As the dataframes for the **Boys** and **Girls** do not include the `Geolocation` column, we would have to insert this at the start of each of the dataframe. 

In [264]:
boys.insert(loc=0, column='Geolocation', value=data.iloc[:, 0])
girls.insert(loc=0, column='Geolocation', value=data.iloc[:, 0])

After this, we would now be able to convert the dataframes into their long representations.

In [265]:
both_sexes = change_to_long (both_sexes, 'Secondary Drop-out rate')
boys = change_to_long (boys, 'Secondary Drop-out rate (Boys)')
girls = change_to_long (girls, 'Secondary Drop-out rate (Girls)')

Seeing that all three dataframes are now in a long representation, we can now merge it to the combined dataset.

In [266]:
combined_data = combined_data.merge(both_sexes, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.merge(girls, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.merge(boys, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Consumption Expenditure GR,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys)
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,5.7,579938180.0,3.697556e+09,76723051.0,,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,8.3,203930819.0,1.237451e+09,9961971.0,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,2.3,13865180.0,9.044601e+07,1369249.0,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,1.3,24454284.0,1.289450e+08,4209083.0,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,4.2,32773347.0,8.593798e+07,2819641.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


##### Quarterly Producer Price Index for Agriculture (First Quarter 2018 to Third Quarter 2021)
We will now be loading the final dataset, which is the **Quarterly Producer Price Index for Agriculture (First Quarter 2018 to Third Quarter 2021)**.

In [267]:
data = pd.read_csv('data' + '/Quarterly Producer Price Index for Agriculture (2018=100) _ First Quarter 2018 to First Quarter 2022.csv',header = 1,sep = ';')
data

Unnamed: 0,Region,Commodity,2018 First Quarter (Jan-Mar),2018 Second Quarter (Apr-Jun),2018 Third Quarter (Jul-Sep),2018 Fourth Quarter (Oct-Dec),2018 Average (Jan-Dec),2019 First Quarter (Jan-Mar),2019 Second Quarter (Apr-Jun),2019 Third Quarter (Jul-Sep),...,2021 First Quarter (Jan-Mar),2021 Second Quarter (Apr-Jun),2021 Third Quarter (Jul-Sep),2021 Fourth Quarter (Oct-Dec),2021 Average (Jan-Dec),2022 First Quarter (Jan-Mar),2022 Second Quarter (Apr-Jun),2022 Third Quarter (Jul-Sep),2022 Fourth Quarter (Oct-Dec),2022 Average (Jan-Dec)
0,PHILIPPINES,AGRICULTURE,95.4,101.7,104.4,98.4,100.0,96.0,96.4,93.4,...,101.7,103.4,104.0,103.0,103.025,106.5,..,..,..,..
1,PHILIPPINES,..CROPS,94.2,102.5,105.3,98.0,100.0,95.2,95.6,91.6,...,95.8,98.8,101.8,98.9,98.825,101.2,..,..,..,..
2,PHILIPPINES,�.Cereals,96.3,101.0,105.7,97.1,100.0,93.4,89.3,81.8,...,85.2,86.9,87.7,86.3,86.525,91.9,..,..,..,..
3,PHILIPPINES,�.Rootcrops,97.1,92.1,102.6,108.2,100.0,101.4,110.8,125.0,...,121.3,88.1,94.5,96.2,100.025,98.3,..,..,..,..
4,PHILIPPINES,�.Beans and Legumes,100.1,96.0,106.1,97.8,100.0,95.5,95.6,93.6,...,94.5,89.9,90.1,95.2,92.425,91.9,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301,..BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDA...,..FISHERY,95.8,102.4,98.2,103.6,100.0,98.5,102.0,99.4,...,96.0,99.7,100.4,103.8,99.975,101.0,..,..,..,..
302,..BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDA...,�.Aquaculture,93.0,100.8,100.7,105.5,100.0,101.1,104.6,101.0,...,115.2,115.6,120.0,122.1,118.225,120.0,..,..,..,..
303,..BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDA...,�.Commercial Fishery,100.5,101.2,97.2,101.1,100.0,96.4,94.6,87.1,...,80.7,77.7,74.4,80.9,78.425,81.4,..,..,..,..
304,..BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDA...,�.Inland Municipal Fishery,90.4,112.5,92.8,104.3,100.0,95.4,110.7,92.3,...,85.0,106.4,115.1,115.7,105.550,114.4,..,..,..,..


Similar to how we handled the previous datasets, we would be removing the '..' found at the start of the values in the `Region` column, as well as the non-alphabet characters found in the `Commodity` column.

In [268]:
data['Region'] = data['Region'].map(lambda x: x.lstrip('..'))
data['Commodity'] = data['Commodity'].map(lambda x: x.lstrip('..'))
data['Commodity'] = data['Commodity'].map(lambda x: x.lstrip('….'))
data

Unnamed: 0,Region,Commodity,2018 First Quarter (Jan-Mar),2018 Second Quarter (Apr-Jun),2018 Third Quarter (Jul-Sep),2018 Fourth Quarter (Oct-Dec),2018 Average (Jan-Dec),2019 First Quarter (Jan-Mar),2019 Second Quarter (Apr-Jun),2019 Third Quarter (Jul-Sep),...,2021 First Quarter (Jan-Mar),2021 Second Quarter (Apr-Jun),2021 Third Quarter (Jul-Sep),2021 Fourth Quarter (Oct-Dec),2021 Average (Jan-Dec),2022 First Quarter (Jan-Mar),2022 Second Quarter (Apr-Jun),2022 Third Quarter (Jul-Sep),2022 Fourth Quarter (Oct-Dec),2022 Average (Jan-Dec)
0,PHILIPPINES,AGRICULTURE,95.4,101.7,104.4,98.4,100.0,96.0,96.4,93.4,...,101.7,103.4,104.0,103.0,103.025,106.5,..,..,..,..
1,PHILIPPINES,CROPS,94.2,102.5,105.3,98.0,100.0,95.2,95.6,91.6,...,95.8,98.8,101.8,98.9,98.825,101.2,..,..,..,..
2,PHILIPPINES,�.Cereals,96.3,101.0,105.7,97.1,100.0,93.4,89.3,81.8,...,85.2,86.9,87.7,86.3,86.525,91.9,..,..,..,..
3,PHILIPPINES,�.Rootcrops,97.1,92.1,102.6,108.2,100.0,101.4,110.8,125.0,...,121.3,88.1,94.5,96.2,100.025,98.3,..,..,..,..
4,PHILIPPINES,�.Beans and Legumes,100.1,96.0,106.1,97.8,100.0,95.5,95.6,93.6,...,94.5,89.9,90.1,95.2,92.425,91.9,..,..,..,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
301,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,FISHERY,95.8,102.4,98.2,103.6,100.0,98.5,102.0,99.4,...,96.0,99.7,100.4,103.8,99.975,101.0,..,..,..,..
302,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,�.Aquaculture,93.0,100.8,100.7,105.5,100.0,101.1,104.6,101.0,...,115.2,115.6,120.0,122.1,118.225,120.0,..,..,..,..
303,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,�.Commercial Fishery,100.5,101.2,97.2,101.1,100.0,96.4,94.6,87.1,...,80.7,77.7,74.4,80.9,78.425,81.4,..,..,..,..
304,BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANA...,�.Inland Municipal Fishery,90.4,112.5,92.8,104.3,100.0,95.4,110.7,92.3,...,85.0,106.4,115.1,115.7,105.550,114.4,..,..,..,..


As we only want to explore the cumulative **Quarterly Producer Price Index for Agriculture**, we would only be taking rows that has this value for the `Commodity` column. Then, as all of the values in the `Commodity` column would now be the same, we can now drop this column.

In [269]:
data = data[data['Commodity'] == 'AGRICULTURE']
data = data.drop("Commodity", axis = 1)
data

Unnamed: 0,Region,2018 First Quarter (Jan-Mar),2018 Second Quarter (Apr-Jun),2018 Third Quarter (Jul-Sep),2018 Fourth Quarter (Oct-Dec),2018 Average (Jan-Dec),2019 First Quarter (Jan-Mar),2019 Second Quarter (Apr-Jun),2019 Third Quarter (Jul-Sep),2019 Fourth Quarter (Oct-Dec),...,2021 First Quarter (Jan-Mar),2021 Second Quarter (Apr-Jun),2021 Third Quarter (Jul-Sep),2021 Fourth Quarter (Oct-Dec),2021 Average (Jan-Dec),2022 First Quarter (Jan-Mar),2022 Second Quarter (Apr-Jun),2022 Third Quarter (Jul-Sep),2022 Fourth Quarter (Oct-Dec),2022 Average (Jan-Dec)
0,PHILIPPINES,95.4,101.7,104.4,98.4,100.0,96.0,96.4,93.4,91.7,...,101.7,103.4,104.0,103.0,103.025,106.5,..,..,..,..
18,CORDILLERA ADMINISTRATIVE REGION (CAR),91.2,89.6,116.6,102.6,100.0,79.7,89.3,88.6,84.3,...,91.6,56.0,90.5,81.2,79.825,63.6,..,..,..,..
36,REGION I (ILOCOS REGION),95.7,94.5,107.8,102.0,100.0,91.3,88.7,89.7,89.4,...,96.9,93.2,96.2,94.4,95.175,95.0,..,..,..,..
54,REGION II (CAGAYAN VALLEY),90.7,100.9,108.3,100.1,100.0,96.6,91.4,86.8,80.4,...,96.1,94.6,94.8,98.1,95.9,99.2,..,..,..,..
72,REGION III (CENTRAL LUZON),98.9,99.8,102.3,99.0,100.0,90.8,88.5,86.4,88.7,...,116.6,105.4,101.4,106.3,107.425,107.0,..,..,..,..
90,REGION IV-A (CALABARZON),106.2,101.9,99.0,93.0,100.0,102.2,100.7,101.0,97.0,...,132.7,118.5,114.7,117.9,120.95,128.6,..,..,..,..
108,MIMAROPA REGION,95.8,101.8,105.1,97.3,100.0,91.7,90.3,84.6,82.0,...,84.3,82.3,80.6,78.0,81.3,84.6,..,..,..,..
126,REGION V (BICOL REGION),99.6,102.6,101.3,96.6,100.0,96.1,94.0,88.5,85.6,...,101.7,100.1,94.9,97.7,98.6,109.3,..,..,..,..
144,REGION VI (WESTERN VISAYAS),89.8,103.3,107.6,99.3,100.0,98.2,103.7,98.7,96.4,...,97.2,109.2,115.2,107.1,107.175,106.8,..,..,..,..
162,REGION VII (CENTRAL VISAYAS),86.9,103.4,114.9,94.8,100.0,92.2,91.2,91.7,89.6,...,95.5,101.1,97.7,99.3,98.4,110.4,..,..,..,..


Since there are no rows that correspond to the **National Capital Region** in this dataset, we would remove the value that corresponds to the **National Capital Region** from `region_names`. Moreover, we would be renaming the `Region` column header to `Geolocation` to maintain consistency.

In [270]:
data['Region'].unique()

array(['PHILIPPINES', 'CORDILLERA ADMINISTRATIVE REGION (CAR)',
       'REGION I (ILOCOS REGION)', 'REGION II (CAGAYAN VALLEY)',
       'REGION III (CENTRAL LUZON)', 'REGION IV-A (CALABARZON)',
       'MIMAROPA REGION', 'REGION V (BICOL REGION)',
       'REGION VI (WESTERN VISAYAS)', 'REGION VII (CENTRAL VISAYAS)',
       'REGION VIII (EASTERN VISAYAS)', 'REGION IX (ZAMBOANGA PENINSULA)',
       'REGION X (NORTHERN MINDANAO)', 'REGION XI (DAVAO REGION)',
       'REGION XII (SOCCSKSARGEN)', 'REGION XIII (CARAGA)',
       'BANGSAMORO AUTONOMOUS REGION IN MUSLIM MINDANAO (BARMM)'],
      dtype=object)

In [271]:
data['Region'] = region_names [0:1] + region_names [2: ]
data.set_index('Region')
data = data.reset_index(drop=True)
data.rename(columns = {'Region': 'Geolocation'}, inplace = True)

Because we would only need a yearly overview, we would only get the columns that corresponds to this and remove the rest. Afterwards, the column names would be renamed to only include the year. 

In [272]:
data = data[['Geolocation', '2018 Average (Jan-Dec)','2019 Average (Jan-Dec)','2020 Average (Jan-Dec)', '2021 Average (Jan-Dec)']]
data.columns = data.columns.str[:4]
data.rename(columns = {'Geol': 'Geolocation'},inplace = True)
data

Unnamed: 0,Geolocation,2018,2019,2020,2021
0,PHILIPPINES,100.0,94.375,95.05,103.025
1,CAR: Cordillera Administrative Region,100.0,85.475,64.35,79.825
2,Region 1: Ilocos Region,100.0,89.775,79.375,95.175
3,Region 2: Cagayan Valley,100.0,88.8,76.1,95.9
4,Region 3: Central Luzon,100.0,88.6,79.125,107.425
5,Region 4A: CALABARZON,100.0,100.225,102.875,120.95
6,MIMAROPA: Southwestern Tagalog Region,100.0,87.15,67.975,81.3
7,Region 5: Bicol Region,100.0,91.05,79.6,98.6
8,Region 6: Western Visayas,100.0,99.25,97.55,107.175
9,Region 7: Central Visayas,100.0,91.175,82.0,98.4


We can now convert our dataframe into its long representation.

In [273]:
data = change_to_long (data, 'Price Index for Agriculture')

And finally, we can merge this final dataset to the combined dataset.

In [274]:
combined_data = combined_data.merge(data, how = 'outer', on = ['Geolocation', 'Year'])
combined_data = combined_data.reset_index (drop=True)
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,579938180.0,3.697556e+09,76723051.0,,,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,203930819.0,1.237451e+09,9961971.0,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,13865180.0,9.044601e+07,1369249.0,,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,24454284.0,1.289450e+08,4209083.0,,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,32773347.0,8.593798e+07,2819641.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,Region 10: Northern Mindanao,2022,,,,,,,,,...,,,,,,,,,,
410,Region 11: Davao Region,2022,,,,,,,,,...,,,,,,,,,,
411,Region 12: SOCCSKSARGEN,2022,,,,,,,,,...,,,,,,,,,,
412,CARAGA: Cordillera Administrative Region,2022,,,,,,,,,...,,,,,,,,,,


## Data Cleaning
There are four steps for the cleaning of the combined dataset: (1) the dropping of the rows wherein all the values of the indicator columns are **NaN**s, (2) the fixing of the data types of the columns, (3) the dropping of duplicated rows, and (4) the cleaning of the individual columns.

### Dropping of rows that has all **NaN** values
The first thing that we would do is to drop the rows that only have **NaN** values. This means that for that specific region in that specific year, there is no data that is collected for any of the indicators, thus, we would not be able to derive any knowledge from it.

Using the combination of the isna and sum functions, we would be able to see the total number of **NaN** values a specific row has.

In [275]:
combined_data.isna().sum(axis = 1).sort_values(ascending=False)

413    51
405    51
397    51
398    51
399    51
       ..
338    10
339    10
340    10
341    10
337    10
Length: 414, dtype: int64

From the result above, we can see that there are rows that have all **NaN** values (i.e., where the number of **NaN** values outputted is equal to the number of the columns for indicators). Since we know that the `Geolocation` and `Year` column does not have any **NaN** values, we would set a threshold of 3 (which means that if there are at least three non-NaN values, the row would not be dropped ) in the [`dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function.

In [276]:
combined_data = combined_data.dropna(axis = 0, thresh = 3)

With this, we would have a new dataframe that has 377 rows, with the `Year` having a range of from 2000 to 2021.

In [277]:
combined_data['Year'].describe()

count     396.000000
mean     2010.500000
std         6.352314
min      2000.000000
25%      2005.000000
50%      2010.500000
75%      2016.000000
max      2021.000000
Name: Year, dtype: float64

In [278]:
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,579938180.0,3.697556e+09,76723051.0,,,,,,,
1,NCR: National Capital Region,2000,,101,101.92,100.13,79.05,79.5,78.57,,...,203930819.0,1.237451e+09,9961971.0,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,13865180.0,9.044601e+07,1369249.0,,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,24454284.0,1.289450e+08,4209083.0,,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,32773347.0,8.593798e+07,2819641.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,,,,,,,,...,164566009.0,9.492320e+08,,,,,,,,107.975
392,Region 11: Davao Region,2021,,,,,,,,,...,257595240.0,9.672276e+08,,,,,,,,110.850
393,Region 12: SOCCSKSARGEN,2021,,,,,,,,,...,87077953.0,5.039756e+08,,,,,,,,103.350
394,CARAGA: Cordillera Administrative Region,2021,,,,,,,,,...,100730468.0,3.317629e+08,,,,,,,,104.525


### Fixing the Data Types of the Columns
Using the [`dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) property, we would see that some indicators columns are **object**-types. As we know that all columns except for the `Geolocation` and `Year` are supposed to be **float64** columns, we would need to convert these objects.

In [279]:
combined_data.dtypes

Geolocation                                                       object
Year                                                               int64
1.2.1 Poverty Proportion                                         float64
1.4.1 Net Elem Enrolment Rate                                     object
1.4.1 Net Elem Enrolment Rate (Girls)                             object
1.4.1 Net Elem Enrolment Rate (Boys)                              object
1.4.1 Net JHS Enrolment Rate                                      object
1.4.1 Net JHS Enrolment Rate (Girls)                              object
1.4.1 Net JHS Enrolment Rate (Boys)                               object
1.4.1 Net SHS Enrolment Rate                                     float64
1.4.1 Net SHS Enrolment Rate (Girls)                             float64
1.4.1 Net SHS Enrolment Rate (Boys)                              float64
1.5.4 Proportion of LGU with DRR                                 float64
3.4.1 Mortality rate credited to NCD               

For each of the column that are not the `Geolocation` and `Year` columns, their datatypes are checker. In the scenario that they are not **float64**, the function [`astype`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) was used in order to convert it to float. Even though we are sure that all of the values in these columns can be transformed to float as this was its original value in the csv file, the parameter `errors` was still set to **raise** for validation.

In [280]:
for x in combined_data.columns.difference(['Geolocation', 'Year']):
    if combined_data[x].dtypes != 'float64':
        combined_data.loc[:, x] = combined_data[x].astype(float, errors = 'raise')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


Using the [`info`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) function, we would see that all the indicator columns are now **float64**.

In [281]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 396 entries, 0 to 395
Data columns (total 53 columns):
 #   Column                                                         Non-Null Count  Dtype  
---  ------                                                         --------------  -----  
 0   Geolocation                                                    396 non-null    object 
 1   Year                                                           396 non-null    int64  
 2   1.2.1 Poverty Proportion                                       36 non-null     float64
 3   1.4.1 Net Elem Enrolment Rate                                  376 non-null    float64
 4   1.4.1 Net Elem Enrolment Rate (Girls)                          376 non-null    float64
 5   1.4.1 Net Elem Enrolment Rate (Boys)                           376 non-null    float64
 6   1.4.1 Net JHS Enrolment Rate                                   376 non-null    float64
 7   1.4.1 Net JHS Enrolment Rate (Girls)                          

### Dropping of Duplicated Rows
Using a combination of [`duplicated`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) and [`sum`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html), we would be able to see how many rows are duplicated and should be dropped.

In [282]:
combined_data.duplicated().sum()

0

As the combination of these functions outputted the number 0, then we can conclude that each of the rows are unique. This means that we would not have to drop any of the rows.

### Cleaning of Each Columns
As each of the column came from different datasets, we would be checking and cleaning the values for each of the column. However, even though there might be anomalies (e.g., a proportion or rate being higher than 100), we would not be droping them, as the data is from the official report of the Philippines to the United Nations. Although, it is important to notes these ## anomalies.

#### 1.2.1. Proportion of population living below the national poverty line
For this column, we would be using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function in order to check if we have an outliner. This is due to the fact that we are expecting a value of 0 to 10, as we are talking about proportion or percentage.

In [283]:
combined_data['1.2.1 Poverty Proportion'].describe()

count    36.000000
mean     24.463889
std      13.917228
min       2.200000
25%      16.000000
50%      23.300000
75%      31.200000
max      61.800000
Name: 1.2.1 Poverty Proportion, dtype: float64

**Result:** From what we can see, the minimum and maximum values of the columns are within the range of values that we expected from this column. Thus, there are no outliers that we need to remove or drop.

#### 1.4.1p5 Net Enrolment Rate in elementary
Just like in the first column, we would be using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function in order to check the value range of the variable. 

According to the Philippine Statistics Authority (n.d.), the formula for net enrollment rate in elementary is defined as total enrollment of aged six to 11, divided by the population of kids of the same age, and then multiplied by 100. For this, we are expecting a value of 0 to 100, as we are talking about a percentage of a population: we cannot have more children enrolled than the total population of kids. 

In this section there are three columns related to this data, these are: 
- `1.4.1 Net Elem Enrolment Rate` (Both Sexes), 
- `1.4.1 Net Elem Enrolment Rate (Girls)`,
- `1.4.1 Net Elem Enrolment Rate (boys)`

First, we will check for the values of `1.4.1 Net Elem Enrolment Rate`.

In [284]:
combined_data['1.4.1 Net Elem Enrolment Rate'].describe()

count    376.000000
mean      91.209016
std        7.184414
min       69.167300
25%       86.254125
50%       92.890000
75%       96.900000
max      108.070000
Name: 1.4.1 Net Elem Enrolment Rate, dtype: float64

**Result:** As we can see, the maximum value of this column is higher than 100, which can be concerning as the unit of measurement set by United Nations for all of the countries in this indicators is percentage. Thus, these might be error in encodings.

Let us check all of the rows which has values higher than 100 for this indicator.

In [285]:
combined_data[combined_data['1.4.1 Net Elem Enrolment Rate'] > 100]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
1,NCR: National Capital Region,2000,,101.0,101.92,100.13,79.05,79.5,78.57,,...,203930819.0,1237451000.0,9961971.0,,,,,,,
197,BARMM: Bangsamoro Autonomous Region in Muslim ...,2010,,103.25,110.01,96.66,44.54,51.13,38.0,,...,22262068.0,131635500.0,3692323.0,18.1,16.95,19.37,11.2,10.5,12.2,
200,CAR: Cordillera Administrative Region,2011,,100.8,101.71,99.96,68.27,75.39,61.47,,...,49944680.0,200486700.0,1641438.0,4.91,3.57,6.1,6.4,2.7,10.0,
212,Region 11: Davao Region,2011,,100.53,101.24,99.86,56.52,62.55,50.74,,...,119476612.0,437559400.0,4561874.0,9.35,7.34,11.16,9.3,6.8,11.7,
215,BARMM: Bangsamoro Autonomous Region in Muslim ...,2011,,108.07,114.0,102.23,44.32,51.64,37.05,,...,17333754.0,141744000.0,3793064.0,22.72,22.22,23.26,12.3,10.6,14.5,
238,Region 2: Cagayan Valley,2013,,100.08,100.17,100.0,72.9,78.74,67.49,,...,62731146.0,268304500.0,3362667.0,2.42,1.67,3.09,5.7,4.3,7.0,
244,Region 7: Central Visayas,2013,,100.97,101.36,100.59,68.13,75.26,61.44,,...,126595075.0,746139400.0,7157605.0,2.88,1.97,3.67,6.9,4.5,9.2,
248,Region 11: Davao Region,2013,,100.65,100.87,100.44,60.04,66.62,53.81,,...,120503064.0,495294300.0,4723542.0,5.28,4.02,6.4,8.9,7.1,10.8,
254,CAR: Cordillera Administrative Region,2014,,100.16,100.05,100.27,72.5,79.73,65.74,,...,24485878.0,237096000.0,1701488.0,2.84,1.71,3.5,6.3,3.8,8.8,
256,Region 2: Cagayan Valley,2014,,101.15,101.26,101.04,72.44,78.46,66.87,,...,75028145.0,304061100.0,3405024.0,2.61,1.96,3.2,6.2,4.6,7.7,


**Result:** As we can see, there are 18 rows which has more than 100% value for the `1.4.1. Net Elem Enrolment Rate`. These are some potential anomalies that we can have in this column.

Next, we will be checking for `1.4.1 Net Elem Enrolment Rate (Girls)`.

In [286]:
combined_data['1.4.1 Net Elem Enrolment Rate (Girls)'].describe()

count    376.000000
mean      91.862734
std        7.017429
min       72.358800
25%       86.977550
50%       93.610000
75%       97.160000
max      114.000000
Name: 1.4.1 Net Elem Enrolment Rate (Girls), dtype: float64

In [287]:
combined_data[combined_data['1.4.1 Net Elem Enrolment Rate (Girls)'] > 100]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
1,NCR: National Capital Region,2000,,101.0,101.92,100.13,79.05,79.5,78.57,,...,203930800.0,1237451000.0,9961971.0,,,,,,,
161,BARMM: Bangsamoro Autonomous Region in Muslim ...,2008,,99.85,104.34,95.37,37.98,42.57,33.24,,...,11818450.0,104963600.0,3511566.0,17.99,16.87,19.21,10.8,10.0,11.9,
164,CAR: Cordillera Administrative Region,2009,,99.51,100.21,98.84,64.76,71.8,58.0,,...,32099100.0,172017200.0,1598061.0,5.04,2.73,6.41,7.9,4.4,11.4,
197,BARMM: Bangsamoro Autonomous Region in Muslim ...,2010,,103.25,110.01,96.66,44.54,51.13,38.0,,...,22262070.0,131635500.0,3692323.0,18.1,16.95,19.37,11.2,10.5,12.2,
199,NCR: National Capital Region,2011,,99.69,101.54,97.97,79.79,83.41,76.27,,...,720272900.0,3237139000.0,12080971.0,2.93,2.42,3.86,5.8,3.9,7.6,
200,CAR: Cordillera Administrative Region,2011,,100.8,101.71,99.96,68.27,75.39,61.47,,...,49944680.0,200486700.0,1641438.0,4.91,3.57,6.1,6.4,2.7,10.0,
203,Region 3: Central Luzon,2011,,99.21,100.24,98.26,74.59,79.3,70.17,,...,216166900.0,1096259000.0,10372806.0,4.07,3.17,4.9,7.6,5.5,9.7,
212,Region 11: Davao Region,2011,,100.53,101.24,99.86,56.52,62.55,50.74,,...,119476600.0,437559400.0,4561874.0,9.35,7.34,11.16,9.3,6.8,11.7,
215,BARMM: Bangsamoro Autonomous Region in Muslim ...,2011,,108.07,114.0,102.23,44.32,51.64,37.05,,...,17333750.0,141744000.0,3793064.0,22.72,22.22,23.26,12.3,10.6,14.5,
217,NCR: National Capital Region,2012,,99.22,100.65,97.87,81.02,85.14,77.04,,...,758973600.0,3553089000.0,12275553.0,4.1,3.27,4.88,6.9,4.9,8.8,


**Result:** As seen in this result, there 28 instances of potential anomalies in this column.

Lastly, we will check the values from `1.4.1 Net Elem Enrolment Rate (Boys)`

In [288]:
combined_data['1.4.1 Net Elem Enrolment Rate (Boys)'].describe()

count    376.000000
mean      90.579812
std        7.497508
min       66.103200
25%       85.506900
50%       92.390000
75%       96.572500
max      102.500000
Name: 1.4.1 Net Elem Enrolment Rate (Boys), dtype: float64

**Result:** From this result, we can observe that its minimum and maximum value is similar to the 1.4.1 Net Elem Enrolment Rate for both sexes and for girls. 

With this, we will also now check for the number of instances where the value in this column is greater than 100%.

In [289]:
combined_data[combined_data['1.4.1 Net Elem Enrolment Rate (Boys)'] > 100]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
1,NCR: National Capital Region,2000,,101.0,101.92,100.13,79.05,79.5,78.57,,...,203930819.0,1237451000.0,9961971.0,,,,,,,
215,BARMM: Bangsamoro Autonomous Region in Muslim ...,2011,,108.07,114.0,102.23,44.32,51.64,37.05,,...,17333754.0,141744000.0,3793064.0,22.72,22.22,23.26,12.3,10.6,14.5,
244,Region 7: Central Visayas,2013,,100.97,101.36,100.59,68.13,75.26,61.44,,...,126595075.0,746139400.0,7157605.0,2.88,1.97,3.67,6.9,4.5,9.2,
248,Region 11: Davao Region,2013,,100.65,100.87,100.44,60.04,66.62,53.81,,...,120503064.0,495294300.0,4723542.0,5.28,4.02,6.4,8.9,7.1,10.8,
254,CAR: Cordillera Administrative Region,2014,,100.16,100.05,100.27,72.5,79.73,65.74,,...,24485878.0,237096000.0,1701488.0,2.84,1.71,3.5,6.3,3.8,8.8,
256,Region 2: Cagayan Valley,2014,,101.15,101.26,101.04,72.44,78.46,66.87,,...,75028145.0,304061100.0,3405024.0,2.61,1.96,3.2,6.2,4.6,7.7,
262,Region 7: Central Visayas,2014,,101.62,101.93,101.32,67.76,75.2,60.81,,...,158426246.0,829547300.0,7271699.0,2.68,1.91,3.36,7.9,5.7,10.1,
274,Region 2: Cagayan Valley,2015,17.8,102.42,102.33,102.5,77.58,83.5,72.12,,...,77480450.0,303113500.0,3447468.0,1.74,0.98,2.42,6.4,4.4,8.3,
280,Region 7: Central Visayas,2015,29.4,102.14,102.28,102.01,75.94,83.37,69.02,,...,158168742.0,889023000.0,7386344.0,1.85,1.07,2.89,6.1,4.3,7.9,
284,Region 11: Davao Region,2015,23.5,101.34,101.66,101.04,68.37,75.39,61.79,,...,127020224.0,602911800.0,4885808.0,1.9,0.99,2.72,7.2,5.4,8.9,


**Result:** Based on the result above, there 14 instances of when the rate is greater than 100%. With this, these instances are reported as potential anomalies.

Overall, there are 18, 28, and 14 instances of potential anomalies in the data of both sexes, girls, and boys, respectively. This is because the rate values of these instances are greater than 100%.

#### 1.4.1p6 Net Enrolment Rate in secondary education (Junior High School)

As we have the same expectations in the second dataset, the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function would be used in order to check if there are outliers or values that are outside of the range.

Also, this has 3 record versions (each has their own column), records for `Both sexes`, `Girls only`, and `Boys only`. 

To start, we will check for the record for `Both sexes`.

In [290]:
combined_data['1.4.1 Net JHS Enrolment Rate'].describe()

count    376.000000
mean      64.368974
std       13.925568
min       10.080000
25%       55.240000
50%       64.160000
75%       75.095250
max       93.436700
Name: 1.4.1 Net JHS Enrolment Rate, dtype: float64

**Result:** From the minimum and maximum value, we can see that the range of values are within the expected values.

Next, we will check for the values from the data for `Girls only`.

In [291]:
combined_data['1.4.1 Net JHS Enrolment Rate (Girls)'].describe()

count    376.000000
mean      69.668151
std       13.471561
min       22.050000
25%       60.800000
50%       70.150000
75%       79.835000
max       97.022700
Name: 1.4.1 Net JHS Enrolment Rate (Girls), dtype: float64

**Result:** Since the minimum and maximum value for this column is within the expected range, we can assume that all values in this column are valid.

Finally, we will check for the values from the data for `Boys only`.

In [292]:
combined_data['1.4.1 Net JHS Enrolment Rate (Boys)'].describe()

count    376.000000
mean      59.419022
std       14.144799
min       18.890000
25%       49.617500
50%       57.950000
75%       70.545000
max       90.040600
Name: 1.4.1 Net JHS Enrolment Rate (Boys), dtype: float64

**Result:** As seen from the displayed result, the minimum and maximum value for this column is within the expected range. Therefore, all values are valid and does need to do more checking.

#### 1.4.1p6 Net Enrolment Rate in secondary education (Senior High School)
Next, in this section, we would be using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function for the same purpose: checking if the maximum and minimum values are within the range we expected.

There are 3 versions of this data. Each is respresented by a column. These are: 
- `1.4.1 Net SHS Enrolment Rate` (Both Sexes)
- `1.4.1 Net SHS Enrolment Rate (Girls)`
- `1.4.1 Net SHS Enrolment Rate (Boys)`

We will begin with checking the values from the `1.4.1 Net SHS Enrolment Rate` column.

In [293]:
combined_data['1.4.1 Net SHS Enrolment Rate'].describe()

count    90.000000
mean     43.985397
std      12.801331
min       6.020000
25%      38.027500
50%      44.985000
75%      52.529725
max      68.630000
Name: 1.4.1 Net SHS Enrolment Rate, dtype: float64

**Result:** Based on the output, we can see that the minimum and maximum are within the range.

However, another expectation that we have from this column is that the rows that are not **NaN** have a value of **2016 - onwards** for the `Year` column. This is due to the fact that the Senior High School years was only added from 2016. Thus, if there are values for years lower than this, we would need to turn it to **NaN**.

To check this, we can use a mixture of the [`isnull`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function and the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function. Using the negation of the [`isnull`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function, we can only return rows that are not missing. Then, using the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) function, we can return the unique values of the `Year` column of the previously returned rows.

In [294]:
combined_data[~combined_data['1.4.1 Net SHS Enrolment Rate'].isnull()]['Year'].unique()

array([2016, 2017, 2018, 2019, 2020])

**Result:** We can see that the values of the `Year` column of the rows that are not **NaN** for this column are what we expected.


Next, we will check the `1.4.1 Net SHS Enrolment Rate (Girls)`.

In [295]:
combined_data['1.4.1 Net SHS Enrolment Rate (Girls)'].describe()

count    90.000000
mean     51.515750
std      13.786862
min       8.080000
25%      45.337500
50%      52.715000
75%      60.482500
max      76.420000
Name: 1.4.1 Net SHS Enrolment Rate (Girls), dtype: float64

**Result:** From this result, the minimum and maximum value are within the range which entails that all of the values in this column is valid. 

However, just as stated above, SHS was established in 2016, meaning we need to check if there are any records before the 2016 mark. If there are any, we will drop this since SHS is not yet established before 2016.

The same method is applied for checking if there are any values before 2016.

In [296]:
combined_data[~combined_data['1.4.1 Net SHS Enrolment Rate (Girls)'].isnull()]['Year'].unique()

array([2016, 2017, 2018, 2019, 2020])

**Result:** As seen from the result above, all values from this column are valid since the result indicated that the values in the years 2016 to 2020 are not NaN.

Finally, we will be checking for the values from the `1.4.1 Net SHS Enrolment Rate (Boys)` column.

In [297]:
combined_data['1.4.1 Net SHS Enrolment Rate (Boys)'].describe()

count    90.000000
mean     36.934978
std      12.015410
min       3.980000
25%      30.915000
50%      37.750000
75%      44.177500
max      61.170000
Name: 1.4.1 Net SHS Enrolment Rate (Boys), dtype: float64

**Result:** Looking at the minimum and maximum value, we can say that all values in this column are within the expected range.

Again, same method is applied for confirming if the values in this column only existed from 2016 to 2020.

In [298]:
combined_data[~combined_data['1.4.1 Net SHS Enrolment Rate (Boys)'].isnull()]['Year'].unique()

array([2016, 2017, 2018, 2019, 2020])

**Result:** Based on this result, all values in this column are valid since the result indicated that numerical values are only visible in the data points from 2016 to 2020.

#### 1.5.4 Proportion of local governments that adopt and implement local disaster risk reduction strategies in line with national disaster risk reduction strategies

As this column talks about proportion, we are expecting a value from 0 to 100 again. This means that we can check it using the same function as the previous columns (i.e., the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function), in order to recheck this using the returned minimum and maximum values.

In [299]:
combined_data['1.5.4 Proportion of LGU with DRR'].describe()

count     68.000000
mean      82.180882
std       23.442354
min       12.500000
25%       74.400000
50%       93.800000
75%      100.000000
max      100.000000
Name: 1.5.4 Proportion of LGU with DRR, dtype: float64

**Result:** Since the maximum is 100 and the minimum is not less than 0, then we can conclude that there are no values that are outside of the accepted range.

#### 3.4.1 Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease 
As the columns that will be evaluated in this section will talk about a rate again, the accepted value range is within 0 to 100. This is because of its formula wherein we divide the number of people who died attributed to the said diseases by the population. Since the number of deaths cannot be higher than the population, we cannot accept a value higher than 100.

Also, the columns that will be evaluated in this section are the following:
- 3.4.1 Mortality rate credited to NCD - total mortality rate of cardiovascular disease, cancer, diabetes or chronic respiratory disease
- 3.4.1 Mortality rate credited to Cardio
- 3.4.1 Mortality rate credited to Cancer
- 3.4.1 Mortality rate credited to Diabetes 
- 3.4.1 Mortality rate credited to Respi

We will check for the minimum and maximum value through the use of the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function.

First, we will check the values for `3.4.1 Mortality rate credited to NCD`.

In [300]:
combined_data['3.4.1 Mortality rate credited to NCD'].describe()

count    240.000000
mean       4.045417
std        1.073209
min        0.200000
25%        3.700000
50%        4.200000
75%        4.800000
max        5.500000
Name: 3.4.1 Mortality rate credited to NCD, dtype: float64

**Result:** From this, we can see that the range of values are within the accepted range, thus, there are no values that would be considered an oddity.

Next, we will specifically check the values of the mortality rate of cardiovascular diseases. This is represented by the column `3.4.1 Mortality rate credited to Cardio`.

In [301]:
combined_data['3.4.1 Mortality rate credited to Cardio'].describe()

count    270.00000
mean       2.46000
std        0.64019
min        0.10000
25%        2.20000
50%        2.50000
75%        2.80000
max        3.50000
Name: 3.4.1 Mortality rate credited to Cardio, dtype: float64

**Result:** As seen from the result above, the data from this column is within the range. Meaning, further checking will not be done anymore since all of the values in this column are valid.

Now, we will check for the mortality rate of cancer which is reprented by `3.4.1 Mortality rate credited to Cancer`.

In [302]:
combined_data['3.4.1 Mortality rate credited to Cancer'].describe()

count    270.000000
mean       0.937037
std        0.255858
min        0.000000
25%        0.800000
50%        1.000000
75%        1.100000
max        1.300000
Name: 3.4.1 Mortality rate credited to Cancer, dtype: float64

**Result:** From this, we can see that all values are valid since the minimum and maximum presented above is within the expecrted range. With this, we can now check for the next column.

We will now check the values under the column `3.4.1 Mortality rate credited to Diabetes`. 

In [303]:
combined_data['3.4.1 Mortality rate credited to Diabetes'].describe()

count    270.000000
mean       0.446667
std        0.150489
min        0.000000
25%        0.400000
50%        0.500000
75%        0.500000
max        0.900000
Name: 3.4.1 Mortality rate credited to Diabetes, dtype: float64

**Result:** Based on the result above, the values for this column are within the range since the minimum value is 0.00 and the maximum value is 0.90. Therefore, we can now move on to checking the next column.

Lastly, we will check for the values of mortality rate credited to chronic respiratory disease, which is represented by `3.4.1 Mortality rate credited to Respi`.

In [304]:
combined_data['3.4.1 Mortality rate credited to Respi'].describe()

count    270.000000
mean       0.280000
std        0.104401
min        0.000000
25%        0.200000
50%        0.300000
75%        0.300000
max        0.500000
Name: 3.4.1 Mortality rate credited to Respi, dtype: float64

**Result:** The result above indicates that all values in this column are valid since the minimum and maximum values presented above are within the range. With this, no more checking will be done for this column.

#### 3.7.1 Proportion of women of reproductive age (aged 15-49 years) who have their need for family planning satisfied [provided] with modern methods
Like the previous indicators that deals with proportion, we would be using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function in order to check if there values are within the range that we are expecting (i.e., 0 to 100). 

In [305]:
combined_data['3.7.1 Proportion of Contraceptive Use of Women'].describe()

count    72.000000
mean     50.681944
std      10.408235
min      21.200000
25%      45.975000
50%      51.000000
75%      56.825000
max      74.100000
Name: 3.7.1 Proportion of Contraceptive Use of Women, dtype: float64

**Result:** As the minimum and maximum are within the accepted range, we would not need to do further cleaning for this column.

#### 3.7.2 Adolescent birth rate aged 15-19 years per 1,000 women in that age group
Likewise, we are still expecting a value of 0 to 100 for this column as it is rate of the population (i.e., the number of adolescents who gave birth cannot go over the number of adolescents). To be able to determine if the values are within the range, we would still be utilizing the same function.

In [306]:
combined_data['3.7.2 Teenage pregnancy rates per 1000'].describe()

count     72.000000
mean      57.500000
std       16.758433
min       25.000000
25%       47.000000
50%       55.500000
75%       63.250000
max      108.000000
Name: 3.7.2 Teenage pregnancy rates per 1000, dtype: float64

**Result:** As we can see that there is at least one value that is higher than 100, we need to decide on what to be done for these values. But, first, let us see how many of the rows have this proble.

In [307]:
len(combined_data[combined_data['3.7.2 Teenage pregnancy rates per 1000'] > 100])

1

**Result:** As there is only one row with this problem, this means that there are 71 values in this column that are within the expected range.

#### 4.1.s1 Completion Rate of elementary and secondary students (Elementary)
Like any other columns that discusses the rate, we are expecting a value of 0 to 100. This can be checked using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function.

Also, there are 3 versions of this data (`Both sexes` (Overall), `Female`, and `Male`) and each has their own column. With this, we will be checking each column in this section.

We will start with checking for the overall record for this data.

In [308]:
combined_data['4.1 Elem Completion Rate'].describe()

count    376.000000
mean      76.328145
std       14.993444
min       23.460000
25%       67.842500
50%       77.540000
75%       86.865000
max       99.800000
Name: 4.1 Elem Completion Rate, dtype: float64

**Result:** We can see that the minimum is greater than 0 and the maximum is less than 100, which means that the range is within the expected range for this column.

Next, we will then check for the values from the record of elementary completion rate for *females only*.

In [309]:
combined_data['4.1 Elem Completion Rate (Female)'].describe()

count    376.000000
mean      80.465424
std       14.333781
min       24.350000
25%       73.490000
50%       82.175000
75%       90.285000
max      100.000000
Name: 4.1 Elem Completion Rate (Female), dtype: float64

**Result:** From this, checking further will not required since the minimum value and the maximum value are within the expected range.
    
Finally, we will check for the values of the *male* elementary completion rate.

In [310]:
combined_data['4.1 Elem Completion Rate (Male)'].describe()

count    376.000000
mean      72.597228
std       15.747094
min       22.040000
25%       62.540000
50%       73.030000
75%       84.002500
max       99.300000
Name: 4.1 Elem Completion Rate (Male), dtype: float64

**Result:** As seen in this result, it is evident that all values are valid in this column by observing the minimum and maximum value. Therefore, the checking of values for this section ends here.

#### 4.1.s1 Completion Rate of elementary and secondary students (Junior High School)
As the this column indicates the rate again, we would still need to determine if it is within the accepted range. This can be done through the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function. 

Also, in this section, there are three columns involved which are `4.1 JHS Completion Rate`, `4.1 JHS Completion Rate (Female)` and `4.1 JHS Completion Rate (Male)`.

Similar with the previous column, we will first check the overall record of this data, which is `4.1 JHS Completion Rate`.

In [311]:
combined_data['4.1 JHS Completion Rate'].describe()

count    376.000000
mean      74.390729
std        8.764703
min       35.210000
25%       69.965000
50%       74.790000
75%       79.790000
max       95.806836
Name: 4.1 JHS Completion Rate, dtype: float64

**Result:** As all the values are within the accepted range, which can be seen through the use of the **min** and **max** values, there is no need problems regarding the range of the values.

Then, we will now check the values of the column for `4.1 JHS Completion Rate (Female)`.

In [312]:
combined_data['4.1 JHS Completion Rate (Female)'].describe()

count    376.000000
mean      78.935592
std        8.945636
min       37.810000
25%       74.270000
50%       80.005000
75%       84.847500
max       98.259478
Name: 4.1 JHS Completion Rate (Female), dtype: float64

**Result:** Observing the result above, the minimum asnd maximum value of this column is within the range. This means that all values in this column is valid and this column wil not need any further checking. 

Finally, we will check for the values from the `4.1 JHS Completion Rate (Male)`.

In [313]:
combined_data['4.1 JHS Completion Rate (Male)'].describe()

count    376.000000
mean      69.797736
std        9.019642
min       32.220000
25%       65.047500
50%       69.775000
75%       75.287500
max       93.394499
Name: 4.1 JHS Completion Rate (Male), dtype: float64

**Result:** From this result, we can clearly see that all values are within the range by observing the minimum value and the maximum value. With this, no more checking is required for this section.

#### 4.1.s1 Completion Rate of elementary and secondary students (Senior High School)
Similar to the previous section, the accepted range for this section is 0 to 100 and there are three columns under this data each represents **(1) Both sexes**, **(2) Female**, and **(3) Male**.

As usual, we will start with checking the column that contains the overall data.

In [314]:
combined_data['4.1 SHS Completion Rate'].describe()

count    54.000000
mean     75.262275
std       7.754152
min      51.050553
25%      72.395000
50%      76.495000
75%      80.947500
max      90.356108
Name: 4.1 SHS Completion Rate, dtype: float64

**Result:** As we can see from the output of the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function, the minimum and maximum values are within the range.

However, as the Senior High School was only established in 2016, the first wave of completers would only be from 2018, thus, we would need to check if the values only exist starting from this year.

This can be done through using the [`isnull`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) and the [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) functions. By checking the unique values of the `Year` column that are not **NaN**, we can determine if the values are only from 2018 onwards. To add, this type of checking will also be done to the columns: `4.1 SHS Completion Rate (Female)` and `4.1 SHS Completion Rate (Male)`.

In [315]:
combined_data[~combined_data['4.1 SHS Completion Rate'].isnull()]['Year'].unique()

array([2018, 2019, 2020])

**Result:** From this, we can see that the values only exist for the years that has senior high school graduates, which is what is correct.

Next, we will be checking the validity of the values from `4.1 SHS Completion Rate (Female)`.

In [316]:
combined_data['4.1 SHS Completion Rate (Female)'].describe()

count    54.000000
mean     79.448554
std       7.581709
min      57.428523
25%      76.795000
50%      80.424756
75%      84.640000
max      94.490419
Name: 4.1 SHS Completion Rate (Female), dtype: float64

**Result:** From this, it is evident that the minimum and maximum value are within the range. 

Next, we will be checking if these numerical values only exist from 2018 onwards.

In [317]:
combined_data[~combined_data['4.1 SHS Completion Rate (Female)'].isnull()]['Year'].unique()

array([2018, 2019, 2020])

**Result:** Based on the output above, it proves that all numerical values in this column only exist from 2018 onwards. Meaning, all values in this column are valid.

Finally, we will check the values from the column, `4.1 SHS Completion Rate (Male)`.

In [318]:
combined_data['4.1 SHS Completion Rate (Male)'].describe()

count    54.000000
mean     70.963982
std       8.083650
min      45.098846
25%      67.540000
50%      72.255000
75%      77.015822
max      86.236932
Name: 4.1 SHS Completion Rate (Male), dtype: float64

**Result:** As presented above, the minimum and maximum value are within the range. This means that the values in this column is valid in terms of the value range.

Again, we will proceed to checking if the numerical values in this column only exist from 2018 onwards.

In [319]:
combined_data[~combined_data['4.1 SHS Completion Rate (Male)'].isnull()]['Year'].unique()

array([2018, 2019, 2020])

**Result:** From the output displayed above, we can confirm that all values in this column is valid since the output indicates that the numerical values in this column are in the years 2018 onwards. Therefore, no more checking is needed for this section.

#### 7.1.1 Proportion of population with access to electricity

Like other columns that indicates proportion, rate and percentage, one of the most important things that we need to check for this column is if the range of values are within `0 to 100`. With this, we will use the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function for checking the values.

In [320]:
combined_data['7.1.1 Proportion of pop with electricity'].describe()

count    216.000000
mean      84.371017
std       17.982998
min       26.561093
25%       74.077564
50%       90.475000
75%       98.883110
max      115.320595
Name: 7.1.1 Proportion of pop with electricity, dtype: float64

**Result:** Based on the result, the maximum value of this column is higher than the maximum value for the proportion. 

Let us see the rows that has values greater than 100 for this column.

In [321]:
combined_data[combined_data['7.1.1 Proportion of pop with electricity'] > 100]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
307,NCR: National Capital Region,2017,,92.83,93.5,92.2,84.76,89.56,80.22,62.74,...,1391592000.0,5327124000.0,13264805.0,,,,,,,
311,Region 3: Central Luzon,2017,,97.91,97.87,97.95,82.84,87.46,78.54,55.99,...,513606400.0,1860324000.0,11667642.0,,,,,,,
312,Region 4A: CALABARZON,2017,,96.31,96.59,96.04,82.51,87.21,78.09,53.9,...,630500100.0,2423069000.0,15085285.0,,,,,,,
322,CARAGA: Cordillera Administrative Region,2017,,95.89,95.07,96.69,74.54,79.83,69.5,38.11,...,115687900.0,273088200.0,2660236.0,,,,,,,
325,NCR: National Capital Region,2018,2.2,92.11,92.38,91.85,88.74,92.64,85.04,68.63,...,1640251000.0,5814440000.0,13453701.0,,,,,,,
327,Region 1: Ilocos Region,2018,9.9,90.48,89.67,91.26,87.81,90.68,85.14,64.06,...,193759500.0,587597300.0,5178410.0,,,,,,,100.0
329,Region 3: Central Luzon,2018,7.0,98.77,98.32,99.2,85.01,88.83,81.44,60.19,...,590017300.0,2062394000.0,11890314.0,,,,,,,100.0
330,Region 4A: CALABARZON,2018,7.1,97.36,97.37,97.34,86.38,89.97,83.01,58.33,...,712401200.0,2706995000.0,15418944.0,,,,,,,100.0
340,CARAGA: Cordillera Administrative Region,2018,30.5,94.74,93.68,95.76,80.7,84.84,76.74,44.36,...,132342900.0,290561800.0,2692072.0,,,,,,,100.0


**Result:** From this output, we can see that there are 9 out of 396 rows that contains a higher value than the maximum value set for this column.

#### 8.1.1. Annual growth rate of real GDP per capita

For this column, there is no specific expected range since this column contains growth rate values. Still, the values will still be first checked by using the `decribe` function. further checking will be needed only if there is still unclear about the values after the output of the first step of checking. 

Since the unit of measurement of the values under this column is percentage, the initial expected range is from 0 to 100. However, unlike other percentage data, growth rates can be negative or can exceed 100%. To add, since the values of growth rates are usually positive, we will take note the number of instances where the values are zero, negative, or exceeding 100%.

The different growth rate values will be expressed as:
The `value representations for growth rates` are the following: 
- `Decline` - A negative value means there is a decrease in real GDP per capita
- `No Growth` - A zero value means there is no change in real GDP per capita
- `Growth` - A positive value means there is a increase in real GDP per capita
    - A 100% value  means that the growth value from the previous year to the current year is equal to the data recorded from the previous year.
    - `Expansion` - A value exceeding 100% means that the growth value from the previous year to the current year is greater than data recorded from the previous year.

In [322]:
combined_data['8.1.1 Growth rate of real GDP per capita'].describe()

count    361.000000
mean       3.227028
std        3.829716
min      -15.362293
25%        1.983157
50%        3.951342
75%        5.338761
max       12.694224
Name: 8.1.1 Growth rate of real GDP per capita, dtype: float64

**Result:** The minimum value entails that there instances where a decline in GDP per capita occurred. As for the maximum value, it indicates that no expansion happened.

With this, more checking will be done. We will be checking for the number of instances of the folowing cases: no change and decrease in real GDP per capita.

First, we will check for the number of instances where there is a decline in real GDP per capita.

In [323]:
combined_data[combined_data['8.1.1 Growth rate of real GDP per capita'] < 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
24,Region 4A: CALABARZON,2001,,93.44,93.93,92.96,66.24,70.46,62.13,,...,76941120.0,671228300.0,9687547.0,,,,,,,
30,Region 9: Zamboanga Peninsula,2001,,90.22,90.52,89.93,43.95,47.79,40.13,,...,20350910.0,76332910.0,2906575.0,,,,,,,
33,Region 12: SOCCSKSARGEN,2001,,86.78,87.85,85.74,50.56,55.2,45.97,,...,6512848.0,95184440.0,3054507.0,,,,,,,
35,BARMM: Bangsamoro Autonomous Region in Muslim ...,2001,,83.68,87.46,79.89,20.48,22.05,18.89,,...,4180245.0,44751360.0,3096339.0,,,,,,,
37,NCR: National Capital Region,2002,,97.38,98.28,96.52,75.28,77.94,72.57,,...,312792400.0,1435447000.0,10344788.0,,,,,,,
94,Region 2: Cagayan Valley,2005,,79.92,80.34,79.52,59.02,64.95,53.38,,...,58370200.0,121542600.0,3026730.0,,,,,,,
105,Region 12: SOCCSKSARGEN,2005,,77.43,79.15,75.77,51.33,56.69,46.06,,...,22874350.0,146556500.0,3346482.0,,,,,,,
161,BARMM: Bangsamoro Autonomous Region in Muslim ...,2008,,99.85,104.34,95.37,37.98,42.57,33.24,,...,11818450.0,104963600.0,3511566.0,17.99,16.87,19.21,10.8,10.0,11.9,
162,PHILIPPINES,2009,,89.48,90.77,88.26,59.89,64.82,55.16,,...,1462595000.0,8390421000.0,90974244.0,6.28,4.93,7.52,8.0,6.2,9.7,
163,NCR: National Capital Region,2009,,90.96,92.01,89.98,76.69,79.74,73.69,,...,461856600.0,2754036000.0,11693627.0,4.07,3.33,4.75,6.9,5.0,8.7,


**Result:** As shown in the results above, there are 41 instances where the real GDP per capita decreased.

Now, we will now check if there are instances where there is no change in real GDP per capita.

In [324]:
combined_data[combined_data['8.1.1 Growth rate of real GDP per capita'] == 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture


**Result:** As shown in the results above, there are no instances where the real GDP per capita remained the same.

#### 10.1.1.1 Growth rates of household expenditure or income per capita among the bottom 40 per cent of the population

Similar to the columns with growth rate values, we will take note of the number of instances where the values are zero, negative, or exceeding 100%, if there are any. 

To recap, these are the `growth rates representation`: 
- If zero value, there is no change in the expenditure or income per capita.
- If negative value, there is a decrease in the expenditure or income per capita.
- If positive value, there is an increase in the expenditure or income per capita.
- If positive value exceeding 100, the expenditure or income per capita became twice or more than the previous year's.

Once again, we will now check this column by using the `describe` function. 

In [325]:
combined_data['10.1.1.1 Income per capita growth rate of bottom 40'].describe()

count    36.000000
mean      8.255049
std       2.286450
min       2.876687
25%       6.731169
50%       8.227927
75%       9.516665
max      14.063013
Name: 10.1.1.1 Income per capita growth rate of bottom 40, dtype: float64

**Result:** All values are valid since all of the values in this column are numerical and only indicates growth. 

As evident in the minimum and maximum value, negative, 0, and more than 100% values are not present in this column. With this, no more steps for cleaning will be done anymore.

#### 10.1.1.2 Growth rates of household expenditure or income per capita among the Total Population

The evaluation in this column will be the same as the `10.1.1.1 Growth rates ...`. This is because they contain the same type of data , which is *Growth rates of household expenditure or income per capita*. 

With this, the same growth rate representations are applied in this column: (1) negative growth rate means *decline*, (2) zero growth rate means *no growth*, (3) positive growth rate just means *growth*, and (4) growth rate more than 100% means *expansion*.

In [326]:
combined_data['10.1.1.2 Income per capita growth rate'].describe()

count    36.000000
mean      6.122632
std       2.291086
min       1.550724
25%       5.013330
50%       6.042426
75%       7.387676
max      10.502686
Name: 10.1.1.2 Income per capita growth rate, dtype: float64

**Result:** Similar to the result of `10.1.1.1 Growth rates...`, there are no instances where there is no growth, negative growth, or excessive growth. Therefore, no more further cleaning is required.

#### 14.5.1.1 Coverage of protected areas in relation to marine areas, Universe (in million hectares)

We will now check the protected areas for marine biodiversity by using the `describe()` function again. Since a coverage of area is being measured in percentage in this column, the expected value will be from `0 to 100`. A negative value or a value exceeding 100% will be reported as a potential anomaly.

In [327]:
combined_data['14.5.1.1 Coverage of protected areas'].describe()

count    37.000000
mean      0.378010
std       0.726692
min       0.000000
25%       0.001242
50%       0.209978
75%       0.364699
max       3.143559
Name: 14.5.1.1 Coverage of protected areas, dtype: float64

**Result:** As seen from the results above, the minimum and maximum value from this column are within the expected range. 

However, since the `minimum value is zero`, this means that there are regions in the Philippines that *does not have designated and protected areas* for marine biodiversity. 

With this, we will check the number of regions in the Philippines were protected marine biodiversity areas are not present.

In [328]:
combined_data[combined_data['14.5.1.1 Coverage of protected areas'] == 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
344,CAR: Cordillera Administrative Region,2019,,91.4,90.51,92.25,84.76,88.9,80.81,50.53,...,23666821.0,322106088.0,1791881.0,,,,,,,85.475
359,BARMM: Bangsamoro Autonomous Region in Muslim ...,2019,,71.51,74.7,68.42,36.48,41.96,31.07,10.59,...,48609129.0,254779370.0,4629060.0,,,,,,,98.6
362,CAR: Cordillera Administrative Region,2020,,87.5276,86.4657,88.5518,84.9372,88.1383,81.866,52.8763,...,31273604.0,295502542.0,1807738.0,,,,,,,64.35
377,BARMM: Bangsamoro Autonomous Region in Muslim ...,2020,,69.1673,72.3588,66.1032,37.1877,42.8858,31.543,12.7783,...,23815628.0,260347879.0,4724381.0,,,,,,,98.725


**Result:** Seeing that there are four rows that was displayed from the result above means that there are 4 records that has a zero value for this indicator. 

Observing the result further, only **CAR** and **BARMM** regions are present. This means that these areas does not have designated protected areas for marine biodiversity as of 2020.  

#### 14.5.1.2 Coverage of protected areas in relation to marine areas, NIPAS and Locally managed MPAs

Now, the checking of values is similar to `14.5.1.1 Coverage of protected areas...`. But, in this section, we will check the protected areas coverage for marine areas that are managed by National Integrated Protected Areas System (NIPAS) and Managed Protected Areas (MPA). Therefore, the expected range is still from `0 to 100`. To clearly see the data, especially the minimum and the maximum, we will is the `describe()` function.

It is important to take note that the results from `14.5.1.1 Coverage of protected areas...` will also be reflected where since this is just a more specific category of the coverage of protected areas in relation to marine areas in the Philippines. With this, the the minimum value is expected to be 0, and **CAR** and **BARMM** should appear in the results when this column is checked further later.

In [329]:
combined_data['14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs'].describe()

count    37.000000
mean      0.171255
std       0.328632
min       0.000000
25%       0.000563
50%       0.095166
75%       0.165288
max       1.420000
Name: 14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs, dtype: float64

**Result:** Similar to the results from `14.5.1.1 Coverage of protected areas...`, the minimum and maximum value from this column are within the expected range. 

Since the minimum value is zero, this means that there are regions in the Philippines that does not have designated and protected areas for marine biodiversity or may have areas for marine biodiversity but are not manged by NIPAS and MPA.

Again, we will now check for these regions.

In [330]:
combined_data[combined_data['14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs'] == 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
344,CAR: Cordillera Administrative Region,2019,,91.4,90.51,92.25,84.76,88.9,80.81,50.53,...,23666821.0,322106088.0,1791881.0,,,,,,,85.475
359,BARMM: Bangsamoro Autonomous Region in Muslim ...,2019,,71.51,74.7,68.42,36.48,41.96,31.07,10.59,...,48609129.0,254779370.0,4629060.0,,,,,,,98.6
362,CAR: Cordillera Administrative Region,2020,,87.5276,86.4657,88.5518,84.9372,88.1383,81.866,52.8763,...,31273604.0,295502542.0,1807738.0,,,,,,,64.35
377,BARMM: Bangsamoro Autonomous Region in Muslim ...,2020,,69.1673,72.3588,66.1032,37.1877,42.8858,31.543,12.7783,...,23815628.0,260347879.0,4724381.0,,,,,,,98.725


**Result:** As expected, the result is similar to `14.5.1.1 Coverage of protected areas...`, where **CAR** and **BARMM** are present in this result which entails that these do not have designated protected area for marine biodiversity

#### 16.1.1 Number of victims of intentional homicide (per 100,000 population)

Similar to the columns with proportion, percentage, rate values, the expected range for this column would be from `0 to 100`. This is because it would be impossible to get a negative value or a value exceeding 100% when measuring a **population proportion**. Again, the `describe()` function will be used to see the summary of the datas in this column.

In [331]:
combined_data['16.1.1 Victims of intentional homicide per 100,000'].describe()

count    108.000000
mean       7.183800
std        3.439573
min        2.444428
25%        4.720040
50%        6.307474
75%        9.101852
max       17.739571
Name: 16.1.1 Victims of intentional homicide per 100,000, dtype: float64

**Result:** From this, we can see that there are no values that went out of range. With this, no more further checking will be done.

#### 16.1.s1 Number of murder cases

In checking this column, the expected range would be `greater than or equal to zero` because the data in this column is **quantity**.

In [332]:
combined_data['16.1.s1 Number of murder cases'].describe()

count      108.000000
mean       822.055556
std       1750.312543
min         49.000000
25%        234.500000
50%        370.500000
75%        591.000000
max      12417.000000
Name: 16.1.s1 Number of murder cases, dtype: float64

**Result:** Based on the results, there are no negative values which also means that there are no potential anomalies in this column. Therefore, we will not be needing to check the values of this column further.

#### Changes in Inventories, by Region

Since this column measure the changes in inventories from the initial year to the current year, there will be `no specific range`. 

With this, these are the `value repersentations` to take into account:
- A positive value means increase in inventories
- A negative value means decrease in inventories
- A zero value means no changes in inventories

In [333]:
combined_data['Changes in Inventories'].describe()

count    3.960000e+02
mean    -7.539818e+06
std      5.055227e+07
min     -6.954643e+08
25%     -9.172803e+06
50%     -2.910125e+05
75%      6.234438e+06
max      1.785591e+08
Name: Changes in Inventories, dtype: float64

**Result:** From this result, we can see that there are instances where there is a decrease in inventories. 

With this, we will further check the column for the number of instances and the number of regions that experienced a decrease in inventories.

Now, we will check for instances where there is a decrease in inventories (*negative values*).

In [334]:
combined_data[combined_data['Changes in Inventories'] < 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,579938180.0,3.697556e+09,76723051.0,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,13865180.0,9.044601e+07,1369249.0,,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,24454284.0,1.289450e+08,4209083.0,,,,,,,
5,Region 3: Central Luzon,2000,,98.32,97.85,98.77,74.32,76.58,72.13,,...,8037710.0,3.687868e+08,8233671.0,,,,,,,
6,Region 4A: CALABARZON,2000,,98.50,99.04,97.98,71.03,74.24,67.88,,...,18214696.0,6.016914e+08,9367205.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,,,,,,,,...,164566009.0,9.492320e+08,,,,,,,,107.975
392,Region 11: Davao Region,2021,,,,,,,,,...,257595240.0,9.672276e+08,,,,,,,,110.850
393,Region 12: SOCCSKSARGEN,2021,,,,,,,,,...,87077953.0,5.039756e+08,,,,,,,,103.350
394,CARAGA: Cordillera Administrative Region,2021,,,,,,,,,...,100730468.0,3.317629e+08,,,,,,,,104.525


**Result:** With this, we can see that there are 188 instances in this column that indicates a decrease in inventories.


Now, we will check for instances where there is no change in inventories (*zero values*).

In [335]:
combined_data[combined_data['Changes in Inventories'] == 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture


**Result:** As seen in this result, there are no instances where no change in inventories happened.

#### Current Health Expenditure by Region

In this column, the percent share of the health expenditure in the national expenditure of each region in a specific year is recorded. With this, the expected range of values for this column is `strictly from 0 to 100`. This is because negative values are impossible when measuring the proportion of the health expenditure in the national expenditure. Also, a value exceeding 100% will mean that the health expenditure exceeds the national expenditure, which is impossible.

Again, we will use the `describe()` function to see the summary of values for this column, especially, the minimum and maximum values.

In [336]:
combined_data['Current Health Expenditure'].describe()

count    126.000000
mean       5.565079
std        5.970183
min        0.500000
25%        2.200000
50%        3.500000
75%        5.700000
max       24.700000
Name: Current Health Expenditure, dtype: float64

**Result:** As of this result, the minimum and maximum value is not outside the expected range which means that there are no potential anomalies in this column. To add, since the minimum value is higher than zero, this means that there are no instances where a region or the country did not have health expenditure. With all of this, there is no need to check the column further.

#### Current Health Expenditure by Region, Growth Rates

This column is related to `Current Health Expenditure by Region` which contains annual *percent share* of health expenditures per region. As for this column, the values are the annual growth rate of health expenditures. Since these are recorded in growth rates there is no strict range of values expected for this column. 

It is important to take note that growth rates can possibly go down to negative or exceed 100% since this happens in real life: (1) negative growth rate means *decline*, (2) zero growth rate means *no growth*, and (3) growth rate more than 100% means *expansion*.

However, we will take not of the number of instances of these cases: *decline*, *no growth*, and *expansion*.

In [337]:
combined_data['Current Health Expenditure GR'].describe()

count    108.000000
mean      12.446296
std       19.581015
min      -63.000000
25%        7.200000
50%       10.600000
75%       13.825000
max      163.800000
Name: Current Health Expenditure GR, dtype: float64

**Result:** From this result, the minimum value for this column is `-63.00` while the maximum value is `163.80`. 

Therefore, we will be cheking this column further to the number of instances of these decline, no growth, and expansion cases.

To start, we will check the number of instances where there are negative values.

In [338]:
combined_data[combined_data['Current Health Expenditure GR'] < 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
313,MIMAROPA: Southwestern Tagalog Region,2017,,92.33,92.44,92.24,75.71,81.04,70.72,43.27,...,55532730.0,321948700.0,3051342.0,,,,,,,
322,CARAGA: Cordillera Administrative Region,2017,,95.89,95.07,96.69,74.54,79.83,69.5,38.11,...,115687900.0,273088200.0,2660236.0,,,,,,,
330,Region 4A: CALABARZON,2018,7.1,97.36,97.37,97.34,86.38,89.97,83.01,58.33,...,712401200.0,2706995000.0,15418944.0,,,,,,,100.0
343,NCR: National Capital Region,2019,,89.91,90.42,89.43,89.68,93.47,86.06,62.28,...,1598462000.0,6294195000.0,13633497.0,,,,,,,
348,Region 4A: CALABARZON,2019,,98.23,98.61,97.87,87.85,91.62,84.29,54.79,...,766853100.0,2865793000.0,15742673.0,,,,,,,100.225
349,MIMAROPA: Southwestern Tagalog Region,2019,,90.26,90.75,89.79,81.12,85.31,77.19,46.0,...,88406150.0,375589600.0,3135503.0,,,,,,,87.15


**Result:** From the result above, there are six instances where there is a decline in health expenditure.

Next, we will check for the number of instances where there is no growth in health expenditure.

In [339]:
combined_data[combined_data['Current Health Expenditure GR'] == 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
339,Region 12: SOCCSKSARGEN,2018,28.2,92.04,92.1,91.98,74.49,80.14,69.11,41.5,...,100721169.0,454304549.0,4256317.0,,,,,,,100.0


**Result:** As seen from the result, there is one instance where there is no growth in the health expenditure. The region involved in this case is **SOCCSKSARGEN**.

Lastly, we will check for the number of instances where an expansion of health expenditure happened.

In [340]:
combined_data[combined_data['Current Health Expenditure GR'] > 100]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
331,MIMAROPA: Southwestern Tagalog Region,2018,15.1,90.99,91.1,90.88,79.74,83.95,75.79,48.14,...,78995250.0,370744808.0,3094357.0,,,,,,,100.0


**Result:** Based on the result above, one instance experienced an expansion in health expenditure. The region involved in this case is **MIMAROPA**.

#### Government Final Consumption Expenditure, by Region, Percent Share

Similar to other columns with percent share values, the expected range for this value is `strictly from 0 to 100`. This is because negative values are impossible in measuring percent shares. Moreover, specifically for regional values, exceeding 100% will mean that this specific regional government expenditure exceeds the national expenditure, which is impossible as well. In relation to this, 100% must be the only value in the instances where the Geolocation is **PHILIPPINES** since the *Government Final Consumption Expenditure* is similar to the national expenditure.

With this, we use the `describe()` function to see the minimum and maximum values of this column.

In [341]:
combined_data['Consumption Expenditure %'].describe()

count    396.000000
mean      11.109091
std       23.406456
min        1.800000
25%        2.675000
50%        3.600000
75%        4.600000
max      100.000000
Name: Consumption Expenditure %, dtype: float64

**Result:** The result indicates that there are no values that went out of the expected range. However, this still needs to be checked further to confirm that there are no regional values that have 100% value. 

Now, we will check for instances where the value is 100%. In this case, the expected result should only be the instances where the PHILIPPINES is the geolocation. 

In [342]:
combined_data[combined_data['Consumption Expenditure %'] == 100]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,579938200.0,3697556000.0,76723051.0,,,,,,,
18,PHILIPPINES,2001,,90.1,90.91,89.33,57.55,62.24,52.96,,...,762429500.0,4024399000.0,78273584.0,,,,,,,
36,PHILIPPINES,2002,,90.29,91.1,89.51,59.0,63.72,54.39,,...,890087000.0,4350560000.0,79832103.0,,,,,,,
54,PHILIPPINES,2003,,88.74,89.68,87.84,60.15,65.07,55.34,,...,921328400.0,4717809000.0,81398610.0,,,,,,,
72,PHILIPPINES,2004,,87.11,88.08,86.17,59.97,65.01,55.04,,...,1103699000.0,5323904000.0,82977428.0,,,,,,,
90,PHILIPPINES,2005,,84.44,85.35,83.56,58.54,63.53,53.65,,...,1098634000.0,5917282000.0,84559930.0,,,,,,,
108,PHILIPPINES,2006,,83.22,84.08,82.39,58.59,63.44,53.85,,...,1049071000.0,6550417000.0,86150420.0,6.37,5.0,7.64,8.5,6.7,10.4,
126,PHILIPPINES,2007,,84.93,85.83,84.07,60.21,64.87,55.65,,...,1160980000.0,7198245000.0,87748896.0,5.99,4.72,7.17,7.5,5.6,9.3,
144,PHILIPPINES,2008,,85.11,85.7,84.55,60.74,65.18,56.39,,...,1526893000.0,8050201000.0,89359772.0,6.02,4.87,7.07,7.4,5.8,9.1,
162,PHILIPPINES,2009,,89.48,90.77,88.26,59.89,64.82,55.16,,...,1462595000.0,8390421000.0,90974244.0,6.28,4.93,7.52,8.0,6.2,9.7,


**Result:** As seen from the result above, 22 instances appeared which are the instances for when the geolocation is PHILIPPINES. Moreover, we can confirm that all of the PHILIPPINES instances are displayed based on the years displayed as well which is from 2000 to 2020. 

Since the result matched with the expected result, there is no need to further check this column.

#### Government Final Consumption Expenditure, by Region, Growth Rates

This column is related to the `Government Final Consumption Expenditure, by Region, Percent Share`. However, in this column the data is measured in  growth rates. Similiar to toher growth rate columns, there is no specific expected range value for this column because the go down to negative (represents *decline* in expenditure) or exceed 100% (*expansion* in expenditure).

Now, we will check for the minimum and maximum values of this column.

In [343]:
combined_data['Consumption Expenditure GR'].describe()

count    378.000000
mean      10.403704
std        6.725922
min       -9.000000
25%        6.300000
50%       10.000000
75%       14.050000
max       38.800000
Name: Consumption Expenditure GR, dtype: float64

**Result:** This result implies that a decline in expenditure happened and no expansion happened. With this we will be checking for the number ofinstances where a decline in expenditure happened. Moreover, we will also check for the number of instances where there were no changes in the expenditure.

Checking this further, we will now check for the instances where the expenditure declined (*negative growth rate*).

In [344]:
combined_data[combined_data['Consumption Expenditure GR'] < 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
11,Region 8: Eastern Visayas,2000,,94.58,95.53,93.67,48.21,54.64,42.22,,...,21092440.0,103764200.0,3618043.0,,,,,,,
19,NCR: National Capital Region,2001,,97.82,99.13,96.57,67.84,73.86,61.7,,...,288023206.0,1349241000.0,10153254.0,,,,,,,
56,CAR: Cordillera Administrative Region,2003,,89.19,90.07,88.36,61.76,68.74,55.02,,...,19788755.0,113075100.0,1451561.0,,,,,,,
61,MIMAROPA: Southwestern Tagalog Region,2003,,89.42,90.13,88.74,58.43,64.7,52.36,,...,12029760.0,95423490.0,2443548.0,,,,,,,
62,Region 5: Bicol Region,2003,,89.3,90.37,88.29,55.61,61.76,49.77,,...,23452407.0,120066700.0,4920546.0,,,,,,,
63,Region 6: Western Visayas,2003,,83.25,83.92,82.6,57.86,63.69,52.19,,...,45601881.0,220844500.0,6501206.0,,,,,,,
66,Region 9: Zamboanga Peninsula,2003,,84.83,85.66,84.02,49.31,54.6,44.04,,...,19815370.0,86783930.0,3033380.0,,,,,,,
68,Region 11: Davao Region,2003,,84.36,85.74,83.03,52.11,57.37,46.92,,...,50647069.0,194422100.0,3923852.0,,,,,,,
89,BARMM: Bangsamoro Autonomous Region in Muslim ...,2004,,90.01,94.19,85.84,28.43,31.1,25.73,,...,15457196.0,71276680.0,3232800.0,,,,,,,
195,Region 12: SOCCSKSARGEN,2010,,88.65,90.58,86.84,54.15,59.76,48.77,,...,33894195.0,252178000.0,3701963.0,8.9,7.43,10.28,8.7,7.4,10.1,


**Result:** The result displayed 10 instances where the consumption expenditure decreased. 

Lastly, we will check for instances where the consumption expenditure remained the same (*zero growth rate*). 

In [345]:
combined_data[combined_data['Consumption Expenditure GR'] == 0]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture


**Result:** Based on this result, there are no instances where the consumption expenditure remained the same. 

#### Gross Capital Formation, by Region

For this column, since this is recorded in thousand pesos, the expected range is values `greater than or equal to zero`. 

To check for the minimum and the maximum values, we will use the `describe()` function. This displays the summary of the data in this column.

In [346]:
combined_data['Gross Capital Formation'].describe()

count    3.960000e+02
mean     2.539494e+08
std      6.289540e+08
min      3.017260e+06
25%      3.282230e+07
50%      7.347546e+07
75%      1.530246e+08
max      5.153069e+09
Name: Gross Capital Formation, dtype: float64

**Result:** Since the minimum value is greater than zero, this means that all values in this column are valid. With this, no more checking is needed for this column.

#### Gross Regional Domestic Product, by Region

Similar to the previous column `Gross Capital Formation, by Region`, the unit of measurement for this column is peso. This means that the expected range of value for this column is `equal or greater than zero`.

In [347]:
combined_data['GRDP'].describe()

count    3.960000e+02
mean     1.186168e+09
std      2.769090e+09
min      4.475136e+07
25%      1.798965e+08
50%      3.235851e+08
75%      7.467392e+08
max      1.951786e+10
Name: GRDP, dtype: float64

**Result:** Based on this result, the minimum value is greater than zero which means that all values in this column is within the range. Therefore, we don't need to check this column further.

#### Population, by Region

Since this column contains population in **quantity** values, the expected range will be `greater than or equal to zero`. 

In [348]:
combined_data['Population'].describe()

count    3.780000e+02
mean     1.030015e+07
std      2.039379e+07
min      1.369249e+06
25%      3.316747e+06
50%      4.483862e+06
75%      7.495641e+06
max      1.087720e+08
Name: Population, dtype: float64

**Result:** Since the minimum value is greater than zero, this means that the rest of the values are within the expected range. Since we can confirm that all values in this column is valid, checking further is not needed for this column.

#### Primary Drop-out rates by Region, Sex and Year

As for this column, the values are measured in population proportion. With this, the expected range of value is from `0 to 100`, strictly. This is because a negative value and a value exceeding 100% will not mean anything if the value being measured is a fraction of a population - this will be reported as an anomaly instead. In addition, there will be three columns under this section, all contains primary drop-outs but have different population size the threre columns are Primary drop rates of **(1) Both sexes**, **(2) Boys**, and **(3) Girls**.

To start, we will first check for the `Primary Drop-rate for both sexes`. 

In [349]:
combined_data['Primary Drop-out rate'].describe()

count    180.000000
mean       5.915500
std        3.951325
min        1.130000
25%        3.297500
50%        4.975000
75%        7.200000
max       23.810000
Name: Primary Drop-out rate, dtype: float64

**Result:** As seen from the result above, the minimum value (1.13%) and the maximum value (23.81%) is within the expected range. Meaning, all of the values in this column are valid. Therefore, no further checking is needed.

Now, we will check for the values of `Primary Drop-rate for Boys`. 

In [350]:
combined_data['Primary Drop-out rate (Boys)'].describe()

count    180.000000
mean       7.012222
std        4.217429
min        1.300000
25%        4.107500
50%        6.185000
75%        8.497500
max       24.930000
Name: Primary Drop-out rate (Boys), dtype: float64

**Result:** Based on this result, for boys, the minimum value is 1.30% and the maximum value is 24.93% which means that no further checking is needed as well.

Lastly, we will check for the values of `Primary Drop-rate for Girls`. 

In [351]:
combined_data['Primary Drop-out rate (Girls)'].describe()

count    180.000000
mean       4.728556
std        3.728556
min        0.640000
25%        2.530000
50%        3.810000
75%        5.645000
max       22.780000
Name: Primary Drop-out rate (Girls), dtype: float64

**Result:** With this result, no more detailed checking is required as it is evident from the minimum (0.64%) and maximum (22.78%) values that all values from this column are valid.

#### Secondary Drop-out rates by Region, Sex and Year

This column is similar to the previous column `Primary Drop-out rates by Region, Sex and Year` but this time, in **secondary** education. With this, the values in this column are measured in population proportion as well which mean the expected range of value is `strictly from 0 to 100`. To add, there are three columns related in this section which are the following: **Secondary Drop-out rate (Both Sexes)**, **Secondary Drop-out rate (Boys)**, and **Secondary Drop-out rate (Girls)**.

Again, negative or more than 100% values will not mean anything if the value being measured is a fraction of a population. Thus, this will be reported as an anomaly as it is impossible for this column.  

We will first start with checking for the column `Secondary Drop-out rate (Both Sexes)`.

In [352]:
combined_data['Secondary Drop-out rate'].describe()

count    180.000000
mean       8.167778
std        2.489232
min        4.100000
25%        6.700000
50%        7.900000
75%        9.300000
max       24.800000
Name: Secondary Drop-out rate, dtype: float64

**Result:** Since the minimum value (4.10%) and the maximum value (24.80%) is within the expected range. This means that all of the values in this column are valid. Thus, we will stop checking for this column as it is not needed any more.

Next, we check for the values in the column `Secondary Drop-out rate (Boys)`.

In [353]:
combined_data['Secondary Drop-out rate (Boys)'].describe()

count    180.000000
mean       9.976111
std        2.525122
min        5.200000
25%        8.500000
50%        9.800000
75%       11.025000
max       26.500000
Name: Secondary Drop-out rate (Boys), dtype: float64

**Result:** Based on the result from above, the minimum value (5.20%) and the maximum value (26.50%) is within the expected range which means that all of the values in this column are within the expected range as well. With this, no more further checking is required.

Finally, the values for the column `Secondary Drop-out rate (Girls)` will be checked.

In [354]:
combined_data['Secondary Drop-out rate (Girls)'].describe()

count    180.000000
mean       6.372222
std        2.618244
min        2.600000
25%        4.875000
50%        5.800000
75%        7.600000
max       23.300000
Name: Secondary Drop-out rate (Girls), dtype: float64

**Result:** The result shows that the minimum value (2.60%) and the maximum value (23.30%) are within the expected range which means that all of the values in this column are valid. No more checking will be done.

#### Quarterly Producer Price Index (PPI) for Agriculture (First Quarter 2018 to Third Quarter 2021)

Lastly, for this column, as this contains average change in the prices data and recorded in **percentage**, there is no strict expected range for this column. 

With this the `representation of values` are the following:
- A zero value means there is no change in PPI
- A negative value means there is a decrease in PPI
- A positive value means there is an increase in PPI
- A value exceeding 100 means that PPI became more than twice the previous year's value.

Now, we will use the `describe()` function to see the summary of values in this column, specifically, the minimum and maximum values.

In [355]:
combined_data['Price Index for Agriculture'].describe()

count     68.000000
mean      94.687500
std        9.793088
min       64.350000
25%       89.650000
50%       97.512500
75%      100.000000
max      120.950000
Name: Price Index for Agriculture, dtype: float64

**Result:** Based on the result above, the minimum value (64.35%) and the maximum value (102.88%) shows that there are no instances where no changes or a decrease in PPI happened. However, a case where the PPI become more than twice the previous year's producer price happened. 

With this, we will be checking the number of instances for this case. Also, we will be checking for instances where the change in producer price is exactly 100% since this is not clearly seen in the result above.

Now, we will start with checking for instances where the PPI is more than 100%.

In [356]:
combined_data[combined_data['Price Index for Agriculture'] > 100]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
348,Region 4A: CALABARZON,2019,,98.23,98.61,97.87,87.85,91.62,84.29,54.79,...,766853100.0,2865793000.0,15742673.0,,,,,,,100.225
366,Region 4A: CALABARZON,2020,,91.9912,92.497,91.5134,84.4897,88.3569,80.833,54.7999,...,298782200.0,2565120000.0,16057299.0,,,,,,,102.875
378,PHILIPPINES,2021,,,,,,,,,...,4111887000.0,19410570000.0,,,,,,,,103.025
383,Region 3: Central Luzon,2021,,,,,,,,,...,478799500.0,2061963000.0,,,,,,,,107.425
384,Region 4A: CALABARZON,2021,,,,,,,,,...,518586300.0,2785912000.0,,,,,,,,120.95
387,Region 6: Western Visayas,2021,,,,,,,,,...,132970200.0,937244700.0,,,,,,,,107.175
390,Region 9: Zamboanga Peninsula,2021,,,,,,,,,...,113775200.0,428122000.0,,,,,,,,101.175
391,Region 10: Northern Mindanao,2021,,,,,,,,,...,164566000.0,949232000.0,,,,,,,,107.975
392,Region 11: Davao Region,2021,,,,,,,,,...,257595200.0,967227600.0,,,,,,,,110.85
393,Region 12: SOCCSKSARGEN,2021,,,,,,,,,...,87077950.0,503975600.0,,,,,,,,103.35


**Result:** The result above shows that there are two instances where the there is an expansion in the producer price. However, one instance from 2019 has a PPI value close to 100%, which is 100.2%.

Lastly, we will check for instances where the average change in producr price is exactly 100%.

In [357]:
combined_data[combined_data['Price Index for Agriculture'] == 100]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
324,PHILIPPINES,2018,16.7,94.05,93.85,94.25,81.41,85.82,77.24,51.24,...,4959105000.0,18265190000.0,105755180.0,,,,,,,100.0
326,CAR: Cordillera Administrative Region,2018,12.0,92.24,90.99,93.45,83.64,88.13,79.4,53.64,...,32266090.0,308267100.0,1775210.0,,,,,,,100.0
327,Region 1: Ilocos Region,2018,9.9,90.48,89.67,91.26,87.81,90.68,85.14,64.06,...,193759500.0,587597300.0,5178410.0,,,,,,,100.0
328,Region 2: Cagayan Valley,2018,16.3,96.86,96.36,97.32,84.76,89.01,80.78,56.21,...,134311400.0,385061300.0,3579715.0,,,,,,,100.0
329,Region 3: Central Luzon,2018,7.0,98.77,98.32,99.2,85.01,88.83,81.44,60.19,...,590017300.0,2062394000.0,11890314.0,,,,,,,100.0
330,Region 4A: CALABARZON,2018,7.1,97.36,97.37,97.34,86.38,89.97,83.01,58.33,...,712401200.0,2706995000.0,15418944.0,,,,,,,100.0
331,MIMAROPA: Southwestern Tagalog Region,2018,15.1,90.99,91.1,90.88,79.74,83.95,75.79,48.14,...,78995250.0,370744800.0,3094357.0,,,,,,,100.0
332,Region 5: Bicol Region,2018,27.0,93.12,92.38,93.82,82.97,87.51,78.71,45.8,...,181534200.0,522014800.0,6005949.0,,,,,,,100.0
333,Region 6: Western Visayas,2018,16.3,97.38,96.47,98.24,84.54,89.21,80.14,49.74,...,155804300.0,860107800.0,7763898.0,,,,,,,100.0
334,Region 7: Central Visayas,2018,17.7,98.17,97.69,98.62,87.64,93.34,82.27,53.28,...,257710100.0,1180946000.0,7745017.0,,,,,,,100.0


**Result:** As shown above, there are 17 instances where the PPI is exactly 100%.

## Conversion of the DataFrame to File
As we have already taken note of the possible anomalies in our data, we can now save our data into a `.csv` file, in order for it to be explored. Each of the indicators would be saved into different files, although there would be one `.csv` file for the combination of all datasets.

Additionally, if a row in a certain SDG is **NaN** for all of the indicators under that SDG, it would be [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)ped as there would be no insights that can be gathered from this row and to lessen the space needed for dataframes.

It is also important to take note that not all SDGs are present, since there are no available data on some of the SDGs in the OpenStat site that are divided per region per year.

### SDG #1: No Poverty
In this file, all related columns for first indicator **No Poverty** would be saved.

In [358]:
sdg_1 = combined_data.loc[ : , : '1.5.4 Proportion of LGU with DRR']
sdg_1

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,1.4.1 Net SHS Enrolment Rate (Girls),1.4.1 Net SHS Enrolment Rate (Boys),1.5.4 Proportion of LGU with DRR
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,,,
1,NCR: National Capital Region,2000,,101.00,101.92,100.13,79.05,79.50,78.57,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,,,,,,,,,,96.9
392,Region 11: Davao Region,2021,,,,,,,,,,,100.0
393,Region 12: SOCCSKSARGEN,2021,,,,,,,,,,,100.0
394,CARAGA: Cordillera Administrative Region,2021,,,,,,,,,,,93.6


First, let us check if there are rows that have all **NaN** values for all five columns. We can get these rows by using the combination of the [`isna`](https://pandas.pydata.org/docs/reference/api/pandas.isna.html) and [`sum`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) functions. This is due to the fact that that if it returns a sum of five (i.e., the number of columns included indicator), it means that all columns are **NaN.**

In [359]:
sdg_1[sdg_1.isna().sum(axis = 1) == 5]

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,1.4.1 Net SHS Enrolment Rate (Girls),1.4.1 Net SHS Enrolment Rate (Boys),1.5.4 Proportion of LGU with DRR
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,,,
1,NCR: National Capital Region,2000,,101.00,101.92,100.13,79.05,79.50,78.57,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
265,Region 10: Northern Mindanao,2014,,94.31,93.86,94.73,57.55,63.86,51.56,,,,
266,Region 11: Davao Region,2014,,99.78,100.17,99.41,60.20,66.96,53.82,,,,
267,Region 12: SOCCSKSARGEN,2014,,92.95,93.64,92.29,57.87,64.26,51.82,,,,
268,CARAGA: Cordillera Administrative Region,2014,,98.10,97.69,98.49,62.39,68.80,56.37,,,,


From this, we can see that there are only three rows that have all **NaN** values for the five columns, which would be dropped using the [`dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function. 

The `thresh` parameter was set to three (3), as this means that, for a row not to be dropped, it would require three (3) **NaN** values. This means that at least one of the five columns must not be **NaN**.

In [360]:
sdg_1 = sdg_1.dropna(axis = 0, thresh = 3)
sdg_1 = sdg_1.reset_index(drop = True)
sdg_1

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,1.4.1 Net SHS Enrolment Rate (Girls),1.4.1 Net SHS Enrolment Rate (Boys),1.5.4 Proportion of LGU with DRR
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,,,
1,NCR: National Capital Region,2000,,101.00,101.92,100.13,79.05,79.50,78.57,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
388,Region 10: Northern Mindanao,2021,,,,,,,,,,,96.9
389,Region 11: Davao Region,2021,,,,,,,,,,,100.0
390,Region 12: SOCCSKSARGEN,2021,,,,,,,,,,,100.0
391,CARAGA: Cordillera Administrative Region,2021,,,,,,,,,,,93.6


Once we were able to drop these rows, we can now save this dataframe to a `.csv` file, using the [`to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).

In [361]:
sdg_1.to_csv (f'data_output/sdg_1.csv', index = False) 

### SDG #3: Good Health and Well-Being
Next, as there is no provided datasets divided per region for the second indicator, **Zero Hunger**, we would be saving the columns related to the third indicator **Good Health and Well-Being**. 

As the columns for this indicator starts at the 7th column, we would need to [`insert`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html) the `Geolocation` and `Year` columns at the start of the dataframe.

In [362]:
start = '3.4.1 Mortality rate credited to NCD'
end = '3.7.2 Teenage pregnancy rates per 1000'
sdg_3 = combined_data.loc[ : , start : end]
sdg_3.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_3.insert (1, 'Year', combined_data ['Year'])
sdg_3

Unnamed: 0,Geolocation,Year,3.4.1 Mortality rate credited to NCD,3.4.1 Mortality rate credited to Cardio,3.4.1 Mortality rate credited to Cancer,3.4.1 Mortality rate credited to Diabetes,3.4.1 Mortality rate credited to Respi,3.7.1 Proportion of Contraceptive Use of Women,3.7.2 Teenage pregnancy rates per 1000
0,PHILIPPINES,2000,,,,,,,
1,NCR: National Capital Region,2000,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,,,,,,
3,Region 1: Ilocos Region,2000,,,,,,,
4,Region 2: Cagayan Valley,2000,,,,,,,
...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,,,,,,
392,Region 11: Davao Region,2021,,,,,,,
393,Region 12: SOCCSKSARGEN,2021,,,,,,,
394,CARAGA: Cordillera Administrative Region,2021,,,,,,,


As there are three columns included for this indicator and the `Geolocation` and `Year` column cannot be **NaN**, we can check which rows have these three columns as **NaN** by returning the rows which has three columns that are **NaN**.

In [363]:
sdg_3[sdg_3.isna().sum(axis = 1) == 3]

Unnamed: 0,Geolocation,Year,3.4.1 Mortality rate credited to NCD,3.4.1 Mortality rate credited to Cardio,3.4.1 Mortality rate credited to Cancer,3.4.1 Mortality rate credited to Diabetes,3.4.1 Mortality rate credited to Respi,3.7.1 Proportion of Contraceptive Use of Women,3.7.2 Teenage pregnancy rates per 1000
116,Region 5: Bicol Region,2006,,3.0,0.7,0.4,0.4,,
117,Region 6: Western Visayas,2006,,2.3,1.1,0.4,0.4,,
134,Region 5: Bicol Region,2007,,3.1,0.7,0.4,0.4,,
135,Region 6: Western Visayas,2007,,2.3,1.1,0.4,0.4,,
170,Region 5: Bicol Region,2009,,3.2,0.8,0.4,0.4,,
171,Region 6: Western Visayas,2009,,2.5,1.1,0.4,0.4,,
188,Region 5: Bicol Region,2010,,3.3,0.8,0.4,0.4,,
189,Region 6: Western Visayas,2010,,2.5,1.1,0.4,0.4,,
206,Region 5: Bicol Region,2011,,3.3,0.9,0.4,0.4,,
207,Region 6: Western Visayas,2011,,2.5,1.1,0.4,0.4,,


From the resulting dataframe above, we can see that there are many rows that have all **NaN** values. These are the rows that we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html). Additionally, we can also use the [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) function, in order to have the index of the dataframe start from 0. 

In [364]:
sdg_3 = sdg_3.dropna(axis = 0, thresh = 3)
sdg_3 = sdg_3.reset_index(drop = True)
sdg_3

Unnamed: 0,Geolocation,Year,3.4.1 Mortality rate credited to NCD,3.4.1 Mortality rate credited to Cardio,3.4.1 Mortality rate credited to Cancer,3.4.1 Mortality rate credited to Diabetes,3.4.1 Mortality rate credited to Respi,3.7.1 Proportion of Contraceptive Use of Women,3.7.2 Teenage pregnancy rates per 1000
0,PHILIPPINES,2003,,,,,,46.7,53.0
1,NCR: National Capital Region,2003,,,,,,47.2,35.0
2,CAR: Cordillera Administrative Region,2003,,,,,,44.4,52.0
3,Region 1: Ilocos Region,2003,,,,,,49.6,55.0
4,Region 2: Cagayan Valley,2003,,,,,,68.8,85.0
...,...,...,...,...,...,...,...,...,...
283,Region 10: Northern Mindanao,2020,4.4,2.7,1.0,0.6,0.2,,
284,Region 11: Davao Region,2020,4.8,2.9,1.0,0.7,0.2,,
285,Region 12: SOCCSKSARGEN,2020,4.1,2.5,0.8,0.6,0.2,,
286,CARAGA: Cordillera Administrative Region,2020,4.5,2.4,1.0,0.9,0.2,,


To make sure that we have dropped all the rows with **NaN** values, we can recheck it.

In [365]:
sdg_3[sdg_3.isna().sum(axis = 1) == 3]

Unnamed: 0,Geolocation,Year,3.4.1 Mortality rate credited to NCD,3.4.1 Mortality rate credited to Cardio,3.4.1 Mortality rate credited to Cancer,3.4.1 Mortality rate credited to Diabetes,3.4.1 Mortality rate credited to Respi,3.7.1 Proportion of Contraceptive Use of Women,3.7.2 Teenage pregnancy rates per 1000
26,Region 5: Bicol Region,2006,,3.0,0.7,0.4,0.4,,
27,Region 6: Western Visayas,2006,,2.3,1.1,0.4,0.4,,
44,Region 5: Bicol Region,2007,,3.1,0.7,0.4,0.4,,
45,Region 6: Western Visayas,2007,,2.3,1.1,0.4,0.4,,
80,Region 5: Bicol Region,2009,,3.2,0.8,0.4,0.4,,
81,Region 6: Western Visayas,2009,,2.5,1.1,0.4,0.4,,
98,Region 5: Bicol Region,2010,,3.3,0.8,0.4,0.4,,
99,Region 6: Western Visayas,2010,,2.5,1.1,0.4,0.4,,
116,Region 5: Bicol Region,2011,,3.3,0.9,0.4,0.4,,
117,Region 6: Western Visayas,2011,,2.5,1.1,0.4,0.4,,


As we can see, there are no rows with **NaN** for all of the columns. This means that we can now save it into a `.csv` file.

In [366]:
sdg_3.to_csv (f'data_output/sdg_3.csv', index = False) 

### SDG #4: Quality Education
Then, we would be loading the dataframe of the fourth SDG indicator **Quality Eduacation**. However, before we can do this, we must: (1) get the columns related to this indicator from the combined dataframe, and (2) [`insert`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html) the `Geolocation` and `Year` column at the first  part of the new dataframe. This is because if we do not have these two columns, we would not be able to identify what these indicators represent.

In [367]:
start = '4.1 Elem Completion Rate'
end = '4.c TVET trainers trained'
sdg_4 = combined_data.loc[ : , start : end]
sdg_4.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_4.insert (1, 'Year', combined_data ['Year'])
sdg_4

Unnamed: 0,Geolocation,Year,4.1 Elem Completion Rate,4.1 Elem Completion Rate (Female),4.1 Elem Completion Rate (Male),4.1 JHS Completion Rate,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),4.1 SHS Completion Rate (Male),4.c TVET trainers trained
0,PHILIPPINES,2000,62.72,65.53,60.05,70.07,72.29,67.66,,,,
1,NCR: National Capital Region,2000,63.87,66.58,61.35,68.16,72.18,63.88,,,,
2,CAR: Cordillera Administrative Region,2000,61.95,65.93,58.28,70.31,73.34,67.08,,,,
3,Region 1: Ilocos Region,2000,78.73,81.44,76.23,73.38,76.87,69.82,,,,
4,Region 2: Cagayan Valley,2000,70.75,74.95,66.90,72.20,73.78,70.48,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,,,,,,,,,378.0
392,Region 11: Davao Region,2021,,,,,,,,,,543.0
393,Region 12: SOCCSKSARGEN,2021,,,,,,,,,,153.0
394,CARAGA: Cordillera Administrative Region,2021,,,,,,,,,,149.0


Since there are **indicators from SDG #1 that also exist in SDG 4**, we will also add these columns.

The indicators from SDG#1 that will be added in SDG#4 are the following:
- Columns in `sdg1` that are officialy under from **1.4.1p5 Net Enrolment Rate in elementary** (which is also found in SDG 4.3.s1)
    - 1.4.1 Net Elem Enrolment Rate
    - 1.4.1 Net Elem Enrolment Rate (Girls)
    - 1.4.1 Net Elem Enrolment Rate (Boys) 
- Columns in `sdg1` that are officially under from **1.4.1p6 Net Enrolment Rate in secondary education** (which is also found in SDG 4.3.s2)
    - 1.4.1 Net JHS Enrolment Rate
    - 1.4.1 Net JHS Enrolment Rate (Girls)
    - 1.4.1 Net JHS Enrolment Rate (Boys)  
    - 1.4.1 Net SHS Enrolment Rate
    - 1.4.1 Net SHS Enrolment Rate (Girls)
    - 1.4.1 Net SHS Enrolment Rate (Boys)

In [368]:
start = '1.4.1 Net Elem Enrolment Rate'
end = '1.4.1 Net SHS Enrolment Rate (Boys)'
sdg_4_3 = combined_data.loc[ : ,  start : end]
sdg_4_3_col = sdg_4_3.columns
ctr = 11
for i in range(8,-1,-1):
    col = sdg_4_3.iloc[i]
    sdg_4.insert(int(ctr), sdg_4_3_col[i], col)
sdg_4.rename(columns= {'1.4.1 Net Elem Enrolment Rate': '4.3.s1 Net Elem Enrolment Rate',
                      '1.4.1 Net Elem Enrolment Rate (Girls)': '4.3.s1 Net Elem Enrolment Rate (Girls)',
                      '1.4.1 Net Elem Enrolment Rate (Boys)': '4.3.s1 Net Elem Enrolment Rate (Boys)',
                      '1.4.1 Net JHS Enrolment Rate': '4.3.s2 Net JHS Enrolment Rate',
                      '1.4.1 Net JHS Enrolment Rate (Girls)': '4.3.s2 Net JHS Enrolment Rate (Girls)',
                      '1.4.1 Net JHS Enrolment Rate (Boys)': '4.3.s2 Net JHS Enrolment Rate (Boys)',
                      '1.4.1 Net SHS Enrolment Rate': '4.3.s2 Net SHS Enrolment Rate',
                      '1.4.1 Net SHS Enrolment Rate (Girls)': '4.3.s2 Net SHS Enrolment Rate (Girls)',
                      '1.4.1 Net SHS Enrolment Rate (Boys)': '4.3.s2 Net SHS Enrolment Rate (Boys)'}, inplace=True)
sdg_4

Unnamed: 0,Geolocation,Year,4.1 Elem Completion Rate,4.1 Elem Completion Rate (Female),4.1 Elem Completion Rate (Male),4.1 JHS Completion Rate,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),...,4.3.s1 Net Elem Enrolment Rate,4.3.s1 Net Elem Enrolment Rate (Girls),4.3.s1 Net Elem Enrolment Rate (Boys),4.3.s2 Net JHS Enrolment Rate,4.3.s2 Net JHS Enrolment Rate (Girls),4.3.s2 Net JHS Enrolment Rate (Boys),4.3.s2 Net SHS Enrolment Rate,4.3.s2 Net SHS Enrolment Rate (Girls),4.3.s2 Net SHS Enrolment Rate (Boys),4.c TVET trainers trained
0,PHILIPPINES,2000,62.72,65.53,60.05,70.07,72.29,67.66,,,...,,,,,,,,,,
1,NCR: National Capital Region,2000,63.87,66.58,61.35,68.16,72.18,63.88,,,...,,,,,,,,,,
2,CAR: Cordillera Administrative Region,2000,61.95,65.93,58.28,70.31,73.34,67.08,,,...,,,,,,,,,,
3,Region 1: Ilocos Region,2000,78.73,81.44,76.23,73.38,76.87,69.82,,,...,,,,,,,,,,
4,Region 2: Cagayan Valley,2000,70.75,74.95,66.90,72.20,73.78,70.48,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,,,,,,,,...,,,,,,,,,,378.0
392,Region 11: Davao Region,2021,,,,,,,,,...,,,,,,,,,,543.0
393,Region 12: SOCCSKSARGEN,2021,,,,,,,,,...,,,,,,,,,,153.0
394,CARAGA: Cordillera Administrative Region,2021,,,,,,,,,...,,,,,,,,,,149.0


Next, we check the rows with the **NaN** values for all of the four indicator columns.

In [369]:
sdg_4[sdg_4.isna().sum(axis = 1) == 4]

Unnamed: 0,Geolocation,Year,4.1 Elem Completion Rate,4.1 Elem Completion Rate (Female),4.1 Elem Completion Rate (Male),4.1 JHS Completion Rate,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),...,4.3.s1 Net Elem Enrolment Rate,4.3.s1 Net Elem Enrolment Rate (Girls),4.3.s1 Net Elem Enrolment Rate (Boys),4.3.s2 Net JHS Enrolment Rate,4.3.s2 Net JHS Enrolment Rate (Girls),4.3.s2 Net JHS Enrolment Rate (Boys),4.3.s2 Net SHS Enrolment Rate,4.3.s2 Net SHS Enrolment Rate (Girls),4.3.s2 Net SHS Enrolment Rate (Boys),4.c TVET trainers trained


As we can see, from the 396 rows that our dataframe has, we only have two rows that are **NaN**. We will also drop these rows, then we will be resetting the index to make it chronological.

In [370]:
sdg_4 = sdg_4.dropna(axis = 0, thresh = 3)
sdg_4 = sdg_4.reset_index(drop = True)
sdg_4

Unnamed: 0,Geolocation,Year,4.1 Elem Completion Rate,4.1 Elem Completion Rate (Female),4.1 Elem Completion Rate (Male),4.1 JHS Completion Rate,4.1 JHS Completion Rate (Female),4.1 JHS Completion Rate (Male),4.1 SHS Completion Rate,4.1 SHS Completion Rate (Female),...,4.3.s1 Net Elem Enrolment Rate,4.3.s1 Net Elem Enrolment Rate (Girls),4.3.s1 Net Elem Enrolment Rate (Boys),4.3.s2 Net JHS Enrolment Rate,4.3.s2 Net JHS Enrolment Rate (Girls),4.3.s2 Net JHS Enrolment Rate (Boys),4.3.s2 Net SHS Enrolment Rate,4.3.s2 Net SHS Enrolment Rate (Girls),4.3.s2 Net SHS Enrolment Rate (Boys),4.c TVET trainers trained
0,PHILIPPINES,2000,62.72,65.53,60.05,70.07,72.29,67.66,,,...,,,,,,,,,,
1,NCR: National Capital Region,2000,63.87,66.58,61.35,68.16,72.18,63.88,,,...,,,,,,,,,,
2,CAR: Cordillera Administrative Region,2000,61.95,65.93,58.28,70.31,73.34,67.08,,,...,,,,,,,,,,
3,Region 1: Ilocos Region,2000,78.73,81.44,76.23,73.38,76.87,69.82,,,...,,,,,,,,,,
4,Region 2: Cagayan Valley,2000,70.75,74.95,66.90,72.20,73.78,70.48,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
389,Region 10: Northern Mindanao,2021,,,,,,,,,...,,,,,,,,,,378.0
390,Region 11: Davao Region,2021,,,,,,,,,...,,,,,,,,,,543.0
391,Region 12: SOCCSKSARGEN,2021,,,,,,,,,...,,,,,,,,,,153.0
392,CARAGA: Cordillera Administrative Region,2021,,,,,,,,,...,,,,,,,,,,149.0


As we can see, two rows were removed from the dataframe (i.e., from 396 rows, we now have 394 rows). Additionally, the last index is 394, which means it was reset to start from `index 0`. Thus, we can already save this dataframe to a `.csv` file.

In [371]:
sdg_4.to_csv (f'data_output/sdg_4.csv', index = False) 

### SDG #7: Affordable and Clean Energy
Next, we would be saving the seventh SDG, **Affordable and Clean Energy**, as there are no provided datasets for the fifth and sixth dataset. Just like in the previous SDGs, we would need to insert the `Geolocation` and `Year` columns in the 0th and 1st column, respectively.  

In [372]:
start = '7.1.1 Proportion of pop with electricity'
end = '7.1.1 Proportion of pop with electricity'
sdg_7 = combined_data.loc[ : , start : end]
sdg_7.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_7.insert (1, 'Year', combined_data ['Year'])
sdg_7

Unnamed: 0,Geolocation,Year,7.1.1 Proportion of pop with electricity
0,PHILIPPINES,2000,
1,NCR: National Capital Region,2000,
2,CAR: Cordillera Administrative Region,2000,
3,Region 1: Ilocos Region,2000,
4,Region 2: Cagayan Valley,2000,
...,...,...,...
391,Region 10: Northern Mindanao,2021,
392,Region 11: Davao Region,2021,
393,Region 12: SOCCSKSARGEN,2021,
394,CARAGA: Cordillera Administrative Region,2021,


As there is only one column for this SDG, we can directly drop a row if there is even one value that is **NaN**. 

In [373]:
sdg_7 = sdg_7.dropna()
sdg_7 = sdg_7.reset_index(drop = True)
sdg_7

Unnamed: 0,Geolocation,Year,7.1.1 Proportion of pop with electricity
0,PHILIPPINES,2009,99.39
1,NCR: National Capital Region,2009,100.00
2,CAR: Cordillera Administrative Region,2009,100.00
3,Region 1: Ilocos Region,2009,100.00
4,Region 2: Cagayan Valley,2009,99.96
...,...,...,...
211,Region 10: Northern Mindanao,2020,94.81
212,Region 11: Davao Region,2020,88.40
213,Region 12: SOCCSKSARGEN,2020,85.91
214,CARAGA: Cordillera Administrative Region,2020,100.00


Afterwards, we can save it as a `.csv` file.

In [374]:
sdg_7.to_csv (f'data_output/sdg_7.csv', index = False) 

### SDG #8: Decent Work and Economic Growth
This is followed by the eighth SDG, **Decent Work and Economic Growth**. For this SDG, we would only need to get three columns from the combined dataset: (1) one column which is related to this SDG, (2) `Geolocation`, and (3) `Year`.

In [375]:
start = '8.1.1 Growth rate of real GDP per capita'
end = '8.1.1 Growth rate of real GDP per capita'
sdg_8 = combined_data.loc[ : , start : end]
sdg_8.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_8.insert (1, 'Year', combined_data ['Year'])
sdg_8

Unnamed: 0,Geolocation,Year,8.1.1 Growth rate of real GDP per capita
0,PHILIPPINES,2000,
1,NCR: National Capital Region,2000,
2,CAR: Cordillera Administrative Region,2000,
3,Region 1: Ilocos Region,2000,
4,Region 2: Cagayan Valley,2000,
...,...,...,...
391,Region 10: Northern Mindanao,2021,
392,Region 11: Davao Region,2021,
393,Region 12: SOCCSKSARGEN,2021,
394,CARAGA: Cordillera Administrative Region,2021,


Just like in the seventh dataset, we can directly drop the row with even just one **NaN** value. As the `Geolocation` and `Year` columns cannot have **NaN** values, we know that if a **NaN** was seen in a row, it stems from the `8.1.1 Annual growth rate of real GDP per capita`.

In [376]:
sdg_8 = sdg_8.dropna()
sdg_8 = sdg_8.reset_index(drop = True)
sdg_8

Unnamed: 0,Geolocation,Year,8.1.1 Growth rate of real GDP per capita
0,PHILIPPINES,2001,1.007914
1,NCR: National Capital Region,2001,0.841128
2,CAR: Cordillera Administrative Region,2001,1.900838
3,Region 1: Ilocos Region,2001,0.811589
4,Region 2: Cagayan Valley,2001,2.147953
...,...,...,...
356,Region 11: Davao Region,2020,-9.046379
357,Region 12: SOCCSKSARGEN,2020,-5.555509
358,CARAGA: Cordillera Administrative Region,2020,-8.237564
359,BARMM: Bangsamoro Autonomous Region in Muslim ...,2020,-3.881252


From 396 rows, we now have 361 rows. This means that 35 rows were dropped. 

As we now have no **NaN** values, we can now save this into a `.csv` file.

In [377]:
sdg_8.to_csv (f'data_output/sdg_8.csv', index = False) 

### SDG #10: Reduced Inequalities
Meanwhile, for the **Reduced Inequalities**, the tenth SDG, we would need to get two columns from the combined dataframe. This is aside from the `Geolocation` and `Year` columns, which would be inserted as the first two columns. 

In [378]:
start = '10.1.1.1 Income per capita growth rate of bottom 40'
end = '10.1.1.2 Income per capita growth rate'
sdg_10 = combined_data.loc[ : , start : end]
sdg_10.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_10.insert (1, 'Year', combined_data ['Year'])
sdg_10

Unnamed: 0,Geolocation,Year,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate
0,PHILIPPINES,2000,,
1,NCR: National Capital Region,2000,,
2,CAR: Cordillera Administrative Region,2000,,
3,Region 1: Ilocos Region,2000,,
4,Region 2: Cagayan Valley,2000,,
...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,
392,Region 11: Davao Region,2021,,
393,Region 12: SOCCSKSARGEN,2021,,
394,CARAGA: Cordillera Administrative Region,2021,,


Let us check which rows have **NaN** values for both `10.1.1.1 Growth rates of household expenditure or income per capita (among the bottom 40 percent of the population)` and `10.1.1.2 Growth rates of household expenditure or income per capita (among the Total Population)` column. This is because we know that `Geolocation` and `Year` cannot have **NaN** values, thus, we can say that a row is **NaN** if it has two **NaN** values.

In [379]:
sdg_10[sdg_10.isna().sum(axis = 1) == 2]

Unnamed: 0,Geolocation,Year,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate
0,PHILIPPINES,2000,,
1,NCR: National Capital Region,2000,,
2,CAR: Cordillera Administrative Region,2000,,
3,Region 1: Ilocos Region,2000,,
4,Region 2: Cagayan Valley,2000,,
...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,
392,Region 11: Davao Region,2021,,
393,Region 12: SOCCSKSARGEN,2021,,
394,CARAGA: Cordillera Administrative Region,2021,,


So, this means that we can drop the rows that do not have at least three not-NaN values. This can be done by setting the thresh parameter of the [`dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function. Although, after this, we have to reset the indices to make it start from `Index 0`.

In [380]:
sdg_10 = sdg_10.dropna(axis = 0, thresh = 3)
sdg_10 = sdg_10.reset_index(drop = True)
sdg_10

Unnamed: 0,Geolocation,Year,10.1.1.1 Income per capita growth rate of bottom 40,10.1.1.2 Income per capita growth rate
0,PHILIPPINES,2015,7.406738,5.045087
1,NCR: National Capital Region,2015,5.883486,4.570268
2,CAR: Cordillera Administrative Region,2015,6.635383,1.658328
3,Region 1: Ilocos Region,2015,7.453736,3.572707
4,Region 2: Cagayan Valley,2015,8.900391,7.312018
5,Region 3: Central Luzon,2015,6.519357,5.508813
6,Region 4A: CALABARZON,2015,6.307004,4.059653
7,MIMAROPA: Southwestern Tagalog Region,2015,9.530952,9.318983
8,Region 5: Bicol Region,2015,8.240401,5.21329
9,Region 6: Western Visayas,2015,9.042917,1.901536


We can see that from 396 rows, we now have 35 rows. As we now have no **NaN** rows, we can now save it to a `.csv` file, using the [`to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) function.

In [381]:
sdg_10.to_csv (f'data_output/sdg_10.csv', index = False) 

### SDG #11: Sustainable Cities and Communities
Our next SDG, **Sustainable Cities and Communities**, has one column related to it from the combined dataset. Like in the previous datasets, we would get this column from the combined dataset, before inserting the `Geolocation` and `Year` columns in front.

The column related to this SDG is also placed in SDG #1, which is `1.5.4 Proportion of LGU with DRR`. 

However, for this SDG, we will change its column name by replacing `1.5.4 <Indicator name>` to `11.b.2 <Indicator name>`. This is because we are also placing this column in SDG #11 and the indicator (1.5.4) related to this column is found specifically in **11.b.2**.

In [382]:
sdg_11 = pd.DataFrame(combined_data['1.5.4 Proportion of LGU with DRR'])
sdg_11.rename(columns={'1.5.4 Proportion of LGU with DRR': '11.b.2 Proportion of LGU with DRR'}, inplace=True)
sdg_11.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_11.insert (1, 'Year', combined_data ['Year'])
sdg_11

Unnamed: 0,Geolocation,Year,11.b.2 Proportion of LGU with DRR
0,PHILIPPINES,2000,
1,NCR: National Capital Region,2000,
2,CAR: Cordillera Administrative Region,2000,
3,Region 1: Ilocos Region,2000,
4,Region 2: Cagayan Valley,2000,
...,...,...,...
391,Region 10: Northern Mindanao,2021,96.9
392,Region 11: Davao Region,2021,100.0
393,Region 12: SOCCSKSARGEN,2021,100.0
394,CARAGA: Cordillera Administrative Region,2021,93.6


For now, we will save this into a csv file. However, as of now, this file will not be explored since the only one indicator included in this SDG also exists and is being explored in the other SDG #1.

In [383]:
sdg_11.to_csv (f'data_output/sdg_11.csv', index = False) 

### SDG #13: Climate Action
Our next SDG, **Climate Action**, has one column related to it from the combined dataset. Like in the previous datasets, we would get this column from the combined dataset, before inserting the `Geolocation` and `Year` columns in front.

Similar to the case of SDG #11, the only column related to this SDG is also related to SDG#1, which is `1.5.4 Proportion of LGU with DRR`.

Since we are placing this column in SDG #13 and this indicator (1.5.4) is found specifically in **13.1.3** as well, we will change its column name by replacing `1.5.4 <Indicator name>` to `13.1.3 <Indicator name>`.

In [384]:
sdg_13 = pd.DataFrame(combined_data['1.5.4 Proportion of LGU with DRR'])
sdg_13.rename(columns={'1.5.4 Proportion of LGU with DRR': '13.1.3 Proportion of LGU with DRR'}, inplace=True)
sdg_13.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_13.insert (1, 'Year', combined_data ['Year'])
sdg_13

Unnamed: 0,Geolocation,Year,13.1.3 Proportion of LGU with DRR
0,PHILIPPINES,2000,
1,NCR: National Capital Region,2000,
2,CAR: Cordillera Administrative Region,2000,
3,Region 1: Ilocos Region,2000,
4,Region 2: Cagayan Valley,2000,
...,...,...,...
391,Region 10: Northern Mindanao,2021,96.9
392,Region 11: Davao Region,2021,100.0
393,Region 12: SOCCSKSARGEN,2021,100.0
394,CARAGA: Cordillera Administrative Region,2021,93.6


Finally, we will save this dataframe into a csv file. However, this file will not be explored as of now because the only one indicator included in this SDG is being explored already in SDG #1.

In [385]:
sdg_13.to_csv (f'data_output/sdg_13.csv', index = False) 

### SDG #14: Life Below Water
Our next SDG, **Life Below Water**, has two columns related to it from the combined dataset. Like in the previous datasets, we would get these two columns from the combined dataset, before inserting the `Geolocation` and `Year` columns in front.

In [386]:
start = '14.5.1.1 Coverage of protected areas'
end = '14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs'
sdg_14 = combined_data.loc[ : , start : end]
sdg_14.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_14.insert (1, 'Year', combined_data ['Year'])
sdg_14

Unnamed: 0,Geolocation,Year,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs
0,PHILIPPINES,2000,,
1,NCR: National Capital Region,2000,,
2,CAR: Cordillera Administrative Region,2000,,
3,Region 1: Ilocos Region,2000,,
4,Region 2: Cagayan Valley,2000,,
...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,
392,Region 11: Davao Region,2021,,
393,Region 12: SOCCSKSARGEN,2021,,
394,CARAGA: Cordillera Administrative Region,2021,,


Once we have created a dataframe from the 14th SDG, we can now check if there are rows that have **NaN** values for the two indicators under this SDG. This can be done using the combination of the [`isna`](https://pandas.pydata.org/docs/reference/api/pandas.isna.html) and [`sum`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) function, as we know that having at least two **NaN** values per row would mean that the indicators are both **NaN**.

In [387]:
sdg_14[sdg_14.isna().sum(axis = 1) == 2]

Unnamed: 0,Geolocation,Year,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs
0,PHILIPPINES,2000,,
1,NCR: National Capital Region,2000,,
2,CAR: Cordillera Administrative Region,2000,,
3,Region 1: Ilocos Region,2000,,
4,Region 2: Cagayan Valley,2000,,
...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,
392,Region 11: Davao Region,2021,,
393,Region 12: SOCCSKSARGEN,2021,,
394,CARAGA: Cordillera Administrative Region,2021,,


From the result above, we know that we have to drop 359 rows. 

In [388]:
sdg_14 = sdg_14.dropna(axis = 0, thresh = 3)
sdg_14 = sdg_14.reset_index(drop = True)
sdg_14

Unnamed: 0,Geolocation,Year,14.5.1.1 Coverage of protected areas,14.5.1.2 Coverage of protected NIPAS and Locally managed MPAs
0,PHILIPPINES,2016,1.412125,0.647
1,PHILIPPINES,2019,3.143559,1.42
2,NCR: National Capital Region,2019,0.000108,4.9e-05
3,CAR: Cordillera Administrative Region,2019,0.0,0.0
4,Region 1: Ilocos Region,2019,0.012083,0.005476
5,Region 2: Cagayan Valley,2019,0.280804,0.127265
6,Region 3: Central Luzon,2019,0.364699,0.165288
7,Region 4A: CALABARZON,2019,0.00061,0.000277
8,MIMAROPA: Southwestern Tagalog Region,2019,0.635796,0.288154
9,Region 5: Bicol Region,2019,0.396713,0.179797


As we now have only 36 rows that are not NaN, we can now save it into a `.csv` file.

In [389]:
sdg_14.to_csv (f'data_output/sdg_14.csv', index = False) 

### SDG #16: Peace, Justice and Strong Institutions
Now, for the last SDG that we have data on, **Peace, Justice and Strong Institutions**, we have to get the two columns that are related to this goal, and the `Geolocation` and `Year` columns.

In [390]:
start = '16.1.1 Victims of intentional homicide per 100,000'
end = '16.1.s1 Number of murder cases'
sdg_16 = combined_data.loc[ : , start : end]
sdg_16.insert (0, 'Geolocation', combined_data ['Geolocation'])
sdg_16.insert (1, 'Year', combined_data ['Year'])
sdg_16

Unnamed: 0,Geolocation,Year,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases
0,PHILIPPINES,2000,,
1,NCR: National Capital Region,2000,,
2,CAR: Cordillera Administrative Region,2000,,
3,Region 1: Ilocos Region,2000,,
4,Region 2: Cagayan Valley,2000,,
...,...,...,...,...
391,Region 10: Northern Mindanao,2021,5.616247,285.0
392,Region 11: Davao Region,2021,6.627875,356.0
393,Region 12: SOCCSKSARGEN,2021,4.031386,201.0
394,CARAGA: Cordillera Administrative Region,2021,7.799533,217.0


From the processing of the previous goals, we know that if we have a thresh parameter set, we can choose to drop rows that have **NaN** values for both of the indicator columns. Since there are two columns that cannot have **NaN** values, we know that we should have at least three non-NaN values for a row. 

In [391]:
sdg_16 = sdg_16.dropna(axis = 0, thresh = 3)
sdg_16 = sdg_16.reset_index(drop = True)
sdg_16

Unnamed: 0,Geolocation,Year,"16.1.1 Victims of intentional homicide per 100,000",16.1.s1 Number of murder cases
0,PHILIPPINES,2016,12.110579,12417.0
1,NCR: National Capital Region,2016,17.739571,2318.0
2,CAR: Cordillera Administrative Region,2016,10.348515,180.0
3,Region 1: Ilocos Region,2016,11.031909,560.0
4,Region 2: Cagayan Valley,2016,9.188067,321.0
...,...,...,...,...
103,Region 10: Northern Mindanao,2021,5.616247,285.0
104,Region 11: Davao Region,2021,6.627875,356.0
105,Region 12: SOCCSKSARGEN,2021,4.031386,201.0
106,CARAGA: Cordillera Administrative Region,2021,7.799533,217.0


Our new dataframe for the 16th SDG have 108 rows. This dataframe can now be saved to a `.csv` file.

In [392]:
sdg_16.to_csv (f'data_output/sdg_16.csv', index = False) 

### Supplementary Datasets
However, we also have columns in our combined dataset that are not directly under an SDG. These columns were included as they might deepen our understanding on the different goals, and we can explore if these columns affect a specific SDG. Thus, we would save it in a separate file, so that we can easily load these columns together.

In [393]:
start = 'Changes in Inventories'
others = combined_data.loc[ : , start : ]
others.insert (0, 'Geolocation', combined_data ['Geolocation'])
others.insert (1, 'Year', combined_data ['Year'])
others

Unnamed: 0,Geolocation,Year,Changes in Inventories,Current Health Expenditure,Current Health Expenditure GR,Consumption Expenditure %,Consumption Expenditure GR,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
0,PHILIPPINES,2000,-136845782.0,,,100.0,5.7,579938180.0,3.697556e+09,76723051.0,,,,,,,
1,NCR: National Capital Region,2000,2177317.0,,,45.9,8.3,203930819.0,1.237451e+09,9961971.0,,,,,,,
2,CAR: Cordillera Administrative Region,2000,-6416286.0,,,2.7,2.3,13865180.0,9.044601e+07,1369249.0,,,,,,,
3,Region 1: Ilocos Region,2000,-1891391.0,,,4.0,1.3,24454284.0,1.289450e+08,4209083.0,,,,,,,
4,Region 2: Cagayan Valley,2000,5458610.0,,,2.4,4.2,32773347.0,8.593798e+07,2819641.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,-1655432.0,,,4.0,,164566009.0,9.492320e+08,,,,,,,,107.975
392,Region 11: Davao Region,2021,-4048336.0,,,3.5,,257595240.0,9.672276e+08,,,,,,,,110.850
393,Region 12: SOCCSKSARGEN,2021,-5780014.0,,,2.6,,87077953.0,5.039756e+08,,,,,,,,103.350
394,CARAGA: Cordillera Administrative Region,2021,-1896392.0,,,1.9,,100730468.0,3.317629e+08,,,,,,,,104.525


Just like how the SDGs were handled, we would be dropping the rows that have **NaN** values for each of the column (i.e., a row that does not have at least 3 non-NaN values).

In [394]:
others = others.dropna(axis = 0, thresh = 3)
others = others.reset_index(drop = True)
others

Unnamed: 0,Geolocation,Year,Changes in Inventories,Current Health Expenditure,Current Health Expenditure GR,Consumption Expenditure %,Consumption Expenditure GR,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
0,PHILIPPINES,2000,-136845782.0,,,100.0,5.7,579938180.0,3.697556e+09,76723051.0,,,,,,,
1,NCR: National Capital Region,2000,2177317.0,,,45.9,8.3,203930819.0,1.237451e+09,9961971.0,,,,,,,
2,CAR: Cordillera Administrative Region,2000,-6416286.0,,,2.7,2.3,13865180.0,9.044601e+07,1369249.0,,,,,,,
3,Region 1: Ilocos Region,2000,-1891391.0,,,4.0,1.3,24454284.0,1.289450e+08,4209083.0,,,,,,,
4,Region 2: Cagayan Valley,2000,5458610.0,,,2.4,4.2,32773347.0,8.593798e+07,2819641.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,-1655432.0,,,4.0,,164566009.0,9.492320e+08,,,,,,,,107.975
392,Region 11: Davao Region,2021,-4048336.0,,,3.5,,257595240.0,9.672276e+08,,,,,,,,110.850
393,Region 12: SOCCSKSARGEN,2021,-5780014.0,,,2.6,,87077953.0,5.039756e+08,,,,,,,,103.350
394,CARAGA: Cordillera Administrative Region,2021,-1896392.0,,,1.9,,100730468.0,3.317629e+08,,,,,,,,104.525


Although there are still cells that have **NaN** values, this is expected as we only dropped the rows that have **NaN** for all of the values of its cells, except `Geolocation` and `Year`. Then, we can save this dataframe to a `.csv` file. 

In [395]:
others.to_csv (f'data_output/supplementary_datasets.csv', index = False) 

### Complete Data
Additionally, we would also be creating a `.csv` file for the combination of all of these data sets. However, as this already went through data cleaning to drop the rows that have all **NaN** values, we can directly save it as a `.csv`.

In [396]:
combined_data.to_csv (f'data_output/combined_data.csv', index = False) 
combined_data

Unnamed: 0,Geolocation,Year,1.2.1 Poverty Proportion,1.4.1 Net Elem Enrolment Rate,1.4.1 Net Elem Enrolment Rate (Girls),1.4.1 Net Elem Enrolment Rate (Boys),1.4.1 Net JHS Enrolment Rate,1.4.1 Net JHS Enrolment Rate (Girls),1.4.1 Net JHS Enrolment Rate (Boys),1.4.1 Net SHS Enrolment Rate,...,Gross Capital Formation,GRDP,Population,Primary Drop-out rate,Primary Drop-out rate (Girls),Primary Drop-out rate (Boys),Secondary Drop-out rate,Secondary Drop-out rate (Girls),Secondary Drop-out rate (Boys),Price Index for Agriculture
0,PHILIPPINES,2000,,96.77,97.28,96.27,66.06,69.49,62.72,,...,579938180.0,3.697556e+09,76723051.0,,,,,,,
1,NCR: National Capital Region,2000,,101.00,101.92,100.13,79.05,79.50,78.57,,...,203930819.0,1.237451e+09,9961971.0,,,,,,,
2,CAR: Cordillera Administrative Region,2000,,94.42,94.58,94.26,71.19,76.37,66.14,,...,13865180.0,9.044601e+07,1369249.0,,,,,,,
3,Region 1: Ilocos Region,2000,,97.73,97.01,98.41,87.51,90.05,85.07,,...,24454284.0,1.289450e+08,4209083.0,,,,,,,
4,Region 2: Cagayan Valley,2000,,95.65,95.74,95.57,77.11,81.11,73.31,,...,32773347.0,8.593798e+07,2819641.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
391,Region 10: Northern Mindanao,2021,,,,,,,,,...,164566009.0,9.492320e+08,,,,,,,,107.975
392,Region 11: Davao Region,2021,,,,,,,,,...,257595240.0,9.672276e+08,,,,,,,,110.850
393,Region 12: SOCCSKSARGEN,2021,,,,,,,,,...,87077953.0,5.039756e+08,,,,,,,,103.350
394,CARAGA: Cordillera Administrative Region,2021,,,,,,,,,...,100730468.0,3.317629e+08,,,,,,,,104.525


## References
*Indicators of sustainable development: Guidelines and methodologies*. United Nations. (n.d.). Retrieved May 23, 2022, from https://www.un.org/esa/sustdev/natlinfo/indicators/methodology_sheets.pdf 

*Net Enrolment Ratio (NER)*. Philippine Statistics Authority. (n.d.). Retrieved May 23, 2022, from https://psa.gov.ph/content/net-enrolment-ratio-ner-1 