This notebook demonstrates how Python can be used to gather and adapt data from different sources.

# Loading socio-economic data

#### Loading functions

First we import the [pandas](http://pandas.pydata.org/) function librairy. Pandas is a standard python librairy that alows us to manipulate Excel-like tables (called DataFrames) with named rows and columns.

In [159]:
import pandas as pd

#### Reading data

Now we read the excel data into a pandas DataFrame.
We start from an Excel file that contains socio-economic data. In the future this file may for instance be populated by PSA.

In [160]:
data_from_excel= pd.read_excel("inputs/input_data_Feb2016.xlsx", #the name of the file
                        sheetname="Consolidated (2012)", #the Excel tab were the data is
                               index_col="province",#column to use as index
                               header=1, #skips the first line of the excel file
                                );
data_from_excel.index = data_from_excel.index.str.title() #fixes the case of province names in the Excel file
data_from_excel.head() #shows the first few lines of the table

Unnamed: 0_level_0,Region,Region PSGC,Province PSGC,GRDPC 2012 (At Current Prices),Projected Population 2012,"Average Annual Family Income, 2009","Average Annual Family Income, by Region, 2012",% Wages and salaries 2012,% Entrepreneurial activities 2012,% Other sources of income 2012,...,% Others Deposits 2012,% Health Expenditure 2012,% of Births by Attended Skilled Health Personnel 2012,% hh with radio 2012,% hh with landlines 2012,% hh with cellular phones 2012,"Public Schools, Elementary, 2012-2013","Public Schools, Secondary, 2012-2013",Estimated QRF 2012,Estimated LDRRM Fund 2012
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abra,CAR,14,1401,126843,240135.244121,133688,257000,0.343701,0.247626,0.297152,...,0.00227,0.031414,0.85772,0.652174,0.062112,0.953416,277,33,31746830.1432,105822800.0
Agusan Del Norte,CARAGA,16,1602,48954,661728.454375,179014,180000,0.3875,0.224497,0.298354,...,4.5e-05,0.034563,0.921445,0.395745,0.02766,0.821277,293,86,40128811.09725,133762700.0
Agusan Del Sur,CARAGA,16,1603,48954,677779.682154,126492,180000,0.3875,0.224497,0.298354,...,0.000552,0.034563,0.727442,0.395745,0.02766,0.821277,483,95,50795871.21195,169319600.0
Aklan,6,6,604,57801,554414.442422,119962,202000,0.371111,0.195986,0.374721,...,0.000133,0.044318,0.806176,0.548898,0.069559,0.823003,320,70,34597652.21625,115325500.0
Albay,5,5,505,38870,1264097.894966,158629,162000,0.384548,0.211663,0.335946,...,0.003677,0.032568,0.84084,0.514019,0.024299,0.8,601,122,61822427.32725,206074800.0


This table contains more data (more columns) that what we need to run the model. In addition, the names of the coumn are human-readable, instead of correspondig to variable names in the model. Finally, Some data is missing. We solve each one of this problems in the following.

### Matching columns in the Excel file to variables in the model

#### pov_head, pop, gdp_pc_pp

Some of the data in the Excel file match directly data in the model. We can transform them directly using a simple dictionary, [inputs/data_source_matching.csv](inputs/data_source_matching.csv), that matches the name in the Excel file to the name in the model

In [161]:
#reads the CSV file that matches names in excel ot names in the model
data_source_matching =pd.read_csv("inputs/data_source_matching.csv",
                                  index_col="name_in_data",
                                 )
data_source_matching #displays the result

Unnamed: 0_level_0,name_in_model
name_in_data,Unnamed: 1_level_1
"Average Annual Family Income, 2009",gdp_pc_pp
Projected Population 2012,pop
"Poverty Incidence among Population (%), 2012",pov_head
% hh with cellular phones 2012,shew


In [162]:
#keeps only the colomns listed in data_source_matching
df=data_from_excel[data_source_matching.index]
#renames those columns to their name in the model
df=df.rename(columns=data_source_matching["name_in_model"])
df.head()

Unnamed: 0_level_0,gdp_pc_pp,pop,pov_head,shew
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Abra,133688,240135.244121,0.373595,0.953416
Agusan Del Norte,179014,661728.454375,0.346715,0.821277
Agusan Del Sur,126492,677779.682154,0.480785,0.821277
Aklan,119962,554414.442422,0.249662,0.823003
Albay,158629,1264097.894966,0.409587,0.8


##### Adapting the data on income and poverty

The model needs income information in each province to be provided relative to the average income in the Philippines.
Witin each province, we need the income of the poor and nonpoor households relative to the average income in the province.

To compute the weighted average, we will use another standard python library, [NumPy](http://www.numpy.org/) the provides  standard mathematical functions such as log, exp, weighted average, etc.

In [163]:
import numpy as np

In [164]:
#Changes the unit of GDP to thousands of pesos (technical: to reduce risk of float overflows when computing welfare)
df["gdp_pc_pp"]/=1e3

#National average income 
df["gdp_pc_pp_nat"] = np.average(df.dropna().gdp_pc_pp,  weights=df.dropna()["pop"]) #note that we have to manually remove the lines with missing data (.dropna()) because numpy does not handle missing data

#Average income of poor households (estimated from WB data on income distribution: http://iresearch.worldbank.org/PovcalNet/index.htm?2)
wp=50

#Relative income of the province and poor families in those provinces
df["rel_gdp_pp"]=df["gdp_pc_pp"]/df["gdp_pc_pp_nat"]
df["share1"]=wp/df["gdp_pc_pp"]

#### Access to savings, transfers

Some other model variables do not match directly one column in the data.


In [165]:
#acess to bank accounts : we use the same value for poor and nonpoor households
df["axfin_p"]=df["axfin_r"]=data_from_excel["%Savings Deposit 2012"]

#share of income from transfers: we use the same value for poor and nonpoor, and we sum two columns of the input data
df["social_p"]=df["social_r"]=data_from_excel[["% Other sources of income 2012","% Other receipts 2012"]].sum(axis=1)

df.head()

Unnamed: 0_level_0,gdp_pc_pp,pop,pov_head,shew,gdp_pc_pp_nat,rel_gdp_pp,share1,axfin_p,axfin_r,social_p,social_r
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Abra,133.688,240135.244121,0.373595,0.953416,184.136685,0.726026,0.374005,0.693233,0.693233,0.408683,0.408683
Agusan Del Norte,179.014,661728.454375,0.346715,0.821277,184.136685,0.97218,0.279308,0.49688,0.49688,0.388003,0.388003
Agusan Del Sur,126.492,677779.682154,0.480785,0.821277,184.136685,0.686946,0.395282,0.475969,0.475969,0.388003,0.388003
Aklan,119.962,554414.442422,0.249662,0.823003,184.136685,0.651483,0.416799,0.660083,0.660083,0.432903,0.432903
Albay,158.629,1264097.894966,0.409587,0.8,184.136685,0.861474,0.315201,0.551314,0.551314,0.403794,0.403794


# Loading data on exposure, hazard, and protection

### Exposure (population in flood-prone areas)

Exposure comes from different files, for instance they could be provided by DOST.

#### River floods

In [166]:
#Exposure to RIVER floods (at present from GLOFRIS)
pop_exposed = pd.read_csv("inputs/pop_exposed.csv",index_col=["NAME_1"])
pop_exposed.index=pop_exposed.index.str.title()
pop_exposed.head()

Unnamed: 0_level_0,rp10_pop,rp100_pop
NAME_1,Unnamed: 1_level_1,Unnamed: 2_level_1
Abra,0.1641,0.2977
Agusan Del Norte,0.318,0.344
Agusan Del Sur,0.1146,0.1531
Aklan,0.0,0.0
Albay,0.0,0.0


Note how for some provinces (Aklan, Albay) are not exposed to river floods according to our data source. Also, the data we have here is for several return periods. The model can work either with on single return period or several return periods. The information on different exposed periods sorted in a different variable, `fa_ratios`.

First we define the exposure (Fraction of people Affected) as the one corresponding to 10 yr return period.

In [167]:
expo_river = pop_exposed[["rp10_pop"]]
expo_river.columns=["fa"]
expo_river.head()

Unnamed: 0_level_0,fa
NAME_1,Unnamed: 1_level_1
Abra,0.1641
Agusan Del Norte,0.318
Agusan Del Sur,0.1146
Aklan,0.0
Albay,0.0


Then we define the exposure to other return period events relative to the exposure to the 10yr event

In [168]:
fa_ratios_river = pop_exposed.div(expo_river.squeeze(),axis=0)
fa_ratios_river.columns=[10,100]
fa_ratios_river.columns.name="rp"
fa_ratios_river.index.name="province"
fa_ratios_river.head()

rp,10,100
province,Unnamed: 1_level_1,Unnamed: 2_level_1
Abra,1.0,1.814138
Agusan Del Norte,1.0,1.081761
Agusan Del Sur,1.0,1.335951
Aklan,,
Albay,,


#### Coastal floods

Now coastal floods

In [169]:
pd.read_csv("inputs/exposure_to_coastal_foods.csv").head() #inspecting the raw data

Unnamed: 0,id_1,pop_landscan,pop_flooded,fa_coast
0,Abra,279104,0.0,0.0
1,Agusan del Norte,704064,70048.438,0.099492
2,Agusan del Sur,798685,8112.8823,0.010158
3,Aklan,556987,61929.047,0.111186
4,Albay,1455600,13530.88,0.009296


In [170]:
expo_coast = pd.read_csv("inputs/exposure_to_coastal_foods.csv", usecols=["fa_coast", "id_1"],index_col="id_1", )
expo_coast.index = expo_coast.index.str.title() #matches case
expo_coast.columns=["fa"]
expo_coast.head()

Unnamed: 0_level_0,fa
id_1,Unnamed: 1_level_1
Abra,0.0
Agusan Del Norte,0.099492
Agusan Del Sur,0.010158
Aklan,0.111186
Albay,0.009296


For coastal flood the data we have has only one return period.

In [171]:
fa_ratios_coast = pd.DataFrame(1,index=expo_coast.index, columns=[10])
fa_ratios_coast.index.name = "province"
fa_ratios_coast.columns.name="rp"
fa_ratios_coast.head()

rp,10
province,Unnamed: 1_level_1
Abra,1
Agusan Del Norte,1
Agusan Del Sur,1
Aklan,1
Albay,1


#### Combining hazard and differentiating poor and nonpoor

Finally we combine the data on table with province, hzard, return period. We store multi-hazard data on a different dataframe.


In [172]:
df_multihazard = pd.concat( [expo_coast,expo_river],keys=["coast", "river"],names=["hazard", "param"],axis=1)
df_multihazard.index.name = "province"
df_multihazard=df_multihazard.stack("hazard")
df_multihazard.head(20)

Unnamed: 0_level_0,param,fa
province,hazard,Unnamed: 2_level_1
Abra,coast,0.0
Abra,river,0.1641
Agusan Del Norte,coast,0.099492
Agusan Del Norte,river,0.318
Agusan Del Sur,coast,0.010158
Agusan Del Sur,river,0.1146
Aklan,coast,0.111186
Aklan,river,0.0
Albay,coast,0.009296
Albay,river,0.0


The model uses exposure for poor and nonpoor people separatedly.
Based on international cas estudies, poor people are about 15% more exposed to floods than nonpoor people. 

In [173]:
dfm = pd.merge(
    df_multihazard.reset_index(), df["pov_head"].reset_index(), on="province")

pe=15/100
fa= dfm["fa"]
ph = dfm["pov_head"]

fap = fa*(1+pe)
far= (fa-ph*fap)/(1-ph)

dfm["fap"]=fap
dfm["far"]=far


dfm = dfm.set_index(["province","hazard"]).drop("fa",axis=1)
dfm.replace(0,np.nan,inplace=True) #treat no exposure as no data
dfm.head()


Unnamed: 0_level_0,param,pov_head,fap,far
province,hazard,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Abra,coast,0.373595,,
Abra,river,0.373595,0.188715,0.149419
Agusan Del Norte,coast,0.346715,0.114415,0.091571
Agusan Del Norte,river,0.346715,0.3657,0.292684
Agusan Del Sur,coast,0.480785,0.011681,0.008747


We store the the fa_ratios separatedly

In [195]:
fa_ratios = pd.concat([fa_ratios_river, fa_ratios_coast], keys=["river","coast"], names=["hazard"], axis=1).stack("hazard")
fa_ratios.head()

Unnamed: 0_level_0,rp,10,100
province,hazard,Unnamed: 2_level_1,Unnamed: 3_level_1
Abra,river,1,1.814138
Abra,coast,1,
Agusan Del Norte,river,1,1.081761
Agusan Del Norte,coast,1,
Agusan Del Sur,river,1,1.335951


In [175]:
fa_ratios.to_csv("fa_ratios.csv")

#### Files for single-hazard version of the modem

In [176]:
fa_ratios_river.to_csv("fa_ratios_river.csv") #this one stored to have an example of single hzard multiple return periods.

In [177]:
# df["fa"] = expo_river["fa"] #just so that we can run the model with single hazard
dfmriver = dfm.query("hazard=='river'").reset_index("hazard")[["fap","far"]]
df["fap"]=dfmriver["fap"]
df["far"]=dfmriver["far"]

df.head()

Unnamed: 0_level_0,gdp_pc_pp,pop,pov_head,shew,gdp_pc_pp_nat,rel_gdp_pp,share1,axfin_p,axfin_r,social_p,social_r,fap,far
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Abra,133.688,240135.244121,0.373595,0.953416,184.136685,0.726026,0.374005,0.693233,0.693233,0.408683,0.408683,0.188715,0.149419
Agusan Del Norte,179.014,661728.454375,0.346715,0.821277,184.136685,0.97218,0.279308,0.49688,0.49688,0.388003,0.388003,0.3657,0.292684
Agusan Del Sur,126.492,677779.682154,0.480785,0.821277,184.136685,0.686946,0.395282,0.475969,0.475969,0.388003,0.388003,0.13179,0.098682
Aklan,119.962,554414.442422,0.249662,0.823003,184.136685,0.651483,0.416799,0.660083,0.660083,0.432903,0.432903,,
Albay,158.629,1264097.894966,0.409587,0.8,184.136685,0.861474,0.315201,0.551314,0.551314,0.403794,0.403794,,


In [178]:
df.fap.dropna().shape

(37,)

a### Vulnerability

To assess asset vulnerability in each province, we use census data on roof and wall types in each province.
We match these types to a given vulnerability with reduced vulnerability curves. Let us first open the files that matche wall and roof types to vulnerability.

#### Reduced vulnerability curves for wall and roofs

In [179]:
#matches roof and wall types to vulnerabilities
roof_types_to_vuln =pd.read_csv("inputs/roof_types_to_vuln.csv").squeeze().sort_values(ascending=False)
wall_types_to_vuln =pd.read_csv("inputs/wall_types_to_vuln.csv").squeeze().sort_values(ascending=False)

print("Reduced vulnerability curve for roofs\n")
print(roof_types_to_vuln)
#print("\nReduced vulnerability curve for walls")
#print(wall_types_to_vuln)

Reduced vulnerability curve for roofs

Roof_% Salvaged/mixed but predominatly salvaged materials 2012    0.7
Roof_% Light/mixed but predominantly light materials 2012         0.4
Roof_% Strong/mixed but predominantly strong materials 2012       0.1
Name: 0, dtype: float64


#### Sorting roofs according to income

The data for **roof** types in each province come from the excel file with socio-economic data we used at the begining.

In [180]:
share =data_from_excel[roof_types_to_vuln.index]
share.head()

Unnamed: 0_level_0,Roof_% Salvaged/mixed but predominatly salvaged materials 2012,Roof_% Light/mixed but predominantly light materials 2012,Roof_% Strong/mixed but predominantly strong materials 2012
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abra,0.002667,0.050667,0.946667
Agusan Del Norte,0.007519,0.327068,0.665414
Agusan Del Sur,0.007519,0.327068,0.665414
Aklan,0.0131,0.183406,0.803493
Albay,0.008584,0.306438,0.684979


Then we assume that the poorest households in  each province use the houses with lowest quality roofs.

In [181]:
#sorts roof types according to income
p=(share.cumsum(axis=1).add(-df["pov_head"],axis=0)).clip(lower=0)
poor=(share-p).clip(lower=0)
rich=share-poor

print("Type of roofs for nonpoor households:")
rich.head()

Type of roofs for nonpoor households:


Unnamed: 0_level_0,Roof_% Salvaged/mixed but predominatly salvaged materials 2012,Roof_% Light/mixed but predominantly light materials 2012,Roof_% Strong/mixed but predominantly strong materials 2012
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abra,0,0,0.626405
Agusan Del Norte,0,0,0.653285
Agusan Del Sur,0,0,0.519215
Aklan,0,0,0.750338
Albay,0,0,0.590413


Finally we average vulnerability accross roof types

In [182]:
#averages vulnerability accross roof type
vp_roof=((poor*roof_types_to_vuln).sum(axis=1)/df["pov_head"] )
vr_roof=(rich*roof_types_to_vuln).sum(axis=1)/(1-df["pov_head"])

vp_roof.head()

province
Abra                0.144969
Agusan Del Norte    0.396011
Agusan Del Sur      0.313467
Aklan               0.351869
Albay               0.337023
dtype: float64

#### Sorting walls according to income

Then we do the same for <b>walls</b>...

In [183]:
#sorts wall types according to income
share =data_from_excel[wall_types_to_vuln.keys()]
p=(share.cumsum(axis=1).add(-df["pov_head"],axis=0)).clip(lower=0)
poor=(share-p).clip(lower=0)
rich=share-poor

#walls
vp_wall=((poor*wall_types_to_vuln).sum(axis=1)/df["pov_head"] )
vr_wall=(rich*wall_types_to_vuln).sum(axis=1)/(1-df["pov_head"])


...and take the average value for roof and walls.

In [184]:
#averages value for roofs and walls
vp = (vp_roof+vp_wall)/2
vr = (vr_roof+vr_wall)/2

df["v_p"]=vp
df["v_r"]=vr

#plots
# import matplotlib.pyplot as plt
# %matplotlib inline
# vp.hist(), plt.xlabel("vp")
# vr.hist(color="red",alpha=0.5),plt.xlabel("vr")


### Hazard (protection)

We capture hazard through the protection level, given in return period. Here we use data from FLOPROS as a placeholder.
FLOPROS uses a different spelling for some province, so we correct that here.

In [185]:
protection = pd.read_csv("inputs/protection_phl.csv",index_col="province", squeeze=True).sort_index()
protection.index = protection.index.str.title()
protection.rename(index={"Cotabato":"North Cotabato",
                         'Mindoro Occidental':"Occidental Mindoro",
                         'Mindoro Oriental':"Oriental Mindoro",}, inplace=True) #(an altenrative way would be to use and demonstrate the function replace_with_warning)
protection.head()

province
Abra                10.57
Agusan Del Norte     9.41
Agusan Del Sur       8.61
Aklan                0.00
Albay                0.00
Name: 0, dtype: float64

In [186]:
df["protection"]=protection.clip_lower(2) #assumes at least 2yr protection
df["protection"].head()

province
Abra                10.57
Agusan Del Norte     9.41
Agusan Del Sur       8.61
Aklan                2.00
Albay                2.00
Name: protection, dtype: float64

# Manually filling data gaps and informing parameters

Some data is missing and has to be added manually

In [187]:
#average productivity of capital
df["avg_prod_k"] = .23

#Reconstruction time (an only be guessed ex-ante)
df["T_rebuild_K"] = 3

# how much early warning reduces vulnerability (eg reactivity to early warnings)
df["pi"] = 0.2

Some other inputs are normative or policy choices

In [188]:
#assumption on cross-provincial risk sharing
df["nat_buyout"] = 0.3

#scale up of transfers after a disaster hits
df["sigma_r"]=df["sigma_p"]=1/3

#income elasticity
df["income_elast"] = 1.5

#discount rate
df["rho"]=5/100

# Adds description to the variables names

Here we add a human readable descritpion to all model variables, based on the descriptions gathered in [inputs/inputs_info.csv](inputs/inputs_info.csv)

In [189]:
description = pd.read_csv("inputs/inputs_info.csv", index_col="key")["descriptor"]
description.head()

key
avg_prod_k                    Productivity of capital
axfin_p             Access to finance for poor people
axfin_r         Access to finance for non-poor people
axhealth                        Access to health care
bashs         Births attended by skilled health staff
Name: descriptor, dtype: object

In [190]:
df.ix["description"]= description
data=df.T.reset_index().set_index(["description","index"]).T
data.columns.names = ['description', 'variable']
data.head().T #displays the first few provinces, transposed for ease of reading.

Unnamed: 0_level_0,province,Abra,Agusan Del Norte,Agusan Del Sur,Aklan,Albay
description,variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Average income in the province,gdp_pc_pp,133.688,179.014,126.492,119.962,158.629
Population,pop,240135.0,661728.0,677780.0,554414.0,1264100.0
Poverty incidence,pov_head,0.373595,0.346715,0.480785,0.249662,0.409587
Access to early warning,shew,0.953416,0.821277,0.821277,0.823003,0.8
National GDP per capita,gdp_pc_pp_nat,184.137,184.137,184.137,184.137,184.137
Average income of the province,rel_gdp_pp,0.726026,0.97218,0.686946,0.651483,0.861474
Relative income of poor families,share1,0.374005,0.279308,0.395282,0.416799,0.315201
Access to finance for poor people,axfin_p,0.693233,0.49688,0.475969,0.660083,0.551314
Access to finance for non-poor people,axfin_r,0.693233,0.49688,0.475969,0.660083,0.551314
Social protection for poor people,social_p,0.408683,0.388003,0.388003,0.432903,0.403794


# Saves the data

In [191]:
#saves the data
data.to_csv("all_data_compiled.csv")
data.to_excel("all_data_compiled.xlsx")
data.columns

           labels=[[0, 16, 17, 3, 15, 1, 19, 5, 4, 22, 21, 14, 13, 7, 6, 2, 18, 23, 8, 20, 10, 11, 12, 9], [6, 11, 12, 17, 7, 14, 16, 2, 3, 20, 21, 4, 5, 22, 23, 13, 1, 0, 10, 9, 19, 18, 8, 15]],
           names=['description', 'variable'])

**That's it, we have built an excel file with all our data!**
To see how to use this data with the resilience model, go to [socio_economic_capacity_demo.ipynb](socio_economic_capacity_demo.ipynb)



 

## Multi-hazard data

In [192]:
dfm.to_csv("multi_hazard_data.csv")

# Report missing data by province

This code builds a table reporting missing data points for each province

In [193]:
def write_missing_data(s):
    which = s[s.isnull()].index.values
    return ", ".join(which)

def count_missing_data(s):
    return s.isnull().sum()

report = pd.DataFrame()

report["nb_missing"]=df.apply(count_missing_data,axis=1)  
report["missing_data"]=df.apply(write_missing_data,axis=1)

report  = report.ix[report["nb_missing"]>0,:]
report.sort_values(by="nb_missing",inplace=True)
report.to_csv("missing_data_report.csv")

report.head()

Unnamed: 0_level_0,nb_missing,missing_data
province,Unnamed: 1_level_1,Unnamed: 2_level_1
Misamis Occidental,1,protection
Negros Oriental,1,protection
Aklan,2,"fap, far"
Marinduque,2,"fap, far"
Masbate,2,"fap, far"


We see that for two provinces, we have no data on protection. Let us inspect the data on protection.

In [194]:
protection.ix[["Misamis Occidental", "Negros Oriental"]]

province
Misamis Occidental   NaN
Negros Oriental      NaN
Name: 0, dtype: float64

In our data on protection, these two provinces have a missing value (nan). This probelm should be investigated going back to the source used for protection (here, FLOPROS as a placeholder, but that could be relaced by a domestic source, for instance DOST)
