<h1>Chapter 2 | Data Exercise #4 | <code>world-bank-immunization</code>: Data Preparation</h1>

<p>4. Consider the <code>world-bank-immunization</code> dataset based on World Bank data.</p>
<p>Assignments</p>
<ul>
    <li>Generate a new variable for the growth rate of GDP per capita, in percentage terms.</li>
    
    

$(gdppc_{it}/gdpcc_{i,t-1})*100$

<ul>
    <li>Add this variable to the long and wide format of the data.</li>
    <li>Add the variabe to Tables 2.4 and 2.5.</li>
</ul>

<h2><b>1.</b> Load the data</h2>

In [37]:
import os
import sys
import warnings
import pandas as pd

warnings.filterwarnings("ignore")

In [38]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_data_exercises")[0]

# Get location folders
data_in = f"{dirname}da_data_repo/worldbank-immunization/clean/"
data_out = f"{dirname}da_data_exercises/ch02-preparing_data_for_analysis/data"
output = f"{dirname}da_data_exercises/ch02-preparing_data_for_analysis/data/output/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [39]:
# Import the prewritten helper functions
from py_helper_functions import *

In [40]:
data = pd.read_csv(f"{data_in}worldbank-immunization-panel.csv")

<p>Let's take a brief look at the data:</p>

In [41]:
data.head()

Unnamed: 0,year,c,countryname,countrycode,pop,mort,surv,imm,gdppc,lngdppc,hexp
0,1998,ARG,Argentina,ARG,36.063459,21.5,97.85,95,15973.067259,9.678659,
1,1998,AUS,Australia,AUS,18.711,6.4,99.36,82,33174.49984,10.409537,
2,1998,BRA,Brazil,BRA,169.78525,39.5,96.05,95,11193.381175,9.323078,
3,1998,CHN,China,CHN,1241.935,41.4,95.86,83,3211.623211,8.074532,
4,1998,FRA,France,FRA,60.186288,5.6,99.44,82,32679.688245,10.394509,


In [42]:
data.tail()

Unnamed: 0,year,c,countryname,countrycode,pop,mort,surv,imm,gdppc,lngdppc,hexp
3802,2017,VEN,"Venezuela, RB",VEN,29.390409,30.9,96.91,96,,,
3803,2017,VNM,Vietnam,VNM,94.596642,20.9,97.91,97,6233.485045,8.737691,
3804,2017,YEM,"Yemen, Rep.",YEM,27.834821,55.4,94.46,65,2404.42237,7.785065,
3805,2017,ZMB,Zambia,ZMB,16.853688,60.0,94.0,96,3717.667166,8.220852,
3806,2017,ZWE,Zimbabwe,ZWE,14.236745,50.3,94.97,90,2568.410072,7.851042,


In [43]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3807 entries, 0 to 3806
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   year         3807 non-null   int64  
 1   c            3807 non-null   object 
 2   countryname  3807 non-null   object 
 3   countrycode  3807 non-null   object 
 4   pop          3807 non-null   float64
 5   mort         3807 non-null   float64
 6   surv         3807 non-null   float64
 7   imm          3807 non-null   int64  
 8   gdppc        3642 non-null   float64
 9   lngdppc      3642 non-null   float64
 10  hexp         3165 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 327.3+ KB


In [44]:
data.shape

(3807, 11)

<h3>1.1. <b>Task #1 </b>| Generate a new variable for the growth rate of GDP per capita, in percentage terms</h3>
<p>Basically, we have to:</p>
<ul>
    <li>Divide <code>gdppc</code> by the value registered in the previous year</li>
    <li>Multiply by 100</li>
</ul>
<p>We can use <code>pct_change()</code> to get the percentage change from one year to the next:</p>

In [45]:
# Sort the dataframe by country and year code
data = data.sort_values(["countrycode", "year"]).reset_index(drop=True)

In [46]:
# Calculate the gdppc growth using pct_change
data["gdppc_growth"] = data.groupby("countrycode")["gdppc"].pct_change()

In [47]:
data.head(20)

Unnamed: 0,year,c,countryname,countrycode,pop,mort,surv,imm,gdppc,lngdppc,hexp,gdppc_growth
0,1998,AFG,Afghanistan,AFG,19.737765,135.8,86.42,31,,,,
1,1999,AFG,Afghanistan,AFG,20.170844,132.3,86.77,31,,,,
2,2000,AFG,Afghanistan,AFG,20.779953,128.8,87.12,27,,,,
3,2001,AFG,Afghanistan,AFG,21.606988,125.3,87.47,37,,,,
4,2002,AFG,Afghanistan,AFG,22.60077,121.7,87.83,35,1016.245409,6.92387,9.443391,
5,2003,AFG,Afghanistan,AFG,23.680871,117.9,88.21,39,1055.557459,6.961824,8.941259,0.038684
6,2004,AFG,Afghanistan,AFG,24.726684,114.1,88.59,48,1025.208245,6.932651,9.808473,-0.028752
7,2005,AFG,Afghanistan,AFG,25.654277,110.1,88.99,50,1099.104568,7.002251,9.948289,0.072079
8,2006,AFG,Afghanistan,AFG,26.433049,106.1,89.39,53,1123.871323,7.024535,10.622766,0.022534
9,2007,AFG,Afghanistan,AFG,27.100536,102.0,89.8,55,1247.753118,7.1291,9.904674,0.110228


In [48]:
data[["year","countryname","gdppc", "gdppc_growth"]].head(30)

Unnamed: 0,year,countryname,gdppc,gdppc_growth
0,1998,Afghanistan,,
1,1999,Afghanistan,,
2,2000,Afghanistan,,
3,2001,Afghanistan,,
4,2002,Afghanistan,1016.245409,
5,2003,Afghanistan,1055.557459,0.038684
6,2004,Afghanistan,1025.208245,-0.028752
7,2005,Afghanistan,1099.104568,0.072079
8,2006,Afghanistan,1123.871323,0.022534
9,2007,Afghanistan,1247.753118,0.110228


<p>Great! Likewise, we can also use the <code>.shift()</code> to get the previous year's <code>gdppc</code> and then divide the current year's <code>gdppc</code> by the sifted value and muliply it by 1000 to get the percentage change. Let's try it in a new variable:</p>

In [61]:
data["gdppc_pct_change"] = data.groupby("countrycode")["gdppc"].apply(lambda x: x / x.shift(1) - 1)

In [62]:
data.head(10)

Unnamed: 0,year,c,countryname,countrycode,pop,mort,surv,imm,gdppc,lngdppc,hexp,gdppc_growth,gdppc_pct_change
0,1998,AFG,Afghanistan,AFG,19.737765,135.8,86.42,31,,,,,
1,1999,AFG,Afghanistan,AFG,20.170844,132.3,86.77,31,,,,,
2,2000,AFG,Afghanistan,AFG,20.779953,128.8,87.12,27,,,,,
3,2001,AFG,Afghanistan,AFG,21.606988,125.3,87.47,37,,,,,
4,2002,AFG,Afghanistan,AFG,22.60077,121.7,87.83,35,1016.245409,6.92387,9.443391,,
5,2003,AFG,Afghanistan,AFG,23.680871,117.9,88.21,39,1055.557459,6.961824,8.941259,0.038684,0.038684
6,2004,AFG,Afghanistan,AFG,24.726684,114.1,88.59,48,1025.208245,6.932651,9.808473,-0.028752,-0.028752
7,2005,AFG,Afghanistan,AFG,25.654277,110.1,88.99,50,1099.104568,7.002251,9.948289,0.072079,0.072079
8,2006,AFG,Afghanistan,AFG,26.433049,106.1,89.39,53,1123.871323,7.024535,10.622766,0.022534,0.022534
9,2007,AFG,Afghanistan,AFG,27.100536,102.0,89.8,55,1247.753118,7.1291,9.904674,0.110228,0.110228


<p>It worked like a charm! In this case, we grouped the DataFrame by <code>countrycode</code> and applied a lambda function to <code>gdppc</code>.<p>
<p>As a reminder, <code>lambda</code> is a keyword used to define an <b>anonymous function</b> that can take can take any number of arguments but can have only <b>one expression</b>. Here, we used it taking a Pandas Series as its <code>x</code> argument, that is, <code>gdppc</code>, and a new Series object will be returned by applying the following operations:</p>
<ol>
    <li><code>x.shift(1)</code> shifts the values in the Series <code>x</code> down by <b>one row</b>, effectling aligning the current row with the previous row.</li>
    <li><code>x / shift(1) - 1</code> calculates the difference change between the current row and the previous row.</li>
    <li><code>(x / shift(1) - 1) * 100</code> multiplies the result by 100 to convert the percentage change to a percentage growth rate.</li>
</ol>
<p>Finally, with <code>.apply()</code>, we passed this anonymous function as an argument inside it to the Pandas Series, creating the percentage growth rate of GDP per capita for each country and year.</p>