# Wrangling World Indicators

The data wrangled here is the World Indicators dataset [link to dataset](https://datacatalog.worldbank.org/dataset/world-development-indicators).

We are going to create a smaller long format dataset.

Let us load in our data.

In [15]:
import pandas as pd
df = pd.read_csv('WDI_Data.csv', encoding='UTF-8')

In [2]:
df.shape

(311829, 65)

In [3]:
df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,AFG,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,22.33,24.08,26.17,27.99,30.1,32.44,,,,
1,Afghanistan,AFG,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,43.222019,69.1,68.933266,89.5,71.5,97.7,97.7,98.713203,,
2,Afghanistan,AFG,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,29.572881,60.849157,61.282199,86.500512,64.573354,97.09936,97.091973,98.272872,,
3,Afghanistan,AFG,"Access to electricity, urban (% of urban popul...",EG.ELC.ACCS.UR.ZS,,,,,,,...,86.567779,95.0,92.673767,98.7,92.5,99.5,99.5,100.0,,
4,Afghanistan,AFG,Account ownership at a financial institution o...,FX.OWN.TOTL.ZS,,,,,,,...,9.005013,,,9.961,,,14.893312,,,


Pick out the countries we want in our datasets.

In [4]:
df_1 = df[
    (df['Country Name'] == 'France')  | 
    (df['Country Name'] == 'Italy')   | 
    (df['Country Name'] == 'Germany') |
    (df['Country Name'] == 'United Kingdom')
]

Pick out the variables.

In [5]:
df_2 = df_1[
    (df_1['Indicator Name'] == 'GDP per capita (current US$)') |
    (df_1['Indicator Name'] == 'Imports of goods and services (current US$)') |
    (df_1['Indicator Name'] == 'Land area (sq. km)') |
    (df_1['Indicator Name'] == 'Life expectancy at birth, total (years)') |
    (df_1['Indicator Name'] == 'Population in largest city') |
    (df_1['Indicator Name'] == 'Population growth (annual %)') |
    (df_1['Indicator Name'] == 'Primary education, duration (years)') |
    (df_1['Indicator Name'] == 'Progression to secondary school (%)') |
    (df_1['Indicator Name'] == 'Rural population (% of total population)') |
    (df_1['Indicator Name'] == 'Access to electricity (% of population)') |
    (df_1['Indicator Name'] == 'Population, total')
]

Generate date range

In [6]:
datetime_series = pd.Series(
    pd.date_range("1960-01-01", periods=45, freq="Y")
)

dates = datetime_series.dt.strftime('%Y')

Change data format to long. Perfect for using with Seaborn.

In [7]:
df_3 = pd.melt(df_2,  
               id_vars = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code'],
               value_vars = dates,
               var_name = 'year',
               value_name = 'value')

In [8]:
df_3

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,year,value
0,France,FRA,Access to electricity (% of population),EG.ELC.ACCS.ZS,1960,
1,France,FRA,GDP per capita (current US$),NY.GDP.PCAP.CD,1960,1.334690e+03
2,France,FRA,Imports of goods and services (current US$),NE.IMP.GNFS.CD,1960,7.703449e+09
3,France,FRA,Land area (sq. km),AG.LND.TOTL.K2,1960,
4,France,FRA,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,1960,6.986829e+01
...,...,...,...,...,...,...
1975,United Kingdom,GBR,Population in largest city,EN.URB.LCTY,2004,7.456170e+06
1976,United Kingdom,GBR,"Population, total",SP.POP.TOTL,2004,5.998790e+07
1977,United Kingdom,GBR,"Primary education, duration (years)",SE.PRM.DURS,2004,6.000000e+00
1978,United Kingdom,GBR,Progression to secondary school (%),SE.SEC.PROG.ZS,2004,


Write out file

In [9]:
df_3.to_csv('world_indicators_long.csv')

To make things a little easier in Pandas, we can convert the long format to wide using pivot table, and then back to a dataframe (see [this link](https://stackoverflow.com/questions/42708193/pandas-pivot-table-to-data-frame/42708606)).

In [10]:
df_pivot = df_3.pivot_table(index=['Country Name', 'year'], columns = 'Indicator Name', values = ['value'])
df_4 = pd.DataFrame(df_pivot.to_records())

# clean up column names
df_4.columns = [hdr.replace("('value', ", "").replace("')", "").replace("'", "") \
                     for hdr in df_4.columns]

# remove spaces from variable names
df_4.columns = [c.replace(' ', '_') for c in df_4.columns]

In [11]:
df_4

Unnamed: 0,Country_Name,year,Access_to_electricity_(%_of_population),GDP_per_capita_(current_US$),Imports_of_goods_and_services_(current_US$),Land_area_(sq._km),"Life_expectancy_at_birth,_total_(years)",Population_growth_(annual_%),Population_in_largest_city,"Population,_total","Primary_education,_duration_(years)",Progression_to_secondary_school_(%),Rural_population_(%_of_total_population)
0,France,1960,,1334.690056,7.703449e+09,,69.868293,,7410735.0,46621669.0,,,38.120
1,France,1961,,1428.045487,8.273435e+09,547566.0156,70.117073,1.318705,7539888.0,47240543.0,,,37.393
2,France,1962,,1578.284604,9.042717e+09,547566.0156,70.314634,1.396483,7650751.0,47904877.0,,,36.511
3,France,1963,,1744.640590,1.036737e+10,547566.0156,70.514634,1.404835,7718298.0,48582611.0,,,35.298
4,France,1964,,1909.541232,1.213712e+10,547566.0156,70.663415,1.324961,7786535.0,49230595.0,,,34.102
...,...,...,...,...,...,...,...,...,...,...,...,...,...
175,United Kingdom,2000,100.0,28149.870010,4.465000e+11,241930.0000,77.741463,0.357301,7272819.0,58892514.0,6.0,,21.349
176,United Kingdom,2001,100.0,27744.506460,4.467580e+11,241930.0000,77.992683,0.384976,7322400.0,59119673.0,6.0,,21.249
177,United Kingdom,2002,100.0,30056.586220,4.762260e+11,241930.0000,78.143902,0.423337,7366701.0,59370479.0,6.0,,20.953
178,United Kingdom,2003,100.0,34419.147910,5.349980e+11,241930.0000,78.446341,0.465641,7411269.0,59647577.0,6.0,,20.661


Write the file out for pandas

In [12]:
df_4.to_csv('world_indicators_pandas.csv', index=False)

# Not time series

We can also look at a specific year.

In [16]:
df_year = df[['Country Name', 'Country Code', 'Indicator Name', '2000']]

In [22]:
df_year = df_year[
    (df_year['Country Name'] == 'France')  | 
    (df_year['Country Name'] == 'Italy')   | 
    (df_year['Country Name'] == 'Germany') |
    (df_year['Country Name'] == 'Japan') |
    (df_year['Country Name'] == 'China') |
    (df_year['Country Name'] == 'Spain') |
    (df_year['Country Name'] == 'USA') |
    (df_year['Country Name'] == 'Afghanistan') |
    (df_year['Country Name'] == 'Chile') |
    (df_year['Country Name'] == 'Norway') |
    (df_year['Country Name'] == 'United States') |
    (df_year['Country Name'] == 'United Kingdom')
]

In [23]:
df_year

Unnamed: 0,Country Name,Country Code,Indicator Name,2000
58917,China,CHN,Access to clean fuels and technologies for coo...,46.780000
58918,China,CHN,Access to electricity (% of population),96.907104
58919,China,CHN,"Access to electricity, rural (% of rural popul...",95.176621
58920,China,CHN,"Access to electricity, urban (% of urban popul...",100.000000
58921,China,CHN,Account ownership at a financial institution o...,
...,...,...,...,...
296017,United Kingdom,GBR,Women who believe a husband is justified in be...,
296018,United Kingdom,GBR,Women who believe a husband is justified in be...,
296019,United Kingdom,GBR,Women who were first married by age 15 (% of w...,
296020,United Kingdom,GBR,Women who were first married by age 18 (% of w...,


In [24]:
df_year.to_csv('world_indicators_2000.csv')