Dans ce notebook, on indique la démarche suivie pour créer une base de donnée prête à être analysée à partir des indicateurs issus de la banque mondiale (https://data.worldbank.org/indicator/). On peut reprendre cette démarche pour rajouter ou enlever d'autres indicateurs, et on peut à la fin choisir de ne conserver que certains pays en en mettant la liste. 

La base créée dans ce code comporte 5 indicateurs (PIB, PIB_HAB, FBCF en dollars, Tx de chômage et Exportations en dollars). On dispose de ces données complètes entre 2000 et 2018 pour 145 pays

# Code

On commence par importer les modules nécessaires, et on crée quelques foncions qui vont nous aider à manipuler les dataframe en jonglant entre les codes et les noms des pays

In [88]:
import pandas as pd
import re
import numpy as np

def refresh_dico(Name,Code,df):
    dico = {}
    for i in df[[Name,Code]].dropna().drop_duplicates().iterrows() :
        dico[i[1][Name]] = i[1][Code]
    return (dico)

def rech_ligne_pays (y,df) :
    expression = re.compile(dic_pays[y]+".*?;")
    return(list(i[:-1] for i in (expression.findall(';'.join(df.index)+';'))))

def rech_ligne_indic (y,df):
    expression = re.compile(".{,4}"+y+";")
    return(list(i[:-1] for i in (expression.findall(';'.join(df.index)+';'))))

On importe ensuite les bases au format csv, que l'on est allé chercher sur le site de la banque mondiale qui est en open data (https://data.worldbank.org/indicator/).

Attention, pour ouvrir proprement ces données, il faut d'abord ouvrir les csv dans le bloc notes (ou n'importe quel autre logiciel capable de les ouvrir), et supprimer les deux premières lignes du document qui correspondent plus ou moins au titre du document, mais qui font bugger la fonction pd.read_csv.

In [89]:
#Importation des bases csv, et crétaion de la base aggrégée
gdp = pd.read_csv('pib.csv')
gdp['Simple Indicator'] = 'gdp'

for i in range(1960, 2021):
    gdp[str(i)] = gdp[str(i)] / 1000000
gdp['Indicator Name'] = 'GDP (current million US$)'

gdp_hab = pd.read_csv('pib_hab.csv')
gdp_hab['Simple Indicator'] = 'gdb_hab'

fbcf = pd.read_csv('FBCF.csv')
fbcf['Simple Indicator'] = 'FBCF'

exports_rate = pd.read_csv('exports_rate.csv')
exports_rate['Simple Indicator'] = 'exports_rate'

unemployment_rate = pd.read_csv('unemployment_rate.csv')
unemployment_rate['Simple Indicator'] = 'unemployment rate'

pop = pd.read_csv('pop.csv')
pop['Simple Indicator'] = 'pop'

labor_force = pd.read_csv('labor force.csv')
labor_force ['Simple Indicator'] = 'labor_force'

In [90]:
#création de la base aggrégée
df_list = [gdp, gdp_hab, fbcf, unemployment_rate, exports_rate, pop, labor_force]

for df in df_list :
    df['Pays_indic'] = df['Country Code'].str[:3] + '_' + df['Indicator Name'].str[:99]
df1 = pd.concat(df_list)
df1 = df1.sort_values('Pays_indic').set_index('Pays_indic')
df1.head(3)

Unnamed: 0_level_0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65,Simple Indicator
Pays_indic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW_Exports of goods and services (% of GDP),Aruba,ABW,Exports of goods and services (% of GDP),NE.EXP.GNFS.ZS,,,,,,,...,76.509512,77.555556,73.51703,71.294029,73.332115,,,,,exports_rate
ABW_GDP (current million US$),Aruba,ABW,GDP (current million US$),NY.GDP.MKTP.CD,,,,,,,...,2701.675978,2765.363128,2919.553073,2965.921788,3056.424581,,,,,gdp
ABW_GDP per capita (current US$),Aruba,ABW,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,26189.435509,26647.938101,27980.880695,28281.350482,29007.693003,,,,,gdb_hab


On supprime d'abord les colonnes qui ne nous servent à rien : le nom de code de l'indicateur, les années que l'on ne conserve pas pour l'étude, et la dernière colonne qui est un bug lors de l'ouverture de ces données avec la fonction pd.read_csv

On fait ensuite de même avec les lignes, en supprimant tous les aggrégats de pays qui ne sont pas proprement des pays

In [91]:
#SUPPRESSION DES LIGNES ET COLONNES INUTILES
dic_pays = refresh_dico('Country Name','Country Code',df1)


useless_data_list = ['Unnamed: 65', 'Indicator Code', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', 
                   '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', 
                   '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994',
                   '1995', '1996', '1997', '1998', '1999', '2019', '2020']
df1 = df1.drop(useless_data_list, axis = 1)


suppr_liste = ['Arab World', 'Central Europe and the Baltics',
 'Caribbean small states', 'East Asia & Pacific (excluding high income)', 
  'Early-demographic dividend', 
 'East Asia & Pacific',
 'Europe & Central Asia (excluding high income)', 
 'Europe & Central Asia',  
 'European Union','Fragile and conflict affected situations',
 'High income',
 'Heavily indebted poor countries (HIPC)','IBRD only',
 'IDA & IBRD total',
 'IDA total',
 'IDA blend',
 'IDA only',
 'Latin America & Caribbean (excluding high income)',
 'Latin America & Caribbean',
 'Least developed countries: UN classification',
 'Low income',
 'Lower middle income',
 'Low & middle income',
 'Late-demographic dividend',
 'Middle East & North Africa',
 'Middle income',
 'Middle East & North Africa (excluding high income)',
 'North America',
 'OECD members',
 'Other small states',
 'Pre-demographic dividend',
 'Pacific island small states',
 'Post-demographic dividend',
 'South Asia',
 'Sub-Saharan Africa (excluding high income)',
 'Sub-Saharan Africa',
 'Small states',
 'East Asia & Pacific (IDA & IBRD countries)',
 'Europe & Central Asia (IDA & IBRD countries)',
 'Latin America & the Caribbean (IDA & IBRD countries)',
 'Middle East & North Africa (IDA & IBRD countries)',
 'South Asia (IDA & IBRD)',
 'Sub-Saharan Africa (IDA & IBRD countries)',
 'Upper middle income',
 'World',
 'Central African Republic',
 'West Bank and Gaza',
 'Hong Kong SAR, China',
 'Macao SAR, China']
for i in suppr_liste :
    df1 = df1.drop(labels = rech_ligne_pays(i,df1)) 

On supprime maintenant tous les pays pour lesquels il manque au moins un indicateur. On le fait en deux étapes :

(1) On supprime toutes les lignes où il manque au moins une donnée à un moment donné.
(2) On supprime toutes les lignes de tous les pays pour lesquels on a supprimé au moins une ligne

In [92]:
#Étape 1
df1 = df1.dropna(axis = 0)
df1.head(7)

#Étape 2 
dic_code = refresh_dico('Country Code','Country Name',df1)

for i in dic_pays.keys() :
    if len(rech_ligne_pays(i,df1)) < len(df_list) :
        df1 = df1.drop(rech_ligne_pays(i,df1))

dic_code = refresh_dico('Country Code','Country Name',df1)

#On affiche le nombre de pays restant dans la base
print(len(dic_code.keys()))


145


On recrée maintenant la variable correpondant aux exportations par pays en dollars, à partir du PIB et des exportations en pourcentage du PIB. 

Après l'avoir fait, on peut supprimer l'ancienne variable qui exprime les exportations en pct du PIB en enlevant le # sur la ligne correspondante.

On a utlisé la formule :

    Exportation$ = (Exportations%PIB * PIB)/100

In [93]:
#CREATION DE LA VARIABLE EXPORTATION A PARTIR DE EXPORTATION RATE ET DU PIB

for i in dic_code.keys():
    df1.loc[i+ '_Exports of goods and services (current million US$)']= (df1.loc[i+'_Exports of goods and services (% of GDP)'][3:df1.shape[1]-1].astype(np.float64) * df1.loc[i+'_GDP (current million US$)'][3:df1.shape[1]-1].astype(np.float64)) / 100
    df1.loc[i+ '_Exports of goods and services (current million US$)',['Country Name','Country Code']]=df1.loc[i+'_Exports of goods and services (% of GDP)',['Country Name','Country Code']]
    df1.loc[i+ '_Exports of goods and services (current million US$)','Indicator Name']= 'Exports of goods and services (current million US$)'
    df1.loc[i+ '_Exports of goods and services (current million US$)','Simple Indicator'] = 'exports'
    #df1 = df1.drop(labels = i+'_Exports of goods and services (% of GDP)')
df1 = df1.sort_index()

On va maintenant créer les variables par habitant, et par population active

In [94]:
def gen_per_capita(variable):
    for i in dic_code.keys():
        df1.loc[i+variable+' per capita']= df1.loc[i+variable][3:df1.shape[1]-1].astype(np.float64) / df1.loc[i+'_Population, total'][3:df1.shape[1]-1].astype(np.float64)
        df1.loc[i+variable+' per capita',['Country Name','Country Code']]=df1.loc[i+variable,['Country Name','Country Code']]
        df1.loc[i+variable+' per capita','Indicator Name']= variable + ' per capita'
        df1.loc[i+variable+' per capita','Simple Indicator'] = df1.loc[i+variable, 'Simple Indicator'] + '_hab'
    return
gen_per_capita('_Exports of goods and services (current million US$)')

def gen_per_labor_unit(variable) :
    for i in dic_code.keys():
        df1.loc[i+variable+' per labor unit']= df1.loc[i+variable][3:df1.shape[1]-1].astype(np.float64) / df1.loc[i+'_Labor force, total'][3:df1.shape[1]-1].astype(np.float64)
        df1.loc[i+variable+' per labor unit',['Country Name','Country Code']]=df1.loc[i+variable,['Country Name','Country Code']]
        df1.loc[i+variable+' per labor unit','Indicator Name']= variable + ' per labor unit'
        df1.loc[i+variable+' per labor unit','Simple Indicator'] = df1.loc[i+variable, 'Simple Indicator'] + '_labor_unit'
    return
gen_per_labor_unit('_Exports of goods and services (current million US$)')
df1 = df1.sort_index()

In [95]:
df1.head(10)

Unnamed: 0_level_0,Country Name,Country Code,Indicator Name,2000,2001,2002,2003,2004,2005,2006,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,Simple Indicator
Pays_indic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AGO_Exports of goods and services (% of GDP),Angola,AGO,Exports of goods and services (% of GDP),89.68583,75.38894,57.085,54.32134,58.38035,65.52627,63.46786,...,61.54311,60.66995,55.94013,50.74708,44.69503,29.7546,28.12448,29.0041,40.83629,exports_rate
AGO_Exports of goods and services (current million US$),Angola,AGO,Exports of goods and services (current million...,8187.953,6736.804,8725.781,9676.1,13749.77,24225.66,33245.11,...,51572.82,67822.75,71632.93,69376.27,65126.11,34572.95,28440.56,35420.92,41388.9,exports
AGO_Exports of goods and services (current million US$) per capita,Angola,AGO,_Exports of goods and services (current millio...,0.0004994033,0.0003975512,0.0004980634,0.0005339575,0.0007330026,0.001246586,0.001649889,...,0.002208095,0.002800202,0.002853,0.0026667,0.002417291,0.001239868,0.0009860649,0.001187954,0.00134337,exports_hab
AGO_Exports of goods and services (current million US$) per labor unit,Angola,AGO,_Exports of goods and services (current millio...,0.001219836,0.0009712125,0.001217387,0.001304164,0.001789817,0.003043671,0.004032362,...,0.005394397,0.006843792,0.006974856,0.00651627,0.005901985,0.00302266,0.002400369,0.002887063,0.003257561,exports_labor_unit
AGO_GDP (current million US$),Angola,AGO,GDP (current million US$),9129.595,8936.064,15285.59,17812.71,23552.05,36970.92,52381.01,...,83799.5,111789.7,128052.9,136709.9,145712.2,116193.6,101123.9,122123.8,101353.2,gdp
AGO_GDP per capita (current US$),Angola,AGO,GDP per capita (current US$),556.8363,527.3335,872.4945,982.9609,1255.564,1902.422,2599.566,...,3587.884,4615.468,5100.096,5254.882,5408.41,4166.98,3506.073,4095.813,3289.647,gdb_hab
AGO_Gross capital formation (% of GDP),Angola,AGO,Gross capital formation (% of GDP),30.49322,30.49322,30.49317,30.45111,30.89367,27.55658,23.30077,...,28.19731,26.42435,26.66758,26.14297,27.50046,34.20249,27.21471,24.1303,17.86942,FBCF
"AGO_Labor force, total",Angola,AGO,"Labor force, total",6712338.0,6936488.0,7167634.0,7419387.0,7682222.0,7959356.0,8244573.0,...,9560441.0,9910112.0,10270170.0,10646620.0,11034610.0,11437920.0,11848410.0,12268840.0,12705490.0,labor_force
"AGO_Population, total",Angola,AGO,"Population, total",16395470.0,16945750.0,17519420.0,18121480.0,18758140.0,19433600.0,20149900.0,...,23356250.0,24220660.0,25107930.0,26015780.0,26941780.0,27884380.0,28842480.0,29816750.0,30809760.0,pop
"AGO_Unemployment, total (% of total labor force) (modeled ILO estimate)",Angola,AGO,"Unemployment, total (% of total labor force) (...",3.837,3.835,3.871,3.875,3.842,3.8,3.712,...,9.43,7.362,7.379,7.4,7.331,7.282,7.223,7.119,7.019,unemployment rate


On affiche le nombre de pays, et on vérifie qu'on a bien toujours la France, et on affiche la taille du df.

In [96]:
print('On a conservé ', len(dic_code.keys()), ' pays')
print('La taille du df est de ', df1.shape)
df1.loc[rech_ligne_pays('France', df1)]


On a conservé  145  pays
La taille du df est de  (1450, 23)


Unnamed: 0_level_0,Country Name,Country Code,Indicator Name,2000,2001,2002,2003,2004,2005,2006,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,Simple Indicator
Pays_indic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
FRA_Exports of goods and services (% of GDP),France,FRA,Exports of goods and services (% of GDP),28.59497,28.2659,27.53105,26.11197,26.46884,27.03305,27.93506,...,26.7883,28.42134,29.20303,29.36474,29.66668,30.59262,30.24754,30.94863,31.71916,exports_rate
FRA_Exports of goods and services (current million US$),France,FRA,Exports of goods and services (current million...,389534.7,389070.2,411392.8,480585.8,560012.4,593679.9,647700.4,...,707910.2,813250.5,783758.2,825465.6,846143.0,745911.7,747503.0,803163.8,884286.9,exports
FRA_Exports of goods and services (current million US$) per capita,France,FRA,_Exports of goods and services (current millio...,0.006394988,0.006341045,0.006656274,0.007720888,0.00893092,0.009396739,0.01018055,...,0.01088632,0.01244591,0.01193665,0.0125073,0.01276002,0.01120858,0.01120289,0.01201183,0.01320503,exports_hab
FRA_Exports of goods and services (current million US$) per labor unit,France,FRA,_Exports of goods and services (current millio...,0.01423671,0.01418445,0.0148184,0.01697049,0.01964856,0.02062555,0.02238176,...,0.02380105,0.02734086,0.02613852,0.02739995,0.02814089,0.02474985,0.02475404,0.02656276,0.02909621,exports_labor_unit
FRA_GDP (current million US$),France,FRA,GDP (current million US$),1362249.0,1376465.0,1494287.0,1840481.0,2115742.0,2196126.0,2318594.0,...,2642610.0,2861408.0,2683825.0,2811078.0,2852166.0,2438208.0,2471286.0,2595151.0,2787864.0,gdp
FRA_GDP per capita (current US$),France,FRA,GDP per capita (current US$),22364.03,22433.56,24177.34,29568.39,33741.27,34760.19,36443.62,...,40638.33,43790.73,40874.7,42592.93,43011.26,36638.18,37037.37,38812.16,41631.09,gdb_hab
FRA_Gross capital formation (% of GDP),France,FRA,Gross capital formation (% of GDP),22.48805,22.16487,21.31558,21.18613,21.8896,22.45262,23.23771,...,21.94629,23.22091,22.62663,22.28719,22.70983,22.71232,22.60939,23.43619,23.87174,FBCF
"FRA_Labor force, total",France,FRA,"Labor force, total",27361300.0,27429360.0,27762300.0,28318910.0,28501440.0,28783710.0,28938760.0,...,29742820.0,29744880.0,29984800.0,30126540.0,30068090.0,30138030.0,30197210.0,30236460.0,30391820.0,labor_force
"FRA_Population, total",France,FRA,"Population, total",60912500.0,61357430.0,61805270.0,62244890.0,62704900.0,63179350.0,63621380.0,...,65027510.0,65342780.0,65659810.0,65998690.0,66312070.0,66548270.0,66724100.0,66864380.0,66965910.0,pop
"FRA_Unemployment, total (% of total labor force) (modeled ILO estimate)",France,FRA,"Unemployment, total (% of total labor force) (...",10.217,8.61,8.702,8.306,8.914,8.493,8.448,...,8.871,8.811,9.4,9.921,10.292,10.359,10.057,9.397,9.059,unemployment rate


In [97]:
df1.to_csv('C:/Users/Titouan/Desktop/WB_data2.csv')