Vous êtes Data Scientist dans une start-up de la EdTech, nommée academy, qui propose des contenus de formation en ligne pour un public de niveau lycée et université.

Mark, votre manager, vous a convié à une réunion pour vous présenter le projet d’expansion à l’international de l’entreprise. Il vous confie une première mission d’analyse exploratoire, pour déterminer si les données sur l’éducation de la banque mondiale permettent d’informer le projet d’expansion.

Voici les différentes questions que Mark aimerait explorer, que vous avez notées durant la réunion :

Quels sont les pays avec un fort potentiel de clients pour nos services ?
Pour chacun de ces pays, quelle sera l’évolution de ce potentiel de clients ?
Dans quels pays l'entreprise doit-elle opérer en priorité ?
Votre mission
Mark vous a donc demandé de réaliser une analyse pré-exploratoire de ce jeu de données. Il vous a transmis cet email à la suite de la réunion :

Hello,

Les données de la Banque mondiale sont disponibles à l’adresse suivante :

https://datacatalog.worldbank.org/dataset/education-statistics

Ou en téléchargement direct à ce lien.

Je te laisse regarder la page d'accueil qui décrit le jeu de données. En résumé, l’organisme “EdStats All Indicator Query” de la Banque mondiale répertorie 4000 indicateurs internationaux décrivant l’accès à l’éducation, l’obtention de diplômes et des informations relatives aux professeurs, aux dépenses liées à l’éducation... Tu trouveras plus d'info sur ce site :

http://datatopics.worldbank.org/education/

Pour la pré-analyse, pourrais-tu :

Valider la qualité de ce jeu de données (comporte-t-il beaucoup de données manquantes, dupliquées ?)

Décrire les informations contenues dans le jeu de données (nombre de colonnes ? nombre de lignes ?)

Sélectionner les informations qui semblent pertinentes pour répondre à la problématique (quelles sont les colonnes contenant des informations qui peuvent être utiles pour répondre à la problématique de l’entreprise ?)

Déterminer des ordres de grandeurs des indicateurs statistiques classiques pour les différentes zones géographiques et pays du monde (moyenne/médiane/écart-type par pays et par continent ou bloc géographique)

Ton travail va nous permettre de déterminer si ce jeu de données peut informer les décisions d'ouverture vers de nouveaux pays. On va partager ton analyse avec le board, alors merci de soigner la présentation et de l'illustrer avec des graphiques pertinents et lisibles !

# CSV used : EdStatsData

## Libraries imports

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

## File imports

In [83]:
filename = 'EdStatsData.csv'
directory = '/CSV'
dfEdStatsData = pd.read_csv(os.getcwd() + directory + '/' + filename)

## Basic informations about the dataset

> Data sample :

In [84]:
dfEdStatsData.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,,,...,,,,,,,,,,
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,,,...,,,,,,,,,,
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,...,,,,,,,,,,


> Data infos

In [85]:
dfEdStatsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 886930 entries, 0 to 886929
Data columns (total 70 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Country Name    886930 non-null  object 
 1   Country Code    886930 non-null  object 
 2   Indicator Name  886930 non-null  object 
 3   Indicator Code  886930 non-null  object 
 4   1970            72288 non-null   float64
 5   1971            35537 non-null   float64
 6   1972            35619 non-null   float64
 7   1973            35545 non-null   float64
 8   1974            35730 non-null   float64
 9   1975            87306 non-null   float64
 10  1976            37483 non-null   float64
 11  1977            37574 non-null   float64
 12  1978            37576 non-null   float64
 13  1979            36809 non-null   float64
 14  1980            89122 non-null   float64
 15  1981            38777 non-null   float64
 16  1982            37511 non-null   float64
 17  1983      

Our dataset contains 886930 entries, we have 66 float variables and 4 object variables which are strings but we keep them in object type as they are handled better like this.


## Dataset fractioning

We create sub datasets for "country-only" and "worldwide"

In [86]:
# we convert the column names to serpent case
dfEdStatsData.columns = dfEdStatsData.columns.str.lower().str.replace(' ', '_')
dfEdStatsData

# world sub dataset
dfEdStatsDataWorld = dfEdStatsData[dfEdStatsData['country_name'] == 'World']

# country sub dataset
# we check for the first occurence of Afghanistan as it is the first country in the dataset
firstOccurence = dfEdStatsData[dfEdStatsData['country_name'] == 'Afghanistan'].index[0]
# we split the dataset into two parts
dfEdStatsDataCountry = dfEdStatsData.iloc[firstOccurence:]
dfEdStatsDataCountry.reset_index(drop=True, inplace=True)
dfEdStatsDataCountry

Unnamed: 0,country_name,country_code,indicator_name,indicator_code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,unnamed:_69
0,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,,,,,7.05911,,...,,,,,,,,,,
1,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,,,,,2.53138,,...,,,,,,,,,,
2,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,,,,,0.22154,,...,,,,,,,,,,
3,Afghanistan,AFG,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,,,,,11.42652,,...,,,,,,,,,,
4,Afghanistan,AFG,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795300,Zimbabwe,ZWE,"Youth illiterate population, 15-24 years, male...",UIS.LP.AG15T24.M,,,,,,,...,,,,,,,,,,
795301,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, b...",SE.ADT.1524.LT.ZS,,,,,,,...,,,,,,,,,,
795302,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, f...",SE.ADT.1524.LT.FE.ZS,,,,,,,...,,,,,,,,,,
795303,Zimbabwe,ZWE,"Youth literacy rate, population 15-24 years, g...",SE.ADT.1524.LT.FM.ZS,,,,,,,...,,,,,,,,,,


## Null percentage by indicator

Ranking of most relevant indicators depending on null percentage for data aggregated **in the past**

In [87]:
dfNullPercentageByIndicator = dfEdStatsDataCountry.groupby('indicator_code').apply(lambda x: x.isnull().mean())
dfNullPercentageByIndicator

# get 1970 column id
year1970 = dfEdStatsDataCountry.columns.get_loc('1970')
# get 2016 column id
year2016 = dfEdStatsDataCountry.columns.get_loc('2016')

# we get column between 1970 and 2016 using iloc
dfNullPercentageByIndicatorPast = dfNullPercentageByIndicator.iloc[:, year1970:year2016+1]
dfNullPercentageByIndicatorPast

# we get the mean of the null values for each indicator
dfNullPercentageByIndicatorPast['mean'] = dfNullPercentageByIndicatorPast.mean(axis=1)
dfNullPercentageByIndicatorPast = dfNullPercentageByIndicatorPast.sort_values('mean', ascending=True)
dfNullPercentageByIndicatorPast

  dfNullPercentageByIndicator = dfEdStatsDataCountry.groupby('indicator_code').apply(lambda x: x.isnull().mean())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfNullPercentageByIndicatorPast['mean'] = dfNullPercentageByIndicatorPast.mean(axis=1)


Unnamed: 0_level_0,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,mean
indicator_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SP.POP.TOTL,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,...,0.009217,0.009217,0.009217,0.009217,0.013825,0.013825,0.032258,0.032258,0.032258,0.020590
SP.POP.GROW,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,0.027650,...,0.009217,0.009217,0.009217,0.013825,0.013825,0.013825,0.032258,0.032258,0.032258,0.021179
SE.PRM.DURS,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,...,0.041475,0.041475,0.041475,0.041475,0.041475,0.041475,0.041475,0.041475,0.179724,0.055888
SE.PRM.AGES,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,...,0.041475,0.041475,0.041475,0.041475,0.041475,0.041475,0.041475,0.041475,0.179724,0.055888
UIS.THDUR.0,0.078341,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,0.073733,...,0.055300,0.055300,0.055300,0.055300,0.055300,0.055300,0.055300,0.055300,0.193548,0.060888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SABER.TER.GOAL6.LVL1,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
SABER.TER.GOAL6.LVL2,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
SABER.TER.GOAL6.LVL3,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
SABER.TER.GOAL4,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000


Ranking of most relevant indicators depending on null percentage for data **expected in the future**

In [88]:
# get 2020 column id
year2020 = dfNullPercentageByIndicator.columns.get_loc('2020')
# get 2100 column id
year2100 = dfNullPercentageByIndicator.columns.get_loc('2100')

# we get column between 2017 and 2100 using iloc
dfNullPercentageByIndicatorFuture = dfNullPercentageByIndicator.iloc[:, year2020:year2100+1]
dfNullPercentageByIndicatorFuture

dfNullPercentageByIndicatorFuture['mean'] = dfNullPercentageByIndicatorFuture.mean(axis=1)
dfNullPercentageByIndicatorFuture = dfNullPercentageByIndicatorFuture.sort_values('mean', ascending=True)
dfNullPercentageByIndicatorFuture

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfNullPercentageByIndicatorFuture['mean'] = dfNullPercentageByIndicatorFuture.mean(axis=1)


Unnamed: 0_level_0,2020,2025,2030,2035,2040,2045,2050,2055,2060,2065,2070,2075,2080,2085,2090,2095,2100,mean
indicator_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
PRJ.ATT.60UP.1.MF,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023
PRJ.ATT.2064.4.FE,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023
PRJ.ATT.2064.4.MA,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023
PRJ.ATT.2064.4.MF,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023
PRJ.ATT.2064.NED.FE,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023,0.235023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
LO.LLECE.SCI6.3,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
LO.LLECE.SCI6.3.FE,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
LO.LLECE.SCI6.3.MA,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
LO.LLECE.REA6.3.FE,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000


## Main indicators ranking

Population | SP.POP.1524.TO.UN

Economique | NY.GNP.PCAP.PP.CD

Education | SE.SEC.ENRR | SE.TER.ENRR

Numerique | IT.NET.USER.P2

In [89]:
numberOfIndicators= dfNullPercentageByIndicatorPast.shape[0]

# population indicator rank

populationIndicatorRankPast = dfNullPercentageByIndicatorPast.index.get_loc('SP.POP.TOTL')

populationIndicatorRankFuture = dfNullPercentageByIndicatorFuture.index.get_loc('SP.POP.TOTL')

# economic indicator rank

economicIndicatorRankPast = dfNullPercentageByIndicatorPast.index.get_loc('NY.GNP.PCAP.PP.CD')

economicIndicatorRankFuture = dfNullPercentageByIndicatorFuture.index.get_loc('NY.GNP.PCAP.PP.CD')

# eduction first indicator rank

educationIndicatorRankPast = dfNullPercentageByIndicatorPast.index.get_loc('SE.SEC.ENRR')

educationIndicatorRankFuture = dfNullPercentageByIndicatorFuture.index.get_loc('SE.SEC.ENRR')

# eduction second indicator rank

educationIndicatorRankPast2 = dfNullPercentageByIndicatorPast.index.get_loc('SE.TER.ENRR')

educationIndicatorRankFuture2 = dfNullPercentageByIndicatorFuture.index.get_loc('SE.TER.ENRR')

# numeric indicator rank

numericIndicatorRankPast = dfNullPercentageByIndicatorPast.index.get_loc('IT.NET.USER.P2')

numericIndicatorRankFuture = dfNullPercentageByIndicatorFuture.index.get_loc('IT.NET.USER.P2')

print("Past aggregated data rankings :\n")

print('Rank of SP.POP.TOTL in the list of indicators: '+ str(populationIndicatorRankPast) +'/'+ str(numberOfIndicators))

print('Rank of NY.GNP.PCAP.PP.CD in the list of indicators: '+ str(economicIndicatorRankPast) +'/'+ str(numberOfIndicators))

print('Rank of SE.SEC.ENRR in the list of indicators: '+ str(educationIndicatorRankPast) +'/'+ str(numberOfIndicators))

print('Rank of SE.TER.ENRR in the list of indicators: '+ str(educationIndicatorRankPast2) +'/'+ str(numberOfIndicators))

print('Rank of IT.NET.USER.P2 in the list of indicators: '+ str(numericIndicatorRankPast) +'/'+ str(numberOfIndicators))

print("\nFuture aggregated data rankings :\n")

print('Rank of SP.POP.TOTL in the list of indicators: '+ str(populationIndicatorRankFuture) +'/'+ str(numberOfIndicators))

print('Rank of NY.GNP.PCAP.PP.CD in the list of indicators: '+ str(economicIndicatorRankFuture) +'/'+ str(numberOfIndicators))

print('Rank of SE.SEC.ENRR in the list of indicators: '+ str(educationIndicatorRankFuture) +'/'+ str(numberOfIndicators))

print('Rank of SE.TER.ENRR in the list of indicators: '+ str(educationIndicatorRankFuture2) +'/'+ str(numberOfIndicators))

print('Rank of IT.NET.USER.P2 in the list of indicators: '+ str(numericIndicatorRankFuture) +'/'+ str(numberOfIndicators))

Past aggregated data rankings :

Rank of SP.POP.TOTL in the list of indicators: 0/3665
Rank of NY.GNP.PCAP.PP.CD in the list of indicators: 134/3665
Rank of SE.SEC.ENRR in the list of indicators: 72/3665
Rank of SE.TER.ENRR in the list of indicators: 97/3665
Rank of IT.NET.USER.P2 in the list of indicators: 132/3665

Future aggregated data rankings :

Rank of SP.POP.TOTL in the list of indicators: 1757/3665
Rank of NY.GNP.PCAP.PP.CD in the list of indicators: 3092/3665
Rank of SE.SEC.ENRR in the list of indicators: 379/3665
Rank of SE.TER.ENRR in the list of indicators: 667/3665
Rank of IT.NET.USER.P2 in the list of indicators: 2034/3665
