# World_Bank_Data_-Wrangling - 
## Este notebook utiliza a biblioteca pandas para realizar o tratamento de uma base de dados

### Site:       https://filipedeabreu.com

### Autor:      Filipe de Abreu

### Manutenção:  Filipe de Abreu

<hr>

# Testado em
### OS: Windows 11                                      
### Versão do python: 3.12.7    

In [5]:
#importações

import pandas as pd

In [6]:
df = pd.read_excel('datasets/WDI World Bank.xlsx')

df.head(5)

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2021 [YR2021],Topic
0,Afghanistan,AFG,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,..,Environment: Energy production & use
1,Afghanistan,AFG,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,..,Environment: Energy production & use
2,Afghanistan,AFG,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,..,Environment: Energy production & use
3,Afghanistan,AFG,Access to electricity (% of population),EG.ELC.ACCS.ZS,..,Environment: Energy production & use
4,Afghanistan,AFG,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,..,Environment: Energy production & use


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 383577 entries, 0 to 383576
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Country Name   383574 non-null  object
 1   Country Code   383572 non-null  object
 2   Series Name    383572 non-null  object
 3   Series Code    383572 non-null  object
 4   2021 [YR2021]  383572 non-null  object
 5   Topic          383572 non-null  object
dtypes: object(6)
memory usage: 17.6+ MB


In [8]:
#verificado baixa taxa de valores nulos, nesse caso, irei re importar os dados ja tratando os valores nulos

df = pd.read_excel('datasets/WDI World Bank.xlsx',
                  na_values='...')#substituindo os valores NaN por '...'

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 383577 entries, 0 to 383576
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Country Name   383574 non-null  object
 1   Country Code   383572 non-null  object
 2   Series Name    383572 non-null  object
 3   Series Code    383572 non-null  object
 4   2021 [YR2021]  383572 non-null  object
 5   Topic          383572 non-null  object
dtypes: object(6)
memory usage: 17.6+ MB


In [9]:
#Simplificando e removedo espaços dos nomes das variaveis

df.rename(
    columns={
        'Country Name':'pais',
        'Country Code':'cod_pais',
        'Series Name':'serie',
        'Series Code':'cod_serie',
        '2021 [YR2021]':'ano_2021',
        'Topic':'topico'
    },
inplace = True
)

#Verificando se os nomes de fato foram trocados
df.columns

Index(['pais', 'cod_pais', 'serie', 'cod_serie', 'ano_2021', 'topico'], dtype='object')

In [10]:
serie_unique_topico = pd.Series(df['topico'].unique())
serie_unique_topico.dropna(inplace = True)
serie_unique_topico_health = serie_unique_topico[serie_unique_topico.str.startswith('Health')]
serie_unique_topico_health

5           Health: Reproductive health
6                  Health: Risk factors
7          Health: Population: Dynamics
22           Health: Disease prevention
38               Health: Health systems
42                    Health: Nutrition
68    Health: Universal Health Coverage
73                    Health: Mortality
79        Health: Population: Structure
82                               Health
dtype: object

In [11]:
#Utilizando minha seleção, realizarei um agrupamento com base em topico, onde topico se inicia em 'health'

pd.pivot_table(
    df[df['topico'].isin(serie_unique_topico_health)],
    index=['pais','cod_pais'],
    columns=['topico'],
    aggfunc='size'
)

Unnamed: 0_level_0,topico,Health,Health: Disease prevention,Health: Health systems,Health: Mortality,Health: Nutrition,Health: Population: Dynamics,Health: Population: Structure,Health: Reproductive health,Health: Risk factors,Health: Universal Health Coverage
pais,cod_pais,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Afghanistan,AFG,2,27,23,39,26,13,58,16,32,13
Africa Eastern and Southern,AFE,2,27,23,39,26,13,58,16,32,13
Africa Western and Central,AFW,2,27,23,39,26,13,58,16,32,13
Albania,ALB,2,27,23,39,26,13,58,16,32,13
Algeria,DZA,2,27,23,39,26,13,58,16,32,13
...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,PSE,2,27,23,39,26,13,58,16,32,13
World,WLD,2,27,23,39,26,13,58,16,32,13
"Yemen, Rep.",YEM,2,27,23,39,26,13,58,16,32,13
Zambia,ZMB,2,27,23,39,26,13,58,16,32,13


In [36]:
#Uma segunda maneira de obter um resultado semelhante 

df[df['topico'].isin(serie_unique_topico_health)].groupby(['pais','cod_pais','topico']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,serie,cod_serie,ano_2021
pais,cod_pais,topico,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,AFG,Health,2,2,2
Afghanistan,AFG,Health: Disease prevention,27,27,27
Afghanistan,AFG,Health: Health systems,23,23,23
Afghanistan,AFG,Health: Mortality,39,39,39
Afghanistan,AFG,Health: Nutrition,26,26,26
...,...,...,...,...,...
Zimbabwe,ZWE,Health: Population: Dynamics,13,13,13
Zimbabwe,ZWE,Health: Population: Structure,58,58,58
Zimbabwe,ZWE,Health: Reproductive health,16,16,16
Zimbabwe,ZWE,Health: Risk factors,32,32,32


In [34]:
#Para validar se os dados acimas estao corretos, para Afghanistan, temos 27 oberções onde o tóico era 'Health: Disease prevention'

#filtro tópico
health = df[df['topico'].isin(serie_unique_topico_health)]

#filtro pais
health = health[ health['pais'] == 'Afghanistan']

#Contagem
len(health[health['topico'] == 'Health: Disease prevention'])

#Verifiquei que o valor 27 é exatamente o valor que aparece para o tópico 'Health: Disease prevention' no dataframe anterior

27

In [13]:
#Verificando a quantidade de paises
len(df['pais'].unique())

269

269 paises parece um valor incorreto, visto que nao existe essa quantidade de paises no mundo