# Importación, manejo y análisis de un dataset en pandas

Este *notebook* contiene los sigueintes temas:
* Importación de datos y análisis rápido
* Cómo renombrar columnas
* Como crear y trabajar con subconjuntos de un DataFrame (iloc, loc y filtros booleanos)
* Manejo de datos faltantes 
* Cómo agrupar observaciones (groupby)
* Cómo trabajar con multiindex (en Series y DataFrames)
* Merge y join de DataFrames


Algunos *shortcuts* útiles:
* tab --> ofrece sugerencias sobre funciones y atributos 
* shift + tab --> muestra la documentación 
* shift + enter --> run cell

Documentación de *pandas*:  http://pandas.pydata.org/pandas-docs/stable/

Los indicadores con los que trabajaremos en esta sección se obtuvieron de la página del [Banco Mundial]( https://data.worldbank.org/indicator?tab=featured)

In [1]:
import pandas as pd

------

## Importación de datos y análisis rápido

In [2]:
# Importamos los datos del archivo csv, estructurándolos en un DataFrame
data = pd.read_csv('data/indicadores.csv')
type(data)
# Pandas también ofrece funciones para importar otro tipo de archivos (txt, en formato JSON, XML, etc.)

pandas.core.frame.DataFrame

In [3]:
# Veamos cuántos renglones y columnas tiene el df
print(data.shape)

(263, 27)


In [4]:
# Veamos cuáles son las columnas que tiene (útil con datasets con muchas columnas):
for col in data.columns: print(col)

Country Name
Country Code
Region
IncomeGroup
cellph_per100
co2_perc
land_area
electricity
fertility
fuel_exps_pc
start_business
foreign_inv
gdp
gdp_pc
growth_gdp_pc
hiv_prev
servs_mill
mil_exp
internet_users
population
rd_exp
taxRev_pc
val_primary
val_industry
val_manufacturing
val_services
region_code


In [5]:
# Para saber a qué se refiere cada columna, importamos los metadatos para nuestro dataframe:
metadata = pd.read_csv('data/indicadores_metadata.csv')
metadata

Unnamed: 0,indicador,descripcion,year
0,cellph_per100,Mobile cellular subscriptions (per 100 people),2017
1,co2_perc,CO2 emissions (metric tons per capita),2014
2,land_area,Land area (sq. km),2017
3,electricity,Access to electricity (% of population),2017
4,fertility,"Fertility rate, total (births per woman)",2017
5,fuel_exps_pc,Fuel exports (% of merchandise exports),2014
6,start_business,Time required to start a business (days),2018
7,foreign_inv,"Foreign direct investment, net inflows (BoP, c...",2017
8,gdp,GDP (current US$),2017
9,gdp_pc,"GDP per capita, PPP (current international $)",2017


In [6]:
# Para echar una vista rápita al dataset:
data.head() # muestra los primeros renglones (default=5)

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,,97.17,105845.0,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.734977,1.290307,1246700.0,41.88623,5.623,96.194403,...,1.777138,14.339079,30809762.0,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf
3,Albania,ALB,Europe & Central Asia,upper_mid,123.736096,1.978763,27400.0,100.0,1.71,1.567523,...,1.178901,71.847041,2866376.0,,18.51579,19.849978,21.141999,5.684427,46.697169,Eu_CAs
4,Andorra,AND,Europe & Central Asia,high,104.381212,5.832906,470.0,100.0,,0.122804,...,,98.871436,77006.0,,,0.492101,9.844719,3.207886,79.204103,Eu_CAs


In [7]:
# Ver los últimos renglones (default=5):
data.tail()

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
258,Kosovo,XKX,Europe & Central Asia,upper_mid,,,10887.0,100.0,2.02,,...,0.795345,,1845300.0,,,10.461838,23.799921,10.969374,46.13293,Eu_CAs
259,"Yemen, Rep.",YEM,Middle East & North Africa,low,54.363326,0.878996,527970.0,79.2,3.889,70.258253,...,,26.718355,28498687.0,,,7.083504,46.611335,6.868056,22.943957,MEa_NAf
260,South Africa,ZAF,Sub-Saharan Africa,upper_mid,156.033229,8.979062,1213090.0,84.4,2.43,10.825269,...,0.98171,56.167394,57779622.0,0.79816,27.332406,2.178174,26.009143,12.040173,61.02059,SSAf
261,Zambia,ZMB,Sub-Saharan Africa,lower_mid,78.614934,0.292412,743390.0,40.3,4.925,1.149178,...,1.40095,27.852579,17351822.0,,14.393956,6.22847,34.877139,7.686142,54.182454,SSAf
262,Zimbabwe,ZWE,Sub-Saharan Africa,lower_mid,85.252183,0.884721,386850.0,40.421368,3.682,0.555653,...,2.169606,27.055488,14439018.0,,18.06022,7.873986,22.115059,11.59602,60.409902,SSAf


In [8]:
# Ver infomación general de cada columna:
data.info() # Vemos que no todas las columnas tienen el mismo número de valores non-null --> este potencial problema lo trataremos más adelante

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 27 columns):
Country Name         263 non-null object
Country Code         263 non-null object
Region               217 non-null object
IncomeGroup          217 non-null object
cellph_per100        248 non-null float64
co2_perc             250 non-null float64
land_area            261 non-null float64
electricity          261 non-null float64
fertility            246 non-null float64
fuel_exps_pc         200 non-null float64
start_business       235 non-null float64
foreign_inv          244 non-null float64
gdp                  246 non-null float64
gdp_pc               237 non-null float64
growth_gdp_pc        246 non-null float64
hiv_prev             168 non-null float64
servs_mill           259 non-null float64
mil_exp              193 non-null float64
internet_users       250 non-null float64
population           262 non-null float64
rd_exp               122 non-null float64
taxRev_pc        

In [9]:
# Para obtener estadísticas descriptivas sobre cada columna
display(data.describe())

# También podemos transponer el dataframe que regresa describe():
display(round(data.describe().transpose(), ndigits=2))

Unnamed: 0,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,start_business,foreign_inv,gdp,gdp_pc,...,servs_mill,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services
count,248.0,250.0,261.0,261.0,246.0,200.0,235.0,244.0,246.0,237.0,...,259.0,193.0,250.0,262.0,122.0,177.0,234.0,237.0,226.0,231.0
mean,105.810981,4.88632,5001030.0,85.326578,2.698302,19.996131,20.256882,66605040000.0,2724524000000.0,20763.986057,...,6770.556134,1.842963,53.240934,307230200.0,1.119049,16.423394,10.390486,25.167851,12.120341,56.529177
std,38.27371,6.068998,14855650.0,24.073241,1.266155,26.295054,23.31398,231428400000.0,8998200000000.0,21391.18102,...,25801.12673,1.256072,27.188752,970013400.0,0.968676,6.167757,10.153936,10.053544,6.895775,11.557682
min,13.711248,0.0447,2.0,9.3,1.052,0.0,0.5,-39482280000.0,40620560.0,737.978583,...,0.0,0.0,1.308907,11508.0,0.01248,0.057734,0.030928,0.003733,0.556843,22.943957
25%,84.08889,0.880427,18280.0,79.960797,1.73925,1.148058,8.15,247679200.0,9723166000.0,5117.754563,...,62.423065,1.103039,29.679294,1638404.0,0.344247,12.336507,2.382371,18.257102,7.358159,49.912379
50%,107.353111,3.091317,183630.0,99.985047,2.256,8.132799,15.0,1577276000.0,57331770000.0,13859.203384,...,379.56276,1.546464,55.619054,10232470.0,0.832685,15.589575,6.906123,24.141872,11.734975,55.525754
75%,125.470705,6.391435,1213090.0,100.0,3.54975,27.240213,23.933333,16256220000.0,615923700000.0,28798.642557,...,3721.222154,2.277198,76.669597,57414300.0,1.668162,20.610425,15.6897,30.7483,15.536211,63.355297
max,328.799953,43.857308,127354600.0,100.0,7.184,99.913384,230.0,2085130000000.0,80885580000000.0,124609.30414,...,355546.539541,8.774719,98.871436,7594270000.0,4.26895,33.921623,58.208741,57.289364,48.442878,91.693804


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cellph_per100,248.0,105.81,38.27,13.71,84.09,107.35,125.47,328.8
co2_perc,250.0,4.89,6.07,0.04,0.88,3.09,6.39,43.86
land_area,261.0,5001030.0,14855650.0,2.0,18280.0,183630.0,1213090.0,127354600.0
electricity,261.0,85.33,24.07,9.3,79.96,99.99,100.0,100.0
fertility,246.0,2.7,1.27,1.05,1.74,2.26,3.55,7.18
fuel_exps_pc,200.0,20.0,26.3,0.0,1.15,8.13,27.24,99.91
start_business,235.0,20.26,23.31,0.5,8.15,15.0,23.93,230.0
foreign_inv,244.0,66605040000.0,231428400000.0,-39482280000.0,247679200.0,1577276000.0,16256220000.0,2085130000000.0
gdp,246.0,2724524000000.0,8998200000000.0,40620560.0,9723166000.0,57331770000.0,615923700000.0,80885580000000.0
gdp_pc,237.0,20763.99,21391.18,737.98,5117.75,13859.2,28798.64,124609.3


In [11]:
# Para saber cuántas observaciones tenemos para cada categoría de una columna (especialmente útil para variables categóricas):
print('Regiones:')
print(data.Region.value_counts(dropna=False), '\n') # dropna=False es para que nos muestre cuántas observaciones son NaN

print('Grupos de ingreso:')
print(data.IncomeGroup.value_counts(dropna=False))


Regiones:
Europe & Central Asia         58
Sub-Saharan Africa            48
NaN                           46
Latin America & Caribbean     42
East Asia & Pacific           37
Middle East & North Africa    21
South Asia                     8
North America                  3
Name: Region, dtype: int64 

Grupos de ingreso:
high         79
upper_mid    60
lower_mid    47
NaN          46
low          31
Name: IncomeGroup, dtype: int64


----

## Cómo renombrar las columnas

In [12]:
# Para ver los nombres de las columnas
for col in data.columns: print(col)

Country Name
Country Code
Region
IncomeGroup
cellph_per100
co2_perc
land_area
electricity
fertility
fuel_exps_pc
start_business
foreign_inv
gdp
gdp_pc
growth_gdp_pc
hiv_prev
servs_mill
mil_exp
internet_users
population
rd_exp
taxRev_pc
val_primary
val_industry
val_manufacturing
val_services
region_code


Se observa que no todos los nombres tienen el mismo formato: hay algunos que tienen mayúsculas y otros que tienen espacios. Es particularmente útil quitar los espacios de los nombres

In [13]:
# Creamos un diccionario donde las keys son los nombres de las columnas que
# nos interesa cambiar, y los values son lo nuevos nombres para éstas
dict_rename = {'Country Name': 'country', 'Country Code': 'code', 'Region': 'region', 'IncomeGroup':'inc_group'}

# Aplicamos el método rename del DataFrame para renombrar columnas
data.rename(columns = dict_rename, inplace=True) 
# inplace=True es para que la operación se haga sobre el dataframe, 
# sin necesidad de que se tenga que reasignar explícitamente
data.head(3) 

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,,97.17,105845.0,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.734977,1.290307,1246700.0,41.88623,5.623,96.194403,...,1.777138,14.339079,30809762.0,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf


-----

## Cómo crear y trabajar con subconjuntos de un DataFrame

### *iloc* y *loc*

En esta sección trabajaremos con *loc* y *iloc* de *pandas*, pero no es la única manera en que se pueden crear subconjuntos de nuestros datos.

*loc* lo usamos cuando queremos trabajar con las etiquetas, mientras que *iloc* nos sirve cuando queremos trabajar con las columnas en función de su índice posicional

In [14]:
# Supongamos que nos interesa trabajar con los renglones 30 a 35 y todas las columnas
data.loc[30:35, :] # Nótese que con loc sí se toma en cuenta el valor stop del rango que indicamos (en este caso, el 35)

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
30,Bhutan,BTN,South Asia,lower_mid,90.467305,1.39223,38144.0,97.7,2.02,,...,,48.106416,754394.0,,12.756118,16.644481,41.387424,7.4321,37.366721,SAs
31,Botswana,BWA,Sub-Saharan Africa,upper_mid,141.40787,3.367451,566730.0,62.824947,2.683,0.488135,...,2.782322,41.413795,2254126.0,,24.687495,2.049733,32.094695,5.194247,56.972556,SSAf
32,Central African Republic,CAF,Sub-Saharan Africa,low,25.227921,0.067357,622980.0,29.982038,4.796,,...,1.40861,4.339255,4666377.0,,5.82106,31.379401,19.910122,17.774796,31.05982,SSAf
33,Canada,CAN,North America,high,86.535681,15.158927,9093510.0,100.0,1.4961,29.921805,...,1.252629,92.701372,37058856.0,1.6493,12.418604,,,,,NAm
34,Central Europe and the Baltics,CEB,,,123.778004,6.148883,1105054.0,100.0,1.511663,4.995959,...,1.644244,73.435759,102511922.0,1.146878,17.557023,2.882645,28.969249,19.019444,56.274348,
35,Switzerland,CHE,Europe & Central Asia,high,130.821914,4.311563,39516.0,100.0,1.54,1.537855,...,0.675557,93.713866,8516543.0,3.3743,9.841604,0.664239,24.829443,17.913815,71.435943,Eu_CAs


In [15]:
# Si nos interesan los renglones 30 a 35, y sólo ciertas columnas
data.loc[30:35, ['country', 'inc_group', 'co2_perc']]

Unnamed: 0,country,inc_group,co2_perc
30,Bhutan,lower_mid,1.39223
31,Botswana,upper_mid,3.367451
32,Central African Republic,low,0.067357
33,Canada,high,15.158927
34,Central Europe and the Baltics,,6.148883
35,Switzerland,high,4.311563


In [16]:
# Si nos interesa trabajar con todos los valores de una columna en específico ('country')
data.loc[:, ['country']].head()

Unnamed: 0,country
0,Aruba
1,Afghanistan
2,Angola
3,Albania
4,Andorra


In [17]:
# Para trabajar con una columna específica usando iloc tenemos que especificar su posición
data.iloc[:, [0]].head() # country es el primero en la lista de columnas (índice 0)

Unnamed: 0,country
0,Aruba
1,Afghanistan
2,Angola
3,Albania
4,Andorra


In [18]:
# Para obtener las columnas 30 a 35 con iloc y todas las columnas
data.iloc[30:36, :] # Notemos que, a diferencia de loc, iloc no toma en cuenta el valor stop del rango indicado

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
30,Bhutan,BTN,South Asia,lower_mid,90.467305,1.39223,38144.0,97.7,2.02,,...,,48.106416,754394.0,,12.756118,16.644481,41.387424,7.4321,37.366721,SAs
31,Botswana,BWA,Sub-Saharan Africa,upper_mid,141.40787,3.367451,566730.0,62.824947,2.683,0.488135,...,2.782322,41.413795,2254126.0,,24.687495,2.049733,32.094695,5.194247,56.972556,SSAf
32,Central African Republic,CAF,Sub-Saharan Africa,low,25.227921,0.067357,622980.0,29.982038,4.796,,...,1.40861,4.339255,4666377.0,,5.82106,31.379401,19.910122,17.774796,31.05982,SSAf
33,Canada,CAN,North America,high,86.535681,15.158927,9093510.0,100.0,1.4961,29.921805,...,1.252629,92.701372,37058856.0,1.6493,12.418604,,,,,NAm
34,Central Europe and the Baltics,CEB,,,123.778004,6.148883,1105054.0,100.0,1.511663,4.995959,...,1.644244,73.435759,102511922.0,1.146878,17.557023,2.882645,28.969249,19.019444,56.274348,
35,Switzerland,CHE,Europe & Central Asia,high,130.821914,4.311563,39516.0,100.0,1.54,1.537855,...,0.675557,93.713866,8516543.0,3.3743,9.841604,0.664239,24.829443,17.913815,71.435943,Eu_CAs


### Subconjuntos a partir de filtros booleanos
Al igual que en *numpy*, en *pandas* también podemos crear filtros basados en los valores (en este caso de las "celdas" de los dataframes). 

In [19]:
data.head(3)

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,,97.17,105845.0,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.734977,1.290307,1246700.0,41.88623,5.623,96.194403,...,1.777138,14.339079,30809762.0,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf


In [20]:
# Supongamos que nos interesa trabajar con las tasas de fertilidad y
# el pib per cápita únicamente con países del sur de Asia (en la columna 'region')

# Conviene volver a imprimir las columnas, para ver cuáles son las que nos interesan
data.columns.values

array(['country', 'code', 'region', 'inc_group', 'cellph_per100',
       'co2_perc', 'land_area', 'electricity', 'fertility',
       'fuel_exps_pc', 'start_business', 'foreign_inv', 'gdp', 'gdp_pc',
       'growth_gdp_pc', 'hiv_prev', 'servs_mill', 'mil_exp',
       'internet_users', 'population', 'rd_exp', 'taxRev_pc',
       'val_primary', 'val_industry', 'val_manufacturing', 'val_services',
       'region_code'], dtype=object)

In [21]:
# Usamos la siguiente función para ver exactamente cómo se llama la región del sur de Asia en nuestra base
data.region.unique() # 'Sout Asia'

array(['Latin America & Caribbean', 'South Asia', 'Sub-Saharan Africa',
       'Europe & Central Asia', nan, 'Middle East & North Africa',
       'East Asia & Pacific', 'North America'], dtype=object)

In [22]:
data.loc[(data.region == 'South Asia'), ['country', 'fertility', 'gdp_pc']]

Unnamed: 0,country,fertility,gdp_pc
1,Afghanistan,4.477,1934.636754
18,Bangladesh,2.076,3998.419424
30,Bhutan,2.02,10173.051069
107,India,2.304,7168.992551
135,Sri Lanka,2.032,12878.588671
149,Maldives,2.052,14668.693033
175,Nepal,2.083,2866.542016
181,Pakistan,3.414,5249.206365


In [23]:
# También podemos usar filtros múltiples.
# Supongamos que nos interesa trabajar con todos los países de América y las mismas variables del ejemplo pasado
filter = (data.region == 'Latin America & Caribbean') | (data.region == 'North America')

# Lista con los indicadores que nos interesan
indicator_list = ['country', 'fertility', 'gdp_pc'] 
data.loc[filter, indicator_list]\
    .head(10)

Unnamed: 0,country,fertility,gdp_pc
0,Aruba,1.798,39454.629831
7,Argentina,2.277,20843.155068
10,Antigua and Barbuda,2.04,25145.541229
21,"Bahamas, The",1.758,31581.10439
24,Belize,2.475,8500.4448
25,Bermuda,1.61,
26,Bolivia,2.839,7480.077141
27,Brazil,1.711,15662.247018
28,Barbados,1.799,18526.00856
33,Canada,1.4961,46723.317764


In [24]:
# Otro ejemplo
# Supongamos que nos interesa trabajar con los gastos militares de países de altos ingresos 
# que tienen una tasa de fertilidad mayor a 1.9
print(data.inc_group.unique())
filter = (data.fertility > 1.9) & (data.inc_group == 'high')

data.loc[filter, ['country', 'fertility',  'mil_exp']]\
    .sort_values(by=['fertility'], ascending = False)\
    .head(10)

['high' 'low' 'lower_mid' 'upper_mid' nan]


Unnamed: 0,country,fertility,mil_exp
223,Seychelles,3.63,1.44422
112,Israel,3.11,4.348101
179,Oman,2.592,8.1713
76,Faroe Islands,2.5,
202,Saudi Arabia,2.491,8.774719
182,Panama,2.487,0.0
91,Guam,2.328,
89,Greenland,2.09,
253,Virgin Islands (U.S.),2.08,
10,Antigua and Barbuda,2.04,


In [25]:
# Si nos interesa obtener todos los valores para México, Guatemala y Belice
filter = data.country.isin(['Mexico', 'Guatemala', 'Belize']) # Una manera práctica de crear el filtro para este caso
data.loc[filter, :]

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
24,Belize,BLZ,Latin America & Caribbean,upper_mid,63.905295,1.400941,22810.0,98.265121,2.475,16.658267,...,1.262456,47.082626,383071.0,,24.519991,10.23875,13.828479,6.503924,62.886725,LAm_Ca
90,Guatemala,GTM,Latin America & Caribbean,upper_mid,118.168791,1.151001,107160.0,93.288094,2.92,6.561547,...,0.352941,40.703049,17247807.0,0.02988,10.21309,10.00943,26.025469,18.331609,61.475605,LAm_Ca
151,Mexico,MEX,Latin America & Caribbean,upper_mid,88.515371,3.990446,1943950.0,100.0,2.153,10.601613,...,0.538884,63.852249,126190788.0,0.52419,12.796622,3.349548,29.508655,16.975577,60.886458,LAm_Ca


In [26]:
# Nota: cuando aplicamos un filtro a una Series de pandas, se crea una Series de igual tamaño, pero con valores booleanos

filter = data.cellph_per100 > 100
print(data.cellph_per100.shape, '\n')
print(data.cellph_per100[:5], '\n')

print(filter.shape, '\n')
print(filter[:5])

(263,) 

0           NaN
1     67.350573
2     44.734977
3    123.736096
4    104.381212
Name: cellph_per100, dtype: float64 

(263,) 

0    False
1    False
2    False
3     True
4     True
Name: cellph_per100, dtype: bool


----

## Manejo de datos faltantes en una base

In [27]:
data.head(3)

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,,97.17,105845.0,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.734977,1.290307,1246700.0,41.88623,5.623,96.194403,...,1.777138,14.339079,30809762.0,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf


In [28]:
# Veamos cuáles on los valores que se leen como nan:
data.head().isnull() # isnull() es un método que regresa un dataframe booleano de las mismas dimensiones

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,False,False,False,False,True,False,False,False,False,False,...,True,False,False,True,True,True,True,True,True,False
1,False,False,False,False,False,False,False,False,False,True,...,False,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,True,False,...,True,False,False,True,True,False,False,False,False,False


In [29]:
# Ahora veamos cómo son los valores que se leen como no nan:
data.head().notnull() # notnull() es un método que regresa un dataframe booleano de las mismas dimensiones

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,True,True,True,True,False,True,True,True,True,True,...,False,True,True,False,False,False,False,False,False,True
1,True,True,True,True,True,True,True,True,True,False,...,True,True,True,False,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,...,True,True,True,False,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,...,True,True,True,False,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,False,True,...,False,True,True,False,False,True,True,True,True,True


In [30]:
# Veamos para cuáles columnas tenemos por lo menos un valor nan:
data.isnull().any()

country              False
code                 False
region                True
inc_group             True
cellph_per100         True
co2_perc              True
land_area             True
electricity           True
fertility             True
fuel_exps_pc          True
start_business        True
foreign_inv           True
gdp                   True
gdp_pc                True
growth_gdp_pc         True
hiv_prev              True
servs_mill            True
mil_exp               True
internet_users        True
population            True
rd_exp                True
taxRev_pc             True
val_primary           True
val_industry          True
val_manufacturing     True
val_services          True
region_code           True
dtype: bool

In [31]:
# Veamos cuántos valores nan tenemos para cada columna:
data.isnull().sum()

country                0
code                   0
region                46
inc_group             46
cellph_per100         15
co2_perc              13
land_area              2
electricity            2
fertility             17
fuel_exps_pc          63
start_business        28
foreign_inv           19
gdp                   17
gdp_pc                26
growth_gdp_pc         17
hiv_prev              95
servs_mill             4
mil_exp               70
internet_users        13
population             1
rd_exp               141
taxRev_pc             86
val_primary           29
val_industry          26
val_manufacturing     37
val_services          32
region_code           46
dtype: int64

In [32]:
# Nos interesará saber, por ejemplo, cuáles son los países que tienen valores valtantes (NaN) para la región
# Para ello, podemos usar isnull() o notnull() a manera de filtros
print(data.region.isnull()[:5]) # Regresa una Series booleana. Por lo tanto, la podemos usar para crear el filtro
display(data.loc[data.region.isnull(), :].head(10)) # vemos que se trata de categorías o grupos de paíse

0    False
1    False
2    False
3    False
4    False
Name: region, dtype: bool


Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
5,Arab World,ARB,,,100.423541,4.886988,11232650.0,90.273687,3.270962,70.116464,...,5.638102,48.847621,419790600.0,,5.316772,5.478197,39.170952,13.695738,54.104593,
34,Central Europe and the Baltics,CEB,,,123.778004,6.148883,1105054.0,100.0,1.511663,4.995959,...,1.644244,73.435759,102511900.0,1.146878,17.557023,2.882645,28.969249,19.019444,56.274348,
47,Caribbean small states,CSS,,,112.91841,8.87122,404850.0,98.383731,2.018646,41.843778,...,1.074941,57.398375,7358965.0,,23.502316,3.679661,24.080113,10.110291,63.734256,
59,East Asia & Pacific (excluding high income),EAP,,,114.295768,5.776128,15913500.0,97.591622,1.849218,5.826903,...,1.749066,50.787621,2081652000.0,1.958267,10.123983,8.952254,39.250348,27.279774,51.27709,
60,Early-demographic dividend,EAR,,,97.42971,2.298216,33107750.0,89.68483,2.509145,30.969984,...,2.34752,37.858569,3249141000.0,,12.9922,9.691805,29.841406,15.709186,53.442,
61,East Asia & Pacific,EAS,,,116.35219,6.294256,24396750.0,97.825556,1.801326,7.348907,...,1.667271,55.0954,2328221000.0,2.381415,11.885324,4.948481,33.383742,23.02624,59.984479,
62,Europe & Central Asia (excluding high income),ECA,,,127.274515,7.421073,22622180.0,99.985047,1.928217,48.550861,...,3.097537,66.384578,417797300.0,0.852931,14.108796,5.570165,28.913346,13.695063,54.728906,
63,Europe & Central Asia,ECS,,,125.421698,6.915939,27440110.0,99.993208,1.753149,12.951708,...,1.704113,74.911855,918793600.0,1.921355,19.160311,1.966487,23.036173,14.185818,64.601386,
66,Euro area,EMU,,,124.594419,6.474921,2678924.0,100.0,1.563142,6.527045,...,1.448541,79.372703,341783200.0,2.137483,18.916993,1.512159,22.261729,15.09843,65.930363,
71,European Union,EUU,,,123.524777,6.379149,4238694.0,100.0,1.589699,6.783763,...,1.498245,80.642489,513213400.0,2.038779,20.209907,1.443517,21.981951,14.38863,66.003159,


In [33]:
# Si queremos eliminar todos los renglones que tengan al menos un valor faltante:
data.dropna(how='any')[:5] 
# Nota: si queremos que el DataFrame original se sustituya con el DF nuevo, hay que pasar el argumento inplace=True
# (en este caso, no lo hicimos)

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
7,Argentina,ARG,Latin America & Caribbean,upper_mid,139.8146,4.781508,2736690.0,100.0,2.277,4.751054,...,0.854561,75.809744,44494502.0,0.61408,12.336507,6.264566,22.054107,13.488026,56.122236,LAm_Ca
8,Armenia,ARM,Europe & Central Asia,upper_mid,119.043969,1.898719,28470.0,100.0,1.604,6.339751,...,4.778338,69.718125,2951776.0,0.25002,20.931221,16.390474,25.577133,10.28465,49.910509,Eu_CAs
11,Australia,AUS,East Asia & Pacific,high,112.688621,15.388766,7692020.0,100.0,1.765,28.019409,...,1.89156,86.544721,24992369.0,1.92296,22.002263,2.387302,22.327603,6.087334,68.258542,EAs_Pa
12,Austria,AUT,Europe & Central Asia,high,170.847923,6.869868,82523.0,100.0,1.53,2.386196,...,0.735993,87.935587,8847037.0,3.04771,26.827003,1.112721,25.201835,16.709604,62.837454,Eu_CAs
13,Azerbaijan,AZE,Europe & Central Asia,upper_mid,103.046637,3.931561,82670.0,100.0,1.9,92.638413,...,3.773694,79.0,9942334.0,0.22232,15.603134,5.605112,47.557973,4.930228,39.322336,Eu_CAs


In [34]:
# Si queremos elminar los renglones para los cuales todos sus valores son faltantes 
data.dropna(how='all')[:5] 
# En este caso, no habría cambio porque no hay valores faltantes para 'country' ni 'region'

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,,97.17,105845.0,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.734977,1.290307,1246700.0,41.88623,5.623,96.194403,...,1.777138,14.339079,30809762.0,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf
3,Albania,ALB,Europe & Central Asia,upper_mid,123.736096,1.978763,27400.0,100.0,1.71,1.567523,...,1.178901,71.847041,2866376.0,,18.51579,19.849978,21.141999,5.684427,46.697169,Eu_CAs
4,Andorra,AND,Europe & Central Asia,high,104.381212,5.832906,470.0,100.0,,0.122804,...,,98.871436,77006.0,,,0.492101,9.844719,3.207886,79.204103,Eu_CAs


In [35]:
# También podemos crear un subconjunto de columnas para aplicar el método dropna():
# Supongamos que queremos eliminar todas las observaciones que tengan uno o más valores faltantes en 'inc_group',
# 'cellph_per100', 'co2_perc' y 'fertility'
data.dropna(subset=['inc_group', 'cellph_per100', 'co2_perc', 'fertility'], how='any')[:5]

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.734977,1.290307,1246700.0,41.88623,5.623,96.194403,...,1.777138,14.339079,30809762.0,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf
3,Albania,ALB,Europe & Central Asia,upper_mid,123.736096,1.978763,27400.0,100.0,1.71,1.567523,...,1.178901,71.847041,2866376.0,,18.51579,19.849978,21.141999,5.684427,46.697169,Eu_CAs
6,United Arab Emirates,ARE,Middle East & North Africa,high,210.914023,22.939606,71020.0,100.0,1.731,42.49624,...,,94.819923,9630959.0,0.89504,0.057734,0.776055,41.448681,8.984418,57.775303,MEa_NAf
7,Argentina,ARG,Latin America & Caribbean,upper_mid,139.8146,4.781508,2736690.0,100.0,2.277,4.751054,...,0.854561,75.809744,44494502.0,0.61408,12.336507,6.264566,22.054107,13.488026,56.122236,LAm_Ca


In [36]:
# si nos interesa sustituir los valores faltantes con un valor específico:
data.fillna(value={'cellph_per100':'FALTA CEL', 'fuel_exps_pc':'FALTA FUEL'})
# Recordar que si queremos sustituir el DF original con el nuevo, hay que pasar el argumento inplace=True

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,FALTA CEL,8.410064,1.800000e+02,100.000000,1.798000,0.250657,...,,97.170000,1.058450e+05,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.3506,0.293946,6.528600e+05,97.700000,4.477000,FALTA FUEL,...,0.984561,11.447688,3.717239e+07,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.735,1.290307,1.246700e+06,41.886230,5.623000,96.1944,...,1.777138,14.339079,3.080976e+07,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf
3,Albania,ALB,Europe & Central Asia,upper_mid,123.736,1.978763,2.740000e+04,100.000000,1.710000,1.56752,...,1.178901,71.847041,2.866376e+06,,18.515790,19.849978,21.141999,5.684427,46.697169,Eu_CAs
4,Andorra,AND,Europe & Central Asia,high,104.381,5.832906,4.700000e+02,100.000000,,0.122804,...,,98.871436,7.700600e+04,,,0.492101,9.844719,3.207886,79.204103,Eu_CAs
5,Arab World,ARB,,,100.424,4.886988,1.123265e+07,90.273687,3.270962,70.1165,...,5.638102,48.847621,4.197906e+08,,5.316772,5.478197,39.170952,13.695738,54.104593,
6,United Arab Emirates,ARE,Middle East & North Africa,high,210.914,22.939606,7.102000e+04,100.000000,1.731000,42.4962,...,,94.819923,9.630959e+06,0.895040,0.057734,0.776055,41.448681,8.984418,57.775303,MEa_NAf
7,Argentina,ARG,Latin America & Caribbean,upper_mid,139.815,4.781508,2.736690e+06,100.000000,2.277000,4.75105,...,0.854561,75.809744,4.449450e+07,0.614080,12.336507,6.264566,22.054107,13.488026,56.122236,LAm_Ca
8,Armenia,ARM,Europe & Central Asia,upper_mid,119.044,1.898719,2.847000e+04,100.000000,1.604000,6.33975,...,4.778338,69.718125,2.951776e+06,0.250020,20.931221,16.390474,25.577133,10.284650,49.910509,Eu_CAs
9,American Samoa,ASM,East Asia & Pacific,upper_mid,FALTA CEL,,2.000000e+02,,,FALTA FUEL,...,,,5.546500e+04,,,,,,,EAs_Pa


In [37]:
# Si queremos trabajar con una base que estrictamente sólo tenga países individuales,
# vamos a querer quitar estas obervaciones que se refieren a regiones 
data.dropna(subset=['region'], how='any', inplace=True) 
# Nótese que pasamos el argumento inplace=True para susituir la base original
data.isnull().sum()

country                0
code                   0
region                 0
inc_group              0
cellph_per100         15
co2_perc              13
land_area              2
electricity            2
fertility             17
fuel_exps_pc          60
start_business        28
foreign_inv           19
gdp                   17
gdp_pc                26
growth_gdp_pc         17
hiv_prev              81
servs_mill             4
mil_exp               67
internet_users        13
population             1
rd_exp               120
taxRev_pc             76
val_primary           29
val_industry          26
val_manufacturing     36
val_services          31
region_code            0
dtype: int64

In [38]:
# Veamos cuál es la observación que no tiene dato para la variable población
data.loc[data.population.isnull(), :]
data.dropna(subset=['population'], how='any', inplace=True)

#### Exportar base con sólo países (para trabajar posteriormente en la sección de visualización)

In [39]:
# data.to_csv('data/indicadores_clean.csv', index=False)

-----

## Cómo agrupar observaciones (groupby) 

In [40]:
data.head(2)

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,,97.17,105845.0,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs


In [41]:
# Supongamos que nos interesa calcular qué porcentaje del pib regional aporta cada país y 
# qué porcentaje de la población regional vive en cada país

#### groupby() con map()

In [41]:
# Supongamos que nos interesa crear una columna donde se capture la importancia de un país (en  términos de población) para su región...

# Aplicamos groupby por región, y sumamos sobre la columna de population
p1 = data.groupby(by=['region']) \
         .population \
         .sum() # otros métodos comunes para trabajar con groupby son mean(), count(), max() y min()
        
# Pasamos como argumento la población por región al método map
p2 = data.region.map(p1) # shift + tab pra ver cómo funciona map

# Dividimos la población de cada observación entre su población regional
data['pop_pc_reg'] = 100*(data.population/p2)# Multiplicamos por 100 para que el porcentaje quede entre 0 y 1

# También se puede hacer todo en una sola línea...
data['pop_pc_reg'] = 100*(data.population/data.region.map(data.groupby(by=['region']).population.sum()))

data.head()

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code,pop_pc_reg
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,97.17,105845.0,,,,,,,LAm_Ca,0.016503
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs,2.048755
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.734977,1.290307,1246700.0,41.88623,5.623,96.194403,...,14.339079,30809762.0,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf,2.866414
3,Albania,ALB,Europe & Central Asia,upper_mid,123.736096,1.978763,27400.0,100.0,1.71,1.567523,...,71.847041,2866376.0,,18.51579,19.849978,21.141999,5.684427,46.697169,Eu_CAs,0.311972
4,Andorra,AND,Europe & Central Asia,high,104.381212,5.832906,470.0,100.0,,0.122804,...,98.871436,77006.0,,,0.492101,9.844719,3.207886,79.204103,Eu_CAs,0.008381


In [42]:
# Eliminamos la columna nueva que acabamos de crear (sólo era para ilustrar el caso)
data.drop(columns=['pop_pc_reg'], inplace=True, errors='ignore') # errors='ignore' hace que no mande error si la columna no existe
data.head(2)

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,,97.17,105845.0,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs


#### groupby() con agg()

In [43]:
# agg() nos permite observar varios métodos de agregación en un DF
data.groupby(by=['region']) \
    .fertility \
    .agg(['count', 'min', 'max', 'mean'])

Unnamed: 0_level_0,count,min,max,mean
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
East Asia & Pacific,31,1.052,5.391,2.414871
Europe & Central Asia,53,1.234,3.313,1.75066
Latin America & Caribbean,36,1.101,2.92,2.092111
Middle East & North Africa,21,1.37,4.309,2.589524
North America,3,1.4961,1.7655,1.623867
South Asia,8,2.02,4.477,2.55725
Sub-Saharan Africa,47,1.44,7.184,4.485511


---

## Multi Index (multi level indexing)

In [44]:
data.head(3)

Unnamed: 0,country,code,region,inc_group,cellph_per100,co2_perc,land_area,electricity,fertility,fuel_exps_pc,...,mil_exp,internet_users,population,rd_exp,taxRev_pc,val_primary,val_industry,val_manufacturing,val_services,region_code
0,Aruba,ABW,Latin America & Caribbean,high,,8.410064,180.0,100.0,1.798,0.250657,...,,97.17,105845.0,,,,,,,LAm_Ca
1,Afghanistan,AFG,South Asia,low,67.350573,0.293946,652860.0,97.7,4.477,,...,0.984561,11.447688,37172386.0,,7.585382,21.081086,21.823223,11.370465,52.761719,SAs
2,Angola,AGO,Sub-Saharan Africa,lower_mid,44.734977,1.290307,1246700.0,41.88623,5.623,96.194403,...,1.777138,14.339079,30809762.0,,11.002019,9.831169,42.643567,6.752681,46.980994,SSAf


#### Series con multi index

In [44]:
# Veamos primero cómo se crea una Series con multi index. En este caso lo hacemos con groupby y mean
ser = data.groupby(by=['inc_group', 'region_code']).cellph_per100.mean()
# Un multi index le agrega una dimensión a los datos en este caso.

In [45]:
ser

inc_group  region_code
high       EAs_Pa         162.583723
           Eu_CAs         122.234676
           LAm_Ca         130.918467
           MEa_NAf        153.989765
           NAm            104.391677
           SSAf           176.575150
low        EAs_Pa          14.946472
           Eu_CAs         111.014676
           LAm_Ca          57.424010
           MEa_NAf         69.289247
           SAs             95.262726
           SSAf            63.211292
lower_mid  EAs_Pa          90.475353
           Eu_CAs         105.466179
           LAm_Ca         119.035018
           MEa_NAf         94.581852
           SAs             85.693709
           SSAf            86.678617
upper_mid  EAs_Pa          97.907309
           Eu_CAs         125.971269
           LAm_Ca         104.555010
           MEa_NAf         95.380776
           SAs            170.683213
           SSAf           120.797433
Name: cellph_per100, dtype: float64

In [46]:
# Veamos cómo son los índices
print('class: ', type(ser.index))
ser.index

class:  <class 'pandas.core.indexes.multi.MultiIndex'>


MultiIndex(levels=[['high', 'low', 'lower_mid', 'upper_mid'], ['EAs_Pa', 'Eu_CAs', 'LAm_Ca', 'MEa_NAf', 'NAm', 'SAs', 'SSAf']],
           codes=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], [0, 1, 2, 3, 4, 6, 0, 1, 2, 3, 5, 6, 0, 1, 2, 3, 5, 6, 0, 1, 2, 3, 5, 6]],
           names=['inc_group', 'region_code'])

In [47]:
# Para seleccionar todos los elementos que pertenecen a inc_group=high
print('SELECCION: \n', ser.loc['high'], '\n')
print(ser.loc['high'].index, '\n')
print('-------------')

print('SELECCIÓN: \n', ser.loc['high', :], '\n')
print(ser.loc['high', :].index)
# Vemos que en un caso .index regresa un objeto de tipo Index y en otro uno de MultiIndex

SELECCION: 
 region_code
EAs_Pa     162.583723
Eu_CAs     122.234676
LAm_Ca     130.918467
MEa_NAf    153.989765
NAm        104.391677
SSAf       176.575150
Name: cellph_per100, dtype: float64 

Index(['EAs_Pa', 'Eu_CAs', 'LAm_Ca', 'MEa_NAf', 'NAm', 'SSAf'], dtype='object', name='region_code') 

-------------
SELECCIÓN: 
 inc_group  region_code
high       EAs_Pa         162.583723
           Eu_CAs         122.234676
           LAm_Ca         130.918467
           MEa_NAf        153.989765
           NAm            104.391677
           SSAf           176.575150
Name: cellph_per100, dtype: float64 

MultiIndex(levels=[['high', 'low', 'lower_mid', 'upper_mid'], ['EAs_Pa', 'Eu_CAs', 'LAm_Ca', 'MEa_NAf', 'NAm', 'SAs', 'SSAf']],
           codes=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 6]],
           names=['inc_group', 'region_code'])


In [48]:
# Para seleccionar todos las observacione que tienen un index 'EAs_Pa' en su segundo nivel
ser.loc[:, 'EAs_Pa']
# Una selección un poco más compleja
ser.loc[['high', 'low'],['EAs_Pa', 'LAm_Ca']]

inc_group  region_code
high       EAs_Pa         162.583723
           LAm_Ca         130.918467
low        EAs_Pa          14.946472
           LAm_Ca          57.424010
Name: cellph_per100, dtype: float64

#### DataFrames con multi index

In [49]:
# Hay varios caminos para construir un df con multiindex. Un ejemplo:
data_multi = data.groupby(by=['inc_group', 'region_code'])\
                 .mean()\
                 .loc[:, ['cellph_per100', 'co2_perc', 'fertility']]\
                 .sort_index()
data_multi[:10]
# una manera alternativa sería usando el método set_index(['level1', 'level2', ... ])

Unnamed: 0_level_0,Unnamed: 1_level_0,cellph_per100,co2_perc,fertility
inc_group,region_code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
high,EAs_Pa,162.583723,10.82008,1.621455
high,Eu_CAs,122.234676,7.218047,1.627219
high,LAm_Ca,130.918467,10.761337,1.821833
high,MEa_NAf,153.989765,20.499351,2.143625
high,NAm,104.391677,13.500026,1.623867
high,SSAf,176.57515,5.418678,3.63
low,EAs_Pa,14.946472,1.617371,1.899
low,Eu_CAs,111.014676,0.62873,3.313
low,LAm_Ca,57.42401,0.27114,2.868
low,MEa_NAf,69.289247,1.259767,3.377


In [50]:
data_multi.index

MultiIndex(levels=[['high', 'low', 'lower_mid', 'upper_mid'], ['EAs_Pa', 'Eu_CAs', 'LAm_Ca', 'MEa_NAf', 'NAm', 'SAs', 'SSAf']],
           codes=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], [0, 1, 2, 3, 4, 6, 0, 1, 2, 3, 5, 6, 0, 1, 2, 3, 5, 6, 0, 1, 2, 3, 5, 6]],
           names=['inc_group', 'region_code'])

In [51]:
# Para hacer selecciones con multi index
display(data_multi.loc[('high'), :]) # Nótese que usamos paréntesis (tuple)
print('type:', type(data_multi.loc[('high'), :])); print('---------------------------------------')

# Especificar un index de cada nivel (tenemos dos niveles aquí)
display(data_multi.loc[('high','EAs_Pa'),:])
print('type:', type(data_multi.loc[('high','EAs_Pa'),:])); print('---------------------------------------')

# Para elegir varios elementos de un nivel (queremos high y low)
display(data_multi.loc[(['high', 'low']), ['co2_perc', 'fertility']])
print('type:', type(data_multi.loc[(['high', 'low']), ['co2_perc', 'fertility']])); print('---------------------------------------')

# Para elegir todos los elementos de un nivel (es un poco más complicado):
display(data_multi.loc[(slice(None), ['EAs_Pa']), :]) # Aquí elegimos todas las columnas para todos las observaciones que en su segundo nivel tienen un índice de 'EAs_Pa'
print('type:', type(data_multi.loc[(slice(None), ['EAs_Pa']), :])); print('---------------------------------------')

Unnamed: 0_level_0,cellph_per100,co2_perc,fertility
region_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
EAs_Pa,162.583723,10.82008,1.621455
Eu_CAs,122.234676,7.218047,1.627219
LAm_Ca,130.918467,10.761337,1.821833
MEa_NAf,153.989765,20.499351,2.143625
NAm,104.391677,13.500026,1.623867
SSAf,176.57515,5.418678,3.63


type: <class 'pandas.core.frame.DataFrame'>
---------------------------------------


cellph_per100    162.583723
co2_perc          10.820080
fertility          1.621455
Name: (high, EAs_Pa), dtype: float64

type: <class 'pandas.core.series.Series'>
---------------------------------------


Unnamed: 0_level_0,Unnamed: 1_level_0,co2_perc,fertility
inc_group,region_code,Unnamed: 2_level_1,Unnamed: 3_level_1
high,EAs_Pa,10.82008,1.621455
high,Eu_CAs,7.218047,1.627219
high,LAm_Ca,10.761337,1.821833
high,MEa_NAf,20.499351,2.143625
high,NAm,13.500026,1.623867
high,SSAf,5.418678,3.63
low,EAs_Pa,1.617371,1.899
low,Eu_CAs,0.62873,3.313
low,LAm_Ca,0.27114,2.868
low,MEa_NAf,1.259767,3.377


type: <class 'pandas.core.frame.DataFrame'>
---------------------------------------


Unnamed: 0_level_0,Unnamed: 1_level_0,cellph_per100,co2_perc,fertility
inc_group,region_code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
high,EAs_Pa,162.583723,10.82008,1.621455
low,EAs_Pa,14.946472,1.617371,1.899
lower_mid,EAs_Pa,90.475353,1.30887,3.076538
upper_mid,EAs_Pa,97.907309,3.410301,2.521833


type: <class 'pandas.core.frame.DataFrame'>
---------------------------------------


## Merge y join de DataFrames

Ejercicio/tarea:
* Investigar cómo se hace el merge en pandas (consultar la documentación, o bien, con shift + tab) --> pd.merge()