# Múltiples índices

Un índice permite encontrar la ubicación de un dato como coordenadas.

Al estructurar los datos de esta forma se pueden aplicar funciones matemáticas en varios niveles.

[World Bank Open Data](https://data.worldbank.org/)

[Reshape pandas dataframe with melt](https://towardsdatascience.com/reshape-pandas-dataframe-with-melt-in-python-tutorial-and-visualization-29ec1450bb02)

Primero voy a ajustar el DF para que quede igual al manejado en clase.

In [2]:
import numpy as np
import pandas as pd

pd.options.display.float_format = '{:_.1f}'.format

# omitir primeras 4 lineas del archivo
# tomar solo las columnas de interés
df = pd.read_csv('files/poblacion.csv', 
    skiprows=4, 
    usecols=['Country Name', '2015', '2016', '2017', '2018'])

# reshape dataframe from wide to long form
df = df.melt(id_vars=['Country Name'], 
    var_name='year', 
    value_name='pop')

df.rename(columns={'Country Name':'country'}, inplace=True)

df['country'] = df['country'].astype('string')
df['year'] = df['year'].astype('category')
df['pop'] = df['pop'].astype('Int64') # con int64 no funciona

df.dtypes

country      string
year       category
pop           Int64
dtype: object

In [3]:
idx_filtro = df['country'].isin(['Aruba','Colombia'])
df_sample = df[idx_filtro]
df_sample

Unnamed: 0,country,year,pop
0,Aruba,2015,104339
45,Colombia,2015,47520667
266,Aruba,2016,104865
311,Colombia,2016,48175048
532,Aruba,2017,105361
577,Colombia,2017,48909844
798,Aruba,2018,105846
843,Colombia,2018,49661056


## Definiendo los dos indices

Ahora los índices no van a ser un número como se muestra arriba sino los nuevos índices que le definamos.

Los múltiples índices permiten estructurar los datos para un mejor análisis.

In [4]:
df_sample = df_sample.set_index(['country','year']).sort_index()
df_sample

Unnamed: 0_level_0,Unnamed: 1_level_0,pop
country,year,Unnamed: 2_level_1
Aruba,2015,104339
Aruba,2016,104865
Aruba,2017,105361
Aruba,2018,105846
Colombia,2015,47520667
Colombia,2016,48175048
Colombia,2017,48909844
Colombia,2018,49661056


In [5]:
df_sample.loc['Colombia',:]

Unnamed: 0_level_0,pop
year,Unnamed: 1_level_1
2015,47520667
2016,48175048
2017,48909844
2018,49661056


In [6]:
df_sample.loc['Colombia',:].loc['2016',:]

pop    48175048
Name: 2016, dtype: Int64

.xs permite hacer lo mismo que .loc pero permite anidar varias busquedas

In [7]:
df_sample.xs('Colombia')

Unnamed: 0_level_0,pop
year,Unnamed: 1_level_1
2015,47520667
2016,48175048
2017,48909844
2018,49661056


In [8]:
df_sample.xs(['Aruba','2018'])

  df_sample.xs(['Aruba','2018'])


pop    105846
Name: (Aruba, 2018), dtype: Int64

Selección a bajo nivel

In [9]:
df_sample.xs('2018', level='year')

Unnamed: 0_level_0,pop
country,Unnamed: 1_level_1
Aruba,105846
Colombia,49661056


## índices a todo el DF

In [10]:
df_counties = df.set_index(['country','year']).sort_index()
df_counties

Unnamed: 0_level_0,Unnamed: 1_level_0,pop
country,year,Unnamed: 2_level_1
Afghanistan,2015,34413603
Afghanistan,2016,35383028
Afghanistan,2017,36296111
Afghanistan,2018,37171922
Africa Eastern and Southern,2015,593871847
...,...,...
Zambia,2018,17351714
Zimbabwe,2015,13814642
Zimbabwe,2016,14030338
Zimbabwe,2017,14236599


In [11]:
ids = pd.IndexSlice
df_counties.loc[ids['Aruba':'Austria','2015':'2017'],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,pop
country,year,Unnamed: 2_level_1
Aruba,2015,104339
Aruba,2016,104865
Aruba,2017,105361
Australia,2015,23815995
Australia,2016,24190907
Australia,2017,24601860
Austria,2015,8642699
Austria,2016,8736668
Austria,2017,8797566


In [13]:
df_counties.index.get_level_values(0)

Index(['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan',
       'Africa Eastern and Southern', 'Africa Eastern and Southern',
       'Africa Eastern and Southern', 'Africa Eastern and Southern',
       'Africa Western and Central', 'Africa Western and Central',
       ...
       'Yemen, Rep.', 'Yemen, Rep.', 'Zambia', 'Zambia', 'Zambia', 'Zambia',
       'Zimbabwe', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
      dtype='object', name='country', length=1064)

In [14]:
df_counties['pop']['Colombia']['2018']

49661056

## Utilidad múltiples índices

la gran utilidad es cuando se llegan a aplicar funciones matemáticas

In [15]:
# df_counties.sum(level='year')
df_counties.groupby(level='year').sum()

Unnamed: 0_level_0,pop
year,Unnamed: 1_level_1
2015,78877276156
2016,79882774995
2017,80889487158
2018,81880400653


## Devolverse

In [22]:
df_sample.unstack('year')
df_sample.unstack('country')

Unnamed: 0_level_0,pop,pop
country,Aruba,Colombia
year,Unnamed: 1_level_2,Unnamed: 2_level_2
2015,104339,47520667
2016,104865,48175048
2017,105361,48909844
2018,105846,49661056
