# Slicing and Indexing DataFrames


Los dataframes estan compuestos de tres partes, un numpy array para guardar los valores de una tala, y dos indices para gurardar informacion de columnas y filas

## Cambiar el indice de las filas en un dataframe

In [2]:
import pandas as pd


countries = pd.read_csv('datasets/gapminder.csv')

countries_ind = countries.set_index('country')

countries_ind


Unnamed: 0_level_0,country_code,year,population,cont,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,11,2007,31889923.0,Asia,43.828,974.580338
Albania,23,2007,3600523.0,Europe,76.423,5937.029526
Algeria,35,2007,33333216.0,Africa,72.301,6223.367465
Angola,47,2007,12420476.0,Africa,42.731,4797.231267
Argentina,59,2007,40301927.0,Americas,75.320,12779.379640
...,...,...,...,...,...,...
Vietnam,1655,2007,85262356.0,Asia,74.249,2441.576404
West Bank and Gaza,1667,2007,4018332.0,Asia,73.422,3025.349798
"Yemen, Rep.",1679,2007,22211743.0,Asia,62.698,2280.769906
Zambia,1691,2007,11746035.0,Africa,42.384,1271.211593


esta version del dataframe usa una de las columnas como indice en lugar del inidice que viene por defecto

para setear el indice por defecto se usa el metodo `.reser_index()`

In [3]:
countries_ind = countries_ind.reset_index()
countries_ind

Unnamed: 0,country,country_code,year,population,cont,life_exp,gdp_cap
0,Afghanistan,11,2007,31889923.0,Asia,43.828,974.580338
1,Albania,23,2007,3600523.0,Europe,76.423,5937.029526
2,Algeria,35,2007,33333216.0,Africa,72.301,6223.367465
3,Angola,47,2007,12420476.0,Africa,42.731,4797.231267
4,Argentina,59,2007,40301927.0,Americas,75.320,12779.379640
...,...,...,...,...,...,...,...
137,Vietnam,1655,2007,85262356.0,Asia,74.249,2441.576404
138,West Bank and Gaza,1667,2007,4018332.0,Asia,73.422,3025.349798
139,"Yemen, Rep.",1679,2007,22211743.0,Asia,62.698,2280.769906
140,Zambia,1691,2007,11746035.0,Africa,42.384,1271.211593


cuando se resetea el indice a su version por defecto se puede eliminar la columna que estaba sirviendo como indice, esto se hace con el atributo `drop=True` del metodo `.reset_index()`

In [4]:
countries_ind = countries.set_index('country')
countries_ind = countries_ind.reset_index(drop=True)
countries_ind

Unnamed: 0,country_code,year,population,cont,life_exp,gdp_cap
0,11,2007,31889923.0,Asia,43.828,974.580338
1,23,2007,3600523.0,Europe,76.423,5937.029526
2,35,2007,33333216.0,Africa,72.301,6223.367465
3,47,2007,12420476.0,Africa,42.731,4797.231267
4,59,2007,40301927.0,Americas,75.320,12779.379640
...,...,...,...,...,...,...
137,1655,2007,85262356.0,Asia,74.249,2441.576404
138,1667,2007,4018332.0,Asia,73.422,3025.349798
139,1679,2007,22211743.0,Asia,62.698,2280.769906
140,1691,2007,11746035.0,Africa,42.384,1271.211593


como se puede ver la columna country que tenia los nombres de los paises se ha eliminado. pero solo se elimina en la instancia que se retorna despues de invocar al metodo reset_index() si este nuevo dataframe no se guarda en una nueva variable/objeto DataFrame este reseteo se descarta y no se hacen cambios al datframe original

## Indexes make subsetting simpler

why you should bother with indexes= The answer is that it makes subsetting code cleaner. Consider this example of subsetting for the rows where the country is called Colombia or Argentina

In [5]:
countries[countries['country'].isin(['Colombia', 'Argentina'])]

Unnamed: 0,country_code,country,year,population,cont,life_exp,gdp_cap
4,59,Argentina,2007,40301927.0,Americas,75.32,12779.37964
25,311,Colombia,2007,44227550.0,Americas,72.889,7006.580419


 It's a fairly tricky line of code for such a simple task. Now, look at the equivalent when the names of countries are in the index. DataFrames have a subsetting method called "loc," which filters on index values. Here you simply pass the country names to loc as a list. Much easier! 

In [6]:
countries_ind = countries.set_index('country')
countries_ind.loc[['Colombia', 'Argentina']]

Unnamed: 0_level_0,country_code,year,population,cont,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Colombia,311,2007,44227550.0,Americas,72.889,7006.580419
Argentina,59,2007,40301927.0,Americas,75.32,12779.37964


## Index values don't need to be unique

supongamos que usamos como indice el atributo continente: `cont` si usamos loc con un valor para cont que este en el indice mas de una vex se van a retornar multiples observaciones como es de esperarse

In [7]:
countries_ind = countries.set_index('cont')
countries_ind.loc[['Oceania']]
#When you pass it a single argument, it will take a subset of rows.

Unnamed: 0_level_0,country_code,country,year,population,life_exp,gdp_cap
cont,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Oceania,71,Australia,2007,20434176.0,81.235,34435.36744
Oceania,1103,New Zealand,2007,4115771.0,80.204,25185.00911


## Multi-level Indexes
se puede manejar indices con mas de una columna, la primera columna se llama outter-level index, la siguiente columna a indexar esta anidada en la primera, se dice que es una columna de tipo inner-level-index con respeto a la primera, y asi sucesivamente con todas las columnas que se vayan a indexar

In [8]:
countries_ind = countries.set_index(['cont', 'country'])
countries_ind.sort_index()# mas adelante se ven mas opciones para ordenar pr indices

Unnamed: 0_level_0,Unnamed: 1_level_0,country_code,year,population,life_exp,gdp_cap
cont,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Africa,Algeria,35,2007,33333216.0,72.301,6223.367465
Africa,Angola,47,2007,12420476.0,42.731,4797.231267
Africa,Benin,131,2007,8078314.0,56.728,1441.284873
Africa,Botswana,167,2007,1639131.0,50.728,12569.851770
Africa,Burkina Faso,203,2007,14326203.0,52.295,1217.032994
...,...,...,...,...,...,...
Europe,Switzerland,1487,2007,7554661.0,81.701,37506.419070
Europe,Turkey,1583,2007,71158647.0,71.777,8458.276384
Europe,United Kingdom,1607,2007,60776238.0,79.425,33203.261280
Oceania,Australia,71,2007,20434176.0,81.235,34435.367440


vamos a extraer un subconjunto del dataframe con indexacion multinivel en el outter-level

In [9]:
countries_ind.loc[['Oceania', 'Americas']]

Unnamed: 0_level_0,Unnamed: 1_level_0,country_code,year,population,life_exp,gdp_cap
cont,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Oceania,Australia,71,2007,20434176.0,81.235,34435.36744
Oceania,New Zealand,1103,2007,4115771.0,80.204,25185.00911
Americas,Argentina,59,2007,40301927.0,75.32,12779.37964
Americas,Bolivia,143,2007,9119152.0,65.554,3822.137084
Americas,Brazil,179,2007,190010647.0,72.39,9065.800825
Americas,Canada,251,2007,33390141.0,80.653,36319.23501
Americas,Chile,287,2007,16284741.0,78.553,13171.63885
Americas,Colombia,311,2007,44227550.0,72.889,7006.580419
Americas,Costa Rica,359,2007,4133884.0,78.782,9645.06142
Americas,Cuba,395,2007,11416987.0,78.273,8948.102923


Ahora camos a hacer sub-setting para inner-levels, para esto se usan listas de tuplas, una tupla para cada consulta, y cada entrada de la tupla corresponde a un nivel de la jerarquia de indices del dataframe, asi:

In [10]:
countries_ind.loc[[('Oceania', 'Australia'), ('Americas', 'Colombia'), ('Americas', 'Chile')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,country_code,year,population,life_exp,gdp_cap
cont,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Oceania,Australia,71,2007,20434176.0,81.235,34435.36744
Americas,Colombia,311,2007,44227550.0,72.889,7006.580419
Americas,Chile,287,2007,16284741.0,78.553,13171.63885


# Ordenando por Indices
 el methodo `.sort:index()`, por default, ordena de manera ascendente todos los nivels de outter to inner level. Tambien se puede configurar el orden de niveles y si es orden ascendente o descendente, asi:

In [11]:
countries_ind.sort_index(level=['country', 'cont'], ascending=[True, False])

Unnamed: 0_level_0,Unnamed: 1_level_0,country_code,year,population,life_exp,gdp_cap
cont,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Asia,Afghanistan,11,2007,31889923.0,43.828,974.580338
Europe,Albania,23,2007,3600523.0,76.423,5937.029526
Africa,Algeria,35,2007,33333216.0,72.301,6223.367465
Africa,Angola,47,2007,12420476.0,42.731,4797.231267
Americas,Argentina,59,2007,40301927.0,75.320,12779.379640
...,...,...,...,...,...,...
Asia,Vietnam,1655,2007,85262356.0,74.249,2441.576404
Asia,West Bank and Gaza,1667,2007,4018332.0,73.422,3025.349798
Asia,"Yemen, Rep.",1679,2007,22211743.0,62.698,2280.769906
Africa,Zambia,1691,2007,11746035.0,42.384,1271.211593


### INDICES PUEDEN SER COMPLEJOS Y A VECES INNECESARIOS, LOS FUNCIONES PARA TRABAJAR CON COLUMNAS NO SON LAS MISMAS PARA TRABAJAR CON INDICES. SIN EMBARGO ES BUENO SABER QUE EXISTEN Y COMO FUNCIONAN POR SI ALGUNA VEZ UNA CONSULTA SE FACILTA AL USAR INDICES O SI ALGUIEN LO USA EN EL CODIGO QUE ESTAMOS ESTUDIANDO

# Slicing

Con las listas, usado [inicio:final+1:pasos], se pueden extraer sublistas con elementos consecutivos de la lista original

tambien se puede hacer lo mismo con DataFrames pero antes hay que ordenar por indices y se debe tener en cuenta que el ultimo elemento en el slice es incluido

In [19]:
countries_ind = countries.set_index('country')
countries_ind =  countries_ind.sort_index()
print(countries_ind.loc['Colombia':'Cuba'])

                  country_code  year  population      cont  life_exp  \
country                                                                
Colombia                   311  2007  44227550.0  Americas    72.889   
Comoros                    323  2007    710960.0    Africa    65.152   
Congo, Dem. Rep.           335  2007  64606759.0    Africa    46.462   
Congo, Rep.                347  2007   3800610.0    Africa    55.322   
Costa Rica                 359  2007   4133884.0  Americas    78.782   
Cote d'Ivoire              371  2007  18013409.0    Africa    48.328   
Croatia                    383  2007   4493312.0    Europe    75.748   
Cuba                       395  2007  11416987.0  Americas    78.273   

                       gdp_cap  
country                         
Colombia           7006.580419  
Comoros             986.147879  
Congo, Dem. Rep.    277.551859  
Congo, Rep.        3632.557798  
Costa Rica         9645.061420  
Cote d'Ivoire      1544.750112  
Croatia        

#### Cuba esta incluido !!!

### Slicing con multiples niveles de indices

en el ejemplo anterior solo habia un nivel de indexado asi que esa es la forma para hacer slicing usando loc para el outter level

ahora hagamos el mismo slicing pero con dos niveles de indexado: primero por continente y despues por pais:

In [24]:
countries_ind = countries.set_index(['cont', 'country'])
countries_ind =  countries_ind.sort_index()
# tupla_de inicio:tupla final
countries_ind.loc[('Americas','Argentina'):('Americas','Venezuela')]



Unnamed: 0_level_0,Unnamed: 1_level_0,country_code,year,population,life_exp,gdp_cap
cont,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Americas,Argentina,59,2007,40301927.0,75.32,12779.37964
Americas,Bolivia,143,2007,9119152.0,65.554,3822.137084
Americas,Brazil,179,2007,190010647.0,72.39,9065.800825
Americas,Canada,251,2007,33390141.0,80.653,36319.23501
Americas,Chile,287,2007,16284741.0,78.553,13171.63885
Americas,Colombia,311,2007,44227550.0,72.889,7006.580419
Americas,Costa Rica,359,2007,4133884.0,78.782,9645.06142
Americas,Cuba,395,2007,11416987.0,78.273,8948.102923
Americas,Dominican Republic,443,2007,9319622.0,72.235,6025.374752
Americas,Ecuador,455,2007,13755680.0,74.994,6873.262326


## Slicing  Columns

supongamos que solo queremos un subconjunto consecutivo de columnas de un dataframe y que tambien queremos ciertas filas consecutivas

hay que tener en cuenta que al hacer esto las columnas indices siempre van a estar incluidas en el dataframe output

In [33]:
basic_info_americas = countries.set_index('cont')
basic_info_americas = basic_info_americas.loc['Americas','country':'life_exp']
basic_info_americas

Unnamed: 0_level_0,country,year,population,life_exp
cont,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Americas,Argentina,2007,40301927.0,75.32
Americas,Bolivia,2007,9119152.0,65.554
Americas,Brazil,2007,190010647.0,72.39
Americas,Canada,2007,33390141.0,80.653
Americas,Chile,2007,16284741.0,78.553
Americas,Colombia,2007,44227550.0,72.889
Americas,Costa Rica,2007,4133884.0,78.782
Americas,Cuba,2007,11416987.0,78.273
Americas,Dominican Republic,2007,9319622.0,72.235
Americas,Ecuador,2007,13755680.0,74.994


### Slice twice

In [35]:
countries_ind = countries.set_index(['cont', 'country'])
countries_ind =  countries_ind.sort_index()
# (tupla_de inicio):(tupla final), first_column:last_column
countries_ind.loc[('Americas','Argentina'):('Americas','Mexico'), 'population':'life_exp']

Unnamed: 0_level_0,Unnamed: 1_level_0,population,life_exp
cont,country,Unnamed: 2_level_1,Unnamed: 3_level_1
Americas,Argentina,40301927.0,75.32
Americas,Bolivia,9119152.0,65.554
Americas,Brazil,190010647.0,72.39
Americas,Canada,33390141.0,80.653
Americas,Chile,16284741.0,78.553
Americas,Colombia,44227550.0,72.889
Americas,Costa Rica,4133884.0,78.782
Americas,Cuba,11416987.0,78.273
Americas,Dominican Republic,9319622.0,72.235
Americas,Ecuador,13755680.0,74.994


## Slicing Time series

Slicing is particularly useful for time series since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, "yyyy-mm-dd" for year-month-day, "yyyy-mm" for year-month, and "yyyy" for year.

`rows_despues_2010 = temperatures['date'] >= '2010-01-01'`

`rows_antes_2011 = temperatures['date'] <= '2011-12-31'`

Use Boolean conditions to subset temperatures for rows in 2010 and 2011

`temperatures_bool = temperatures[rows_despues_2010 & rows_antes_2011]`

`print(temperatures_bool)`

Set date as the index and sort the index

`temperatures_ind = temperatures.set_index('date').sort_index()`

Use .loc[] to subset temperatures_ind for rows in 2010 and 2011

`print(temperatures_ind.loc['2010':'2011'])`


`print(temperatures_ind.loc['2010-08':'2011-02'])`
