# DataFrames

Conjunto de series que comparten el mismo indice

Los DataFrames son el caballo de batalla de Pandas y están directamente inspirados en el lenguaje de programación R. 

** Podemos entender un DataFrame como un conjunto de Series que comparten el mismo índice. ** 

In [1]:
import pandas as pd
import numpy as np

In [2]:
from numpy.random import randn
np.random.seed(101)

In [3]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [4]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selección e índices

A continuación se presentan varias formas de acceder a los datos de un DataFrame:

In [5]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [6]:
# SQL Syntax (NO RECOMENDADO!)
# Aunque también podríamos acceder a una columna directamente como df.W, 
# no se recomienda porque podría confundirse con el nombre de un método

df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

** Las columnas de los DataFrame son simplemetne Series **

In [7]:
type(df['W'])

pandas.core.series.Series

In [8]:
type(df)

pandas.core.frame.DataFrame

In [9]:
# Es posible acceder a varias columnas a la vez pasando una lista de nombres de columnas
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [10]:
df[['Z','X']]

Unnamed: 0,Z,X
A,0.503826,0.628133
B,0.605965,-0.319318
C,-0.589001,0.740122
D,0.955057,-0.758872
E,0.683509,1.978757


** Creando una columna nueva:**

In [11]:
df['new'] = df['W'] + df['Y']

In [12]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [13]:
df.columns

Index(['W', 'X', 'Y', 'Z', 'new'], dtype='object')

** Eliminando columnas**

In [14]:
df.drop('new',axis=1)   # axis=1 indica que queremos eliminar una columna; si fuera 0 sería una fila (por defecto axis es 0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [15]:
# No modifica el DataFrame a menos que sea especificado!
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [16]:
df.drop('new',axis=1,inplace=True)

In [17]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


También se pueden eliminar filas, así:

In [18]:
# El parámetro axis por defecto es igual a 0 (fila)

df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [19]:
# No modifica el objeto porque no se le pasa el parámetro inplace=True
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


** Selección de filas **

In [20]:
# En los dataframes no solo las columnas pueden ser consideradas Series, las filas también.
# Es posible acceder a una fila específica usando la propiedad loc (location)

df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [21]:
df.loc['D']

W    0.188695
X   -0.758872
Y   -0.933237
Z    0.955057
Name: D, dtype: float64

O puede ser seleccionado mediante la pripiedad iloc (index location):

In [22]:
df.iloc[3]

W    0.188695
X   -0.758872
Y   -0.933237
Z    0.955057
Name: D, dtype: float64

** Seleccionando un subconjunto de filas y columnas**

In [23]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [24]:
df.loc['B','Y']

-0.84807698340363147

In [25]:
df.loc[['A','B'],['W','Y']] #df.iloc[0:3]['W','Y']

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


### Selección condicional

Similar a Numpy, en Pandas también se soporta selección condicional usando corchetes cuadrados:

In [28]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [29]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [30]:
booldf = df > 0

In [31]:
booldf

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [33]:
df[booldf]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [34]:
# Directamente
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [35]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [36]:
df['W']>0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [39]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [37]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [40]:
df [df['Z']>0.8]

Unnamed: 0,W,X,Y,Z
D,0.188695,-0.758872,-0.933237,0.955057


In [41]:
df [df['W']> 0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [42]:
resultdf = df [df['W']> 0]

In [43]:
resultdf['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

In [44]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [45]:
df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

In [46]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


In [49]:
df.loc['A']>0

W    True
X    True
Y    True
Z    True
Name: A, dtype: bool

In [47]:
# Si hicieramos lo anterior paso a paso:
boolSer = df['W']>0
result = df[boolSer]
myCols = ['Y', 'X']
result[myCols]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


### Condiciones múltiples

Para más de un condicional es posible usar los operadores | (o) y & (y) con paréntesis:

In [50]:
df[(df['X']>0) & (df['Y']>0)]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
C,-2.018168,0.740122,0.528813,-0.589001
E,0.190794,1.978757,2.605967,0.683509


In [52]:
df[(df['X']>0) | (df['Z']>0)]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Más detalles de los índices


In [53]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [54]:
# El método reset_index reinicia los índices de un DataFrame como número 
# y deja los índices anteriores como una columna adicional

df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [None]:
# Si queremos que el cambio sea permanente, el parámetro inplace debe ser True 

# df.reset_index(inplace=true)

In [55]:
# Podemos crear una nueva lista, cuyo elementos más adelante usaremos como nuevos índices
newind = 'Colombia Perú Ecuador Venezuela Brasil'.split()

In [56]:
newind

['Colombia', 'Perú', 'Ecuador', 'Venezuela', 'Brasil']

In [57]:
df['País'] = newind

In [58]:
df

Unnamed: 0,W,X,Y,Z,País
A,2.70685,0.628133,0.907969,0.503826,Colombia
B,0.651118,-0.319318,-0.848077,0.605965,Perú
C,-2.018168,0.740122,0.528813,-0.589001,Ecuador
D,0.188695,-0.758872,-0.933237,0.955057,Venezuela
E,0.190794,1.978757,2.605967,0.683509,Brasil


In [59]:
# Se asigna la lista de países como nuevo índice
df.set_index('País')

Unnamed: 0_level_0,W,X,Y,Z
País,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Colombia,2.70685,0.628133,0.907969,0.503826
Perú,0.651118,-0.319318,-0.848077,0.605965
Ecuador,-2.018168,0.740122,0.528813,-0.589001
Venezuela,0.188695,-0.758872,-0.933237,0.955057
Brasil,0.190794,1.978757,2.605967,0.683509


In [60]:
df

Unnamed: 0,W,X,Y,Z,País
A,2.70685,0.628133,0.907969,0.503826,Colombia
B,0.651118,-0.319318,-0.848077,0.605965,Perú
C,-2.018168,0.740122,0.528813,-0.589001,Ecuador
D,0.188695,-0.758872,-0.933237,0.955057,Venezuela
E,0.190794,1.978757,2.605967,0.683509,Brasil


In [61]:
# Si queremos que el cambio sea permanente, inplace=True
df.set_index('País',inplace=True)

In [62]:
df

Unnamed: 0_level_0,W,X,Y,Z
País,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Colombia,2.70685,0.628133,0.907969,0.503826
Perú,0.651118,-0.319318,-0.848077,0.605965
Ecuador,-2.018168,0.740122,0.528813,-0.589001
Venezuela,0.188695,-0.758872,-0.933237,0.955057
Brasil,0.190794,1.978757,2.605967,0.683509


## Multi-índices y jerarquía de índices



In [63]:
# Niveles de índices
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))


In [64]:
hier_index

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

In [65]:
# Multi-índice
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [66]:
hier_index

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [67]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


Ahora vamos a mostrar cómo indexar esto! 
Para la jerarquía de índice en las filas usamos df.loc[]. 
Si la jerarquía de índices estuviera en el eje de columnas, solo sería necesaria la notación de corchete normal df[]. 

Llamar a un nivel del índice devuelve el sub-DataFrame de datos:

In [68]:
df.loc['G1']

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [69]:
df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64

In [70]:
df.index.names

FrozenList([None, None])

In [71]:
df.index.names = ['Grupo','Num']

In [72]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Grupo,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [75]:
# Para seleccionar un dato específico 
df.loc['G2'].loc[3]['B']

0.32964629880452445

### Cross-Section

La función **xs** retorna la sección transversal de un DataFrame que usa multi-índices.

Por ejemplo, es útil cuando queremos filtrar todos los elementos de un determinado "Num" en el DataFrame anterior, independientemente de su Grupo.

In [76]:
# En este caso seleccionamos el Num=1
df.xs(1,level='Num')

Unnamed: 0_level_0,A,B
Grupo,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.302665,1.693723
G2,0.166905,0.184502


In [77]:
df.loc['G1']

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [78]:
# También, se puede hacer lo siguiente (sin usar loc):
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [79]:
df.xs(['G1',1])

A    0.302665
B    1.693723
Name: (G1, 1), dtype: float64

In [80]:
# Para lo anterior, usando loc hubieramos tenido que hacer lo siguiente:
df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64