<img style="float: left;;" src='../Figures/iteso.jpg' width="50" height="100"/></a>

# <center> <font color= #000047> Repaso de la Librería Pandas</font> </center>


# DataFrames

Los DataFrames son la principal herramienta de la librería de pandas y se inspiran directamente en el lenguaje de programación R. Podemos pensar en un DataFrame como un grupo de objetos Series juntos para compartir el mismo índice. ¡Usemos pandas para explorar este tema!

In [1]:
# Importar librerías
import pandas as pd
import numpy as np

In [3]:
np.random.seed(101)
arr = np.random.randn(5,4)
arr

array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])

In [4]:
arr.shape

(5, 4)

In [5]:
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1,2,3
0,2.70685,0.628133,0.907969,0.503826
1,0.651118,-0.319318,-0.848077,0.605965
2,-2.018168,0.740122,0.528813,-0.589001
3,0.188695,-0.758872,-0.933237,0.955057
4,0.190794,1.978757,2.605967,0.683509


In [6]:
indices = 'A B C D E'.split() # ind = ['A', 'B', ...]
indices

['A', 'B', 'C', 'D', 'E']

In [7]:
df = pd.DataFrame(data = arr, index = indices)
df

Unnamed: 0,0,1,2,3
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [8]:
columnas = 'W X Y Z'.split()
columnas

['W', 'X', 'Y', 'Z']

In [9]:
df = pd.DataFrame(data = arr, index = indices, columns = columnas)
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selección e Indexación

Aprendamos los diversos métodos para obtener datos de un DataFrame

In [10]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [12]:
df['X']

A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: X, dtype: float64

In [11]:
type(df['W'])

pandas.core.series.Series

In [15]:
df.loc['C','Z']

-0.5890005332865824

In [16]:
df[['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077
C,-2.018168,0.528813
D,0.188695,-0.933237
E,0.190794,2.605967


In [18]:
df[['Z','W','Y']]

Unnamed: 0,Z,W,Y
A,0.503826,2.70685,0.907969
B,0.605965,0.651118,-0.848077
C,-0.589001,-2.018168,0.528813
D,0.955057,0.188695,-0.933237
E,0.683509,0.190794,2.605967


In [19]:
df['W']  # df.W


A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [20]:
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [22]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [21]:
df.iloc[1,1]

-0.31931804459303326

### Creando una nueva columna:

In [23]:
df['Operacion1'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [24]:
df['Operacion2'] = df['W']**2 + df['Y']**3
df

Unnamed: 0,W,X,Y,Z,Operacion1,Operacion2
A,2.70685,0.628133,0.907969,0.503826,3.614819,8.075574
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959,-0.186012
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355,4.220882
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542,-0.77718
E,0.190794,1.978757,2.605967,0.683509,2.796762,17.733697


### Eliminar Columnas

In [25]:
df.drop('Operacion2', axis=1, inplace=False)

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [27]:
df2 = df.drop('Operacion2', axis=1, inplace=False)

In [28]:
df2

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [29]:
df.drop('Operacion2', axis=1, inplace=True)

In [30]:
df

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


también podemos eliminar las columnas de esta forma:

### Seleccionar Columnas

In [31]:
df.iloc[0,1]

0.6281327087844596

In [32]:
df.iloc[0:3,3]

A    0.503826
B    0.605965
C   -0.589001
Name: Z, dtype: float64

In [35]:
df.iloc[:2,:2]

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318


In [36]:
df.iloc[::2,::2]

Unnamed: 0,W,Y,Operacion1
A,2.70685,0.907969,3.614819
C,-2.018168,0.528813,-1.489355
E,0.190794,2.605967,2.796762


In [39]:
df.loc['A','W']

2.706849839399938

In [40]:
df.loc[:,'W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [41]:
df.loc[::2,'W']

A    2.706850
C   -2.018168
E    0.190794
Name: W, dtype: float64

In [42]:
df.loc[['A','B'],['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965


O seleccionar en base a la posición en vez de alguna etiqueta

### Seleccionar subconjunto de filas y columnas

In [43]:
df.iloc[[0,1],[0,3]]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965


### Selección condicional

Una característica importante de los pandas es la selección condicional mediante notación de corchetes, muy similar a numpy:

In [44]:
df

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [47]:
arr>1

array([[ True, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False,  True,  True, False]])

In [48]:
df>1

Unnamed: 0,W,X,Y,Z,Operacion1
A,True,False,False,False,True
B,False,False,False,False,False
C,False,False,False,False,False
D,False,False,False,False,False
E,False,True,True,False,True


In [49]:
df[df>1]

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,,,,3.614819
B,,,,,
C,,,,,
D,,,,,
E,,1.978757,2.605967,,2.796762


In [53]:
df

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [51]:
df['X']>0

A     True
B    False
C     True
D    False
E     True
Name: X, dtype: bool

In [52]:
df[df['X']>0]

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
E,0.190794,1.978757,2.605967,0.683509,2.796762


Para dos condiciones, se puede usar los operadores | y & con paréntesis:

In [55]:
df3 = df[(df['X']>0)&(df['W']>0)]
df3

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [57]:
df3[['W','X']]

Unnamed: 0,W,X
A,2.70685,0.628133
E,0.190794,1.978757


In [58]:
df[(df['X']>0)&(df['W']>0)][['W','X']]

Unnamed: 0,W,X
A,2.70685,0.628133
E,0.190794,1.978757


## Más detalles del índice

Analicemos algunas características más de la indexación, incluido el restablecimiento del índice o resetearlo a otro valor. ¡También hablaremos sobre la jerarquía de índices!

In [59]:
df

Unnamed: 0,W,X,Y,Z,Operacion1
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [60]:
df.reset_index(inplace=True)
df

Unnamed: 0,index,W,X,Y,Z,Operacion1
0,A,2.70685,0.628133,0.907969,0.503826,3.614819
1,B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
2,C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
3,D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
4,E,0.190794,1.978757,2.605967,0.683509,2.796762


In [61]:
newind = 'CA NY WY OR CO'.split()
newind

['CA', 'NY', 'WY', 'OR', 'CO']

In [62]:
df['States'] = newind
df

Unnamed: 0,index,W,X,Y,Z,Operacion1,States
0,A,2.70685,0.628133,0.907969,0.503826,3.614819,CA
1,B,0.651118,-0.319318,-0.848077,0.605965,-0.196959,NY
2,C,-2.018168,0.740122,0.528813,-0.589001,-1.489355,WY
3,D,0.188695,-0.758872,-0.933237,0.955057,-0.744542,OR
4,E,0.190794,1.978757,2.605967,0.683509,2.796762,CO


In [63]:
df.set_index('States', inplace=True)
df

Unnamed: 0_level_0,index,W,X,Y,Z,Operacion1
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CA,A,2.70685,0.628133,0.907969,0.503826,3.614819
NY,B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
WY,C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
OR,D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
CO,E,0.190794,1.978757,2.605967,0.683509,2.796762


In [64]:
df.drop('index', axis=1, inplace=True)

In [65]:
df

Unnamed: 0_level_0,W,X,Y,Z,Operacion1
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CA,2.70685,0.628133,0.907969,0.503826,3.614819
NY,0.651118,-0.319318,-0.848077,0.605965,-0.196959
WY,-2.018168,0.740122,0.528813,-0.589001,-1.489355
OR,0.188695,-0.758872,-0.933237,0.955057,-0.744542
CO,0.190794,1.978757,2.605967,0.683509,2.796762


## DataFrame Resumen
Hay un par de formas de obtener datos resumidos en DataFrames.<br>
<tt><strong>df.describe()</strong></tt> proporciona estadísticas resumidas en todas las columnas numéricas.<br>
<tt><strong>df.info y df.dtypes</strong></tt> muestra el tipo de datos de todas las columnas.

In [66]:
df.describe()

Unnamed: 0,W,X,Y,Z,Operacion1
count,5.0,5.0,5.0,5.0,5.0
mean,0.343858,0.453764,0.452287,0.431871,0.796145
std,1.681131,1.061385,1.454516,0.594708,2.26555
min,-2.018168,-0.758872,-0.933237,-0.589001,-1.489355
25%,0.188695,-0.319318,-0.848077,0.503826,-0.744542
50%,0.190794,0.628133,0.528813,0.605965,-0.196959
75%,0.651118,0.740122,0.907969,0.683509,2.796762
max,2.70685,1.978757,2.605967,0.955057,3.614819


In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, CA to CO
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   W           5 non-null      float64
 1   X           5 non-null      float64
 2   Y           5 non-null      float64
 3   Z           5 non-null      float64
 4   Operacion1  5 non-null      float64
dtypes: float64(5)
memory usage: 240.0+ bytes


In [69]:
df.dtypes

W             float64
X             float64
Y             float64
Z             float64
Operacion1    float64
dtype: object