
<img style="float: left;;" src='Figures/alinco.png' height="100"/></a>

# <center> <font color= #000047> DataFrames en Pandas</font> </center>


# DataFrames

Los DataFrames son la principal herramienta de la librería de pandas y se inspiran directamente en el lenguaje de programación R. Podemos pensar en un DataFrame como un grupo de objetos Series juntos para compartir el mismo índice. ¡Usemos pandas para explorar este tema!

In [1]:
# Importar librerías
import numpy as np
import pandas as pd

## Selección e Indexación

Aprendamos los diversos métodos para obtener datos de un DataFrame

In [2]:
df= pd.DataFrame(np.random.randn(5,4))
print(df)

          0         1         2         3
0  0.553747 -0.335115  1.189771  0.275664
1  0.105198 -0.337352 -0.542403  0.338660
2 -0.503820  0.463611 -0.110793  0.537875
3  0.095747 -0.174665  0.765632  0.690525
4 -0.424041  0.803894  0.418056 -0.867862


In [3]:
df[0]

0    0.553747
1    0.105198
2   -0.503820
3    0.095747
4   -0.424041
Name: 0, dtype: float64

In [7]:
ind='A B C D E'.split()
df= pd.DataFrame(np.random.randn(5,4), index=ind, columns='W X Y Z'.split())
print(df)

          W         X         Y         Z
A  0.236764  1.752594  0.225746  0.388335
B  0.587074 -0.018796  0.261965  0.925602
C  0.400808  0.618183  1.271299  0.239199
D  1.414572  0.363157  1.789851  0.618117
E -0.122182 -0.806622 -0.351290  0.492906


In [12]:
df['W']

A    0.236764
B    0.587074
C    0.400808
D    1.414572
E   -0.122182
Name: W, dtype: float64

In [13]:
df.loc['C','X']

np.float64(0.6181832659396477)

In [14]:
df.iloc[2,1]

np.float64(0.6181832659396477)

In [16]:
df[['W','X']]

Unnamed: 0,W,X
A,0.236764,1.752594
B,0.587074,-0.018796
C,0.400808,0.618183
D,1.414572,0.363157
E,-0.122182,-0.806622


In [17]:
df.W

A    0.236764
B    0.587074
C    0.400808
D    1.414572
E   -0.122182
Name: W, dtype: float64

### Creando una nueva columna:

In [19]:
df['newColumn'] =df['X'] -df.Y*2
df

Unnamed: 0,W,X,Y,Z,newColumn
A,0.236764,1.752594,0.225746,0.388335,1.301102
B,0.587074,-0.018796,0.261965,0.925602,-0.542727
C,0.400808,0.618183,1.271299,0.239199,-1.924415
D,1.414572,0.363157,1.789851,0.618117,-3.216544
E,-0.122182,-0.806622,-0.35129,0.492906,-0.104042


### Eliminar Columnas

In [22]:
df.drop('newColumn',axis=1, inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,0.236764,1.752594,0.225746,0.388335
B,0.587074,-0.018796,0.261965,0.925602
C,0.400808,0.618183,1.271299,0.239199
D,1.414572,0.363157,1.789851,0.618117
E,-0.122182,-0.806622,-0.35129,0.492906


también podemos eliminar las columnas de esta forma:

### Seleccionar Columnas

df.loc[]
df.iloc

O seleccionar en base a la posición en vez de alguna etiqueta

### Seleccionar subconjunto de filas y columnas

In [23]:
df.loc[['A', 'B'],['X','Y']]

Unnamed: 0,X,Y
A,1.752594,0.225746
B,-0.018796,0.261965


### Selección condicional

Una característica importante de los pandas es la selección condicional mediante notación de corchetes, muy similar a numpy:

In [24]:
df

Unnamed: 0,W,X,Y,Z
A,0.236764,1.752594,0.225746,0.388335
B,0.587074,-0.018796,0.261965,0.925602
C,0.400808,0.618183,1.271299,0.239199
D,1.414572,0.363157,1.789851,0.618117
E,-0.122182,-0.806622,-0.35129,0.492906


In [25]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,True,True
C,True,True,True,True
D,True,True,True,True
E,False,False,False,True


In [26]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,0.236764,1.752594,0.225746,0.388335
B,0.587074,,0.261965,0.925602
C,0.400808,0.618183,1.271299,0.239199
D,1.414572,0.363157,1.789851,0.618117
E,,,,0.492906


In [27]:
df['W']>0

A     True
B     True
C     True
D     True
E    False
Name: W, dtype: bool

In [28]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,0.236764,1.752594,0.225746,0.388335
B,0.587074,-0.018796,0.261965,0.925602
C,0.400808,0.618183,1.271299,0.239199
D,1.414572,0.363157,1.789851,0.618117


In [29]:
df[df['W']>0][['W','Z']]

Unnamed: 0,W,Z
A,0.236764,0.388335
B,0.587074,0.925602
C,0.400808,0.239199
D,1.414572,0.618117


In [31]:
df[(df['W']>0) & (df['Z']>0)]

Unnamed: 0,W,X,Y,Z
A,0.236764,1.752594,0.225746,0.388335
B,0.587074,-0.018796,0.261965,0.925602
C,0.400808,0.618183,1.271299,0.239199
D,1.414572,0.363157,1.789851,0.618117


## Más detalles del índice

Analicemos algunas características más de la indexación, incluido el restablecimiento del índice o resetearlo a otro valor. ¡También hablaremos sobre la jerarquía de índices!

In [36]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z,Test
0,A,0.236764,1.752594,0.225746,0.388335,H
1,B,0.587074,-0.018796,0.261965,0.925602,J
2,C,0.400808,0.618183,1.271299,0.239199,K
3,D,1.414572,0.363157,1.789851,0.618117,L
4,E,-0.122182,-0.806622,-0.35129,0.492906,M


In [37]:
newInd =['H','J','K','L','M']
df['Test']=newInd
df

Unnamed: 0,W,X,Y,Z,Test
A,0.236764,1.752594,0.225746,0.388335,H
B,0.587074,-0.018796,0.261965,0.925602,J
C,0.400808,0.618183,1.271299,0.239199,K
D,1.414572,0.363157,1.789851,0.618117,L
E,-0.122182,-0.806622,-0.35129,0.492906,M


In [39]:
df.set_index('Test',inplace=True)
df

Unnamed: 0_level_0,W,X,Y,Z
Test,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
H,0.236764,1.752594,0.225746,0.388335
J,0.587074,-0.018796,0.261965,0.925602
K,0.400808,0.618183,1.271299,0.239199
L,1.414572,0.363157,1.789851,0.618117
M,-0.122182,-0.806622,-0.35129,0.492906


## DataFrame Resumen
Hay un par de formas de obtener datos resumidos en DataFrames.<br>
<tt><strong>df.describe()</strong></tt> proporciona estadísticas resumidas en todas las columnas numéricas.<br>
<tt><strong>df.info y df.dtypes</strong></tt> muestra el tipo de datos de todas las columnas.

In [40]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,5.0,5.0,5.0,5.0
mean,0.503407,0.381703,0.639514,0.532832
std,0.572322,0.936499,0.8686,0.259921
min,-0.122182,-0.806622,-0.35129,0.239199
25%,0.236764,-0.018796,0.225746,0.388335
50%,0.400808,0.363157,0.261965,0.492906
75%,0.587074,0.618183,1.271299,0.618117
max,1.414572,1.752594,1.789851,0.925602


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, H to M
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   W       5 non-null      float64
 1   X       5 non-null      float64
 2   Y       5 non-null      float64
 3   Z       5 non-null      float64
dtypes: float64(4)
memory usage: 200.0+ bytes


# Datos Faltantes (Missing Data)

Mostraremos algunos métodos convenientes para lidiar con los datos faltantes en pandas:

In [42]:
df= pd.DataFrame({'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [43]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [44]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [45]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


In [47]:
df.fillna(value=df['A'].mean())#Todos los NA los llena con la media de A

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,1.5,2
2,1.5,1.5,3


In [48]:
df.fillna(value='unk')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,unk,2
2,unk,unk,3


# Groupby

El método groupby nos permite agrupar filas de datos y utilizar funciones de agregación

In [49]:
data = {'Company': ['GOOG','GOOG','MSFT', 'MSFT', 'FB', 'FB'],
       'Person': ['Sam','Charly','Amy', 'Vane', 'Carlos', 'Sara'],
       'Sales':[200,120,340,124,243,350]}

In [50]:
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charly,120
2,MSFT,Amy,340
3,MSFT,Vane,124
4,FB,Carlos,243
5,FB,Sara,350


<strong>Ahora podemos utilizar el método .groupby () para agrupar filas en función del nombre de una columna. <br> Por ejemplo, agrupemos según la empresa. Esto creará un objeto DataFrameGroupBy:</strong>

In [52]:
df.groupby('Company').count()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


In [55]:
df.groupby('Company')['Sales'].mean()

Company
FB      296.5
GOOG    160.0
MSFT    232.0
Name: Sales, dtype: float64

In [56]:
df.groupby('Company')['Sales'].min()

Company
FB      243
GOOG    120
MSFT    124
Name: Sales, dtype: int64

In [57]:
df.groupby('Company')['Sales'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


# Operaciones

Hay muchas operaciones con pandas que serán realmente útiles para nosotros, por ejemplo:

In [58]:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


### Info de Valores Unicos

In [59]:
df['col3'].unique()

array(['abc', 'def', 'ghi', 'xyz'], dtype=object)

In [60]:
df['col3'].nunique()

4

In [61]:
df['col3'].value_counts()

col3
abc    1
def    1
ghi    1
xyz    1
Name: count, dtype: int64

### Deleccionando Datos

In [62]:
df[(df['col1']>2) & (df['col2']==444)]

Unnamed: 0,col1,col2,col3
3,4,444,xyz


### Definiendo y aplicando funciones

In [63]:
def times2(x):
    return x**2 + 2*x +2

In [64]:
df['col1'].apply(times2)

0     5
1    10
2    17
3    26
Name: col1, dtype: int64

In [66]:
df['col4']=df['col1'].apply(times2)
df

Unnamed: 0,col1,col2,col3,col4
0,1,444,abc,5
1,2,555,def,10
2,3,666,ghi,17
3,4,444,xyz,26


In [67]:
df['col3'].apply(len)

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

### Eliminar una columna de forma permanente

In [None]:
#del df['col4']
df


Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


### Obtener nombres de índice y columna:

In [78]:
df.index

Index(['abc', 'def', 'ghi', 'xyz'], dtype='object', name='col3')

In [75]:
df.set_index('col3', inplace=True)

In [76]:
df

Unnamed: 0_level_0,col1,col2
col3,Unnamed: 1_level_1,Unnamed: 2_level_1
abc,1,444
def,2,555
ghi,3,666
xyz,4,444


### Ordenar y ordenar un DataFrame:

In [79]:
df

Unnamed: 0_level_0,col1,col2
col3,Unnamed: 1_level_1,Unnamed: 2_level_1
abc,1,444
def,2,555
ghi,3,666
xyz,4,444


In [80]:
df.sort_values(by='col2', inplace=True)

df

Unnamed: 0_level_0,col1,col2
col3,Unnamed: 1_level_1,Unnamed: 2_level_1
abc,1,444
xyz,4,444
def,2,555
ghi,3,666


In [82]:
df.sort_values(by='col2', inplace=True, ascending=False)
df

Unnamed: 0_level_0,col1,col2
col3,Unnamed: 1_level_1,Unnamed: 2_level_1
ghi,3,666
def,2,555
xyz,4,444
abc,1,444


# Entrada y salida de datos

La librería de pandas permite leer una variedad de tipos de archivos usando sus métodos pd.read_. Echemos un vistazo a los tipos de datos más comunes:


In [83]:
df_bank = pd.read_csv('Data/bank.csv')
df_bank.tail(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
4511,46,blue-collar,married,secondary,no,668,yes,no,unknown,15,may,1263,2,-1,0,unknown,yes
4512,40,blue-collar,married,secondary,no,1100,yes,no,unknown,29,may,660,2,-1,0,unknown,no
4513,49,blue-collar,married,secondary,no,322,no,no,cellular,14,aug,356,2,-1,0,unknown,no
4514,38,blue-collar,married,secondary,no,1205,yes,no,cellular,20,apr,45,4,153,1,failure,no
4515,32,services,single,secondary,no,473,yes,no,cellular,7,jul,624,5,-1,0,unknown,no
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no
4520,44,entrepreneur,single,tertiary,no,1136,yes,yes,cellular,3,apr,345,2,249,7,other,no


In [84]:
df_bank.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0
mean,41.170095,1422.657819,15.915284,263.961292,2.79363,39.766645,0.542579
std,10.576211,3009.638142,8.247667,259.856633,3.109807,100.121124,1.693562
min,19.0,-3313.0,1.0,4.0,1.0,-1.0,0.0
25%,33.0,69.0,9.0,104.0,1.0,-1.0,0.0
50%,39.0,444.0,16.0,185.0,2.0,-1.0,0.0
75%,49.0,1480.0,21.0,329.0,3.0,-1.0,0.0
max,87.0,71188.0,31.0,3025.0,50.0,871.0,25.0


In [85]:
df_bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   object
dtypes: int64(7), object(10)
memory usage: 600.6+ KB


## CSV
Los archivos de valores separados por comas (CSV), son archivos de texto que utilizan comas como delimitadores de campo.<br>
### CSV entrada

In [None]:
df #df_bank = pd.read_csv('Data/bank.csv')

Unnamed: 0_level_0,col1,col2
col3,Unnamed: 1_level_1,Unnamed: 2_level_1
ghi,3,666
def,2,555
xyz,4,444
abc,1,444


### CSV salida

In [87]:
df.to_csv('Data/e1.csv')

## Excel
con Pandas podemos leer y escribir archivos de MS Excel. Sin embargo, esto solo importa datos, no fórmulas ni imágenes. Un archivo que contiene imágenes o macros puede causar que el método <tt>.read_excel()</tt> no funcione. 

In [90]:
pd.read_excel('Data/Excel2.xlsx', index_col =[0])

ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.