![](./imagenes/python_logo.jpeg)
# Librería Pandas.
***

[*Pandas*](https://pandas.pydata.org/pandas-docs/stable/) es sin duda el paquete más importante de *Python* utilizado en la *Ciencia de Datos*. No solo ofrece muchos métodos y funciones que facilitan el trabajo con los datos, sino que además se ha optimizado para la velocidad, lo que le brinda una ventaja significativa en comparación al trabajo con datos numéricos en *Python*.

*Pandas* es una librería que provee estructuras de datos rápidas, flexibles y expresivas; diseñadas para trabajar con *rotulados* y/o *relacionales*. Conceptualmente se puede pensar como *arrays* de *NumPy* donde las filas y columnas están *rótuladas*. Las estructuras de datos de *Pandas* tienen forma similar a hojas de cálculo en *Python*.

Asi como *NumPy*, *Pandas* es una muy buena herramienta para trabajar con números, vectores, álgebra lineal, etc. *Pandas* es adecuado para trabajar con:

* Datos tabulares y heterogéneos (flotantes, string, enteros, etc.)
* Series temporales.
* Los mismos datos que se pueden manipular con arreglos de *NumPy*.

*Pandas* no forma parte de la instalación estándar de *Python*, así que debe instalarse por separado. Para instalar *Pandas* debe ejecutarse el siguiente comando desde una cónsola o terminal:
```bash
pip install pandas
```
Además, para trabajar con archivos de Excel, también deberá ejecutarse:
```bash
pip install xlrd
```
*Pandas* tiene una documentación muy completa y diversos [tutoriales](http://pandas.pydata.org/pandas-docs/stable/tutorials.html).

En *Pandas* existen tres tipos básicos de objetos todos ellos basados a su vez en *NumPy*:

* *Series* (listas, 1D),
* *DataFrame* (tablas, 2D) y
* *Panels* (tablas 3D).

Nosotros vamos a ver el uso básico de los dos primeros tipos de objetos, para un mayor detalle puedes consultar el [manual](http://pandas.pydata.org/pandas-docs/stable/dsintro.html).

En este notebook vamos a usar la abreviación:

* **df**: para cualquier objeto `DataFrame` *Pandas*.
* **s**: para cualquier objeto de `Series` *Pandas*.

Para comenzar iniciamos importando *Pandas* según la convención:

In [63]:
import pandas as pd

## Importador de datos.

In [64]:
csv_data  = "datos/capacitaciones.csv"
json_data = "datos/capacitaciones.json"
xlsx_data = "datos/capacitaciones.xlsx"

df = pd.read_csv(csv_data, encoding = "ISO-8859-1") # De un archivo CSV.
df = pd.read_json(json_data)                        # De un archivo JSON.
df = pd.read_excel(xlsx_data)                       # De un archivo XLSX.

Existen otros tipos de *importadores* de datos, tales como:

* `pd.read_table(filename)`: importa desde un archivo de texto delimitado (como TSV).
* `pd.read_sql(query, connection_object)`: importa desde una BaseDeDatos/Tabla SQL.
* `pd.read_html(url)`: importa desde una URL html, una cadena o un archivo y extrae tablas a una lista.
* `pd.read_clipboard()`: toma los datos desde el contenido del portapapeles.
* `pd.DataFrame(dict)`: importa desde un diccionario *Python*.

## Exportador de datos.

Los *exportadores* de datos guardan en disco los datos del *df*:

* `df.to_csv(filename)`: escribe los datos en un archivo CSV.
* `df.to_excel(filename)`: escribe los datos en un archivo Excel.
* `df.to_sql(table_name, connection_object)`: escribe los datos en una tabla SQL.
* `df.to_json(filename)`: escribe los datos en un archivo con formato JSON;

## Crear objetos de prueba.

Crear datos de prueba es útil para probar segmentos de código:

In [65]:
# Genera un dataframe con 5 columnas y 10 filas.
df = pd.DataFrame(pd.np.random.rand(10, 5))
df

Unnamed: 0,0,1,2,3,4
0,0.323111,0.746657,0.780768,0.955938,0.118458
1,0.123266,0.310638,0.893211,0.902347,0.321628
2,0.139962,0.826734,0.143109,0.621192,0.459418
3,0.664837,0.321543,0.636465,0.006667,0.440059
4,0.630373,0.918279,0.944503,0.618441,0.854934
5,0.262089,0.468748,0.613187,0.829566,0.51005
6,0.976877,0.954365,0.151554,0.506109,0.576196
7,0.214438,0.421249,0.627729,0.916895,0.94696
8,0.624012,0.221798,0.749764,0.077372,0.82914
9,0.086453,0.924445,0.487996,0.503651,0.441132


In [66]:
# Crea una series a partir de una lista.
my_list = [2, 7, 3, 9, 5]
s = pd.Series(my_list)
s

0    2
1    7
2    3
3    9
4    5
dtype: int64

In [67]:
# Añade un índice de fecha al dataframe.
df.index = pd.date_range('1900/1/30', periods = df.shape[0])
df

Unnamed: 0,0,1,2,3,4
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


## Visualizar / inspeccionar datos.

In [68]:
df.head(5) # Muestra las primeras 5 filas del DataFrame.

Unnamed: 0,0,1,2,3,4
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934


In [69]:
df.tail(5) # Muestra las últimas 5 filas del DataFrame.

Unnamed: 0,0,1,2,3,4
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


In [70]:
df.shape # Muestra el número de filas y columnas del DataFrame.

(10, 5)

In [71]:
df.info() # Muestra el índice, tipo de datos y memoria.

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 1900-01-30 to 1900-02-08
Freq: D
Data columns (total 5 columns):
0    10 non-null float64
1    10 non-null float64
2    10 non-null float64
3    10 non-null float64
4    10 non-null float64
dtypes: float64(5)
memory usage: 480.0 bytes


In [72]:
df.describe() # Muestra estadísticas resumidas de columnas numéricas.

Unnamed: 0,0,1,2,3,4
count,10.0,10.0,10.0,10.0,10.0
mean,0.404542,0.611446,0.602829,0.593818,0.549798
std,0.299602,0.290064,0.275752,0.33581,0.258476
min,0.086453,0.221798,0.143109,0.006667,0.118458
25%,0.158581,0.346469,0.519294,0.504266,0.440327
50%,0.2926,0.607703,0.632097,0.619816,0.484734
75%,0.628783,0.895393,0.773017,0.884151,0.765904
max,0.976877,0.954365,0.944503,0.955938,0.94696


In [73]:
s.value_counts(dropna = False) # Muestra valores y recuentos únicos en la serie.

7    1
5    1
3    1
2    1
9    1
dtype: int64

## Selección.

In [74]:
df[2] # Devuelve la columna con la etiqueta 2 como una Series.

1900-01-30    0.780768
1900-01-31    0.893211
1900-02-01    0.143109
1900-02-02    0.636465
1900-02-03    0.944503
1900-02-04    0.613187
1900-02-05    0.151554
1900-02-06    0.627729
1900-02-07    0.749764
1900-02-08    0.487996
Freq: D, Name: 2, dtype: float64

In [75]:
df[[2, 4]] # Devuelve columnas 2 y 4 como un nuevo DataFrame.

Unnamed: 0,2,4
1900-01-30,0.780768,0.118458
1900-01-31,0.893211,0.321628
1900-02-01,0.143109,0.459418
1900-02-02,0.636465,0.440059
1900-02-03,0.944503,0.854934
1900-02-04,0.613187,0.51005
1900-02-05,0.151554,0.576196
1900-02-06,0.627729,0.94696
1900-02-07,0.749764,0.82914
1900-02-08,0.487996,0.441132


In [76]:
s.iloc[0] # Selección por posición: selecciona el elemento 0 de la Series.

2

In [77]:
s.loc[3] # Selección por índice de la Series.

9

In [78]:
df.iloc[0, :] # Primera fila del DataFrame.

0    0.323111
1    0.746657
2    0.780768
3    0.955938
4    0.118458
Name: 1900-01-30 00:00:00, dtype: float64

In [79]:
df.iloc[0, 0] # Primer elemento de la primera columna.

0.3231114297946387

## Limpieza de datos.

In [80]:
df.columns = ["a", "b", "c", "x", "y"] # Renombrar columnas del DataFrame.
df

Unnamed: 0,a,b,c,x,y
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


In [81]:
df.isnull() # Comprueba valores nulos, devuelve un Boolean Arrays.

Unnamed: 0,a,b,c,x,y
1900-01-30,False,False,False,False,False
1900-01-31,False,False,False,False,False
1900-02-01,False,False,False,False,False
1900-02-02,False,False,False,False,False
1900-02-03,False,False,False,False,False
1900-02-04,False,False,False,False,False
1900-02-05,False,False,False,False,False
1900-02-06,False,False,False,False,False
1900-02-07,False,False,False,False,False
1900-02-08,False,False,False,False,False


In [82]:
df.notnull() # El opuesto a df.isnull().

Unnamed: 0,a,b,c,x,y
1900-01-30,True,True,True,True,True
1900-01-31,True,True,True,True,True
1900-02-01,True,True,True,True,True
1900-02-02,True,True,True,True,True
1900-02-03,True,True,True,True,True
1900-02-04,True,True,True,True,True
1900-02-05,True,True,True,True,True
1900-02-06,True,True,True,True,True
1900-02-07,True,True,True,True,True
1900-02-08,True,True,True,True,True


In [83]:
df.dropna(axis = 0) # Elimina todas las filas que contienen valores nulos.

Unnamed: 0,a,b,c,x,y
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


In [84]:
df.dropna(axis = 1) # Elimina todas las columnas que contienen valores nulos

Unnamed: 0,a,b,c,x,y
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


In [85]:
# Elimina todas las filas que tienen menos de 3 valores no nulos.
df.dropna(axis = 1, thresh = 3)

Unnamed: 0,a,b,c,x,y
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


In [86]:
df.fillna(0) # Remplaza todos los valores nulos por 0.

Unnamed: 0,a,b,c,x,y
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


In [87]:
s.fillna(s.mean()) # Remplaza todos los valores nulos por la media.

0    2
1    7
2    3
3    9
4    5
dtype: int64

In [88]:
s.astype(float) # Convierte el tipo de datos a float.

0    2.0
1    7.0
2    3.0
3    9.0
4    5.0
dtype: float64

In [89]:
s.replace(3, '100') # Remplaza todos los valores iguales a 3 con '100'.

0      2
1      7
2    100
3      9
4      5
dtype: object

In [90]:
s.replace([7, 9], ['700', '900']) # Remplaza todos los 7 por '700' y 9 por '900'.

0      2
1    700
2      3
3    900
4      5
dtype: object

In [91]:
df.columns = (0, 1, 2, 3, 4)
df.rename(columns = lambda x: 2*x + 1) # Cambio de nombre de columnas en masa.

Unnamed: 0,1,3,5,7,9
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


In [92]:
df.rename(columns={1:'one'}) # Renombrar seleccionando columna.

Unnamed: 0,0,one,2,3,4
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132


In [93]:
df.set_index(2) # Cambiar el índice por una columna (2) del Dataframe.

Unnamed: 0_level_0,0,1,3,4
2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.780768,0.323111,0.746657,0.955938,0.118458
0.893211,0.123266,0.310638,0.902347,0.321628
0.143109,0.139962,0.826734,0.621192,0.459418
0.636465,0.664837,0.321543,0.006667,0.440059
0.944503,0.630373,0.918279,0.618441,0.854934
0.613187,0.262089,0.468748,0.829566,0.51005
0.151554,0.976877,0.954365,0.506109,0.576196
0.627729,0.214438,0.421249,0.916895,0.94696
0.749764,0.624012,0.221798,0.077372,0.82914
0.487996,0.086453,0.924445,0.503651,0.441132


In [94]:
df.set_index(2).rename(index = lambda x: x + 1) # Cambia el índice en lote.

Unnamed: 0_level_0,0,1,3,4
2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.780768,0.323111,0.746657,0.955938,0.118458
1.893211,0.123266,0.310638,0.902347,0.321628
1.143109,0.139962,0.826734,0.621192,0.459418
1.636465,0.664837,0.321543,0.006667,0.440059
1.944503,0.630373,0.918279,0.618441,0.854934
1.613187,0.262089,0.468748,0.829566,0.51005
1.151554,0.976877,0.954365,0.506109,0.576196
1.627729,0.214438,0.421249,0.916895,0.94696
1.749764,0.624012,0.221798,0.077372,0.82914
1.487996,0.086453,0.924445,0.503651,0.441132


## Filtro, orden y agrupamiento.

In [95]:
df[df[2] > 0.5] # Filtra las filas donde la columna 2 es mayor que 0.5

Unnamed: 0,0,1,2,3,4
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914


In [96]:
df[(df[2] > 0.5) & (df[2] < 0.7)] # Filtra las filas donde 0.7 > columna 2 > 0.5

Unnamed: 0,0,1,2,3,4
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696


In [97]:
df.sort_values(3) # Ordena por los valores de la columna 3 en orden ascendente.

Unnamed: 0,0,1,2,3,4
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458


In [98]:
df.sort_values(3, ascending = False) # Ordena por la columna 3 en orden descendente.

Unnamed: 0,0,1,2,3,4
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059


In [99]:
# Ordena los valores por la columna 2 de forma ascendente
# y luego por la columna 3 en orden descendente.
df.sort_values([2, 3], ascending = [True, False])

Unnamed: 0,0,1,2,3,4
1900-02-01,0.139962,0.826734,0.143109,0.621192,0.459418
1900-02-05,0.976877,0.954365,0.151554,0.506109,0.576196
1900-02-08,0.086453,0.924445,0.487996,0.503651,0.441132
1900-02-04,0.262089,0.468748,0.613187,0.829566,0.51005
1900-02-06,0.214438,0.421249,0.627729,0.916895,0.94696
1900-02-02,0.664837,0.321543,0.636465,0.006667,0.440059
1900-02-07,0.624012,0.221798,0.749764,0.077372,0.82914
1900-01-30,0.323111,0.746657,0.780768,0.955938,0.118458
1900-01-31,0.123266,0.310638,0.893211,0.902347,0.321628
1900-02-03,0.630373,0.918279,0.944503,0.618441,0.854934


In [100]:
df.groupby(3) # Devuelve un objeto groupby para los valores de una columna.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f0a41968b00>

In [101]:
# Devuelve un objeto groupby para valores de múltiples columnas,
# en este caso columnas 3 y 4.
df.groupby([3, 4])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f0a41968e80>

In [102]:
# Crea una tabla dinámica que agrupa por la columna 2
# y calcula la media de las columnas 3 y 4.
df.pivot_table(index = 2, values = [3, 4], aggfunc = pd.np.mean)

Unnamed: 0_level_0,3,4
2,Unnamed: 1_level_1,Unnamed: 2_level_1
0.143109,0.621192,0.459418
0.151554,0.506109,0.576196
0.487996,0.503651,0.441132
0.613187,0.829566,0.51005
0.627729,0.916895,0.94696
0.636465,0.006667,0.440059
0.749764,0.077372,0.82914
0.780768,0.955938,0.118458
0.893211,0.902347,0.321628
0.944503,0.618441,0.854934


In [103]:
# Encuentra el promedio en todas las columnas
# para cada grupo de la columna 3 único.
df.groupby(3).agg(pd.np.mean)

Unnamed: 0_level_0,0,1,2,4
3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.006667,0.664837,0.321543,0.636465,0.440059
0.077372,0.624012,0.221798,0.749764,0.82914
0.503651,0.086453,0.924445,0.487996,0.441132
0.506109,0.976877,0.954365,0.151554,0.576196
0.618441,0.630373,0.918279,0.944503,0.854934
0.621192,0.139962,0.826734,0.143109,0.459418
0.829566,0.262089,0.468748,0.613187,0.51005
0.902347,0.123266,0.310638,0.893211,0.321628
0.916895,0.214438,0.421249,0.627729,0.94696
0.955938,0.323111,0.746657,0.780768,0.118458


In [104]:
df.apply(pd.np.mean) # Aplica la función pd.np.mean() en cada columna.

0    0.404542
1    0.611446
2    0.602829
3    0.593818
4    0.549798
dtype: float64

In [105]:
df.apply(pd.np.max, axis = 1) # Aplica la función pd.np.max() en cada fila.

1900-01-30    0.955938
1900-01-31    0.902347
1900-02-01    0.826734
1900-02-02    0.664837
1900-02-03    0.944503
1900-02-04    0.829566
1900-02-05    0.976877
1900-02-06    0.946960
1900-02-07    0.829140
1900-02-08    0.924445
Freq: D, dtype: float64

## Unir / Combinar.

Para ejemplificar la unión y conbinación de dataframes, supongamos que tenemos los siguientes dos dataframes:

In [121]:
data_1 = {"A": [1, 2, 3, 4],
          "B": ["a", "b", "c", "d"],
          "C": ["w", "x", "y", "z"]}
df1 = pd.DataFrame(data_1)
df1

Unnamed: 0,A,B,C
0,1,a,w
1,2,b,x
2,3,c,y
3,4,d,z


In [122]:
data_2 = {"A": [6, 7, 3, 4],
          "B": [True, False, True, False],
          "C": [1, 2, 3, 4]}
df2 = pd.DataFrame(data_2)
df2

Unnamed: 0,A,B,C
0,6,True,1
1,7,False,2
2,3,True,3
3,4,False,4


In [141]:
# Agrupa las filas en df1 al final de df2 (las columnas deben ser idénticas).
df1.append(df2)

Unnamed: 0,A,B,C
0,1,a,w
1,2,b,x
2,3,c,y
3,4,d,z
0,6,True,1
1,7,False,2
2,3,True,3
3,4,False,4


In [124]:
# Agrega las columnas en df1 al final de df2 (las filas deben ser idénticas)
pd.concat([df1, df2], axis = 1)

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,1,a,w,6,True,1
1,2,b,x,7,False,2
2,3,c,y,3,True,3
3,4,d,z,4,False,4


In [131]:
# Une las columnas en df1 con las columnas en df2 donde las filas para A
# tienen valores idénticos. También puede utilizarse: left, right, outer, inner.
pd.merge(df1, df2, on = ["A"], how = 'inner')

Unnamed: 0,A,B_x,C_x,B_y,C_y
0,3,c,y,True,3
1,4,d,z,False,4


## Estadísticas.

In [132]:
df.describe() # Resumen de estadísticas para columnas numéricas.

Unnamed: 0,0,1,2,3,4
count,10.0,10.0,10.0,10.0,10.0
mean,0.404542,0.611446,0.602829,0.593818,0.549798
std,0.299602,0.290064,0.275752,0.33581,0.258476
min,0.086453,0.221798,0.143109,0.006667,0.118458
25%,0.158581,0.346469,0.519294,0.504266,0.440327
50%,0.2926,0.607703,0.632097,0.619816,0.484734
75%,0.628783,0.895393,0.773017,0.884151,0.765904
max,0.976877,0.954365,0.944503,0.955938,0.94696


In [133]:
df.mean(axis = 0) # Devuelve la media de todas las columnas.

0    0.404542
1    0.611446
2    0.602829
3    0.593818
4    0.549798
dtype: float64

In [134]:
df.mean(axis = 1) # Devuelve la media de todas las filas.

1900-01-30    0.584987
1900-01-31    0.510218
1900-02-01    0.438083
1900-02-02    0.413914
1900-02-03    0.793306
1900-02-04    0.536728
1900-02-05    0.633020
1900-02-06    0.625454
1900-02-07    0.500417
1900-02-08    0.488735
Freq: D, dtype: float64

In [135]:
df.corr() # Devuelve la correlación entre las columnas en un DataFrame.

Unnamed: 0,0,1,2,3,4
0,1.0,0.09991,-0.113422,-0.552906,0.293707
1,0.09991,1.0,-0.446381,0.214068,-0.129492
2,-0.113422,-0.446381,1.0,0.137561,0.07825
3,-0.552906,0.214068,0.137561,1.0,-0.223966
4,0.293707,-0.129492,0.07825,-0.223966,1.0


In [136]:
df.count(axis = 0) # Devuelve el número de valores no nulos en cada columna.

0    10
1    10
2    10
3    10
4    10
dtype: int64

In [137]:
df.max(axis = 0) # Devuelve el valor más alto en cada columna.

0    0.976877
1    0.954365
2    0.944503
3    0.955938
4    0.946960
dtype: float64

In [138]:
df.min(axis = 0) # Devuelve el valor más bajo en cada columna

0    0.086453
1    0.221798
2    0.143109
3    0.006667
4    0.118458
dtype: float64

In [139]:
df.median(axis = 0) # Devuelve la mediana de cada columna.

0    0.292600
1    0.607703
2    0.632097
3    0.619816
4    0.484734
dtype: float64

In [140]:
df.std(axis = 0) # Devuelve la desviación estándar de cada columna.

0    0.299602
1    0.290064
2    0.275752
3    0.335810
4    0.258476
dtype: float64

## Referencias adicionales.

Para ver operaciones adicionales de *NumPy* podemos consultar el repositorio del [Dr. Thirta Sarkar](https://github.com/tirthajyoti):
* [Pandas Operations](https://github.com/tirthajyoti/Machine-Learning-with-Python/blob/master/Pandas%20and%20Numpy/Pandas_Operations.ipynb).
* [Advanced Pandas Operations](https://github.com/tirthajyoti/Machine-Learning-with-Python/blob/master/Pandas%20and%20Numpy/Advanced%20Pandas%20Operations.ipynb).