![](./imagenes/python_logo.jpeg)
# Librería Pandas.
***

[*Pandas*](https://pandas.pydata.org/pandas-docs/stable/) es sin duda el paquete más importante de *Python* utilizado en la *Ciencia de Datos*. No solo ofrece muchos métodos y funciones que facilitan el trabajo con los datos, sino que además se ha optimizado para la velocidad, lo que le brinda una ventaja significativa en comparación al trabajo con datos numéricos en *Python*.

*Pandas* es una librería que provee estructuras de datos rápidas, flexibles y expresivas; diseñadas para trabajar con *rotulados* y/o *relacionales*. Conceptualmente se puede pensar como *arrays* de *NumPy* donde las filas y columnas están *rótuladas*. Las estructuras de datos de *Pandas* tienen forma similar a hojas de cálculo en *Python*.

Asi como *NumPy*, *Pandas* es una muy buena herramienta para trabajar con números, vectores, álgebra lineal, etc. *Pandas* es adecuado para trabajar con:

* Datos tabulares y heterogéneos (flotantes, string, enteros, etc.)
* Series temporales.
* Los mismos datos que se pueden manipular con arreglos de *NumPy*.

*Pandas* no forma parte de la instalación estándar de *Python*, así que debe instalarse por separado. Para instalar *Pandas* debe ejecutarse el siguiente comando desde una cónsola o terminal:
```bash
pip install pandas
```
Además, para trabajar con archivos de Excel, también deberá ejecutarse:
```bash
pip install xlrd
```
*Pandas* tiene una documentación muy completa y diversos [tutoriales](http://pandas.pydata.org/pandas-docs/stable/tutorials.html).

En *Pandas* existen tres tipos básicos de objetos todos ellos basados a su vez en *NumPy*:

* *Series* (listas, 1D),
* *DataFrame* (tablas, 2D) y
* *Panels* (tablas 3D).

Nosotros vamos a ver el uso básico de los dos primeros tipos de objetos, para un mayor detalle puedes consultar el [manual](http://pandas.pydata.org/pandas-docs/stable/dsintro.html).

En este notebook vamos a usar la abreviación:

* **df**: para cualquier objeto `DataFrame` *Pandas*.
* **s**: para cualquier objeto de `Series` *Pandas*.

Para comenzar iniciamos importando *Pandas* según la convención:

In [1]:
import pandas as pd

## Importador de datos.
***

In [2]:
csv_data  = "datos/capacitaciones.csv"
json_data = "datos/capacitaciones.json"
xlsx_data = "datos/capacitaciones.xlsx"

df = pd.read_csv(csv_data, encoding = "ISO-8859-1") # De un archivo CSV.
df = pd.read_json(json_data)                        # De un archivo JSON.
df = pd.read_excel(xlsx_data)                       # De un archivo XLSX.

Existen otros tipos de *importadores* de datos, tales como:

* `pd.read_table(filename)`: importa desde un archivo de texto delimitado (como TSV).
* `pd.read_sql(query, connection_object)`: importa desde una BaseDeDatos/Tabla SQL.
* `pd.read_html(url)`: importa desde una URL html, una cadena o un archivo y extrae tablas a una lista.
* `pd.read_clipboard()`: toma los datos desde el contenido del portapapeles.
* `pd.DataFrame(dict)`: importa desde un diccionario *Python*.

## Exportador de datos.
***

Los *exportadores* de datos guardan en disco los datos del *df*:

* `df.to_csv(filename)`: escribe los datos en un archivo CSV.
* `df.to_excel(filename)`: escribe los datos en un archivo Excel.
* `df.to_sql(table_name, connection_object)`: escribe los datos en una tabla SQL.
* `df.to_json(filename)`: escribe los datos en un archivo con formato JSON;

## Crear objetos de prueba.
***

Crear datos de prueba es útil para probar segmentos de código:

In [3]:
# Genera un dataframe con 5 columnas y 10 filas.
df = pd.DataFrame(pd.np.random.rand(10, 5))
df

Unnamed: 0,0,1,2,3,4
0,0.582785,0.388263,0.42623,0.658371,0.75127
1,0.713651,0.695313,0.079295,0.030452,0.586475
2,0.109922,0.537157,0.637753,0.879046,0.471591
3,0.355367,0.784931,0.957191,0.455044,0.48774
4,0.942278,0.856266,0.039143,0.776321,0.147819
5,0.505111,0.253876,0.991086,0.382961,0.84647
6,0.114346,0.581086,0.754694,0.054246,0.226384
7,0.560394,0.034081,0.412505,0.796324,0.218537
8,0.080404,0.265832,0.29644,0.903831,0.344192
9,0.213173,0.049282,0.106273,0.606679,0.282871


In [4]:
# Crea una series a partir de una lista.
my_list = [2, 7, 3, 9, 5]
s = pd.Series(my_list)
s

0    2
1    7
2    3
3    9
4    5
dtype: int64

In [5]:
# Añade un índice de fecha al dataframe.
df.index = pd.date_range('1900/1/30', periods = df.shape[0])
df

Unnamed: 0,0,1,2,3,4
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


## Visualizar / inspeccionar datos.
***

In [6]:
df.head(5) # Muestra las primeras 5 filas del DataFrame.

Unnamed: 0,0,1,2,3,4
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819


In [7]:
df.tail(5) # Muestra las últimas 5 filas del DataFrame.

Unnamed: 0,0,1,2,3,4
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


In [8]:
df.shape # Muestra el número de filas y columnas del DataFrame.

(10, 5)

In [9]:
df.info() # Muestra el índice, tipo de datos y memoria.

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 1900-01-30 to 1900-02-08
Freq: D
Data columns (total 5 columns):
0    10 non-null float64
1    10 non-null float64
2    10 non-null float64
3    10 non-null float64
4    10 non-null float64
dtypes: float64(5)
memory usage: 480.0 bytes


In [10]:
df.describe() # Muestra estadísticas resumidas de columnas numéricas.

Unnamed: 0,0,1,2,3,4
count,10.0,10.0,10.0,10.0,10.0
mean,0.417743,0.444609,0.470061,0.554328,0.436335
std,0.291407,0.292813,0.353397,0.318461,0.235848
min,0.080404,0.034081,0.039143,0.030452,0.147819
25%,0.139053,0.256865,0.153815,0.400981,0.240506
50%,0.430239,0.46271,0.419368,0.632525,0.407892
75%,0.577187,0.666756,0.725459,0.791323,0.561791
max,0.942278,0.856266,0.991086,0.903831,0.84647


In [11]:
s.value_counts(dropna = False) # Muestra valores y recuentos únicos en la serie.

7    1
5    1
3    1
2    1
9    1
dtype: int64

## Selección.
***

In [12]:
df[2] # Devuelve la columna con la etiqueta 2 como una Series.

1900-01-30    0.426230
1900-01-31    0.079295
1900-02-01    0.637753
1900-02-02    0.957191
1900-02-03    0.039143
1900-02-04    0.991086
1900-02-05    0.754694
1900-02-06    0.412505
1900-02-07    0.296440
1900-02-08    0.106273
Freq: D, Name: 2, dtype: float64

In [13]:
df[[2, 4]] # Devuelve columnas 2 y 4 como un nuevo DataFrame.

Unnamed: 0,2,4
1900-01-30,0.42623,0.75127
1900-01-31,0.079295,0.586475
1900-02-01,0.637753,0.471591
1900-02-02,0.957191,0.48774
1900-02-03,0.039143,0.147819
1900-02-04,0.991086,0.84647
1900-02-05,0.754694,0.226384
1900-02-06,0.412505,0.218537
1900-02-07,0.29644,0.344192
1900-02-08,0.106273,0.282871


In [14]:
s.iloc[0] # Selección por posición: selecciona el elemento 0 de la Series.

2

In [15]:
s.loc[3] # Selección por índice de la Series.

9

In [16]:
df.iloc[0, :] # Primera fila del DataFrame.

0    0.582785
1    0.388263
2    0.426230
3    0.658371
4    0.751270
Name: 1900-01-30 00:00:00, dtype: float64

In [17]:
df.iloc[0, 0] # Primer elemento de la primera columna.

0.5827851483624562

## Limpieza de datos.
***

In [18]:
df.columns = ["a", "b", "c", "x", "y"] # Renombrar columnas del DataFrame.
df

Unnamed: 0,a,b,c,x,y
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


In [19]:
df.isnull() # Comprueba valores nulos, devuelve un Boolean Arrays.

Unnamed: 0,a,b,c,x,y
1900-01-30,False,False,False,False,False
1900-01-31,False,False,False,False,False
1900-02-01,False,False,False,False,False
1900-02-02,False,False,False,False,False
1900-02-03,False,False,False,False,False
1900-02-04,False,False,False,False,False
1900-02-05,False,False,False,False,False
1900-02-06,False,False,False,False,False
1900-02-07,False,False,False,False,False
1900-02-08,False,False,False,False,False


In [20]:
df.notnull() # El opuesto a df.isnull().

Unnamed: 0,a,b,c,x,y
1900-01-30,True,True,True,True,True
1900-01-31,True,True,True,True,True
1900-02-01,True,True,True,True,True
1900-02-02,True,True,True,True,True
1900-02-03,True,True,True,True,True
1900-02-04,True,True,True,True,True
1900-02-05,True,True,True,True,True
1900-02-06,True,True,True,True,True
1900-02-07,True,True,True,True,True
1900-02-08,True,True,True,True,True


In [21]:
df.dropna(axis = 0) # Elimina todas las filas que contienen valores nulos.

Unnamed: 0,a,b,c,x,y
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


In [22]:
df.dropna(axis = 1) # Elimina todas las columnas que contienen valores nulos

Unnamed: 0,a,b,c,x,y
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


In [23]:
# Elimina todas las filas que tienen menos de 3 valores no nulos.
df.dropna(axis = 1, thresh = 3)

Unnamed: 0,a,b,c,x,y
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


In [24]:
df.fillna(0) # Remplaza todos los valores nulos por 0.

Unnamed: 0,a,b,c,x,y
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


In [25]:
s.fillna(s.mean()) # Remplaza todos los valores nulos por la media.

0    2
1    7
2    3
3    9
4    5
dtype: int64

In [26]:
s.astype(float) # Convierte el tipo de datos a float.

0    2.0
1    7.0
2    3.0
3    9.0
4    5.0
dtype: float64

In [27]:
s.replace(3, '100') # Remplaza todos los valores iguales a 3 con '100'.

0      2
1      7
2    100
3      9
4      5
dtype: object

In [28]:
s.replace([7, 9], ['700', '900']) # Remplaza todos los 7 por '700' y 9 por '900'.

0      2
1    700
2      3
3    900
4      5
dtype: object

In [29]:
df.columns = (0, 1, 2, 3, 4)
df.rename(columns = lambda x: 2*x + 1) # Cambio de nombre de columnas en masa.

Unnamed: 0,1,3,5,7,9
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


In [30]:
df.rename(columns={1:'one'}) # Renombrar seleccionando columna.

Unnamed: 0,0,one,2,3,4
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871


In [31]:
df.set_index(2) # Cambiar el índice por una columna (2) del Dataframe.

Unnamed: 0_level_0,0,1,3,4
2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.42623,0.582785,0.388263,0.658371,0.75127
0.079295,0.713651,0.695313,0.030452,0.586475
0.637753,0.109922,0.537157,0.879046,0.471591
0.957191,0.355367,0.784931,0.455044,0.48774
0.039143,0.942278,0.856266,0.776321,0.147819
0.991086,0.505111,0.253876,0.382961,0.84647
0.754694,0.114346,0.581086,0.054246,0.226384
0.412505,0.560394,0.034081,0.796324,0.218537
0.29644,0.080404,0.265832,0.903831,0.344192
0.106273,0.213173,0.049282,0.606679,0.282871


In [32]:
df.set_index(2).rename(index = lambda x: x + 1) # Cambia el índice en lote.

Unnamed: 0_level_0,0,1,3,4
2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.42623,0.582785,0.388263,0.658371,0.75127
1.079295,0.713651,0.695313,0.030452,0.586475
1.637753,0.109922,0.537157,0.879046,0.471591
1.957191,0.355367,0.784931,0.455044,0.48774
1.039143,0.942278,0.856266,0.776321,0.147819
1.991086,0.505111,0.253876,0.382961,0.84647
1.754694,0.114346,0.581086,0.054246,0.226384
1.412505,0.560394,0.034081,0.796324,0.218537
1.29644,0.080404,0.265832,0.903831,0.344192
1.106273,0.213173,0.049282,0.606679,0.282871


## Filtro, orden y agrupamiento.
***

In [33]:
df[df[2] > 0.5] # Filtra las filas donde la columna 2 es mayor que 0.5

Unnamed: 0,0,1,2,3,4
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384


In [34]:
df[(df[2] > 0.5) & (df[2] < 0.7)] # Filtra las filas donde 0.7 > columna 2 > 0.5

Unnamed: 0,0,1,2,3,4
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591


In [35]:
df.sort_values(3) # Ordena por los valores de la columna 3 en orden ascendente.

Unnamed: 0,0,1,2,3,4
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192


In [36]:
df.sort_values(3, ascending = False) # Ordena por la columna 3 en orden descendente.

Unnamed: 0,0,1,2,3,4
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475


In [37]:
# Ordena los valores por la columna 2 de forma ascendente
# y luego por la columna 3 en orden descendente.
df.sort_values([2, 3], ascending = [True, False])

Unnamed: 0,0,1,2,3,4
1900-02-03,0.942278,0.856266,0.039143,0.776321,0.147819
1900-01-31,0.713651,0.695313,0.079295,0.030452,0.586475
1900-02-08,0.213173,0.049282,0.106273,0.606679,0.282871
1900-02-07,0.080404,0.265832,0.29644,0.903831,0.344192
1900-02-06,0.560394,0.034081,0.412505,0.796324,0.218537
1900-01-30,0.582785,0.388263,0.42623,0.658371,0.75127
1900-02-01,0.109922,0.537157,0.637753,0.879046,0.471591
1900-02-05,0.114346,0.581086,0.754694,0.054246,0.226384
1900-02-02,0.355367,0.784931,0.957191,0.455044,0.48774
1900-02-04,0.505111,0.253876,0.991086,0.382961,0.84647


In [38]:
df.groupby(3) # Devuelve un objeto groupby para los valores de una columna.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe959554d30>

In [39]:
# Devuelve un objeto groupby para valores de múltiples columnas,
# en este caso columnas 3 y 4.
df.groupby([3, 4])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe959562080>

In [44]:
# Crea una tabla dinámica que agrupa por la columna 2
# y calcula la media de las columnas 3 y 4.
df.pivot_table(index = 2, values = [3, 4], aggfunc = pd.np.mean)

Unnamed: 0_level_0,3,4
2,Unnamed: 1_level_1,Unnamed: 2_level_1
0.039143,0.776321,0.147819
0.079295,0.030452,0.586475
0.106273,0.606679,0.282871
0.29644,0.903831,0.344192
0.412505,0.796324,0.218537
0.42623,0.658371,0.75127
0.637753,0.879046,0.471591
0.754694,0.054246,0.226384
0.957191,0.455044,0.48774
0.991086,0.382961,0.84647


In [41]:
# Encuentra el promedio en todas las columnas
# para cada grupo de la columna 3 único.
df.groupby(3).agg(pd.np.mean)

Unnamed: 0_level_0,0,1,2,4
3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.030452,0.713651,0.695313,0.079295,0.586475
0.054246,0.114346,0.581086,0.754694,0.226384
0.382961,0.505111,0.253876,0.991086,0.84647
0.455044,0.355367,0.784931,0.957191,0.48774
0.606679,0.213173,0.049282,0.106273,0.282871
0.658371,0.582785,0.388263,0.42623,0.75127
0.776321,0.942278,0.856266,0.039143,0.147819
0.796324,0.560394,0.034081,0.412505,0.218537
0.879046,0.109922,0.537157,0.637753,0.471591
0.903831,0.080404,0.265832,0.29644,0.344192


In [42]:
df.apply(pd.np.mean) # Aplica la función pd.np.mean() en cada columna.

0    0.417743
1    0.444609
2    0.470061
3    0.554328
4    0.436335
dtype: float64

In [43]:
df.apply(pd.np.max, axis = 1) # Aplica la función pd.np.max() en cada fila.

1900-01-30    0.751270
1900-01-31    0.713651
1900-02-01    0.879046
1900-02-02    0.957191
1900-02-03    0.942278
1900-02-04    0.991086
1900-02-05    0.754694
1900-02-06    0.796324
1900-02-07    0.903831
1900-02-08    0.606679
Freq: D, dtype: float64

## Unir / Combinar.
***

## Estadísticas.
***

In [45]:
df.describe() # Resumen de estadísticas para columnas numéricas.

Unnamed: 0,0,1,2,3,4
count,10.0,10.0,10.0,10.0,10.0
mean,0.417743,0.444609,0.470061,0.554328,0.436335
std,0.291407,0.292813,0.353397,0.318461,0.235848
min,0.080404,0.034081,0.039143,0.030452,0.147819
25%,0.139053,0.256865,0.153815,0.400981,0.240506
50%,0.430239,0.46271,0.419368,0.632525,0.407892
75%,0.577187,0.666756,0.725459,0.791323,0.561791
max,0.942278,0.856266,0.991086,0.903831,0.84647


In [48]:
df.mean(axis = 0) # Devuelve la media de todas las columnas.

0    0.417743
1    0.444609
2    0.470061
3    0.554328
4    0.436335
dtype: float64

In [49]:
df.mean(axis = 1) # Devuelve la media de todas las filas.

1900-01-30    0.561384
1900-01-31    0.421037
1900-02-01    0.527094
1900-02-02    0.608055
1900-02-03    0.552365
1900-02-04    0.595901
1900-02-05    0.346151
1900-02-06    0.404368
1900-02-07    0.378140
1900-02-08    0.251656
Freq: D, dtype: float64

In [51]:
df.corr() # Devuelve la correlación entre las columnas en un DataFrame.

Unnamed: 0,0,1,2,3,4
0,1.0,0.332894,-0.374344,-0.08681,0.100152
1,0.332894,1.0,0.049785,-0.318906,-0.029674
2,-0.374344,0.049785,1.0,-0.211969,0.428057
3,-0.08681,-0.318906,-0.211969,1.0,-0.239399
4,0.100152,-0.029674,0.428057,-0.239399,1.0


In [57]:
df.count(axis = 0) # Devuelve el número de valores no nulos en cada columna.

0    10
1    10
2    10
3    10
4    10
dtype: int64

In [56]:
df.max(axis = 0) # Devuelve el valor más alto en cada columna.

0    0.942278
1    0.856266
2    0.991086
3    0.903831
4    0.846470
dtype: float64

In [58]:
df.min(axis = 0) # Devuelve el valor más bajo en cada columna

0    0.080404
1    0.034081
2    0.039143
3    0.030452
4    0.147819
dtype: float64

In [59]:
df.median(axis = 0) # Devuelve la mediana de cada columna.

0    0.430239
1    0.462710
2    0.419368
3    0.632525
4    0.407892
dtype: float64

In [60]:
df.std(axis = 0) # Devuelve la desviación estándar de cada columna.

0    0.291407
1    0.292813
2    0.353397
3    0.318461
4    0.235848
dtype: float64