# Pandas

Pandas es un paquete de Python que provee dos estructuras de datos para el análisis de datos "relacionales" de manera rápida, flexible y expresiva. Pandas está construido sobre Numpy, por lo que muchas de las ventajas de utilizar Numpy se trasladan a Pandas.

Pandas puede ser utilizado para trabajar con diferentes tipos de datos:

* Tabulares, con datos en columnas que comparten un mismo tipo (por ejemplo, tablas de una base de datos y hojas de Excel).
* Series de tiempo.
* Matrices con columnas y filas etiquetadas.
* Datos de experimentos estadísticos en general.


In [2]:
#pd es el alias comun de pandas
import pandas as pd
import numpy as np

# Estructuras

## Series

La primera estructura de datos de Pandas son las Series. Una Serie es utilizada para arreglos 1D etiquetados con un mismo tipo de dato en todos sus elementos. Una serie es un arreglo 1D donde cada elemento tiene asignado un índice:


| index   || |
|---||----|
| a || s1 |
| b || s2 |
| c || s3 |
| ... || ... |

In [3]:
#un objeto series desde una lista
s1 = pd.Series(data=[1, 1, 2, 3, 5], index=["a", "b", "c", "d", "e"])
s1

a    1
b    1
c    2
d    3
e    5
dtype: int64

## Dataframe

Los Dataframes son la segunda estructura en Pandas, éstos son utilizados para datos 2D etiquetados (columnas con nombres generalmente). Las columnas de un Dataframe pueden tener tipos de dato diferentes. Puedes pensar en un Dataframe  como un contenedor de Series, donde cada columna es una Serie (arreglo 1D con etiquetas). Un Dataframe puede representarse como una tabla donde cada fila tiene asignado un índice y cada columna un nombre:

| index\columns  || A  | B  | C  | D  |
|---||----|----|----|----|
| 1 || a1 | b1 | c1 | d1 |
| 2 || a2 | b2 | c2 | d2 |
| 3 || a3 | b3 | c3 | d3 |

In [4]:
#un dataframe aleatorio con numpy random
df = pd.DataFrame(np.random.randn(6, 4), columns=["A", "B", "C", "D"])
df

Unnamed: 0,A,B,C,D
0,-0.834125,-0.81803,1.015872,0.848633
1,-0.436428,0.46103,-0.469404,-1.621243
2,-0.897734,0.366286,1.405613,-1.220837
3,0.508403,-1.651465,1.08486,-0.834111
4,0.140955,0.407926,0.512196,0.007177
5,0.025478,0.971165,0.624289,-1.06602


### Axes

Recuerda que los axes de un Dataframe (una tabla 2D), son los siguientes:

![Numpy/Pandas axes](https://raw.githubusercontent.com/jhermosillo/diplomado_CDD2019/master/Programaci%C3%B3n%20en%20Python/images/axes.png)

# Creación y almacenamiento

## Creación desde diccionarios de listas

In [15]:
# Un diccionario de python
# las longitudes de los arreglos deben ser las mismas
dict1 = {
    'pais': ['Mexico','EUA', "Francia"], 
    'continente': ["America", "America", "Europa"], 
    'poblacion': [129, 325, 67]
}

paises = pd.DataFrame(dict1)
paises

Unnamed: 0,pais,continente,poblacion
0,Mexico,America,129
1,EUA,America,325
2,Francia,Europa,67


## Creación desde diccionario anidado

In [14]:
#Un diccionario anidado de python
#las claves anidadas especifican el indice de las filas
dict2 = {
    'pais': {
        "p1": 'Mexico',
        "p2": 'EUA', 
        "p3": "Francia"
    }, 
    'continente': {
        "p1": "America", 
        "p2": "America", 
        "p3": "Europa"
     }, 
    'poblacion': {
        "p1": 129, 
        "p2": 325, 
        "p3": 67
    }
}

paises = pd.DataFrame(dict2)
paises

Unnamed: 0,pais,continente,poblacion
p1,Mexico,America,129
p2,EUA,America,325
p3,Francia,Europa,67


## Creación desde archivos csv

In [20]:
#leamos un archivo csv almacenado por colab
cali_housing = pd.read_csv("sample_data/california_housing_test.csv", nrows=10)
cali_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
5,-119.56,36.51,37.0,1018.0,213.0,663.0,204.0,1.6635,67000.0
6,-121.43,38.63,43.0,1009.0,225.0,604.0,218.0,1.6641,67000.0
7,-120.65,35.48,19.0,2310.0,471.0,1341.0,441.0,3.225,166900.0
8,-122.84,38.4,15.0,3080.0,617.0,1446.0,599.0,3.6696,194400.0
9,-118.02,34.08,31.0,2402.0,632.0,2830.0,603.0,2.3333,164200.0


## Creación desde archivos excel

In [26]:
#cargar el archivo en formato xls a un pandas dataframe
cali_housing = pd.read_excel("sample_data/california_housing_test.xlsx", index_col=0, header_col=0)
cali_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700
1,-118.30,34.26,43,1510,310,809,277,3.5990,176500
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500
3,-118.36,33.82,28,67,15,49,11,6.1359,330000
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500


## Persistencia a disco

In [27]:
cali_housing.to_csv("cali_housing.csv")
#o en formato de excel:
cali_housing.to_excel("cali_housing.xlsx")

Ahora se crearon dos archivos en colab

## Otros formatos

Además de los ya mencionados, Pandas permite leer y escribir en formatos comúnes como:

* JSON
* HTML
* Parquet
* SQL
* BigQuery

Puedes ver la lista completa en la [documentación de Pandas](https://pandas.pydata.org/docs/user_guide/io.html#io-tools-text-csv-hdf5).

# Accediendo a los datos

## info()

El método info() puede usarse para , mostrar la información general del objeto de Pandas, como el índice y los tipos de datos usados.

In [28]:
cali_housing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000 entries, 0 to 2999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           3000 non-null   float64
 1   latitude            3000 non-null   float64
 2   housing_median_age  3000 non-null   int64  
 3   total_rooms         3000 non-null   int64  
 4   total_bedrooms      3000 non-null   int64  
 5   population          3000 non-null   int64  
 6   households          3000 non-null   int64  
 7   median_income       3000 non-null   float64
 8   median_house_value  3000 non-null   int64  
dtypes: float64(3), int64(6)
memory usage: 234.4 KB


## head() y tail()

Los métodos head() y tail() devuelven una pequeña muestra de los primeros o últimos elementos de un objeto Series o Dataframe. Ambas pueden recibir como parámetro el número de elementos que se desean visualizar, por defecto es igual a 5.

In [32]:
cali_housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700
1,-118.3,34.26,43,1510,310,809,277,3.599,176500
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500
3,-118.36,33.82,28,67,15,49,11,6.1359,330000
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700


In [33]:
cali_housing.tail(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
2997,-119.7,36.3,10,956,201,693,220,2.2895,62000
2998,-117.12,34.1,40,96,14,46,14,3.2708,162500
2999,-119.63,34.42,42,1765,263,753,260,8.5608,500001


## df.sample()

Para obtener una muestra aleatoria del objeto.

In [31]:
cali_housing.sample(n=3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
2834,-121.31,37.93,21,1556,314,1140,304,2.4667,81400
778,-122.5,37.74,44,2792,615,1640,579,4.0625,272800
2786,-121.84,39.75,29,4362,1053,2053,1000,1.7284,74500


## df.index

El atributo index regresa el índice del objeto de Pandas.

In [34]:
cali_housing.index

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            2990, 2991, 2992, 2993, 2994, 2995, 2996, 2997, 2998, 2999],
           dtype='int64', length=3000)

## df.columns

El atributo columns de un DataFrame devuelve las columnas del mismo.

In [35]:
cali_housing.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')

## describe()

El método describe() regresa un resúmen estadístico del objeto.

In [39]:
cali_housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,-119.5892,35.63539,28.845333,2599.578667,529.950667,1402.798667,489.912,3.807272,205846.275
std,1.994936,2.12967,12.555396,2155.593332,415.654368,1030.543012,365.42271,1.854512,113119.68747
min,-124.18,32.56,1.0,6.0,2.0,5.0,2.0,0.4999,22500.0
25%,-121.81,33.93,18.0,1401.0,291.0,780.0,273.0,2.544,121200.0
50%,-118.485,34.27,29.0,2106.0,437.0,1155.0,409.5,3.48715,177650.0
75%,-118.02,37.69,37.0,3129.0,636.0,1742.75,597.25,4.656475,263975.0
max,-114.49,41.92,52.0,30450.0,5419.0,11935.0,4930.0,15.0001,500001.0


In [40]:
cali_housing["longitude"].describe()

count    3000.000000
mean     -119.589200
std         1.994936
min      -124.180000
25%      -121.810000
50%      -118.485000
75%      -118.020000
max      -114.490000
Name: longitude, dtype: float64

# Selección - asignación

## Seleccionando columnas

Puedes usar dos notaciones para seleccionar una columna de un DataFrame, cualquiera de éstas devolverá un objeto Series.



```
df["A"]
```

o



```
df.A
```





In [41]:
#seleccionar la columna devuelve un objeto Series
cali_housing["latitude"]

0       37.37
1       34.26
2       33.78
3       33.82
4       36.33
        ...  
2995    34.42
2996    34.06
2997    36.30
2998    34.10
2999    34.42
Name: latitude, Length: 3000, dtype: float64

In [42]:
#usa segunda opcion para seleccionar columnas
cali_housing.latitude

0       37.37
1       34.26
2       33.78
3       33.82
4       36.33
        ...  
2995    34.42
2996    34.06
2997    36.30
2998    34.10
2999    34.42
Name: latitude, Length: 3000, dtype: float64

In [43]:
#seleccionando una columna como un dataframe
cali_housing[["latitude"]]

Unnamed: 0,latitude
0,37.37
1,34.26
2,33.78
3,33.82
4,36.33
...,...
2995,34.42
2996,34.06
2997,36.30
2998,34.10


Seleccionando múltiples columnas



```
df[["A", ...]]
```



In [48]:
cali_housing[["longitude", "latitude"]]

Unnamed: 0,longitude,latitude
0,-122.05,37.37
1,-118.30,34.26
2,-117.81,33.78
3,-118.36,33.82
4,-119.67,36.33
...,...,...
2995,-119.86,34.42
2996,-118.14,34.06
2997,-119.70,36.30
2998,-117.12,34.10


## Seleccionando filas con slicing

In [49]:
cali_housing[0:3]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700
1,-118.3,34.26,43,1510,310,809,277,3.599,176500
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500


## Selección con etiquetas loc

In [50]:
cali_housing.loc[:, "total_rooms":"households"]

Unnamed: 0,total_rooms,total_bedrooms,population,households
0,3885,661,1537,606
1,1510,310,809,277
2,3589,507,1484,495
3,67,15,49,11
4,1241,244,850,237
...,...,...,...,...
2995,1450,642,1258,607
2996,5257,1082,3496,1036
2997,956,201,693,220
2998,96,14,46,14


## Selección con enteros iloc

In [52]:
cali_housing.iloc[3:, 6:]

Unnamed: 0,households,median_income,median_house_value
3,11,6.1359,330000
4,237,2.9375,81700
5,204,1.6635,67000
6,218,1.6641,67000
7,441,3.2250,166900
...,...,...,...
2995,607,1.1790,225000
2996,1036,3.3906,237200
2997,220,2.2895,62000
2998,14,3.2708,162500


## Asignación por rango

In [53]:
#primero agregamos una columna random inicializada a 0
cali_housing["random"] = 0.0
#reasignamos usando np.random
cali_housing.loc[:, "random"] = np.random.randn(cali_housing.shape[0])
cali_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038
1,-118.30,34.26,43,1510,310,809,277,3.5990,176500,-0.136305
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863
3,-118.36,33.82,28,67,15,49,11,6.1359,330000,-0.310166
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749
...,...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200,-1.433246
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000,-0.272285
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500,1.109533


## Selección con indexación booleana

In [62]:
cali_housing[cali_housing["population"] > 1000]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863
7,-120.65,35.48,19,2310,471,1341,441,3.2250,166900,1.272826
8,-122.84,38.40,15,3080,617,1446,599,3.6696,194400,0.141389
9,-118.02,34.08,31,2402,632,2830,603,2.3333,164200,0.655466
...,...,...,...,...,...,...,...,...,...,...
2988,-122.01,36.97,43,2162,509,1208,464,2.5417,260900,0.153929
2989,-122.02,37.60,32,1295,295,1097,328,3.2386,149600,-1.272297
2990,-118.23,34.09,49,1638,456,1500,430,2.6923,150000,0.847022
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349


In [66]:
cali_housing[cali_housing["population"] > 1000][:4]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863
7,-120.65,35.48,19,2310,471,1341,441,3.225,166900,1.272826
8,-122.84,38.4,15,3080,617,1446,599,3.6696,194400,0.141389


## Selección de celdas con at y iat

**at** utiliza etiquetas para seleccionar elementos:

In [71]:
cali_housing.at[0, "housing_median_age"]

27

**iat** utiliza enteros para seleccionar elementos:

In [72]:
cali_housing.iat[1, 4]

310

## Asignación elemento

También puedes cambir el contenido de las celdas directamente mediante asignación:

In [75]:
cali_housing.iat[1, 4] = 201.0
# total_bedrooms row 1
cali_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038
1,-118.30,34.26,43,1510,201,809,277,3.5990,176500,-0.136305
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863
3,-118.36,33.82,28,67,15,49,11,6.1359,330000,-0.310166
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749
...,...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200,-1.433246
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000,-0.272285
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500,1.109533


## Otros
Otros métodos que podrían ser útiles:

* idxmax()
* idxmin()
* filter()
* take()
* truncate()

Puedes ver la lista completa de métodos para la selección, re-indexado y manipulación de etiquetas en la [documentación de Pandas](https://pandas.pydata.org/docs/reference/frame.html#reindexing-selection-label-manipulation).

# Atributos

* df.T
* df.axes
* df.dtypes
* df.shape
* df.size
* df.values

## df.T

La transpuesta de df (filas y columnas intercambiadas):

In [76]:
#Los indices pasan a ser los nombres de las columnas, las columnas a indices
cali_housing.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
longitude,-122.05,-118.3,-117.81,-118.36,-119.67,-119.56,-121.43,-120.65,-122.84,-118.02,...,-118.23,-117.17,-122.33,-117.91,-117.93,-119.86,-118.14,-119.7,-117.12,-119.63
latitude,37.37,34.26,33.78,33.82,36.33,36.51,38.63,35.48,38.4,34.08,...,34.09,34.28,37.39,33.6,33.86,34.42,34.06,36.3,34.1,34.42
housing_median_age,27.0,43.0,27.0,28.0,19.0,37.0,43.0,19.0,15.0,31.0,...,49.0,13.0,52.0,37.0,35.0,23.0,27.0,10.0,40.0,42.0
total_rooms,3885.0,1510.0,3589.0,67.0,1241.0,1018.0,1009.0,2310.0,3080.0,2402.0,...,1638.0,4867.0,573.0,2088.0,931.0,1450.0,5257.0,956.0,96.0,1765.0
total_bedrooms,661.0,201.0,507.0,15.0,244.0,213.0,225.0,471.0,617.0,632.0,...,456.0,718.0,102.0,510.0,181.0,642.0,1082.0,201.0,14.0,263.0
population,1537.0,809.0,1484.0,49.0,850.0,663.0,604.0,1341.0,1446.0,2830.0,...,1500.0,780.0,232.0,673.0,516.0,1258.0,3496.0,693.0,46.0,753.0
households,606.0,277.0,495.0,11.0,237.0,204.0,218.0,441.0,599.0,603.0,...,430.0,250.0,92.0,390.0,174.0,607.0,1036.0,220.0,14.0,260.0
median_income,6.6085,3.599,5.7934,6.1359,2.9375,1.6635,1.6641,3.225,3.6696,2.3333,...,2.6923,7.1997,6.2263,5.1048,5.5867,1.179,3.3906,2.2895,3.2708,8.5608
median_house_value,344700.0,176500.0,270500.0,330000.0,81700.0,67000.0,67000.0,166900.0,194400.0,164200.0,...,150000.0,253800.0,500001.0,500001.0,182500.0,225000.0,237200.0,62000.0,162500.0,500001.0
random,-1.600038,-0.136305,-1.098863,-0.310166,0.378749,-0.112603,1.119988,1.272826,0.141389,0.655466,...,0.847022,-0.237,-0.776239,-0.155301,-2.32374,0.865349,-1.433246,-0.272285,1.109533,1.836168


## df.axes

Retorna información sobre los axes que componen al objeto:


In [77]:
cali_housing.axes

[Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
             ...
             2990, 2991, 2992, 2993, 2994, 2995, 2996, 2997, 2998, 2999],
            dtype='int64', length=3000),
 Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
        'total_bedrooms', 'population', 'households', 'median_income',
        'median_house_value', 'random'],
       dtype='object')]

## df.dtypes

Retorna los tipos de datos utilizados por el objeto:

In [78]:
cali_housing.dtypes

longitude             float64
latitude              float64
housing_median_age      int64
total_rooms             int64
total_bedrooms          int64
population              int64
households              int64
median_income         float64
median_house_value      int64
random                float64
dtype: object

## df.shape

La forma del objeto (número de filas y columnas en un DataFrame):

In [83]:
cali_housing.shape

(3000, 10)

## df.size

El número de elementos en el objeto (número de celdas en un DataFrame):

In [84]:
cali_housing.size

30000

## df.values

Retorna el objeto de Pandas como un arreglo de Numpy:

In [85]:
cali_housing.values

array([[-1.22050000e+02,  3.73700000e+01,  2.70000000e+01, ...,
         6.60850000e+00,  3.44700000e+05, -1.60003778e+00],
       [-1.18300000e+02,  3.42600000e+01,  4.30000000e+01, ...,
         3.59900000e+00,  1.76500000e+05, -1.36305161e-01],
       [-1.17810000e+02,  3.37800000e+01,  2.70000000e+01, ...,
         5.79340000e+00,  2.70500000e+05, -1.09886318e+00],
       ...,
       [-1.19700000e+02,  3.63000000e+01,  1.00000000e+01, ...,
         2.28950000e+00,  6.20000000e+04, -2.72284848e-01],
       [-1.17120000e+02,  3.41000000e+01,  4.00000000e+01, ...,
         3.27080000e+00,  1.62500000e+05,  1.10953281e+00],
       [-1.19630000e+02,  3.44200000e+01,  4.20000000e+01, ...,
         8.56080000e+00,  5.00001000e+05,  1.83616775e+00]])

# Eliminación


## Remover filas o columnas con df.drop()

El método drop() permite eliminar filas o columnas al especificar el axis correspondiente, puedes eliminar más de una usando una lista como parámetro:

In [86]:
#primero agregamos dos nuevas columnas
cali_housing["temp"] = "temp"
cali_housing["temp2"] = cali_housing["temp"] + "2"
cali_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random,temp,temp2
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038,temp,temp2
1,-118.30,34.26,43,1510,201,809,277,3.5990,176500,-0.136305,temp,temp2
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863,temp,temp2
3,-118.36,33.82,28,67,15,49,11,6.1359,330000,-0.310166,temp,temp2
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749,temp,temp2
...,...,...,...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349,temp,temp2
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200,-1.433246,temp,temp2
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000,-0.272285,temp,temp2
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500,1.109533,temp,temp2


Eliminar columnas con axis = 1

In [87]:
#drop devuelve un nuevo dataframe sin la columna
df = cali_housing.drop(["temp", "temp2"], axis=1)
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038
1,-118.30,34.26,43,1510,201,809,277,3.5990,176500,-0.136305
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863
3,-118.36,33.82,28,67,15,49,11,6.1359,330000,-0.310166
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749
...,...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200,-1.433246
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000,-0.272285
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500,1.109533


Eliminar filas con axis = 0

In [89]:
#drop devuelve un nuevo dataframe sin la fila
df = cali_housing.drop([1,2,3], axis=0)
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random,temp,temp2
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038,temp,temp2
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749,temp,temp2
5,-119.56,36.51,37,1018,213,663,204,1.6635,67000,-0.112603,temp,temp2
6,-121.43,38.63,43,1009,225,604,218,1.6641,67000,1.119988,temp,temp2
7,-120.65,35.48,19,2310,471,1341,441,3.2250,166900,1.272826,temp,temp2
...,...,...,...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349,temp,temp2
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200,-1.433246,temp,temp2
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000,-0.272285,temp,temp2
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500,1.109533,temp,temp2


## Remover columnas con ```del```

Puedes usar la notación _del df[col_name]_ para eliminar una columna directamente de un DataFrame:

In [90]:
#el dataframe original no ha sido alterado
#aun contiene las columnas temporales
cali_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random,temp,temp2
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038,temp,temp2
1,-118.30,34.26,43,1510,201,809,277,3.5990,176500,-0.136305,temp,temp2
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863,temp,temp2
3,-118.36,33.82,28,67,15,49,11,6.1359,330000,-0.310166,temp,temp2
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749,temp,temp2
...,...,...,...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349,temp,temp2
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200,-1.433246,temp,temp2
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000,-0.272285,temp,temp2
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500,1.109533,temp,temp2


In [91]:
#eliminamos la nueva columna temp con del directamente en el df
del cali_housing["temp"]
cali_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random,temp2
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038,temp2
1,-118.30,34.26,43,1510,201,809,277,3.5990,176500,-0.136305,temp2
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863,temp2
3,-118.36,33.82,28,67,15,49,11,6.1359,330000,-0.310166,temp2
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749,temp2
...,...,...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349,temp2
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200,-1.433246,temp2
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000,-0.272285,temp2
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500,1.109533,temp2


In [92]:
#del elimina una columna a la vez
# llamemos del una segunda vez para eliminar temp2
del cali_housing["temp2"]
cali_housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038
1,-118.30,34.26,43,1510,201,809,277,3.5990,176500,-0.136305
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863
3,-118.36,33.82,28,67,15,49,11,6.1359,330000,-0.310166
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749
...,...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23,1450,642,1258,607,1.1790,225000,0.865349
2996,-118.14,34.06,27,5257,1082,3496,1036,3.3906,237200,-1.433246
2997,-119.70,36.30,10,956,201,693,220,2.2895,62000,-0.272285
2998,-117.12,34.10,40,96,14,46,14,3.2708,162500,1.109533


# Datos faltantes

En Pandas, los valores faltantes son representado con el valor NaN (de Not a Number). Cuando existen valores faltantes, muchas veces es necesario eliminarlar las entradas que los contienen o reemplazarlos por valores antes de empezar a trabajar con los datos. Pandas provee algunos métodos para realizar este tipo de tareas.

In [148]:
#creando un dataframe con datos faltantes
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['A', 'B', 'C'])
df['D'] = df['A'] > 0
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df

Unnamed: 0,A,B,C,D
a,-0.46346,0.090193,-0.672014,False
b,,,,
c,1.445934,0.18609,-0.709763,True
d,,,,
e,1.375455,-2.517051,-1.280936,True
f,-0.439717,1.444073,-0.376402,False
g,,,,
h,1.118576,1.007971,0.918047,True


## df.isna()

El método isna() funciona para conocer si el objeto contiene valores faltantes. Regresa un objeto con elementos booleanos que indican la existencia de datos faltantes.

In [149]:
df.isna()

Unnamed: 0,A,B,C,D
a,False,False,False,False
b,True,True,True,True
c,False,False,False,False
d,True,True,True,True
e,False,False,False,False
f,False,False,False,False
g,True,True,True,True
h,False,False,False,False


## df.dropna()

El método dropna() permite eliminar del DataFrame las filas o columnas que contienen uno o más elementos NaN.

Los principales parámetros de dropna():

```
axis : {0 o 'index', 1 o 'columns'}, por defecto 0
        Determina si elimina filas o columnas que contienen
        valores faltantes.
    
how : {'any', 'all'}, por defecto 'any'
    Determina si se elimina la fila o columna cuando contiene al menos un valor faltante o solo si todos sus valores son faltantes.

    * 'any' : Si hay un valor faltante, elimina la fila o columna.
    * 'all' : Si todos los valores son faltantes, elimina la fila o columna.
    
thresh : int, opcional
    Requirir al menos ese número de valores faltantes para eliminar la fila o columna.
```



In [150]:
df.dropna(axis=0, how='any')

Unnamed: 0,A,B,C,D
a,-0.46346,0.090193,-0.672014,False
c,1.445934,0.18609,-0.709763,True
e,1.375455,-2.517051,-1.280936,True
f,-0.439717,1.444073,-0.376402,False
h,1.118576,1.007971,0.918047,True


In [152]:
#insertando un nuevo nan
df.iat[0, 0] = np.nan
df

Unnamed: 0,A,B,C,D
a,,0.090193,-0.672014,False
b,,,,
c,1.445934,0.18609,-0.709763,True
d,,,,
e,1.375455,-2.517051,-1.280936,True
f,-0.439717,1.444073,-0.376402,False
g,,,,
h,1.118576,1.007971,0.918047,True


Usando ```all``` con ```axis=0``` elimina filas donde todos los elementos faltan:

In [153]:
df.dropna(axis = 0, how='all')

Unnamed: 0,A,B,C,D
a,,0.090193,-0.672014,False
c,1.445934,0.18609,-0.709763,True
e,1.375455,-2.517051,-1.280936,True
f,-0.439717,1.444073,-0.376402,False
h,1.118576,1.007971,0.918047,True


Usando ```any``` con axis=0, elimina las filas con uno o más faltantes:

In [155]:
df.dropna(axis=0, how="any")

Unnamed: 0,A,B,C,D
c,1.445934,0.18609,-0.709763,True
e,1.375455,-2.517051,-1.280936,True
f,-0.439717,1.444073,-0.376402,False
h,1.118576,1.007971,0.918047,True


Agregamos una nueva columna con todos los elementos ```NaN```:

In [156]:
df["empty_col"] = np.nan
df

Unnamed: 0,A,B,C,D,empty_col
a,,0.090193,-0.672014,False,
b,,,,,
c,1.445934,0.18609,-0.709763,True,
d,,,,,
e,1.375455,-2.517051,-1.280936,True,
f,-0.439717,1.444073,-0.376402,False,
g,,,,,
h,1.118576,1.007971,0.918047,True,


Eliminando la columna con ```axis=1``` y ```how=all```:

In [157]:
# La columna con solo ```NaN``` sera eliminada por completo
df.dropna(axis=1, how="all")

Unnamed: 0,A,B,C,D
a,,0.090193,-0.672014,False
b,,,,
c,1.445934,0.18609,-0.709763,True
d,,,,
e,1.375455,-2.517051,-1.280936,True
f,-0.439717,1.444073,-0.376402,False
g,,,,
h,1.118576,1.007971,0.918047,True


## df.fillna()

El método fillna() permite reemplazar los valores faltantes en un DataFrame por otro valor.

Los principales parámetros de fillna():

```
value : escalar, diccionario, Series, o un DataFrame
    El valor utilizado para reemplazar los faltantes.
    Si es un diccionario, una serie o un dataframe, entonces
    se utiliza el valor asociado a la clave, indice o columna asociada.

method : {'backfill', 'bfill', 'pad', 'ffill', None}, por defecto None
    El método utilizado para rellenar los espacios vacios.
    * pad / ffill: usa el ultimo valor valido para rellenar 
    hacia adelante hasta el siguiente valor valido.
    * backfill / bfill: utilizar la siguiente observacion 
    valida para rellenar.

axis : {0 or 'index', 1 or 'columns'}
```



Reemplazar todos los faltantes por un escalar:

In [158]:
# fillna con escalar
df.fillna(0.0)

Unnamed: 0,A,B,C,D,empty_col
a,0.0,0.090193,-0.672014,False,0.0
b,0.0,0.0,0.0,0,0.0
c,1.445934,0.18609,-0.709763,True,0.0
d,0.0,0.0,0.0,0,0.0
e,1.375455,-2.517051,-1.280936,True,0.0
f,-0.439717,1.444073,-0.376402,False,0.0
g,0.0,0.0,0.0,0,0.0
h,1.118576,1.007971,0.918047,True,0.0


Reemplazar usando un dicccionario para las columnas de un Dataframe:

In [159]:
# usando un diccionario para reemplazar con un valor diferente cada columna
df.fillna({'A': 0.0, 'B': 0.0, 'C': 0.0, 'D': False})

Unnamed: 0,A,B,C,D,empty_col
a,0.0,0.090193,-0.672014,False,
b,0.0,0.0,0.0,False,
c,1.445934,0.18609,-0.709763,True,
d,0.0,0.0,0.0,False,
e,1.375455,-2.517051,-1.280936,True,
f,-0.439717,1.444073,-0.376402,False,
g,0.0,0.0,0.0,False,
h,1.118576,1.007971,0.918047,True,


Con un diccionario para los índices de una Serie:

In [160]:
df["A"].fillna({'a':0.0, 'b':1.0, 'c':2.0})

a    0.000000
b    1.000000
c    1.445934
d         NaN
e    1.375455
f   -0.439717
g         NaN
h    1.118576
Name: A, dtype: float64

Forward Fill:

```axis = 0``` por defecto (por columnas, siguiendo el index):

In [165]:
df.fillna(method='ffill')

Unnamed: 0,A,B,C,D,empty_col
a,,0.090193,-0.672014,False,
b,,0.090193,-0.672014,False,
c,1.445934,0.18609,-0.672014,True,
d,1.445934,0.18609,-0.672014,True,
e,1.375455,-2.517051,-1.280936,True,
f,-0.439717,1.444073,-0.376402,False,
g,-0.439717,1.444073,-0.376402,False,
h,1.118576,1.007971,0.918047,True,


```axis=1``` (por filas):

In [166]:
# un nuevo elemento faltante en la fila c columnas C
df.iat[2, 2] = np.NaN
df

Unnamed: 0,A,B,C,D,empty_col
a,,0.090193,-0.672014,False,
b,,,,,
c,1.445934,0.18609,,True,
d,,,,,
e,1.375455,-2.517051,-1.280936,True,
f,-0.439717,1.444073,-0.376402,False,
g,,,,,
h,1.118576,1.007971,0.918047,True,


In [167]:
# rellenando de izq a der por filas
df.fillna(method='ffill', axis=1)

Unnamed: 0,A,B,C,D,empty_col
a,,0.0901925,-0.672014,False,False
b,,,,,
c,1.44593,0.18609,0.18609,True,True
d,,,,,
e,1.37546,-2.51705,-1.28094,True,True
f,-0.439717,1.44407,-0.376402,False,False
g,,,,,
h,1.11858,1.00797,0.918047,True,True


Back Fill:

```axis=0``` por defecto (por columnas):

In [168]:
# rellena de abajo hacia arriba por columnas
df.fillna(method='bfill')

Unnamed: 0,A,B,C,D,empty_col
a,1.445934,0.090193,-0.672014,False,
b,1.445934,0.18609,-1.280936,True,
c,1.445934,0.18609,-1.280936,True,
d,1.375455,-2.517051,-1.280936,True,
e,1.375455,-2.517051,-1.280936,True,
f,-0.439717,1.444073,-0.376402,False,
g,1.118576,1.007971,0.918047,True,
h,1.118576,1.007971,0.918047,True,


```axis=1``` (por filas):

In [169]:
# rellena de der a izq por filas
df.fillna(method='bfill', axis=1)

Unnamed: 0,A,B,C,D,empty_col
a,0.0901925,0.0901925,-0.672014,False,
b,,,,,
c,1.44593,0.18609,True,True,
d,,,,,
e,1.37546,-2.51705,-1.28094,True,
f,-0.439717,1.44407,-0.376402,False,
g,,,,,
h,1.11858,1.00797,0.918047,True,


# Métodos Utilitarios

* df.copy()
* df.sort_values([ascending=True|False])
* df.sort_index([ascending=True|False])

## df.copy()

Crea una copia profunda (por defecto) del objeto:

In [170]:
df = cali_housing.copy()
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-122.05,37.37,27,3885,661,1537,606,6.6085,344700,-1.600038
1,-118.3,34.26,43,1510,201,809,277,3.599,176500,-0.136305
2,-117.81,33.78,27,3589,507,1484,495,5.7934,270500,-1.098863
3,-118.36,33.82,28,67,15,49,11,6.1359,330000,-0.310166
4,-119.67,36.33,19,1241,244,850,237,2.9375,81700,0.378749


Ahora modificar ```df``` no afectará a ```cali_housing``` y viceversa.

## df.sort_values()

Ordenar usando los valores en el Dataframe:

In [171]:
df.sort_values("housing_median_age")

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
842,-117.95,35.08,1,83,15,32,15,4.8750,141700,0.861210
1115,-116.95,33.86,1,6,2,8,2,1.6250,55000,0.203798
2117,-117.51,34.16,2,718,98,119,50,4.1000,315000,-0.194365
2429,-117.20,33.58,2,30450,5033,9419,3197,4.5936,174300,0.647425
495,-122.29,37.82,2,158,43,94,57,2.5625,60000,-0.209260
...,...,...,...,...,...,...,...,...,...,...
717,-122.27,37.86,52,2307,583,1127,548,1.8447,198200,-0.240777
708,-118.21,33.97,52,4220,908,3731,892,3.1901,167600,0.003300
1826,-122.04,37.00,52,3365,644,796,333,2.9712,116600,0.546383
754,-121.49,38.56,52,1777,368,624,350,3.6729,137800,-0.172833


Ordenar usando más de una columna:

In [172]:
df.sort_values(["housing_median_age", "households"])

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
1115,-116.95,33.86,1,6,2,8,2,1.6250,55000,0.203798
842,-117.95,35.08,1,83,15,32,15,4.8750,141700,0.861210
1050,-121.96,37.74,2,200,20,25,9,15.0001,350000,-0.677188
2117,-117.51,34.16,2,718,98,119,50,4.1000,315000,-0.194365
495,-122.29,37.82,2,158,43,94,57,2.5625,60000,-0.209260
...,...,...,...,...,...,...,...,...,...,...
26,-122.42,37.76,52,3587,1030,2259,979,2.5403,250000,0.687504
2950,-122.42,37.79,52,3364,1100,2112,1045,2.1343,400000,-1.080658
1780,-122.42,37.76,52,4407,1192,2280,1076,3.3937,270000,0.377767
1346,-122.44,37.77,52,5604,1268,2023,1196,4.4085,400000,0.809044


## df.sort_index()

Ordenar usando los índices del Dataframe/Series:

In [173]:
#una muestra aleatoria
df = cali_housing.sample(5)
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
2368,-122.74,38.46,9,2268,594,1311,585,2.6607,91500,-0.336768
523,-122.53,37.95,22,7446,1979,2980,1888,3.5838,271300,-0.051061
2882,-118.18,34.05,52,1070,231,925,220,1.825,133000,-0.040384
1884,-117.3,34.12,43,1018,261,736,215,2.6,66900,0.612593
812,-119.14,36.06,32,1838,441,1628,425,1.6452,41500,0.580128


In [174]:
# ordenar por index
df.sort_index()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
523,-122.53,37.95,22,7446,1979,2980,1888,3.5838,271300,-0.051061
812,-119.14,36.06,32,1838,441,1628,425,1.6452,41500,0.580128
1884,-117.3,34.12,43,1018,261,736,215,2.6,66900,0.612593
2368,-122.74,38.46,9,2268,594,1311,585,2.6607,91500,-0.336768
2882,-118.18,34.05,52,1070,231,925,220,1.825,133000,-0.040384


Para un objeto Series:

In [175]:
s = cali_housing["total_rooms"].sample(5)
s

882     2785
807     2715
0       3885
2234    7141
1709    1086
Name: total_rooms, dtype: int64

In [176]:
s.sort_index()

0       3885
807     2715
882     2785
1709    1086
2234    7141
Name: total_rooms, dtype: int64

# Métodos matemáticos

[Operaciones binarias](https://pandas.pydata.org/docs/reference/frame.html#binary-operator-functions):

* add(other), suma elemento por elemento un DataFrames/Series y other.
* sub(other), resta elemento por elemento un DataFrames/Series y other.
* mul(other), multiplicación elemento por elemento de un DataFrames/Series por other.
* div(other), divide elemento por elemento un DataFrames/Series entre other.
* mod(other), calcula el módulo de un DataFrame/Series usando other.
* pow(other), calcula el exponente elemento por elemento de un DataFrame/Series elevado a la potencia de other.
* dot(other), producto punto o de matrices entre dos Series/DataFrames respectivamente.

_Las operaciones binarias pueden utilizar broadcasting, entonces pueden recibir escalares, Dataframes o Series._


[Operaciones estadísticas](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats):
* abs(), valor absoluto de los elementos.
* count(axis), cuenta el número de elementos no-nulos en las filas o columnas. 
* max(axis), retorna el valor máximo encontrado por filas o columnas.
* min(axis), retorna el valor mínimo encontrado por filas o columnas.
* mean(axis), retorna la media de los elementos por filas o columnas.
* median(axis), retorna la mediana de los elementos de las filas o columnas.
* sum(axis), retorna la suma de todos los elementos de las filas o columnas.
* std(axis), retorna la desviación estándar de las filas o columnas.
* var(axis),retorna la varianza de las filas o columnas.

## Operaciones binarias

In [177]:
#creamos un dataframe para ejemplificar las funciones
df = pd.DataFrame({'angles': [0, 3, 4], 'degrees': [360, 180, 360]}, index=['circle', 'triangle', 'rectangle'])
df

Unnamed: 0,angles,degrees
circle,0,360
triangle,3,180
rectangle,4,360


### add(), sub(), mul() y div()

In [178]:
df.add(1)

Unnamed: 0,angles,degrees
circle,1,361
triangle,4,181
rectangle,5,361


o

In [179]:
df + 1

Unnamed: 0,angles,degrees
circle,1,361
triangle,4,181
rectangle,5,361


In [180]:
df.mul(2)

Unnamed: 0,angles,degrees
circle,0,720
triangle,6,360
rectangle,8,720


o

In [181]:
df * 2

Unnamed: 0,angles,degrees
circle,0,720
triangle,6,360
rectangle,8,720


In [182]:
df.div(2)

Unnamed: 0,angles,degrees
circle,0.0,180.0
triangle,1.5,90.0
rectangle,2.0,180.0


o

In [183]:
df / 2

Unnamed: 0,angles,degrees
circle,0.0,180.0
triangle,1.5,90.0
rectangle,2.0,180.0


### mod() y pow()

In [184]:
df.mod(2)

Unnamed: 0,angles,degrees
circle,0,0
triangle,1,0
rectangle,0,0


o

In [185]:
df % 2

Unnamed: 0,angles,degrees
circle,0,0
triangle,1,0
rectangle,0,0


In [186]:
df.pow(3)

Unnamed: 0,angles,degrees
circle,0,46656000
triangle,27,5832000
rectangle,64,46656000


o

In [0]:
df ** 3

### dot()

In [187]:
#un nuevo dataframe para ejemplificar dot
df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]])
df

Unnamed: 0,0,1,2,3
0,0,1,-2,-1
1,1,1,1,1


In [188]:
#dot usando una serie
s = pd.Series([1, 1, 2, 1])
df.dot(s)

0   -4
1    5
dtype: int64

In [189]:
#dot de dos matrices
other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]])
df.dot(other)

Unnamed: 0,0,1
0,1,4
1,2,2


*Nota: Las operaciones binarias pueden aplicarse usando escalares, series u otros dataframes.*

## Operaciones estadísticas

### abs()

In [190]:
#un df con valores negativos
df = pd.DataFrame({'a': [4, -5, -6, 7], 'b': [-10, 20, 30, -40], 'c': [100, 50, -30, -50]})
df

Unnamed: 0,a,b,c
0,4,-10,100
1,-5,20,50
2,-6,30,-30
3,7,-40,-50


In [191]:
df.abs()

Unnamed: 0,a,b,c
0,4,10,100
1,5,20,50
2,6,30,30
3,7,40,50


### count()

In [192]:
#un dataframe con valores no-nulos y nuelos
df = pd.DataFrame({"Person":["John", "Myla", "Lewis", "John", "Myla"],
                    "Age": [24., np.nan, 21., 33, 26],
                    "Single": [False, True, True, np.nan, np.nan]})
df

Unnamed: 0,Person,Age,Single
0,John,24.0,False
1,Myla,,True
2,Lewis,21.0,True
3,John,33.0,
4,Myla,26.0,


Por columna (por defecto axis=0):

In [193]:
df.count()

Person    5
Age       4
Single    3
dtype: int64

Por fila (axis=1)

In [194]:
df.count(axis=1)

0    3
1    2
2    3
3    2
4    2
dtype: int64

### max() y min()

In [195]:
#creamos un dataframe para ejemplificar las funciones
df = pd.DataFrame({'angles': [0, 3, 4], 'degrees': [360, 180, 360]}, index=['circle', 'triangle', 'rectangle'])
df

Unnamed: 0,angles,degrees
circle,0,360
triangle,3,180
rectangle,4,360


Por columna (por defecto ```axis=0```):

In [196]:
df.max(axis=0)

angles       4
degrees    360
dtype: int64

In [197]:
# axis = 0 por defecto
df.min()

angles       0
degrees    180
dtype: int64

Por filas (```axis=1```):

In [198]:
df.max(axis=1)

circle       360
triangle     180
rectangle    360
dtype: int64

In [199]:
df.min(axis=1)

circle       0
triangle     3
rectangle    4
dtype: int64

### mean() y median()

Por columnas (por defecto axis=0)

In [25]:
df.mean()

angles       2.333333
degrees    300.000000
dtype: float64

In [26]:
df.median()

angles       3.0
degrees    360.0
dtype: float64

Por filas (axis=1):

In [200]:
df.mean(axis=1)

circle       180.0
triangle      91.5
rectangle    182.0
dtype: float64

In [201]:
df.median(axis=1)

circle       180.0
triangle      91.5
rectangle    182.0
dtype: float64

### sum()

Por columnas (por defecto axis=0):

In [202]:
df.sum()

angles       7
degrees    900
dtype: int64

Por filas (axis=1):

In [203]:
df.sum(axis=1)

circle       360
triangle     183
rectangle    364
dtype: int64

### std() y var()

Por columnas (por defecto axis=0):

In [204]:
df.std()

angles       2.081666
degrees    103.923048
dtype: float64

In [205]:
df.var()

angles         4.333333
degrees    10800.000000
dtype: float64

Por filas (axis=1):

In [206]:
df.std(axis=1)

circle       254.558441
triangle     125.157900
rectangle    251.730014
dtype: float64

In [207]:
df.var(axis=1)

circle       64800.0
triangle     15664.5
rectangle    63368.0
dtype: float64

# Aplicando funciones propias

## apply()

Recibe una función y la aplica a todas las filas o columnas según el axis indicado.

```axis=0``` aplica la función por columnas:

In [208]:
f = lambda x: (x - x.mean())/x.var()

#aplica la funcion a todas las columnas
cali_housing.apply(f, axis=0)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-0.618327,0.382452,-0.011706,0.000277,0.000759,0.000126,0.000869,0.814497,0.000011,-1.564046
1,0.323938,-0.303250,0.089792,-0.000234,-0.001904,-0.000559,-0.001594,-0.060558,-0.000002,-0.120181
2,0.447061,-0.409082,-0.011706,0.000213,-0.000133,0.000076,0.000038,0.577495,0.000005,-1.069674
3,0.308862,-0.400263,-0.005362,-0.000545,-0.002980,-0.001275,-0.003586,0.677082,0.000010,-0.291682
4,-0.020303,0.153150,-0.062455,-0.000292,-0.001655,-0.000521,-0.001894,-0.252898,-0.000010,0.387882
...,...,...,...,...,...,...,...,...,...,...
2995,-0.068044,-0.267973,-0.037081,-0.000247,0.000649,-0.000136,0.000877,-0.764207,0.000001,0.867877
2996,0.364142,-0.347347,-0.011706,0.000572,0.003195,0.001971,0.004090,-0.121153,0.000002,-1.399518
2997,-0.027841,0.146535,-0.119548,-0.000354,-0.001904,-0.000668,-0.002021,-0.441314,-0.000011,-0.254315
2998,0.620438,-0.338528,0.070761,-0.000539,-0.002986,-0.001278,-0.003564,-0.155987,-0.000003,1.108746


```axis = 1``` aplica la función por filas:

In [209]:
#aplica la funcion a todas las filas
cali_housing.apply(f, axis=1)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-0.000003,-0.000003,-0.000003,-0.000003,-0.000003,-0.000003,-0.000003,-0.000003,0.000026,-0.000003
1,-0.000006,-0.000006,-0.000006,-0.000005,-0.000006,-0.000006,-0.000006,-0.000006,0.000051,-0.000006
2,-0.000004,-0.000004,-0.000004,-0.000003,-0.000004,-0.000004,-0.000004,-0.000004,0.000033,-0.000004
3,-0.000003,-0.000003,-0.000003,-0.000003,-0.000003,-0.000003,-0.000003,-0.000003,0.000027,-0.000003
4,-0.000013,-0.000013,-0.000013,-0.000011,-0.000012,-0.000011,-0.000012,-0.000013,0.000111,-0.000013
...,...,...,...,...,...,...,...,...,...,...
2995,-0.000005,-0.000005,-0.000005,-0.000004,-0.000004,-0.000004,-0.000004,-0.000005,0.000040,-0.000005
2996,-0.000004,-0.000004,-0.000004,-0.000004,-0.000004,-0.000004,-0.000004,-0.000004,0.000038,-0.000004
2997,-0.000017,-0.000017,-0.000017,-0.000014,-0.000016,-0.000015,-0.000016,-0.000017,0.000146,-0.000017
2998,-0.000006,-0.000006,-0.000006,-0.000006,-0.000006,-0.000006,-0.000006,-0.000006,0.000055,-0.000006


## applymap()

Aplica una función elemento por elemento al objeto de Pandas.

In [210]:
f = lambda x: round(x)

#aplica la funcion a todos los elementos
cali_housing.applymap(f)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,random
0,-122,37,27,3885,661,1537,606,7,344700,-2
1,-118,34,43,1510,201,809,277,4,176500,0
2,-118,34,27,3589,507,1484,495,6,270500,-1
3,-118,34,28,67,15,49,11,6,330000,0
4,-120,36,19,1241,244,850,237,3,81700,0
...,...,...,...,...,...,...,...,...,...,...
2995,-120,34,23,1450,642,1258,607,1,225000,1
2996,-118,34,27,5257,1082,3496,1036,3,237200,-1
2997,-120,36,10,956,201,693,220,2,62000,0
2998,-117,34,40,96,14,46,14,3,162500,1


# Operaciones en columnas

Puedes realizar operaciones directamente en las columnas de un DataFrame para obtener resultados como una Serie o para crear nuevas columnas.  Por ejemplo:

* Relizar operaciones binarias entre columnas. (suma, resta, multiplicación, etc.).
* Aplicar métodos estadísticos en las columnas. (sum(), mean(), var(), etc.).
* Una mezcla de operaciones binarias y métodos.

In [211]:
#creando un dataframe
df = pd.DataFrame({"A":[1, 2, 3], "B":[2, 2, 2]})
df

Unnamed: 0,A,B
0,1,2
1,2,2
2,3,2


In [212]:
#operaciones entre columna regresan una serie
df["A"] * df["B"]

0    2
1    4
2    6
dtype: int64

In [213]:
#generar una nueva columna desde una operacion
df["A^B"] = df["A"].pow(df["B"])
df 

Unnamed: 0,A,B,A^B
0,1,2,1
1,2,2,4
2,3,2,9


In [214]:
#aplicar un metodo estadistico a una columna genera un valor
df["A"].sum()

6

In [215]:
#generando una nueva columna con la suma de las filas
df["suma"] = df.sum(axis=1)
df

Unnamed: 0,A,B,A^B,suma
0,1,2,1,4
1,2,2,4,8
2,3,2,9,14


# Funciones por grupos y agregación

## aggregate()/agg()

La agregación de datos es una transformación que produce un valor escalar a partir de un arreglo, por ejemplo, las funciones/métodos "sum" y "mean". Los métodos aggregate() y su alias agg() permiten aplicar una o más funciones de agregación a los objetos de Pandas. 

Los parámetros principales son:



```
func : funcion, str, list o dict
  La funcion o funciones utilizada spara la agregacion de los datos.

  Las combinaciones aceptadas:

  funcion
  cadena con el nombre de la funcion
  lista de funciones y/o nombre de las funciones (ej, [np.sum, 'mean'])
  diccionario de etiquetas del axis que mapean a:
  - las funciones.
  - nombre de funciones.
  - lista de funciones.

axis : {0 or ‘index’, 1 or ‘columns’}, por defecto 0
  Si axis=0 o ‘index’: aplica la funcion a cada columna.
  Si axis=1 o ‘columns’: aplica la funcion a cada fila.
```



In [216]:
df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],
                  columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


Por columna (por defecto axis=0):

In [217]:
#funcion de agregacion para calcular varios estadisticos a las columnas
df.agg(["min", "max"])

Unnamed: 0,A,B,C
min,1,2,3
max,7,8,9


Por filas (axis=1):

In [218]:
#funcion de agregacion para calcular varios estadisticos a las filas
df.agg(["min", "max"], axis=1)

Unnamed: 0,min,max
0,1,3
1,4,6
2,7,9


Usando un diccionario como parámetro:

In [219]:
df.agg({"A": np.min, "B": np.max, "C": [np.mean, np.std]})

Unnamed: 0,A,B,C
amax,,8.0,
amin,1.0,,
mean,,,6.0
std,,,3.0


## groupby()

El método groupby() permite generar grupos de datos usando una o más columnas para aplicar funciones de transformación o de agregación. 

Groupby es útil para contestar preguntas tipo: ¿cuál es la media de altura en hombres y mujeres? o ¿cúal es estado con menor población para cada país?

In [224]:
df = pd.DataFrame({"Sexo": ["M", "F", "F", "M"], "Altura": [1.68, 1.55, 1.75, 1.82], "Peso": [70, 50, 68, 72]})
df

Unnamed: 0,Sexo,Altura,Peso
0,M,1.68,70
1,F,1.55,50
2,F,1.75,68
3,M,1.82,72


In [225]:
#generar un dataframe argupado por una columna
df.groupby("Sexo")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027A16791DA0>

groupby() por si solo no resulta en la salida esperada.

Hay que aplicar una función:

In [226]:
#la media de las columnas
df.groupby("Sexo").mean()

Unnamed: 0_level_0,Altura,Peso
Sexo,Unnamed: 1_level_1,Unnamed: 2_level_1
F,1.65,59
M,1.75,71


In [227]:
#seleccionando una sola columna para la agregacion
df.groupby("Sexo")["Altura"].mean()

Sexo
F    1.65
M    1.75
Name: Altura, dtype: float64

**groupby() y agg()**

Puedes mezclar groupby() con agg() para aplicar más de una función. 

In [228]:
df.groupby("Sexo").agg(["mean", "std"])

Unnamed: 0_level_0,Altura,Altura,Peso,Peso
Unnamed: 0_level_1,mean,std,mean,std
Sexo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
F,1.65,0.141421,59,12.727922
M,1.75,0.098995,71,1.414214


**Agrupando con más de una columna**

In [229]:
#agregar una nueva columna al df
df["Estado"] = ["Morelos", "DF", "DF", "DF"]
df

Unnamed: 0,Sexo,Altura,Peso,Estado
0,M,1.68,70,Morelos
1,F,1.55,50,DF
2,F,1.75,68,DF
3,M,1.82,72,DF


In [230]:
#usando mas de una clave para la agrupacion
df.groupby(["Sexo", "Estado"]).agg("mean")

Unnamed: 0_level_0,Unnamed: 1_level_0,Altura,Peso
Sexo,Estado,Unnamed: 2_level_1,Unnamed: 3_level_1
F,DF,1.65,59
M,DF,1.82,72
M,Morelos,1.68,70


# Combinar Dataframes

## append()

Concatena las filas de un DataFrame a otro usando las columnas que coinciden. Columnas que no coinciden son agregadas al DataFrame resultante.

In [232]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
df

Unnamed: 0,A,B
0,1,2
1,3,4


In [254]:
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=["A", "B"])
df2

Unnamed: 0,A,B
0,5,6
1,7,8


In [233]:
df.append(df2)

Unnamed: 0,A,B
0,1,2
1,3,4
0,5,6
1,7,8


## merge() y join()

Joins al estilo SQL.

Si no conoces SQL visita este sitio para darte una idea de los tipos de joins disponibles: https://www.w3schools.com/sql/sql_join.asp

![dofactory.com](https://www.dofactory.com/Images/sql-joins.png)

**merge()**

Parámetros principales:


```
right : DataFrame o  Series
  Objeto con el que realizar merge()

how:{‘left’, ‘right’, ‘outer’, ‘inner’}, Por defecto ‘inner’
  Type of merge to be performed.

  left: similar a un left outer join de SQL.
  right: similar a un  right outer join SQL.
  outer:  similar a un full outer join SQL.
  inner: similar a un inner join de SQL.

on : etiqueta o lista de etiquetas
  Columnas o indices para unir. Deben estar en ambos DataFrames.

left_on : etiqueta, lista de etiquetas
  Columnas o indices para unir del DataFrame de la izquierda.

right_on : etiqueta, lista de etiquetas
  Columnas o indices para unir del DataFrame de la derecha.
```


In [234]:
# df1 tiene un elemento unico u1 que df2 no tiene
df1 = pd.DataFrame({'key': ['a', 'b', 'u1', 'a'], 'value': [1, 2, 3, 5]})
df1

Unnamed: 0,key,value
0,a,1
1,b,2
2,u1,3
3,a,5


In [235]:
# df2 tiene un elemento unico u2 que no df1 no tiene
df2 = pd.DataFrame({'key': ['a', 'b', 'u2', 'a'], 'value': [5, 6, 7, 8]})
df2

Unnamed: 0,key,value
0,a,5
1,b,6
2,u2,7
3,a,8


Inner

Mantiene las llaves ```key``` que coinciden en ambos dataframes.


| df1   || coincide en ```key``` de df2|
|---||----|
| (a, 1) || (a, 5) y (a, 8) |
| (a, 5) || (a, 5) y (a, 8) |
| (b, 2) || (b, 6) |

In [236]:
df1.merge(df2, how="inner", on="key")

Unnamed: 0,key,value_x,value_y
0,a,1,5
1,a,1,8
2,a,5,5
3,a,5,8
4,b,2,6


Outer

Mantiene las filas de ambos Dataframes, sin importar si existe o no existe un mapeo entre las llaves ```key```.

| df1   || df2|
|---||----|
| (a, 1) || (a, 5) y (a, 8) |
| (a, 5) || (a, 5) y (a, 8) |
| (b, 2) || (b, 6) |
| (u1, 3)|| NaN |
| NaN || (u2, 7)|

In [237]:
df1.merge(df2, how="outer", on="key")

Unnamed: 0,key,value_x,value_y
0,a,1.0,5.0
1,a,1.0,8.0
2,a,5.0,5.0
3,a,5.0,8.0
4,b,2.0,6.0
5,u1,3.0,
6,u2,,7.0


Left

Mantiene todas las filas del Dataframe de la izquierda y en las que existe un match en ```key``` en el Dataframe de la derecha.

| df1 (left)   || df2 (right)|
|---||----|
| (a, 1) || (a, 5) y (a, 8) |
| (a, 5) || (a, 5) y (a, 8) |
| (b, 2) || (b, 6) |
| (u1, 3)|| NaN |

In [238]:
df1.merge(df2, how="left", on="key")

Unnamed: 0,key,value_x,value_y
0,a,1,5.0
1,a,1,8.0
2,b,2,6.0
3,u1,3,
4,a,5,5.0
5,a,5,8.0


Right

Mantiene todas las filas del Dataframe de la derecha y en las que existe un match en ```key``` en el Dataframe de la izquierda.

| df1 (left)   || df2 (right)|
|---||----|
| (a, 1) || (a, 5) y (a, 8) |
| (a, 5) || (a, 5) y (a, 8) |
| (b, 2) || (b, 6) |
| NaN || (u2, 7) |

In [239]:
df1.merge(df2, how="right", on="key")

Unnamed: 0,key,value_x,value_y
0,a,1.0,5
1,a,5.0,5
2,a,1.0,8
3,a,5.0,8
4,b,2.0,6
5,u2,,7


**join()**

Permite unir las columnas de uno o más objetos usando su índice o usando una columna como clave.


Parámetros:
```
other : DataFrame, Series, o lista de DataFrames

on : str o lista de str, opcional (por defecto usa los indices)

how:{‘left’, ‘right’, ‘outer’, ‘inner’}, Por defecto ‘inner’
  Type of merge to be performed.

  left: usar el indice del DataFrame que llama al metodo.
  right: usar el indice de other.
  outer:  union de los indices.
  inner: interseccion de los indices.

lsuffix : str, por defecto ‘’
  sufijo a utilizar del dataframe izq para la columnas sobrelapadas.

rsuffix : str, default ‘’
  sufijo a utilizar del dataframe der para la columnas sobrelapadas.
```



In [240]:
df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3, 5]}).set_index("key")
df1

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
foo,1
bar,2
baz,3
foo,5


In [241]:
df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]}).set_index("key")
df2

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
foo,5
bar,6
baz,7
foo,8


In [242]:
df1.join(df2, lsuffix="_l", rsuffix="_r")

Unnamed: 0_level_0,value_l,value_r
key,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,2,6
baz,3,7
foo,1,5
foo,1,8
foo,5,5
foo,5,8


# Cadenas

Muchas de los métodos/funciones de Python para variables string, tienen un método equivalente en los objetos Series y DataFrames de Pandas.

Métodos principales para cadenas:

* str.cat(other, sep), concatenta sep y other a cada str en el objeto.
* str.split(str), divide la cadena usando el separador especificado.
* str.replace(patt, repl), reemplaza en la cadena patt por repl.
* str.lower() y str.upper(), para convertir cadenas a minúsculas o mayúsculas.
* str.len(), calcula la longitud de la cadena.
* str.count(str), cuenta el número de apariciones de str en la cadena.
* str.strip(str), elimina str al inicio y final de la cadena.

Visita la documentación para la lista completa de [métodos para cadenas de texto de Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary).


In [243]:
s1 = pd.Series(["A", "B", "C"])
s1

0    A
1    B
2    C
dtype: object

In [244]:
s2 = pd.Series(["1", "2", "3"])
s2

0    1
1    2
2    3
dtype: object

In [246]:
#concatenando s1 y s2 generando s3
s3 = s1.str.cat(s2, sep="-")
s3

0    A-1
1    B-2
2    C-3
dtype: object

In [247]:
#str split
s3.str.split("-")

0    [A, 1]
1    [B, 2]
2    [C, 3]
dtype: object

In [248]:
#str.replace
s3.str.replace(pat="-", repl="_")

0    A_1
1    B_2
2    C_3
dtype: object

In [249]:
#un nuevo df
s = pd.Series(["gato", "perro", "elefante"])

In [250]:
#upper
s.str.upper()

0        GATO
1       PERRO
2    ELEFANTE
dtype: object

In [251]:
#len
s.str.len()

0    4
1    5
2    8
dtype: int64

In [252]:
#count
s.str.count("o")

0    1
1    1
2    0
dtype: int64

In [253]:
#un nuevo df
s4 = pd.Series(["_A_", "_B_", "_C_", "_D_"])
s4

0    _A_
1    _B_
2    _C_
3    _D_
dtype: object

In [52]:
s4.str.strip("_")

0    A
1    B
2    C
3    D
dtype: object

# Series de tiempo

Pandas utiliza el tipo de dato "Timestamp" para el manejo de fechas y tiempo.

## String a Timestamp

In [255]:
ts_index = pd.to_datetime(["10/09/2006", "11/09/2006", "12/09/2006"])
ts_index

DatetimeIndex(['2006-10-09', '2006-11-09', '2006-12-09'], dtype='datetime64[ns]', freq=None)

Creando una serie de tiempo desde objetos:

In [256]:
ts = pd.Series(["5", "10", "15"], index=ts_index)
ts

2006-10-09     5
2006-11-09    10
2006-12-09    15
dtype: object

Creando una serie de tiempo usando aleatoriamente usando Numpy y date_range de Pandas:

In [257]:
ts = pd.Series(np.random.normal(0, 1, 100), index=pd.date_range(start="2006-01-01", periods=100, freq="D"))
ts.head(10)

2006-01-01    1.626137
2006-01-02   -0.632834
2006-01-03   -0.414172
2006-01-04   -1.293596
2006-01-05    0.196971
2006-01-06    0.110002
2006-01-07    1.296111
2006-01-08    0.849119
2006-01-09    0.799508
2006-01-10   -0.022591
Freq: D, dtype: float64

## Indexing

Puedes utilizar fechas para indexar una serie de tiempo:

In [258]:
#usando el anio y mes como indice
ts["2006-02"]

2006-02-01    1.298992
2006-02-02    0.102221
2006-02-03   -1.068939
2006-02-04   -1.143115
2006-02-05   -2.298487
2006-02-06    0.438334
2006-02-07    0.811359
2006-02-08    0.216005
2006-02-09   -1.334542
2006-02-10    0.161672
2006-02-11   -0.952871
2006-02-12    1.382208
2006-02-13   -3.130507
2006-02-14    1.159404
2006-02-15   -1.257624
2006-02-16    1.042069
2006-02-17   -2.484514
2006-02-18   -0.523080
2006-02-19   -2.353845
2006-02-20   -0.504369
2006-02-21    0.090765
2006-02-22   -0.923170
2006-02-23   -0.374246
2006-02-24   -1.178402
2006-02-25   -0.742739
2006-02-26   -0.461150
2006-02-27   -0.271193
2006-02-28   -1.163409
Freq: D, dtype: float64

In [259]:
#una rebanada entre fechas
ts["2006-02-20": "2006-03-10"]

2006-02-20   -0.504369
2006-02-21    0.090765
2006-02-22   -0.923170
2006-02-23   -0.374246
2006-02-24   -1.178402
2006-02-25   -0.742739
2006-02-26   -0.461150
2006-02-27   -0.271193
2006-02-28   -1.163409
2006-03-01   -0.081818
2006-03-02    0.317361
2006-03-03   -0.481347
2006-03-04    0.048051
2006-03-05    0.208532
2006-03-06   -0.573935
2006-03-07   -0.452151
2006-03-08    0.970055
2006-03-09    1.365781
2006-03-10   -0.624655
Freq: D, dtype: float64

## Resampling / agregación

Permite obtener desde una serie de tiempo una nueva serie de tiempo a una frecuencia de tiempo menor al agregar los datos.

In [260]:
#De frecuencia por dias a semanas
ts.resample("W").mean()

2006-01-01    1.626137
2006-01-08    0.015943
2006-01-15   -0.148842
2006-01-22   -0.119186
2006-01-29    0.334597
2006-02-05   -0.411843
2006-02-12    0.103166
2006-02-19   -1.078299
2006-02-26   -0.584759
2006-03-05   -0.203403
2006-03-12   -0.017113
2006-03-19   -0.119355
2006-03-26   -0.050291
2006-04-02   -0.049817
2006-04-09    0.088476
2006-04-16   -0.182110
Freq: W-SUN, dtype: float64

## Ventanas

Puedes aplicar ventanas deslizantes a series de tiempo para calcular distintos estadísticos.

In [261]:
ts.rolling(window=7).mean().head(14)

2006-01-01         NaN
2006-01-02         NaN
2006-01-03         NaN
2006-01-04         NaN
2006-01-05         NaN
2006-01-06         NaN
2006-01-07    0.126946
2006-01-08    0.015943
2006-01-09    0.220563
2006-01-10    0.276503
2006-01-11    0.321274
2006-01-12    0.258723
2006-01-13    0.148396
2006-01-14   -0.065698
Freq: D, dtype: float64