# Pandas: contenedores y operaciones básicas 

[Pandas](https://pandas.pydata.org) es una librería de código abierto que proporciona herramientas de análisis y manipulación de datos utilizando potentes estructuras de datos.

Algunas características

- Herramientas para la carga de datos en objetos desde diferentes formatos
- __DataFrame__ Objeto eficiente con indexación predeterminada y personalizada
- Manejo integrado de datos faltantes
- _Reshaping_ y _pivoting_ de conjunto de datos
- Inserción y eliminación de columnas desde estructuras de datos
- Agrupación de datos para la agregación y transformación
- Alto rendimiento para fusionar y unir datos
- Funciones para el manejo y manipulación de series temporales

[Tutorial de Pandas](https://pandas.pydata.org/docs/user_guide/10min.html) 

In [1]:
import pandas as pd

pd.__version__

'2.1.1'

## Estructuras de datos

Pandas permite tratar con las siguientes estructuras de datos:

- __Series:__ es un arreglo unidimensional con datos homogéneos.
- __DataFrame:__ (contenedor de series) es un arreglo bidimensional de datos heterogéneos.
- __Panel:__ (contenedor de DataFrame) es una estructura tridimensional con datos heterogéneos.

Todas la estructuras de Pandas tiene valor __mutable__ (se pueden cambiar) y, a excepción de las Series, todas son de tamaño mutable.

### Series

Una Serie de Pandas puede ser creada usando el constructor [`pandas.Series(data, index,...)`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html):
- `data`, puede tomar distintas formas (e.g., `ndarray`s, listas, constantes).
- `index`, valores únicos con longitud igual a la longitud de `data`. Por defecto, desde `0` hasta `len(data)-1`.


In [2]:
from IPython.display import Image
Image(url="./img/pandas_series.png", width=150)

In [3]:
import pandas as pd

# Crear una Serie a partir de un ndarray de numpy
data = [1,2,3,4]
serie1 = pd.Series(data, index=['p','q','r','s'])
serie1

p    1
q    2
r    3
s    4
dtype: int64

In [4]:
# Crear una serie a partir de un diccionario
serie2 = pd.Series({'a': 0., 'b': 1., 'c': 2.})
serie2

a    0.0
b    1.0
c    2.0
dtype: float64

In [5]:
# Los datos faltantes se completan con NaN (Not A Number)
serie3 = pd.Series({'a': 0., 'b': 1., 'c': 2.}, index=['a', 'b', 'c', 'd', 'e'])
serie3

a    0.0
b    1.0
c    2.0
d    NaN
e    NaN
dtype: float64

El __acceso a los datos de Series__ es de forma similar que a los datos de un `ndarray`.

In [6]:
# Obtener el primer elemento
serie1.iloc[0]

1

In [7]:
# Obtener un subsegmento
serie1.iloc[:3]

p    1
q    2
r    3
dtype: int64

In [8]:
# Obtener los últimos tres elementos
serie1[-2:]

r    3
s    4
dtype: int64

In [9]:
# Obtener elemento por medio del índice
serie3['c']

2.0

In [10]:
# Obetener elementos a partir de una lista de índices
serie3[['a', 'c', 'b']]

a    0.0
c    2.0
b    1.0
dtype: float64

### DataFrame

- Un [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) puede contener ejes etiquetados (filas y columnas). 

In [11]:
from IPython.display import Image
Image(url="./img/pandas_dataframe.png", height=150)

In [12]:
data = [[0, 4, 5], [0, 6, 7], [20, 30, 40]]
pd.DataFrame(data, index=[1,2,3], columns=['p', 'q', 'r'])

Unnamed: 0,p,q,r
1,0,4,5
2,0,6,7
3,20,30,40


Una forma de crear `DataFrame` es a partir de un diccionario.

In [13]:
rick_and_morty = pd.DataFrame(
    {
        'nombre': ['Ricky', 'Morty'],
        'apellidos': ['Sanchez', 'Smith'],
        'edad': [60, 14]
    }
)
rick_and_morty

Unnamed: 0,nombre,apellidos,edad
0,Ricky,Sanchez,60
1,Morty,Smith,14


Importando datos desde un archivo con valores separados por coma (CSV).

In [14]:
df = pd.read_csv('./data/primary_results_usa_2016.csv')
df

Unnamed: 0,state,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
0,Alabama,AL,Autauga,1001.0,Democrat,Bernie Sanders,544,0.182
1,Alabama,AL,Autauga,1001.0,Democrat,Hillary Clinton,2387,0.800
2,Alabama,AL,Baldwin,1003.0,Democrat,Bernie Sanders,2694,0.329
3,Alabama,AL,Baldwin,1003.0,Democrat,Hillary Clinton,5290,0.647
4,Alabama,AL,Barbour,1005.0,Democrat,Bernie Sanders,222,0.078
...,...,...,...,...,...,...,...,...
24606,Wyoming,WY,Teton-Sublette,95600028.0,Republican,Ted Cruz,0,0.000
24607,Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,Donald Trump,0,0.000
24608,Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,John Kasich,0,0.000
24609,Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,Marco Rubio,0,0.000


Descripción
- __state__: estado
- __county__: condado
- __fips__: código unico que representa una zona geográfica ([más información...](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html#ti1187912100))
- __party__: partido
- __votes__: votos
- __fraction_votes__: fracción de votos por condado

__NOTA__: Las funciones de tipo [`pandas.read_...`](https://pandas.pydata.org/docs/reference/io.html) permiten importar datos desde distintas fuentes.

In [15]:
# Obtener las primeras n filas
df.head(5)

Unnamed: 0,state,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
0,Alabama,AL,Autauga,1001.0,Democrat,Bernie Sanders,544,0.182
1,Alabama,AL,Autauga,1001.0,Democrat,Hillary Clinton,2387,0.8
2,Alabama,AL,Baldwin,1003.0,Democrat,Bernie Sanders,2694,0.329
3,Alabama,AL,Baldwin,1003.0,Democrat,Hillary Clinton,5290,0.647
4,Alabama,AL,Barbour,1005.0,Democrat,Bernie Sanders,222,0.078


In [16]:
# Obtener las ultimas n filas
df.tail(5)

Unnamed: 0,state,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
24606,Wyoming,WY,Teton-Sublette,95600028.0,Republican,Ted Cruz,0,0.0
24607,Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,Donald Trump,0,0.0
24608,Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,John Kasich,0,0.0
24609,Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,Marco Rubio,0,0.0
24610,Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,Ted Cruz,53,1.0


In [17]:
# Consultar tipos de datos de un DataFrame
df.dtypes

state                  object
state_abbreviation     object
county                 object
fips                  float64
party                  object
candidate              object
votes                   int64
fraction_votes        float64
dtype: object

In [18]:
# Obtener Estadísticos básicos de un DataFrame
df.describe()

Unnamed: 0,fips,votes,fraction_votes
count,24511.0,24611.0,24611.0
mean,26671520.0,2306.252773,0.304524
std,42009780.0,9861.183572,0.231401
min,1001.0,0.0,0.0
25%,21091.0,68.0,0.094
50%,42081.0,358.0,0.273
75%,90900120.0,1375.0,0.479
max,95600040.0,590502.0,1.0


In [19]:
# Obtener los índices de un DataFrame
df.index

RangeIndex(start=0, stop=24611, step=1)

## Operaciones básicas con DataFrame

### Indices y selección

In [20]:
# Cambiar índice
df2 = df.set_index('state')
df2

Unnamed: 0_level_0,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama,AL,Autauga,1001.0,Democrat,Bernie Sanders,544,0.182
Alabama,AL,Autauga,1001.0,Democrat,Hillary Clinton,2387,0.800
Alabama,AL,Baldwin,1003.0,Democrat,Bernie Sanders,2694,0.329
Alabama,AL,Baldwin,1003.0,Democrat,Hillary Clinton,5290,0.647
Alabama,AL,Barbour,1005.0,Democrat,Bernie Sanders,222,0.078
...,...,...,...,...,...,...,...
Wyoming,WY,Teton-Sublette,95600028.0,Republican,Ted Cruz,0,0.000
Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,Donald Trump,0,0.000
Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,John Kasich,0,0.000
Wyoming,WY,Uinta-Lincoln,95600027.0,Republican,Marco Rubio,0,0.000


In [21]:
# Selección por índice
df2.loc['Alabama']

Unnamed: 0_level_0,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama,AL,Autauga,1001.0,Democrat,Bernie Sanders,544,0.182
Alabama,AL,Autauga,1001.0,Democrat,Hillary Clinton,2387,0.800
Alabama,AL,Baldwin,1003.0,Democrat,Bernie Sanders,2694,0.329
Alabama,AL,Baldwin,1003.0,Democrat,Hillary Clinton,5290,0.647
Alabama,AL,Barbour,1005.0,Democrat,Bernie Sanders,222,0.078
...,...,...,...,...,...,...,...
Alabama,AL,Winston,1133.0,Republican,Ben Carson,539,0.090
Alabama,AL,Winston,1133.0,Republican,Donald Trump,3352,0.561
Alabama,AL,Winston,1133.0,Republican,John Kasich,163,0.027
Alabama,AL,Winston,1133.0,Republican,Marco Rubio,708,0.119


In [22]:
# Selección con posición
df2.iloc[1]

state_abbreviation                 AL
county                        Autauga
fips                           1001.0
party                        Democrat
candidate             Hillary Clinton
votes                            2387
fraction_votes                    0.8
Name: Alabama, dtype: object

In [23]:
# Selección columna (1)
df['county']

0               Autauga
1               Autauga
2               Baldwin
3               Baldwin
4               Barbour
              ...      
24606    Teton-Sublette
24607     Uinta-Lincoln
24608     Uinta-Lincoln
24609     Uinta-Lincoln
24610     Uinta-Lincoln
Name: county, Length: 24611, dtype: object

In [24]:
# Selección columnas (2 o más)
df[['county', 'party']]

Unnamed: 0,county,party
0,Autauga,Democrat
1,Autauga,Democrat
2,Baldwin,Democrat
3,Baldwin,Democrat
4,Barbour,Democrat
...,...,...
24606,Teton-Sublette,Republican
24607,Uinta-Lincoln,Republican
24608,Uinta-Lincoln,Republican
24609,Uinta-Lincoln,Republican


### Aplicar filtros

In [25]:
# Aplicar filtros (condición)
df[df['votes'] > 300000]

Unnamed: 0,state,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
1385,California,CA,Los Angeles,6037.0,Democrat,Bernie Sanders,434656,0.42
1386,California,CA,Los Angeles,6037.0,Democrat,Hillary Clinton,590502,0.57
4450,Illinois,IL,Chicago,91700103.0,Democrat,Bernie Sanders,311225,0.454
4451,Illinois,IL,Chicago,91700103.0,Democrat,Hillary Clinton,366954,0.536


In [26]:
# Aplicar filtro con anidación de condiciones
df[(df['county'] == 'Manhattan') & (df['party'] == 'Democrat')]

Unnamed: 0,state,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
15011,New York,NY,Manhattan,36061.0,Democrat,Bernie Sanders,90227,0.337
15012,New York,NY,Manhattan,36061.0,Democrat,Hillary Clinton,177496,0.663


### Ordenar valores

In [27]:
# Ordenar DataFrame
df.sort_values(by='votes', ascending=False).head(10)

Unnamed: 0,state,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
1386,California,CA,Los Angeles,6037.0,Democrat,Hillary Clinton,590502,0.57
1385,California,CA,Los Angeles,6037.0,Democrat,Bernie Sanders,434656,0.42
4451,Illinois,IL,Chicago,91700103.0,Democrat,Hillary Clinton,366954,0.536
4450,Illinois,IL,Chicago,91700103.0,Democrat,Bernie Sanders,311225,0.454
4463,Illinois,IL,Cook Suburbs,91700104.0,Democrat,Hillary Clinton,249217,0.536
17309,Pennsylvania,PA,Philadelphia,42101.0,Democrat,Hillary Clinton,212785,0.626
4462,Illinois,IL,Cook Suburbs,91700104.0,Democrat,Bernie Sanders,212428,0.457
1519,California,CA,Los Angeles,6037.0,Republican,Donald Trump,179130,0.698
15012,New York,NY,Manhattan,36061.0,Democrat,Hillary Clinton,177496,0.663
14964,New York,NY,Brooklyn,36047.0,Democrat,Hillary Clinton,174236,0.6


### Agregación de valores

In [28]:
# Agregar datos (1)
df.groupby(by='state')['votes'].sum() # otras funciones: count, mean...

state
Alabama           1223959
Alaska              22469
Arizona            834200
Arkansas           605971
California        4938197
Colorado           121184
Connecticut        531302
Delaware           160416
Florida           3940929
Georgia           2032941
Hawaii              46886
Idaho              238989
Illinois          3372537
Indiana           1719291
Iowa               326704
Kansas             111296
Kentucky           648885
Louisiana          585781
Maine                3415
Maryland          1233272
Massachusetts     1813320
Michigan          2431111
Mississippi        614200
Missouri          1531672
Montana            257516
Nebraska           213691
Nevada              86815
New Hampshire      525966
New Jersey        1321220
New Mexico         309532
New York          2686539
North Carolina    2185747
North Dakota          354
Ohio              3204172
Oklahoma           766123
Oregon             933975
Pennsylvania      3176340
Rhode Island       179594
South 

Mas información de la función [`pandas.DataFrame.group_by()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).

In [29]:
# Agregar datos (2)
df.groupby(['state','party'])['votes'].sum()

state          party     
Alabama        Democrat       386327
               Republican     837632
Alaska         Democrat          539
               Republican      21930
Arizona        Democrat       399097
                              ...   
West Virginia  Republican     188138
Wisconsin      Democrat      1000703
               Republican    1072699
Wyoming        Democrat          280
               Republican        903
Name: votes, Length: 95, dtype: int64