# Pandas
> David Quintanar Pérez

http://pandas.pydata.org/

Pandas es una herramienta de análisis y manipulación de datos de código abierto rápida, potente, flexible y fácil de usar,

Tiene dos principales estructuras de datps
- Series
- DataFrame

In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.__version__

## Series
Matriz etiquetada **unidimensional** capaz de contener cualquier tipo de datos (enteros, cadenas, números de punto flotante, objetos Python, etc.).

Las etiquetas de los ejes se denominan índice.

var = pd.Series(data, index=index)

In [None]:
# Crear con numpy arrays
pd_np_series = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
pd_series

In [None]:
# Crear con diccionarios
dictionary = {"b": 1, "a": 0, "c": 2}

pd_dictionary_series = pd.Series(dictionary)
pd_dictionary_series

In [None]:
# Crear con diccionarios pero usar otros indices
pd.Series(dictionary, index=["b", "c", "d", "a"])

In [None]:
# Con una constante
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

In [None]:
# Seleccionar elemento
print(pd_dictionary_series)
print(pd_dictionary_series[0])

In [None]:
# Seleccionar elemento
pd_dictionary_series["b"]

In [None]:
# Saber si existe un valor en el index
"b" in pd_dictionary_series

In [None]:
# Tipo de dato
pd_dictionary_series.dtype

In [None]:
# Operaciones aritméticas
pd_dictionary_series + pd_dictionary_series

In [None]:
# Operaciones aritméticas
pd_dictionary_series * 3

## DataFrame
Estructura de datos etiquetada **bidimensional** con **columnas** de tipos diferentes.

Como una hoja de cálculo o una tabla SQL, o un conjunto de objetos Series. 

In [None]:
d_adventure_time = {
    "nombre": ["Finn", "Jake"],
    "apodo": ["El humano", "El perro"],
    "edad": [17, 34]
}

d_adventure_time

In [None]:
# Crear de un diccionario
pd.DataFrame(d_adventure_time)

In [None]:
# Crear de un diccionario, pero con otro indice
pd.DataFrame(d_adventure_time, index=["a", "b"])

In [None]:
# Crear de un diccionario, pero con otro indice y selcccionando columnas
pd.DataFrame(d_adventure_time, index=["d", "b"], columns=["nombre", "apodo"])

In [None]:
# Por lo general se agrega a una variable llamada df o que empiece con df_
# df_adventure_time
df = pd.DataFrame(d_adventure_time)
df

In [None]:
# Crear de un diccionario, de un csv
pd.read_csv("./CSV/primary_results.csv")

> #### Exploración

In [None]:
df = pd.read_csv("./CSV/primary_results.csv")
df

In [None]:
df.shape

In [None]:
df.columns

In [None]:
# Podemos ver que una columna es un objero Series de pandas
type(df.state)

In [None]:
type(df[["state", "county"]])

In [None]:
df.index

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.dtypes

In [None]:
df.describe()

> ### selección

In [None]:
df.loc[0]

In [None]:
df2 = df.set_index("county")
df2

In [None]:
df2.loc["Los Angeles"]

In [None]:
df2.iloc[0]

In [None]:
df["state"][:10]

In [None]:
df[df.votes>=590502]

In [None]:
df[(df.county=="Manhattan") & (df.party=="Democrat")]

In [None]:
df[
    (df.county=="Manhattan") &
    (df.party=="Democrat")
]

In [None]:
df.query("county=='Manhattan' and party=='Democrat'")

In [None]:
county = 'Manhattan'
df.query("county==@county and party=='Democrat'")

> ### Procesado

In [None]:
df_sorted = df.sort_values(by="votes", ascending=False)
df_sorted

In [None]:
df.groupby(["state", "party"])["votes"].sum()

In [None]:
df['letra_inicial'] = df.state_abbreviation.apply(lambda s: s[0])
df.groupby("letra_inicial")["votes"].sum().sort_values()

In [None]:
# Descargamos datos de pobreza por condado en US en https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data/
df_pobreza = pd.read_csv("./CSV/PovertyEstimates.csv")
df_pobreza

In [None]:
df = df.merge(df_pobreza, left_on="fips", right_on="FIPStxt")
df

In [None]:
county_votes = df.groupby(["county","party"]).agg({
    "fraction_votes":"mean",
    "PCTPOVALL_2015": "mean"   
   }
)
county_votes

> ### Exportar

In [None]:
d_adventure_time = {
    "nombre": ["Finn", "Jake"],
    "apodo": ["El humano", "El perro"],
    "edad": [17, 34]
}

df_adventure_time = pd.DataFrame(d_adventure_time)
df_adventure_time

In [None]:
# !pip install xlwt
df_adventure_time.to_excel("./CSV/adventure_time.xls", sheet_name="personajes", index=False)

In [None]:
df_adventure_time_xls = pd.read_excel("./CSV/adventure_time.xls", sheet_name="personajes")
df_adventure_time_xls