# Manipulacion y Analisis de Datos con Pandas

## ¿Qué es pandas?

**Pan**eles **Da**tos la libreria mas usadas por los Data Science en el mundo.

Fue inventada en el 1008 por Wes McKinney, por una necesidad de manipular datos en el mercado financiero.

## Ventajas y Desventajas

- Reduce Lineas de Codigo
- Diseñada especialmente para analisis
- API Facil y Concisa
- Multiples funciones
- Compatibilidad con matrices 3D (numpy)
- Curva de aprendizaje lenta

# Introduccion

Numpy funciona para el análisis de datos numéricos bidimencionales, pueden ser análisis bastante complejos. Pandas se especializa en la manipulación de esos datos, su depuración y análisis, estos datos pueden ser de cualquier tipo: numérico, de caracteres, etc.

In [1]:
import numpy as np

my_list = [1,2,3,4,5]
print(my_list + [10])

my_list = np.array(my_list)
print(my_list + 10)

[1, 2, 3, 4, 5, 10]
[11 12 13 14 15]


In [2]:
my_list.shape

(5,)

In [3]:
new_list = np.array([[1,2,3,4], [6,7,8,9]])
new_list.shape

(2, 4)

## Series e Indexación y selección de datos

In [4]:
import pandas as pd

In [5]:
sr = pd.Series([10,9,8,7,6])
sr

0    10
1     9
2     8
3     7
4     6
dtype: int64

In [6]:
sr.values

array([10,  9,  8,  7,  6])

In [7]:
sr.index

RangeIndex(start=0, stop=5, step=1)

In [8]:
sr.shape

(5,)

In [9]:
sr[3]

7

In [10]:
sr[[0,4,2]]

0    10
4     6
2     8
dtype: int64

In [11]:
sr = pd.Series([10,9,8,7,6], index=["a","b","c","d","e"])
sr

a    10
b     9
c     8
d     7
e     6
dtype: int64

In [12]:
sr["c"]

8

In [13]:
sr[["c", "a"]]

c     8
a    10
dtype: int64

In [14]:
sr["b":"e"]

b    9
c    8
d    7
e    6
dtype: int64

In [15]:
dict_data = {"CO": 100, "MX": 200, "AR": 300}
dict_data

{'CO': 100, 'MX': 200, 'AR': 300}

In [16]:
dict_data.keys()

dict_keys(['CO', 'MX', 'AR'])

In [17]:
dict_data["MX"]

200

In [18]:
pd.Series(dict_data)

CO    100
MX    200
AR    300
dtype: int64

In [19]:
pd.Series(dict_data, index=["CO", "MX", "PE"])

CO    100.0
MX    200.0
PE      NaN
dtype: float64

In [20]:
np.nan

nan

In [21]:
np.nan + 10

nan

In [22]:
sr = pd.Series(dict_data, index=["CO", "MX", "PE"])

In [23]:
sr.isnull()

CO    False
MX    False
PE     True
dtype: bool

In [24]:
sr.notnull()

CO     True
MX     True
PE    False
dtype: bool

## Paneles de datos al Dataframe

Dataframe es una estructura bidimensional en donde las columnas tienen varias categorias de datos, texto, numerico, logico, etc

In [25]:
pd.__version__

'1.0.3'

In [26]:
dict_data = {"CH": [100,800,200], "CO": [100,200,300], "MX": [300,500,400]}
dict_data

{'CH': [100, 800, 200], 'CO': [100, 200, 300], 'MX': [300, 500, 400]}

In [27]:
df = pd.DataFrame(dict_data)
df

Unnamed: 0,CH,CO,MX
0,100,100,300
1,800,200,500
2,200,300,400


In [28]:
dict_data = {
    "edad": [10,9,13,14,12,11,12],
    "cm": [115,110,130,155,125,120,125],
    "pais": ["co","mx","co","mx","mx","ch","ch"],
    "genero": ["M","F","F","M","M","M","F"],
    "Q1": [5,10,8,np.nan,7,8,3],
    "Q2": [7,9,9,8,8,8,9]
}

In [29]:
df = pd.DataFrame(dict_data, index=["ana", "benito", "camilo", "daniel", "erika", "fabian", "gabriela"])
df

Unnamed: 0,edad,cm,pais,genero,Q1,Q2
ana,10,115,co,M,5.0,7
benito,9,110,mx,F,10.0,9
camilo,13,130,co,F,8.0,9
daniel,14,155,mx,M,,8
erika,12,125,mx,M,7.0,8
fabian,11,120,ch,M,8.0,8
gabriela,12,125,ch,F,3.0,9


In [30]:
df.index

Index(['ana', 'benito', 'camilo', 'daniel', 'erika', 'fabian', 'gabriela'], dtype='object')

In [31]:
df.columns

Index(['edad', 'cm', 'pais', 'genero', 'Q1', 'Q2'], dtype='object')

In [32]:
df.values

array([[10, 115, 'co', 'M', 5.0, 7],
       [9, 110, 'mx', 'F', 10.0, 9],
       [13, 130, 'co', 'F', 8.0, 9],
       [14, 155, 'mx', 'M', nan, 8],
       [12, 125, 'mx', 'M', 7.0, 8],
       [11, 120, 'ch', 'M', 8.0, 8],
       [12, 125, 'ch', 'F', 3.0, 9]], dtype=object)

In [33]:
df["edad"]

ana         10
benito       9
camilo      13
daniel      14
erika       12
fabian      11
gabriela    12
Name: edad, dtype: int64

In [34]:
df[["edad", "cm", "Q1"]]

Unnamed: 0,edad,cm,Q1
ana,10,115,5.0
benito,9,110,10.0
camilo,13,130,8.0
daniel,14,155,
erika,12,125,7.0
fabian,11,120,8.0
gabriela,12,125,3.0


In [35]:
df.loc["ana", ["edad", "cm", "Q1"]]

edad     10
cm      115
Q1        5
Name: ana, dtype: object

In [36]:
df.loc[["ana", "erika"], ["edad", "cm", "Q1"]]

Unnamed: 0,edad,cm,Q1
ana,10,115,5.0
erika,12,125,7.0


In [37]:
df.loc["daniel", "Q1"]

nan

In [38]:
# Por posiciones (fila, columna)
df.iloc[2, 1]

130

In [39]:
df.iloc[2, [1, 3]]

cm        130
genero      F
Name: camilo, dtype: object

In [40]:
df.iloc[[2,4,5], [1, 3]]

Unnamed: 0,cm,genero
camilo,130,F
erika,125,M
fabian,120,M


In [41]:
df.iloc[:, [1, 3]]

Unnamed: 0,cm,genero
ana,115,M
benito,110,F
camilo,130,F
daniel,155,M
erika,125,M
fabian,120,M
gabriela,125,F


## Condiciones

In [42]:
# Todos cuya edad sea superior a 12
df["edad"] >= 12

ana         False
benito      False
camilo       True
daniel       True
erika        True
fabian      False
gabriela     True
Name: edad, dtype: bool

In [43]:
df[df["edad"] >= 12]

Unnamed: 0,edad,cm,pais,genero,Q1,Q2
camilo,13,130,co,F,8.0,9
daniel,14,155,mx,M,,8
erika,12,125,mx,M,7.0,8
gabriela,12,125,ch,F,3.0,9


In [44]:
# Mas condiciones se deben separar por parentesis
df[(df["edad"] >= 12) & (df["pais"] == "mx")]

Unnamed: 0,edad,cm,pais,genero,Q1,Q2
daniel,14,155,mx,M,,8
erika,12,125,mx,M,7.0,8


In [45]:
df.query("edad > 12")

Unnamed: 0,edad,cm,pais,genero,Q1,Q2
camilo,13,130,co,F,8.0,9
daniel,14,155,mx,M,,8


In [46]:
df['Q2'] > df["Q1"]

ana          True
benito      False
camilo       True
daniel      False
erika        True
fabian      False
gabriela     True
dtype: bool

In [47]:
df[df['Q2'] > df["Q1"]]

Unnamed: 0,edad,cm,pais,genero,Q1,Q2
ana,10,115,co,M,5.0,7
camilo,13,130,co,F,8.0,9
erika,12,125,mx,M,7.0,8
gabriela,12,125,ch,F,3.0,9
