# **Pandas**

pandas es una biblioteca de Python de código abierto para análisis de datos altamente especializados. Actualmente es el punto de referencia que todos los profesionales que utilizan el lenguaje Python deben estudiar con fines estadísticos de análisis y toma de decisiones.

In [22]:
import numpy as np 
import pandas as pd 

In [23]:
s = pd.Series([12, -4, 7,9])
s

0    12
1    -4
2     7
3     9
dtype: int64

In [24]:
a = np.random.randint(0,20,15)
a

array([11,  2,  4, 18,  0,  1, 17, 16,  5, 14,  6, 18, 11, 14,  3])

In [25]:
t = pd.Series(a)
t

0     11
1      2
2      4
3     18
4      0
5      1
6     17
7     16
8      5
9     14
10     6
11    18
12    11
13    14
14     3
dtype: int64

In [26]:
s = pd.Series([12, -4, 7,9], index= ['a', 'b', 'c', 'd'])
s


a    12
b    -4
c     7
d     9
dtype: int64

In [27]:
s.values, s.index

(array([12, -4,  7,  9]), Index(['a', 'b', 'c', 'd'], dtype='object'))

## **Seleccionando elementos internos**

In [28]:
s[2]

7

In [29]:
s['b']

-4

In [30]:
s[0:2]

a    12
b    -4
dtype: int64

In [31]:
s[[0,2]], s[['a', 'c']]

(a    12
 c     7
 dtype: int64,
 a    12
 c     7
 dtype: int64)

## **Asignando valores a los elementos**

In [32]:
s[1]= 0
s

a    12
b     0
c     7
d     9
dtype: int64

In [33]:
s['b'] = 1
s

a    12
b     1
c     7
d     9
dtype: int64

## **Definiendo una serie a partir de una array y otras series**

In [34]:
arr = np.arange(1,10,2)
s3 = pd.Series(arr)
s3

0    1
1    3
2    5
3    7
4    9
dtype: int64

In [35]:
s4 = pd.Series(s)
s4

a    12
b     1
c     7
d     9
dtype: int64

## **Operaciones y funciones matemáticas**

De forma similar que las filtraciones las operciones matemáticas (+,-,*,/) y las funciones matemáticas de NumPy se pueden aplicar a las series.

In [36]:
s/2

a    6.0
b    0.5
c    3.5
d    4.5
dtype: float64

In [37]:
np.log(s)

a    2.484907
b    0.000000
c    1.945910
d    2.197225
dtype: float64

## **Filtrando valores**

Gracias a la elección de la libreria NumPy como la base para libreria pandas, como resultado para su estructura de datos, muchas operaciones que son aplicables a los arrays de NumPy, se pueden extender a las series. 

In [38]:
s[s>8]

a    12
d     9
dtype: int64

In [39]:
colors = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
colors

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

In [40]:
colors.unique()

array([1, 0, 2, 3])

In [41]:
colors.value_counts()

1    2
2    2
0    1
3    1
dtype: int64

In [42]:
colors.isin([0,3])

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

In [43]:
colors[colors.isin([0,3])]

white     0
yellow    3
dtype: int64

## **Valores NaN**

In [44]:
s2 = pd.Series([5,-3,np.NaN,14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

In [45]:
s2[s2.isnull()]

2   NaN
dtype: float64

In [46]:
s2[s2.notnull()]

0     5.0
1    -3.0
3    14.0
dtype: float64

## **Series y diccionarios**

In [47]:
mydict = {'red': 2000, 'blue': 1000, 'yellow': 500,
'orange': 1000}

In [48]:
s5 = pd.Series(mydict)
s5

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64

In [49]:
colors = ['red','yellow','orange','blue','green']
myseries = pd.Series(mydict, index=colors)
myseries

red       2000.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

## **DataFrame**

El dataframe es una estructura de datos tabular muy similar a una hoja de cálculo. Esta estructura de datos esta diseñada para extender las series a multiples dimensiones.

In [50]:
data = {'color' : ['blue','green','yellow','red','white'],       
        'object' : ['ball','pen','pencil','paper','mug'],
        'price' : [1.2,1.0,0.6,0.9,1.7]}

In [51]:
frame = pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [52]:
frame2 = pd.DataFrame(data, columns= ['object', 'price'])
frame2

Unnamed: 0,object,price
0,ball,1.2
1,pen,1.0
2,pencil,0.6
3,paper,0.9
4,mug,1.7


In [53]:
frame2 = pd.DataFrame(data, index=['one','two','three','four','five'])
frame2

Unnamed: 0,color,object,price
one,blue,ball,1.2
two,green,pen,1.0
three,yellow,pencil,0.6
four,red,paper,0.9
five,white,mug,1.7


In [54]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
                   index=['red','blue','yellow','white'],                   columns=['ball','pen','pencil','paper'])
frame3

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


## **Seleccionar elementos**

In [55]:
frame.columns

Index(['color', 'object', 'price'], dtype='object')

In [56]:
frame.index

RangeIndex(start=0, stop=5, step=1)

In [57]:
frame.values

array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6],
       ['red', 'paper', 0.9],
       ['white', 'mug', 1.7]], dtype=object)

In [58]:
frame['price']

0    1.2
1    1.0
2    0.6
3    0.9
4    1.7
Name: price, dtype: float64

In [59]:
frame.loc[2]

color     yellow
object    pencil
price        0.6
Name: 2, dtype: object

In [60]:
frame.loc[[2,4]]

Unnamed: 0,color,object,price
2,yellow,pencil,0.6
4,white,mug,1.7


In [61]:
frame[0:1]

Unnamed: 0,color,object,price
0,blue,ball,1.2


In [62]:
frame[1:3]

Unnamed: 0,color,object,price
1,green,pen,1.0
2,yellow,pencil,0.6


In [63]:
frame['object'][3]

'paper'

In [64]:
frame.index.name = 'id'
frame.columns.name = 'item'
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [65]:
frame['new'] = 12
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,12
1,green,pen,1.0,12
2,yellow,pencil,0.6,12
3,red,paper,0.9,12
4,white,mug,1.7,12


In [66]:
frame['new'] = [3.0,1.3,2.2,0.8,1.1]
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,3.0
1,green,pen,1.0,1.3
2,yellow,pencil,0.6,2.2
3,red,paper,0.9,0.8
4,white,mug,1.7,1.1


In [67]:
ser = pd.Series(np.arange(5))
ser

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [68]:
frame['new'] = ser
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0
1,green,pen,1.0,1
2,yellow,pencil,0.6,2
3,red,paper,0.9,3
4,white,mug,1.7,4


In [69]:
frame['price'][2] = 3.3
frame

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frame['price'][2] = 3.3


item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0
1,green,pen,1.0,1
2,yellow,pencil,3.3,2
3,red,paper,0.9,3
4,white,mug,1.7,4


In [70]:
frame.isin([1.0,'pen'])

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,False,False,False
1,False,True,True,True
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False


In [71]:
frame[frame.isin([1.0,'pen'])]

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,,,,
1,,pen,1.0,1.0
2,,,,
3,,,,
4,,,,


### **Borrando una columna**

In [72]:
del frame['new']
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,3.3
3,red,paper,0.9
4,white,mug,1.7


### **Filtrando**

In [73]:
frame[frame['price'] < 1.2]

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,green,pen,1.0
3,red,paper,0.9


### **DataFrame de un dict anidado**

In [74]:
nestdict = { 'red': {2012: 22, 2013: 33},
             'white': {2011: 13, 2012: 22, 2013: 16},
             'blue': {2011: 17, 2012: 27, 2013: 18} }
nestdict

{'red': {2012: 22, 2013: 33},
 'white': {2011: 13, 2012: 22, 2013: 16},
 'blue': {2011: 17, 2012: 27, 2013: 18}}

In [75]:
frame2 = pd.DataFrame(nestdict)
frame2

Unnamed: 0,red,white,blue
2012,22.0,22,27
2013,33.0,16,18
2011,,13,17


### **Transposición de un DateFrame**

In [76]:
frame2.T

Unnamed: 0,2012,2013,2011
red,22.0,33.0,
white,22.0,16.0,13.0
blue,27.0,18.0,17.0


### **El objeto Index**

In [77]:
ser = pd.Series([5,0,3,8,4], index = ['red', 'blue', 'yellow', 'white', 'green'])
ser.index

Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')

### **Métodos sobre Index**

In [78]:
ser.idxmin()

'blue'

In [79]:
ser.idxmax()

'white'

### **Index con etiquetas duplicadas**

In [80]:
serd = pd.Series(range(6), index=['white', 'white', 'blue', 'green', 'green', 'yellow'])
serd

white     0
white     1
blue      2
green     3
green     4
yellow    5
dtype: int64

In [81]:
serd['white']

white    0
white    1
dtype: int64

Con estructuras de datos pequeñas, es fácil identificar cualquier indice duplicado, pero si la estructura se vuelve gradualmente grande, el proceso se vuelve más difícil. En este caso, pandas posee una función llamada is_unique el cual indica con un valor booleana si hay o no índices repetidos.

In [82]:
serd.index.is_unique

False

In [83]:
frame.index.is_unique

True

## **Otras funcionalidades sobre índices**

En esta sección analizaremos en detalle un número de características básicas para manipular Dataframes que generan una gran ventaja por su facilidad.

* Reindexando
* Eliminando
* Alineando

### **Reindexando**

In [84]:
ser = pd.Series([2,5,7,4], index = ['one', 'two', 'three', 'four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

In [85]:
ser.reindex(['three', 'four', 'five', 'one'])

three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64

In [86]:
ser3 = pd.Series([1,5,6,3], index= [0,3,5,6])

In [87]:
ser3.reindex(range(6), method = 'ffill' )

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

In [88]:
ser3.reindex(range(6), method = 'bfill' )

0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64

In [89]:
frame.reindex(range(5), method = 'ffill', columns = ['colors', 'price', 'new', 'object'])

item,colors,price,new,object
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,1.2,blue,ball
1,green,1.0,green,pen
2,yellow,3.3,yellow,pencil
3,red,0.9,red,paper
4,white,1.7,white,mug


### **Eliminando**

In [90]:
ser = pd.Series(np.arange(4.), index = ['red', 'blue', 'yellow', 'white'])
ser

red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

In [91]:
ser.drop('yellow')

red      0.0
blue     1.0
white    3.0
dtype: float64

In [92]:
ser.drop(['yellow', 'blue'])

red      0.0
white    3.0
dtype: float64

In [100]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)), 
                    index = ['red', 'blue', 'yellow', 'white'], 
                    columns = ['ball', 'pen', 'pencil', 'paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [102]:
frame.drop(['blue', 'yellow'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
white,12,13,14,15


Para borrar columnas, siempre se necesita especificar el índice de la columna, pero además debe especificar el eje en el cual se quiere borrar los elementos, para ell usamos la opción axis.

In [103]:
frame.drop(['pen', 'pencil'], axis = 1)

Unnamed: 0,ball,paper
red,0,3
blue,4,7
yellow,8,11
white,12,15


## **Aritmética i alineamiento de datos**

In [104]:
 s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
 s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])


In [105]:
s1+s2

black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64

In [108]:
 frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
 index=['red','blue','yellow','white'],
 columns=['ball','pen','pencil','paper'])
 frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
 index=['blue','green','white','yellow'],
 columns=['mug','pen','ball'])


In [109]:
frame1+frame2

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


## **Operaciones entre estructuras de datos**

### **Métodos aritméticos flexibles**
Las operaciones vistas previamente tambíen las podemos realizar utilizando métodos llamados métodos aritméticos flexibles.

In [110]:
frame1.add(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


## **Funciones estadísticas**

In [111]:
frame.sum()

ball      24
pen       28
pencil    32
paper     36
dtype: int64

In [112]:
frame.mean()

ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

In [113]:
frame.describe()

Unnamed: 0,ball,pen,pencil,paper
count,4.0,4.0,4.0,4.0
mean,6.0,7.0,8.0,9.0
std,5.163978,5.163978,5.163978,5.163978
min,0.0,1.0,2.0,3.0
25%,3.0,4.0,5.0,6.0
50%,6.0,7.0,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


# **Referencias**

[1]. https://es.wikipedia.org/wiki/NumPy 

[2]. Unpingo J. Python for Probability, Statistics and Machine Learning. Second Edition. Ed. 
Springer.

[3]. Nelli F. Python Data Analytics. With Pandas, NumPy and Matplotlib. Second Edition. Ed. Apress