<center>
  <h1>The PyData Toolbox</h1>
  <h3>Scott Sanderson (Twitter: @scottbsanderson, GitHub: ssanderson)</h3>
  <h3><a href="https://github.com/ssanderson/pydata-toolbox">https://github.com/ssanderson/pydata-toolbox</a></h3>
</center>

In [1]:
pip install -U fortran-magic

Requirement already up-to-date: fortran-magic in /usr/local/lib/python3.6/dist-packages (0.7)


In [2]:
%reload_ext fortranmagic

  self._lib_dir = os.path.join(get_ipython_cache_dir(), 'fortran')


In [3]:
import py_compile
import numpy as np


# About Me:

- Senior Engineer at [Quantopian](www.quantopian.com)
- Background in Mathematics and Philosophy
- **Twitter:** [@scottbsanderson](https://twitter.com/scottbsanderson)
- **GitHub:** [ssanderson](github.com/ssanderson)

## Outline

- Built-in Data Structures
- Numpy `array`
- Pandas `Series`/`DataFrame`
- Plotting and "Real-World" Analyses

# Data Structures

> Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms
will almost always be self-evident. Data structures, not algorithms, are central to programming.

- *Notes on Programming in C*, by Rob Pike.

# Lists

In [4]:
l = [1, 'two', 3.0, 4, 5.0, "six"]
l

[1, 'two', 3.0, 4, 5.0, 'six']

In [5]:
J=[2,True,'strings',3.141592,'b'] #Ejemplo replicado
J

[2, True, 'strings', 3.141592, 'b']

In [6]:
# Lists can be indexed like C-style arrays.
first = l[0]
second = l[1]
print("first:", first)
print("second:", second)

first: 1
second: two


In [7]:
#Ejemplo replicado, acceso a listas como en arrays de C/C++
primer = J[0]
ultimo = J[len(J)-1]
print("Primer elemento:", primer)
print("Ultimo elemento:", ultimo)

Primer elemento: 2
Ultimo elemento: b


In [8]:
# Negative indexing gives elements relative to the end of the list.
last = l[-1]
penultimate = l[-2]
print("last:", last)
print("second to last:", penultimate)

last: six
second to last: 5.0


In [9]:
#Ejemplo replicado, indexado negativo, se hace a partir de -1 hasta -len(lista)
ultimo = J[-1]
ante_p_ultimo=J[-2]
print("Ultimo elemento:", ultimo)
print("Antepenultimo elemento:", ante_p_ultimo)

Ultimo elemento: b
Antepenultimo elemento: 3.141592


In [10]:
# Lists can also be sliced, which makes a copy of elements between 
# start (inclusive) and stop (exclusive)
sublist = l[1:3]
sublist

['two', 3.0]

In [11]:
#Las listas pueden ser seccionadas de 3 formas
J_slt = J[2:4]
J_slt
#elementos 2 y 3 tres de la lista

['strings', 3.141592]

In [12]:
# l[:N] is equivalent to l[0:N].
first_three = l[:3]
first_three

[1, 'two', 3.0]

In [13]:
# la segunda forma es seccionar hasta x elemento contenido en lista
J_xel = J[:4]
J_xel

[2, True, 'strings', 3.141592]

In [14]:
# l[3:] is equivalent to l[3:len(l)].
after_three = l[3:]
after_three

[4, 5.0, 'six']

In [15]:
# la tercer forma es seccionar desde x elemento hasta el final de la lista
J_aftx = J[2:]
J_aftx

['strings', 3.141592, 'b']

In [16]:
# There's also a third parameter, "step", which gets every Nth element.
l = ['a', 'b', 'c', 'd', 'e', 'f', 'g','h']
l[1:7:2]

['b', 'd', 'f']

In [17]:
#con un tercer parametro se hace un escalonado, cada ciertos elementos
J = [0,1,2,3,4,5,6,7,8,9,10]
J[0:9:2]

[0, 2, 4, 6, 8]

In [18]:
# This is a cute way to reverse a list.
l[::-1]

['h', 'g', 'f', 'e', 'd', 'c', 'b', 'a']

In [19]:
#leer una lista en revers
J[::-1]

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [20]:
# Lists can be grown efficiently (in O(1) amortized time).
l = [1, 2, 3, 4, 5]
print("Before:", l)
l.append('six')
print("After:", l)

Before: [1, 2, 3, 4, 5]
After: [1, 2, 3, 4, 5, 'six']


In [21]:
# las listas creen eficientemente, anexando al final en O(1) , pero es complejo 
#añadir al frente
k = ['a','e','i','o','u'] 
print("Antes:",k)
k.append('y')
print("despues:",k)

Antes: ['a', 'e', 'i', 'o', 'u']
despues: ['a', 'e', 'i', 'o', 'u', 'y']


In [22]:
# Comprehensions let us perform elementwise computations.
l = [1, 2, 3, 4, 5]
[x * 2 for x in l]

[2, 4, 6, 8, 10]

In [23]:
#se pueden hacer expresiones en elementos de una lista
J = [0,1,2,3,4,5,6,7,8,9,10]
[i**2 for i in J]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

## Review: Python Lists

- Zero-indexed sequence of arbitrary Python values.
- Slicing syntax: `l[start:stop:step]` copies elements at regular intervals from `start` to `stop`.
- Efficient (`O(1)`) appends and removes from end.
- Comprehension syntax: `[f(x) for x in l if cond(x)]`.

# Dictionaries

In [24]:
# Dictionaries are key-value mappings.
philosophers = {'David': 'Hume', 'Immanuel': 'Kant', 'Bertrand': 'Russell'}
philosophers

{'Bertrand': 'Russell', 'David': 'Hume', 'Immanuel': 'Kant'}

In [25]:
#Diccionarios, mapeo de valores a partir de una llave unica y un valor
notas = {'Fisica':4.2, 'Quimica':3.9, 'Calculo 1':4.5, 'Filosofia':3.2}
notas

{'Calculo 1': 4.5, 'Filosofia': 3.2, 'Fisica': 4.2, 'Quimica': 3.9}

In [26]:
# Like lists, dictionaries are size-mutable.
philosophers['Ludwig'] = 'Wittgenstein'
philosophers

{'Bertrand': 'Russell',
 'David': 'Hume',
 'Immanuel': 'Kant',
 'Ludwig': 'Wittgenstein'}

In [27]:
#son modificables añadiendo o eliminando elementos
notas['Arte'] = 4.0
notas

{'Arte': 4.0,
 'Calculo 1': 4.5,
 'Filosofia': 3.2,
 'Fisica': 4.2,
 'Quimica': 3.9}

In [28]:
del philosophers['David']
philosophers

{'Bertrand': 'Russell', 'Immanuel': 'Kant', 'Ludwig': 'Wittgenstein'}

In [29]:
del notas['Filosofia']
notas

{'Arte': 4.0, 'Calculo 1': 4.5, 'Fisica': 4.2, 'Quimica': 3.9}

In [30]:
# No slicing.
#philosophers['Bertrand':'Immanuel']

In [31]:
#los diccionarios no se pueden seccionar como las listas

## Review: Python Dictionaries

- Unordered key-value mapping from (almost) arbitrary keys to arbitrary values.
- Efficient (`O(1)`) lookup, insertion, and deletion.
- No slicing (would require a notion of order).

In [32]:
# Suppose we have some matrices...
a = [[1, 2, 3],
     [2, 3, 4],
     [5, 6, 7],
     [1, 1, 1]]

b = [[1, 2, 3, 4],
     [2, 3, 4, 5]]

In [33]:
A = [[0,1,1],
     [1,2,5],
     [6,0,9]]

B = [[1,4,7],
     [2,3,9],
     [0,1,2]]  


In [34]:
def matmul(A, B):
    """Multiply matrix A by matrix B."""
    rows_out = len(A)
    cols_out = len(B[0])
    out = [[0 for col in range(cols_out)] for row in range(rows_out)]
    
    for i in range(rows_out):
        for j in range(cols_out):
            for k in range(len(B)):
                out[i][j] += A[i][k] * B[k][j]
    return out

In [35]:
%%time

matmul(a, b)

CPU times: user 35 µs, sys: 0 ns, total: 35 µs
Wall time: 40.3 µs


[[5, 8, 11, 14], [8, 13, 18, 23], [17, 28, 39, 50], [3, 5, 7, 9]]

In [36]:
%%time
matmul(A,B)

CPU times: user 23 µs, sys: 5 µs, total: 28 µs
Wall time: 31.7 µs


[[2, 4, 11], [5, 15, 35], [6, 33, 60]]

In [37]:
import random
def random_matrix(m, n):
    out = []
    for row in range(m):
        out.append([random.random() for _ in range(n)])
    return out

randm = random_matrix(2, 3)
randm

[[0.02166869197659893, 0.5747056189626042, 0.6691095582386547],
 [0.8711575336112014, 0.33784968481166255, 0.7606802732871605]]

In [38]:
rdm_A = random_matrix(30,20)
rdm_A

[[0.2187593464100639,
  0.6059548791937581,
  0.5228231909458115,
  0.6127848791156129,
  0.5397883074168264,
  0.6437804719295146,
  0.49222778604492656,
  0.20715030349508667,
  0.5543045513331132,
  0.35486445821554025,
  0.40761079294437463,
  0.10864818112064312,
  0.7191342372681166,
  0.6766264010195533,
  0.6370784407020396,
  0.4679374734140923,
  0.6658061303196651,
  0.8600197215671375,
  0.1745257079860667,
  0.2207518922728231],
 [0.863377898363574,
  0.1813751961942155,
  0.44309817097204796,
  0.23195646214042132,
  0.9132800085984053,
  0.5591508042769078,
  0.7188714800258721,
  0.22593341191742777,
  0.7535680432427572,
  0.7539454070287351,
  0.7955693899666552,
  0.6604743911671216,
  0.6188020637799325,
  0.16499953256304067,
  0.886404594818444,
  0.5269893782621505,
  0.20171440560637954,
  0.13677918028855962,
  0.8900902832859197,
  0.6305479518996115],
 [0.3059560132335156,
  0.1083719109176684,
  0.29600424549258253,
  0.6466661628646206,
  0.8948175384198325

In [39]:
%%time
randa = random_matrix(600, 100)
randb = random_matrix(100, 600)
x = matmul(randa, randb)

CPU times: user 6.74 s, sys: 22.6 ms, total: 6.76 s
Wall time: 6.82 s


In [40]:
%%time
rdm_A = random_matrix(30,20)
rdm_B = random_matrix(20,30)
y = matmul(rdm_A,rdm_B)

CPU times: user 5.77 ms, sys: 34 µs, total: 5.81 ms
Wall time: 7.27 ms


In [41]:
# Maybe that's not that bad?  Let's try a simpler case.
def python_dot_product(xs, ys):
    return sum(x * y for x, y in zip(xs, ys))

In [42]:
%%fortran
subroutine fortran_dot_product(xs, ys, result)
    double precision, intent(in) :: xs(:)
    double precision, intent(in) :: ys(:)
    double precision, intent(out) :: result
    
    result = sum(xs * ys)
end

In [43]:
list_data = [float(i) for i in range(100000)]
array_data = np.array(list_data)

In [44]:
ld_B = [float(j) for j in range(200000)]
ad_B = np.array(ld_B)

In [45]:
%%time
python_dot_product(list_data, list_data)

CPU times: user 9.19 ms, sys: 10 µs, total: 9.2 ms
Wall time: 9.5 ms


333328333350000.0

In [46]:
%%time
python_dot_product(ld_B,ld_B)

CPU times: user 20.6 ms, sys: 0 ns, total: 20.6 ms
Wall time: 21.9 ms


2666646666700000.0

In [47]:
%%time
fortran_dot_product(array_data, array_data)

CPU times: user 199 µs, sys: 9 µs, total: 208 µs
Wall time: 219 µs


333328333350000.0

In [48]:
%%time
fortran_dot_product(ad_B, ad_B)

CPU times: user 382 µs, sys: 17 µs, total: 399 µs
Wall time: 409 µs


2666646666700000.0

<center><img src="images/sloth.gif" alt="Drawing" style="width: 1080px;"/></center>


## Why is the Python Version so Much Slower?

In [49]:
# Dynamic typing.
def mul_elemwise(xs, ys):
    return [x * y for x, y in zip(xs, ys)]

mul_elemwise([1, 2, 3, 4], [1, 2 + 0j, 3.0, 'four'])
#[type(x) for x in _]

[1, (4+0j), 9.0, 'fourfourfourfour']

In [50]:
mul_elemwise([1, 2.25, 78.1, 2], [12, 2 + 3j, 7.859, 'number'])
#[type(x) for x in _]

[12, (4.5+6.75j), 613.7878999999999, 'numbernumber']

In [51]:
# Interpretation overhead.
source_code = 'a + b * c'
bytecode = compile(source_code, '', 'eval')
import dis; dis.dis(bytecode)

  1           0 LOAD_NAME                0 (a)
              2 LOAD_NAME                1 (b)
              4 LOAD_NAME                2 (c)
              6 BINARY_MULTIPLY
              8 BINARY_ADD
             10 RETURN_VALUE


In [52]:
sc_B = 'a*b+ (c**d)/e'
bytecode = compile(sc_B, '', 'eval')
dis.dis(bytecode)

  1           0 LOAD_NAME                0 (a)
              2 LOAD_NAME                1 (b)
              4 BINARY_MULTIPLY
              6 LOAD_NAME                2 (c)
              8 LOAD_NAME                3 (d)
             10 BINARY_POWER
             12 LOAD_NAME                4 (e)
             14 BINARY_TRUE_DIVIDE
             16 BINARY_ADD
             18 RETURN_VALUE


## Why is the Python Version so Slow?
- Dynamic typing means that every single operation requires dispatching on the input type.
- Having an interpreter means that every instruction is fetched and dispatched at runtime.
- Other overheads:
  - Arbitrary-size integers.
  - Reference-counted garbage collection.

> This is the paradox that we have to work with when we're doing scientific or numerically-intensive Python. What makes Python fast for development -- this high-level, interpreted, and dynamically-typed aspect of the language -- is exactly what makes it slow for code execution.

- Jake VanderPlas, [*Losing Your Loops: Fast Numerical Computing with NumPy*](https://www.youtube.com/watch?v=EEUXKG97YRw)

# What Do We Do?

<center><img src="images/runaway.gif" alt="Drawing" style="width: 50%;"/></center>

<center><img src="images/thisisfine.gif" alt="Drawing" style="width: 1080px;"/></center>

- Python is slow for numerical computation because it performs dynamic dispatch on every operation we perform...

- ...but often, we just want to do the same thing over and over in a loop!

- If we don't need Python's dynamicism, we don't want to pay (much) for it.

- **Idea:** Dispatch **once per operation** instead of **once per element**.

In [53]:
import numpy as np

data = np.array([1, 2, 3, 4])
data

array([1, 2, 3, 4])

In [54]:
example = np.array([67.1,32.12,4.2,18.59])
example

array([67.1 , 32.12,  4.2 , 18.59])

In [55]:
data + data

array([2, 4, 6, 8])

In [56]:
example * 2.25

array([150.975 ,  72.27  ,   9.45  ,  41.8275])

In [57]:
%%time
# Naive dot product
(array_data * array_data).sum()

CPU times: user 220 µs, sys: 1.01 ms, total: 1.23 ms
Wall time: 1.77 ms


333328333350000.0

In [58]:
%%time
(ad_B*ad_B).sum()

CPU times: user 1.45 ms, sys: 21 µs, total: 1.47 ms
Wall time: 1.65 ms


2666646666700000.0

In [59]:
%%time
# Built-in dot product.
array_data.dot(array_data)

CPU times: user 940 µs, sys: 5.99 ms, total: 6.93 ms
Wall time: 9.08 ms


333328333350000.0

In [60]:
%%time
ad_B.dot(ad_B)

CPU times: user 773 µs, sys: 988 µs, total: 1.76 ms
Wall time: 2.55 ms


2666646666700000.0

In [61]:
%%time
fortran_dot_product(array_data, array_data)

CPU times: user 96 µs, sys: 955 µs, total: 1.05 ms
Wall time: 1.22 ms


333328333350000.0

In [62]:
%%time
fortran_dot_product(ad_B, ad_B)

CPU times: user 1.25 ms, sys: 0 ns, total: 1.25 ms
Wall time: 1.38 ms


2666646666700000.0

In [63]:
# Numpy won't allow us to write a string into an int array.
#data[0] = "foo"

In [64]:
# We also can't grow an array once it's created.
#data.append(3)

In [65]:
# We **can** reshape an array though.
two_by_two = data.reshape(2, 2)
two_by_two

array([[1, 2],
       [3, 4]])

Numpy arrays are:

- Fixed-type

- Size-immutable

- Multi-dimensional

- Fast\*

\* If you use them correctly.

# What's in an Array?

In [66]:
arr = np.array([1, 2, 3, 4, 5, 6], dtype='int16').reshape(2, 3)
print("Array:\n", arr, sep='')
print("===========")
print("DType:", arr.dtype)
print("Shape:", arr.shape)
print("Strides:", arr.strides)
print("Data:", arr.data.tobytes())

Array:
[[1 2 3]
 [4 5 6]]
DType: int16
Shape: (2, 3)
Strides: (6, 2)
Data: b'\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06\x00'


In [67]:
arb = np.array([145,-25,74,6,-13,7,89,-12], dtype='int16').reshape(4, 2)
print("Array:\n", arb, sep='')
print("===========")
print("DType:", arb.dtype)
print("Shape:", arb.shape)
print("Strides:", arb.strides)
print("Data:", arb.data.tobytes())

Array:
[[145 -25]
 [ 74   6]
 [-13   7]
 [ 89 -12]]
DType: int16
Shape: (4, 2)
Strides: (4, 2)
Data: b'\x91\x00\xe7\xffJ\x00\x06\x00\xf3\xff\x07\x00Y\x00\xf4\xff'


# Core Operations

- Vectorized **ufuncs** for elementwise operations.
- Fancy indexing and masking for selection and filtering.
- Aggregations across axes.
- Broadcasting

# UFuncs

UFuncs (universal functions) are functions that operate elementwise on one or more arrays.

In [68]:
data = np.arange(15).reshape(3, 5)
data

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [69]:
dt_B = np.arange(20).reshape(5, 4)
dt_B

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [70]:
# Binary operators.
data * data

array([[  0,   1,   4,   9,  16],
       [ 25,  36,  49,  64,  81],
       [100, 121, 144, 169, 196]])

In [71]:
dt_B/np.pi

array([[0.        , 0.31830989, 0.63661977, 0.95492966],
       [1.27323954, 1.59154943, 1.90985932, 2.2281692 ],
       [2.54647909, 2.86478898, 3.18309886, 3.50140875],
       [3.81971863, 4.13802852, 4.45633841, 4.77464829],
       [5.09295818, 5.41126807, 5.72957795, 6.04788784]])

In [72]:
# Unary functions.
np.sqrt(data)

array([[0.        , 1.        , 1.41421356, 1.73205081, 2.        ],
       [2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ],
       [3.16227766, 3.31662479, 3.46410162, 3.60555128, 3.74165739]])

In [73]:
np.sin(dt_B)

array([[ 0.        ,  0.84147098,  0.90929743,  0.14112001],
       [-0.7568025 , -0.95892427, -0.2794155 ,  0.6569866 ],
       [ 0.98935825,  0.41211849, -0.54402111, -0.99999021],
       [-0.53657292,  0.42016704,  0.99060736,  0.65028784],
       [-0.28790332, -0.96139749, -0.75098725,  0.14987721]])

In [74]:
# Comparison operations
(data % 3) == 0

array([[ True, False, False,  True, False],
       [False,  True, False, False,  True],
       [False, False,  True, False, False]])

In [75]:
dt_B >= 0.5

array([[False,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

In [76]:
# Boolean combinators.
((data % 2) == 0) & ((data % 3) == 0)

array([[ True, False, False, False, False],
       [False,  True, False, False, False],
       [False, False,  True, False, False]])

In [77]:
((dt_B%3) == 0) & (((dt_B**2)%2) == 1)

array([[False, False, False,  True],
       [False, False, False, False],
       [False,  True, False, False],
       [False, False, False,  True],
       [False, False, False, False]])

In [78]:
# as of python 3.5, @ is matrix-multiply
data @ data.T

array([[ 30,  80, 130],
       [ 80, 255, 430],
       [130, 430, 730]])

In [79]:
dt_B @ (2*dt_B.T)

array([[  28,   76,  124,  172,  220],
       [  76,  252,  428,  604,  780],
       [ 124,  428,  732, 1036, 1340],
       [ 172,  604, 1036, 1468, 1900],
       [ 220,  780, 1340, 1900, 2460]])

# UFuncs Review

- UFuncs provide efficient elementwise operations applied across one or more arrays.
- Arithmetic Operators (`+`, `*`, `/`)
- Comparisons (`==`, `>`, `!=`)
- Boolean Operators (`&`, `|`, `^`)
- Trigonometric Functions (`sin`, `cos`)
- Transcendental Functions (`exp`, `log`)

# Selections

We often want to perform an operation on just a subset of our data.

In [80]:
sines = np.sin(np.linspace(0, 3.14, 10))
cosines = np.cos(np.linspace(0, 3.14, 10))
sines

array([0.        , 0.34185385, 0.64251645, 0.86575984, 0.98468459,
       0.98496101, 0.8665558 , 0.64373604, 0.34335012, 0.00159265])

In [81]:
tan = np.tan(np.linspace(0,6.28,20))
tan

array([ 0.00000000e+00,  3.43113039e-01,  7.77792965e-01,  1.52893401e+00,
        3.93781237e+00, -1.21923821e+01, -2.28601853e+00, -1.08885118e+00,
       -5.42908169e-01, -1.68421722e-01,  1.65147806e-01,  5.38791101e-01,
        1.08191342e+00,  2.26633049e+00,  1.17335071e+01, -3.99105827e+00,
       -1.53961748e+00, -7.82917976e-01, -3.46677248e-01, -3.18531795e-03])

In [82]:
# Slicing works with the same semantics as Python lists.
sines[0]

0.0

In [83]:
tan[8]

-0.5429081690349788

In [84]:
sines[:3]  # First three elements  

array([0.        , 0.34185385, 0.64251645])

In [85]:
tan[:10]

array([  0.        ,   0.34311304,   0.77779296,   1.52893401,
         3.93781237, -12.19238214,  -2.28601853,  -1.08885118,
        -0.54290817,  -0.16842172])

In [86]:
sines[5:]  # Elements from 5 on.

array([0.98496101, 0.8665558 , 0.64373604, 0.34335012, 0.00159265])

In [87]:
tan[8:]

array([-5.42908169e-01, -1.68421722e-01,  1.65147806e-01,  5.38791101e-01,
        1.08191342e+00,  2.26633049e+00,  1.17335071e+01, -3.99105827e+00,
       -1.53961748e+00, -7.82917976e-01, -3.46677248e-01, -3.18531795e-03])

In [88]:
sines[::2]  # Every other element.

array([0.        , 0.64251645, 0.98468459, 0.8665558 , 0.34335012])

In [89]:
tan[::3]

array([ 0.        ,  1.52893401, -2.28601853, -0.16842172,  1.08191342,
       -3.99105827, -0.34667725])

In [90]:
# More interesting: we can index with boolean arrays to filter by a predicate.
print("sines:\n", sines)
print("sines > 0.5:\n", sines > 0.5)
print("sines[sines > 0.5]:\n", sines[sines > 0.5])

sines:
 [0.         0.34185385 0.64251645 0.86575984 0.98468459 0.98496101
 0.8665558  0.64373604 0.34335012 0.00159265]
sines > 0.5:
 [False False  True  True  True  True  True  True False False]
sines[sines > 0.5]:
 [0.64251645 0.86575984 0.98468459 0.98496101 0.8665558  0.64373604]


In [91]:
print("Tangents:\n", tan)
print("Positive Tangents:\n", tan > 0)
print("Tangents[Tangents > 0.5]:\n", tan[tan > 0.5])

Tangents:
 [ 0.00000000e+00  3.43113039e-01  7.77792965e-01  1.52893401e+00
  3.93781237e+00 -1.21923821e+01 -2.28601853e+00 -1.08885118e+00
 -5.42908169e-01 -1.68421722e-01  1.65147806e-01  5.38791101e-01
  1.08191342e+00  2.26633049e+00  1.17335071e+01 -3.99105827e+00
 -1.53961748e+00 -7.82917976e-01 -3.46677248e-01 -3.18531795e-03]
Positive Tangents:
 [False  True  True  True  True False False False False False  True  True
  True  True  True False False False False False]
Tangents[Tangents > 0.5]:
 [ 0.77779296  1.52893401  3.93781237  0.5387911   1.08191342  2.26633049
 11.73350714]


In [92]:
# We index with lists/arrays of integers to select values at those indices.
print(sines)
sines[[0, 4, 7]]

[0.         0.34185385 0.64251645 0.86575984 0.98468459 0.98496101
 0.8665558  0.64373604 0.34335012 0.00159265]


array([0.        , 0.98468459, 0.64373604])

In [93]:
print(tan)
tan[[15, 12, 9, 6, 3, 0,-2 ,-4,-8,-16]]

[ 0.00000000e+00  3.43113039e-01  7.77792965e-01  1.52893401e+00
  3.93781237e+00 -1.21923821e+01 -2.28601853e+00 -1.08885118e+00
 -5.42908169e-01 -1.68421722e-01  1.65147806e-01  5.38791101e-01
  1.08191342e+00  2.26633049e+00  1.17335071e+01 -3.99105827e+00
 -1.53961748e+00 -7.82917976e-01 -3.46677248e-01 -3.18531795e-03]


array([-3.99105827,  1.08191342, -0.16842172, -2.28601853,  1.52893401,
        0.        , -0.34667725, -1.53961748,  1.08191342,  3.93781237])

In [94]:

# Index arrays are often used for sorting one or more arrays.
unsorted_data = np.array([1, 3, 2, 12, -1, 5, 2])

In [95]:
to_sort =np.array([-7.25,8.63,0.19,0.001,-0.24,3.25,-9.86])

In [96]:
sort_indices = np.argsort(unsorted_data)
sort_indices

array([4, 0, 2, 6, 1, 5, 3])

In [97]:
sort_indx = np.argsort(to_sort)
sort_indx

array([6, 0, 4, 3, 2, 5, 1])

In [98]:
unsorted_data[sort_indices]

array([-1,  1,  2,  2,  3,  5, 12])

In [99]:
to_sort[sort_indx]

array([-9.86e+00, -7.25e+00, -2.40e-01,  1.00e-03,  1.90e-01,  3.25e+00,
        8.63e+00])

In [100]:
market_caps = np.array([12, 6, 10, 5, 6])  # Presumably in dollars?
assets = np.array(['A', 'B', 'C', 'D', 'E'])

In [101]:
values = np.array([2,3,4,7,8,11,6,5])
names = np.array(['X','Y','T','R','G','Q','S','Z'])

In [102]:
# Sort assets by market cap by using the permutation that would sort market caps on ``assets``.
sort_by_mcap = np.argsort(market_caps)
assets[sort_by_mcap]

array(['D', 'B', 'E', 'C', 'A'], dtype='<U1')

In [103]:
srtByName = np.argsort(values)
names[srtByName]

array(['X', 'Y', 'T', 'Z', 'S', 'R', 'G', 'Q'], dtype='<U1')

In [104]:
# Indexers are also useful for aligning data.
#print("Dates:\n", repr(event_dates))
#print("Values:\n", repr(event_values))
#print("Calendar:\n", repr(calendar))

In [105]:
#print("Raw Dates:", event_dates)
#print("Indices:", calendar.searchsorted(event_dates))
#print("Forward-Filled Dates:", calendar[calendar.searchsorted(event_dates)])

On multi-dimensional arrays, we can slice along each axis independently.

In [106]:
data = np.arange(25).reshape(5, 5)
data

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [107]:
dtC = np.arange(12).reshape(4,3)
dtC

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [108]:
data[:2, :2]  # First two rows and first two columns.

array([[0, 1],
       [5, 6]])

In [109]:
dtC[:3, :3]

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [110]:
data[:2, [0, -1]]  # First two rows, first and last columns.

array([[0, 4],
       [5, 9]])

In [111]:
dtC[:2,[1,-1]]

array([[1, 2],
       [4, 5]])

In [112]:
data[(data[:, 0] % 2) == 0]  # Rows where the first column is divisible by two.

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24]])

In [113]:
dtC[(dtC[:,0]**2)>= 10]

array([[ 6,  7,  8],
       [ 9, 10, 11]])

# Selections Review

- Indexing with an integer removes a dimension.
- Slicing operations work on Numpy arrays the same way they do on lists.
- Indexing with a boolean array filters to True locations.
- Indexing with an integer array selects indices along an axis.
- Multidimensional arrays can apply selections independently along different axes.

## Reductions

Functions that reduce an array to a scalar.

$Var(X) = \frac{1}{N}\sqrt{\sum_{i=1}^N (x_i - \bar{x})^2}$

In [114]:
def variance(x):
    return ((x - x.mean()) ** 2).sum() / len(x)

In [115]:
variance(np.random.standard_normal(1000))

1.0230174403857861

In [116]:
variance(np.random.chisquare(10,250))

19.922717063744138

- `sum()` and `mean()` are both **reductions**.

- In the simplest case, we use these to reduce an entire array into a single value...

In [117]:
data = np.arange(30)
data.mean()

14.5

In [118]:
dtb = np.arange(40)
dtb.std() 

11.543396380615196

- ...but we can do more interesting things with multi-dimensional arrays.

In [119]:
data = np.arange(30).reshape(3, 10)
data

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])

In [120]:
dtb=np.arange(40).reshape(8,5)
dtb

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39]])

In [121]:
data.mean()

14.5

In [122]:
dtb.std()

11.543396380615196

In [123]:
data.mean(axis=0)

array([10., 11., 12., 13., 14., 15., 16., 17., 18., 19.])

In [124]:
dtb.sum(axis=1)

array([ 10,  35,  60,  85, 110, 135, 160, 185])

In [125]:
data.mean(axis=1)

array([ 4.5, 14.5, 24.5])

In [126]:
data.max(axis=0)

array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

## Reductions Review

- Reductions allow us to perform efficient aggregations over arrays.
- We can do aggregations over a single axis to collapse a single dimension.
- Many built-in reductions (`mean`, `sum`, `min`, `max`, `median`, ...).

# Broadcasting

In [127]:
row = np.array([1, 2, 3, 4])
column = np.array([[1], [2], [3]])
print("Row:\n", row, sep='')
print("Column:\n", column, sep='')

Row:
[1 2 3 4]
Column:
[[1]
 [2]
 [3]]


In [128]:
fil = np.array([1, 2, 4, 8, 16, 32, 64, 128])
col = np.array([[1], [2], [4], [8], [16], [32], [64], [128]])
print("filas:\n", fil, sep='')
print("Columnas:\n", col, sep='')

filas:
[  1   2   4   8  16  32  64 128]
Columnas:
[[  1]
 [  2]
 [  4]
 [  8]
 [ 16]
 [ 32]
 [ 64]
 [128]]


In [129]:
row + column


array([[2, 3, 4, 5],
       [3, 4, 5, 6],
       [4, 5, 6, 7]])

In [130]:
fil * col

array([[    1,     2,     4,     8,    16,    32,    64,   128],
       [    2,     4,     8,    16,    32,    64,   128,   256],
       [    4,     8,    16,    32,    64,   128,   256,   512],
       [    8,    16,    32,    64,   128,   256,   512,  1024],
       [   16,    32,    64,   128,   256,   512,  1024,  2048],
       [   32,    64,   128,   256,   512,  1024,  2048,  4096],
       [   64,   128,   256,   512,  1024,  2048,  4096,  8192],
       [  128,   256,   512,  1024,  2048,  4096,  8192, 16384]])

<center><img src="images/broadcasting.png" alt="Drawing" style="width: 60%;"/></center>

<h5>Source: http://www.scipy-lectures.org/_images/numpy_broadcasting.png</h5>

In [131]:
# Broadcasting is particularly useful in conjunction with reductions.
print("Data:\n", data, sep='')
print("Mean:\n", data.mean(axis=0), sep='')
print("Data - Mean:\n", data - data.mean(axis=0), sep='')

Data:
[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]]
Mean:
[10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Data - Mean:
[[-10. -10. -10. -10. -10. -10. -10. -10. -10. -10.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [ 10.  10.  10.  10.  10.  10.  10.  10.  10.  10.]]


In [132]:
print("Datos:\n", dtC, sep='')
print("Varianza:\n", dtC.mean(axis=1), sep='')
print("covarianza:\n", dtC - dtC.mean(axis=0), sep='')

Datos:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
Varianza:
[ 1.  4.  7. 10.]
covarianza:
[[-4.5 -4.5 -4.5]
 [-1.5 -1.5 -1.5]
 [ 1.5  1.5  1.5]
 [ 4.5  4.5  4.5]]


# Broadcasting Review

- Numpy operations can work on arrays of different dimensions as long as the arrays' shapes are still "compatible".
- Broadcasting works by "tiling" the smaller array along the missing dimension.
- The result of a broadcasted operation is always at least as large in each dimension as the largest array in that dimension.

# Numpy Review

- Numerical algorithms are slow in pure Python because the overhead dynamic dispatch dominates our runtime.

- Numpy solves this problem by:
  1. Imposing additional restrictions on the contents of arrays.
  2. Moving the inner loops of our algorithms into compiled C code.

- Using Numpy effectively often requires reworking an algorithms to use vectorized operations instead of for-loops, but the resulting operations are usually simpler, clearer, and faster than the pure Python equivalent.

<center><img src="images/unicorn.jpg" alt="Drawing" style="width: 75%;"/></center>

Numpy is great for many things, but...

- Sometimes our data is equipped with a natural set of **labels**:
  - Dates/Times
  - Stock Tickers
  - Field Names (e.g. Open/High/Low/Close)

- Sometimes we have **more than one type of data** that we want to keep grouped together.
  - Tables with a mix of real-valued and categorical data.

- Sometimes we have **missing** data, which we need to ignore, fill, or otherwise work around.

<center><img src="images/panda-wrangling.gif" alt="Drawing" style="width: 75%;"/></center>

<center><img src="images/pandas_logo.png" alt="Drawing" style="width: 75%;"/></center>


Pandas extends Numpy with more complex data structures:

- `Series`: 1-dimensional, homogenously-typed, labelled array.
- `DataFrame`: 2-dimensional, semi-homogenous, labelled table.

Pandas also provides many utilities for: 
- Input/Output
- Data Cleaning
- Rolling Algorithms
- Plotting

# Selection in Pandas

In [133]:
import pandas as pd

In [134]:
s = pd.Series(index=['a', 'b', 'c', 'd', 'e'], data=[1, 2, 3, 4, 5])
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [135]:
values = [2.1,3.5,4.4,3.7,3.8,4.1,4.6,2.5]
names = ['X','Y','T','R','G','Q','S','Z']
t = pd.Series(index=names, data=values)
t

X    2.1
Y    3.5
T    4.4
R    3.7
G    3.8
Q    4.1
S    4.6
Z    2.5
dtype: float64

In [136]:
# There are two pieces to a Series: the index and the values.
print("The index is:", s.index)
print("The values are:", s.values)

The index is: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
The values are: [1 2 3 4 5]


In [137]:
print("Los indices son:", t.index)
print("Los valores son:", t.values)

Los indices son: Index(['X', 'Y', 'T', 'R', 'G', 'Q', 'S', 'Z'], dtype='object')
Los valores son: [2.1 3.5 4.4 3.7 3.8 4.1 4.6 2.5]


In [138]:
# We can look up values out of a Series by position...
s.iloc[0]

1

In [139]:
t.iloc[4]

3.8

In [140]:
# ... or by label.
s.loc['a']

1

In [141]:
t.loc['Q']

4.1

In [142]:
# Slicing works as expected...
s.iloc[:2]

a    1
b    2
dtype: int64

In [143]:
t.iloc[2:]

T    4.4
R    3.7
G    3.8
Q    4.1
S    4.6
Z    2.5
dtype: float64

In [144]:
# ...but it works with labels too!
s.loc[:'c']

a    1
b    2
c    3
dtype: int64

In [145]:
t.loc['G':]

G    3.8
Q    4.1
S    4.6
Z    2.5
dtype: float64

In [146]:
# Fancy indexing works the same as in numpy.
s.iloc[[0, -1]]

a    1
e    5
dtype: int64

In [147]:
t.iloc[[1,-1]]

Y    3.5
Z    2.5
dtype: float64

In [148]:
# As does boolean masking.
s.loc[s > 2]

c    3
d    4
e    5
dtype: int64

In [149]:
t.loc[t%2 ==0]

Series([], dtype: float64)

In [150]:
# Element-wise operations are aligned by index.
other_s = pd.Series({'a': 10.0, 'c': 20.0, 'd': 30.0, 'z': 40.0})
other_s

a    10.0
c    20.0
d    30.0
z    40.0
dtype: float64

In [151]:
u = pd.Series({'Fisica':4.2, 'Quimica':3.9, 'Calculo 1':4.5, 'Filosofia':3.2})
u

Fisica       4.2
Quimica      3.9
Calculo 1    4.5
Filosofia    3.2
dtype: float64

In [152]:
s + other_s

a    11.0
b     NaN
c    23.0
d    34.0
e     NaN
z     NaN
dtype: float64

In [153]:
t+u

Calculo 1   NaN
Filosofia   NaN
Fisica      NaN
G           NaN
Q           NaN
Quimica     NaN
R           NaN
S           NaN
T           NaN
X           NaN
Y           NaN
Z           NaN
dtype: float64

In [154]:
# We can fill in missing values with fillna().
(s + other_s).fillna(0.0)

a    11.0
b     0.0
c    23.0
d    34.0
e     0.0
z     0.0
dtype: float64

In [155]:
(t+u).fillna(0.0)

Calculo 1    0.0
Filosofia    0.0
Fisica       0.0
G            0.0
Q            0.0
Quimica      0.0
R            0.0
S            0.0
T            0.0
X            0.0
Y            0.0
Z            0.0
dtype: float64

In [156]:
# Most real datasets are read in from an external file format.
#aapl = pd.read_csv('AAPL.csv', parse_dates=['Date'], index_col='Date')
#aapl.head()

In [157]:
# Slicing generalizes to two dimensions as you'd expect:
#aapl.iloc[:2, :2]

In [158]:
#aapl.loc[pd.Timestamp('2010-02-01'):pd.Timestamp('2010-02-04'), ['Close', 'Volume']]

# Rolling Operations

<center><img src="images/rolling.gif" alt="Drawing" style="width: 75%;"/></center>

In [159]:
#aapl.rolling(5)[['Close', 'Adj Close']].mean().plot();

In [160]:
# Drop `Volume`, since it's way bigger than everything else.
#aapl.drop('Volume', axis=1).resample('2W').max().plot();

In [161]:
# 30-day rolling exponentially-weighted stddev of returns.
#aapl['Close'].pct_change().ewm(span=30).std().plot();

# "Real World" Data

In [162]:
"""from demos.avocados import read_avocadata

avocados = read_avocadata('2014', '2016')
avocados.head()"""

"from demos.avocados import read_avocadata\n\navocados = read_avocadata('2014', '2016')\navocados.head()"

In [163]:
# Unlike numpy arrays, pandas DataFrames can have a different dtype for each column.
#avocados.dtypes

In [164]:
# What's the regional average price of a HASS avocado every day?
#hass = avocados[avocados.Variety == 'HASS']
#hass.groupby(['Date', 'Region'])['Weighted Avg Price'].mean().unstack().ffill().plot();

In [165]:
"""def _organic_spread(group):

    if len(group.columns) != 2:
        return pd.Series(index=group.index, data=0.0)
    
    is_organic = group.columns.get_level_values('Organic').values.astype(bool)
    organics = group.loc[:, is_organic].squeeze()
    non_organics = group.loc[:, ~is_organic].squeeze()
    diff = organics - non_organics
    return diff

def organic_spread_by_region(df):
    What's the difference between the price of an organic 
    and non-organic avocado within each region?
    
    return (
        df
        .set_index(['Date', 'Region', 'Organic'])
         ['Weighted Avg Price']
        .unstack(level=['Region', 'Organic'])
        .ffill()
        .groupby(level='Region', axis=1)
        .apply(_organic_spread)
    )"""

"def _organic_spread(group):\n\n    if len(group.columns) != 2:\n        return pd.Series(index=group.index, data=0.0)\n    \n    is_organic = group.columns.get_level_values('Organic').values.astype(bool)\n    organics = group.loc[:, is_organic].squeeze()\n    non_organics = group.loc[:, ~is_organic].squeeze()\n    diff = organics - non_organics\n    return diff\n\ndef organic_spread_by_region(df):\n    What's the difference between the price of an organic \n    and non-organic avocado within each region?\n    \n    return (\n        df\n        .set_index(['Date', 'Region', 'Organic'])\n         ['Weighted Avg Price']\n        .unstack(level=['Region', 'Organic'])\n        .ffill()\n        .groupby(level='Region', axis=1)\n        .apply(_organic_spread)\n    )"

In [166]:
"""organic_spread_by_region(hass).plot();
plt.gca().set_title("Daily Regional Organic Spread");
plt.legend(bbox_to_anchor=(1, 1));"""

'organic_spread_by_region(hass).plot();\nplt.gca().set_title("Daily Regional Organic Spread");\nplt.legend(bbox_to_anchor=(1, 1));'

In [167]:
"""spread_correlation = organic_spread_by_region(hass).corr()
spread_correlation"""

'spread_correlation = organic_spread_by_region(hass).corr()\nspread_correlation'

In [168]:
"""import seaborn as sns
grid = sns.clustermap(spread_correlation, annot=True)
fig = grid.fig
axes = fig.axes
ax = axes[2]
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);"""

'import seaborn as sns\ngrid = sns.clustermap(spread_correlation, annot=True)\nfig = grid.fig\naxes = fig.axes\nax = axes[2]\nax.set_xticklabels(ax.get_xticklabels(), rotation=45);'

# Pandas Review

- Pandas extends numpy with more complex datastructures and algorithms.
- If you understand numpy, you understand 90% of pandas.
- `groupby`, `set_index`, and `unstack` are powerful tools for working with categorical data.
- Avocado prices are surprisingly interesting :)

# Thanks!