# 3 - NumPy for pandas

In [1]:
# this allows us to access numpy using the np. prefix
import numpy as np

Vectorized operations are much faster than operations performed by loops. Also, vectorized operations requires less code, improving the readability of our code:

In [2]:
# a function that squares all the values in a sequence
def squares(values):
    result = []
    for v in values:
        result.append(v * v)
    return result

In [3]:
# create 100,000 numbers using python range
to_square = range(100000)

# time how long it takes to repeatedly square them all
%timeit squares(to_square)

15.9 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [4]:
# now lets do this with a numpy array
array_to_square = np.arange(0, 100000)

# and time using a vectorized operation
%timeit array_to_square ** 2

72.2 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Using vectorized operations instead of loop operations represents a gain of an order of magnitude: loop operations are the order of miliseconds, while vectorized operations are the order of microseconds.

> If you find yourself coding a loop to iterate across elements of a NumPy array, or a pandas Series or DataFrame, 
> then you are, as they say, **doing it wrong**.

## Creating NumPy arrays and performing basic array operations

In NumPy, n-dimension arrays are represented by **ndarray** objects. Arrays can be createad from usual Python lists throgh the ```np.array()``` function:

In [5]:
# a simple array
a1 = np.array([1, 2, 3, 4, 5])
a1

array([1, 2, 3, 4, 5])

In [6]:
# what is its type ?
type(a1)

numpy.ndarray

In [7]:
np.size(a1)

5

In [8]:
# array dize in each dimension
a1.shape

(5,)

NumPy arrays must have all of their element of same type:

In [9]:
# any float in the sequence makes it an array of floats
a2 = np.array([1, 2, 3, 4.0, 5.0])
a2

array([1., 2., 3., 4., 5.])

In [10]:
# array is all of one type (float64 in this case)
a2.dtype

dtype('float64')

In [11]:
# shorthand to repeat a sequence 10 times
a3 = np.array([0] * 10)
a3

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [12]:
# convert a python range to numpy array
np.array(range(10))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Although NumPy arrays can be created from Python lists, they can be created more efficiently using NumPy built-in functions:

In [13]:
# create a numpy array of 10 0.0's
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [14]:
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [15]:
# make "a range" starting at 0 and with 10 values
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]:
# 0 <= x < 10, increment by two
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

In [17]:
# 10 >= x > 0, counting down
np.arange(10, 0, -1)

array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1])

The function ```np.linspace()``` generates an array of a specific number of items between an interval. While the ```np.arange()``` function calculates the number of items based on the specified interval, the function ```np.linspace()``` calculates the interval based on the number of items specified.

In [18]:
np.linspace(0, 10, 21) # for np.linspace() both range limits are  inclusive

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5, 10. ])

NumPy arrays vectorize many mathematical operators, for both vectors and scalars:

In [19]:
# multiply numpy array by 2
a1 = np.arange(0, 10)
a1 * 2

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [20]:
# add two numpy arrays
a2 = np.arange(10, 20)
a1 + a2

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

NumPy arrays are n-dimensional, but most of time, we are interested in 1-dimension and 2-dimension arrays.

In [21]:
# create a 2-dimensional array (2x2) by passing a list of lists to np.array()
# each sublist is a row of the 2-dimension array
np.array([
    [1, 2],
    [3, 4]
])

array([[1, 2],
       [3, 4]])

The function ```np.reshape()``` allows us to redefine the dimensionality of an array:

In [22]:
# create a 1x20 array, and reshape to a 5x4 2d-array
m = np.arange(0, 20).reshape(5, 4)
m

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [23]:
# size of any dimensional array is the # of elements
np.size(m)

20

In [24]:
# can ask the size along a given axis (0 is rows)
np.size(m, axis = 0)

5

In [25]:
# (1 is columns)
np.size(m, axis = 1)

4

## Selecting array elements

NumPy array elements are accessed via \[\] operator. 

In [26]:
# select 0-based elements 0 and 2
a1[0], a1[2]

(0, 2)

Elements in two-dimensional arrays can be accessed using two values separated by a comma:

In [27]:
# select an element in 2d array at row 1 and column 2
m[1, 2]

6

In [28]:
# select all items from row 1
m[1, ]

array([4, 5, 6, 7])

In order to select a full column from a 2d array, we must use the : (colon) symbol in place of the first index:

In [29]:
# select all items from column 2
m[:, 2]

array([ 2,  6, 10, 14, 18])

## Logical operations on arrays

In [30]:
# which items are less than 2 ?
a = np.arange(5)
a < 2

array([ True,  True, False, False, False])

In [34]:
b = np.array([False, True, False])
c = np.array([True, True, False])
b | c

array([ True,  True, False])

In [37]:
# this will cause an exception
# a < 2 or a > 3

# The correct way to perform this operation is using the bitwise operator
(a < 2) | (a > 3)

array([ True,  True, False, False,  True])

The ```np.vectorize()``` takes a function that operates on a single element and applies it to all elements in a vector:

In [40]:
np.vectorize(lambda x : x < 2 or x > 3)(a)

array([ True,  True, False, False,  True])

Array of Booleans are specially useful to select elements from other arrays. Passing an array of vectors to the \[\] operator selects the elements for which the boolean value in the Boolean array is **True**:

In [41]:
a[a < 2]

array([0, 1])

In [42]:
a[a > 3]

array([4])

In [43]:
a[(a < 2) | (a > 3)]

array([0, 1, 4])

In [44]:
# np.sum() treats True as 1 and False as 0
# so this is how many items are less than 3
np.sum(a < 3)

3

Finally, arrays can be compared to other arrays:

In [45]:
a1 = np.arange(0, 5)
a1

array([0, 1, 2, 3, 4])

In [47]:
a2 = np.arange(5, 0, -1)
a2

array([5, 4, 3, 2, 1])

In [48]:
a1 < a2

array([ True,  True,  True, False, False])

In [53]:
# multi dimensional arrays also can be compared
a1 = np.arange(9).reshape(3, 3)
a2 = np.arange(9,0,-1).reshape(3, 3)
a1 < a2

array([[ True,  True,  True],
       [ True,  True, False],
       [False, False, False]])

However, two arrays must have the same shape in order to be compared:

In [52]:
a3 = np.arange(10, 20)
# the operation below throws an exception
# a1 < a3

In [50]:
a1 < a3

ValueError: operands could not be broadcast together with shapes (5,) (10,) 

## Slicing arrays

Slicing retrieves zero or more items from an array. Sliding overrides the \[\] operator to accept a slice object. A slice object consists of a triple of the form ```start:end:step```, where ```start``` is inclusive but ```end``` is exclusive.

In [54]:
# get all items in the array from position 3
# up to position 8 (but no inclusive)
a1 = np.arange(1, 10)
a1[3:8]

array([4, 5, 6, 7, 8])

In [55]:
# every odd element
a1[::2]

array([1, 3, 5, 7, 9])

In [56]:
# ommiting the start of the slice, NumPy assumes it as 0
a1[:6]

array([1, 2, 3, 4, 5, 6])

In [57]:
# ommiting the end of the slice, NumPy assumes it is the last index of the array
a1[3:]

array([4, 5, 6, 7, 8, 9])

In [58]:
# array in reverse order
a1[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1])

In [59]:
# note that when in reverse, this does not include 
# the element specified in the second component of the slice
# that is, there is no 1 printed in this
a1[9:0:-1]

array([9, 8, 7, 6, 5, 4, 3, 2])

In [60]:
a1[9::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1])

In [61]:
# all items from position 5 onwards
a1[5:]

array([6, 7, 8, 9])

In [62]:
# the items in the first 5 positions
a1[:5]

array([1, 2, 3, 4, 5])

Two dimensional arrays can also be sliced. Rows or columns selection uses the slice notation:

In [63]:
# we saw this earlier
# : in rows specifier mean all rows
# so this gets items in column position 1, from all rows
m[:, 1]

array([ 1,  5,  9, 13, 17])

To the left of the comma is a slice object for the rows, and to the rigth is one for the columns:

In [64]:
# in all rows, but for all columns in positions
# 1 up to but not including 3
m[:, 1:3]

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10],
       [13, 14],
       [17, 18]])

In [65]:
# in row positions 3 up to but no including 5,
# all columns
m[3:5, :]

array([[12, 13, 14, 15],
       [16, 17, 18, 19]])

In [66]:
# combined to pull out a submatrix of the matrix
m[3:5, 1:3]

array([[13, 14],
       [17, 18]])

Finally, it is also possible to explicitly select items by passing a Python list as argument to the \[\] operator: 

In [67]:
# using a Python array, we can select 
# non-contiguous rows or columns
m[[1,3,4], :]

array([[ 4,  5,  6,  7],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

## Reshaping arrays

NumPy offers the method ```np.reshape()``` so that it is possible to change the shape (a.k.a dimensions) of an array. So we can convert vectors into matrix and matrix back to single dimension vectors back for example.

In [70]:
# create a 9 element array (1x9)
a = np.arange(0, 9)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [72]:
# and reshape to a 3x3 2d array (matrix)
m = a.reshape(3, 3)
m

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [73]:
# and we can reshape downward in dimensions too
v = m.reshape(9)
v

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [74]:
# .ravel will generate an array representing a flattened 2d array
raveled = m.ravel()
raveled

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [75]:
# .ravel and .shape does not alter the shape of the source
m

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [76]:
# but it will be view into the source
# so items changed in the result of the ravel
# are changed in the original object

# reshape m to an array
reshaped = m.reshape(np.size(m))

# ravel into an array
raveled = m.ravel()

# chage values in either
reshaped[2] = 1000
raveled[5] = 2000

# and they show as changed in the original
m


array([[   0,    1, 1000],
       [   3,    4, 2000],
       [   6,    7,    8]])

In [78]:
# flattened is like ravel, but a copy of data
# not a view into the source
m2 = np.arange(0, 9).reshape(3, 3)
flattened = m2.flatten()

# change in the flattened object
flattened[0] = 1000
flattened

array([1000,    1,    2,    3,    4,    5,    6,    7,    8])

In [79]:
# but not the original
m2

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [80]:
# we can reshape by assigning a tuple to the .shape property
# we start with this, which has one dimension
flattened.shape

(9,)

In [82]:
# and make it 3x3
flattened.shape = (3,3)

# it is no longer flattened
flattened

array([[1000,    1,    2],
       [   3,    4,    5],
       [   6,    7,    8]])

In [83]:
# transpose a matrix
flattened.transpose()

array([[1000,    3,    6],
       [   1,    4,    7],
       [   2,    5,    8]])

In [84]:
# can also use .T property to transpose
flattened.T

array([[1000,    3,    6],
       [   1,    4,    7],
       [   2,    5,    8]])

In [87]:
# we can also use .resize, which changes shape of 
# and object in-place
m = np.arange(0, 9).reshape(3, 3)
m.resize(1, 9)
m # my shape has changed

array([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

## Combining arrays

In NumPy, the process of combining arrays is known as _stacking_. Stacking can take the following forms:
* horizontal
* vertical 
* depth-wise

In [88]:
# creating two arrays for examples
a = np.arange(9).reshape(3, 3)
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [89]:
b = (a + 1) * 10
b

array([[10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [91]:
# horizontally stack two arrays 
# b becomes columns of a to the right of a's columns

# hstack takes a tuple as a single argument
np.hstack((a, b))

array([[ 0,  1,  2, 10, 20, 30],
       [ 3,  4,  5, 40, 50, 60],
       [ 6,  7,  8, 70, 80, 90]])

In [93]:
# identical to concatenate along axis = 1
np.concatenate((a, b) , axis = 1)

array([[ 0,  1,  2, 10, 20, 30],
       [ 3,  4,  5, 40, 50, 60],
       [ 6,  7,  8, 70, 80, 90]])

In [94]:
# vertical stack, adding b as rows after a's rows
np.vstack((a, b))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [95]:
# concatenate along axis = 0 is the same as vstack
np.concatenate((a, b), axis = 0)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [96]:
# dstack stacks each independent column of a and b
np.dstack((a, b))

array([[[ 0, 10],
        [ 1, 20],
        [ 2, 30]],

       [[ 3, 40],
        [ 4, 50],
        [ 5, 60]],

       [[ 6, 70],
        [ 7, 80],
        [ 8, 90]]])

In [97]:
np.dstack((a,b)).shape

(3, 3, 2)

In [99]:
# stack two columns
one_d_a = np.arange(5)
one_d_b = (one_d_a + 1) * 10
np.column_stack((one_d_a, one_d_b))

array([[ 0, 10],
       [ 1, 20],
       [ 2, 30],
       [ 3, 40],
       [ 4, 50]])

In [100]:
# stack along rows
np.row_stack((one_d_a, one_d_b))

array([[ 0,  1,  2,  3,  4],
       [10, 20, 30, 40, 50]])