# Using Data Structures Effectively

## NumPy arrays

In [1]:
import numpy as np

### NumPy array functionality

It is possible to make a multidimensional data structure from nested Python lists, but it quickly becomes difficult to perform calculations on them. I’ll illustrate this with an example: given a two-dimensional array, how can you look up the values in the first column?

To do this using nested lists, you would need to write a line of code to go through all the rows and extract the first value. Here’s one way of doing that, using a list comprehension:

In [2]:
python_2d_list = [[1, 3, 5], [2, 4, 6], [7, 9, 11]]

In [3]:
first_column = [python_2d_list[i][0] for i in range(len(python_2d_list))]

But if you have a NumPy array, you can simply look up the values in the first column using NumPy’s array slicing syntax:

In [4]:
np_2d_array = np.array([[1, 3, 5], [2, 4, 6], [7, 9, 11]])

In [5]:
first_columm = np_2d_array[:, 0]

If you look up elements in a list and assign a new variable name to the selection, this creates a new copy of those elements. But if you do the same with a NumPy array, this is a view of the original array. So if you change the values in the `np_2d_array` object in the above example, this will also change the corresponding values in the  `first_column` object. Creating a view is faster and more memory efficient than creating a copy, and this is another way NumPy arrays give better performance than lists.

In the same vein, many other operations on multidimensional data are much easier using NumPy arrays than nested lists. These include matrix multiplication, concatenating arrays, transposing arrays, and reshaping arrays.


### NumPy array performance

In [6]:
mixed_type_list = ["one", 2, 3.14]

In [7]:
mixed_type_array = np.array(["one", 2, 3.14])

In [8]:
print(mixed_type_array)

['one' '2' '3.14']


In [9]:
integer_array = np.array([1, 2, 3])

In [10]:
integer_array.dtype

dtype('int32')

In [11]:
array_to_fill = np.zeros(1000)

In [12]:
random_int_array = np.random.randint(1, 100000, 100000)
random_int_list = list(random_int_array)

In [13]:
%%timeit -r 7 -n 100
sum(random_int_list)



2.93 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%%timeit -r 7 -n 100
np.sum(random_int_array)

35.2 µs ± 6.13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


It’s approximately 100 times faster! This is an enormous performance boost. If the operation you want to perform is available as a vectorized NumPy array method, you should definitely use this rather than a native Python method or writing your own code. You can consult the NumPy documentation to find out if the operation is available.
When using NumPy arrays, you also need to consider whether you’ll need to add more elements to an array later. 

Unlike a regular Python list, when NumPy allocates space for an array, it doesn’t allow any extra room. So if you append more elements to a NumPy array the entire array needs to be moved to a new memory location every time. This means appending to a NumPy array is O(n). It’s definitely worthwhile to initialize your array with the correct amount of space, and an easy way to do this is to use np.zeros, like so:

`array_to_fill = np.zeros(1000)`


You can then replace the zeros with the new elements instead of appending to the array.


### Performance benefits of vectorization
Perhaps the largest benefit is not the code clarity, but the performance improvement that vectorization provides: an increase in computational efficiency/speed. Let’s explore this aspect of vectorization through a simple example. Let’s say we have a large array of numbers and we want to double each of them.

We can start by using a non-vectorized approach which loops through each element in the array, doubling it along the way. Let’s create a function that does just that:

In [28]:
def double_nonvectorized(array):
    doubled = array.copy()
    for i in range(len(array)):
        doubled[i] = array[i] * 2
    return doubled

Next, let’s create the equivalent function, but vectorize it:

In [29]:
def double_vectorized(array):
    return array * 2

In [30]:
array = np.array([1, 2, 3, 4])
print("Nonvectorized = ", double_nonvectorized(array))
print("Vectorized    = ", double_vectorized(array))

Nonvectorized =  [2 4 6 8]
Vectorized    =  [2 4 6 8]


In [31]:
big_array = np.arange(1000000)
num_runs = 5

In [33]:
%%timeit
double_nonvectorized(big_array)

185 ms ± 5.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]:
%%timeit
double_vectorized(big_array)

2.17 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


That's a roughly 100x speedup. OK, so why does this happen? The answer is twofold.


First, in the vectorized function, Python (or rather, the numpy code written in C that gets called) is designed to understand that it’s about to do something to every entry of an array, so it remembers where the array is located, and so only has to look up where to find the array once.


In addition, arrays are typed, meaning that Python also knows that every entry of the array it’s modifying is an integer. As a result, it doesn’t have to check the type of every entry in the array when the operation is vectorized, it checks once and knows that it’s working with an array of integers.

### Adding Two Lists Elementwise

#### Non Vectorized Version

In [61]:
n = 10000000
x = np.random.rand(n)
y = np.random.rand(n)

In [66]:
def add_nonvectorized(x, y):
    z1 = []
    for i in range(n):
        z1.append(x[i] + y[i])
    return z1

#### Vectorized Version

In [67]:
def add_vectorized(x, y):
    return x + y

In [69]:
%%timeit
add_nonvectorized(x, y)

3.05 s ± 262 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [70]:
%%timeit
add_vectorized(x, y)

42.7 ms ± 466 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Matrix Multiplication

#### Non Vectorized

In [75]:
n = 200
A = np.random.rand(n, n)
B = np.random.rand(n, n)

In [77]:
def matrix_multiply_nonvectorized(A, B):
    n = len(A)
    C = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]
    return C
    

#### Vectorized

In [79]:
def matrix_multiply_vectorized(x, y):
    return np.dot(A, B)

In [83]:
%%timeit -r 3 -n 1
matrix_multiply_nonvectorized(A, B)

9.38 s ± 3.75 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [81]:
%%timeit
matrix_multiply_vectorized(A, B)

465 µs ± 90 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


#### Vectorization syntax parallels much of the math of data science
Often in data science, we use linear algebra to perform matrix operations. Linear regression, principle components analysis, and correlation analyses all involve matrix operations. 

Many of these matrix operations can be directly expressed through vectorized operations in much the same way that the math would be expressed. For example, we often need to multiply one matrix by another - a common operation for a data scientist. 

### Vectorization Recap
- Vectorization can drastically increase the speed of execution versus looping over arrays

- Vectorization keeps code simpler and more readable so it’s easier to understand and build on later

- Much of the math of data science is similar to vectorized implementations, making it easier to translate into vectorized code

- While performance may be important for your particular problem, prioritize module implementations that can be optimized later over a delayed deliverable

### NumPy's Different Types
You also can save a lot of memory space with NumPy arrays by taking advantage of NumPy’s different types. NumPy arrays are loaded into memory, so reducing their size may be helpful when you are dealing with large arrays.


You can generate an array of random integers as before:


In [21]:
random_int_array = np.random.randint(1, 100_000, 100_000)

In [22]:
random_int_array.nbytes

400000

In [23]:
random_int_array_32 = random_int_array.astype(np.int32)

In [24]:
random_int_array.dtype

dtype('int32')

In [25]:
random_int_array_32.nbytes

400000

In [18]:
random_int_array.nbytes

400000

In [26]:
small_array = np.array([1, 3, 5], dtype=np.int16)

In [27]:
small_array.nbytes

6

### Parallel array operations with Dask

In [35]:
import dask.array as da

In [41]:
large_np_array = np.random.randint(1, 100000, 1000000000)

In [42]:
%%timeit -r 1 -n 7
np.max(large_np_array)

692 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 7 loops each)


In [43]:
large_dask_array = da.random.randint(1, 100000, 1000000000)

In [39]:
array_max = large_dask_array.max()

In [44]:
%%timeit -r 1 -n 7
array_max = large_dask_array.max()
array_max.compute()

5.31 s ± 0 ns per loop (mean ± std. dev. of 1 run, 7 loops each)
