## Numpy - I LOVE Numpy!

Our Linear Algebra library for Python! It's fast bc it's written in CPython!

Numpy is a library for representing and working with large and multi-dimensional arrays. Most other libraries in the data-science ecosystem depend on numpy, making it one of the fundamental data science libraries.

Numpy provides a number of useful tools for scientific programming, and in this lesson, we'll take a look at some of the most common.

### Problems Solved:

Vectorization means operations are applied to whole arrays instead of individual elements. No need to loop!!

Native lists in Python don't do linear algebra out of the box and they're not as fast as possible.

### Vectorization:

Vectorization means operations are applied to whole arrays instead of individual elements.

Scalar multiplication = multiplying an array or matrix by a number.

Applying a function or operation to every element in an array.

### Operations with arrays


In [40]:
x = np.array([1, 2, 3])
y = np.array([2, 3, 4])

print(x - y)
print(x + y)
print(x * y)
print(x.dot(y))    # The dot product is viewing the magnitude and direction
                   #
print(x / y)
print(x + 3)

[-1 -1 -1]
[3 5 7]
[ 2  6 12]
20
[0.5        0.66666667 0.75      ]
[4 5 6]


In [41]:
numbers = list(range(10_000_000))
%timeit [number + 1 for number in numbers]

679 ms ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [43]:
# This is how much faster...
numbers = np.arange(10_000_000)
%timeit numbers + 1

20.2 ms ± 563 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Bracket Syntax (Boolean Mask to List of Ints)

In [45]:
divisible_by_fifteen = numbers[numbers % 15 == 0]

In [46]:
divisible_by_fifteen[1:5]

array([15, 30, 45, 60])

In [54]:
print(x % 2 == 0)     #returns a boolean list or **boolean mask**
print(x[x % 2 == 0])  #bracket syntax reasigns variables from booleans to integers

[False  True False]
[2]


In [55]:
x == 2     #boolean mask to find out where a certain value lives

array([False,  True, False])

In [58]:
mask_above_1 = x > 1   # Awesome!! No need for loops!
mask_above_1

array([False,  True,  True])

### Reshape

In [53]:
ones = np.ones(30)  #OR pass a tuple to shape from the start ones = np.ones((10,3))

In [52]:
ones.reshape(10,3)

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [1]:
import numpy as np

### Indexing

Numpy provides an array type that goes above and beyond what Python's built-in lists can do.

We can create a numpy array by passing a list to the np.array function:

In [2]:
a = np.array([1, 2, 3])
a

array([1, 2, 3])

We can create a multi-dimensional array by passing a list of lists to the array function

In [3]:
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Referencing elements in numpy arrays at it's most basic is the same as referencing elements in Python lists.



In [4]:
a[0]


1

In [5]:
print('a    == {}'.format(a))
print('a[0] == {}'.format(a[0]))
print('a[1] == {}'.format(a[1]))
print('a[2] == {}'.format(a[2]))

a    == [1 2 3]
a[0] == 1
a[1] == 2
a[2] == 3


However, multidimensional numpy arrays are easier to index into. To obtain the element at the second column in the second row, we would write:



In [6]:
matrix[1, 1]


5

To get the first 2 elements of the last 2 rows:



In [7]:
matrix[1:, :2]


array([[4, 5],
       [7, 8]])

Arrays can also be indexed with a boolean sequence used to indicate which values should be included in the resulting array.



In [8]:
should_include_elements = [True, False, True]
a[should_include_elements]


array([1, 3])

Note that the boolean sequence must the the same length as the array being indexed.



### Vectorized Operations
Another useful feature of numpy arrays is vectorized operations.

If we wanted to add 1 to every element in a list, without numpy, we can't simply add 1 to the list, as that will result in a TypeError.

In [9]:
original_array = [1, 2, 3, 4, 5]
try:
    original_array + 1
except TypeError as e:
    print('An Error Occured!')
    print(f'TypeError: {e}')


An Error Occured!
TypeError: can only concatenate list (not "int") to list


Instead, we might write a for loop or a list comprehension:


In [10]:
original_array = [1, 2, 3, 4, 5]
array_with_one_added = []
for n in original_array:
    array_with_one_added.append(n + 1)
print(array_with_one_added)

[2, 3, 4, 5, 6]


Vectorizing operations means that operations are automatically applied to every element in a vector, which in our case will be a numpy array. So if we are working with a numpy array, we can simply add 1:

In [11]:
original_array = np.array([1, 2, 3, 4, 5])
original_array + 1

array([2, 3, 4, 5, 6])

This works the same way for the other basic arithmatic operators as well.

In [12]:
my_array = np.array([-3, 0, 3, 16])

print('my_array      == {}'.format(my_array))
print('my_array - 5  == {}'.format(my_array - 5))
print('my_array * 4  == {}'.format(my_array * 4))
print('my_array / 2  == {}'.format(my_array / 2))
print('my_array ** 2 == {}'.format(my_array ** 2))
print('my_array % 2  == {}'.format(my_array % 2))

my_array      == [-3  0  3 16]
my_array - 5  == [-8 -5 -2 11]
my_array * 4  == [-12   0  12  64]
my_array / 2  == [-1.5  0.   1.5  8. ]
my_array ** 2 == [  9   0   9 256]
my_array % 2  == [1 0 1 0]


Not only are the arithmatic operators vectorized, but the same applies to the comparison operators.



In [13]:
my_array = np.array([-3, 0, 3, 16])

print('my_array       == {}'.format(my_array))
print('my_array == -3 == {}'.format(my_array == -3))
print('my_array >= 0  == {}'.format(my_array >= 0))
print('my_array < 10  == {}'.format(my_array < 10))

my_array       == [-3  0  3 16]
my_array == -3 == [ True False False False]
my_array >= 0  == [False  True  True  True]
my_array < 10  == [ True  True  True False]


Knowing what we know about indexing numpy arrays, we can use the comparison operators to select a certain subset of an array.

For example, we can get all the positive numbers in my_array like so:

In [14]:
my_array[my_array > 0]

array([ 3, 16])

### In-Depth Example
As another example, we could obtain all the even numbers like this:

In [15]:
my_array[my_array % 2 == 0]

array([ 0, 16])

To better understand how this is all working let's go through the above example in a little more detail.

The first expression that gets evaluated is this:

In [16]:
my_array % 2


array([1, 0, 1, 0])

In [17]:
result = my_array % 2
result == 0

array([False,  True, False,  True])

Lastly, we use this array of boolean values to index into the original array, giving us only the values that are evenly divisible by 2.



In [18]:
step_1 = my_array % 2
step_2 = step_1 == 0
step_3 = my_array[step_2]

step_3

array([ 0, 16])

Put another way, here is how the expression is evaluated:



In [19]:
print('1. my_array[my_array % 2 == 0]')
print('    - the original expression')
print('2. my_array[{} % 2 == 0]'.format(my_array))
print('    - variable substitution')
print('3. my_array[{} == 0]'.format(my_array % 2))
print('    - result of performing the vectorized modulus 2')
print('4. my_array[{}]'.format(my_array % 2 == 0))
print('    - result of comparing to 0')
print('5. {}[{}]'.format(my_array, my_array % 2 == 0))
print('    - variable substitution')
print('6. {}'.format(my_array[my_array % 2 == 0]))
print('    - our final result')

1. my_array[my_array % 2 == 0]
    - the original expression
2. my_array[[-3  0  3 16] % 2 == 0]
    - variable substitution
3. my_array[[1 0 1 0] == 0]
    - result of performing the vectorized modulus 2
4. my_array[[False  True False  True]]
    - result of comparing to 0
5. [-3  0  3 16][[False  True False  True]]
    - variable substitution
6. [ 0 16]
    - our final result


### Array Creation

Numpy provides several methods for creating arrays, we'll take a look at several of them.

np.random.randn can be used to create an array of specified length of random numbers drawn from the standard normal distribution.

In [20]:
np.random.randn(10)

array([ 0.82070746, -0.71204293,  0.15326132,  0.24100122,  0.25023941,
        0.89873336,  0.67904663, -1.1466945 ,  0.29663266, -1.02537765])

We can also pass a second argument to this function to define the shape of a two dimensional array.



In [21]:
np.random.randn(3, 4)

array([[-1.97913735,  1.09066429,  0.16747329, -1.29217895],
       [ 0.52644363,  0.8189482 ,  0.33624969, -1.08542562],
       [-0.80154492,  0.33883055, -1.45174571,  0.62152247]])

If we wish to draw from a normal distribution with mean 
μ
 and standard deviation 
σ
, we'll need to apply some arithmetic. Recall that to convert from the standard normal distribution, we'll need to multiply by the standard deviation, and add the mean.



In [22]:
mu = 100
sigma = 30

sigma * np.random.randn(20) + mu

array([101.74544284,  78.19272901, 119.29007159, 140.04463168,
        96.71611024, 146.23474233,  46.61564161,  71.84052004,
        82.47266105,  98.96360126,  90.39564861, 108.04168346,
        61.16638364,  76.09662232,  57.99274615, 135.20406581,
        94.96137359,  91.10227428, 125.28531453,  79.19338521])

The zeros and ones functions provide the ability to create arrays of a specified size full or either 0s or 1s, and the full function allows us to create an array of the specified size with a default value.

In [23]:
print('np.zeros(3)    == {}'.format(np.zeros(3)))
print('np.ones(3)     == {}'.format(np.ones(3)))
print('np.full(3, 17) == {}'.format(np.full(3, 17)))

np.zeros(3)    == [0. 0. 0.]
np.ones(3)     == [1. 1. 1.]
np.full(3, 17) == [17 17 17]


We can also use these methods to create multi-dimensional arrays by passing a tuple of the dimensions of the desired array, instead of a single integer value.

In [24]:
np.zeros((2, 3))

array([[0., 0., 0.],
       [0., 0., 0.]])

Numpy's arange function is very similar to python's builtin range function. It can take a single argument and generate a range from zero up to, but not including, the passed number.

In [25]:
np.arange(4)


array([0, 1, 2, 3])

We can also specify a starting point for the range:



In [26]:
np.arange(1, 4)


array([1, 2, 3])

As well as a step:



In [28]:
np.arange(1, 4, 2)


array([1, 3])

Unlike python's builtin range, numpy's arange can handle decimal numbers



In [29]:
np.arange(3, 5, 0.5)


array([3. , 3.5, 4. , 4.5])

The linspace method creates a range of numbers between a minimum and a maximum, with a set number of elements.



In [30]:
print('min: 1, max: 4, length = 4 -- {}'.format(np.linspace(1, 4, 4)))
print('min: 1, max: 4, length = 7 -- {} '.format(np.linspace(1, 4, 7)))

min: 1, max: 4, length = 4 -- [1. 2. 3. 4.]
min: 1, max: 4, length = 7 -- [1.  1.5 2.  2.5 3.  3.5 4. ] 


**Note that here the maximum is inclusive.



### Array Methods
Numpy arrays also come with built-in methods to make many mathematical operations easier.

In [31]:
a = np.array([1, 2, 3, 4, 5])


Some of the most common are:

.min

In [32]:
a.min()


1

.max



In [33]:
a.max()


5

.mean



In [34]:
a.mean()


3.0

In [59]:
np.median(a)  # written differently

3.0

.sum



In [35]:
a.sum()


15

.std: standard deviation



In [36]:
a.std()


1.4142135623730951

### Convert string numbers into floats

We will be using Pandas to clean data, but this is good to know

In [60]:
import numpy as np
x = np.array(['1.1', '2.2', '3.3'])
y = x.astype(np.float)
type(y[1])
print(y)

[1.1 2.2 3.3]
