# Numpy
* The core library for manipulating and cleaning data in python is called pandas.
* The core library underneath pandas, is numpy (numerical python)
* Numpy is all about n dimensional arrays and doing things quickly
* We won't talk about numpy much beyond today, but it's worth a quick consideration...

In [1]:
import numpy as np #importing numpy
a = np.array([1, 2, 3])
print(a) #prints array
print(type(a)) #type is numpy array

[1 2 3]
<class 'numpy.ndarray'>


In [2]:
# Things we can do with an ndarray include:
print(a.ndim) #prints a one dimensional array

1


In [3]:
# Lets create something multidimensional
b = np.array([[1, 2, 3], [4, 5, 6]]) #nested lists   for example two rows and three columns
print(b) #formats nicely
print(b.shape) #returns tuple with number of rows and number of columns, in this case 2 rows and three columns

[[1 2 3]
 [4 5 6]]
(2, 3)


In [4]:
# What is this ndarray filled with?
b.dtype #filled with 64 bit integers

dtype('int64')

In [5]:
# What is this reallly?
type(b[0][0]) #the zeroth row in the first column
#numpy defines its own set of integers

numpy.int64

In [6]:
# And some automatic typecasting is available
c = np.array([2.2, 5, 1.1])
print(c.dtype)
print(c) #all have automatically been cast to float for numpy


l=[2.2, 5, 1.1] #example from list
print(type(l[0])) #prints a float from list
print(type(l[1])) #prints a int from list
print(type(l[2])) #prints a float from list


float64
[2.2 5.  1.1]
<class 'float'>
<class 'int'>
<class 'float'>


In [7]:
# You'll see this code a lot in examples
np.zeros((2, 3)) #creates an array of zeros as floats

array([[0., 0., 0.],
       [0., 0., 0.]])

In [8]:
import numpy as np
np.random.rand(2, 3) #generates random number from zero to one

array([[0.39483033, 0.78719389, 0.13033862],
       [0.85188803, 0.91877064, 0.14126827]])

In [12]:
# just like range! ten to fifty by twos!
np.arange(10, 50, 2) #for range from 10 to 50 count by two
np.arange(1, 2, .1) #for range from 1 to 2 count by .1

array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])

In [10]:
# last one
# 15 numbers from 0 (inclusive) to 2 (inclusive) EVENLY SPACED OUT
np.linspace(0, 2, 15) 

array([0.        , 0.14285714, 0.28571429, 0.42857143, 0.57142857,
       0.71428571, 0.85714286, 1.        , 1.14285714, 1.28571429,
       1.42857143, 1.57142857, 1.71428571, 1.85714286, 2.        ])

## Array Operations
* We can do many things on arrays, such as mathematical manipulation (addition, subtraction, square, exponents) as well as use boolean arrays, which are binary values. 
* We can also do matrix manipulation such as product, transpose, inverse, and so forth.
* Key difference, arithmetic operations are **element wise**

In [14]:
a = np.array([10,20,30,40])
b = np.array([1, 2, 3,4])
c = a-b # takes 10 - 1, 20 - 2, 30 - 3 etc
c = a+b # takes 10 + 1, 20 + 2, 30 + 3 etc
c # this is the same as `display(c)`
#if they are not the same size then you will get an error

array([11, 22, 33, 44])

In [15]:
a = np.array([10,20,30,40])
b = np.array([[1, 2, 3, 4], [0, 0, 0, 0]])
c = a+b # took numbers from a and added to each of the rows, called broadcasting 
c #this is the same as display(c)

array([[11, 22, 33, 44],
       [10, 20, 30, 40]])

In [58]:
a = np.array([1, 2, 3, 4])
b = np.array([[22, 7, 14, 2], [4, 7, 12, 3]])
c = a+b
d = c - np.array([9, 1, 7 ,8])
display(d)

array([[14,  8, 10, -2],
       [-4,  8,  8, -1]])

## Example
* Metrication in the US ([Wiki article](https://en.wikipedia.org/wiki/Metrication_in_the_United_States))

* Might want to convert, for our international audience...

![weather forecast](datasets/weather.jpg)

In [17]:
fahrenheit = np.array([32,27,32,21,29,16])
celcius = (fahrenheit - 32) * (5/9)
celcius #computed all the data from the array at once
#

array([ 0.        , -2.77777778,  0.        , -6.11111111, -1.66666667,
       -8.88888889])

* What's happening underneath is so beautiful. It's called *broadcasting*, or *vectorization*.
* Each item in the ndarray can be operated on individually - there is no need to consider other operations
* I believe this is a data science **threshold concept**, and we should call it out:

*The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations.* (wikipedia)



* Vectorization allows for:
1. Massive parallelization and thus efficiency
2. Increased readability
3. Added flexibility
4. Increased code quality

In [18]:
%%timeit farenheit = np.linspace( -10, 20, 1000 ) #%% runs a timer to see how long it takes to run code
# Solve this the numpy way, vectorized!
celcius = (farenheit - 32) * (5/9)

3.02 µs ± 8.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [20]:
%%timeit fahrenheit = np.linspace( -10, 20, 1000 )
c = []
for i in fahrenheit:
    c.append((i-32)*(5/9)) #takes way longer to calculate

521 µs ± 6.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
# How else could we solve this iteratively?

In [9]:
%%timeit fahrenheit = np.linspace( -10, 20, 1000 )
[(temp-32) * (5/9) for temp in fahrenheit] #list comprehension slightly faster than for loop but much slower than vectorized way

480 µs ± 8.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
# Any other ideas on solving this?

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )


* How do we better understand this? What's going on underneath?

In [10]:
celcius=map(lambda temp: (temp-32)*(5/9), fahrenheit)
for i in celcius:
    print(i)

NameError: name 'fahrenheit' is not defined

# Final thoughts
1. Vectorization is the process of applying array programming techniques to data. When you vectorize an operation, function(s) are broadcast across elements in an array which allows for parallelization of operations.
2. Vectorization is powerful, and it's thinking in a vectorized way which is really important here. This aligns well with functional programming methods, and is a key to being an effective data scientist. #loopsaredead


# Boolean masking
* This is a **critical concept** (and not difficult!) in this course.
* This will impact how you look at data and understand how queries work.

* A Boolean mask is analagous to a bitwise mask!

* Ok, it's really simple. You take a range of values which are `True (1)` or `False (0)` and you either AND them or OR them with another range of values which are `True` or `False`.

* See https://stackoverflow.com/questions/28282869/shift-masked-bits-to-the-lsb

In [11]:
#bitwise masking
a=np.random.randint(2, size=10) #arrays with ones and zeroes. 2 is exclusive
b=np.random.randint(2, size=10)
print(a) #random bitwise mask
print(b) #random bitwise mask
print(np.bitwise_and(a, b)) #bitwise_and is for the and boolean. Only bottom rows that are ones where the first two rows are ones, all else equal zero in third row

[2 0 2 0 0 2 0 1 2 1]
[1 1 1 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]


In [25]:
#boolean masking
import random
a=np.array([random.choice([True,False]) for x in range(0,10)]) #three arrays of boolean values in the range of 10
b=np.array([random.choice([True,False]) for x in range(0,10)])
c=np.array([random.choice([True,False]) for x in range(0,10)])

print(a)
print(b)
print(c)
print((a & b) & c) #apply and b together to determine true or false than apply that answer to c
                   #only works when first three rows are true

[ True  True  True  True False  True False False  True  True]
[ True  True False False False False  True  True  True False]
[ True  True  True False False False  True  True False False]
[ True  True False False False False False False False False]


# Why are we doing this?
* It's really common to take an array of data and mask it to reveal a result.
* This works hand in hand with broadcasting! This is a highly parallelizable result!
* We can broadcast individuals values with comparison operators

In [34]:
a = np.random.randint(5,size=10) #array with random integers
mask1 = np.array([random.choice([True,False]) for x in range(0,10)]) #array with random integers
mask2 = np.array([random.choice([True,False]) for x in range(0,10)]) #array with random integers
print(a)
print(mask1)
print(mask2)

[3 4 4 2 0 3 0 2 2 2]
[ True False  True  True False False  True  True False  True]
[False False  True False  True False  True  True False False]


In [32]:
a[mask1] #only grabs what equals true 
a[mask1 & mask2] #evaluates from mask1 and mask2 and grabs the the resulting trues

array([2, 3, 2, 4, 2])

In [43]:
a = np.array([1, 0, 1, 1, 0, 1, 1, 0, 1, 0])
mask = a > 0.5
print(mask)
print(mask[a])

[ True False  True  True False  True  True False  True False]
[False  True False False  True False False  True False  True]


# Indexing operator
* The indexing operator in numpy, and pandas, is incredibly overloaded. You can use it to
  * get a single item out of the array, e.g. `a[0]`
  * slice a range out of the array, e.g. `a[1:4]`
  * apply a boolean mask to an array, e.g. `a[True, False, True]`

In [35]:
# Extended Topic
# Just for fun, here's how to implement the indexing operator yourself
class HomemadePsychologist:
    def __getitem__(self, key):
        print("""That's a good question, what do you think about the question '{}' Why do you think you are wondering about that?""".format(key))

psych = HomemadePsychologist()
psych["Are dogs fun?"]

That's a good question, what do you think about the question 'Are dogs fun?' Why do you think you are wondering about that?


In [36]:
a=np.random.randint(2, size=10) #arrays with ones and zeroes. two is exclusive
a

array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])

In [37]:
mask = a > 0.5
mask #gets an array of boolean values that are True if greater than .5 and False if less than .5

array([False,  True,  True,  True, False,  True,  True, False, False,
       False])

In [38]:
a[mask] #gets all the values from the mask thats greater than .5

array([1, 1, 1, 1, 1])

You and boolean masking with numpy (and thus pandas!) will become good friends. :)