# Numpy
* The core library for manipulating and cleaning data in python is called pandas.
* The core library underneath pandas, is numpy (numerical python)
* Numpy is all about n dimensional arrays and doing things quickly
* We won't talk about numpy much beyond today, but it's worth a quick consideration...

In [1]:
import numpy as np
a = np.array([1, 2, 3])
print(a)
print(type(a))

[1 2 3]
<class 'numpy.ndarray'>


In [2]:
# Things we can do with an ndarray include:
print(a.ndim) #number of dimensions

1


In [3]:
# Lets create something multidimensional
b = np.array([[1,2,3],[4,5,6]])
print(b)
print(b.shape)

[[1 2 3]
 [4 5 6]]
(2, 3)


In [4]:
# What is this ndarray filled with?
b.dtype

dtype('int64')

In [5]:
# What is this reallly?
type(b[0][0])

numpy.int64

In [9]:
# And some automatic typecasting is available
c = np.array([2.2, 5, 1.1])
print(c.dtype)
print(c)
#l=[2.2, 5, 1.1]
#print(type(l[0]))
#print(type(l[1]))
#print(type(l[2]))


float64
[2.2 5.  1.1]


In [10]:
# You'll see this code a lot in examples
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [11]:
np.random.rand(2,3)

array([[0.27035481, 0.51848331, 0.41234737],
       [0.1241825 , 0.54737886, 0.01375916]])

In [12]:
np.arange(10, 50, 2) # just like range! ten to fifty by twos!

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
       44, 46, 48])

In [13]:
# last one
np.linspace( 0, 2, 15 ) # 15 numbers from 0 (inclusive) to 2 (inclusive)

array([0.        , 0.14285714, 0.28571429, 0.42857143, 0.57142857,
       0.71428571, 0.85714286, 1.        , 1.14285714, 1.28571429,
       1.42857143, 1.57142857, 1.71428571, 1.85714286, 2.        ])

## Array Operations
* We can do many things on arrays, such as mathematical manipulation (addition, subtraction, square, exponents) as well as use boolean arrays, which are binary values. 
* We can also do matrix manipulation such as product, transpose, inverse, and so forth.
* Key difference, arithmetic operations are **element wise**

In [14]:
a = np.array([10,20,30,40])
b = np.array([1, 2, 3,4])
c = a-b
c # this is the same as `display(c)`

array([ 9, 18, 27, 36])

# Real word example
* My wife is an ultra-🇨🇦. We're talking syrup chugging lumber jack who knows the words to *God Save the Queen*
* I'm kinda trying to fit in around here (poorly!)
* We bought a house full of Amazon Alexa's...

* And they all talk in celcius. And she refuses to change. But....

* I got to the thermostat first 😏

![](datasets/mwahaha.gif)

In [15]:
farenheit = np.array([0,-10,-5,-15,0])

# And the formula for conversion is ((°F − 32) × 5/9 = °C)
celcius = (farenheit - 32) * (5/9)
celcius

array([-17.77777778, -23.33333333, -20.55555556, -26.11111111,
       -17.77777778])

* What's happening underneath is so beautiful. It's called *broadcasting*, or *vectorization*.
* Each item in the ndarray can be operated on individually - there is no need to consider other operations
* I believe this is a data science **threshold concept**, and we should call it out:

*The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations.* (wikipedia)



* Vectorization allows for:
1. Massive parallelization and thus efficiency
2. Increased readability
3. Added flexibility
4. Increased code quality

In [17]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
# Solve this the numpy way, vectorized!
celcius = (farenheit - 32) * (5/9)
# How could we solve this iteratively?
#np.linspace( -10, 20, 1000 )

2.78 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [18]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
c=[]
for i in farenheit:
    c.append((i-32)*(5/9))

515 µs ± 9.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
# How else could we solve this iteratively?

In [19]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
[(temp-32)*(5/9) for temp in farenheit]

468 µs ± 4.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
# One more thought on how could we solve this iteratively?

In [21]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
for i in range(len(farenheit)):
    farenheit[i]=((farenheit[i]-32)*(5/9))

552 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
# Any other ideas on solving this?

In [25]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
celcius=map(lambda temp: (temp-32)*(5/9), farenheit)

170 ns ± 0.415 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


# wth
* How do we better understand this? What's going on underneath?

In [32]:
celcius=map(lambda temp: (temp-32)*(5/9), farenheit)
for i in celcius:
    print(i)

-17.77777777777778
-23.333333333333336
-20.555555555555557
-26.11111111111111
-17.77777777777778


# Final thoughts
1. Vectorization is the process of applying array programming techniques to data. When you vectorize an operation, function(s) are broadcast across elements in an array which allows for parallelization of operations.
2. Vectorization is powerful, and it's thinking in a vectorized way which is really important here. This aligns well with functional programming methods, and is a key to being an effective data scientist. We live in a parallel world: #loopsaredead


# Boolean masking
* This is a **critical concept** (and not difficult!) in this course.
* This will impact how you look at data and understand how queries work.

* A Boolean mask is analagous to a bitwise mask!
* Has everyone used bitwise masking before?

* Ok, it's really simple. You take a range of values which are `True (1)` or `False (0)` and you either AND them or OR them with another range of values which are `True` or `False`.

![](./datasets/bitmask.png)
* shamelessly stolen from https://stackoverflow.com/questions/28282869/shift-masked-bits-to-the-lsb

In [34]:
#bitwise masking
a=np.random.randint(2, size=10)
b=np.random.randint(2, size=10)
print(a)
print(b)
print(np.bitwise_and(a,b))

[1 1 1 0 0 0 1 1 1 0]
[1 0 0 0 0 0 1 0 1 1]
[1 0 0 0 0 0 1 0 1 0]


In [36]:
#boolean masking
import random
a=[random.choice([True,False]) for x in range(0,10)]
b=[random.choice([True,False]) for x in range(0,10)]
c=[random.choice([True,False]) for x in range(0,10)]

print(a)
print(b)
print((a and b) and c)

[False, True, True, False, False, True, False, False, False, False]
[True, False, False, False, True, False, True, False, False, False]
[True, False, True, True, False, False, False, False, False, True]


# Why are we doing this?
* It's really common to take an array of data and mask it to reveal a result.
* This works hand in hand with broadcasting! This is a highly parallelizable result!
* We can broadcast individuals values with comparison operators

In [None]:
a=np.random.randint(5,size=10)
a > 3

In [None]:
a[a>3]

![](https://ibin.co/4v17tWp4dxMC.jpg)

# Indexing operator
* The indexing operator in numpy, and pandas, is incredibly overloaded. You can use it to
  * get a single item out of the array, e.g. `a[0]`
  * slice a range out of the array, e.g. `a[1:4]`
  * apply a boolean mask to an array, e.g. `a[True, False, True]`

In [39]:
a=np.random.randint(2, size=10)
a

array([0, 1, 1, 1, 1, 0, 0, 1, 1, 0])

In [40]:
mask=a>0.5
mask

array([False,  True,  True,  True,  True, False, False,  True,  True,
       False])

In [42]:
#a[mask]
a[0]

0

You and boolean masking with numpy (and thus pandas!) will become good friends. :)

In [None]:
# Extended Topic #1
# Just for fun, here's how to impl the indexing operator yourself
class HomemadePsychologist:
    def __getitem__(self, key):
        print("""That's a good question, what do you think about the question '{}' Why do you think you are wondering about that?""".format(key))

psych = HomemadePsychologist()
psych["Are dogs fun?"]