# Numpy
* The core library for manipulating and cleaning data in python is called pandas.
* The core library underneath pandas, is numpy (numerical python)
* Numpy is all about n dimensional arrays and doing things quickly
* We won't talk about numpy much beyond today, but it's worth a quick consideration...

In [None]:
import numpy as np
a = np.array([1, 2, 3])
print(a)
print(type(a))

In [None]:
# Things we can do with an ndarray include:
print(a.ndim) #number of dimensions

In [None]:
# Lets create something multidimensional
b = np.array([[1,2,3],[4,5,6]])
print(b)
print(b.shape)

In [None]:
# What is this ndarray filled with?
b.dtype

In [None]:
# And some automatic typecasting is available
c = np.array([2.2, 5, 1.1])
print(c.dtype)
print(c)

In [None]:
# You'll see this code a lot in examples
np.zeros((2,3))

In [None]:
np.random.rand(2,3)

In [None]:
np.arange(10, 50, 2) # just like range! ten to fifty by twos!

In [None]:
# last one
np.linspace( 0, 2, 15 ) # 15 numbers from 0 (inclusive) to 2 (inclusive

## Array Operations
* We can do many things on arrays, such as mathematical manipulation (addition, subtraction, square, exponents) as well as use boolean arrays, which are binary values. 
* We can also do matrix manipulation such as product, transpose, inverse, and so forth.
* Key difference, arithmetic operations are **element wise**

In [None]:
a = np.array([10,20,30,40])
b = np.array([1, 2, 3,4])
c = a-b
c

# Real word example
* My wife is an ultra-🇨🇦. We're talking syrup chugging lumber jack who knows the words to *God Save the Queen*
* I'm kinda trying to fit in around here (poorly!)
* We bought a house full of Amazon Alexa's...

* And they all talk in celcius. And she refuses to change. But....

* I got to the thermostat first 😏

In [None]:
farenheit = np.array([0,-10,-5,-15,0])

# And the formula for conversion is ((°F − 32) × 5/9 = °C)
celcius = (farenheit - 32) * (5/9)
celcius

* What's happening underneath is so beautiful. It's called *broadcasting*, or *vectorization*.
* Each item in the ndarray can be operated on individually - there is no need to consider other operations
* I believe this is a data science **threshold concept**, and we should call it out:

*The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations.* (wikipedia)



* Vectorization allows for:
1. Massive parallelization and thus efficiency
2. Increased readability
3. Added flexibility
4. Increased code quality

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
celcius = (farenheit - 32) * (5/9)
# How could we solve this iteratively?

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
celcius=[]
for i in farenheit:
    celcius.append((i - 32) * (5/9))
# How else could we solve this iteratively?

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
[(temp - 32) * (5/9) for temp in farenheit]
# One more thought on how could we solve this iteratively?

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
for i in range(len(farenheit)):
    farenheit[i]=(farenheit[i] - 32) * (5/9)
# Any other ideas?

In [None]:
%%timeit farenheit = np.linspace( -10, 20, 1000 )
celcius = map(lambda temp: (temp - 32) * (5/9), farenheit)

# wtf
* How do we better understand this?

https://stackoverflow.com/questions/57937570/vectorization-in-numpy-vs-python-map

# Final thoughts
1. Vectorization is the process of applying array programming techniques to data. When you vectorize an operation, function(s) are broadcast across elements in an array which allows for parallelization of operations.
2. Vectorization is powerful, and it's thinking in a vectorized way which is really important here. This aligns well with functional programming methods, and is a key to being an effective data scientist. We live in a parallel world: #loopsaredead


# Boolean masking
* This is a **critical concept** (and not difficult!) in this course.
* This will impact how you look at data and understand how queries work.

* A Boolean mask is analagous to a bitwise mask!
* Has everyone used bitwise masking before?

* Ok, it's really simple. You take a range of values which are `True (1)` or `False (0)` and you either AND them or OR them with another range of values which are `True` or `False`.

![](https://i.stack.imgur.com/jy0de.png)
* shamelessly stolen from https://stackoverflow.com/questions/28282869/shift-masked-bits-to-the-lsb ignore the wanted row

In [None]:
#bitwise masking
a=np.random.randint(2, size=10)
b=np.random.randint(2, size=10)
print(a)
print(b)
print(np.bitwise_and(a,b))

In [None]:
#boolean masking
import random
a=[random.choice([True,False]) for x in range(0,10)]
b=[random.choice([True,False]) for x in range(0,10)]

print(a)
print(b)
print(a and b)

# Why are we doing this?
* It's really common to take an array of data and mask it to reveal a result.
* This works hand in hand with broadcasting! This is a highly parallelizable result!
* We can broadcast individuals values with comparison operators

In [None]:
a=np.random.randint(5,size=10)
a > 3

In [None]:
a[a>3]

![](https://ibin.co/4v17tWp4dxMC.jpg)

# Indexing operator
* The indexing operator in numpy, and pandas, is incredibly overloaded. You can use it to
  * get a single item out of the array, e.g. `a[0]`
  * slice a range out of the array, e.g. `a[1:4]`
  * apply a boolean mask to an array, e.g. `a[True, False, True]`

In [None]:
a

In [None]:
mask=a>3
mask

In [None]:
a[mask]

You and boolean masking with numpy (and thus pandas!) will become good friends. :)