In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("nb2.ipynb")

# nb02:  Basics with Pandas and And In-Depth with NumPy
***

In this notebook we'll explore both Pandas and NumPy.  We'll start with some basic dataframe manipulation using Pandas.  Since Pandas is built on the backbone of NumPy we'll then dive into more NumPy functionality.


---
### Pandas module:

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will review commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating DataFrames
* Slicing DataFrames (i.e. selecting rows and columns)
* Filtering data (using boolean arrays and groupby.filter)
* Aggregating (using groupby.agg)

In this nb you are going to use several pandas methods. Reminder from lecture that you may press `shift+tab` on method parameters to see the documentation for that method. For example, if you were using the `drop` method in pandas, you could press shift+tab to see what `drop` is expecting.



**Note**: The Pandas interface is notoriously confusing for beginners, and the documentation is not consistently great. Throughout the semester, you will have to search through Pandas documentation and experiment, but remember it is part of the learning experience and will help shape you as a data scientist!

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
%matplotlib inline

## Creating DataFrames & Basic Manipulations

Recall that a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) is a table in which each column has a specific data type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

Usually you'll create DataFrames by using a function like `pd.read_csv`. However, in this section, we'll discuss how to create them from scratch.

The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for the pandas `DataFrame` class provides several constructors for the DataFrame class.

**Syntax 1:** You can create a DataFrame by specifying the columns and values using a dictionary as shown below. 

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

In [None]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink'],
          'price': [1.0, 0.75, 0.35, 0.05]
          })
fruit_info

**Syntax 2:** You can also define a DataFrame by specifying the rows as shown below. 

Each row corresponds to a distinct tuple, and the columns are specified separately.

In [None]:
fruit_info2 = pd.DataFrame(
    [("red", "apple", 1.0), ("orange", "orange", 0.75), ("yellow", "banana", 0.35),
     ("pink", "raspberry", 0.05)], 
    columns = ["color", "fruit", "price"])
fruit_info2

You can obtain the dimensions of a DataFrame by using the shape attribute `DataFrame.shape`.

In [None]:
fruit_info.shape

You can also convert the entire DataFrame into a two-dimensional NumPy array.

In [None]:
fruit_info.values

### **REVIEW:** Selecting Rows and Columns in Pandas

As you've seen in lecture, there are two verbose operators in Python for selecting rows: `loc` and `iloc`. Let's review them briefly.

#### Approach 1: `loc`

The first of the two verbose operators is `loc`, which takes two arguments. The first is one or more row **labels**, the second is one or more column **labels** - both of which are displayed in bold to the left of each of the rows and above each of the columns respectively. These are not the same as positional indices, which are used for indexing Python lists or NumPy arrays!

The desired rows or columns can be provided individually, in slice notation, or as a list. Some examples are given below.

Note that **slicing in `loc` is inclusive** on the provided labels.

In [None]:
#get rows 0 through 2 and columns fruit through price
fruit_info.loc[0:2, 'fruit':'price']

In [None]:
# get rows 0 through 2 and columns fruit and price. 
# Note the difference in notation and result from the previous example.
fruit_info.loc[0:2, ['fruit', 'price']]

In [None]:
# get rows 0 and 2 and columns fruit and price. 
fruit_info.loc[[0, 2], ['fruit', 'price']]

In [None]:
# get rows 0 and 2 and column fruit
fruit_info.loc[[0, 2], ['fruit']]

Note that if we request a single column but don't enclose it in a list, the return type of the `loc` operator is a `Series` rather than a DataFrame. 

In [None]:
# get rows 0 and 2 and column fruit, returning the result as a Series
fruit_info.loc[[0, 2], 'fruit']

If we provide only one argument to `loc`, it uses the provided argument to select rows, and returns all columns.

In [None]:
fruit_info.loc[0:1]

Note that if you try to access columns without providing rows, `loc` will crash. 

In [None]:
# uncomment, this code will crash
#fruit_info.loc[["fruit", "price"]]

# uncomment, this code works fine: 
#fruit_info.loc[:, ["fruit", "price"]]

#### Approach 2: `iloc`

`iloc` is very similar to `loc` except that its arguments are row numbers and column numbers, rather than row labels and labels names. A usueful mnemonic is that the `i` stands for "integer". This is quite similar to indexing into a Python list or NumPy array.

In addition, **slicing for `iloc` is exclusive** on the provided integer indices. Some examples are given below:

In [None]:
# get rows 0 through 3 (exclusive) and columns 0 through 2 (exclusive)
fruit_info.iloc[0:3, 0:3]

In [None]:
# get rows 0 through 3 (exclusive) and columns 0 and 2.
fruit_info.iloc[0:3, [0, 2]]

In [None]:
# get rows 0 and 2 and columns 0 and 2.
fruit_info.iloc[[0, 2], [0, 2]]

In [None]:
#get rows 0 and 2 and column fruit
fruit_info.iloc[[0, 2], [0]]

In [None]:
# get rows 0 and 2 and column fruit
fruit_info.iloc[[0, 2], 0]

Note that in these loc and iloc examples above, the row **label** and row **number** were always the same.

Let's see an example where they are different. If we sort our fruits by color, we get:

In [None]:
fruit_info_sorted = fruit_info.sort_values("price")
fruit_info_sorted

Observe that the row number 0 now has index 3, row number 1 now has index 2, etc. These indices are the arbitrary numerical index generated when we created the DataFrame. For example, banana was originally in row 2, and so it has row label 2.

If we request the rows in positions 0 and 2 using `iloc`, we're indexing using the row NUMBERS, not labels. 

In [None]:
fruit_info_sorted.iloc[[0, 2], 0]

Lastly, similar to with `loc`, the second argument to `iloc` is optional. That is, if you provide only one argument to `iloc`, it treats the argument you provide as a set of desired row numbers, not column numbers.

In [None]:
fruit_info.iloc[[0, 2]]

#### Approach 3: `[]` Notation for Accessing Rows and Columns

Pandas also supports a bare `[]` operator. It's similar to `loc` in that it lets you access rows and columns by their name.

However, unlike `loc`, which takes row names and also optionally column names, `[]` is more flexible. If you provde it only row names, it'll give you rows (same behavior as `loc`), and if you provide it with only column names, it'll give you columns (whereas `loc` will crash).

Some examples:

In [None]:
fruit_info[0:2]

In [None]:
# Here we're providing a list of fruits as single argument to []
fruit_info[["fruit", "color", "price"]]

Note that slicing notation is not supported for columns if you use `[]` notation. Use `loc` instead.

In [None]:
# uncomment and this code crashes
#fruit_info["fruit":"price"]

# uncomment and this works fine
#fruit_info.loc[:, "fruit":"price"]

`[]` and `loc` are quite similar. For example, the following two pieces of code are functionally equivalent for selecting the fruit and price columns.

1. `fruit_info[["fruit", "price"]]` 
2. `fruit_info.loc[:, ["fruit", "price"]]`.

Because it yields more concise code, you'll find that our code and your code both tend to feature `[]`. However, there are some subtle pitfalls of using `[]`. If you're ever having performance issues, weird behavior, or you see a `SettingWithCopyWarning` in pandas, switch from `[]` to `loc` and this may help.

To avoid getting too bogged down in indexing syntax, we'll avoid a more thorough discussion of `[]` and `loc`. We may return to this at a later point in the course.

For more on `[]` vs `loc`, you may optionally try reading:
1. https://stackoverflow.com/questions/48409128/what-is-the-difference-between-using-loc-and-using-just-square-brackets-to-filte
2. https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc/65875826#65875826
3. https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas/53954986#53954986

Now that we've reviewed basic indexing, let's discuss how we can modify dataframes. We'll do this via a series of exercises. 

### Question 1(a)

For a DataFrame `d`, you can add a column with `d['new column name'] = ...` and assign a list or array of values to the column. Add a column of integers containing 1, 2, 3, and 4 called `rank1` to the `fruit_info` table which expresses your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty). 


In [None]:
...
fruit_info

In [None]:
grader.check("q1a")

### Question 1(b)

You can also add a column to `d` with `d.loc[:, 'new column name'] = ...`. As above, the first parameter is for the rows and second is for columns. The `:` means change all rows and the `'new column name'` indicates the name of the column you are modifying (or in this case, adding). 

Add a column called `rank2` to the `fruit_info` table which contains the same values in the same order as the `rank1` column.


In [None]:
...
fruit_info

In [None]:
grader.check("q1b")

### Question 2

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) both the `rank1` and `rank2` columns you created. Make sure to use the `axis` parameter correctly. Note that `drop` does not change a table, but instead returns a new table with fewer columns or rows unless you set the optional `inplace` parameter.

*Hint*: Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to see how you can drop multiple columns of a Pandas DataFrame at once using a list of column names.


In [None]:
fruit_info_original = ...
fruit_info_original

In [None]:
grader.check("q2")

### Question 3

Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) the columns of `fruit_info_original` so they begin with capital letters. Set this new DataFrame to `fruit_info_caps`. For an example of how to use rename, see the linked documentation above.

In [None]:
...
fruit_info_caps

In [None]:
grader.check("q3")

---


## Switching Gears:  Deep Diving Into Numpy 


### NumPy module:

𝐍𝐮𝐦𝐏𝐲  (short for "Numeric Python") is an open source Python module for scientific computing. NumPy supports large, multidimensional arrays and matrices. It also supports a large collection of mathematical functions not found in Python's standard math library.

It's custom in data science to import Numpy  with the alias $\texttt{np}$.  We can then ***access any mathematical function in the Numpy library by prepending function names by $\texttt{np}$***

For a more in-depth description of methods and attributes built into Numpy see here:  https://numpy.org/doc/stable/user/quickstart.html   
and  here:  https://numpy.org/doc/stable/reference/routines.math.html



### Mathematical Functions

Numpy includes all of the functions and mathematical constants in Python's standard math library, like logarithms, exponentiation, and even $\pi$.  

In [None]:
print(np.log(np.exp(1)))
print(np.log2(16))
print(np.log10(1000))
print(np.pi)

The nice thing about Numpy's mathematical functions is that they can be applied to arrays as well as scalars.  For instance

In [None]:
u = np.array([10, 100, 1000, 10000])

np.log10(u)

### Question 4

Use numpy built-in functions to find the square root of 17: </span>


In [None]:
#Exercise 4 


In [None]:
# if you don't like how cumbersome it is to use np as the prefix, there is a solution!
from numpy import pi

print(pi)

#Or, to practice printing this with specific formatting:

"Floating point pi = {0:.3f}, with {1:d} digit precision".format(pi, 3)
#Here we specify 3 digits of precision and f is used to represent floating point number
#d in {1:d} represents integer value.
#See this site for many ways to format your output: https://thepythonguru.com/python-string-formatting/

## The Numpy ndarray
The main workhorse of Numpy is its  multidimensional array  class called the  𝚗𝚍𝚊𝚛𝚛𝚊𝚢 . 

A numpy array is a grid of values that belong to a similar data type.

The numpy array values are indexed by a tuple of nonnegative integers.

This allows us to build and work with one dimensional, two dimensional, or even  𝑁 -dimensional arrays of numbers.

We can build a  one dimensional  ndarray by passing a Python list to the  𝚗𝚙.𝚊𝚛𝚛𝚊𝚢()  function.

See here for a full list of the array attributes and methods:  https://docs.scipy.org/doc/numpy-1.6.0/reference/generated/numpy.ndarray.html

### Creating Arrays

You can build arrays from python lists. 


In [None]:
np.array([[1.,2.], [3.,4.]])

In [None]:
np.array([x for x in range(5)])

Array's don't have to contain numbers:

In [None]:
np.array([["A", "matrix"], ["of", "words."]])

## Making Arrays of Zeros

In [None]:
np.zeros(5)

## Making Arrays of Ones

In [None]:
np.ones([3,2])

In [None]:
np.eye(4)

## Making Arrays from ranges:

The `np.arange(start, stop, step)` function is like the python `range` function.

In [None]:
np.arange(0, 10, 2)

You can make a range of other types as well:

In [None]:
np.arange(np.datetime64('2016-12-31'), np.datetime64('2017-02-01'))

## Interpolating numbers 

The `linspace(start,end,num)` function generates `num` numbers evenly spaced between the `start` and `end`.

In [None]:
np.linspace(0, 5, 10)

Learn more about working with [datetime objects](https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#).

## A random array 

You can also generate arrays of random numbers (we will cover this in greater detail later).


- `rand` generates random numbers from a Uniform(low=0, high=1) distribution.
- `permutation` generates a random permutation of a sequence of numbers.

In [None]:
np.random.rand(3,2)

In [None]:
np.random.permutation(range(0,10))

In [None]:
# generate random integers
np.random.randint(0,20,5) # low, high, size


# Properties of Arrays

## Shape

Arrays have a shape which corresponds to the number of rows, columns, fibers, ...

In [None]:
A = np.array([[1., 2., 3.], [4., 5., 6.]])
print(A)
A.shape

## Type

Arrays have a type which corresponds to the type of data they contain

In [None]:
A.dtype

In [None]:
np.arange(1,5).dtype

In [None]:
(np.array([True, False])).dtype

In [None]:
np.array(["Hello", "Worlddddd!"]).dtype

What does `<U6` mean?

- `<` Little Endian
- `U` Unicode
- `6` length of longest string

#### and we can change the type of an array:

In [None]:
np.array([1,2,3]).astype(float)

In [None]:
np.array(["1","2","3"]).astype(int)

Learn more about numpy [array types](https://docs.scipy.org/doc/numpy/user/basics.types.html)

In [None]:
# 'y' is a list in Python
y = [1,2,3,4,5]
print(y)

#We can check the type of what we've just created with Python's type function.
type(y)

In [None]:
#Now, the list is an array
x = np.array(y, dtype=float)
print(x)

type(x)

In [None]:
# The attribute .dtype will let you see the datatype of the elements: 

print(x.dtype)

### Question 5
    
Write code to create a vector of zeros of size 10 and update the sixth entry to 11 

In [None]:
#We are not limited to 1-dimensional arrays - we can also create  n-Dimensional arrays!
# 1-dimensional array
z=np.array([14, 15, 22.5, 100])
# 2-dimensional array
w=np.array([[1, 2],
       [3, 4],
       [5, 6]])

<img src="images/NumpyArrays.png"/>

In [None]:
#You can find the dimension(s) of your array using the shape attribute:

print(z.shape)

print(w.shape)

The nice thing about using Numpy arrays to store numerical values is that we can perform mathematical operations on them.  For instance, if we want to multiply every value in $\texttt{x}$ by $5$ and store the result in a new array $\texttt{y}$ we can do so as follows: 

In [None]:
y = 5*x
print(y)

Contrast this to what happens if we tried the exact same operations on a list instead:


In [None]:
w = [1, 2, 3, 4, 5]
# w is a Python list

q = 5 * w
# q is a list of lists
print('w = ', w)
print('5 * w = ', q)

If we want to add two ndarray's of equal length together we can do that too: 

In [None]:
z = x + y 
print(z)

If we want to create a two dimensional array we simply pass a list of lists to the $\texttt{np.array()}$ function.  The lists in the list of lists then become the rows of the two dimensional array. 

In [None]:
A = np.array([[1,2,3,4,5], [6,7,8,9,10]], dtype=float)
print(A)

We can access elements of Numpy arrays in a ways similar to the way we access elements in Python lists. For instance, if we want to get the first $3$ elements of the array $\texttt{y}$, we can do so as follows: 

In [None]:
print(y)

In [None]:
y[0:4]

Just like with Python lists, if we're indexing from the start of the array there is no need to include the $0$ in the index range.  We can simply do 

In [None]:
y[:3]

Similarly, if we want to access everything from the third entry to the end of the array, we can do 

In [None]:
y[2:]

Indexing in multidimensional arrays is similar.  First, let's build a two dimensional array with our arrays $\texttt{x}, \texttt{y},$ and $\texttt{z}$ as the rows. 

In [None]:
B = np.array([x,y,z])
print(B)

We can slice up the two dimensional array by doing slices along rows and columns.  Let's suppose we wanted to carve out the second row of $\texttt{B}$. 

In [None]:
B[1,:]


Here, the stuff before the comma refers to rows of the array, and the stuff after the comma refers to columns of the array.  In the previous command we've indicated that we want the row with index $1$, and the colon indicated that we want _all_ of the columns. Similarly, if we want all of the fourth column, we could do 

In [None]:
B[:,3]

If you've played with slicing lists of Python lists then you'll note that the main difference between slicing two dimensional Numpy arrays is that we only need one set of brackets, instead of two. Otherwise it's pretty similar.   

If we want to carve out certain rows and certain columns, we can do that by grabbing ranges of indices for both the rows and the columns.  For instance, if we want the part of $\texttt{B}$ in the second and third rows and the second through the fourth columns we can do 

In [None]:
B[1:3, 1:4]

We can also apply mathematical functions to the two dimensional array.  For instance, if we want to sum all of the entries in $\texttt{B}$ we can do 

In [None]:
np.sum(B)

If instead we just want to sum along the rows or columns of the array we can add the $\texttt{axis}$ parameter.  

In [None]:
np.sum(B, axis=0)

Notice that choosing $\texttt{axis=0}$ caused us to sum along the columns of $\texttt{B}$.  If instead we used $\texttt{axis=1}$ we would sum along the rows 

In [None]:
np.sum(B, axis=1)

We can also index into an array by conditions.  For instance, remember the vector $\texttt{z}$

In [None]:
print(z)

Let's suppose I want to grab all of the entries in $\texttt{z}$ that are bigger than $15$.  The condition $\texttt{z > 15}$ returns a boolean array indicating whether each entry in $\texttt{z}$ satisfies the given condition 

In [None]:
z > 15

Now, if I want to actually extract those entries of $\texttt{z}$ that satisfy the condition we can index $\texttt{z}$ using this boolean array 

In [None]:
z[z>15]


### Question 6
Use numpy built-in functions to create an array with elements [2, 4, 8] and then output an array with values $[e^2, e^4, e^8]$ 

### Accessing elements in Arrays

We can access elements of Numpy arrays in a way similar to the way we access elements in Python lists. For instance, if we want to get the first $3$ elements of the array $\texttt{y}$, we can do so as follows: 

In [None]:
print(y)
# recall y is an nparray

print(y[0:3])



Just like with Python lists, if we're indexing from the start of the array there is no need to include the $0$ in the index range.  We can simply do 

In [None]:
y[:3]

Similarly, if we want to access everything from the third entry (indexed with a 2) to the end of the array, we can do 

In [None]:
# Recall y = [5. 10. 15. 20. 25.]
y[2:]

We can also index into an array by conditions.  For instance, remember the vector $\texttt{z}$

In [None]:
print(z)

Let's suppose I want to grab all of the entries in $\texttt{z}$ that are bigger than $15$.  The condition $\texttt{z > 15}$ returns a boolean array indicating whether each entry in $\texttt{z}$ satisfies the given condition 

In [None]:
z > 15
#type(z)

Now, if I want to actually extract those entries of $\texttt{z}$ that satisfy the condition we can index $\texttt{z}$ using this boolean array 

In [None]:
print(z)
z[z>15]

### Question 7
Write code to extract the entries of z that are divisible by 4: 

In [None]:
#Question 7

z[z%4==0]


# Reshaping

Often you will need to reshape matrices.  Suppose you have the following array:

In [None]:
np.arange(1,13)

**What will the following produce:**

```python
np.arange(1,13).reshape(4,3)
```

**Option A:**

```python
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
```

**Option B:**

```python
array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]])
```

**Solution**

In [None]:
A = np.arange(1,13).reshape(4,3)
A

## Flattening Matrix

Flattening a matrix (higher dimensional array) produces a one dimensional array.

In [None]:
A.flatten()

Understanding the slice syntax

```python
begin:end:stride
```


## Modifying a Slice

Suppose I wanted to make all entries in my matrix 0 in the top right corner as in (H) above.

In [None]:
H = np.arange(1,13).reshape(4,3)
print("Before:\n", H)

In [None]:
H[:2, 1:] = 0
print("After:\n", H)

# Boolean Indexing

We can apply boolean operations to arrays.  This is essential when trying to select and modify individual elements.


**Question:** *Given the following definition of A:*

```python
[[   1.    2.    3.]
 [   4.    5. -999.]
 [   7.    8.    9.]
 [  10. -999. -999.]]
```

*what will the following output:*
```python
A > 3
```


- **Option A:**

```python
False
```

- **Option B:**

```python
array([[False, False, False],
       [ True,  True, False],
       [ True,  True,  True],
       [ True, False, False]], dtype=bool)
```

In [None]:
A = np.array([[  1.,   2.,   3.],
       [  4.,   5.,   -999.0],
       [  7.,   8.,   9.],
       [ 10.,  -999.0,  -999.0]])

A > 3.

**Question:** *What will the following output*
```python
A = np.array([[   1.,    2.,    3.],
       [   4.,    5., -999.],
       [   7.,    8.,    9.],
       [  10., -999., -999.]])

A[A > 3]
```


- **Option A:**

```python
array([ 4,  7, 10,  5,  8, 11,  6,  9, 12])
```

- **Option B:**

```python
array([  4.,   5.,   7.,   8.,   9.,  10.])
```


- **Option C:**

```python
array([[  nan,   nan,  nan],
       [  4.,    5.,   nan],
       [  7.,    8.,   9.],
       [ 10.,    nan,  nan]])
```

In [None]:
A = np.array([[  1.,   2.,   3.],
       [  4.,   5.,   -999.0],
       [  7.,   8.,   9.],
       [ 10.,  -999.0,  -999.0]])

A[A > 3] 

**Question:** *Replace the -999.0 entries with `np.nan`.*

```python
array([[   1.,    2.,    3.],
       [   4.,    5., -999.],
       [   7.,    8.,    9.],
       [  10., -999., -999.]])
```

**Solution**

In [None]:
A = np.array([[  1.,   2.,   3.],
       [  4.,   5.,   -999.0],
       [  7.,   8.,   9.],
       [ 10.,  -999.0,  -999.0]])

* Construct a boolean array that indicates where the value is 999.0:

In [None]:
ind = (A == -999.0)
print(ind)

* Assign `np.nan` to all the `True` entires:

In [None]:
A[ind] = np.nan
A

**Question:** *What might -999.0 represent?  Why might I want to replace the -999.0 with a `np.nan`?  *

**Solution:** It could be safer in calculations.  For example when computing the mean of the transformed A we get:

In [None]:
print(A)
np.mean(A)

Perhaps instead we want:

In [None]:
np.nanmean(A)

In [None]:
help(np.nanmean)

# More Complex Bit Logic

Often we will want to work with multiple different arrays at once and select subsets of entries from each array.  Consider the following example:

In [None]:
names = np.array(["Joey", "Henry", "Joseph", 
                  "Jim", "Sam", "Deb", "Mike", 
                  "Bin", "Joe", "Andrew", "Bob"])

favorite_number = np.arange(len(names)) 

Suppose a subset of these people are staff members:

In [None]:
staff = ["Joey",  "Deb", "Sam"]

### Question 8:
*How could we compute the sum of the staff members favorite numbers?*

One solution is to use for loops:

Another solution would be to use the [np.in1d](https://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html) function to determine which people are staff.

Boolean indexing

### Question:
*What does the following expression compute:*

```python
starts_with_j = np.char.startswith(names, "J")
starts_with_j[is_staff].mean()
```

### Question:
*What does it mean to take the mean of an array of booleans?*

### Question
*What does the following expression compute:*

```python
favorite_number[starts_with_j & is_staff].sum()
```

## Other Useful Operations

* [`choose()`](https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.choose.html)
* [`where()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html)



# A Note on using Array operations

In [None]:
data = np.random.rand(1000000)

Consider the following two programs.  


### Program A
```python
s = 0
c = 0
for x in data:
    if x > 0.5:
        s += x
        c += 1
result = s/c
```

### Program B
```python
result = data[data > 0.5].mean()
```

1. What do they do?
1. Which one is faster?
1. Which one is clearer?

---

<br/><br/><br/><br/><br/>

### Solution

## Important Points

Using the array abstractions instead of looping can often be:

1. Clearer
2. Faster

These are fundamental goals of abstraction. 

<br/>
<br/>
<br/>




## Be Careful with Floating Point Numbers

What is the value of the following:
$$
A - \exp \left( \log \left( A \right) \right)
$$

<br/>
<br/>
<br/>

**Solution:**


In [None]:
A = np.arange(1., 13.).reshape(4,3)
print(A)

(A - np.exp(np.log(A)))

**What happened?!**

Floating point precision is not perfect and we are applying transcendental functions.

<br/><br/><br/><br/>
### A simpler examples

What is the value of the following expression:

```python
0.1 + 0.2 == 0.3
```

In [None]:
0.1 + 0.2 == 0.3

In [None]:
print(0.1 + 0.2)

For these situations consider using `np.isclose`:

In [None]:
help(np.isclose)

<br/>
<br/>
<br/>
<br/>


## Aggregating along an axis 


### Grouping by row:

In [None]:
A.sum(axis=0)

This is the same as:
```python
(nrow, ncols) = A.shape

s = np.zeros(ncols)

for i in range(nrows):
    s += A[i,:]

print(s)
```

### Grouping by col:

In [None]:
A.sum(axis=1)

This is the same as:
```python
(nrows, ncols) = A.shape

s = np.zeros(nrows)

for i in range(ncols):
    s += A[:,i]

print(s)
```

## Other Functions to Checkout

* [`mean`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) computes the mean 
* [`std`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html) standard deviation
* [`var`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html) variance

and [many more](https://docs.scipy.org/doc/numpy/reference/ufuncs.html#math-operations)

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)