## Martin Dionne

## 1. What is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let's consider the heights of all US presidents.

This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values.

Find the mean height, the standard deviation of height, and the president who is the smallest and tallest.

You can use `pandas` to read in the file if you want, then cast the column to a `np.array`

In [124]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/president_heights.csv')
arr = np.array(df['height(cm)'])

min_h = arr.argmin()
max_h = np.where(arr == arr.max())

In [123]:
print(f'Average height is {round(arr.mean(),1)} with a {round(arr.std(),2)} standard deviation.')
print('')
print(f"The smallest president is {df['name'][min_h]}")
print(f"The tallest presidents are {df['name'][max_h[0][0]]} and {df['name'][max_h[0][1]]}")

Average height is 179.7 with a 6.93 standard deviation.

The smallest president is James Madison
The tallest presidents are Abraham Lincoln and Lyndon B. Johnson


# Exercise 2

Recall the polynomial formula

$$
p(x) = a_0 + a_1 x + a_2 x^2 + \cdots a_N x^N = \sum_{n=0}^N a_n x^n \tag{1}
$$

In the **math functions workshop**, you wrote a simple function `p(x, coeff)` to evaluate it without thinking about efficiency.

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

(This is already implemented in `np.poly1d`, but use that only to test your function)

- Hint: Use `np.cumprod()`  


In [118]:
def p(x, coeff):
    coeff = np.array(coeff)
    exp = np.flip(np.arange(0, coeff.size)) # [2, 1, 0]
    y = ((x ** exp) * coeff).sum() 
    return y

In [119]:
print(p(5,[1,2,3]))

poly = np.poly1d([1,2,3])
print(poly(5))

38
38


## Exercise 3 Softmax

Read in `data/iris.csv` and compute the [softmax]() of the sepal length. The formula for the softmax function $\sigma(x)$ for a vector $x = \{x_0, x_1, ..., x_{n-1}\}$ is
    .$$\sigma(x)_j = \frac{e^{x_j}}{\sum_k e^{x_k}}$$


Your result should be equal to the output of `scipy.special.softmax`

In [116]:

import scipy
from scipy.special import softmax

df = pd.read_csv('data/iris.csv')
sl = np.array(df['sepallength'])
# np.exp(x) is faster than np.e ** x
softmax = np.exp(sl) / np.sum(np.exp(sl)) 
#print(softmax)

In [117]:
# must round, can't compare floats as is
print(scipy.special.softmax(sl).round(5) == softmax.round(5))

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]


## Exercise 4: unique counts


Compute the counts of unique values row-wise.

Input:
```
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
arr
> array([[ 9,  9,  4,  8,  8,  1,  5,  3,  6,  3],
>        [ 3,  3,  2,  1,  9,  5,  1, 10,  7,  3],
>        [ 5,  2,  6,  4,  5,  5,  4,  8,  2,  2],
>        [ 8,  8,  1,  3, 10, 10,  4,  3,  6,  9],
>        [ 2,  1,  8,  7,  3,  1,  9,  3,  6,  2],
>        [ 9,  2,  6,  5,  3,  9,  4,  6,  1, 10]])
```
Desired Output:
```
> [[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
>  [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
>  [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
>  [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
>  [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
>  [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]
```
Output contains 10 columns representing numbers from 1 to 10. The values are the counts of the numbers in the respective rows.
For example, Cell(0,2) has the value 2, which means, the number 3 occurs exactly 2 times in the 1st row.

In [114]:
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))

def count_by_row(arr):
    cbr = np.empty(np.shape(arr), dtype=int)
    for i in range(0,arr.max()):
        cbr[:,i]  = np.count_nonzero(arr == i+1, axis=1)
    return cbr

In [115]:
count_by_row(arr)

array([[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
       [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
       [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
       [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
       [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]])

## Exercise 5: One-Hot encodings

Compute the one-hot encodings (AKA dummy binary variables) for each unique value in the array.

Input:
```
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
#> array([2, 3, 2, 2, 2, 1])
```
Output:
```
#> array([[ 0.,  1.,  0.],
#>        [ 0.,  0.,  1.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 1.,  0.,  0.]])
```

In [112]:
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr

#generalize case
def one_hot(arr):
    oh = np.zeros((len(arr), arr.max()), dtype=float)
    for i in range(len(arr)):
        oh[i,arr[i]-1] = 1
    return oh

In [113]:
print(arr)
one_hot(arr)

[2 3 2 2 2 1]


array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [7]:
np.random.seed(101)
arr2 = np.random.randint(1,10, size=8)

print(arr2)
one_hot(arr2)

[2 7 8 9 5 9 6 1]


array([[0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.]])

## Exercise 6

Let `q` be a NumPy array of length `n` with `q.sum() == 1`.

Suppose that `q` represents a [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) over a statistical distribution. Recall that a distribution is an array of probabilities of events.

We want to generate a discrete random variable $ x $ such that $ \mathbb P\{x = i\} = q_i $.

In other words, `x` takes values in `range(len(q))` and `x = i` with probability `q[i]`.

The standard (inverse transform) algorithm is as follows:

- Divide the unit interval $ [0, 1] $ into $ n $ subintervals $ I_0, I_1, \ldots, I_{n-1} $ such that the length of $ I_i $ is $ q_i $.  
- Draw a uniform random variable $ U $ on $ [0, 1] $ and return the $ i $ such that $ U \in I_i $.  


The probability of drawing $ i $ is the length of $ I_i $, which is equal to $ q_i $.

We can implement the algorithm as follows

```python
from random import uniform

def sample(q):
    a = 0.0
    U = uniform(0, 1)
    for i in range(len(q)):
        if a < U <= a + q[i]:
            return i
        a = a + q[i]
```

If you can’t see how this works, try thinking through the flow for a simple example, such as `q = [0.25, 0.75]`
It helps to sketch the intervals on paper.

**Your exercise is to speed it up using NumPy, avoiding explicit loops**

- Hint: Use `np.searchsorted` and `np.cumsum`  


If you can, implement the functionality as a class called `DiscreteRV`, where

- the data for an instance of the class is the vector of probabilities `q`  
- the class has a `draw()` method, which returns one draw according to the algorithm described above  


If you can, write the method so that `draw(k)` returns `k` draws from `q`.

In [111]:
from random import uniform

class DiscreteRV():
    def __init__(self, q=[1]):
        self.q = np.array(q)
        if self.q.sum() != 1:
            raise ValueError ("q != 1")

    def draw(self, k=1):
        # I know... avoid loops
        return [np.searchsorted(self.q, uniform(0, 1)) for _ in range(k)]

In [110]:
drv = DiscreteRV([0.25, 0.75])
drv.draw(10)

[1, 1, 1, 2, 1, 2, 0, 1, 1, 0]

## Exercise 7 Peaks

Find all the peaks in a 1D numpy array a. Peaks are points surrounded by smaller values on both sides.

Input:
```
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
```
Desired Output:
```
#> array([2, 5])
```
where, 2 and 5 are the positions of peak values 7 and 6.

### 1. Solve this usign a regular python for loop

### 2. Solve this using no loops and only numpy functions

In [108]:
def peaks(arr):
    p = []
    for i in range(len(arr)-1):
        if arr[i-1] < a[i] > arr[i+1]:
            p.append(i)
    return p

In [109]:

arr = [1, 3, 7, 1, 2, 6, 0, 1]
print(peaks(arr))

[2, 5]


In [102]:
import numpy as np

def np_peaks(arr):
    arr0 = np.roll(arr,-1)
    arr2 = np.roll(arr,1)
    #np.piecewise(a, (arr0 < arr) & (arr > arr2), [True, False])
    return np.where( (arr0 < arr) & (arr > arr2) )

In [126]:
arr = np.array([1, 3, 7, 1, 2, 6, 0, 1]) 
print(np_peaks(arr))

arr = np.array([1, 3, 7, 4, 1, 2, 6, 0, 1])
print(np_peaks(arr))

(array([2, 5], dtype=int64),)
(array([2, 6], dtype=int64),)


In [106]:
#initial solution: doesn't work for all cases
def np_peaks_old(arr):
    d = np.diff(arr) 
    s = np.sign(d)
    return np.where(s == s.min())

In [107]:
arr = np.array([1, 3, 7, 1, 2, 6, 0, 1])
print(np_peaks_old(arr))

(array([2, 5], dtype=int64),)
