## 1. What is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let's consider the heights of all US presidents.

This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values.

Find the mean height, the standard deviation of height, and the president who is the smallest and tallest.

You can use `pandas` to read in the file if you want, then cast the column to a `np.array`

In [9]:
import pandas as pd
import numpy as np
data = pd.read_csv('data\president_heights.csv')
data2 = np.array(data['height(cm)'])

data2.argmin()
data2.argmax()
std = data2.std()
avg = data2.mean()
small = data.iloc[3]
tall = data.iloc[15]

avg, std, small, tall

(179.73809523809524,
 6.931843442745892,
 order                     4
 name          James Madison
 height(cm)              163
 Name: 3, dtype: object,
 order                      16
 name          Abraham Lincoln
 height(cm)                193
 Name: 15, dtype: object)

# Exercise 2

Recall the polynomial formula

$$
p(x) = a_0 + a_1 x + a_2 x^2 + \cdots a_N x^N = \sum_{n=0}^N a_n x^n \tag{1}
$$

In the **math functions workshop**, you wrote a simple function `p(x, coeff)` to evaluate it without thinking about efficiency.

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

(This is already implemented in `np.poly1d`, but use that only to test your function)

- Hint: Use `np.cumprod()`  


In [None]:

import numpy as np

def poly(x, coeff):
    # coeff array for shape
    a = np.array(coeff)
    # replace all ele with x
    a[:] = x
    # x**0 == 1
    a[0] = 1
    # x*x, x*x*x, etc
    a = np.cumprod(a)
    # array multiplication and sum
    return np.sum(a * coeff)

poly(5, [2, 1, 1])

## Exercise 3 Softmax

Read in `data/iris.csv` and compute the [softmax]() of the sepal length. The formula for the softmax function $\sigma(x)$ for a vector $x = \{x_0, x_1, ..., x_{n-1}\}$ is
    .$$\sigma(x)_j = \frac{e^{x_j}}{\sum_k e^{x_k}}$$


Your result should be equal to the output of `scipy.special.softmax`

In [8]:
from scipy import special

import pandas as pd
import numpy as np
data = pd.read_csv('data\iris.csv')
data2 = np.array(data['sepallength'])

def softmax(arr):
    arr = np.array(arr)
    mx = sum(np.exp(arr))
    return np.exp(arr) / mx

a = softmax(data2)
b = special.softmax(data2)

b[0:5], a[0:5]
# these results look the same, 
# but fail under '==' test

(array([0.00221959, 0.00181724, 0.00148783, 0.00134625, 0.00200836]),
 array([0.00221959, 0.00181724, 0.00148783, 0.00134625, 0.00200836]))

## Exercise 4: unique counts


Compute the counts of unique values row-wise.

Input:
```
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
arr
> array([[ 9,  9,  4,  8,  8,  1,  5,  3,  6,  3],
>        [ 3,  3,  2,  1,  9,  5,  1, 10,  7,  3],
>        [ 5,  2,  6,  4,  5,  5,  4,  8,  2,  2],
>        [ 8,  8,  1,  3, 10, 10,  4,  3,  6,  9],
>        [ 2,  1,  8,  7,  3,  1,  9,  3,  6,  2],
>        [ 9,  2,  6,  5,  3,  9,  4,  6,  1, 10]])
```
Desired Output:
```
> [[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
>  [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
>  [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
>  [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
>  [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
>  [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]
```
Output contains 10 columns representing numbers from 1 to 10. The values are the counts of the numbers in the respective rows.
For example, Cell(0,2) has the value 2, which means, the number 3 occurs exactly 2 times in the 1st row.

In [7]:
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
out = np.zeros_like(arr)

for i in range(len(arr)):
    val, counts = np.unique(arr[i], return_counts=True)
    out[i, val-1] = counts

print(out, arr, sep='\n')

[[1 0 2 1 1 1 0 2 2 0]
 [2 1 3 0 1 0 1 0 1 1]
 [0 3 0 2 3 1 0 1 0 0]
 [1 0 2 1 0 1 0 2 1 2]
 [2 2 2 0 0 1 1 1 1 0]
 [1 1 1 1 1 2 0 0 2 1]]
[[ 9  9  4  8  8  1  5  3  6  3]
 [ 3  3  2  1  9  5  1 10  7  3]
 [ 5  2  6  4  5  5  4  8  2  2]
 [ 8  8  1  3 10 10  4  3  6  9]
 [ 2  1  8  7  3  1  9  3  6  2]
 [ 9  2  6  5  3  9  4  6  1 10]]


## Exercise 5: One-Hot encodings

Compute the one-hot encodings (AKA dummy binary variables) for each unique value in the array.

Input:
```
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
#> array([2, 3, 2, 2, 2, 1])
```
Output:
```
#> array([[ 0.,  1.,  0.],
#>        [ 0.,  0.,  1.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 1.,  0.,  0.]])
```

In [6]:
def onehot(inp):
    top = np.argmax(inp)
    rows = len(inp)
    cols = inp[top]
    out = np.zeros((rows, cols))
    
    for i in range(rows):
        out[i, inp[i]-1] = 1
    
    return out
    
onehot([2, 3, 2, 2, 2, 10])

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

## Exercise 6

Let `q` be a NumPy array of length `n` with `q.sum() == 1`.

Suppose that `q` represents a [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) over a statistical distribution. Recall that a distribution is an array of probabilities of events.

We want to generate a discrete random variable $ x $ such that $ \mathbb P\{x = i\} = q_i $.

In other words, `x` takes values in `range(len(q))` and `x = i` with probability `q[i]`.

The standard (inverse transform) algorithm is as follows:

- Divide the unit interval $ [0, 1] $ into $ n $ subintervals $ I_0, I_1, \ldots, I_{n-1} $ such that the length of $ I_i $ is $ q_i $.  
- Draw a uniform random variable $ U $ on $ [0, 1] $ and return the $ i $ such that $ U \in I_i $.  


The probability of drawing $ i $ is the length of $ I_i $, which is equal to $ q_i $.

We can implement the algorithm as follows

```python
from random import uniform

def sample(q):
    a = 0.0
    U = uniform(0, 1)
    for i in range(len(q)):
        if a < U <= a + q[i]:
            return i
        a = a + q[i]
```

If you can’t see how this works, try thinking through the flow for a simple example, such as `q = [0.25, 0.75]`
It helps to sketch the intervals on paper.

**Your exercise is to speed it up using NumPy, avoiding explicit loops**

- Hint: Use `np.searchsorted` and `np.cumsum`  


If you can, implement the functionality as a class called `DiscreteRV`, where

- the data for an instance of the class is the vector of probabilities `q`  
- the class has a `draw()` method, which returns one draw according to the algorithm described above  


If you can, write the method so that `draw(k)` returns `k` draws from `q`.

In [5]:

from random import uniform
import numpy as np


class DiscreteRV():
    def __init__(self, q):
        self.q = q
        self.space = np.cumsum(q)
        
    def __repr__(self):
        return 'DiscreteRV(' + str(self.q) + ')'
        
    def draw(self, k):
        for i in range(k):
            U = uniform(0,1)
            r = np.searchsorted(self.space, U)
            print(r)
        # r is the result of a single draw as an index of self.space
        
DRV = DiscreteRV([0.1, 0.25, 0.65])
DRV.draw(10)

2
2
2
2
2
2
2
2
2
1


## Exercise 7 Peaks

Find all the peaks in a 1D numpy array a. Peaks are points surrounded by smaller values on both sides.

Input:
```
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
```
Desired Output:
```
#> array([2, 5])
```
where, 2 and 5 are the positions of peak values 7 and 6.

### 1. Solve this usign a regular python for loop

### 2. Solve this using no loops and only numpy functions

In [1]:

def peak_find(arr):
    out = []
    n = len(arr)
    for i in range(1, n-1):
        if arr[i] > arr[i-1] & arr[i] > arr[i+1]:
            out.append(i)
            
    return out

peak_find([1, 3, 7, 1, 2, 6, 0, 1])

[2, 5]

In [10]:
def loopless_peak(inp):
    
    lshift = np.roll(inp, -1)
    rshift = np.roll(inp, 1)
    lcheck = np.where(inp - lshift > 0)
    rcheck = np.where(inp - rshift > 0)
    out = np.intersect1d(lcheck, rcheck)
    return out

# Note, this design doesn't account for 
# 'peaks' in the [0] or [-1] positions 

test = [1,3,2,4,3]
loopless_peak(test)

array([1, 3], dtype=int64)