## 1. What is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let's consider the heights of all US presidents.

This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values.

Find the mean height, the standard deviation of height, and the president who is the smallest and tallest.

You can use `pandas` to read in the file if you want, then cast the column to a `np.array`

In [342]:
# to get heights and president names. 

import pandas as pd 
import numpy as np 

my_file = pd.read_csv('/Users/mayarossi/DS-Workshop/m1-6-numpy/data/president_heights.csv')

my_file = np.array(my_file)


# mean height
heights = my_file[:, 2]
print(f"Mean height is {heights.mean()}cm.")
# standard dev of height
print(f"Standard dev of heights is {heights.std()}cm.")
# smallest pres
presidents = my_file[:,1]
print(f"Smallest pres height is {heights.min()}cm and is President {presidents[heights.argmin()]}.")
# tallest pres
print(f"Tallest pres height is {heights.max()}cm and is President {presidents[heights.argmax()]}.")

Mean height is 179.73809523809524cm.
Standard dev of heights is 6.931843442745893cm.
Smallest pres height is 163cm and is President James Madison.
Tallest pres height is 193cm and is President Abraham Lincoln.


# Exercise 2

Recall the polynomial formula

$$
p(x) = a_0 + a_1 x + a_2 x^2 + \cdots a_N x^N = \sum_{n=0}^N a_n x^n \tag{1}
$$

In the **math functions workshop**, you wrote a simple function `p(x, coeff)` to evaluate it without thinking about efficiency.

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

(This is already implemented in `np.poly1d`, but use that only to test your function)

- Hint: Use `np.cumprod()`  


In [344]:
import numpy as np

def p(x, coeff):
    coeff = np.array(coeff)
    
    y = np.ones(len(coeff), dtype = 'int32')
    y[1:] = x
    y2 = np.cumprod(y) * coeff # cumprod will give the x^0 x^1 x^2 ...
    return y2.sum()


p(5, [2, 1, 1])

32

## Exercise 3 Softmax

Read in `data/iris.csv` and compute the [softmax]() of the sepal length. The formula for the softmax function $\sigma(x)$ for a vector $x = \{x_0, x_1, ..., x_{n-1}\}$ is
    .$$\sigma(x)_j = \frac{e^{x_j}}{\sum_k e^{x_k}}$$


Your result should be equal to the output of `scipy.special.softmax`

In [322]:
import numpy as np
import pandas as pd

my_file = pd.read_csv('/Users/mayarossi/DS-Workshop/m1-6-numpy/data/iris.csv')
sepal_length = np.array(my_file["sepallength"])

def softmax(x):
    return np.exp(x)/sum(np.exp(x))

softmax(sepal_length)



array([0.00221959, 0.00181724, 0.00148783, 0.00134625, 0.00200836,
       0.00299613, 0.00134625, 0.00200836, 0.00110221, 0.00181724,
       0.00299613, 0.00164431, 0.00164431, 0.00099732, 0.0044697 ,
       0.00404435, 0.00299613, 0.00221959, 0.00404435, 0.00221959,
       0.00299613, 0.00221959, 0.00134625, 0.00221959, 0.00164431,
       0.00200836, 0.00200836, 0.00245302, 0.00245302, 0.00148783,
       0.00164431, 0.00299613, 0.00245302, 0.00331123, 0.00181724,
       0.00200836, 0.00331123, 0.00181724, 0.00110221, 0.00221959,
       0.00200836, 0.00121813, 0.00110221, 0.00200836, 0.00221959,
       0.00164431, 0.00221959, 0.00134625, 0.00271101, 0.00200836,
       0.01483991, 0.00814432, 0.01342771, 0.00331123, 0.00900086,
       0.00404435, 0.00736928, 0.00181724, 0.00994749, 0.00245302,
       0.00200836, 0.00493978, 0.0054593 , 0.00603346, 0.00365948,
       0.01099368, 0.00365948, 0.0044697 , 0.006668  , 0.00365948,
       0.00493978, 0.00603346, 0.00736928, 0.00603346, 0.00814

## Exercise 4: unique counts


Compute the counts of unique values row-wise.

Input:
```
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
arr
> array([[ 9,  9,  4,  8,  8,  1,  5,  3,  6,  3],
>        [ 3,  3,  2,  1,  9,  5,  1, 10,  7,  3],
>        [ 5,  2,  6,  4,  5,  5,  4,  8,  2,  2],
>        [ 8,  8,  1,  3, 10, 10,  4,  3,  6,  9],
>        [ 2,  1,  8,  7,  3,  1,  9,  3,  6,  2],
>        [ 9,  2,  6,  5,  3,  9,  4,  6,  1, 10]])
```
Desired Output:
```
> [[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
>  [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
>  [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
>  [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
>  [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
>  [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]
```
Output contains 10 columns representing numbers from 1 to 10. The values are the counts of the numbers in the respective rows.
For example, Cell(0,2) has the value 2, which means, the number 3 occurs exactly 2 times in the 1st row.

In [354]:
# Using np.arrays

np.random.seed(100)
arr = np.random.randint(1, 11, size=(6, 10))
def count_unique(arr):
    val, count = np.unique((arr), return_counts = True)
    r = np.zeros(10)
    r[val-1] = count
    return r

result = [count_unique(arr[i]) for i in range(arr.shape[0])]
np.array(result, dtype = 'int32')

array([[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
       [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
       [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
       [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
       [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]], dtype=int32)

In [206]:
# Using for loops

np.random.seed(100)
arr = np.random.randint(1, 11, size=(6, 10))
output = np.zeros((6, 10), dtype=int)

for i in range(len(arr)): # rows in the array (6)
    
    for j in range(len(arr[i])): # numbers in the row (10)

        count = 0
        
        for ij in range(len(arr[i])): 
            
            if arr[i][ij] == j + 1:
                count += 1
        
        output[i][j] = count

print(arr)
print(output)


[[ 9  9  4  8  8  1  5  3  6  3]
 [ 3  3  2  1  9  5  1 10  7  3]
 [ 5  2  6  4  5  5  4  8  2  2]
 [ 8  8  1  3 10 10  4  3  6  9]
 [ 2  1  8  7  3  1  9  3  6  2]
 [ 9  2  6  5  3  9  4  6  1 10]]
[[1 0 2 1 1 1 0 2 2 0]
 [2 1 3 0 1 0 1 0 1 1]
 [0 3 0 2 3 1 0 1 0 0]
 [1 0 2 1 0 1 0 2 1 2]
 [2 2 2 0 0 1 1 1 1 0]
 [1 1 1 1 1 2 0 0 2 1]]


## Exercise 5: One-Hot encodings

Compute the one-hot encodings (AKA dummy binary variables) for each unique value in the array.

Input:
```
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
#> array([2, 3, 2, 2, 2, 1])
```
Output:
```
#> array([[ 0.,  1.,  0.],
#>        [ 0.,  0.,  1.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 1.,  0.,  0.]])
```

In [239]:
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr

def one_hot(arr):
    my_array = np.zeros((len(arr), arr.max()), dtype = 'int64')
    for i in range(len(arr)):
        my_array[i][arr[i]-1] = 1
    return my_array

one_hot(arr)



array([[0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0]])

## Exercise 6

Let `q` be a NumPy array of length `n` with `q.sum() == 1`.

Suppose that `q` represents a [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) over a statistical distribution. Recall that a distribution is an array of probabilities of events.

We want to generate a discrete random variable $ x $ such that $ \mathbb P\{x = i\} = q_i $.

In other words, `x` takes values in `range(len(q))` and `x = i` with probability `q[i]`.

The standard (inverse transform) algorithm is as follows:

- Divide the unit interval $ [0, 1] $ into $ n $ subintervals $ I_0, I_1, \ldots, I_{n-1} $ such that the length of $ I_i $ is $ q_i $.  
- Draw a uniform random variable $ U $ on $ [0, 1] $ and return the $ i $ such that $ U \in I_i $.  


The probability of drawing $ i $ is the length of $ I_i $, which is equal to $ q_i $.

We can implement the algorithm as follows

```python
from random import uniform

def sample(q):
    a = 0.0
    U = uniform(0, 1)
    for i in range(len(q)):
        if a < U <= a + q[i]:
            return i
        a = a + q[i]
```

If you can’t see how this works, try thinking through the flow for a simple example, such as `q = [0.25, 0.75]`
It helps to sketch the intervals on paper.

**Your exercise is to speed it up using NumPy, avoiding explicit loops**

- Hint: Use `np.searchsorted` and `np.cumsum`  


If you can, implement the functionality as a class called `DiscreteRV`, where

- the data for an instance of the class is the vector of probabilities `q`  
- the class has a `draw()` method, which returns one draw according to the algorithm described above  


If you can, write the method so that `draw(k)` returns `k` draws from `q`.

In [375]:
from numpy.random import uniform
class DiscreteRv():
    def __init__(self, q): # q is a np.array. self.q = q. self.Q = np.cumsum(q)
        self.q = q
        self.Q =np.cumsum(q)

    def draw(self, k = 1): 
        U = uniform(0,1, size=k)
        print(U)
        res = np.searchsorted(self.Q, U)
        return res

q = [0.25, 0.75]   

m = DiscreteRv(q)
m.draw(4)



[0.35950784 0.59885895 0.35479561 0.34019022]


array([1, 1, 1, 1])

## Exercise 7 Peaks

Find all the peaks in a 1D numpy array a. Peaks are points surrounded by smaller values on both sides.

Input:
```
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
```
Desired Output:
```
#> array([2, 5])
```
where, 2 and 5 are the positions of peak values 7 and 6.

### 1. Solve this usign a regular python for loop

### 2. Solve this using no loops and only numpy functions

In [266]:
# 1. Solve with python for loop

def peak_python(arr):
    my_list = []
    for i in range(len(arr)):
        if arr[i-1] and arr[i+1] < arr[i]:
            my_list.append(i)
    return my_list

peak_python(np.array([1,3,7,1,2,6,0,1]))


[2, 5]

In [378]:
# 2. Solve this using no loops and only numpy functions
import numpy as np
a = np.array([1,3,7,1,2,6,0,1])

print(a[:-2]) #'top'
print(a[1:-1]) #'middle'
print(a[2:]) #'bottom'

np.arange(1, len(a)-1)[(a[:-2] < a[1:-1]) & (a[1:-1] > a[2:])]
 
b = np.sign(np.diff(a, 1))

print(np.where(b<0))


[1 3 7 1 2 6]
[3 7 1 2 6 0]
[7 1 2 6 0 1]
(array([2, 5]),)
