## 1. What is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let's consider the heights of all US presidents.

This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values.

Find the mean height, the standard deviation of height, and the president who is the smallest and tallest.

You can use `pandas` to read in the file if you want, then cast the column to a `np.array`

In [18]:
import pandas as pd
import numpy as np

p_file = pd.read_csv("./data/president_heights.csv", usecols=["order", "name", "height(cm)"])

heights = np.array(p_file["height(cm)"])
names = np.array(p_file["name"])
shortestI = np.argmin(heights)
tallestI = np.argmax(heights)

print("The everage height of presidents is {height} cm".format(height=np.average(heights)))
print("The standard deviation of height of presidents is +/- {height} cm".format(height=np.std(heights)))
print("The shortest president is/was {president} ({height} cm)".format(president=names[shortestI],height=heights[shortestI]))
print("The tallest president is/was {president} ({height} cm)".format(president=names[tallestI],height=heights[tallestI]))



The everage height of presidents is 179.73809523809524 cm
The standard deviation of height of presidents is +/- 6.931843442745892 cm
The shortest president is/was James Madison (163 cm)
The tallest president is/was Abraham Lincoln (193 cm)


# Exercise 2

Recall the polynomial formula

$$
p(x) = a_0 + a_1 x + a_2 x^2 + \cdots a_N x^N = \sum_{n=0}^N a_n x^n \tag{1}
$$

In the **math functions workshop**, you wrote a simple function `p(x, coeff)` to evaluate it without thinking about efficiency.

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

(This is already implemented in `np.poly1d`, but use that only to test your function)

- Hint: Use `np.cumprod()`  


In [57]:
import numpy as np

def p(x, coeff):
    arr = np.array(coeff)
#     We create an array of the indexes of coeff
    enum = list(range(0, len(coeff)))
#     We pass a lambda to vectorize that takes in the coeff and the index
    poly = np.vectorize(lambda i, t: t*(x**i))(enum, arr)
    return np.sum(poly)

def prin(x, a):
    print("x = {x}; coeff = {arra} --> {result}".format(
        x=a,
        arra=a,
        result=p(x, a)
    ))
    
prin(5, [1,1])
prin(5, [2,1,1])

x = [1, 1]; coeff = [1, 1] --> 6
x = [2, 1, 1]; coeff = [2, 1, 1] --> 32


## Exercise 3 Softmax

Read in `data/iris.csv` and compute the [softmax]() of the sepal length. The formula for the softmax function $\sigma(x)$ for a vector $x = \{x_0, x_1, ..., x_{n-1}\}$ is
    .$$\sigma(x)_j = \frac{e^{x_j}}{\sum_k e^{x_k}}$$


Your result should be equal to the output of `scipy.special.softmax`

In [5]:
import pandas as pd
import numpy as np
from scipy.special import softmax

def sum_exp(values):
    exp_arr = np.vectorize(lambda t: np.exp(t))(values)
    return np.sum(exp_arr)
    

i_file = pd.read_csv("./data/iris.csv", usecols=["sepallength"])
print(i_file)

def soft_max_array(ve): 
#     Formulae from scipy:
#     return np.exp(ve) / sum(np.exp(ve))
#     My formulae:
    arr = np.array(ve)
    exp_sum = sum_exp(arr)
    return np.vectorize(lambda t: np.exp(t)/exp_sum)(arr)
    

print("The softmax of sepallength is {softm}".format(
    softm=soft_max_array(i_file["sepallength"])))
print(softmax(i_file["sepallength"]))


     sepallength
0            5.1
1            4.9
2            4.7
3            4.6
4            5.0
..           ...
145          6.7
146          6.3
147          6.5
148          6.2
149          5.9

[150 rows x 1 columns]
The softmax of sepallength is [0.00221959 0.00181724 0.00148783 0.00134625 0.00200836 0.00299613
 0.00134625 0.00200836 0.00110221 0.00181724 0.00299613 0.00164431
 0.00164431 0.00099732 0.0044697  0.00404435 0.00299613 0.00221959
 0.00404435 0.00221959 0.00299613 0.00221959 0.00134625 0.00221959
 0.00164431 0.00200836 0.00200836 0.00245302 0.00245302 0.00148783
 0.00164431 0.00299613 0.00245302 0.00331123 0.00181724 0.00200836
 0.00331123 0.00181724 0.00110221 0.00221959 0.00200836 0.00121813
 0.00110221 0.00200836 0.00221959 0.00164431 0.00221959 0.00134625
 0.00271101 0.00200836 0.01483991 0.00814432 0.01342771 0.00331123
 0.00900086 0.00404435 0.00736928 0.00181724 0.00994749 0.00245302
 0.00200836 0.00493978 0.0054593  0.00603346 0.00365948 0.01099368
 0.00

## Exercise 4: unique counts


Compute the counts of unique values row-wise.

Input:
```
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
arr
> array([[ 9,  9,  4,  8,  8,  1,  5,  3,  6,  3],
>        [ 3,  3,  2,  1,  9,  5,  1, 10,  7,  3],
>        [ 5,  2,  6,  4,  5,  5,  4,  8,  2,  2],
>        [ 8,  8,  1,  3, 10, 10,  4,  3,  6,  9],
>        [ 2,  1,  8,  7,  3,  1,  9,  3,  6,  2],
>        [ 9,  2,  6,  5,  3,  9,  4,  6,  1, 10]])
```
Desired Output:
```
> [[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
>  [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
>  [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
>  [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
>  [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
>  [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]
```
Output contains 10 columns representing numbers from 1 to 10. The values are the counts of the numbers in the respective rows.
For example, Cell(0,2) has the value 2, which means, the number 3 occurs exactly 2 times in the 1st row.

In [13]:
import numpy as np

def unique_count_row_wise(arr): 
    result = []
    for row in arr:
        tempRow = np.array(row)
        index = 1
        newRow = []
        for value in row:
            """
            We use count_nonzero to 
            check the occurence of the index (the column) in the 
            row of the array we're in
            """
            newRow.append(np.count_nonzero(tempRow==index, axis=0))
            index += 1
        result.append(newRow)
    return result

array = [[ 9,  9,  4,  8,  8,  1,  5,  3,  6,  3],
         [ 3,  3,  2,  1,  9,  5,  1, 10,  7,  3],
         [ 5,  2,  6,  4,  5,  5,  4,  8,  2,  2],
         [ 8,  8,  1,  3, 10, 10,  4,  3,  6,  9],
         [ 2,  1,  8,  7,  3,  1,  9,  3,  6,  2],
         [ 9,  2,  6,  5,  3,  9,  4,  6,  1, 10]]

unique_count_row_wise(array)
    

[[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
 [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
 [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
 [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
 [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
 [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]

## Exercise 5: One-Hot encodings

Compute the one-hot encodings (AKA dummy binary variables) for each unique value in the array.

Input:
```
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
#> array([2, 3, 2, 2, 2, 1])
```
Output:
```
#> array([[ 0.,  1.,  0.],
#>        [ 0.,  0.,  1.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 1.,  0.,  0.]])
```

In [19]:
"""
Givent that the web has several ways of implementing this algorithm
in practice, I went with the one here as it seemed the simplest and since
I didn't quite understand the concept at first or how to
generate "[0,1]." values.

https://stackoverflow.com/a/28663910/7183483

"""

def one_hot(arr):
    a = np.array(arr)
    b = np.zeros((a.size, a.max()+1))
    b[np.arange(a.size),a] = 1
    return b

one_hot([2, 3, 2, 2, 2, 1])

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.]])

## Exercise 6

Let `q` be a NumPy array of length `n` with `q.sum() == 1`.

Suppose that `q` represents a [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) over a statistical distribution. Recall that a distribution is an array of probabilities of events.

We want to generate a discrete random variable $ x $ such that $ \mathbb P\{x = i\} = q_i $.

In other words, `x` takes values in `range(len(q))` and `x = i` with probability `q[i]`.

The standard (inverse transform) algorithm is as follows:

- Divide the unit interval $ [0, 1] $ into $ n $ subintervals $ I_0, I_1, \ldots, I_{n-1} $ such that the length of $ I_i $ is $ q_i $.  
- Draw a uniform random variable $ U $ on $ [0, 1] $ and return the $ i $ such that $ U \in I_i $.  


The probability of drawing $ i $ is the length of $ I_i $, which is equal to $ q_i $.

We can implement the algorithm as follows

```python
from random import uniform

def sample(q):
    a = 0.0
    U = uniform(0, 1)
    for i in range(len(q)):
        if a < U <= a + q[i]:
            return i
        a = a + q[i]
```

If you can’t see how this works, try thinking through the flow for a simple example, such as `q = [0.25, 0.75]`
It helps to sketch the intervals on paper.

**Your exercise is to speed it up using NumPy, avoiding explicit loops**

- Hint: Use `np.searchsorted` and `np.cumsum`  


If you can, implement the functionality as a class called `DiscreteRV`, where

- the data for an instance of the class is the vector of probabilities `q`  
- the class has a `draw()` method, which returns one draw according to the algorithm described above  


If you can, write the method so that `draw(k)` returns `k` draws from `q`.

In [136]:
import numpy as np
import random
from random import uniform

"""
We assume this is not a uniform distribution

"""
class DiscreteRV:
    
    __skip = True
        
    __last_q_i = 0
    
    q = []
    
    _n = 0
    
    def __init__(self, n):
        self.build(n)
        
    """
    Convenience method
    """
    def build(self, n):
        if n < 1:
            self.q = []
            self._n = 0
        else:
            self._n = n
            self.q = np.fromfunction(np.vectorize(self.doLambda), (1, n), dtype=float)[0]
        self.__last_q_i = 0
        self.__skip = True
    

    def doLambda(self, i, j):
        """We use a first-iteration guard here to skip over np.vectorize's 
        first iteration (see print bellow to see what I mean)"""
        if self.__skip:
            self.__skip = False
            return i
        """
        We update the new sub-interval's max with a random value between
        the max of the former sub-interval and 1
        """
        self.__last_q_i = uniform(self.__last_q_i, 1)
        return self.__last_q_i

    def show(self):
        print("q = {a}".format(a=self.q))
        
    """
    Default to 1 draw
    """
    def draw(self, k=1):
        draws_i = random.sample(range(0, len(self.q) - 1), k)
        
        """
        Check generates a random value x inc. in [0, 1] and returns it's
        index value with searchsorted applied to q
        """
        def check(i,j): return np.searchsorted(self.q, uniform(0, 1))
        
        return np.fromfunction(np.vectorize(check), (1, len(draws_i)), dtype=int)[0]
        
        
dist = DiscreteRV(10)

dist.show()

print(dist.draw(3))
print(dist.draw(7))



q = [0.32550555 0.63937228 0.76249495 0.85869799 0.85887894 0.92027329
 0.9878375  0.99725703 0.99924164 0.99969571]
[3 3 3]
[5 0 5 1 3 1 1]


## Exercise 7 Peaks

Find all the peaks in a 1D numpy array a. Peaks are points surrounded by smaller values on both sides.

Input:
```
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
```
Desired Output:
```
#> array([2, 5])
```
where, 2 and 5 are the positions of peak values 7 and 6.

### 1. Solve this usign a regular python for loop

### 2. Solve this using no loops and only numpy functions

In [143]:
def peaks_loop(a):
    p = []
    if len(a) == 3: return [1] if a[0] < a[1] > a[2] else []
    if len(a) < 3: return []
    """
    No need to check elements on both ends of the array
    """
    for i in range(1, len(a) - 2): 
        if a[i - 1] < a[i] > a[i + 1]: p.append(i)
    return p

peaks_loop([1, 3, 7, 1, 2, 6, 0, 1])

[2, 5]

In [151]:
def peaks_np(a):
    if len(a) == 3: return [1] if a[0] < a[1] > a[2] else []
    if len(a) < 3: return []
#     A = np.fromfunction(np.vectorize(lambda i, j: a[j] if i == j else 0), (len(a), len(a)), dtype=int)
    print(np.ndenumerate(a))
    
    
peaks_np([1, 3, 7, 1, 2, 6, 0, 1])

<numpy.ndenumerate object at 0x112b359d0>
