## 1. What is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let's consider the heights of all US presidents.

This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values.

Find the mean height, the standard deviation of height, and the president who is the smallest and tallest.

You can use `pandas` to read in the file if you want, then cast the column to a `np.array`

In [24]:
import pandas as pd
import numpy as np
file1 = 'data/president_heights.csv'
df_heights = pd.read_csv(file1)
height_np = np.array(df_heights['height(cm)'])

In [25]:
#1- Mean height 
mean_height = np.mean(height_np)
print(f"Answer 1.1: The average height is {mean_height:.2f} cm.")

#2- Standard deviation of heights
std_height = np.std(height_np)
print(f"Answer 1.2: The standard deviation is {std_height:.2f} cm.")

#3- Person with minimum height
min_height = np.min(height_np)
name_min_height = df_heights[df_heights['height(cm)'] == min_height][['name']]
print(f"Answer 1.3: The smallest person is {name_min_height.iloc[0][0]}.")

#4- Person with maximum height 
max_height = np.max(height_np)
names_max_height = df_heights[df_heights['height(cm)'] == max_height]['name']
names_max_height_str = (' and ').join(names_max_height)
print(f"Answer 1.4: The tallest persons are {names_max_height_str}.")

Answer 1.1: The average height is 179.74 cm.
Answer 1.2: The standard deviation is 6.93 cm.
Answer 1.3: The smallest person is James Madison.
Answer 1.4: The tallest persons are Abraham Lincoln and Lyndon B. Johnson.


# Exercise 2

Recall the polynomial formula

$$
p(x) = a_0 + a_1 x + a_2 x^2 + \cdots a_N x^N = \sum_{n=0}^N a_n x^n \tag{1}
$$

In the **math functions workshop**, you wrote a simple function `p(x, coeff)` to evaluate it without thinking about efficiency.

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

(This is already implemented in `np.poly1d`, but use that only to test your function)

- Hint: Use `np.cumprod()`  


In [26]:
def p(x, coeff):
    powers = np.arange(len(coeff))[::-1]
    values = np.array(coeff)*x**powers
    res = np.sum(values)
    return res

#Test if function equals to np.poly1d
#Input parameters
x0 = 2
coeff0 = np.arange(3)

#Results
poly = p(x0, coeff0)
poly1d = np.poly1d(coeff0)

#Display
print(f"Result from built function with numpy : {poly}")
print(f"Result from np.poly1d : {poly1d(x0)}")
print(f"Does the result of the polynomial function built with numpy equals the numpy poly1d function?\
 {p(2, coeff0) == poly1d(x0)}")

Result from built function with numpy : 4
Result from np.poly1d : 4
Does the result of the polynomial function built with numpy equals the numpy poly1d function? True


## Exercise 3 Softmax

Read in `data/iris.csv` and compute the [softmax]() of the sepal length. The formula for the softmax function $\sigma(x)$ for a vector $x = \{x_0, x_1, ..., x_{n-1}\}$ is
    .$$\sigma(x)_j = \frac{e^{x_j}}{\sum_k e^{x_k}}$$


Your result should be equal to the output of `scipy.special.softmax`

In [27]:
from scipy.special import softmax

file3 = 'data/iris.csv'
df_iris = pd.read_csv(file3)
sepal_lengths = df_iris['sepallength']

denominator = np.sum(np.exp(sepal_lengths))
sigmas = round(np.exp(sepal_lengths)/denominator, 4)
scipy_softmax_fn = round(softmax(sepal_lengths),4)

#Test
print(f"Does the code generates the same values as scipy.special.softmax? {np.array_equal(sigmas, scipy_softmax_fn)}\n")
# print(sigmas)
# print(scipy_softmax_fn)

Does the code generates the same values as scipy.special.softmax? True



## Exercise 4: unique counts


Compute the counts of unique values row-wise.

Input:
```
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
arr
> array([[ 9,  9,  4,  8,  8,  1,  5,  3,  6,  3],
>        [ 3,  3,  2,  1,  9,  5,  1, 10,  7,  3],
>        [ 5,  2,  6,  4,  5,  5,  4,  8,  2,  2],
>        [ 8,  8,  1,  3, 10, 10,  4,  3,  6,  9],
>        [ 2,  1,  8,  7,  3,  1,  9,  3,  6,  2],
>        [ 9,  2,  6,  5,  3,  9,  4,  6,  1, 10]])
```
Desired Output:
```
> [[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
>  [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
>  [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
>  [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
>  [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
>  [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]
```
Output contains 10 columns representing numbers from 1 to 10. The values are the counts of the numbers in the respective rows.
For example, Cell(0,2) has the value 2, which means, the number 3 occurs exactly 2 times in the 1st row.

In [28]:
def count_numbers(array):
    #Initialize empty array
    x,y=array.shape
    output= np.empty([x,y])
    
    #Calculate count for each number in nested arrays
    for i in range(x):
        for j in range(1,11):
            count = np.count_nonzero(array[i] == j)
            output[i][j-1] = count
    return output

#Test
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
print(count_numbers(arr))

[[1. 0. 2. 1. 1. 1. 0. 2. 2. 0.]
 [2. 1. 3. 0. 1. 0. 1. 0. 1. 1.]
 [0. 3. 0. 2. 3. 1. 0. 1. 0. 0.]
 [1. 0. 2. 1. 0. 1. 0. 2. 1. 2.]
 [2. 2. 2. 0. 0. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 2. 0. 0. 2. 1.]]


## Exercise 5: One-Hot encodings

Compute the one-hot encodings (AKA dummy binary variables) for each unique value in the array.

Input:
```
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
#> array([2, 3, 2, 2, 2, 1])
```
Output:
```
#> array([[ 0.,  1.,  0.],
#>        [ 0.,  0.,  1.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 1.,  0.,  0.]])
```

In [29]:
#Solution implemented with function to convert a number to binary from scratch.
import math
def num_to_binary(num, size):
    binary = []
    if num == 0:
        binary.append(0)
    else:
        while (num >0):
            remainder = num%2
            binary.append(remainder)
            num = math.floor(num/2)
    binary = binary[::-1]
    
    length_binary = len(binary)
    if length_binary < size:
        for i in range(size - length_binary):
            binary.insert(0,0)
        
    return(binary)

def one_hot_encoding(array):
    output = []
    for num in array:
        binary = num_to_binary(num,3)
        binary = np.array([float(n) for n in binary])
        output.append(binary)
    return np.array(output)

#Test
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
print(arr)
one_hot_encoding(arr)

[2 3 2 2 2 1]


array([[0., 1., 0.],
       [0., 1., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [30]:
#Solution implemented with function to convert a number to binary from numpy. 

def one_hot_encoding(array):
    output = []
    for num in array:
        binary = np.binary_repr(num, width = 3)
        binary = np.array([float(n) for n in binary])
        output.append(binary)
    return np.array(output)

#Test
print(arr)
one_hot_encoding(arr)


[2 3 2 2 2 1]


array([[0., 1., 0.],
       [0., 1., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

## Exercise 6

Let `q` be a NumPy array of length `n` with `q.sum() == 1`.

Suppose that `q` represents a [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) over a statistical distribution. Recall that a distribution is an array of probabilities of events.

We want to generate a discrete random variable $ x $ such that $ \mathbb P\{x = i\} = q_i $.

In other words, `x` takes values in `range(len(q))` and `x = i` with probability `q[i]`.

The standard (inverse transform) algorithm is as follows:

- Divide the unit interval $ [0, 1] $ into $ n $ subintervals $ I_0, I_1, \ldots, I_{n-1} $ such that the length of $ I_i $ is $ q_i $.  
- Draw a uniform random variable $ U $ on $ [0, 1] $ and return the $ i $ such that $ U \in I_i $.  


The probability of drawing $ i $ is the length of $ I_i $, which is equal to $ q_i $.

We can implement the algorithm as follows

```python
from random import uniform

def sample(q):
    a = 0.0
    U = uniform(0, 1)
    for i in range(len(q)):
        if a < U <= a + q[i]:
            return i
        a = a + q[i]
```

If you can’t see how this works, try thinking through the flow for a simple example, such as `q = [0.25, 0.75]`
It helps to sketch the intervals on paper.

**Your exercise is to speed it up using NumPy, avoiding explicit loops**

- Hint: Use `np.searchsorted` and `np.cumsum`  


If you can, implement the functionality as a class called `DiscreteRV`, where

- the data for an instance of the class is the vector of probabilities `q`  
- the class has a `draw()` method, which returns one draw according to the algorithm described above  


If you can, write the method so that `draw(k)` returns `k` draws from `q`.

In [31]:
from random import uniform
class DiscreteRV():
    def __init__(self, data):
        self.data = data
    
    def draw(self, k=1):
        output = np.zeros(k)
        for i in range(k):
            q = self.data
            U = uniform(0, 1)
            q_np = np.array(q)
            q_np = np.cumsum(q_np)
            output[i] = np.searchsorted(q_np,U)
        return output

q = [0.25, 0.75]
discrete_rv = DiscreteRV(q)
discrete_rv.draw(10)

array([0., 0., 0., 1., 1., 1., 1., 1., 1., 1.])

## Exercise 7 Peaks

Find all the peaks in a 1D numpy array a. Peaks are points surrounded by smaller values on both sides.

Input:
```
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
```
Desired Output:
```
#> array([2, 5])
```
where, 2 and 5 are the positions of peak values 7 and 6.

### 1. Solve this using a regular python for loop

### 2. Solve this using no loops and only numpy functions

In [32]:
#7.1

def find_peak(array):
    a = np.copy(array)
    result = []
    for i in range(1, len(a)-1):
        if a[i]> a[i-1] and a[i]>a[i+1]:
            result.append(i)
    return result

arr = np.array([1, 3, 7, 1, 2, 6, 0, 1])
find_peak(arr)

[2, 5]

In [33]:
#7.2
def find_peak(array):
    
    a = array.copy()
    a0 = a[1:-1]
    a1 = a[:-2]
    a2 = a[2:]
    
    #if result is positive it means the number at that index is bigger than number to its left
    r1 = a0-a1
    
    #if result is positive it means the number at that index is bigger than number to its right
    r2 = a0-a2
    
    #Convert values to booleans : True if value is +ve and False if value is -ve.
    r1 = r1>0.0
    r2 = r2>0.0
    
    #Convert values to numbers : 1 if value is +ve and 0 if value is -ve.
    r1 = r1.astype(int)
    r2 = r2.astype(int)

    #Add r1 and r2
    sum_arrays = r1+r2
    
    #Find indexes where r3 = 2 to find index where the numbers surrounded by lower values are located
    r4 = np.where(sum_arrays ==2)
    
    #Adjust indexes by 1 to account for array manipulation carried out earlier
    r5 = r4[0]+1
    
    return r5

find_peak(arr)

array([2, 5], dtype=int64)