## Steven Zajac-Descôteaux Workshop 6

## 1. What is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let's consider the heights of all US presidents.

This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values.

Find the mean height, the standard deviation of height, and the president who is the smallest and tallest.

You can use `pandas` to read in the file if you want, then cast the column to a `np.array`

In [165]:
import numpy as np
import pandas as pd

height = pd.read_csv('data/president_heights.csv')
#height = height.reset_index(drop=True, inplace=True)

height_arr = np.array(height["height(cm)"])

h_mean = round(np.mean(height_arr), 2)

h_std = round(np.std(height_arr), 2)

h_max = np.amax(height_arr)
h_min = np.amin(height_arr)

t_index = list(height[height['height(cm)']==h_max].index.values)
tallest = list(height['name'][t_index])


s_index = list(height[height['height(cm)']==h_min].index.values)
shortest = list(height['name'][s_index])

print(f"The tallest president(s): {', '.join(tallest)}")
print(f"\nThe shortest president(s): {','.join(shortest)}")
print(f"\nThe average height for the presidents is {h_mean} cm with a standard deviation of {h_std}.")


The tallest president(s): Abraham Lincoln, Lyndon B. Johnson

The shortest president(s): James Madison

The average height for the presidents is 179.74 cm with a standard deviation of 6.93.


# Exercise 2

Recall the polynomial formula

$$
p(x) = a_0 + a_1 x + a_2 x^2 + \cdots a_N x^N = \sum_{n=0}^N a_n x^n \tag{1}
$$

In the **math functions workshop**, you wrote a simple function `p(x, coeff)` to evaluate it without thinking about efficiency.

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

(This is already implemented in `np.poly1d`, but use that only to test your function)

- Hint: Use `np.cumprod()`  


In [72]:
import numpy as np

#np goes from larger exponent to smaller so we need to flip the coeff list and powers

def p(x, coeff):
    """Evaluate 1 dimentional polynomial formula"""
    coeff = (np.array(coeff)) #create array from coefficients 
    power = np.arange(0, coeff.size) #generate powers of x. starting at x^0 in array
    answer = sum(((coeff) * (x ** power))) #for each coeff we increase power of x by 1 (starting @ 0)
    return answer
              
p(5, [2,1,1])

32

In [73]:
#Flip b/c polynomial equation starts at x^0 whereas numpy begins at highest power

p = np.poly1d(np.flip([2, 1, 1]))
print(p(5))

32


## Exercise 3 Softmax

Read in `data/iris.csv` and compute the [softmax]() of the sepal length. The formula for the softmax function $\sigma(x)$ for a vector $x = \{x_0, x_1, ..., x_{n-1}\}$ is
    .$$\sigma(x)_j = \frac{e^{x_j}}{\sum_k e^{x_k}}$$


Your result should be equal to the output of `scipy.special.softmax`

In [304]:
import numpy as np
import pandas as pd
from scipy.special import softmax as sf

df = pd.read_csv('/Users/Main/Data_Science/m1-6-numpy-main/data/iris.csv')
df.head()
arr = np.array(df['sepallength'])
sf(arr)
    

array([0.00221959, 0.00181724, 0.00148783, 0.00134625, 0.00200836,
       0.00299613, 0.00134625, 0.00200836, 0.00110221, 0.00181724,
       0.00299613, 0.00164431, 0.00164431, 0.00099732, 0.0044697 ,
       0.00404435, 0.00299613, 0.00221959, 0.00404435, 0.00221959,
       0.00299613, 0.00221959, 0.00134625, 0.00221959, 0.00164431,
       0.00200836, 0.00200836, 0.00245302, 0.00245302, 0.00148783,
       0.00164431, 0.00299613, 0.00245302, 0.00331123, 0.00181724,
       0.00200836, 0.00331123, 0.00181724, 0.00110221, 0.00221959,
       0.00200836, 0.00121813, 0.00110221, 0.00200836, 0.00221959,
       0.00164431, 0.00221959, 0.00134625, 0.00271101, 0.00200836,
       0.01483991, 0.00814432, 0.01342771, 0.00331123, 0.00900086,
       0.00404435, 0.00736928, 0.00181724, 0.00994749, 0.00245302,
       0.00200836, 0.00493978, 0.0054593 , 0.00603346, 0.00365948,
       0.01099368, 0.00365948, 0.0044697 , 0.006668  , 0.00365948,
       0.00493978, 0.00603346, 0.00736928, 0.00603346, 0.00814

In [324]:
def softmx(ar):
    """
    Compute softmax values
    Compute the exponential of each element divided
    by the sum of the exponentials of all the elements.
    """
    #np.exp() : calculate the exponential of all elements in the input array
    soft_max = (np.exp(ar)) / (np.sum(np.exp(ar))) #exponetial divided by sum of all exponentials in array
    return soft_max
     
df = pd.read_csv('/Users/Main/Data_Science/m1-6-numpy-main/data/iris.csv')
df.head()
arr = np.array(df['sepallength'])
softmx(arr)

array([0.00221959, 0.00181724, 0.00148783, 0.00134625, 0.00200836,
       0.00299613, 0.00134625, 0.00200836, 0.00110221, 0.00181724,
       0.00299613, 0.00164431, 0.00164431, 0.00099732, 0.0044697 ,
       0.00404435, 0.00299613, 0.00221959, 0.00404435, 0.00221959,
       0.00299613, 0.00221959, 0.00134625, 0.00221959, 0.00164431,
       0.00200836, 0.00200836, 0.00245302, 0.00245302, 0.00148783,
       0.00164431, 0.00299613, 0.00245302, 0.00331123, 0.00181724,
       0.00200836, 0.00331123, 0.00181724, 0.00110221, 0.00221959,
       0.00200836, 0.00121813, 0.00110221, 0.00200836, 0.00221959,
       0.00164431, 0.00221959, 0.00134625, 0.00271101, 0.00200836,
       0.01483991, 0.00814432, 0.01342771, 0.00331123, 0.00900086,
       0.00404435, 0.00736928, 0.00181724, 0.00994749, 0.00245302,
       0.00200836, 0.00493978, 0.0054593 , 0.00603346, 0.00365948,
       0.01099368, 0.00365948, 0.0044697 , 0.006668  , 0.00365948,
       0.00493978, 0.00603346, 0.00736928, 0.00603346, 0.00814

## Exercise 4: unique counts


Compute the counts of unique values row-wise.

Input:
```
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
arr
> array([[ 9,  9,  4,  8,  8,  1,  5,  3,  6,  3],
>        [ 3,  3,  2,  1,  9,  5,  1, 10,  7,  3],
>        [ 5,  2,  6,  4,  5,  5,  4,  8,  2,  2],
>        [ 8,  8,  1,  3, 10, 10,  4,  3,  6,  9],
>        [ 2,  1,  8,  7,  3,  1,  9,  3,  6,  2],
>        [ 9,  2,  6,  5,  3,  9,  4,  6,  1, 10]])
```
Desired Output:
```
> [[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
>  [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
>  [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
>  [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
>  [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
>  [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]
```
Output contains 10 columns representing numbers from 1 to 10. The values are the counts of the numbers in the respective rows.
For example, Cell(0,2) has the value 2, which means, the number 3 occurs exactly 2 times in the 1st row.

In [295]:
import numpy as np

np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
print(arr)

r = np.zeros((6, 10), dtype='int') #Change dtype to int 
#creating zero array, if the index has no value, zero will be applies
for i in range(len(arr)): #range of lenth of array
    ind, count = np.unique(arr[i], return_counts=True) #index and count the unique frequencies
    r[i][ind-1] = count #need to have index - 1 or index will be out of range
print(r)

[[ 9  9  4  8  8  1  5  3  6  3]
 [ 3  3  2  1  9  5  1 10  7  3]
 [ 5  2  6  4  5  5  4  8  2  2]
 [ 8  8  1  3 10 10  4  3  6  9]
 [ 2  1  8  7  3  1  9  3  6  2]
 [ 9  2  6  5  3  9  4  6  1 10]]
[[1 0 2 1 1 1 0 2 2 0]
 [2 1 3 0 1 0 1 0 1 1]
 [0 3 0 2 3 1 0 1 0 0]
 [1 0 2 1 0 1 0 2 1 2]
 [2 2 2 0 0 1 1 1 1 0]
 [1 1 1 1 1 2 0 0 2 1]]


## Exercise 5: One-Hot encodings

Compute the one-hot encodings (AKA dummy binary variables) for each unique value in the array.

Input:
```
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
#> array([2, 3, 2, 2, 2, 1])
```
Output:
```
#> array([[ 0.,  1.,  0.],
#>        [ 0.,  0.,  1.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 1.,  0.,  0.]])
```

In [375]:
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr

dum_bin = np.zeros((arr.size, arr.max())) #create an array of zeros, (6 by 3) (max value in arr is 3)
dum_bin[np.arange(arr.size), arr-1] = 1 #put 1 in place of the 0 for the values in new index values b/c we
#(substract 1 because we want to start at 1 not 0 so 1 becomes 0 and index of 3 becomes 2.)
dum_bin


array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [376]:
#Notes for me :) 
print([np.arange(arr.size), arr-1])
#it is as if we take array 0-5(inclusively and then take the array indexes and match them together)
#So array[0] will have a value at index[1]. That's why we substract array by 1
#Otherwise we would have a 6 by 4  not 6 by 3

[array([0, 1, 2, 3, 4, 5]), array([1, 2, 1, 1, 1, 0])]
[1 2 1 1 1 0]


## Exercise 6

Let `q` be a NumPy array of length `n` with `q.sum() == 1`.

Suppose that `q` represents a [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) over a statistical distribution. Recall that a distribution is an array of probabilities of events.

We want to generate a discrete random variable $ x $ such that $ \mathbb P\{x = i\} = q_i $.

In other words, `x` takes values in `range(len(q))` and `x = i` with probability `q[i]`.

The standard (inverse transform) algorithm is as follows:

- Divide the unit interval $ [0, 1] $ into $ n $ subintervals $ I_0, I_1, \ldots, I_{n-1} $ such that the length of $ I_i $ is $ q_i $.  
- Draw a uniform random variable $ U $ on $ [0, 1] $ and return the $ i $ such that $ U \in I_i $.  


The probability of drawing $ i $ is the length of $ I_i $, which is equal to $ q_i $.

We can implement the algorithm as follows

```python
from random import uniform

def sample(q):
    a = 0.0
    U = uniform(0, 1)
    for i in range(len(q)):
        if a < U <= a + q[i]:
            return i
        a = a + q[i]
```

If you can’t see how this works, try thinking through the flow for a simple example, such as `q = [0.25, 0.75]`
It helps to sketch the intervals on paper.

**Your exercise is to speed it up using NumPy, avoiding explicit loops**

- Hint: Use `np.searchsorted` and `np.cumsum`  


If you can, implement the functionality as a class called `DiscreteRV`, where

- the data for an instance of the class is the vector of probabilities `q`  
- the class has a `draw()` method, which returns one draw according to the algorithm described above  


If you can, write the method so that `draw(k)` returns `k` draws from `q`.

In [835]:
from random import uniform

def sample(q):
    a = 0.0
    U = uniform(0, 1)
    for i in range(len(q)):
        print(U)
        if a < U <= a + q[i]:
            print(q[i])
            return i
        a = a + q[i]
        print(a)
        
sample([.25,.75])

0.6295479387514277
0.25
0.6295479387514277
0.75


1

In [915]:
import numpy as np
from numpy.random import uniform

class DiscreetRV:
    
    def __init__(self, q):
        """Initialize the class"""
        self.q = q
        self.Q = np.cumsum(q)
        
        
    def draw(self, k=1):
        """Represents k number of draws. Default is 1"""
        rand = uniform(0, 1, size=k) #size k depends on number of draws. Default is 1
        x = np.searchsorted(self.Q, rand)
        return x       
        
dr = DiscreetRV([.25,.25, .5])    
dr.draw(13)

array([2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 0, 2, 2])

## Exercise 7 Peaks

Find all the peaks in a 1D numpy array a. Peaks are points surrounded by smaller values on both sides.

Input:
```
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
```
Desired Output:
```
#> array([2, 5])
```
where, 2 and 5 are the positions of peak values 7 and 6.

### 1. Solve this usign a regular python for loop

### 2. Solve this using no loops and only numpy functions

In [278]:
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])

answer_array = []
for i in range(1, len(a) - 1):
    if a[i-1] < a[i] and a[i] > a[i+1]:
        answer_array.append(i)
print(answer_array)

[2, 5]


list

In [337]:
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])

print(a[2:])
print(a[1:-1])
print(a[:-2])

lst = [False, True, True, True, True, True, True, False] #False because end values cannot be peaks. 
a[lst] #Array without end values in original order
(a[:-2] < a[1:-1]) & (a[1:-1] > a[2:]) #compare middle list with shifted top and btm lists
np.arange(1, len(a)-1)[(a[:-2] < a[1:-1]) & (a[1:-1] > a[2:])]

[7 1 2 6 0 1]
[3 7 1 2 6 0]
[1 3 7 1 2 6]


array([2, 5])