## 1. What is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let's consider the heights of all US presidents.

This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values.

Find the mean height, the standard deviation of height, and the president who is the smallest and tallest.

You can use `pandas` to read in the file if you want, then cast the column to a `np.array`

In [64]:
import numpy as np
import pandas as pd

pd.read_csv('/Users/mike_stein612/Desktop/m1-6-numpy/data/president_heights.csv')
data = pd.read_csv('/Users/mike_stein612/Desktop/m1-6-numpy/data/president_heights.csv')
data

Unnamed: 0,order,name,height(cm)
0,1,George Washington,189
1,2,John Adams,170
2,3,Thomas Jefferson,189
3,4,James Madison,163
4,5,James Monroe,183
5,6,John Quincy Adams,171
6,7,Andrew Jackson,185
7,8,Martin Van Buren,168
8,9,William Henry Harrison,173
9,10,John Tyler,183


In [65]:
np.mean(data, axis = 0)# average height is 179.738095

order          22.476190
height(cm)    179.738095
dtype: float64

In [66]:
np.std(data, axis = 0)# standard deviation heights is 6.931843

order         12.994941
height(cm)     6.931843
dtype: float64

In [67]:
np.amin(data, axis = 0)#based on height, shortest height is 163cm

order                       1
name          Abraham Lincoln
height(cm)                163
dtype: object

In [68]:
np.amax(data, axis = 0)# Based on results, tallest height is 193cm

order                     44
name          Zachary Taylor
height(cm)               193
dtype: object

In [69]:
np.array(data)
np.where(data == 163)#tells us to look at (3,2) --> James Madison

(array([3]), array([2]))

In [71]:
np.where(data == 193)#tells us to look at 2 vectors with that height
                     #Tallest presidents = Abraham Lincoln and Lyndon B. Johnson

(array([15, 33]), array([2, 2]))

# Exercise 2

Recall the polynomial formula

$$
p(x) = a_0 + a_1 x + a_2 x^2 + \cdots a_N x^N = \sum_{n=0}^N a_n x^n \tag{1}
$$

In the **math functions workshop**, you wrote a simple function `p(x, coeff)` to evaluate it without thinking about efficiency.

Now write a new function that does the same job, but uses NumPy arrays and array operations for its computations, rather than any form of Python loop.

(This is already implemented in `np.poly1d`, but use that only to test your function)

- Hint: Use `np.cumprod()`  


In [37]:
import numpy as np

def poly_numpy(x, coeff):
    print("The coefficient values are " + str(coeff))
    arr = np.ones(len(coeff))
    cumu_prod = arr.copy() #copy for manipulations
    cumu_prod[1:] = x #convert all values from index 1 into x
    z = np.cumprod(cumu_prod) #multiply all of the x's
    print('The cumulative product array of x is ' + str(z))
    poly = np.dot(coeff, z) #adds values from 2 arrays together
    print("The sum of the polynomial function is " + str(poly))
poly_numpy(5, [2,1,1])

The coefficient values are [2, 1, 1]
The cumulative product array of x is [ 1.  5. 25.]
The sum of the polynomial function is 32.0


## Exercise 3 Softmax

Read in `data/iris.csv` and compute the [softmax]() of the sepal length. The formula for the softmax function $\sigma(x)$ for a vector $x = \{x_0, x_1, ..., x_{n-1}\}$ is
    .$$\sigma(x)_j = \frac{e^{x_j}}{\sum_k e^{x_k}}$$


Your result should be equal to the output of `scipy.special.softmax`

In [38]:
import numpy as np
import pandas as pd

pd.read_csv('/Users/mike_stein612/Desktop/m1-6-numpy/data/iris.csv')
a = pd.read_csv('/Users/mike_stein612/Desktop/m1-6-numpy/data/iris.csv')
a

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,flower
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [39]:
logits = a["sepallength"]
logits

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepallength, Length: 150, dtype: float64

In [52]:
def softmax(logits):
    return np.exp(logits) / np.sum(np.exp(logits))

softmax(logits)

0      0.002220
1      0.001817
2      0.001488
3      0.001346
4      0.002008
         ...   
145    0.010994
146    0.007369
147    0.009001
148    0.006668
149    0.004940
Name: sepallength, Length: 150, dtype: float64

In [53]:
np.sum(np.exp(logits) / np.sum(np.exp(logits))) #works since it should equal 1

0.9999999999999999

## Exercise 4: unique counts


Compute the counts of unique values row-wise.

Input:
```
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
arr
> array([[ 9,  9,  4,  8,  8,  1,  5,  3,  6,  3],
>        [ 3,  3,  2,  1,  9,  5,  1, 10,  7,  3],
>        [ 5,  2,  6,  4,  5,  5,  4,  8,  2,  2],
>        [ 8,  8,  1,  3, 10, 10,  4,  3,  6,  9],
>        [ 2,  1,  8,  7,  3,  1,  9,  3,  6,  2],
>        [ 9,  2,  6,  5,  3,  9,  4,  6,  1, 10]])
```
Desired Output:
```
> [[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
>  [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
>  [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
>  [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
>  [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
>  [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]
```
Output contains 10 columns representing numbers from 1 to 10. The values are the counts of the numbers in the respective rows.
For example, Cell(0,2) has the value 2, which means, the number 3 occurs exactly 2 times in the 1st row.

In [54]:
from collections import Counter
import numpy as np

np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
z = np.zeros((6,10))

for i in range(len(arr)):
    count = (Counter(arr[i,:]))
    for k,v  in count.items():
        z[i, k-1] = v
        val , count = np.unique((arr[i]),return_counts=True)
        arr, val, count, z, val -1
        z[i][val-1] = count

z

array([[1., 0., 2., 1., 1., 1., 0., 2., 2., 0.],
       [2., 1., 3., 0., 1., 0., 1., 0., 1., 1.],
       [0., 3., 0., 2., 3., 1., 0., 1., 0., 0.],
       [1., 0., 2., 1., 0., 1., 0., 2., 1., 2.],
       [2., 2., 2., 0., 0., 1., 1., 1., 1., 0.],
       [1., 1., 1., 1., 1., 2., 0., 0., 2., 1.]])

## Exercise 5: One-Hot encodings

Compute the one-hot encodings (AKA dummy binary variables) for each unique value in the array.

Input:
```
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
#> array([2, 3, 2, 2, 2, 1])
```
Output:
```
#> array([[ 0.,  1.,  0.],
#>        [ 0.,  0.,  1.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 0.,  1.,  0.],
#>        [ 1.,  0.,  0.]])
```

In [55]:
import numpy as np

np.random.seed(101)
arr = np.random.randint(1, 4, size = 6)
arr

array([2, 3, 2, 2, 2, 1])

In [56]:
arr2 = np.zeros((6,3))
arr2

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [73]:
arr3 = arr2.copy()
arr3[(0,1)], arr3[(2,1)], arr3[(3,1)], arr3[(4,1)] = (
arr[(0)], arr[(2)], arr[(3)], arr[(4)]) #Copy 2s
arr3[(5,0)] = arr[(5)] #Copy 1
arr3[(1,2)] = arr[(1)] #Copy 3
arr3

array([[0., 2., 0.],
       [0., 0., 3.],
       [0., 2., 0.],
       [0., 2., 0.],
       [0., 2., 0.],
       [1., 0., 0.]])

In [58]:
np.where(arr3 < 1, arr3, 1)

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

## Exercise 6

Let `q` be a NumPy array of length `n` with `q.sum() == 1`.

Suppose that `q` represents a [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) over a statistical distribution. Recall that a distribution is an array of probabilities of events.

We want to generate a discrete random variable $ x $ such that $ \mathbb P\{x = i\} = q_i $.

In other words, `x` takes values in `range(len(q))` and `x = i` with probability `q[i]`.

The standard (inverse transform) algorithm is as follows:

- Divide the unit interval $ [0, 1] $ into $ n $ subintervals $ I_0, I_1, \ldots, I_{n-1} $ such that the length of $ I_i $ is $ q_i $.  
- Draw a uniform random variable $ U $ on $ [0, 1] $ and return the $ i $ such that $ U \in I_i $.  


The probability of drawing $ i $ is the length of $ I_i $, which is equal to $ q_i $.

We can implement the algorithm as follows

```python
from random import uniform

def sample(q):
    a = 0.0
    U = uniform(0, 1)
    for i in range(len(q)):
        if a < U <= a + q[i]:
            return i
        a = a + q[i]
```

If you can’t see how this works, try thinking through the flow for a simple example, such as `q = [0.25, 0.75]`
It helps to sketch the intervals on paper.

**Your exercise is to speed it up using NumPy, avoiding explicit loops**

- Hint: Use `np.searchsorted` and `np.cumsum`  


If you can, implement the functionality as a class called `DiscreteRV`, where

- the data for an instance of the class is the vector of probabilities `q`  
- the class has a `draw()` method, which returns one draw according to the algorithm described above  


If you can, write the method so that `draw(k)` returns `k` draws from `q`.

In [1]:
import numpy as np

class DiscreteRV():
    def __init__(self, q, k):
        self.q = q
        self.k = k
    
    def draw(self):
        res = []
        for k in range(self.k):
            d = np.random.rand()
            s = np.searchsorted(self.sum2(), d)
            res.append(s)
        return res
    
    def sum2(self):
        return np.cumsum(self.q)
q = (0.2, 0.3, 0.1, 0.1, 0.3)
dv = DiscreteRV(q, k = 3)
dv.sum2()

array([0.2, 0.5, 0.6, 0.7, 1. ])

In [2]:
dv.draw()

[4, 1, 0]

## Exercise 7 Peaks

Find all the peaks in a 1D numpy array a. Peaks are points surrounded by smaller values on both sides.

Input:
```
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
```
Desired Output:
```
#> array([2, 5])
```
where, 2 and 5 are the positions of peak values 7 and 6.

### 1. Solve this usign a regular python for loop

### 2. Solve this using no loops and only numpy functions

In [59]:
#Python Loop
import numpy as np

a = np.array([1, 3, 7, 1, 2, 6, 0, 1])

def peaks(a):
    for i in range(len(a) - 1):
        if a[i]  > a[i + 1]:
            print("At index " + str(i) + " , the value is " + str(a[i]))
peaks(a)

At index 2 , the value is 7
At index 5 , the value is 6


In [60]:
#Numpy Method
np.argmax(a)
a.max()
print("At index " + str(np.argmax(a)) + " , the value is " + str(a.max()))

At index 2 , the value is 7


In [61]:
#Create new array that slices after highest value in a
b = a[3:]
b

array([1, 2, 6, 0, 1])

In [62]:
#Find highest value in new array
b.max()
print("The highest value in array b is " + str(b.max()))

The highest value in array b is 6


In [63]:
#Refer back to array a to find the index of value 6
np.where(a == 6)
print("At index " + str(np.where(a == 6)) + " , the value is " + str(b.max()))

At index (array([5]),) , the value is 6
