# Lab 2: Algorithms for Online FDR Control
Welcome to the second DS102 lab! 

The goals of this lab is to understand the mechanics of the LORD algorithm, and why it controls FDR. You'll modify the provided code for LORD to match the simpler method that we saw in lecture, and you will assess the performance of both methods under various sequences of p-values.

The code you need to write is commented out with a message "TODO: fill in". There is additional documentation for each part as you go along.


## Course Policies

**Collaboration Policy**

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** in the cell below.

**Submission**: to submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run all), and then print as a pdf (File > download as > pdf) and submit it to Gradescope.


**This assignment should be completed and submitted before Tuesday September 17, 2019 at 11:59 PM.**
    

In [None]:
%pylab inline
import matplotlib.pyplot as plt
import numpy as np
import math

In [None]:
# Here's a restatement of the lord algorithm provided in HW2. 
# Part 1a) will ask you to modify it to return alpha_ts.
def LORD(stream,alpha):
    # Inputs: stream - array of p-values, alpha - target FDR level
    # Output: array of indices k such that the k-th p-value corresponds to a discovery,
    #         array of alpha values for each time step
    
    gamma = lambda t: 6 / (math.pi * t) ** 2
    w_0 = alpha / 2
    rejections = []
    alpha_t = gamma(1) * w_0
    
    n = len(stream)
    
    for t in range(1, n + 1):
        # Offset by one since indexing by 1 for t.
        p_t = stream[t - 1] 
        if p_t < alpha_t:
            rejections.append(t)

        next_alpha_t = gamma(t + 1) * w_0 + alpha * sum([gamma(t + 1 - rej) for rej in rejections])
        # Check if tau_1 exists.
        if len(rejections) > 0:
            next_alpha_t -= gamma(t + 1 - rejections[0]) * w_0
        
        # Update alpha.
        alpha_t = next_alpha_t
    # Shift rejections since the rejections are 1-indexed.
    shifted_rej = [rej - 1 for rej in rejections]
    
    # TODO: fill-in return alpha_ts in addition to shifted_reg.
    return shifted_rej, alpha_ts

## 1) Tracking Decisions and Alphas Over Time

### 1a) Modify the LORD algorithm above to return not only the decisions, but also the alpha_t's that correspond to each time step. 

Use the code below to plot the decisions and alpha values over time. Make sure to fill in all the missing labels in the plotting function defined above. 

In [None]:
# Modify the plotting code so that the relevant quantities are correctly labeled.
def plot_disc_and_alphas(discoveries, alphas, algorithm_name, color='green', plotting_offset=0.5):
    m = len(alphas)
    plt.bar(np.arange(1, m + 1), alphas, label=algorithm_name, color=color, alpha=0.8)
    
    for i, disc in enumerate(discoveries):
        plt.axvline(disc + 1 + plotting_offset, color=color)  
        plt.annotate(r"$\tau{0}$".format(i), (disc + plotting_offset, np.max(alphas)), color=color, size=16)
    plt.xlabel("TODO: fill in")
    plt.ylabel('TODO: fill in')
    
    plt.legend()

In [None]:
# SETUP FOR PROBLEM 1 - Run this cell to instantiate the p-values.
m = 20

rs = np.random.RandomState(0)
# p-values for the null generated from uniform (0,1).
p_values= rs.uniform(0, 1, size=m)

# Instantiate p-values for the alternative as 0.001.
alt_idxs = [0, 1, 4, 5, 10]
p_values[alt_idxs] = 0.001
alpha = 0.05


In [None]:
# Run the lord algorithm .
discoveries, alphas = LORD(p_values, alpha)

# Plot the discoveries (make sure everything is labeled).
fig, ax = plt.subplots(figsize=(12, 6))
plot_disc_and_alphas(discoveries, alphas, algorithm_name="LORD")

### 1b) Now make a second function, LORD_most_recent() that follows the algorithm we analyzed in class; that is, the update step on alpha only depends on the most recent discovery (rather than the sum over all previous discoveries).

The update rule is given by
$$\alpha_t = \begin{cases} \gamma_t \alpha, \text{ if no rejection has yet been made}\\
\gamma_{t-r_t} \alpha, \text{ otherwise} \end{cases}.$$

As in part a we use
$$\gamma_t = \frac{6}{(\pi t)^2}.$$


In [None]:
# Suggestion: start by copying the LORD() function you modified above to output alphas. Then modify as needed.
def LORD_most_recent(stream,alpha):
    # Inputs: stream - array of p-values, alpha - target FDR level
    # Output: array of indices k such that the k-th p-value corresponds to a discovery

    return shifted_rej, alpha_ts

In [None]:
# Use the p-values from before so we can compare the plots.
# Run the LORD algorithm variant you defined above.
discoveries_2, alphas_2 = LORD_most_recent(p_values, alpha)

# Plot the discoveries from the original algorithm.
fig, ax = plt.subplots(figsize=(12, 6))

# The plotting offset is so the discoveries don't overwrite themselves.
plot_disc_and_alphas(discoveries, alphas, color="green", algorithm_name="TODO: fill in", plotting_offset=0.4)

# Plot the discoveries from the variant.
plot_disc_and_alphas(discoveries_2, alphas_2, color="blue", algorithm_name="TODO: fill in", plotting_offset=0.6)

### 1c) What do you notice? Are the two algorithms making substantially different decisions?

Todo: fill in what you notice.


### 2) Repeat the same experiment with a second set of p-values, defined in the cell below.

In [None]:
# SETUP FOR PROBLEM 1
m = 70

rs = np.random.RandomState(0)
# p-values for the null generated from uniform (0,1).
p_values= rs.uniform(0, 1, size=m)

# Instantiate p-values for the alternative as 0.001.
alt_idxs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 20, 25, 30, 35, 40, 50]
p_values[alt_idxs] = 0.001
alpha = 0.05

# Run the algorithms.
discoveries, alphas = LORD(p_values, alpha)
discoveries_2, alphas_2 = LORD_most_recent(p_values, alpha)

# Plot the results.
# Make the figure very wide so you can see what's going on.
fig, ax = plt.subplots(figsize=(30, 5))
plot_disc_and_alphas(discoveries, alphas, color="green", 
                     algorithm_name="TODO: fill in", 
                     plotting_offset=0.4)
plot_disc_and_alphas(discoveries_2, alphas_2, color="blue",
                     algorithm_name="TODO: fill in", 
                     plotting_offset=0.6)


### 2a) What do you notice? Compare with your results from the p-value sequence from part 1.

TODO: fill in with your findings.
