## Bash

In [1]:
%%sh 
pwd

# Drop into bash using %%sh, (can't be a comment above or beside the magic command)

# Regular unix cmds apply
#ls
#cd mydir
#curl -O http://lamastex.org/datasets/public/SOU/sou/20170228.txt
#ls
#head 20170228.txt

/Users/emiresenov/Documents/Masters/Intro to Data Science/Notebooks/first_lecture_and_data


## Assignment 1, PROBLEM 2

Evaluate the following cells by replacing X with the right command line option in order to find the first two lines of the csv file data/earthquakes.csv 

In [2]:
%%sh
head -2 data/earthquakes.csv

publicid,eventtype,origintime,modificationtime,longitude, latitude, magnitude, depth,magnitudetype,depthtype,evaluationmethod,evaluationstatus,evaluationmode,earthmodel,usedphasecount,usedstationcount,magnitudestationcount,minimumdistance,azimuthalgap,originerror,magnitudeuncertainty
2018p368955,,2018-05-17T12:19:35.516Z,2018-05-17T12:21:54.953Z,178.4653957,-37.51944533,2.209351541,20.9375,M,,NonLinLoc,,automatic,nz3drx,12,12,6,0.1363924727,261.0977462,0.8209633086,0


In [3]:
line_1_earthquakes = "publicid,eventtype,origintime,modificationtime,longitude, latitude, magnitude, depth,magnitudetype,depthtype,evaluationmethod,evaluationstatus,evaluationmode,earthmodel,usedphasecount,usedstationcount,magnitudestationcount,minimumdistance,azimuthalgap,originerror,magnitudeuncertainty"
line_2_earthquakes = "2018p368955,,2018-05-17T12:19:35.516Z,2018-05-17T12:21:54.953Z,178.4653957,-37.51944533,2.209351541,20.9375,M,,NonLinLoc,,automatic,nz3drx,12,12,6,0.1363924727,261.0977462,0.8209633086,0"

# Assignment 1, PROBLEM 3

In this assignment the goal is to parse the earthquakes.csv file from the previous problem.

1. Read the file data/earthquakes.csv and parse it using the csv package and store the result as follows

the header variable contains a list of names all as strings

the data variable should be a list of lists containing all the rows of the csv file

In [4]:
import csv

header = []
data = []
with open('data/earthquakes.csv', mode='r') as f:
    reader = csv.reader(f)
    header = next(reader)
    for line in reader:
        data.append(line)

#header
#data

## Students passing exam (Sample exam problem)
Let's say we have an exam question which consists of $20$ yes/no questions. 
From past performance of similar students, a randomly chosen student will know the correct answer to $N \sim \text{binom}(20,11/20)$ questions. Furthermore, we assume that the student will guess the answer with equal probability to each question they don't know the answer to, i.e. given $N$ we define $Z \sim \text{binom}(20-N,1/2)$ as the number of correctly guessed answers. Define $Y = N + Z$, i.e., $Y$ represents the number of total correct answers.

We are interested in setting a deterministic threshold $T$, i.e., we would pass a student at threshold $T$ if $Y \geq T$. Here $T \in \{0,1,2,\ldots,20\}$.

1. [5p] For each threshold $T$, compute the probability that the student *knows* less than $10$ correct answers given that the student passed, i.e., $N < 10$. Put the answer in `problem11_probabilities` as a list.
2. [3p] What is the smallest value of $T$ such that if $Y \geq T$ then we are 90\% certain that $N \geq 10$?

In [5]:
from scipy.special import binom as binomial
p = 11/20
p_N = lambda k: binomial(20,k)*((1-p)**(20-k))*(p**k)

# Answer

## 1)
We solve this problem with the conditional probability formula $$P(A|B)=\frac{A \cap B}{B},$$

where $A = N < 10, \text{and} \: B = (N \cap Z) \geq T$. 

This follows the reasoning that the conditional probability that we want to find, i.e. that a student knows $N<10$ correct answers and guesses the remaining ones to pass the threshold ($N \cap Z) \geq T$.

The question posed as a conditional probability can thus be expressed as $$P(N<10 | (N\cap Z) \geq T) = \frac{N < 10 \cap (N \cap Z) \geq T}{(N \cap Z) \geq T}$$

In [6]:
# Define p(Z) function
p_Z = lambda j,k: binomial(j,k)*((0.5)**(j-k)*(0.5**k))

# P(N < 10 and (N and Z) >= T) for a given T
def AnB(t):
    prob=0
    for i in range(10): # i = known answers
        pn = p_N(i)
        
        '''
        Idea: with i correct answers, figure out the probability that we guess the 
        remaining ones. This occurs for all cases where guesses + correct answers 
        exceed threshold, i.e. for 9 correct answers with threshold = 10, all cases 
        where we guess 1 up to 11 correct have to be factored into the probability. 
        - If the threshold is higher than or equal to i, then we have to start 
          counting guesses from t-i up to 20-i (21-i in range function, non-inclusive).
        - If the threshold is lower than i, then we just start counting guesses from
          0 up to 20-i.
        '''
        for j in range(max(0,t-i),21-i): # j = correctly guessed answers
            prob += pn*p_Z(20-i,j)
    return prob 

# B = P(N and Z >= T)
def B(t):
    prob=0
    # i = known answers
    for i in range(21):
        pn = p_N(i)
        # j = correctly guessed answers
        for j in range(max(0,t-i),21-i):
            prob += pn*p_Z(20-i,j)
    return prob

# P(A | B) = (A and B) / B
problem11_probabilities = [AnB(i)/B(i) for i in range(21)]

## Code reduction

We can view the above two functions as the function of passing the threshold $t$ with the given conditions. Thus, we can reduce the amount of repeated code by merging them.

In [7]:
# Probability of passing threshold t with a bounded N
def pThresh(maxN, t):
    prob=0
    # i = known answers
    for i in range(maxN):
        pn = p_N(i)
        # j = correctly guessed answers
        for j in range(max(0,t-i),21-i):
            prob += pn*p_Z(20-i,j)
    return prob

# P(A | B) = (A and B) / B
problem11_probabilities = [pThresh(10,i)/pThresh(21,i) for i in range(21)]

In [9]:
# Part 2: Give an integer between 0 and 20 which is the answer to 2.
problem12_T = 0
for i in range(len(problem11_probabilities)):
    if (problem11_probabilities[i] < 0.10):
        problem12_T = i
        break

problem12_T

17

## Concentration of measure (Sample exam problem)

As you recall, we said that concentration of measure was simply the phenomenon where we expect that the probability of a large deviation of some quantity becoming smaller as we observe more samples: [0.4 points per correct answer]

1. Which of the following will exponentially concentrate, i.e. for some $C_1,C_2,C_3,C_4 $ 
$$
    P(Z - \mathbb{E}[Z] \geq \epsilon) \leq C_1 e^{-C_2 n \epsilon^2} \wedge C_3 e^{-C_4 n (\epsilon+1)} \enspace .
$$

    1. The empirical mean of i.i.d. sub-Gaussian random variables?
    2. The empirical mean of i.i.d. sub-Exponential random variables?
    3. The empirical mean of i.i.d. random variables with finite variance?
    4. The empirical variance of i.i.d. random variables with finite variance?
    5. The empirical variance of i.i.d. sub-Gaussian random variables?
    6. The empirical variance of i.i.d. sub-Exponential random variables?
    7. The empirical third moment of i.i.d. sub-Gaussian random variables?
    8. The empirical fourth moment of i.i.d. sub-Gaussian random variables?
    9. The empirical mean of i.i.d. deterministic random variables?
    10. The empirical tenth moment of i.i.d. Bernoulli random variables?

2. Which of the above will concentrate in the weaker sense, that for some $C_1$
$$
    P(Z - \mathbb{E}[Z] \geq \epsilon) \leq \frac{C_1}{n \epsilon^2}?
$$

In [8]:
problem3_answer_1 = [1,2,5,9,10]
problem3_answer_2 = [1,2,3,5,6,7,8,9,10]