# Report 3 - Carl Moser

# Project Ideas

Some project ideas that I am interested in working on:

- Election prediction based on data mining from news sources or social media
- Expanding on my first report problem and doing some sort of word prediction

In [1]:
from __future__ import print_function, division

% matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import math
import numpy as np
from scipy import stats

from thinkbayes2 import Pmf, Cdf, Suite, Joint, Beta
import thinkplot

## Gluten allergy
From: [This study from 2015](http://onlinelibrary.wiley.com/doi/10.1111/apt.13372/full) showed that many subjects diagnosed with non-celiac gluten sensitivity (NCGS) were not able to distinguish gluten flour from non-gluten flour in a blind challenge.

Here is a description of the study:

>"We studied 35 non-CD subjects (31 females) that were on a gluten-free diet (GFD), in a double-blind challenge study. Participants were randomised to receive either gluten-containing ﬂour or gluten-free ﬂour for 10 days, followed by a 2-week washout period and were then crossed over. The main outcome measure was their ability to identify which ﬂour contained gluten.
>"The gluten-containing ﬂour was correctly identiﬁed by 12 participants (34%)..."
Since 12 out of 35 participants were able to identify the gluten flour, the authors conclude "Double-blind gluten challenge induces symptom recurrence in just one-third of patients fulﬁlling the clinical diagnostic criteria for non-coeliac gluten sensitivity."

This conclusion seems odd to me, because if none of the patients were sensitive to gluten, we would expect some of them to identify the gluten flour by chance.  So the results are consistent with the hypothesis that none of the subjects are actually gluten sensitive.

We can use a Bayesian approach to interpret the results more precisely.  But first we have to make some modeling decisions.

1. Of the 35 subjects, 12 identified the gluten flour based on resumption of symptoms while they were eating it.  Another 17 subjects wrongly identified the gluten-free flour based on their symptoms, and 6 subjects were unable to distinguish.  So each subject gave one of three responses.  To keep things simple I follow the authors of the study and lump together the second two groups; that is, I consider two groups: those who identified the gluten flour and those who did not.

2. I assume (1) people who are actually gluten sensitive have a 95% chance of correctly identifying gluten flour under the challenge conditions, and (2) subjects who are not gluten sensitive have only a 40% chance of identifying the gluten flour by chance (and a 60% chance of either choosing the other flour or failing to distinguish).

Using this model, estimate the number of study participants who are sensitive to gluten.  What is the most likely number?  What is the 95% credible interval?

In [2]:
correct = 12
incorrect = 17
indistinguishable = 6





## Bugfinder
From [John D. Cook](http://www.johndcook.com/blog/2010/07/13/lincoln-index/)

"Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Suppose two testers independently search for bugs. Let k1 be the number of errors the first tester finds and k2 the number of errors the second tester finds. Let c be the number of errors both testers find.  The Lincoln Index estimates the total number of errors as k1 k2 / c [I changed his notation to be consistent with mine]."

So if the first tester finds 20 bugs, the second finds 15, and they find 3 in common, we estimate that there are about 100 bugs.  What is the Bayesian estimate of the number of errors based on this data?

To start, we can define the number of bugs that each of the programmers found and how many of those were in common.

In [5]:
num_p1 = 20
num_p2 = 15
num_c = 3

Next, we can estimate the total number of bugs using the Lincoln index

(num_p1)(num_p2)/num_c

In [7]:
num_p1*num_p2/num_c

100.0

We see that the estimated number of bugs is about 100. We can take this problem further and do the Bayesian estimate of the number of errors. First, we can create a class that inherits from Suite and Joint. Then, we can define the Likelihood function.

In [8]:
class Bugfinder(Suite, Joint):
    def __init__(self, n, p1, p2):
        '''
        Makes a joint suite of parameters n, p1, p2

        n: possible number of bugs
        p1: the probability that programmer1 will find a bug
        p1: The probability that programmer 2 will find a bug
        '''
        trips = [(num, a1, a2)
                 for num in n
                 for a1 in p1 
                 for a2 in p2]
        
        Suite.__init__(self, trips)
        
    def Likelihood(self, data, hypo):
        '''
        k1, k2, c = data
        n, p1, p2 = hypo
        
        k1: The number of bugs that programmer 1 found
        k2: The number of bugs that programmer 2 found
        c:  The number of bugs that they found in common
        '''
        k1, k2, c = data
        n, p1, p2 = hypo
        
        like1 = stats.binom.pmf(k1, n, p1)
        like2 = stats.binom.pmf(k2, n, p2)
        
        return like1 + like2

We can then create arrays of the number of possible bugs, the probability that programmer 1 will find a bug, and the probability that programmer 2 will find a bug.

In [9]:
n = np.linspace(32, 40)
p1 = np.linspace(0,1,10)
p2 = np.linspace(0,1,10)

b = Bugfinder(n, p1, p2)

In [10]:
b.Update((20,15,3))

0.048848351520928451