## Homework 1: Brandon Shurick

### HW1.0.0
Define big data. Provide an example of a big data problem in your domain of expertise. 

In my opinion, "big data" is the integration of data into core business processes (regardless of data size). Big data manifests itself in two major ways in my domain of expertise (business intelligence & operations):
- recommendations based on observational data, i.e. for account managers, customer care agents, other internal employees, based on client or partner history
- causational interpretation of randomized, controlled experiments, i.e. A-B tests

Each of these requires careful recording and cleansing of data that may involve any of the "3 V's": volume, variety, velocity. For example, we may need to use cheap storage and Hadoop in order to store all of our operational and customer usage data, because we don't always know what features we might need to utilize for recommendations based on observational data. We might need to build complicated data processing code to cleans and connect all of the types of data we record. Lastly, we might need to build infrastructure that can support realtime analysis at high-speed. 

### HW1.0.1
In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the irreduciable error for a test dataset T when using polynomial regression models of degree 1, 2,3, 4,5 are considered. How would you select a model?

First, to estimate bias it is necessary to have a target function defined that the regression models are attemtping to estimate. We can call this function $f(x)$. Then a regression model $g_N(x)$ should be created from training data to approximate $f(x)$ for a dataset $T$ over each of the $N$ polynomial degrees being considered, where $T$ represents the result of function $f(x)$ plus some additional random noise. 

The bias would then be calculated as the mean of the squared mean test error (as measured by mean squared error in regression) across all datasets for each polynomial degree being considered, minus the result of the true function $f(x)$. In pseudocode:

polydegree = {}  
bias = {}  
for n in N:  
&nbsp;&nbsp;for d in T:  
&nbsp;&nbsp;&nbsp;&nbsp;if n not in polydegree: polydegree[n]=[ ]  
&nbsp;&nbsp;&nbsp;&nbsp;error_d = (g_n(d)-d)\*\*2  
&nbsp;&nbsp;&nbsp;&nbsp;polydegree[n].append(error_d)  
for n in N:  
&nbsp;&nbsp;bias[n] = mean((mean(polydegree[n])-f(x))\*\*2)  

Measuring the variance would not require the function $f(x)$ but would simply be a calculation of the mean variance of the model $g_N(x)$ test results for each polynomial degree being considered. In pseudocode:

polydegree = {}  
variance = {}  
for n in N:  
&nbsp;&nbsp;for d in T:  
&nbsp;&nbsp;&nbsp;&nbsp;if n not in polydegree: polydegree[n]=[ ]  
&nbsp;&nbsp;&nbsp;&nbsp;result_d = g_n(d)  
&nbsp;&nbsp;&nbsp;&nbsp;polydegree[n].append(result_d)  
for n in N:  
&nbsp;&nbsp;results = [ (d-mean(polydegree[n]))\*\*2 for d in polydegree[n] ]  
&nbsp;&nbsp;variance[n] = mean(results)

The irreducable (constant) error does not depend on the model $g_N(x)$, and is calculated as the mean squared difference between the observations in the dataset $T$ from the true function $f(x)$. In pseudocode:

polydegree = {}  
error = {}  
for n in N:  
&nbsp;&nbsp;for d in T:  
&nbsp;&nbsp;&nbsp;&nbsp;if n not in polydegree: polydegree[n]=[ ]  
&nbsp;&nbsp;&nbsp;&nbsp;result_d = (d-f(x))\*\*2  
&nbsp;&nbsp;&nbsp;&nbsp;polydegree[n].append(result_d)  
for n in N:  
&nbsp;&nbsp;error[n] = mean(polydegree[n])

To choose the model, I would select the model $g_N(x)$ which has the lowest combined variance and bias (ignoring the error/noise of the data, which is constant across all models). 


### HW1.1
Read through the provided control script (pNaiveBayes.sh)
   and all of its comments. When you are comfortable with their
   purpose and function, respond to the remaining homework questions below. 
   A simple cell in the notebook with a print statmement with  a "done" string will suffice here. (dont forget to include the Question Number and the quesition in the cell as a multiline comment!)

### HW1.2
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.

In [38]:
%%writefile mapper.py
#!/usr/bin/env python
import re, sys
filename = sys.argv[1]
findword = sys.argv[2]
WORDS = re.compile(r'[\w]+')
for line in open(filename,'r').readlines():
    line = line.strip()
    wordslist = WORDS.findall(line)
    findwords = [w for w in wordslist if w==findword ]
    print(len(findwords))    

Writing mapper.py


In [39]:
!chmod +x mapper.py

In [40]:
%%writefile reducer.py
#!/usr/bin/env python
import sys
filenames = sys.argv[1:]
sums = []
for f in filenames:
    for line in open(f,'r').readlines():
        line = line.strip()
        sums.append(int(line))
print(sum(sums))

Writing reducer.py


In [41]:
!chmod +x reducer.py

In [42]:
!chmod +x pNaiveBayes.sh

In [44]:
!./pNaiveBayes.sh 4 assistance
!echo assistance: `cat *.output`

assistance: 10


### HW1.3. 
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by a single, user-specified word using the multinomial Naive Bayes Formulation. Examine the word “assistance” and report your results. To do so, make sure that
   
   - mapper.py and
   - reducer.py 

That performs a single word Naive Bayes classification. For multinomial Naive Bayes, the Pr(X=“assistance”|Y=SPAM) is calculated as follows:

  the number of times “assistance” occurs in SPAM labeled documents / the number of words in documents labeled SPAM 

In [110]:
%%writefile mapper.py
#!/usr/bin/env python
import re, sys
filename = sys.argv[1]
findword = sys.argv[2]
WORDS = re.compile(r'[\w]+')
for line in open(filename,'r').readlines():
    line = re.sub(r'[^\w\s]+','',line.strip())
    components = line.split('\t')
    try:
        spamdocs = int(components[1])
    except IndexError:
        spamdocs = 0
    words = ' '.join(components[2:])
    wordslist = WORDS.findall(words)
    totaldocs = 1
    totalwords = len(wordslist)
    if spam==1:
        findwords = [w for w in wordslist if w==findword]
        spamwords = len(wordslist)
    else:
        findwords = []
        spamwords = 0
    print('{}\t{}\t{}\t{}\t{}'.format(
                                    totalwords
                                    ,totaldocs
                                    ,spamwords
                                    ,spamdocs
                                    ,len(findwords)
                                  ))

Overwriting mapper.py


In [105]:
!chmod +x mapper.py

In [106]:
%%writefile reducer.py
#!/usr/bin/env python
import sys
filenames = sys.argv[1:]
findwords = []
spamwords = []
totaldocs = []
spamdocs = []
for f in filenames:
    for line in open(f,'r').readlines():
        line = line.strip()
        components = line.split('\t')
        totaldocs.append(int(components[1]))
        spamwords.append(int(components[2]))
        spamdocs.append(int(components[3])) 
        findwords.append(int(components[4]))       

spamp = sum(findwords)*1.0/sum(spamwords)
prior = sum(spamdocs)*1.0/sum(totaldocs)
print('{}'.format(spamp*prior))

Overwriting reducer.py


In [107]:
!chmod +x reducer.py

In [108]:
!chmod +x pNaiveBayes.sh

In [109]:
!./pNaiveBayes.sh 4 assistance
!echo assistance: `cat *.output`

assistance: 0.00016419906044


### HW1.4
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results
   
To do so, make sure that

   - mapper.py counts all occurrences of a list of words, and
   - reducer.py performs the multiple-word multinomial Naive Bayes classification via the chosen list.  