# Skills Test 1

Solve problems that you are comfortable with first, to the best of your ability. Realize that many, if not most, students may not complete the entire test, and thus you need to focus on *doing your best on what you can*. 

* Time: 90 minutes
* Closed Book
* Individual

Remember: Everything is Showbiz. Break a leg.

# Written Examination

Focus on producing concise, complete answers to the written sections. Place answers to each question in the appropriate cell. Written answers that are complete but spitballing (i.e. B.S.) will be unlikely to recieve any points. Partial answers with a few correct elements are much more likely to do better.

## Discussing Models:

1) Write a concise description of each of the elements for k-Nearest Neighbors:

* Hypothesis
* Cost
* Optimization

(It is true that these elements do not map cleanly to those of other algorithms you have learned, however you should be able to take a moment and reason out approximately what they should be.)

**ANSWER:**

    3 point response:

    * Hypothesis: Members of a test set can be classified by the spatial relationships maintained with neighborhoods of training points.
    * Cost: a) Each class votes for a given test point according to its neighborhood proximity to that point (using a given distance norm). b) The class is assigned based on the largest number of votes. The neighborhood is defined as having k or more neighbors to the the test point (thus k is effectively a threshold).
    * Optimization: The training set is only learned once and so the algorithm is already optimal within the definition of the problem.

    Deduct 1 full point for conceptual deviations from each of these terms. 0.5 point deduction if important pieces are missing (in this case only the cost function has several parts)

2) k-Nearest Neighbors: what are the:

* Use cases
* Strengths
* Weaknesses 

(short answers!)

**ANSWER:**
    
    3 point response:

    * Use cases: Where the problem can be formulated in two dimensions and does not include strongly overlapping clusters of varying density.
    * Fast, reasonably robust to small amounts of noise.
    * There are many weaknesses, especially the curse of dimensionality and variable density.

    Deduct 1 full point for conceptual deviations from each of these terms. 0.5 point deduction if important pieces are missing (in this case only the cost function has several parts)

3) You have produced the below diagram of performance for your k-NN algorithm. Assume that you have used euclidean distance for your distance metric. Explain in a concise discription what is happening and the *mathematical reason as to why* it is happening. 



![dvp](./images/dvp.png)

**ANSWER:**

    3 point response:

    This is an illustration of the above questions. As the number of dimensions increase, the (euclidean) distances between a point $x_i$ belonging to actual class $C1$ and training points belonging to other classes $C2 \cdots CN$ begin to converge to similar values. Thus all classes begin to vote equally, and the model becomes unable to distinguish between similar classes. The effects are noticeable even at four and five dimensions.



4) Write a concise description (including mathematical formulae) of each of the elements for Regularized Logistic Regression:

* Hypothesis
* Cost
* Optimization


**ANSWER:**
    
3 point response:

    * Hypothesis: The training set points have a binary relationship, and belong to one of two classes. The underlying variables that determine the binary relationship have a linear relationship amongst each other (linear regression).
    * Cost: 

$$J = - \left( \sum_{i = 1}^{n} y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i)) \right) + \lambda \sum_{j = 1}^{p} \beta_j^2$$

    * Optimization: Gradient descent. (Writing the gradient cost is welcome but not necessary)
    
$$-\frac{1}{n} (\sum_{i=1}^{n} \left( y_i - p(x_i) \right) x_{ij}+\lambda \sum_{j = 1}^{p} \beta_j)$$

5) Logistic Regression: Why would we choose to use a logistic regression over any other type of regression? What is the real difference between a logistic regression and any other?

**ANSWER:**
    
    3 point response:

    A logistic regression is used specifically for binary classification. The real difference between a logistic regression and any other is simply the inclusion of the logistic function into the cost function. 

6) How do we measure the performance of Logistic Regression? How does this differ from measurement of the performance of standard Regression?

    A logistic regression is typically measured in terms of its confusion matrix - particularly the true positive rate (sensitivity) and false positive rate (1-specificity). Negative labels are not commonly accounted for. Standard regression is typically discussed in terms of explained variance, so continuous measures such as MSE and F-test are used.

7) Write a concise description (including mathematical formulae) of each of the elements for Naive Bayes:

* Hypothesis
* Cost
* Optimization


**ANSWER:**

* Hypothesis: We may apply Bayes' law to classify points (documents) among M different classes using N different features as keys, using the probability of  observing a given feature $x_i$ for a given class $C_{k}$, $p(x_i|C_{k})$ as if it was completely independent of the presence of other features.
* Cost: 

8) Why is Naive Bayes so naive? Use the mathematics to provide an explanation. The best answer will discuss the probability theory involved.


**ANSWER:**
    The probability of observing a class $C$ dependent on the observation of a set of features $X_{1}, X_{2},...,X_{N}$ is the fully joint probability:
    
$$P(C|X_{1}, X_{2},...,X_{N})$$
    
    Applying Bayes Law we get:
    
$$P(C|X_{1}, X_{2},...,X_{N}) = \dfrac{P(X_{1}, X_{2},...,X_{N}|C)P(C)}{P(X_{1}, X_{2},...,X_{N})}$$
    
    In order to compute the actual probability on the left, we would need to use and parameterize joint histograms with a dimension for each of the features (N dimensions). For any significant number of features (say 8 or more), an enormous training set would be required to produce adequate sampling. We can overcome this by using an assumption that the joint probabilities are independent. Two events A and B are independent iff:
   
   $$P(A \cap B) = P(A,B) = P(A)P(B)$$
    
    This reduces our above equation to the product:
    
$$P(C|X_{1}, X_{2},...,X_{N}) = \dfrac{P(X_{1}|C) P(X_{2}|C) \cdots P(X_{N}|C)P(C) }{P(X_{1})P(X_{2}) \cdots P(X_{N})}$$

    In practice the denominator is simply a constant that is ignored, thus we have:
    
$$P(C|X_{1}, X_{2},...,X_{N}) = P(X_{1}|C) P(X_{2}|C) \cdots P(X_{N}|C)P(C)$$
    
    However the notion that all features are completely naive, considering that pairs or triples of features are quite likely to appear due to having the same underlying cause, for example brown eyes and brown hair. It turns out that having perfect priors is less important with enough data.

9) Why would I prefer to use lasso regression instead of ridge regression? 


**ANSWER:**

    Lasso regression is commonly used as an exploratory technique to determine which variables among many are explanatory. Normally it is used when it is believed that all variables are weakly collinear and there are several dominating variables.
    Ridge regression is not intended to remove variables, but simply provide good shrinkage and robustness to the final regression.


## Model Metrics

These questions will require a little extra writing than the above section. 

1) Why does the Bias-Variance tradeoff occur? More specifically, what causes it?

**ANSWER:**

    The B-V tradeoff occurs due to overfitting of the model to the underlying system. Additional paramters enable the model to better capture underlying behavior of the system (less biased), but the parameters become progressively more fragile and prone to undersampling and error in the training data. Consequently the model becomes less robust (increased variance). Too few parameters creates a model that fails to capture the complexity of the undelying system, but can do so reliably.

2) How do we know that a model is underfit? 

**ANSWER:**
    
    This question definitely involves some nuance, however underfitting can be detected in the ability of the model to reproduce similar measures of performance during Cross validation. In general underfitting is characterized by an inability of the model to produce good measures of performance on training data. Thus being underfit suggests a lack of sufficient model complexity.

3) How do we know that a model is overfit?

**ANSWER:** 

    Cross-validation can often help in this case too. Overfitting regularly produces highly accurate training results, but the model will fail when used on test data because it imputes behaviors to the system that don't exist (often a result of too many parameters rather than too few, i.e. trying to fit  2nd order polynomial with a 8th order polynomial). Measures of model performance will vary widely based on the training set used.

4) How do bias and variance relate to precision and accuracy? How do they relate to Type I and Type II errors?

**ANSWER:**

        Increased bias leads to reduced accuracy - the model cannot accurately predict the correct behavior of the system. Increased variance leads to reduced precision. Although the model may be able to predict the correct behavior of the system, it cannot do so reliably. Bias and variance can lead to both types of errors. In the case of bias we typically focus on type I error.

5) Why is $R^{2}$ not a good metric to compare two regression models? What would be a better choice to compare them?

**ANSWER:**
    
    $R^2$ ultimately is simply a measure of how the goodness-of-fit to the regressed data for an individual model, and thus does not discuss the explanatory value of additional degrees of freedom. A better measure is the F-statistic that does this explicitly. Adjusted $R^2$ can also be used, but cannot be used to ignore model-specific F-tests and p values.
    
    
[read this link](https://sakai.duke.edu/access/content/group/25e08a3d-9fc4-41b0-a7e9-815732c1c4ba/New%20folder/Stat%20Topic%20Files/Non-Linear%20Regression/FTestTutorial.pdf)

### Questions 6-8

Below is a ROC curve detailing the confusion matrix of a binary ultrasound test for uterine cancer. The ultrasound device makes it possible to measure its thickness. The idea of the test is to measure the thickness of the wall and set a threshold of thickness that is "normal". 

When the test measures a thickness greater this number, the patient will be labeled as a cancer risk and the test will be "positive". Below the number the test will be "negative."

Of course there are limitations. 

![uterine_roc](./images/ROC_endometrial.jpg)

6) Write a *brief* description of what is happening in the above figure, discussing the role of sensitivity and specificity with respect to the confusion matrix.

**ANSWER:**

    As the threshold is lowered, both the true positive (sensitivity) and false positive (1-specificity) (Type I error) rates are increasing. The ROC curve provides us a comparison of relative rate of increase vs. threshold value.

7) Suppose that this test is expensive and lacks predictive value of outcome and thus we want to minimize the number of false positives while still keeping the test usefully diagnostic. Where should we set the threshold?

**ANSWER:**
    
    A threshold of about 12-15 mm will be adequate. This provides a 10-20% FP rate while capturing 75-85% of the TP. This is the best setting for a test of this type.

8) Suppose that this test is cheap and it doesn't matter to us what the human implications are of giving millions of women a year the belief that they are soon to be diagnosed with uterine cancer. Our intent is to diagnose every possible case while keeping the number of false positives reasonable. Where should we set the threshold?

**ANSWER:**

    Set the threshold around 4-5 mm. This captures nearly 100% of the TP, keeping FP around 40%. Unfortunately, this also yields a significant Type II rate, meaning there will be a large fraction of women who actually have uterine cancer (with a uterine wall thickness below the threshold) who will be later diagnosed at an unfavorable point, falsely believing they had nothing to worry about. This is a problem with overconfidence in a test. 

# Coding Examination


## Bernoulli Bayes

You will need to use the original lab for Naive Bayes.

Naive Bayes has numerous variants. The only difference between each of these variants and Multinomial Bayes is in the calculation of the distribution of likelihoods.  For Bernoulli Bayes we have the following likelihood function:

$$p(y|x) = p(x_{i}|y)^{N_{1}}(1-p(x_{i}|y))^{N_{0}}$$

Where $y$ is the class to be predicted, $p(x_{i}|y)$ is the frequency of a feature (word) given the class. However, we account for the prediction of the class discussing the frequencies of appearance each label has. That is to say, $N_{1}$ is the number of counts of a given feature (word) within that class and/or prediction. $N_{0}$ is the number of counts when the given feature does *not* appear within a given class! Hence we are predicting *negative* relationships within a class as well as positive relationships.

you are permitted to study this [resource](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html) and compare with [this one](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html). Don't waste too much time reading!!


1) Write the log-likelihood cost function needed for BernoulliBayes.


$$p(y|x) = \sum_{i}^{K} N_{1, x_{i}}\ log{(p(x_{i}|y))} + N_{0, x_{i}} log{(1-p(x_{i}|y))}$$

2) Inherit from the `NaiveBayes` class from the lab to create a `BernoulliBayes` subclass.


3) Overwrite the `_predict()` function of the `NaiveBayes` parent class (within the subclass) to reflect the log-likelihood of the Bernoulli Bayes variant.


In [37]:
import sys, os
from collections import defaultdict
from sklearn.preprocessing import binarize

class NaiveBayes(object):

    def __init__(self, alpha=1):
        self.prior = {}
        self.per_feature_per_label = {}
        self.feature_sum_per_label = {}
        self.likelihood = {}
        self.posterior = {}
        self.alpha = alpha
        self.p = None

    def compute_prior(self, y):
        for label in y:
            if label in self.prior:
                self.prior[label] += 1
            else:
                self.prior[label] = 1

    def compute_likelihood(self, X, y):
        for label, row_features in zip(y, X):
            if label in self.per_feature_per_label:
                self.per_feature_per_label[label] += row_features
            else:
                self.per_feature_per_label[label] = row_features

            if label in self.feature_sum_per_label:
                self.feature_sum_per_label[label] += sum(row_features)
            else:
                self.feature_sum_per_label[label] = sum(row_features)

        for label, per_feature_per_label_arr in self.per_feature_per_label.iteritems():
            feature_sum_per_label = self.feature_sum_per_label[label]
            numerator = per_feature_per_label_arr + self.alpha
            denominator = feature_sum_per_label + self.alpha * self.p
            self.likelihood[label] = numerator / denominator

    def fit(self, X, y):
        self.p = X.shape[1]
        self.compute_prior(y)
        self.compute_likelihood(X, y)

    def predict(self, X):
        predictions = []
        for row in X:
            """max_label = None
            max_value = None"""
            labelfind = defaultdict(float)
            for label, prior in self.prior.iteritems():
                
                labelfind[label] = np.log(prior) + sum([e for e in (row * np.log(self.likelihood[label]))])
              
                """if max_label is None:
                    max_label = label
                    max_value = value
                else:
                    if value > max_value:
                        max_label = label
                        max_value = value
            predictions.append(max_label)"""
            predictions.append(sorted(labelfind, key=labelfind.get, reverse=True)[0])
        return predictions

    def score(self, X, y):
        return sum(self.predict(X) == y) / float(len(y))
    
class BernoulliBayes(NaiveBayes):
    from sklearn.preprocessing import binarize

    def __init__(self, alpha=1):
        NaiveBayes.__init__(self, alpha)
   

    def compute_likelihood(self, X, y):
        for label, row_features in zip(y, X):
            if label in self.per_feature_per_label:
                self.per_feature_per_label[label] += binarize(row_features)
            else:
                self.per_feature_per_label[label] = binarize(row_features)

            if label in self.feature_sum_per_label:
                self.feature_sum_per_label[label] += sum(binarize(row_features))
            else:
                self.feature_sum_per_label[label] = sum(binarize(row_features))

        for label, per_feature_per_label_arr in self.per_feature_per_label.iteritems():
            feature_sum_per_label = self.feature_sum_per_label[label]
            numerator = per_feature_per_label_arr + self.alpha
            denominator = feature_sum_per_label + self.alpha * len(set(y))
            self.likelihood[label] = numerator / denominator

    def predict(self, X):
        predictions = []
        for row in X:
            row_counts = binarize(row, threshold=0.)
            labelfind = defaultdict(float)
            for label, prior in self.prior.iteritems():
                log_prob = np.log(self.likelihood[label])
                log_neg_prob = np.log(1-self.likelihood[label])
                labelfind[label] += np.log(prior)
                labelfind[label] += row_counts.dot(log_prob-log_neg_prob)
                labelfind[label] += log_neg_prob.sum()
            predictions.append(sorted(labelfind, key=labelfind.get, reverse=True)[0])
        return predictions


4) Run your `BernoulliBayes` model on the same test harness used in the lab.  How does it score compared to the Multinomial Naive Bayes? Why would this behavior occur?

**ANSWER:**

        This section is only scored for logical participation. We might expect that BB performs better compared to NB due to the fact that groups (pairs) of words may be a useful indicator in this dataset, i.e. "Nigeria" and "money". 

In [38]:
#from naive_bayes import NaiveBayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
import numpy as np
from sklearn.naive_bayes import BernoulliNB

data = np.genfromtxt('./data/spam.csv', delimiter=',')

y = data[:, -1]
X = data[:, 0:-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print 'Train shape:', X_train.shape
print 'Test shape:', X_test.shape

print "My Implementation:"
my_nb = BernoulliBayes()
my_nb.fit(X_train, y_train)
print 'Accuracy:', my_nb.score(X_test, y_test)
my_predictions =  my_nb.predict(X_test)

print "sklearn's Implementation"
mnb = BernoulliNB()
mnb.fit(X_train, y_train)
print 'Accuracy:', mnb.score(X_test, y_test)
sklearn_predictions = mnb.predict(X_test)

# Assert I get the same results as sklearn
# (will give an error if different)
assert np.all(sklearn_predictions == my_predictions)                                                             

Train shape: (3450, 57)
Test shape: (1151, 57)
My Implementation:
Accuracy: 0.913119026933
sklearn's Implementation
Accuracy: 0.891398783666


AssertionError: 