# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup postings on a variety of topics. You'll train classifiers to distinguish between the topics based on the text of the posts. Whereas with digit classification, the input is relatively dense: a 28x28 matrix of pixels, many of which are non-zero, here we'll represent each document with a "bag-of-words" model. As you'll see, this makes the feature representation quite sparse -- only a few words of the total vocabulary are active in any given document. The bag-of-words assumption here is that the label depends only on the words; their order is not important.

The SK-learn documentation on feature extraction will prove useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but please prepare your own write-up and write your own code.

In [None]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

Load the data, stripping out metadata so that we learn classifiers that only use textual features. By default, newsgroups data is split into train and test sets. We further split the test so we have a dev set. Note that we specify 4 categories to use for this project. If you remove the categories argument from the fetch function, you'll get all 20 categories.

In [None]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)

num_test = len(newsgroups_test.target)
test_data, test_labels = newsgroups_test.data[int(num_test/2):], newsgroups_test.target[int(num_test/2):]
dev_data, dev_labels = newsgroups_test.data[:int(num_test/2)], newsgroups_test.target[:int(num_test/2)]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training label shape:', train_labels.shape)
print('test label shape:', test_labels.shape)
print('dev label shape:', dev_labels.shape)
print('labels names:', newsgroups_train.target_names)

(1) For each of the first 5 training examples, print the text of the message along with the label.

[2 pts]

In [None]:
#def P1(num_examples=5):
### STUDENT START ###

def print_examples(num_examples):
    for i in range(num_examples):
        print("\n\n\nTraining Example {}:".format(i + 1))
        print("\n{}".format(train_data[i]))
        print("\nLabel: {}".format(newsgroups_train.target_names[train_labels[i]]))
        
print_examples(5)

### STUDENT END ###
#P1(2)

(2) Use CountVectorizer to turn the raw training text into feature vectors. You should use the fit_transform function, which makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").

The vectorizer has a lot of options. To get familiar with some of them, write code to answer these questions:

a. The output of the transform (also of fit_transform) is a sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html. What is the size of the vocabulary? What is the average number of non-zero features per example? What fraction of the entries in the matrix are non-zero? Hint: use "nnz" and "shape" attributes.

b. What are the 0th and last feature strings (in alphabetical order)? Hint: use the vectorizer's get_feature_names function.

c. Specify your own vocabulary with 4 words: ["atheism", "graphics", "space", "religion"]. Confirm the training vectors are appropriately shaped. Now what's the average number of non-zero features per example?

d. Instead of extracting unigram word features, use "analyzer" and "ngram_range" to extract bigram and trigram character features. What size vocabulary does this yield?

e. Use the "min_df" argument to prune words that appear in fewer than 10 documents. What size vocabulary does this yield?

f. Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? Hint: build a vocabulary for both train and dev and look at the size of the difference.

[6 pts]

In [None]:
#def P2():
### STUDENT START ###

def make_cv_get_dtms(cv, train_inputs, dev_inputs):
    doc_term_matrix = cv.fit_transform(train_inputs)
    dev_doc_term = cv.transform(dev_inputs)
    return cv, doc_term_matrix, dev_doc_term

# these are the default
# will use them in subsequent problems
# that ask for default CountVectorizer
(default_cv,
 default_train_dtm,
 default_dev_dtm) = make_cv_get_dtms(CountVectorizer(), train_data, dev_data)

## QUESTION A:
print("Question A:")
print("Vocabulary size: {}".format(default_train_dtm.shape[1]))
print("Average number of non-zero features per example: {:.2f}".format(default_train_dtm.nnz / 
                                                                       default_train_dtm.shape[0]))
print("Fraction of entries in matrix that are non-zero: {:.4f}".format(default_train_dtm.nnz /
                                                                       (default_train_dtm.shape[0] * 
                                                                       default_train_dtm.shape[1])))

##QUESTION B:
print("\nQuestion B:")
feature_names = default_cv.get_feature_names()
print("First feature name: {}".format(feature_names[0]))
print("Last feature name: {}".format(feature_names[-1]))

##QUESTION C:
print("\nQuestion C:")

(lim_vocab_cv,
 lim_vocab_train_dtm,
 lim_vocab_dev_dtm) = make_cv_get_dtms(CountVectorizer(vocabulary=["atheism", "graphics", "space", "religion"]), 
                                      train_data, 
                                      dev_data)

# the shape here makes sense - there are now only 4 features,
# but still 2034 training examples
print("Training vector shape: {}".format(lim_vocab_train_dtm.shape))
print("Average number of non-zero features per example: {:.2f}".format(lim_vocab_train_dtm.nnz / 
                                                                       lim_vocab_train_dtm.shape[0]))
print("Fraction of entries in matrix that are non-zero: {:.4f}".format(lim_vocab_train_dtm.nnz /
                                                                       (lim_vocab_train_dtm.shape[0] * 
                                                                       lim_vocab_train_dtm.shape[1])))

##QUESTION D:
print("\nQuestion D:")
(bigram_cv,
 bigram_train_dtm,
 bigram_dev_dtm) = make_cv_get_dtms(CountVectorizer(analyzer='char', ngram_range=(2,2)), 
                                    train_data, 
                                    dev_data)

print("Vocabulary size (for bi-gram character features): {}".format(bigram_train_dtm.shape[1]))

(trigram_cv,
 trigram_train_dtm,
 trigram_dev_dtm) = make_cv_get_dtms(CountVectorizer(analyzer='char', ngram_range=(3,3)), 
                                     train_data, 
                                     dev_data)
print("Vocabulary size (for tri-gram character features): {}".format(trigram_train_dtm.shape[1]))


##QUESTION E:
print("\nQuestion E:")
(minword_cv,
 minword_train_dtm,
 minword_dev_dtm) = make_cv_get_dtms(CountVectorizer(min_df=10), 
                                     train_data, 
                                     dev_data)
print("Vocabulary size (with only words that occur 10+ times): {}".format(minword_train_dtm.shape[1]))

##Question F:
print("\nQuestion F:")
# build a doc term matrix that includes train and dev
cv = CountVectorizer()
cv_doc_term_matrix = cv.fit_transform(train_data + dev_data)

# also build one that includes just dev
dev_cv = CountVectorizer()
dev_cv_doc_term_matrix = dev_cv.fit_transform(dev_data)

# subtract length of train only from train and dev
num_dev_words_not_in_train = cv_doc_term_matrix.shape[1] - default_train_dtm.shape[1]

# take that over length of dev
dev_fraction_missing = num_dev_words_not_in_train / dev_cv_doc_term_matrix.shape[1]
print("Fraction of dev words missing from train set: {:.2f}".format(dev_fraction_missing))
### STUDENT END ###
#P2()

(3) Use the default CountVectorizer options and report the f1 score (use metrics.f1_score) for a k nearest neighbors classifier; find the optimal value for k. Also fit a Multinomial Naive Bayes model and find the optimal value for alpha. Finally, fit a logistic regression model and find the optimal value for the regularization strength C using l2 regularization. A few questions:

a. Why doesn't nearest neighbors work well for this problem?

b. Any ideas why logistic regression doesn't work as well as Naive Bayes?

c. Logistic regression estimates a weight vector for each class, which you can access with the coef\_ attribute. Output the sum of the squared weight values for each class for each setting of the C parameter. Briefly explain the relationship between the sum and the value of C.

[4 pts]

In [None]:
#def P3():
### STUDENT START ###
def fit_grid_search(model, 
                    params, 
                    dtm,
                    dev_dtm,
                    model_name):
    clf = GridSearchCV(model, params, cv=3)
    clf.fit(dtm, train_labels)
    print("\nBest params: {}".format(clf.best_params_))    
    y_pred = clf.predict(dev_dtm)
    f1 = metrics.f1_score(dev_labels, 
                          y_pred,
                          average="weighted")
    print("Score for best {} classifier: {:.3f}".format(model_name,
                                                    f1))
    return clf, f1


parameters = {'n_neighbors': list(range(1, 100, 2)), 'weights':['distance', 'uniform']}
knn = KNeighborsClassifier()
_ = fit_grid_search(knn,
                parameters,
                default_train_dtm,
                default_dev_dtm,
                "knn")

# using a geometric sequence to try to efficiently get in the right ballpark
parameters = {'alpha': [2 * .5 ** (n - 1) for n in range(1, 15 + 1)]}
m_nb = MultinomialNB()
_ = fit_grid_search(m_nb, 
                parameters,
                default_train_dtm,
                default_dev_dtm,
                "multinomial nb")

# using a geometric sequence to try to efficiently get in the right ballpark
parameters = {'C': [2 * .5 ** (n - 1) for n in range(1, 15 + 1)] }
log_reg = LogisticRegression(penalty='l2')
_ = fit_grid_search(log_reg,
                parameters,
                default_train_dtm,
                default_dev_dtm,
                "logistic regression")


### STUDENT END ###
#P3()

ANSWERS:

*a. Why doesn't nearest neighbors work well for this problem? *

Answer: When we use a nearest neighbors classifier to predict on this problem, the prediction is made by looking to see which other documents have the most similar features (which other documents have the most words in common with the document we're predicting for). That method is likely prone to error, because a given document could use a totally different vocabulary than a related document, even though they're describing the same topic. Additionally, because nearest neighbors is non-parametric, the algorithm equally weights all features in measuring distance, so in deciding which documents are neighbors, it will equally consider similarities in meaningless, common words (which are likely to occur a lot) and similarities in rarer, signal words. As a result, document A could end up being a nearby neighbor to document B because they have lots of very common similar words, even though they have significant differences in the words that actually pertain to the labels.

b. *Any ideas why logistic regression doesn't work as well as Naive Bayes?*

Answer: 
Two ideas:
1. Naive Bayes works particularly well for this problem because the independence assumption made in Naive Bayes is well-suited to this problem. The independence assumption allows the model to treat each word in the document completely independently in determining its probability, which is a helpful assumption when you have a large number of words and those words may or may not co-occur in similar document. In logistic regression, the model learns coefficients that explain the additional likelihood that a document is of a particular class _given that all other words in the document are equal_. This is a more restrictive assumption that isn't as well-suited to this problem.  Since the weights in logistic regression are learned all together (in the absence of the independence assumption that's present in Naive Bayes), it's extra sensitive to the sparsity of the matrices, since the presence or absence of certain words could change the prediction dramatically based on the coefficients assigned to those words. 
2. Naive bayes and logistic regression have different implications in terms of the bias-variance tradeoff which could be impacting their performance on this task. Since Naive Bayes is essentially a set of univariate estimators (one per word in the feature set), the overall variance of this model is lower than the multivariate estimatation of Logistic Regression. If logistic regression has higher variance, and we see that it's performing more poorly, that could be a sign that it's overfitting to the training data (which seems awfully likely when we know that a quarter of the dev set contains words that aren't in the training set at all).

c. *Logistic regression estimates a weight vector for each class, which you can access with the coef\_ attribute. Output the sum of the squared weight values for each class for each setting of the C parameter. Briefly explain the relationship between the sum and the value of C.* 

In [None]:
# for question C: sum of squared weight values for different settings of C
cs = []
sswvs = []
for c in [2 * .5 ** (n - 1) for n in range(1, 15 + 1)]:
    
    # fit logistic regression with given value of C
    log_reg = LogisticRegression(C=c)
    log_reg.fit(default_train_dtm, train_labels)
    
    # calculate the sum of the squared weight values 
    wvs = log_reg.coef_
    sum_square_wvs = np.sum(np.square(log_reg.coef_))
    
    # print results and add them to lists to be plotted
    print("\nC: {}".format(c))
    print("Sum of squared weight values: {:.3f}".format(sum_square_wvs))
    cs.append(c)
    sswvs.append(sum_square_wvs)
    
plt.plot(cs, sswvs, '.')
plt.title("Sum of Squared Weight vs C values")
plt.xlabel("C")
plt.ylabel("Sum of Squared Weight")

Answer: The weight values get larger as the value of C gets higher. Smaller C values cause stronger regularization. Regularization penalizes the model for getting more complicated by using a loss function that penalizes large parameters. It makes sense that when we use less strong regularization (larger C values), there would be more non-zero and large weights. Indeed, we see the aggregated size of the weights get larger as C gets larger and the regularization penalty gets more relaxed. 

(4) Train a logistic regression model. Find the 5 features with the largest weights for each label -- 20 features in total. Create a table with 20 rows and 4 columns that shows the weight for each of these features for each of the labels. Create the table again with bigram features. Any surprising features in this table?

[5 pts]

In [None]:
#def P4():
### STUDENT START ###

### I meant to ask about this ahead of time but I forgot:
### we weren't supposed to import new libraries for this project,
### but we didn't have pandas, and I wondered if that was an oversight
### given that that's the most common/effective way to make tables in a notebook. 
def make_coefs_table(d_t_m, cv):
    
    log_reg = LogisticRegression()
    log_reg.fit(d_t_m, train_labels)
    idx_to_word = {v:k for k,v in cv.vocabulary_.items()}
    coefs = log_reg.coef_
    all_idxes_of_interest = []
    cat_labels_for_df = []
    
    # iterate over each set of coefficients associated with a label
    # to find the top weights for that label
    for i, coef_set in enumerate(coefs):
        
        # I'm going to interpret largest weights to mean largest absolute value
        # because that corresponds best to the idea of which features are strongest signals
        # in making a prediction for a class (negative signal is still a strong signal)
        sorted_coefs = sorted(coef_set, key=abs, reverse=True)
        top_5 = sorted_coefs[:5]
        
        # find indexes of the 5 coefficients with highest absolute values 
        indexes = [np.where(coef_set==big)[0][0] for big in top_5]
        
        all_idxes_of_interest.extend(indexes)
        cat_labels_for_df.extend([newsgroups_train.target_names[i]]*5)

    # find the actual words from the index positions
    words = [idx_to_word[idx] for idx in all_idxes_of_interest]
    
    # for each word of interest, find the coefficients across all labels
    coef_0_list = [coefs[0, idx] for idx in all_idxes_of_interest]
    coef_1_list = [coefs[1, idx] for idx in all_idxes_of_interest]
    coef_2_list = [coefs[2, idx] for idx in all_idxes_of_interest]
    coef_3_list = [coefs[3, idx] for idx in all_idxes_of_interest]

    df = pd.DataFrame(data={"words": words,
                            "category": cat_labels_for_df,
                            newsgroups_train.target_names[0]: coef_0_list,
                            newsgroups_train.target_names[1]: coef_1_list,
                            newsgroups_train.target_names[2]: coef_2_list,
                            newsgroups_train.target_names[3]: coef_3_list})
    
    df = df.set_index(['category', 'words'])
    
    return df
        
df = make_coefs_table(default_train_dtm, default_cv)
(bigram_word_cv,
 bigram_word_dtm,
 bigram_dev_dtm) = make_cv_get_dtms(CountVectorizer(ngram_range=(2,2)),
                                   train_data,
                                   dev_data)
bigram_df = make_coefs_table(bigram_word_dtm, bigram_word_cv)

df.round(3)
### STUDENT END ###
#P4()

By choosing to use largest absolute values for each category, we end up getting some duplicates in the table, but I chose to keep it that way because I thought it was an interesting finding. Most notably, the word "space" shows up as one of the strongest signals in every single category; it has a positive coefficient for the "space" category, indicating that the presence of that word is a strong signal that the containing document in that category, and it has a strong negative coefficient for all the other categories. The most surprising features in this table, overall, are "blood" as a positive signal for religion, "bobby" as a positive signal for atheism, and "could" as a negative signal for religion. These signals suggest particularities about the data we're training on which might not extend out if our sample got bigger; perhaps a number of the documents reference a certain "bobby" who is active in the atheism conversation, but that signal might not generalize well if we get more documents. 

In [None]:
bigram_df.round(3)

Many of these features are less obvious than their unigram counterparts. Atheism and graphics in particular have a lot of phrases that seem very run-of-the-mill ("looking for", "was just") and don't necessarily seem like they should be signals for the given label. The two-word phrases for religion, however, give a pretty strong signal about what kind of documents are included. Phrases like "such lunacy" and "the fbi" suggest that these documents might be sourced from some kind of conservative talk radio or something like that, which is a very particular subset of all documents we could imagine being categorized as religion-oriented in content. Both of these cases - the seemingly generic combinations and the strong but specific ones - could be evidence of overfitting. The features we're picking up on here might just be quirks of the particular documents we have than true signals for the labels. 

(5) Try to improve the logistic regression classifier by passing a custom preprocessor to CountVectorizer. The preprocessing function runs on the raw text, before it is split into words by the tokenizer. Your preprocessor should try to normalize the input in various ways to improve generalization. For example, try lowercasing everything, replacing sequences of numbers with a single token, removing various other non-letter characters, and shortening long words. If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular. With your new preprocessor, how much did you reduce the size of the dictionary?

For reference, I was able to improve dev F1 by 2 points.

[4 pts]

In [None]:
# Original logistic regression to start with
log_reg = LogisticRegression(C=.5, penalty='l2')
log_reg.fit(default_train_dtm, train_labels)
dev_preds = log_reg.predict(default_dev_dtm)
print("Number of features: {}".format(default_train_dtm.shape[1]))
baseline_f1 = metrics.f1_score(dev_labels,
                               dev_preds,
                               average="weighted")
print("Baseline F1 score: {:.5f}".format(baseline_f1))

In [None]:
# various regexes for use in better preprocesser

# digit regex: for a single digit on its own
digit_regex = re.compile(r"[\d]\s")
# big number regex: for a series of 2 or more digits in a row
big_num_regex = re.compile(r"[\d]+")
# special char: for all characters other than letters and numbers
special_char = re.compile(r"[^a-zA-Z0-9 ]")

# various word endings: these match on any combo of letters 
# followed by the ending
ly = re.compile(r"([a-z]+)(ly)")
ing = re.compile(r"([a-z]+)(ing)")
final_s = re.compile(r"([a-z]+)(s)")
ed = re.compile(r"([a-z]+)(ed)")
endings = [ly, ing, final_s, ed]

def better_preprocessor(s):
    s = s.lower()
    s = re.sub(digit_regex, "digittoken ", s)
    s = re.sub(big_num_regex, "numtoken", s)
    s = re.sub(special_char, " ", s)
    for ending in endings:
        s = re.sub(ending, r"\1", s)
    for word in s.split():
        if len(word) > 12:
            s = s.replace(word, word[:12])
        if word in ENGLISH_STOP_WORDS:
            s = s.replace(" " + word + " ", " ")
    return s

In [None]:
def train_with_new_cv(preprocessor,
                      compare_dtm):
    cv = CountVectorizer(preprocessor=preprocessor)
    new_dtm = cv.fit_transform(train_data)
    new_dev_dtm = cv.transform(dev_data)
    ## TODO: add difference between this and orig number of words
    print("Number of features: {}".format(new_dtm.shape[1]))
    print("Number of features fewer than in baseline regression: {}".format(compare_dtm.shape[1] -
                                                                            new_dtm.shape[1]))
    # TODO: replace with grid search
    parameters = {'C': [.5] }
    log_reg = LogisticRegression(penalty='l2')
    lr, f1 = fit_grid_search(log_reg,
                    parameters,
                    new_dtm,
                    new_dev_dtm,
                    "logistic regression")
    print("Improvement from baseline: {:.3f}".format(f1 - baseline_f1))
    

train_with_new_cv(better_preprocessor,
                  default_train_dtm)

(6) The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. That is, logistic regression seeks the set of weights that minimizes errors in the training data AND has a small size. The default regularization, L2, computes this size as the sum of the squared weights (see P3, above). L1 regularization computes this size as the sum of the absolute values of the weights. The result is that whereas L2 regularization makes all the weights relatively small, L1 regularization drives lots of the weights to 0, effectively removing unimportant features.

Train a logistic regression model using a "l1" penalty. Output the number of learned weights that are not equal to zero. How does this compare to the number of non-zero weights you get with "l2"? Now, reduce the size of the vocabulary by keeping only those features that have at least one non-zero weight and retrain a model using "l2".

Make a plot showing accuracy of the re-trained model vs. the vocabulary size you get when pruning unused features by adjusting the C parameter.

Note: The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.01 (the default is .0001).

[4 pts]

In [None]:
np.random.seed(0)

def train_lr_return_coefs(penalty,
                          train_dtm,
                          c=1):
    
    lr = LogisticRegression(penalty = penalty,
                            C=c)
    lr.fit(train_dtm, train_labels)
    return lr.coef_
  
# i thought the right answer here was 1609 - not sure why that would have changed 
l1_weights = train_lr_return_coefs('l1',
                                   default_train_dtm)
l2_weights = train_lr_return_coefs('l2',
                                   default_train_dtm)
print("Number of nonzero weights for LR trained with L1 regularization: {}".format(np.sum(l1_weights != 0)))
print("Number of nonzero weights for LR trained with L2 regularization: {}".format(np.sum(l2_weights != 0)))

In [None]:
def P6(c):
    # Keep this random seed here to make comparison easier.
    
    l1 = train_lr_return_coefs('l1',
                               default_train_dtm,
                               c)
    
    # keep only the vocabulary elements with
    # at least one non-zero weight
    non_zero_list = []
    for i in range(len(l1[0])):
        all_weights = [r[i] for r in l1]
        if np.sum(all_weights) != 0:
            non_zero_list.append(i)
    print("Number of features with non-zero weight: {}".format(len(non_zero_list)))
    
    # change dtm to only have the features with
    # non-zero weights when trained on 'l1' penalty
    dtm = default_train_dtm.toarray()
    just_those_features = dtm[:,non_zero_list]
    
    # retrain on l2 with the restricted vocabulary 
    lr = LogisticRegression(penalty = 'l2', tol=.01)
    lr.fit(just_those_features, train_labels)
    
    # process the dev dtm to only have the new features
    dev_dtm = default_dev_dtm.toarray()
    just_those_dev_features = dev_dtm[:, non_zero_list]
    
    # predict on dev and get the f1 score
    dev_preds = lr.predict(just_those_dev_features)
    f1_score = metrics.f1_score(dev_labels,
                   dev_preds,
                   average="weighted")
    
    return f1_score
        
    

In [None]:
cs = []
f1s = []

# iterate over c values
# and find different values for different C's 
for c in [2 * .5 ** (n - 1) for n in range(1, 10 + 1)]:
    cs.append(c)
    f1s.append(P6(c))

In [None]:
plt.plot(cs, f1s, 'bo')
plt.xlabel("C values")
plt.ylabel("F1 score")
plt.show()

In general, as the C value gets higher (and the regularization gets more relaxed), the f1 score for the model trained with l2 regularization on the feature set created by pruning features based on l1 regularization gets higher up until a certain point, then plateaus. At the lower C values, we reduce our vocabulary very drastically, which should partially explain why our f1 scores get so low. When we practice this strong regularization which drastically reduces our C-values, we strongly penalize the model for additional weights, so we limit the number of non-zero parameters significantly, and as a result, the re-trained model doesn't have as large of a vocabulary to work with. The words it keeps are the words that are the strongest signals, which helps the f1 keep from completely plummeting, but given the sparsity of these matrices in general, there are a lot of documents where the kept features probably aren't contained at all. 

(7) Use the TfidfVectorizer -- how is this different from the CountVectorizer? Train a logistic regression model with C=100.

Make predictions on the dev data and show the top 3 documents where the ratio R is largest, where R is:

maximum predicted probability / predicted probability of the correct label

What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

[4 pts]

In [None]:
#def P7():
    ### STUDENT START ###

def train_vectorizer_get_f1(tf_vectorizer, lr):
    tf_dtm = tf_vectorizer.fit_transform(train_data)
    lr.fit(tf_dtm, train_labels)
    tf_dtm_dev = tf_vectorizer.transform(dev_data)
    dev_probs = lr.predict_proba(tf_dtm_dev)
    dev_preds = lr.predict(tf_dtm_dev)
    print("F1 score: {:.3f}".format(metrics.f1_score(dev_labels,
                                 dev_preds,
                                 average="weighted")))
    return dev_probs
    
dev_probs = train_vectorizer_get_f1(TfidfVectorizer(), LogisticRegression(C=100))

def calculate_r(probs, label):
    return max(probs) / probs[label]

r_values = [calculate_r(probs, dev_labels[i]) for i, probs in enumerate(dev_probs)]
top_3_r = sorted(r_values, reverse=True)[:3]
idxes_of_top_r = [np.where(r_values==r)[0][0] for r in top_3_r]

for i in range(3):
    print("\nR value: {:.3f}".format(top_3_r[i]))
    print(idx_in_set)
    print("Correct label: {}".format(newsgroups_train.target_names[dev_labels[idx_in_set]]))
    preds = dev_probs[idx_in_set]
    print("Predicted label: {}".format(newsgroups_train.target_names[np.argmax(preds)]))
    print("Predicted probabilities: {}".format(dev_probs[idx_in_set]))
    print("Text: {}".format(dev_data[idx_in_set]))
    ### STUDENT END ###
#P7()

ANSWER:

The TfidVectorizer is different from the CountVectorizer in that instead of representing a document by the counts of words contained, it represents it by the counts of words contained times the inverse document frequency. This measure scales the count of words in the document by the number of documents that the word appears in, such that words that appear frequently across documents (and are, therefore, less meaningful from a classification perspective) won't have as high of values.

In terms of mistakes the model is making, it seems like the documents with the highest R-squared are the ones where a particular word with an extreme tfidf is used in a different context. For instance, in the first two examples, the documents are about religion, but they both contain the word "ftp," which is a word that we typically see in the context of the graphics documents where authors are talking about how to transfer graphics files. In this case, the word "ftp" is used in reference to a religious file that is being transferred, but the signal from the word "ftp" is so strong that the model is predicting that those are graphics documents. The situation in the last case is also similar - we already discussed earlier how our documents in the "religion" category are generally very violent, so this document probably registers as a religious document due to the use of the word "killed' or "gunman." A way to correct this problem in general might be to do something to minimize the effect an "outlier" feature has on the final prediction - perhaps by doing something to normalize the weight with the biggest contribution. Another way might be to increase the size of the ngrams included in the model, so that the model will have more features that put the signal words in context, which might counteract the effect of seeing them in isolation where they mean something different. 

(8) EXTRA CREDIT

Try implementing one of your ideas based on your error analysis. Use logistic regression as your underlying model.

- [1 pt] for a reasonable attempt
- [2 pts] for improved performance


It is interesting to see how the f1 score goes up, and the r values for this set of examples change, when we change the maximum or minimum document frequency. For instance, see how the f1 and r scores change below, when we adjust the TfidfVectorizer to ignore terms that appear in more than 5% of the documents:

In [None]:
new_probs = train_vectorizer_get_f1(TfidfVectorizer(ngram_range = (1,2)), LogisticRegression(C=100))

for idx in idxes_of_top_r:
    preds = new_probs[idx]
    r_value = calculate_r(preds, dev_labels[idx])
    print("\nR value: {:.3f}".format(r_value))
    print("Correct label: {}".format(newsgroups_train.target_names[dev_labels[idx]]))
    print("Predicted label: {}".format(newsgroups_train.target_names[np.argmax(preds)]))
    print("Predicted probabilities: {}".format(new_probs[idx]))


I changed the ngram ranges to include bigram words in addition to unigram words, in the hopes that adding features that include a little more context (pairs of words as opposed to single words) would counteract the effect in the previous examples where a strong signal word, used in an unusual context, points the model in the wrong direction. I was pleased to see that when I made this change, the overall f1 score for the model goes up slightly (76.4% as opposed to 76%), and the R values for each of these three examples goes down. The new model is still predicting incorrectly on these three examples, but it's a little less sure of its predictions, so its R values go down, reflecting that it isn't messing up quite so badly on these cases (interestingly, in the case of the last example, the model still predicts the wrong answer, but it now predicts space instead of religion). 

I was also curious to see what would happen if we totally dropped unigrams, and just considered bigrams. 

In [None]:
new_probs = train_vectorizer_get_f1(TfidfVectorizer(ngram_range = (2,2)), LogisticRegression(C=100))

for idx in idxes_of_top_r:
    preds = new_probs[idx]
    r_value = calculate_r(preds, dev_labels[idx])
    print("\nR value: {:.3f}".format(r_value))
    print("Correct label: {}".format(newsgroups_train.target_names[dev_labels[idx]]))
    print("Predicted label: {}".format(newsgroups_train.target_names[np.argmax(preds)]))
    print("Predicted probabilities: {}".format(new_probs[idx]))


Interestingly (and perhaps unsurprisingly), the R values for these three cases decrease even more when we only consider bigram word features. Without the unigrams, the model is even less sure of its wrong predictions, since the unigrams are what is giving it the strong signal towards the incorrect labels. However, the overall F1 score goes down significantly when we make this change. That is not surprising, since those unigrams were strong signals for a reason - they help point the model to the correct label a large amount of the time, enough to compensate for a smaller set of cases where they point the word to a wrong label.  