# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup postings on a variety of topics. You'll train classifiers to distinguish between the topics based on the text of the posts. Whereas with digit classification, the input is relatively dense: a 28x28 matrix of pixels, many of which are non-zero, here we'll represent each document with a "bag-of-words" model. As you'll see, this makes the feature representation quite sparse -- only a few words of the total vocabulary are active in any given document. The bag-of-words assumption here is that the label depends only on the words; their order is not important.

The SK-learn documentation on feature extraction will prove useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but please prepare your own write-up and write your own code.

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
import time
import pandas as pd

# import pretty printing
from pprint import pprint

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *



Load the data, stripping out metadata so that we learn classifiers that only use textual features. By default, newsgroups data is split into train and test sets. We further split the test so we have a dev set. Note that we specify 4 categories to use for this project. If you remove the categories argument from the fetch function, you'll get all 20 categories.

In [3]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)

num_test = len(newsgroups_test.target)
test_data, test_labels = newsgroups_test.data[num_test/2:], newsgroups_test.target[num_test/2:]
dev_data, dev_labels = newsgroups_test.data[:num_test/2], newsgroups_test.target[:num_test/2]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print 'training label shape:', train_labels.shape
print 'test label shape:', test_labels.shape
print 'dev label shape:', dev_labels.shape
print 'labels names:', newsgroups_train.target_names

training label shape: (2034L,)
test label shape: (677L,)
dev label shape: (676L,)
labels names: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


(1) For each of the first 5 training examples, print the text of the message along with the label.

[2 pts]

In [3]:
### STUDENT START ###

def P1(num_examples=5):
    names = newsgroups_train['target_names']
    for n in range(num_examples):
        print '\nSample number:', n, '\nClassification Label:', names[train_labels[n]]
        print '\nText:\n', train_data[n]
P1()

### STUDENT END ###


Sample number: 0 
Classification Label: comp.graphics

Text:
Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych

Sample number: 1 
Classification Label: talk.religion.misc

Text:


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like

(2) Use CountVectorizer to turn the raw training text into feature vectors. You should use the fit_transform function, which makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").

The vectorizer has a lot of options. To get familiar with some of them, write code to answer these questions:

a. The output of the transform (also of fit_transform) is a sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html. What is the size of the vocabulary? What is the average number of non-zero features per example? What fraction of the entries in the matrix are non-zero? Hint: use "nnz" and "shape" attributes.

b. What are the 0th and last feature strings (in alphabetical order)? Hint: use the vectorizer's get_feature_names function.

c. Specify your own vocabulary with 4 words: ["atheism", "graphics", "space", "religion"]. Confirm the training vectors are appropriately shaped. Now what's the average number of non-zero features per example?

d. Instead of extracting unigram word features, use "analyzer" and "ngram_range" to extract bigram and trigram character features. What size vocabulary does this yield?

e. Use the "min_df" argument to prune words that appear in fewer than 10 documents. What size vocabulary does this yield?

f. Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? Hint: build a vocabulary for both train and dev and look at the size of the difference.

[6 pts]

In [4]:
### STUDENT START ###
def P2(train_data, dev_data):
    print 'PART (A):'
    v = CountVectorizer()
    features = v.fit_transform(train_data)
    print '\nSize of vocabulary:', len(v.vocabulary_)
    avg_nz_doc = 1.0 * np.sum([f.nnz for f in features]) / features.shape[0]
#     targets = newsgroups_train['target_names']
#     examples = {targets[i] : [] for i in range(len(targets))}
#     for i in range(len(train_labels)):
#         examples[targets[train_labels[i]]].append(features[i].nnz)
#     avg_nz_ex = {k: 1.0 * np.sum(v) / len(v) for k,v in examples.items()}
    print 'Average number of non-zero features per example:', avg_nz_doc
#     print 'Average number of non-zero features per topic:'
#     pprint(avg_nz_ex)
    frac_nz = 1.0 * features.nnz / features.toarray().size
    print 'Fraction of entries that are non-zero:', frac_nz
    print '\nPART (B):'
    names = v.get_feature_names()
    print '\n0th feature: {}, last feature: {}'.format(names[0], names[-1])
    print '\nPART (C):'
    v2 = CountVectorizer(
        vocabulary=['atheism', 'graphics', 'space', 'religion']
    )
    features2 = v2.fit_transform(train_data)
    avg_nz_doc = 1.0 * np.sum([f.nnz for f in features2]) / features2.shape[0]
    print '\nVector Shape:', features2.shape
    print 'Average number of non-zero features per example:', avg_nz_doc
    print '\nPART (D):'
    v3 = CountVectorizer(analyzer='word', ngram_range=(2, 3))
    v3.fit_transform(train_data)
    n = len(v3.vocabulary_)
    print '\nSize of vocabulary using brigrams and trigrams:', n
    print '\nPART (E):'
    v4 = CountVectorizer(min_df=1.0 * 10/len(train_data))
    v4.fit_transform(train_data)
    print '\nSize of vocabulary appearing in < 10 docs:', len(v4.vocabulary_)
    print '\nPART (F):'
    v = CountVectorizer()
    v.fit_transform(train_data)
    names1 = v.get_feature_names()
    v.fit_transform(dev_data)
    names2 = v.get_feature_names()
    diff = 1.0 * len(set(names2).difference(set(names1))) / len(names1)
    print '\nFraction of words in dev_data and not in train_data vocab:', diff
    
P2(train_data, dev_data)

### STUDENT END ###

PART (A):

Size of vocabulary: 26879
Average number of non-zero features per example: 96.7059980334
Fraction of entries that are non-zero: 0.00359782722696

PART (B):

0th feature: 00, last feature: zyxel

PART (C):

Vector Shape: (2034, 4)
Average number of non-zero features per example: 0.268436578171

PART (D):

Size of vocabulary using brigrams and trigrams: 510583

PART (E):

Size of vocabulary appearing in < 10 docs: 3064

PART (F):

Fraction of words in dev_data and not in train_data vocab: 0.14981956174


(3) Use the default CountVectorizer options and report the f1 score (use metrics.f1_score) for a k nearest neighbors classifier; find the optimal value for k. Also fit a Multinomial Naive Bayes model and find the optimal value for alpha. Finally, fit a logistic regression model and find the optimal value for the regularization strength C using l2 regularization. A few questions:

a. Why doesn't nearest neighbors work well for this problem?

b. Any ideas why logistic regression doesn't work as well as Naive Bayes?

c. Logistic regression estimates a weight vector for each class, which you can access with the coef\_ attribute. Output the sum of the squared weight values for each class for each setting of the C parameter. Briefly explain the relationship between the sum and the value of C.

[4 pts]

In [5]:
# set parameters and perform grid serach over the varying alpha values
start = time.time()
v = CountVectorizer()
f = v.fit_transform(train_data)
params = {'n_neighbors': range(1, 100)}
scorer = metrics.make_scorer(metrics.f1_score, average='weighted')
gs = GridSearchCV(KNeighborsClassifier(), params, scoring=scorer)
gs.fit(f, train_labels)
end = time.time()
print 'Best k:', gs.best_params_.values()[0]
print 'Best F1-score:', gs.best_score_ 
print 'runtime:', round(end-start, 3)

Best k: 96
Best F1-score: 0.429989842138
runtime: 48.916


In [6]:
# set parameters and perform grid serach over the varying alpha values
start = time.time()
v = CountVectorizer()
f = v.fit_transform(train_data)
scorer = metrics.make_scorer(metrics.f1_score, average='weighted')
params = {'alpha': np.arange(0.0, 0.1, 0.0001)}
gs = GridSearchCV(MultinomialNB(), params, scoring=scorer)
gs.fit(f, train_labels)
end = time.time()
print 'Best alpha:', gs.best_params_.values()
print 'Best F1-score:', gs.best_score_ 
print 'runtime:', round(end-start, 3)

  'setting alpha = %.1e' % _ALPHA_MIN)


Best alpha: [0.0036000000000000003]
Best F1-score: 0.831144346553
runtime: 43.88


In [7]:
# set parameters and perform grid serach over the varying alpha values
start = time.time()
v = CountVectorizer()
f = v.fit_transform(train_data)
scorer = metrics.make_scorer(metrics.f1_score, average='weighted')
params = {'C': np.arange(0.1, 1, 0.01)}
gs = GridSearchCV(LogisticRegression(), params, scoring=scorer)
gs.fit(f, train_labels)
end = time.time()
print 'Best regularization parameter:', gs.best_params_.values()
print 'Best F1-score:', gs.best_score_ 
print 'runtime:', round(end-start, 3)

Best regularization parameter: [0.17999999999999997]
Best F1-score: 0.772589852946
runtime: 275.26


In [8]:
help(LogisticRegression)

Help on class LogisticRegression in module sklearn.linear_model.logistic:

class LogisticRegression(sklearn.base.BaseEstimator, sklearn.linear_model.base.LinearClassifierMixin, sklearn.linear_model.base.SparseCoefMixin)
 |  Logistic Regression (aka logit, MaxEnt) classifier.
 |  
 |  In the multiclass case, the training algorithm uses the one-vs-rest (OvR)
 |  scheme if the 'multi_class' option is set to 'ovr', and uses the cross-
 |  entropy loss if the 'multi_class' option is set to 'multinomial'.
 |  (Currently the 'multinomial' option is supported only by the 'lbfgs',
 |  'sag' and 'newton-cg' solvers.)
 |  
 |  This class implements regularized logistic regression using the
 |  'liblinear' library, 'newton-cg', 'sag' and 'lbfgs' solvers. It can handle
 |  both dense and sparse input. Use C-ordered arrays or CSR matrices
 |  containing 64-bit floats for optimal performance; any other input format
 |  will be converted (and copied).
 |  
 |  The 'newton-cg', 'sag', and 'lbfgs' solve

In [9]:
### STUDENT START ###
def P3():
# above implementation uses grid search but they probably want us to use gradient descent

P3()
### STUDENT END ###

IndentationError: expected an indented block (<ipython-input-9-29335554b749>, line 5)

ANSWER:

(4) Train a logistic regression model. Find the 5 features with the largest weights for each label -- 20 features in total. Create a table with 20 rows and 4 columns that shows the weight for each of these features for each of the labels. Create the table again with bigram features. Any surprising features in this table?

[5 pts]

- lg.coef_ is a 4x26879 matrix. each row contains the regression coefficients for each word in the vocab, in the order of the vocab keys
- vocab holds each word in the vocab as a key and the value is the index of the vocab (ie the column in the vectorizer matrix
- need to map the

In [4]:
labels = newsgroups_train['target_names']
lg = LogisticRegression(C=0.18)
v = CountVectorizer()
f = v.fit_transform(train_data)
vocab = v.vocabulary_
lg.fit(f, train_labels)

LogisticRegression(C=0.18, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [5]:
word_map = {}
words = []
for ind,c in enumerate(lg.coef_):
    top5 = np.argsort(c)[::-1][:5]
    words = [(k, v) for k,v in vocab.items() for j in range(5) if v == top5[j]]
    for w in words:
        word_map[w[0]] = w[1]
word_map

{u'3d': 1145,
 u'atheism': 3866,
 u'atheists': 3870,
 u'blood': 4743,
 u'bobby': 4784,
 u'christian': 5901,
 u'christians': 5904,
 u'computer': 6555,
 u'fbi': 10234,
 u'file': 10376,
 u'graphics': 11552,
 u'image': 12769,
 u'islam': 13668,
 u'launch': 14540,
 u'nasa': 16697,
 u'orbit': 17597,
 u'order': 17609,
 u'religion': 20430,
 u'space': 22567,
 u'spacecraft': 22570}

In [10]:
weight_arr = np.array(
    [[lg.coef_[i][v] for v in word_map.values()] for i in range(len(labels))])
# weight_arr.shape
df = pd.DataFrame(weight_arr.T, columns=labels, index=word_map.keys())
df

Unnamed: 0,alt.atheism,comp.graphics,sci.space,talk.religion.misc
christian,-0.330484,-0.238977,-0.21752,0.672245
religion,0.59601,-0.367913,-0.482972,-0.015839
image,-0.32916,0.802789,-0.468117,-0.269745
bobby,0.597084,-0.146222,-0.208707,-0.282748
space,-0.793108,-0.849346,1.46719,-0.716614
orbit,-0.263077,-0.410297,0.735836,-0.329093
launch,-0.263923,-0.298331,0.581582,-0.207426
atheism,0.596496,-0.257976,-0.255648,-0.305634
atheists,0.582813,-0.085958,-0.195574,-0.399363
spacecraft,-0.215581,-0.230584,0.50249,-0.186686


In [95]:
labels = newsgroups_train['target_names']
word_list = []
words = []
for ind,c in enumerate(lg.coef_):
    top5 = np.sort(c)[-5:]
#     top5 = np.sort(c)[::-1][:5]
    idx = [(i, top5[j]) for i in range(len(c)) for j in range(5) if c[i] == top5[j]]
idx
#     words = [(k, v, top5[j]) for k,v in vocab.items() for j in range(5) if v == idx[j][0]]
#     for w in words:
#         word_list.append(w)
# word_list

[(4743, 0.5706131255103621),
 (5901, 0.6722453326558024),
 (5904, 0.6405559727375951),
 (10234, 0.5364960477583791),
 (17609, 0.5340546682114011)]

In [99]:
labels = newsgroups_train['target_names']
word_list = []
words = []
for ind,c in enumerate(lg.coef_):
    top5 = np.argsort(c)[::-1][:5]
    words = [(k, v, c[top5[j]]) for k,v in vocab.items() for j in range(5) if v == top5[j]]
    for w in words:
        word_list.append(w)
word_list

[(u'atheism', 3866, 0.5964956187568897),
 (u'bobby', 4784, 0.5970837638203422),
 (u'atheists', 3870, 0.5828128319130426),
 (u'religion', 20430, 0.5960100065889833),
 (u'islam', 13668, 0.5121175704268935),
 (u'computer', 6555, 0.6606312142606383),
 (u'3d', 1145, 0.6845519794908685),
 (u'image', 12769, 0.8027890564041955),
 (u'graphics', 11552, 1.2155893925133994),
 (u'file', 10376, 0.7817224429010882),
 (u'launch', 14540, 0.5815820631270786),
 (u'orbit', 17597, 0.7358355510288339),
 (u'spacecraft', 22570, 0.5024903850730208),
 (u'space', 22567, 1.4671903386906164),
 (u'nasa', 16697, 0.6446863532684038),
 (u'fbi', 10234, 0.5364960477583791),
 (u'christians', 5904, 0.6405559727375951),
 (u'christian', 5901, 0.6722453326558024),
 (u'order', 17609, 0.5340546682114011),
 (u'blood', 4743, 0.5706131255103621)]

In [98]:
labels = newsgroups_train['target_names']
word_list = []
words = []
for ind,c in enumerate(lg.coef_):
#     top5 = np.sort(c)[-5:]
    top5 = np.argsort(c)[::-1][:5]
#     top5 = np.argsort(c)[-5:][::-1]
top5
#     idx = [(i, top5[j]) for i in range(len(c)) for j in range(5) if c[i] == top5[j]]
#     words = [(k, v, top5[j]) for k,v in vocab.items() for j in range(5) if v == idx[j][0]]
#     for w in words:
#         word_list.append(w)
# word_list

array([ 5901,  5904,  4743, 10234, 17609], dtype=int64)

In [100]:
label = labels[0]
weights = [(w, lg.coef_[0][w[1]]) for w in word_list]
weights

[((u'atheism', 3866, 0.5964956187568897), 0.5964956187568897),
 ((u'bobby', 4784, 0.5970837638203422), 0.5970837638203422),
 ((u'atheists', 3870, 0.5828128319130426), 0.5828128319130426),
 ((u'religion', 20430, 0.5960100065889833), 0.5960100065889833),
 ((u'islam', 13668, 0.5121175704268935), 0.5121175704268935),
 ((u'computer', 6555, 0.6606312142606383), -0.01124715421129772),
 ((u'3d', 1145, 0.6845519794908685), -0.22410397258601977),
 ((u'image', 12769, 0.8027890564041955), -0.32915989149716246),
 ((u'graphics', 11552, 1.2155893925133994), -0.48851841152609043),
 ((u'file', 10376, 0.7817224429010882), -0.2079274191946244),
 ((u'launch', 14540, 0.5815820631270786), -0.2639226430423041),
 ((u'orbit', 17597, 0.7358355510288339), -0.2630767786154764),
 ((u'spacecraft', 22570, 0.5024903850730208), -0.21558139861966702),
 ((u'space', 22567, 1.4671903386906164), -0.7931082594693554),
 ((u'nasa', 16697, 0.6446863532684038), -0.33428686486929327),
 ((u'fbi', 10234, 0.5364960477583791), -0.16

In [88]:
import numpy as np

arr = np.array([1, 3, 2, 4, 5])
arr[::-1]

# arr.argsort()[-3:][::-1]

array([5, 4, 2, 3, 1])

In [76]:
# labels = newsgroups_train['target_names']
# weight_map = {
#     labels[i]: {
#         vocab.keys()[j]: lg.coef_[i,j] 
#         for j in range(lg.coef_.shape[1])
#     } 
#     for i in range(len(labels))
# }
# # pprint(weight_map)

In [77]:
# for label in weight_map.keys():
#     top5 = np.sort(weight_map[label].values())[-5:]
#     print '\n', label, top5, '\n'
#     for k,v in weight_map[label].items():
#         if v in top5:
#             print k, v

In [53]:
labels = newsgroups_train['target_names']
# print len(labels)
# print lg.coef_.shape
# for i,c in enumerate(lg.coef_):
#     print i, c.shape
word_list = []
words = {}
for ind,c in enumerate(lg.coef_):
    top5 = np.sort(c)[-5:]
    print top5
    idx = [i for i in range(len(c)) for j in range(5) if c[i] == top5[j]]
    words[labels[ind]] = [k for k,v in vocab.items() for j in range(5) if v == idx[j]]
#     print ind, labels[ind], words
    for w in words[labels[ind]]:
        word_list.append(w)
words, word_list

[0.51211757 0.58281283 0.59601001 0.59649562 0.59708376]
[0.66063121 0.68455198 0.78172244 0.80278906 1.21558939]
[0.50249039 0.58158206 0.64468635 0.73583555 1.46719034]
[0.53405467 0.53649605 0.57061313 0.64055597 0.67224533]


({'alt.atheism': [u'atheism', u'bobby', u'atheists', u'religion', u'islam'],
  'comp.graphics': [u'computer', u'3d', u'image', u'graphics', u'file'],
  'sci.space': [u'launch', u'orbit', u'spacecraft', u'space', u'nasa'],
  'talk.religion.misc': [u'fbi',
   u'christians',
   u'christian',
   u'order',
   u'blood']},
 [u'atheism',
  u'bobby',
  u'atheists',
  u'religion',
  u'islam',
  u'computer',
  u'3d',
  u'image',
  u'graphics',
  u'file',
  u'launch',
  u'orbit',
  u'spacecraft',
  u'space',
  u'nasa',
  u'fbi',
  u'christians',
  u'christian',
  u'order',
  u'blood'])

In [41]:
weights = {label: {word: weight_map[label][word] for word in word_list} for label in weight_map.keys()}
weights

{'alt.atheism': {u'3d': 0.03694414553519378,
  u'atheism': 1.1083750105278388e-06,
  u'atheists': -0.00632191195267421,
  u'blood': -0.008162866924341731,
  u'bobby': -0.005160527308382557,
  u'christian': -0.00012176811247876634,
  u'christians': 0.04345151092407021,
  u'computer': 0.0014024427996590268,
  u'fbi': -0.0064279338631855355,
  u'file': -0.026131610554068155,
  u'graphics': -0.01390926889975656,
  u'image': -2.7282289890424304e-07,
  u'islam': -0.0004339606746706673,
  u'launch': -1.4820981235059479e-06,
  u'nasa': 0.1161226561127644,
  u'orbit': -8.548672025013739e-06,
  u'order': -0.0032220828494332904,
  u'religion': -0.00012486423717855024,
  u'space': 6.277317854525447e-07,
  u'spacecraft': 7.838765649257208e-07},
 'comp.graphics': {u'3d': -0.005734535147474898,
  u'atheism': 1.1917502829141858e-06,
  u'atheists': 0.018667988667478732,
  u'blood': -0.02028456544476551,
  u'bobby': -0.0016831622534122877,
  u'christian': -2.3685445408404427e-06,
  u'christians': -0.007

In [21]:
labels = newsgroups_train['target_names']
weights = {label: {} for label in labels}
# weights
for i,label in enumerate(labels):
    for j,word in enumerate(word_list):
# #     for j in range(len(word_list)):
        print label, word, vocab[word], lg.coef_[i,j]
#         weights[label][word] = lg.coef_[i,j]
# pprint(weights)

alt.atheism atheism 3866 -0.06121247482274525
alt.atheism bobby 4784 0.0467968299398923
alt.atheism atheists 3870 -3.517159354482324e-05
alt.atheism religion 20430 -7.034318708964648e-05
alt.atheism islam 13668 5.970104066968314e-07
alt.atheism computer 6555 3.731546697855308e-07
alt.atheism 3d 1145 2.895321101175386e-07
alt.atheism image 12769 -0.0007993076975865455
alt.atheism graphics 11552 -0.001100976288210156
alt.atheism file 10376 -3.517159354482324e-05
alt.atheism launch 14540 -1.789132176787828e-05
alt.atheism orbit 17597 -1.7280271776944964e-05
alt.atheism spacecraft 22570 -3.517159354482324e-05
alt.atheism space 22567 -1.7280271776944964e-05
alt.atheism nasa 16697 1.537794218909252e-06
alt.atheism fbi 10234 -0.0011207221825874636
alt.atheism christians 5904 2.8060463498847007e-06
alt.atheism christian 5901 -1.789132176787828e-05
alt.atheism order 17609 7.463093395710616e-07
alt.atheism blood 4743 1.233301813606777e-06
comp.graphics atheism 3866 0.09282670215525876
comp.graph

In [142]:
# labels = newsgroups_train['target_names']
# # labels
# label_weights = {label: [] for label in labels}
# label_weights[labels[2]]
for i,c in enumerate(lg.coef_):
    print i, labels[i]
    top5 = np.sort(c)[-5:]
    idx = [i for i in range(len(c)) for j in range(5) if f1[i] == top5[j]]
    words = [k for k,v in vocab.items() for j in range(5) if v == idx[j]]
#     label_weights[labels[i]].append(words)
    
# pprint(label_weights)

[]

In [129]:
f1 = lg.coef_[0]
tester = np.sort(f1)
top5 = tester[-5:]
idx = [i for i in range(len(f1)) for j in range(5) if f1[i] == top5[j]]
idx

[3866, 3870, 4784, 13668, 20430]

In [131]:
words = [k for k,v in vocab.items() for j in range(5) if v == idx[j]]
words

[u'atheism', u'bobby', u'atheists', u'religion', u'islam']

In [124]:
f[1153]

<1x26879 sparse matrix of type '<type 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [116]:
help(CountVectorizer)

Help on class CountVectorizer in module sklearn.feature_extraction.text:

class CountVectorizer(sklearn.base.BaseEstimator, VectorizerMixin)
 |  Convert a collection of text documents to a matrix of token counts
 |  
 |  This implementation produces a sparse representation of the counts using
 |  scipy.sparse.csr_matrix.
 |  
 |  If you do not provide an a-priori dictionary and you do not use an analyzer
 |  that does some kind of feature selection then the number of features will
 |  be equal to the vocabulary size found by analyzing the data.
 |  
 |  Read more in the :ref:`User Guide <text_feature_extraction>`.
 |  
 |  Parameters
 |  ----------
 |  input : string {'filename', 'file', 'content'}
 |      If 'filename', the sequence passed as an argument to fit is
 |      expected to be a list of filenames that need reading to fetch
 |      the raw content to analyze.
 |  
 |      If 'file', the sequence items must have a 'read' method (file-like
 |      object) that is called to fetc

In [6]:
#def P4():
### STUDENT START ###


### STUDENT END ###
#P4()

ANSWER:

(5) Try to improve the logistic regression classifier by passing a custom preprocessor to CountVectorizer. The preprocessing function runs on the raw text, before it is split into words by the tokenizer. Your preprocessor should try to normalize the input in various ways to improve generalization. For example, try lowercasing everything, replacing sequences of numbers with a single token, removing various other non-letter characters, and shortening long words. If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular. With your new preprocessor, how much did you reduce the size of the dictionary?

For reference, I was able to improve dev F1 by 2 points.

[4 pts]

In [7]:
def empty_preprocessor(s):
    return s

#def better_preprocessor(s):
### STUDENT START ###

### STUDENT END ###

#def P5():
### STUDENT START ###

    
### STUDENT END ###
#P5()

(6) The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. That is, logistic regression seeks the set of weights that minimizes errors in the training data AND has a small size. The default regularization, L2, computes this size as the sum of the squared weights (see P3, above). L1 regularization computes this size as the sum of the absolute values of the weights. The result is that whereas L2 regularization makes all the weights relatively small, L1 regularization drives lots of the weights to 0, effectively removing unimportant features.

Train a logistic regression model using a "l1" penalty. Output the number of learned weights that are not equal to zero. How does this compare to the number of non-zero weights you get with "l2"? Now, reduce the size of the vocabulary by keeping only those features that have at least one non-zero weight and retrain a model using "l2".

Make a plot showing accuracy of the re-trained model vs. the vocabulary size you get when pruning unused features by adjusting the C parameter.

Note: The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.01 (the default is .0001).

[4 pts]

In [8]:
def P6():
    # Keep this random seed here to make comparison easier.
    np.random.seed(0)

    ### STUDENT START ###

    

    ### STUDENT END ###
P6()

(7) Use the TfidfVectorizer -- how is this different from the CountVectorizer? Train a logistic regression model with C=100.

Make predictions on the dev data and show the top 3 documents where the ratio R is largest, where R is:

maximum predicted probability / predicted probability of the correct label

What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

[4 pts]

In [11]:
#def P7():
    ### STUDENT START ###



    ### STUDENT END ###
#P7()

ANSWER:

(8) EXTRA CREDIT

Try implementing one of your ideas based on your error analysis. Use logistic regression as your underlying model.

- [1 pt] for a reasonable attempt
- [2 pts] for improved performance