# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup posts on a variety of topics. You'll train classifiers to distinguish posts by topics inferred from the text. Whereas with digit classification, where each input is relatively dense (represented as a 28x28 matrix of pixels, many of which are non-zero), here each document is relatively sparse (represented as a bag-of-words). Only a few words of the total vocabulary are active in any given document. The assumption is that a label depends only on the count of words, not their order.

The `sklearn` documentation on feature extraction may be useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on Slack, but <b> please prepare your own write-up with your own code. </b>

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

import nltk

Load the data, stripping out metadata so that only textual features will be used, and restricting documents to 4 specific topics. By default, newsgroups data is split into training and test sets, but here the test set gets further split into development and test sets.  (If you remove the categories argument from the fetch function calls, you'd get documents from all 20 topics.)

In [2]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test  = fetch_20newsgroups(subset='test',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

num_test = int(len(newsgroups_test.target) / 2)
test_data, test_labels   = newsgroups_test.data[num_test:], newsgroups_test.target[num_test:]
dev_data, dev_labels     = newsgroups_test.data[:num_test], newsgroups_test.target[:num_test]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training label shape:', train_labels.shape)
print('dev label shape:',      dev_labels.shape)
print('test label shape:',     test_labels.shape)
print('labels names:',         newsgroups_train.target_names)

training label shape: (2034,)
dev label shape: (676,)
test label shape: (677,)
labels names: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


### Part 1:

For each of the first 5 training examples, print the text of the message along with the label.

In [3]:
# https://stackoverflow.com/questions/8924173/how-do-i-print-bold-text-in-python
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'
    
def P1(num_examples=5):
    for i in range(num_examples):
        print (color.BOLD + "Message for example number {0} with label {1} ({2}) is".format(i, train_labels[i], newsgroups_train.target_names[train_labels[i]]) + color.END)
        print(color.BLUE + train_data[i] + color.END)
        print()
P1(5)

[1mMessage for example number 0 with label 1 (comp.graphics) is[0m
[94mHi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych[0m

[1mMessage for example number 1 with label 3 (talk.religion.misc) is[0m
[94m

Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1

### Part 2:

Transform the training data into a matrix of **word** unigram feature vectors.  What is the size of the vocabulary? What is the average number of non-zero features per example?  What is the fraction of the non-zero entries in the matrix?  What are the 0th and last feature strings (in alphabetical order)?<br/>
_Use `CountVectorization` and its `.fit_transform` method.  Use `.nnz` and `.shape` attributes, and `.get_feature_names` method._

Now transform the training data into a matrix of **word** unigram feature vectors using your own vocabulary with these 4 words: ["atheism", "graphics", "space", "religion"].  Confirm the size of the vocabulary. What is the average number of non-zero features per example?<br/>
_Use `CountVectorization(vocabulary=...)` and its `.transform` method._

Now transform the training data into a matrix of **character** bigram and trigram feature vectors.  What is the size of the vocabulary?<br/>
_Use `CountVectorization(analyzer=..., ngram_range=...)` and its `.fit_transform` method._

Now transform the training data into a matrix of **word** unigram feature vectors and prune words that appear in fewer than 10 documents.  What is the size of the vocabulary?<br/>
_Use `CountVectorization(min_df=...)` and its `.fit_transform` method._

Now again transform the training data into a matrix of **word** unigram feature vectors. What is the fraction of words in the development vocabulary that is missing from the training vocabulary?<br/>
_Hint: Build vocabularies for both train and dev and look at the size of the difference._

Notes:
* `.fit_transform` makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").
* `.fit_transform` and `.transform` return sparse matrix objects.  See about them at http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html.  

In [19]:
def P2():
    # by default, analyzer is word - so we have word n-grams
    vectorizer = CountVectorizer()
    vec_result = vectorizer.fit_transform(train_data)
    
    print(color.BOLD + "[Standard]" + color.END + " The size of the vocabulary is {0}".format(len(vectorizer.vocabulary_)))
    # print(vec_result.toarray())
    
    print(color.BOLD + "[Standard]" + color.END + " The first feature alphabetically is {0}".format(vectorizer.get_feature_names()[0]))
    print(color.BOLD + "[Standard]" + color.END + " The last feature alphabetically is {0}".format(vectorizer.get_feature_names()[-1]))
    
    # count all non zero elements per example in the result vector
    nonzero = np.count_nonzero(vec_result.toarray(), axis=1)
    
    print(color.BOLD + "[Standard]" + color.END + " The average number of non-zero features per example is {0}".format(np.average(nonzero)))
    print(color.BOLD + "[Standard]" + color.END + " The fraction of the non-zero entries in the matrix is {0}".format(np.sum(nonzero)/np.sum(vec_result.toarray())))
    
    custom_vocab = ["atheism", "graphics", "space", "religion"]
    vectorizer = CountVectorizer(vocabulary=custom_vocab)
    vec_result = vectorizer.transform(train_data)
    print(color.BOLD + "[Custom Vocab]" + color.END +" The custom vocab size is {0} and expected size is {1}".format(len(vectorizer.vocabulary_), len(custom_vocab)))
    print(vectorizer.vocabulary_)
    
    nonzero = np.count_nonzero(vec_result.toarray(), axis=1)
    print(color.BOLD + "[Custom Vocab]" + color.END +" The average number of non-zero features per example is {0}".format(np.average(nonzero)))

    # make analyzer to be string, so we have char n-grams
    vectorizer = CountVectorizer(analyzer="char", ngram_range=(2,3))
    vec_result = vectorizer.fit_transform(train_data)
    print(color.BOLD + "[Character n-grams]" + color.END + " The size of the vocabulary is {0}".format(len(vectorizer.vocabulary_)))
    
    # prune words that appear in fewer than 10 docs
    vectorizer = CountVectorizer(min_df=10)
    vec_result = vectorizer.fit_transform(train_data)
    print(color.BOLD + "[Min word freq]" + color.END + " The size of the vocabulary is {0}".format(len(vectorizer.vocabulary_)))
    
    vectorizer_train = CountVectorizer()
    vec_result_train = vectorizer_train.fit_transform(train_data)

    vectorizer_dev = CountVectorizer()
    vec_result_dev = vectorizer_dev.fit_transform(dev_data)
    
    diffkeys = [k for k in vectorizer_dev.vocabulary_ if k not in vectorizer_train.vocabulary_]
    print("Fraction of words in dev vocab that are not in the training vocab are {0}".format(len(diffkeys)/len(vectorizer_dev.vocabulary_)))
    

P2()

[1m[Standard][0m The size of the vocabulary is 26879
[1m[Standard][0m The first feature alphabetically is 00
[1m[Standard][0m The last feature alphabetically is zyxel


[1m[Standard][0m The average number of non-zero features per example is 96.70599803343165
[1m[Standard][0m The fraction of the non-zero entries in the matrix is 0.5215127315919528
[1m[Custom Vocab][0m The custom vocab size is 4 and expected size is 4
{'atheism': 0, 'graphics': 1, 'space': 2, 'religion': 3}
[1m[Custom Vocab][0m The average number of non-zero features per example is 0.26843657817109146
[1m[Character n-grams][0m The size of the vocabulary is 35478
[1m[Min word freq][0m The size of the vocabulary is 3064
Fraction of words in dev vocab that are not in the training vocab are 0.24787640034470024


### Part 3:

Transform the training and development data to matrices of word unigram feature vectors.

1. Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score.  For each model, show the k value and f1 score.
1. Produce several Naive Bayes models by varying smoothing (alpha), including one with alpha set approximately to optimize f1 score.  For each model, show the alpha value and f1 score.
1. Produce several Logistic Regression models by varying L2 regularization strength (C), including one with C set approximately to optimize f1 score.  For each model, show the C value, f1 score, and sum of squared weights for each topic.

* Why doesn't k-Nearest Neighbors work well for this problem?
* Why doesn't Logistic Regression work as well as Naive Bayes does?
* What is the relationship between logistic regression's sum of squared weights vs. C value?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer` and its `.fit_transform` and `.transform` methods to transform data.
* You can use `KNeighborsClassifier(...)` to produce a k-Nearest Neighbors model.
* You can use `MultinomialNB(...)` to produce a Naive Bayes model.
* You can use `LogisticRegression(C=..., solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.

In [7]:
from collections import defaultdict

def knn_helper(k_values, train_set, dev_set, train_y, dev_y, verbose=True):
    f1_score = defaultdict(list)
    
    # define knn model
    for k in k_values:
        neigh = KNeighborsClassifier(n_neighbors=k)
        neigh.fit(train_set, train_y)
        predictions = neigh.predict(dev_set)
        f1_score[k] = metrics.f1_score(dev_y, predictions, average="weighted")
    
    if verbose:
        for k in k_values:
            print(color.BOLD + "[KNN]" + color.END + " For k ={0}, the F1 score is {1}".format(k, f1_score[k]))
            
    return f1_score

def multinomial_nb_helper(alphas, train_set, dev_set, train_y, dev_y, verbose=True):
    f1_score = defaultdict(list)
    
    for a in alphas:
        clf = MultinomialNB(alpha=a)
        clf.fit(train_set, train_y)
        predictions = clf.predict(dev_set)
        f1_score[a] = metrics.f1_score(dev_y, predictions, average="weighted")

    if verbose:
        for a in alphas:
            print(color.BOLD + "[MultiNomial NB]" + color.END + " For alpha ={0}, the F1 score is {1}".format(a, f1_score[a]))
    
    return f1_score

def logistic_reg_helper(l2_reg, train_set, dev_set, train_y, dev_y, verbose=True):
    result = defaultdict(list)
    
    for l2 in l2_reg:
        clf = LogisticRegression(C=l2, solver="liblinear", multi_class="auto")
        clf.fit(train_set, train_y)
        predictions = clf.predict(dev_set)
        
        # add relevant attributes for logit to result
        result[l2].append(metrics.f1_score(dev_y, predictions, average="weighted"))
        result[l2].append(np.sum(clf.coef_ ** 2, axis=1))
        result[l2].append(clf.coef_)
        
    if verbose:
        for l2 in l2_reg:
            print(color.BOLD + "[Logistic Regression]" + color.END + " For C ={0}, F1 score: {1}, SSQ: {2}".format(l2, result[l2][0], result[l2][1]))
    
    return result
    
def P3():
    # prepare dataset
    vocab = set()
    vec = CountVectorizer()
    vec_result_train = vec.fit_transform(train_data)
    vocab.update(vec.vocabulary_.keys())

    vec = CountVectorizer(vocabulary=list(vocab))
    vec_result_dev = vec.transform(dev_data)
    
    print("Shape of the training set is {0}".format(str(vec_result_train.toarray().shape)))
    print("Shape of the dev set is {0}".format(str(vec_result_dev.toarray().shape)))
    
    # perform knn
    #knn_helper([i for i in range(1, 20, 3)], vec_result_train.toarray(), vec_result_dev.toarray(), train_labels, dev_labels)
   
    # perform multinomial nb
    #multinomial_nb_helper([1.0e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0], vec_result_train.toarray(), vec_result_dev.toarray(), train_labels, dev_labels)
    
    # perform logistic regression
    logistic_reg_helper([1.0e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 0.7, 1.0, 2.0, 10.0, 50, 100], vec_result_train.toarray(), vec_result_dev.toarray(), train_labels, dev_labels)

P3()

Shape of the training set is (2034, 26879)
Shape of the dev set is (676, 26879)


  'precision', 'predicted', average, warn_for)


[1m[Logistic Regression][0m For C =1e-10, F1 score: 0.17995908910526975, SSQ: [5.70674183e-13 8.45299474e-13 4.04006981e-13 6.35819484e-13]
[1m[Logistic Regression][0m For C =0.0001, F1 score: 0.18389221656390384, SSQ: [0.00770175 0.0119412  0.00943508 0.00910284]
[1m[Logistic Regression][0m For C =0.001, F1 score: 0.17480590631079723, SSQ: [0.16509345 0.20095275 0.18067094 0.18724278]
[1m[Logistic Regression][0m For C =0.01, F1 score: 0.23200347507376304, SSQ: [2.54149597 2.93970937 2.86246884 2.25002867]
[1m[Logistic Regression][0m For C =0.1, F1 score: 0.21404766410894882, SSQ: [27.13276422 24.65876272 27.45791178 23.02092251]
[1m[Logistic Regression][0m For C =0.5, F1 score: 0.238210423317165, SSQ: [102.60222594  83.1201744   99.01364429  88.98434638]
[1m[Logistic Regression][0m For C =0.7, F1 score: 0.2339022200730513, SSQ: [130.8094896  104.16699008 124.93573965 113.83960131]
[1m[Logistic Regression][0m For C =1.0, F1 score: 0.22839605502775226, SSQ: [166.96406144

ANSWER:

**KNN**

KNN doesnt perform as well due to the curse of dimensionality. As we notice, there are over 25k features in the model.
Due to this, not only does KNN run slowest of the 3 models, but its F score also suffers. The feature space is large and its hard to generalize well with small values of K (small compared to the size of the features). We can try to increase the value of K, however, beyond a threshold - increasing K will lose its meaning. For example, it would not be useful to consider (say) 10k neighbors for a data point for classification as its not useful to consider so many neighbors (because everyone is "close").

The peak value for KNN classification F score is around 0.46 at K=7

**Multinomial NB vs Logistic Regression**

https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-a319b07a5d4c

The best performing logistic model is when C=0.5 with F1 score = 0.708

The best performing multinomial NB model is when alpha =0.1 with F1 score = 0.79

The above article describes that "when the training size reaches infinity the discriminative model, ie, logistic regression performs better than the generative model, ie, Naive Bayes. The generative model reaches its asymptotic faster (O(log n)) than the discriminative model (O(n)), ie, the generative model (Naive Bayes) reaches the asymptotic solution for fewer training sets than the discriminative model (Logistic Regression)". In our dataset, we have only around 2k training examples - which is relatively small. Thus, in such a setting NB performs better than logistic regression.


**Relationship between C and SSQ**

From the above, we can think of C as an inverse of the traditional regularization parameter, lambda. That is, when C is large, we're hardly regularizing the weights but when C is small, we are heavily regularizing the weights. Thus, during high regularization (C is small), we see smaller sum of squares on the weights - we're penalizing unnecessary features highly and being parsimonious. When C is large, we're allowing more features' weights to creep into the model, thus SSQ is large.


### Part 4:

Transform the data to a matrix of word **bigram** feature vectors.  Produce a Logistic Regression model.  For each topic, find the 5 features with the largest weights (that's 20 features in total).  Show a 20 row (features) x 4 column (topics) table of the weights.

Do you see any surprising features in this table?

Notes:
* Train on the transformed training data.
* You can use `CountVectorizer` and its `.fit_transform` method to transform data.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `np.argsort` to get indices sorted by element value. 

In [28]:
def P4(l2_reg = 0.5, num_features = 5):
    # prepare dataset
    vocab = set()
    vec = CountVectorizer(ngram_range=(2,2))
    vec_result_train = vec.fit_transform(train_data)
    vocab.update(vec.vocabulary_.keys())

    vec = CountVectorizer(ngram_range=(2,2), vocabulary=list(vocab))
    vec_result_dev = vec.transform(dev_data)
    
    logit_result = logistic_reg_helper([l2_reg], vec_result_train.toarray(), vec_result_dev.toarray(), train_labels, dev_labels)
    
    # get the weights for each topic
    weights_topic = logit_result[l2_reg][2]
    # print(weights_topic.shape)
    
    result = defaultdict(list)
    
    # sort the weights row wise for each topic to get indices
    ind_topic_sorted_by_wt = np.argsort(weights_topic, axis=1)
    
    # for each topic
    for i in range(len(newsgroups_train.target_names)):
        
        # get topic name, which is the key of the result dict
        topic = newsgroups_train.target_names[i]
        print("For index {0} the topic name is {1}".format(i, topic))
        
        # add indices corresponding to highest N weights
        result[topic].append(list(ind_topic_sorted_by_wt[i][-num_features:]))
        
        # add weight values corresponding to the highest N weights
        wts = list()
        # add features (n grams) corresponding to the highest N weights
        ngrams = list()
        
        for idx in result[topic][0]:
            wts.append(weights_topic[i][idx])
            
            for k,v in vec.vocabulary_.items():
                if idx == v:
                    ngrams.append(k)
            
        result[topic].append(wts)
        result[topic].append(ngrams)
    
    print(result)
'''

comp graphics
in there
in advance
looking for


atheism
was just
look up?? (maybe)
are you -
you are
is not
in this
was just
cheers kent?

see image - for table

atheism - 0.229 as a weight

SURPRISING PART
cheers kent - appears twice - this is probably the signature under the end of the message

fbi appears in religion

these params 
'''
P4()

[1m[Logistic Regression][0m For C =0.5, F1 score: 0.20347421529530357, SSQ: [111.75537598 118.73154706 123.42664907 100.53142283]
For index 0 the topic name is alt.atheism
For index 1 the topic name is comp.graphics
For index 2 the topic name is sci.space
For index 3 the topic name is talk.religion.misc


ANSWER:

### Part 5:

To improve generalization, it is common to try preprocessing text in various ways before splitting into words. For example, you could try transforming strings to lower case, replacing sequences of numbers with single tokens, removing various non-letter characters, and shortening long words.

Produce a Logistic Regression model (with no preprocessing of text).  Evaluate and show its f1 score and size of the dictionary.

Produce an improved Logistic Regression model by preprocessing the text.  Evaluate and show its f1 score and size of the vocabulary.  Try for an improvement in f1 score of at least 0.02.

How much did the improved model reduce the vocabulary size?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer(preprocessor=...)` to preprocess strings with your own custom-defined function.
* `CountVectorizer` default is to preprocess strings to lower case.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.
* If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular.

In [None]:
#def better_preprocessor(s):
    ### STUDENT START ###
    ### STUDENT END ###

#def P5():
    ### STUDENT START ###
    ### STUDENT END ###
'''
we have to use bigrams for 5-7
0.02 - may or maynot get this improvement (more like 1.8 percent improvement)

bigrams - 194k features

feature preprocessing
lower case, change numbers to num, remove plurals (remove es plurals to empty),
remove ending with ing etc.
replace underscore with space (no benefit in f1 score)

baseline f1 for bigrams - 0.607 
.689 unigrams baseline without preprocessing


unigram - original and better (preprocessing) one
bigram - original and better (preprocessing) one

porter stemming in nltk improves 25 basis points 
lamitization in nltk improves 25 basis points

maybe use document frequency > 10 as a filter criteria to reduce the vocab

'''
#P5()

### Part 6:

The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. Logistic regression seeks the set of weights that minimizes errors in the training data AND has a small total size. The default L2 regularization computes this size as the sum of the squared weights (as in Part 3 above). L1 regularization computes this size as the sum of the absolute values of the weights. Whereas L2 regularization makes all the weights relatively small, L1 regularization drives many of the weights to 0, effectively removing unimportant features.

For several L1 regularization strengths ...<br/>
* Produce a Logistic Regression model using the **L1** regularization strength.  Reduce the vocabulary to only those features that have at least one non-zero weight among the four categories.  Produce a new Logistic Regression model using the reduced vocabulary and **L2** regularization strength of 0.5.  Evaluate and show the L1 regularization strength, vocabulary size, and f1 score associated with the new model.

Show a plot of f1 score vs. log vocabulary size.  Each point corresponds to a specific L1 regularization strength used to reduce the vocabulary.

How does performance of the models based on reduced vocabularies compare to that of a model based on the full vocabulary?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `LogisticRegression(..., penalty="l1")` to produce a logistic regression model using L1 regularization.
* You can use `LogisticRegression(..., penalty="l2")` to produce a logistic regression model using L2 regularization.
* You can use `LogisticRegression(..., tol=0.015)` to produce a logistic regression model using relaxed gradient descent convergence criteria.  The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.015 (the default is .0001).

In [None]:
#def P6():
    # Keep this random seed here to make comparison easier.
    #np.random.seed(0)
    
    ### STUDENT START ###
    ### STUDENT END ###
'''
c=0.5 unigram, curve wasnt linear - shot up fast and then decreases

c value of 1 is optimal, vocab = 2419 or 1681 (bigrams - was used for this)
c=.6
THERE IS NO CONSENSUS ON THE C VALUE HERE, PEOPLE ARE GETTING 0.5,1 AND EVEN UP TO 30

L1 is a feature selector, use this to feed into the other model l2 to get the f1

I did it this way to obtain features with non-zero weights and calculate the size:
    nonzero_features = np.unique(np.nonzero(lr1.coef_)[1])
    nonzero_vocab = np.array(cv.get_feature_names())
    vocab_size = len(nonzero_vocab)

c=0.5 in the problem description
'''
#P6()

ANSWER:

### Part 7:

How is `TfidfVectorizer` different than `CountVectorizer`?

Produce a Logistic Regression model based on data represented in tf-idf form, with L2 regularization strength of 100.  Evaluate and show the f1 score.  How is `TfidfVectorizer` different than `CountVectorizer`?

Show the 3 documents with highest R ratio, where ...<br/>
$R\,ratio = maximum\,predicted\,probability \div predicted\,probability\,of\,correct\,label$

Explain what the R ratio describes.  What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

Note:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `TfidfVectorizer` and its `.fit_transform` method to transform data to tf-idf form.
* You can use `LogisticRegression(C=100, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `LogisticRegression`'s `.predict_proba` method to access predicted probabilities.

In [None]:
#def P7():
    ### STUDENT START ###
    ### STUDENT END ###
'''
bigram, 0.69

book of mormon - i am pleased to announce
why is the ....jesus OR (24 children)
book of mormon


indices of the docs
215,667,615
471,215,665

r ratio can be derived from predicted_proba
for index 665 =663.333
index 215=476, 168
index 471 = 236, 94

'''
#P7()

ANSWER:

### Part 8 EXTRA CREDIT:

Produce a Logistic Regression model to implement your suggestion from Part 7.