## Classifying text

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score

We turn to applying machine learning classification methods to text. There are
no new principles at stake.  In principle, everything is the same as it was for
learning how to classify irises.

1.  We need to find labeled data; each of the exemplars in the data should be represented with a fixed set of features.  
2. We need to split our data and training and test data.  
3. We need to train learner on the training data and evaluate it (test it) it on the test data.

The problem is that text data is not in a form  that is compatible with
what we have learned about classifiers.  The text must be put in a suitable
form before a linear model; can be trained on it. 

**Training**

1.  Labeled data must be loaded (into Python).  It should be a sequence of documents T accompanied by a sequence of labels L. 
2.  Split T and L into training and test groups, yielding T1 and T2; as well as and L1 and L2.
2.  Train or a **feature model** on the training data T1 (or in scikit learn terminology **fit** the model **to** the training data).  The feature model inputs the text sequence and outputs a **term-document** matrix suitable for training a linear classifier.  The feature model is called a **vectorizer**
(because it turns a document into a vector, a column of numbers).
3.  Using the trained vectorizer, transform T1 into a term document matrix M1.
4.  Train a linear model $\mu$ on M1 and L1.

**Evaluation**

1.  Transform the test data T2 into a term document matrix M2 using the vectorizer fit during step 2 of training;  in particular this means if there are words in the T2 data that were never seen during training, they are ignored in building M2.
2.  Use $\mu$  to classify the texts represented in M2; that is produce a set of predicted labels P2.
3.  Compare the actual labels L2 with the predicted labels P2 using standard evaluation metrics such as precision, accuracy, and recall.


## Review the steps with insult detection

We looked at the insult detection data in  the text classification notebook.

### Training step 1: Loading the data

Let's load the CSV file.

In [15]:
import pandas as pd
import os.path
site = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/'\
'text_classification/'
#site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the date, and the comment.

In [3]:
df.tail()

Unnamed: 0,Insult,Date,Comment
3942,1,20120502172717Z,"""you are both morons and that is never happening"""
3943,0,20120528164814Z,"""Many toolbars include spell check, like Yahoo..."
3944,0,20120620142813Z,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,20120528205648Z,"""How about Felix? He is sure turning into one ..."
3946,0,20120515200734Z,"""You're all upset, defending this hipster band..."


Now we define the text sequences $\mathbf{T}$ and the label sequence  $\mathbf{L}$.

In [16]:
T = df['Comment']

In [17]:
L = df['Insult']

### Step 2 Split the data and labels into training and test groups

In [6]:
T1, T2, L1, L2 = train_test_split(T,L)

### Step 3 and 4:  Fit the feature model (vectorizer) to the training data and Transform  it

In [7]:
tf = text.TfidfVectorizer()
# Scikit learn has one function that does both fitting and transforming.
# M1 is the transformed data
# tf is the trained feature model (which will be used to transform the test data)
M1 = tf.fit_transform(T1)

### Step 5 Training the classifier

Now, we are going to train a classifier as usual. We first split the data into a train and test set.

We use a **Bernoulli Naive Bayes classifier**.

In [8]:
# Create classifer
bnb =nb.BernoulliNB()

# Fit (train) the classifier  using the training data and labels
bnb.fit(M1, L1);

### Evaluation

Evaluate the classifier, first using accuracy (what `.score()` returns).

In [9]:
# vectorize the test data using the vectorizer trained on T1
# Notice we DONT call .fit_transform() because that would retrain the vectorizer on the test data
# We call .transform() using the trained model to transform the new data.
# Words not seen during training will be ignored.
M2 = tf.transform(T2)
# Classify the data using the trained classisifer and report the accuracy
bnb.score(M2, L2)

0.7507598784194529

Now try re-executing steps 2 through 5.  (Just re-execute the cells)  The results should be the same, right?

Well, are they?  

What happens:  each training test split produces a different set of test data.  Sometimes the test is harder.
Sometimes it's easier.  Or looking at it another way:  Sometimes the training data is a better preparation for the test than others.  

To get a realistic view of how our classifier is doing we take the average performance on a  number of 
train/test splits.  This is called **cross validation**.  We return to that point below.

#### Using all three evaluation metrics

First let's get more evaluation numbers, in particular precision and recall.  We do
that by calling a method that returns the predicted labels P2, so we can compare
L2 and P2 using different evaluation metrics.

In [10]:
P2 = bnb.predict(M2)
scores = np.array([accuracy_score(P2, L2),
                   precision_score(P2, L2),
                   recall_score(P2, L2)])
print(f'Accuracy: {scores[0]:.2f} Precision: {scores[1]:.2f} Recall: {scores[2]:.2f}')

Accuracy: 0.75 Precision: 0.12 Recall: 0.79


We see that the accuracy is a bit misleading.  There is a serious precision problem.

What does that mean in the setting of insult detection?  It means the BNB classifier is a little too
eager to call something an insult.  When it flags something as an insult, it
is right only 14% of the time.

Why would that be?  Think about how the model is trained and what its weakness might be.
This is what it means to try to interpret or discuss a model's performance.  Zoom
in the model's weakness. Talk about where that weakness comes from.

#### Basic train and test loop

How to get the average of a number of runs.

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import text
from sklearn.metrics import accuracy_score, precision_score, recall_score

def split_fit_and_eval(T,L,test_size=.2):
    # This code just collects together the training steps 2-5 + the eval
    # That is, It does one training,test., eval run
    (T1, T2, L1, L2) = train_test_split(T, L, test_size=test_size)
    tf = text.TfidfVectorizer()
    M1 = tf.fit_transform(T1)
    bnb = nb.BernoulliNB()
    bnb.fit(M1,L1)
    # .fit(), ..fit_transform()
    M2 = tf.transform(T2)
    P2 = bnb.predict(M2)
    return np.array([accuracy_score(P2,L2),
                     precision_score(P2, L2),
                     recall_score(P2,L2)])

# Split, Train, test and eval 10 times
num_runs = 10
# an accumulator for acc.,pre.,and rec.
scores = np.zeros((3,))
for test_run in range(num_runs):
    scores += split_fit_and_eval(T,L)
# Compute the average of the num_runs runs for all metrics
normed_stats = scores/num_runs

print(f'Accuracy: {normed_stats[0]:.2f} Precision: {normed_stats[1]:.2f} Recall: {normed_stats[2]:.2f}')

Accuracy: 0.77 Precision: 0.14 Recall: 0.90


## Homework

Read the on line book draft chapter about text classification and and especially
about  movie review data.  Note that you will be using a different classifier implementation (`scikit_learn`) than the one used in the book
(`nltk`).  Therefore, when it comes to writing code for training the calssifier. focus on the code examples in this notebook, which use `scikit_learn`.

Try using two classifiers on the movie review data, the one used in the textbook, an SVM, and
the Bernoulli Naive Bayes model used above. Be sure
to stick with  scikit learn (it has an SVM implementation).
Some points of emphasis;

1.  Be sure to get the average of at runs  least 10 runs for **both** classifiers.
2.  Be sure to get average accuracy, precision, and recall for both classifiers on those multiple runs. You will probably find `split_fit_and_eval` defined above useful, but you may need to modify it.
3.  For your first discussion post turn in the new code you wrote, including the code that labels and shuffles the data (discussed further below).  If you have to do a new import, show that. If you have to rewrite `split_fit_and_eval`, turn in the new version.  Also show the output, which should be a single line giving the accuracy, prcision, and recall.
4.  Discuss which classifier does better.  Discuss which metric the best classifier does the worst at and speculate as to why (this will require reviewing the definitions of precision and recall and thinking about what they mean in a movie review setting).

#### Help with getting the movie reviews data.

Execute the next two cells to get the movie review data.

In [12]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/gawron/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [43]:
from nltk.corpus import movie_reviews as mr

def get_file_strings (corpus, file_ids):
    return [corpus.raw(file_id) for file_id in file_ids]

data = dict(pos = mr.fileids('pos'),
            neg = mr.fileids('neg'))

pos_file_ids = data['pos']
neg_file_ids = data['neg']

# Get all the positive and negative reviews.
pos_file_reviews = get_file_strings (mr, pos_file_ids)
neg_file_reviews = get_file_strings (mr, neg_file_ids)

Each review is a string.  In principle, a list of strings like `pos_file_reviews`  can be passed to `text.TfidfVectorizer()` via the `fit_transform` method to train a vectorizer for machine learning.
You could code that up.

What you'd really like to do is use `split_fit_and_eval`, defined above, which does a lot of the work for you.

But hold on. You have a coding problem. You don't have  a sequence of documents and labels.  Instead you have
one sequence of positive documents  and another sequence of negative documents.  

So you will need to turn those two sequences into a sequence of documents and a sequence of labels
because that's what `split_fit_and_eval` wants.  You also want the doc sequence
to contain a random mixture of positive and negative documents, because some machine
learning algorithms are sensitive to the order in which training data is presented to
them.

The next cell does **not** do that for you.  But it illustrates an approach using 
two sets of English letters in place of two sets of English documents.

In [44]:
# Lets work on documents
# There are 2 classes, pos and neg

from random import shuffle
from string import ascii_lowercase


# Now get pairs of letters and labels
pos_pairs = [(rev,'pos') for rev in pos_file_reviews]
neg_pairs = [(rev,'neg') for rev in neg_file_reviews]

#### Solution starts here

In [45]:
###########  Shuffling  ###########################
# Way too orderly, the classes arent mixed yet.
data = pos_pairs + neg_pairs
shuffle(data)
###################  Now they're shuffled! ###############

# Separate the letters from their labels
revs, lbls = zip(*data)

In [7]:
# What you were given
#def split_fit_and_eval(T,L,test_size=.2):
#    # This code just collects together the training steps 2-5 + the eval
#    # That is, It does one training,test., eval run
#    (T1, T2, L1, L2) = train_test_split(T, L, test_size=test_size)
#    tf = text.TfidfVectorizer()
#    M1 = tf.fit_transform(T1)
#    bnb = nb.BernoulliNB()
#    bnb.fit(M1,L1)
#    # .fit(), ..fit_transform()
#    M2 = tf.transform(T2)
#    P2 = bnb.predict(M2)
#    return np.array([accuracy_score(P2,L2),
#                     precision_score(P2, L2),
#                     recall_score(P2,L2)])

def split_vectorize_and_fit(T,L,clf,pos_label='pos'):
    """
    Added default value for pos label
    Added classifier argument.  
    """
    X_train, X_test, y_train, y_test, tf = split_and_vectorize(T,L)
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    return precision_score(y_test,predictions,pos_label=pos_label), \
           recall_score(y_test, predictions,pos_label=pos_label), \
           accuracy_score(y_test,predictions)

def split_and_vectorize(T,L):
    """
    Added default value for pos label
    Added classifier argument.  
    """
    T_train,T_test, y_train,y_test = train_test_split(T,L)
    tf = text.TfidfVectorizer()
    X_train = tf.fit_transform(T_train)
    X_test = tf.transform(T_test)
    return X_train, X_test, y_train, y_test, tf

Solution Partial:  Running just one classifier:

Bernoulli Naive Bayes

In [23]:
import numpy as np

num_test_runs = 10
scores = np.zeros((3,))
for test_run in range(num_test_runs):
    clf = nb.BernoulliNB()
    scores += split_vectorize_and_fit(revs,lbls,clf)

p_score,r_score,a_score = scores/num_test_runs
print(f"{'BernoulliNB':<50} Precision {p_score:.3f} Recall {r_score:.3f}  Accuracy: {a_score:.3f}")

BernoulliNB                                        Precision 0.878 Recall 0.669  Accuracy: 0.785


In [24]:
from sklearn import linear_model
from sklearn.svm import LinearSVC

clfs = {"Logistic Regression": linear_model.LogisticRegression(),
        "Ridge Classifier":   linear_model.RidgeClassifier(),
        "Multinomial NB" : nb.MultinomialNB(),
        "Passive Aggressive Classifier": linear_model.PassiveAggressiveClassifier(),
        "SVM": LinearSVC(),
       }

Testing multiple classifiers in one loop with 10 training test splits for each classifier.
Reporting average precision, recall and accuracy scores for the 10 test runs for each of
the classifiers.

In [25]:
num_test_runs = 10
scores = np.zeros((3,))

print(f"Clf: {'':<50} Precision   Recall  Accuracy ")
for clf in clfs:
    for test_run in range(num_test_runs):
        scores += split_vectorize_and_fit(revs,lbls,clfs[clf])
    p_score,r_score,a_score = scores/num_test_runs
    print(f"Clf: {clf:<50} {p_score:.3f}        {r_score:.3f}   {a_score:.3f}")
    scores = np.zeros((3,))
    print()

Clf:                                                    Precision   Recall  Accuracy 
Clf: Logistic Regression                                0.832        0.830   0.829

Clf: Ridge Classifier                                   0.840        0.848   0.843

Clf: Multinomial NB                                     0.881        0.695   0.792

Clf: Passive Aggressive Classifier                      0.845        0.857   0.850

Clf: SVM                                                0.843        0.875   0.856



The Multinomial Naive Bayes classifier had the best Precision Score, but it also had the worst accuracy
and by far the worst recall.  This is an excellent example of the precision-recall trade-off. The classifier
achieves its high precision at the cost of a lot of false negatives, lowering its recall and accuracy.

The highest Accuracy and the best recall was achieved by the Passive Aggressive Classifier, with the
Ridge Classifier tying for best recall and just a smidge behind in accuracy. Except for the Multinomial Naive Bayes
classifier, the precision scores all lagged behind their recall scores, meaning most of the classifiers
traded in a little precision for better recall.

**Note:  Your mileage may vary.  It doesn't mean much if you couldn't reproduce these numbers.**

**Addendum**: Note the results discussed above aren't **stable**.  On this next run of 10 test runs,  the Passive Agressive Classifier got the best precision.   That's okay.  Just discuss the results you got.

In [16]:
num_test_runs = 10
scores = np.zeros((3,))

print(f"Clf: {'':<50} Precision   Recall  Accuracy ")
for clf in clfs:
    for test_run in range(num_test_runs):
        scores += split_vectorize_and_fit(revs,lbls,clfs[clf])
    p_score,r_score,a_score = scores/num_test_runs
    print(f"Clf: {clf:<50} {p_score:.3f}        {r_score:.3f}   {a_score:.3f}")
    scores = np.zeros((3,))
    print()

Clf:                                                    Precision   Recall  Accuracy 
Clf: Logistic Regression                                0.807        0.827   0.816

Clf: Ridge Classifier                                   0.815        0.866   0.837

Clf: Multinomial NB                                     0.848        0.752   0.805

Clf: Passive Aggressive Classifier                      0.850        0.862   0.855

Clf: SVM                                                0.845        0.863   0.853



Bonus:

A brief ("truncated") demo of using dimensionality reduction on textual data.  The 
technique is known as LSI (Latent Semantic Indexing), because the SVD
axes are latent features,  linear combinations of the original word features.
See the dimensionality reduction NB. 

It doesn't do badly. One reason this is attractive is because it is a whole
lot faster than it used to be with classic eigenvector computation methods.
Try to articulate a way in which the LSI helps classifier performance, at
least in this experiment vis a vis the results above.

In [5]:
from sklearn import naive_bayes as nb

In [26]:
from sklearn import decomposition as dec
from sklearn import linear_model
from sklearn import naive_bayes as nb

X_train, X_test, y_train, y_test, tf = split_and_vectorize(revs,lbls)

#Going to try what's known as LSI (Latent Semantic Indexing)
reducer = dec.TruncatedSVD(n_components=200)
# Using output of vectorizer
X_train_reduced = reducer.fit_transform(X_train)
clf_red = linear_model.LogisticRegression(solver='liblinear')
clf_red.fit(X_train_reduced, y_train)

# Test
X_test_reduced = reducer.transform(X_test)
y_predicted = clf_red.predict(X_test_reduced)

# Optional but convenient below
y_test = np.array(y_test)
pos_label='pos'
acc, prec, rec = accuracy_score(y_test,y_predicted),\
                  precision_score(y_test,y_predicted,pos_label=pos_label), \
                    recall_score(y_test,y_predicted,pos_label=pos_label)

print(f"Acc: {acc:.3f} Prec: {prec:.3f} Rec: {rec:.3f}")

Acc: 0.804 Prec: 0.776 Rec: 0.816


In [None]:
from sklearn import decomposition as dec

n_components=2
reducer = dec.PCA(n_components=n_components)
X_r = reducer.fit_transform(X)

### Most important features (SVM)

Note the way two-class problems work with svms there is only one set
of weights to look at.  So you need the **lowest** weighted features
if you want to look at the more fun word set that best characterized **bad**
reviews.  You will have to modify `get_topn` from the Insult Detection Notebook,
to do that:

In [27]:
import sklearn.svm as svm

def get_feature_weights (clf):
    """
    Return weight vector and weight dimensionality
    """
    global clf0
    if hasattr(clf,"coef_"):
        (wt_dims,feats) = clf.coef_.shape
        return wt_dims, clf.coef_
    elif hasattr(clf,'feature_log_prob_'):
        (wt_dims,feats) = clf.feature_log_prob_.shape
        return wt_dims, clf.feature_log_prob_
    else:
        clf0 = clf
        raise Exception(f"Can\'t find weights for classifier clf0: {clf0}")
 

def get_topn(clf, vectorizer=None, feature_names=None, top_n=10, class_labels=(True,), verbose=True):
    """
    Prints features with the highest/lowest coefficient values, per class if appropriate.
    
    Assumptions: In a Binary class problem the positive class is labeled `"Pos"` or `True`,
    the neg class `"Neg"` or `False`.  Generalizing from this case needs some more parameterization.    
    
    For the multiclass case the entire set of class labels is used. User supplied
    class_labels are ignored.  Also in the case of a classifier that uses more than one
    weight vector for a 2-class problem (e.g., Naive Bayes):
    
    If no vectorizer is supplied, the parameter `feature_names` must be a 1D numpy array with the names
    aligned with the data matrix columns.
    """
    ## NB: feature_names is a numpy array
    if feature_names is None:
        feature_names = vectorizer.get_feature_names_out()
    ## Look at the feature weights the classifier learned.
    wt_dims,weights = get_feature_weights (clf)
    if wt_dims > 1:
        # Must go through all classes in the right order when there's more than one weight vector
        class_labels = clf.classes_
    word_lists = []
    ## ArgSort the feature weights row by row (that is, class by class): feature indices sorted from
    ## lowest weighted feature to highest-weighted feature.
    word_importance = weights.argsort(axis=1)
    for i, class_label in enumerate(class_labels):
        ## Th value-sorted indices of the feature weights the classifier learned for class i.
        word_indices = word_importance[i]
        ## Get the topn WORDS using fancy indexing on feature_names
        if wt_dims==1 and class_label in ["Pos",True]:
            # Reorder so most important comes first
            Words = feature_names[word_indices[-top_n:]][::-1] 
        elif wt_dims==1 and class_label in ["Neg",False]:
            Words = feature_names[word_indices[:top_n]]
        else:
            # NB clf case: each class has its own logprob vector
            # Multiclass SVM: each class has its own coefficient vector
            Words = feature_names[word_indices[-top_n:]][::-1] 
        if verbose:
            WordsStr = " ".join(Words)
            print(f"{class_label}: {WordsStr}",end="\n\n")
        word_lists.append(Words)
    return word_lists

tf = text.TfidfVectorizer(stop_words="english")
#revs, lbls
X_train = tf.fit_transform(revs)
est = svm.LinearSVC()
est.fit(X_train, lbls)
# Now find the most heavily weighted features [= words]
Words = get_topn(est, tf, class_labels=("Pos",), top_n=50)

Pos: great fun life quite hilarious memorable terrific overall excellent especially seen job perfectly trek performances true different matrix family good perfect american definitely performance gives truman effective wonderful enjoyed enjoyable frank bulworth titanic best pulp cameron rocky pace wonderfully mulan war people political bowfinger mamet solid works mike bit dark



In [30]:
est.decision_function(X_train).min()

np.float64(-1.7519762264209062)

In [31]:
est.decision_function(X_train).max()

np.float64(1.4920838699931187)

In [25]:
feature_names = tf.get_feature_names_out()
#feature_names

array(['00', '000', '0009f', ..., 'zwigoff', 'zycie', 'zzzzzzz'],
      dtype=object)

In [44]:
BadWords = get_topn(est,tf, class_labels=("Neg",), top_n=100)

Neg: bad plot worst supposed unfortunately boring script waste awful poor reason looks ridiculous stupid attempt dull harry wasted mess terrible tv lame pointless cheap carpenter poorly maybe better fails jakob filmmakers attempts minute given worse adam guess director west point joke potential seagal falls material write didn bland make hurlyburly saved flat tries minutes eve tedious disappointing obvious embarrassing laughable charlie unfunny audience naked badly joan save talent predictable grade beast idea weak data alessa actors sloppy clich tired francis iii godzilla metro designed superior headed apparently dutch sports pay bore franklin annoying movie nbsp lifeless interesting problem figure random



**Code note**: Demonstrating how `.argsort()` works on a 2D array `A`: `A.argsort()` always returns an array of 
the same shape as `A` but it's an array of indices, and we can sort either rows or columns:

In [189]:
A1 = np.arange(12,dtype=float).reshape((2,6))
A2 = np.arange(12,0,-1,dtype=float).reshape((2,6))
A = np.concatenate([A1,A2])
print("A")
print(A)
print("argsort Axis 0 (sorted columns, indices are row indices)")
col_sort = A.argsort(axis=0)
print(col_sort)
print("Example: Col 1 Sorted")
sorted_indices = col_sort[:,1]
# Use fancy indexing 
print(A[sorted_indices,1])
print("argsort axis 1 (sorted rows,  indices are column indices)")
print(A.argsort(axis=1))

A
[[ 0.  1.  2.  3.  4.  5.]
 [ 6.  7.  8.  9. 10. 11.]
 [12. 11. 10.  9.  8.  7.]
 [ 6.  5.  4.  3.  2.  1.]]
argsort Axis 0 (sorted columns, indices are row indices)
[[0 0 0 0 3 3]
 [1 3 3 3 0 0]
 [3 1 1 1 2 2]
 [2 2 2 2 1 1]]
Example: Col 1 Sorted
[ 1.  5.  7. 11.]
argsort axis 1 (sorted rows,  indices are column indices)
[[0 1 2 3 4 5]
 [0 1 2 3 4 5]
 [5 4 3 2 1 0]
 [5 4 3 2 1 0]]


As can be seen, sort order is smallest to biggest.  In other words, if we've argsorted the rows (axis=1),
then pick a row, say, the third row. 

In [190]:
sorted_indices = A.argsort(axis=1)
#third row
row_idx = 2
print("The entire row:",A[row_idx,:])
col_idx = sorted_indices[row_idx,1]
print("The index of the second smallest element in the third row is:", col_idx)
print(f"The value of the second smallest element in the third row is: {A[row_idx,col_idx]:.0f}.")

The entire row: [12. 11. 10.  9.  8.  7.]
The index of the second smallest element in the third row is: 4
The value of the second smallest element in the third row is: 8.


Sorting always happens along an axis for a multidimensional array; the default
is sorting along axis 1.

In [191]:
A.argsort()

array([[0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [5, 4, 3, 2, 1, 0],
       [5, 4, 3, 2, 1, 0]])

Mnemonic.  Axis 0 is the row axis; axis 1 is the column axis.  If you want the indices
in the sort to be row indices, use a row-axis (axis=0) argsort; if you want the indices
in the sort to be column indices, use a column-axis (axis=1) argsort.

### Addendum  (not part of the assignment):  Trying this out with NaiveBayes

The whole strategy doesn't work very well with NB because NB uses a kind of cumulative evidence strategy.
All very common words are positive indicators of both classes but as we sum the log probs for all the words, the evidence gradually tilts in one direction or the other.

In [12]:
# Uset stop words; reduce the tendency of NB to make the most important feat list be a list of very common words.
tf = text.TfidfVectorizer(stop_words="english")
X_train = tf.fit_transform(revs)
est = nb.BernoulliNB()
est.fit(X_train, lbls)
top_n=100
# Now find the most heavily weighted features [= words]
Words = get_topn(est, tf, top_n=top_n)

neg: film movie like just time good plot make character bad story way characters director little does don really doesn people know scene films scenes man end better movies best new big work gets script going life isn makes audience look long thing action think actually come real love plays old say things fact year great acting did role seen played point comes minutes screen ve comedy goes funny right years cast actors course interesting away lot didn performance world far trying dialogue takes watching john hard set instead looks star pretty ll want watch reason making half having guy given

pos: film movie like just time good story character way does make life best characters little people films man really director new great scene doesn makes scenes don plot end work movies world know love seen performance years role right year real long audience gets old better things takes big comes going fact come ve young say plays cast look screen isn quite think actually thing played making away

A reasonable thing to try is sort the highest-weighted words by their IDF score.  The more
docs a word has occurred in, the lower the IDF score; so a high-IDF 
word is more likely to be a content word. Let's print those first. 

It so happens the TFIDF Vectorizer stores the vocabulary
IDF-scores in an attribute (in part because that's the part of the TFIDF score
that can be computed once for each word for the entire document set).

In [239]:
# As before
top_n = 100
feature_names = tf.get_feature_names_out()
_wt_dims, weights = get_feature_weights (est)
word_importance = weights.argsort(axis=1)

# New: argsort idf scores for vocab
idf_idx = tf.idf_.argsort()[::-1]


for (i, this_class) in enumerate(est.classes_):
    topn_idxs = word_importance[i][-top_n:]
    # Present the top_n idxs in idf score order
    idf_idx0 = [idx for idx in idf_idx if idx in topn_idxs]
    print(f"Class: {this_class}")
    print(' '.join(feature_names[idf_idx0]))
    print()

Class: neg
reason looks half pretty dialogue given guy watching ll didn having trying hard want instead star watch john interesting minutes making far set course lot point funny comedy away actors goes did acting takes script played cast screen ve action actually comes fact plays say thing think right performance things come old world years look role year isn real going seen long audience gets big love better great makes work movies end know new bad scenes scene man life don doesn films best people really plot director little does characters way make character story good time just like movie film

Class: pos
family performances especially gives picture wife job bit high fun help times instead star watch john place quite day interesting making far set course lot point funny comedy young away actors goes did acting takes played cast screen ve action actually comes fact plays say thing think right performance things come old world years look role year isn real going seen long audience get

###  Looking at confidence scores (functional margins)

Train classifier, prepare test set.

In [72]:
# Enforce labeling convention used for SVMs
lbl_set = set(lbls)
if lbl_set == {"pos","neg"}:
    lbls = np.array([1 if l == "pos" else -1 for l in lbls])
elif lbl_set != {-1,1}:
    raise Exception("Unknown labeling!")
T_train,T_test, y_train,y_test = train_test_split(revs,lbls)
y_train, y_test = np.array(y_train),np.array(y_test)
clf = LinearSVC()
tf = text.TfidfVectorizer()
X_train = tf.fit_transform(T_train)


clf.fit(X_train,  y_train)
X_test = tf.transform(T_test)

In [87]:
# When given one argument (the condition) np.where(condition)
# returns a 1-tuple whose first member is the array of indexes 
# satisfying the condition
fn_idxs = np.where((y_test > 0) & (predictions < 0))[0] 
fp_idxs = np.where((y_test < 0) & (predictions > 0))[0]

# Take an example of each type
fn_idx = fn_idxs[0]
fp_idx = fp_idxs[0]

In [150]:
train_scores = clf.decision_function(X_train)
lbls_svm_train = clf.predict(X_train)
# Guarantee margins are positive
functional_margins = lbls_svm_train * train_scores

test_scores = clf.decision_function(X_test)
predictions = clf.predict(X_test)
test_margins = predictions * test_scores

# Margins are all positive so this does what we want
cmax_idx, cmin_idx = functional_margins.argmax(),functional_margins.argmin()
cmax, cmin= functional_margins[cmax_idx],functional_margins[cmin_idx]

def get_confidence_level(idx,test_margins,cmax,cmin):
    """
    Note that this is a u-function.  That is, idx can be an array
    of idxs and an array of confidence values will be returned.
    """
    return (test_margins[idx]-cmin)/(cmax-cmin)

Both the fp item and the fn item  were low-confidence predictions, but the false positive
prediction is significantly more confident.

In [151]:
for (tp,idx) in [("fp",fp_idx),("fn",fn_idx)]:
    print(f"{idx} {tp} {get_confidence_level(idx,test_margins,cmax,cmin):.2f}")

7 fp 0.13
5 fn 0.01


On average, the fns and fps are low-confidence, suggesting confidence is not generally misplaced.

In [152]:
fn_conf_mn = get_confidence_level(fn_idxs,test_margins,cmax,cmin).mean()
print(f"Avg fn margin: {fn_conf_mn:.2f}")


fp_conf_mn = get_confidence_level(fp_idxs,test_margins,cmax,cmin).mean()
print(f"Avg fp margin: {fp_conf_mn:.2f}")

Avg fn margin: 0.14
Avg fp margin: 0.11


And the largest misplaced confidence margins are signifcantly less than .5

In [153]:
print(f"{len(fn_idxs)} fn items; max conf: {get_confidence_level(fn_idxs,test_margins,cmax,cmin).max():.2f}")
print(f"{len(fp_idxs)} fp items; max conf: {get_confidence_level(fp_idxs,test_margins,cmax,cmin).max():.2f}")

37 fn items; max conf: 0.39
47 fp items; max conf: 0.34


### Addendum  (not part of the assignment):  Demonstrating use of `get_topn` on a multiclass nonlinguistic classification problem

In [69]:
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, recall_score, precision_score
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import roc_auc_score
from sklearn import linear_model
from sklearn.svm import LinearSVC

data = load_iris()
features = data['data']
feature_names = np.array(data['feature_names'])
target = data['target']

In [251]:
from sklearn.metrics import accuracy_score
from sklearn import linear_model
# Rerun prediction on just the training data.
X = features[:,:]
Y = target
# we create an instance of a Logistic Regression Classifier.
logreg = linear_model.LogisticRegression(C=1e5,solver='lbfgs',multi_class='auto')
#logreg = LinearSVC(C=1e5)

logreg.fit(X, Y)
predicted = logreg.predict(X)

Here's what the weights on many sklearn classifiers will look like on
a multiclass problem.  

One weight vector for each of the iris class; Weights for four features.
That yields a 3x4 array:

In [252]:
print(list(feature_names))
logreg.coef_

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


array([[  3.93510528,   9.17853868, -12.37698229,  -5.92989148],
       [ -0.73525102,  -1.25242607,   1.47764768,  -6.16849281],
       [ -3.19985426,  -7.92611261,  10.89933461,  12.09838429]])

It turns out each class has a different "Most important feature":

In [253]:
feat_array = get_topn(logreg,top_n=4,feature_names=feature_names)

0: sepal width (cm) sepal length (cm) petal width (cm) petal length (cm)

1: petal length (cm) sepal length (cm) sepal width (cm) petal width (cm)

2: petal width (cm) petal length (cm) sepal length (cm) sepal width (cm)



In [20]:
from random import shuffle
pos_pairs = [(review, 'pos') for review in pos_file_reviews]
neg_pairs = [(review, 'neg') for review in neg_file_reviews]
review_data = pos_pairs + neg_pairs
shuffle(review_data)
reviews, labels = zip(*review_data)
#print(labels)
df_data = [reviews, labels]
df = pd.DataFrame.from_records(df_data, index=["Review","Opinion"]).transpose()

In [24]:
pd.DataFrame({"Review":reviews,"Opinion":labels})

Unnamed: 0,Review,Opinion
0,bruce barth's mellow piano plays in the backgr...,pos
1,almost a full decade before steven spielberg's...,pos
2,adam sandler vehicles are never anything speci...,neg
3,""" the red violin "" is a cold , sterile featur...",neg
4,the calendar year has not even reached its mid...,pos
...,...,...
1995,""" the endurance : shackleton's legendary anta...",pos
1996,i've never written a review for a movie i have...,neg
1997,"edward zwick's "" the siege "" raises more quest...",pos
1998,how do films like mouse hunt get into theatres...,neg


In [22]:
df

Unnamed: 0,Review,Opinion
0,bruce barth's mellow piano plays in the backgr...,pos
1,almost a full decade before steven spielberg's...,pos
2,adam sandler vehicles are never anything speci...,neg
3,""" the red violin "" is a cold , sterile featur...",neg
4,the calendar year has not even reached its mid...,pos
...,...,...
1995,""" the endurance : shackleton's legendary anta...",pos
1996,i've never written a review for a movie i have...,neg
1997,"edward zwick's "" the siege "" raises more quest...",pos
1998,how do films like mouse hunt get into theatres...,neg
