## Classifying text

In [2]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score
%matplotlib inline

We turn to applying machine learning classification methods to text. There are
no new principles at stake.  In principle, everything is the same as it was for
learning how to classify irises.

1.  We need to find labeled data; each of the exemplars in the data should be represented with a fixed set of features.  
2. We need to split our data and training and test data.  
3. We need to train learner on the training data and evaluate it (test it) it on the test data.

The problem is that text data is not in a form  that is compatible with
what we have learned about classifiers.  The text must be put in a suitable
form before a linear model; can be trained on it.

**Training**

1.  Labeled data must be loaded (into Python).  It should be a sequence of documents T accompanied by a sequence of labels L.
2.  Split T and L into training and test groups, yielding T1 and T2; as well as and L1 and L2.
2.  Train or a **feature model** on the training data T1 (or in scikit learn terminology **fit** the model **to** the training data).  The feature model inputs the text sequence and outputs a **term-document** matrix suitable for training a linear classifier.  The feature model is called a **vectorizer**
(because it turns a document into a vector, a column of numbers).
3.  Using the trained vectorizer, transform T1 into a term document matrix M1.
4.  Train a linear model $\mu$ on M1 and L1.

**Evaluation**

1.  Transform the test data T2 into a term document matrix M2 using the vectorizer fit during step 2 of training;  in particular this means if there are words in the T2 data that were never seen during training, they are ignored in building M2.
2.  Use $\mu$  to classify the texts represented in M2; that is produce a set of predicted labels P2.
3.  Compare the actual labels L2 with the predicted labels P2 using standard evaluation metrics such as precision, accuracy, and recall.


## Review the steps with insult detection

We looked at the insult detection data in  the text classification notebook.

### Training step 1: Loading the data

Let's load the CSV file.

In [33]:
import os.path
site = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/'\
'text_classification/'
#site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the date, and the comment.

In [34]:
df.tail()

Unnamed: 0,Insult,Date,Comment
3942,1,20120502172717Z,"""you are both morons and that is never happening"""
3943,0,20120528164814Z,"""Many toolbars include spell check, like Yahoo..."
3944,0,20120620142813Z,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,20120528205648Z,"""How about Felix? He is sure turning into one ..."
3946,0,20120515200734Z,"""You're all upset, defending this hipster band..."


Now we define the text sequences $\mathbf{T}$ and the label sequence  $\mathbf{L}$.

In [35]:
T = df['Comment']

In [36]:
L = df['Insult']

### Step 2 Split the data and labels into training and test groups

In [37]:
T1, T2, L1, L2 = train_test_split(T,L)

### Step 3 and 4:  Fit the feature model (vectorizer) to the training data and Transform  it

In [38]:
tf = text.TfidfVectorizer()
# Scikit learn has one function that does both fitting and transforming.
# M1 is the transformed data
# tf is the trained feature model (which will be used to transform the test data)
M1 = tf.fit_transform(T1)#.toarray()

### Step 5 Training the classifier

Now, we are going to train a classifier as usual. We first split the data into a train and test set.

We use a **Bernoulli Naive Bayes classifier**.

In [39]:
# Create classifer
bnb =nb.BernoulliNB()
#bnb= nb.MultinomialNB()
#bnb =nb.GaussianNB()

# Fit (train) the classifier  using the training data and labels
bnb.fit(M1, L1);

### Evaluation

Evaluate the classifier, first using accuracy (what `.score()` returns).

In [40]:
# vectorize the test data using the vectorizer trained on T1
# Notice we DONT call .fit_transform() because that would retrain the vectorizer on the test data
# We call .transform() using the trained model to transform the new data.
# Words not seen during training will be ignored.
M2 = tf.transform(T2)#.toarray()
# Classify the data using the trained classisifer and report the accuracy
bnb.score(M2, L2)

0.7882472137791287

Now try re-executing steps 2 through 5.  (Just re-execute the cells)  The results should be the same, right?

Well, are they?  

What happens:  each training test split produces a different set of test data.  Sometimes the test is harder.
Sometimes it's easier.  Or looking at it another way:  Sometimes the training data is a better preparation for the test than others.  

To get a realistic view of how our classifier is doing we take the average performance on a  number of
train/test splits.  This is called **cross validation**.  We return to that point below.

#### Using all three evaluation metrics

First let's get more evaluation numbers, in particular precision and recall.  We do
that by calling a method that returns the predicted labels P2, so we can compare
L2 and P2 using different evaluation metrics.

In [41]:
P2 = bnb.predict(M2)
scores = np.array([accuracy_score(P2, L2),
                   precision_score(P2, L2),
                   recall_score(P2, L2)])
print(f'Accuracy: {scores[0]:.2f} Precision: {scores[1]:.2f} Recall: {scores[2]:.2f}')

Accuracy: 0.79 Precision: 0.15 Recall: 0.92


We see that the accuracy is a bit misleading.  There is a serious precision problem.

What does that mean in the setting of insult detection?  It means the BNB classifier is a little too
eager to call something an insult.  When it flags something as an insult, it
is right only 14% of the time.

Why would that be?  Think about how the model is trained and what its weakness might be.
This is what it means to try to interpret or discuss a model's performance.  Zoom
in the model's weakness. Talk about where that weakness comes from.

#### Basic train and test loop

How to get the average of a number of runs.

## Homework

Read the on line book draft chapter about text classification and and especially
about insult detection. Focus on the use of `scikit_learn`, especially the
`TfidfVectorizer`.

Try two different classifiers on the movie review data, the one used in the textbook, an SVM called
`LinearSVC`, and  the Bernoulli Naive Bayes model used above. Some points of emphasis;

1.  Be sure to get the average of at runs  least 10 runs for **both** classifiers.
2.  Be sure to get average accuracy, precision, and recall for both classifiers on those multiple runs. You will probably find `split_fit_and_eval` defined above useful, but you will need to modify it.
3.  For your first discussion post turn in the new code you wrote, including the code that labels and shuffles the data (discussed further below).  If you have to do a new import, show that. If you have to rewrite `split_fit_and_eval`, turn in the new version.  Also show the output, which should be a single line giving the accuracy, prcision, and recall.
4.  Discuss which classifier does better.  Discuss which metric the best classifier does the worst at and speculate as to why (this will require reviewing the definitions of precision and recall and thinking about what they mean in a movie review setting).
5. Using the SVM classifier and training on **all** the data find the 50 most important Positive features for the Movie Reviews Data.  They should differ significantly from the most important features in Insult Detection. The function `print_topn` (from the Insult Detection Notebook) should be of help.
6.  Find the 100 most important Negative features for the Movie Reviews Data. Note that the way two-class problems work with SVMs there is only one set of weights to look at, so it won't work to pass more than one class name to the `class_labels` parameter of `print_top_n` (it would work with a NaiveBayes classifier).  In particular: you need the **lowest** weighted features if you want to look at the more fun word set that best characterized bad reviews. You will have to modify `print_topn` (from the Insult Detection Notebook) to do that. Try to do so in such a way that with one set of parameters it prints the most positive words, and with another, it prints the most negative words. You may notice the names of a few actors appearing in this feature set. Try not to laugh as the meaning of this dawns on you.  For an extra bit of approval from your instructor, while you're at it, modify it so that it returns a list of words in addition to printing them.  You should probably change the name of the function to `get_topn` if you succeed.

#### Help with getting the movie reviews data.

Execute the next two cells to get the movie review data.

In [99]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [100]:
from nltk.corpus import movie_reviews as mr

def get_file_strings (corpus, file_ids):
    return [corpus.raw(file_id) for file_id in file_ids]

data = dict(pos = mr.fileids('pos'),
            neg = mr.fileids('neg'))

pos_file_ids = data['pos']
neg_file_ids = data['neg']

# Get all the positive and negative reviews.
pos_file_reviews = get_file_strings (mr, pos_file_ids)
neg_file_reviews = get_file_strings (mr, neg_file_ids)

 # Organize reviews as tuples of (document, label)
pos_documents = [(review, 'pos') for review in pos_file_reviews]
neg_documents = [(review, 'neg') for review in neg_file_reviews]

    # Combine positive and negative reviews
all_documents = pos_documents + neg_documents



Each review is a string.  In principle, a list of strings like `pos_file_reviews`  can be passed to `text.TfidfVectorizer()` via the `fit_transform` method to train a vectorizer for machine learning.
You could code that up.

What you'd really like to do is use `split_fit_and_eval`, defined above, which does a lot of the work for you.

But hold on. You have a coding problem. You don't have  a sequence of documents and labels.  Instead you have
one sequence of positive documents  and another sequence of negative documents.  

So you will need to turn those two sequences into a sequence of documents and a sequence of labels
because that's what `split_fit_and_eval` wants.  You also want the doc sequence
to contain a random mixture of positive and negative documents, because some machine
learning algorithms are sensitive to the order in which training data is presented to
them.

The next cell does **not** do that for you.  But it illustrates an approach using
two sets of English letters in place of two sets of English documents.

In [25]:
# Lets work on letters instead of documents
# There are 2 classes, letters from the first half of the
# alphabet ('f') and letters frmm the last half ('l')

from random import shuffle
from string import ascii_lowercase

#Class 1 of the letters: the f_lets
f_lets = ascii_lowercase[:13]
print(f_lets)
#Class2 of the letters: the l_lets
l_lets = ascii_lowercase[13:]
print(l_lets)

# Now get pairs of letters and labels
f_pairs = [(let,'f') for let in f_lets]
l_pairs = [(let,'l') for let in l_lets]

###########  Shuffling  ###########################
# Way too orderly, the classes arent mixed yet.
data = f_pairs + l_pairs
shuffle(data)
###################  Now they're shuffled! ###############

# Separate the letters from their labels
lets, lbls = zip(*data)
print(lets)
print(lbls)

abcdefghijklm
nopqrstuvwxyz
('p', 'r', 't', 'j', 'u', 'k', 'c', 'o', 'w', 's', 'v', 'x', 'n', 'g', 'l', 'd', 'z', 'a', 'b', 'i', 'q', 'e', 'f', 'm', 'y', 'h')
('l', 'l', 'l', 'f', 'l', 'f', 'f', 'l', 'l', 'l', 'l', 'l', 'l', 'f', 'f', 'f', 'l', 'f', 'f', 'f', 'l', 'f', 'f', 'f', 'l', 'f')


In [98]:
import numpy as np
from sklearn.feature_extraction import text
import sklearn.svm as svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import random
from nltk.corpus import movie_reviews as mr

def get_file_strings(corpus, file_ids):
    return [corpus.raw(file_id) for file_id in file_ids]

def load_reviews():
    data = dict(pos=mr.fileids('pos'), neg=mr.fileids('neg'))

    pos_file_ids = data['pos']
    neg_file_ids = data['neg']

    # Get all the positive and negative reviews.
    pos_file_reviews = get_file_strings(mr, pos_file_ids)
    neg_file_reviews = get_file_strings(mr, neg_file_ids)

    # Organize reviews as tuples of (document, label)
    pos_documents = [(review, 'pos') for review in pos_file_reviews]
    neg_documents = [(review, 'neg') for review in neg_file_reviews]

    # Combine positive and negative reviews
    all_documents = pos_documents + neg_documents

    return all_documents

# Load reviews as (document, label) tuples
documents = load_reviews()

def load_and_shuffle_reviews():
    documents = load_reviews()
    random.shuffle(documents)
    return zip(*documents)

# Load and shuffle reviews
X, y = load_and_shuffle_reviews()

# Split the data into training and testing sets (you can adjust the test_size as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a TfidfVectorizer and transform the training data
tf = TfidfVectorizer()
X_train_tfidf = tf.fit_transform(X_train)

# Train a LinearSVC classifier
est = svm.LinearSVC()
est.fit(X_train_tfidf, y_train)



In [106]:
def print_top_features(vectorizer, clf, class_label, top_n=10):
    feature_names = np.array(vectorizer.get_feature_names_out())

    try:
        class_index = np.where(clf.classes_ == class_label)[0][0]
    except IndexError:
        print(f"Class {class_label} not found in classifier classes.")
        return

    try:
        if clf.coef_.ndim == 1:
            coefficients = clf.coef_
        else:
            coefficients = clf.coef_[0]

        # Combine feature names with their coefficients
        feature_coef_pairs = list(zip(feature_names, coefficients))
        # Sort by coefficient values
        sorted_feature_coef_pairs = sorted(feature_coef_pairs, key=lambda x: x[1])

        # Get the top_n features
        top_features = sorted_feature_coef_pairs[-top_n:]
        top_feature_names = [feat for feat, _ in top_features]
        print(f"Top {top_n} features for class {class_label}:\n{', '.join(top_feature_names)}")
    except Exception as e:
        print(f"Error printing top features: {e}")
        print(f"Classes: {clf.classes_}")
        print(f"Coefficients shape: {clf.coef_.shape}")
        print(f"Class index: {class_index}")

# Print the top 50 features for the positive class
if len(est.classes_) == 2 and 'pos' in est.classes_:
    print_top_features(tf, est, class_label='pos', top_n=50)
else:
    print("Binary classification detected, but 'pos' class not found in the classifier.")

# Print the top 100 features for the negative class
if len(est.classes_) == 2 and 'neg' in est.classes_:
    print_top_features(tf, est, class_label='neg', top_n=100)
else:
    print("Binary classification detected, but 'neg' class not found in the classifier.")


Top 50 features for class pos:
kits, grunted, hartman, bankability, droagon, banisters, flings, terribly, decaying, brake, showiest, ahabs, rossellini, crowns, lyrical, larroquette, trepidation, sputtering, leoni, tiara, 04, sweat, allegory, mongo, blonde, negligent, rounded, grace, negotiating, staple, profit, stanley, alarming, imbibed, threads, recidivist, arachnids, sloppiness, humorously, peasant, tackles, krueger, hy, jaoui, reclaims, eileen, freshener, ferociously, glacier, amalgamations
Top 100 features for class neg:
carrion, mortally, amiel, shindler, tit, allegorically, risqueness, exasperating, arc, sophia, neice, mael, chucklesome, medak, proclaimed, swoop, mnc, pettiford, educating, graphically, smearing, torrential, fluent, daddy, onanism, discounted, airs, invokes, driving, interdimensional, mitch, _two_, beart, productions, nehru, mai, treated, exit, brushes, prevue, beleiveable, parr, dirty, torrent, ensemble, scratching, penny, enlistees, mosaic, coward, kits, grunte

Looks like the reviewers tend to not like moviews that are too prude and want movies that make them think and feel smart. Eileen shows up on both sides. I see Stanley which I assume is the famous director Stanley Kubric.

In [103]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
import numpy as np

# Function to train and evaluate BernoulliNB classifier
def train_and_evaluate_bernoulli(X_train, X_test, y_train, y_test):
    # Create a TfidfVectorizer instance
    vectorizer = TfidfVectorizer()

    # Fit and transform the training data
    X_train_tfidf = vectorizer.fit_transform(X_train)

    # Transform the test data using the same vectorizer
    X_test_tfidf = vectorizer.transform(X_test)

    # Train BernoulliNB classifier
    clf = BernoulliNB()
    clf.fit(X_train_tfidf, y_train)

    # Make predictions on the test set
    y_pred = clf.predict(X_test_tfidf)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')

    return accuracy, precision, recall

# Repeat the process 10 times
num_repeats = 10
accuracies_bernoulli = []
precisions_bernoulli = []
recalls_bernoulli = []

for i in range(num_repeats):
    # Load and shuffle reviews
    X, y = load_and_shuffle_reviews()

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=i)

    # Train and evaluate the BernoulliNB classifier
    accuracy_bernoulli, precision_bernoulli, recall_bernoulli = train_and_evaluate_bernoulli(X_train, X_test, y_train, y_test)
    accuracies_bernoulli.append(accuracy_bernoulli)
    precisions_bernoulli.append(precision_bernoulli)
    recalls_bernoulli.append(recall_bernoulli)

    print(f'Repeat {i+1}: Accuracy (BernoulliNB) = {accuracy_bernoulli:.2f}, Precision = {precision_bernoulli:.2f}, Recall = {recall_bernoulli:.2f}')

# Display the average accuracy, precision, and recall over 10 repeats for BernoulliNB
average_accuracy_bernoulli = np.mean(accuracies_bernoulli)
average_precision_bernoulli = np.mean(precisions_bernoulli)
average_recall_bernoulli = np.mean(recalls_bernoulli)

print(f'\nAverage Accuracy (BernoulliNB) over {num_repeats} repeats: {average_accuracy_bernoulli:.2f}')
print(f'Average Precision (BernoulliNB) over {num_repeats} repeats: {average_precision_bernoulli:.2f}')
print(f'Average Recall (BernoulliNB) over {num_repeats} repeats: {average_recall_bernoulli:.2f}')


Repeat 1: Accuracy (BernoulliNB) = 0.80, Precision = 0.81, Recall = 0.80
Repeat 2: Accuracy (BernoulliNB) = 0.76, Precision = 0.78, Recall = 0.76
Repeat 3: Accuracy (BernoulliNB) = 0.76, Precision = 0.79, Recall = 0.76
Repeat 4: Accuracy (BernoulliNB) = 0.78, Precision = 0.79, Recall = 0.78
Repeat 5: Accuracy (BernoulliNB) = 0.79, Precision = 0.82, Recall = 0.79
Repeat 6: Accuracy (BernoulliNB) = 0.79, Precision = 0.80, Recall = 0.79
Repeat 7: Accuracy (BernoulliNB) = 0.80, Precision = 0.81, Recall = 0.80
Repeat 8: Accuracy (BernoulliNB) = 0.81, Precision = 0.82, Recall = 0.81
Repeat 9: Accuracy (BernoulliNB) = 0.80, Precision = 0.82, Recall = 0.80
Repeat 10: Accuracy (BernoulliNB) = 0.82, Precision = 0.83, Recall = 0.82

Average Accuracy (BernoulliNB) over 10 repeats: 0.79
Average Precision (BernoulliNB) over 10 repeats: 0.81
Average Recall (BernoulliNB) over 10 repeats: 0.79


In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
import numpy as np

# Function to train and evaluate SVC classifier
def train_and_evaluate_svc(X_train, X_test, y_train, y_test):
    # Create a TfidfVectorizer instance
    vectorizer = TfidfVectorizer()

    # Fit and transform the training data
    X_train_tfidf = vectorizer.fit_transform(X_train)

    # Transform the test data using the same vectorizer
    X_test_tfidf = vectorizer.transform(X_test)

    # Train SVC classifier
    clf = SVC()
    clf.fit(X_train_tfidf, y_train)

    # Make predictions on the test set
    y_pred = clf.predict(X_test_tfidf)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')

    return accuracy, precision, recall

# Repeat the process 10 times
num_repeats = 10
accuracies_svc = []
precisions_svc = []
recalls_svc = []

for i in range(num_repeats):
    # Load and shuffle reviews
    X, y = load_and_shuffle_reviews()

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=i)

    # Train and evaluate the SVC classifier
    accuracy_svc, precision_svc, recall_svc = train_and_evaluate_svc(X_train, X_test, y_train, y_test)
    accuracies_svc.append(accuracy_svc)
    precisions_svc.append(precision_svc)
    recalls_svc.append(recall_svc)

    print(f'Repeat {i+1}: Accuracy (SVC) = {accuracy_svc:.2f}, Precision = {precision_svc:.2f}, Recall = {recall_svc:.2f}')

# Display the average accuracy, precision, and recall over 10 repeats for SVC
average_accuracy_svc = np.mean(accuracies_svc)
average_precision_svc = np.mean(precisions_svc)
average_recall_svc = np.mean(recalls_svc)

print(f'\nAverage Accuracy (SVC) over {num_repeats} repeats: {average_accuracy_svc:.2f}')
print(f'Average Precision (SVC) over {num_repeats} repeats: {average_precision_svc:.2f}')
print(f'Average Recall (SVC) over {num_repeats} repeats: {average_recall_svc:.2f}')


Repeat 1: Accuracy (SVC) = 0.86, Precision = 0.86, Recall = 0.86
Repeat 2: Accuracy (SVC) = 0.86, Precision = 0.86, Recall = 0.86
Repeat 3: Accuracy (SVC) = 0.86, Precision = 0.86, Recall = 0.86
Repeat 4: Accuracy (SVC) = 0.80, Precision = 0.80, Recall = 0.80
Repeat 5: Accuracy (SVC) = 0.83, Precision = 0.83, Recall = 0.83
Repeat 6: Accuracy (SVC) = 0.82, Precision = 0.82, Recall = 0.82
Repeat 7: Accuracy (SVC) = 0.81, Precision = 0.82, Recall = 0.81
Repeat 8: Accuracy (SVC) = 0.81, Precision = 0.81, Recall = 0.81
Repeat 9: Accuracy (SVC) = 0.83, Precision = 0.84, Recall = 0.83
Repeat 10: Accuracy (SVC) = 0.84, Precision = 0.85, Recall = 0.84

Average Accuracy (SVC) over 10 repeats: 0.83
Average Precision (SVC) over 10 repeats: 0.83
Average Recall (SVC) over 10 repeats: 0.83


Based on the above results, SVC was more accurate, precise, and recalled better than Bernoullis. What this means is that it likely predicted whether a review was correctly positive or negative more often and avoided false positives and negatives over the span of 10 repeats. This is likely for SVM being more complex than Naive Bayes which is simpler and computes faster. SVC has the advantage of being used for more than just binary data and can go higher dimensionality but the higher computing costs and time are something to look at and for this data the small difference shows that maybe the improved accuracy may not be worth it.