# Logistic Regression for Text Categorization

In this document, we will do experiments using Logistic Regression algorithm for text classification task. We will use the framework sklearn for experiments.

For the binary classification, we will re-use the sentiment classification data. For multi-class classification, we will use the 20 newsgroups dataset. It will be automatically downloaded, then cached.

## Binary classification

We download the data set as the first step.


In [None]:
!rm -f sentiment.txt
!wget https://raw.githubusercontent.com/minhpqn/nlp_100_drill_exercises/master/data/sentiment.txt

--2020-03-15 04:15:42--  https://raw.githubusercontent.com/minhpqn/nlp_100_drill_exercises/master/data/sentiment.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270444 (1.2M) [text/plain]
Saving to: ‘sentiment.txt’


2020-03-15 04:15:45 (16.2 MB/s) - ‘sentiment.txt’ saved [1270444/1270444]



### Load data

We will load data into a list of sentences with their labels.

In [None]:
import re


def load_data(file_path):
    data = []
    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.strip()
            if line == "":
                continue
            match = re.search(r"(\+1|-1)[\s\t]+(.+)$", line)  # match the line +1 ...
            if match:
                lb = match.group(1)
                sentence = match.group(2)
                if sentence == "":
                    continue
                data.append((sentence,lb))
    return data

We will use the above function to load sentiment data.

In [None]:
DATA_PATH = "./sentiment.txt"
data = load_data(DATA_PATH)

print("# Loaded {} examples".format(len(data)))

# Loaded 10662 examples


We also split data into training/test data.

In [None]:
import random
from sklearn.model_selection import train_test_split

data = load_data(DATA_PATH)
docs, labels = zip(*data)

train_docs, test_docs, train_labels, test_labels = train_test_split(docs, labels,
                                                                   test_size=0.2,
                                                                   random_state=42)
print("Training reviews: {}".format(len(train_docs)))
print("Test reviews: {}".format(len(test_docs)))

# Let's see some positive and negative documents in test data.
posi_docs = []
neg_docs = []
for d, lb in zip(test_docs, test_labels):
    if lb == "+1":
        posi_docs.append(d)
    else:
        neg_docs.append(d)

print("Random positive review")
print(random.choice(posi_docs))
print("Random negative review")
print(random.choice(neg_docs))

Training reviews: 8529
Test reviews: 2133
Random positive review
the story ultimately takes hold and grips hard .
Random negative review
first-time writer-director dylan kidd also has a good ear for dialogue , and the characters sound like real people .


### Using scikit-learn for feature extraction

We can use scikit-learn for [feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html). We use the bag-of-word representation for feature extraction. In scikit-learn, we can use `CountVectorizer` or `TfidfTransformer`.

### Feature extraction with CountVectorizer



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
                             binary=True,  # Use binary features
                            ) 
vectorizer

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Now, we fit the vectorizer object on the training data.

In [None]:
X_train = vectorizer.fit_transform(train_docs)
X_train.shape

(8529, 16530)

We we try the vectorizer to get BoW of a sentence.

In [None]:
analyze = vectorizer.build_analyzer()
analyze("This is a text document to analyze.")

['this', 'is', 'text', 'document', 'to', 'analyze']

### Text categorization with logistic regression

Now let's try text categorization with [logistic regression implementation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) in scikit-learn. See the document [here](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) for more details.

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=500)
clf

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now, we fit the model on the training data.

In [None]:
clf.fit(X_train, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Evaluation on test set

Now let's evaluate the model on the test data.

In [None]:
X_test = vectorizer.transform(test_docs)
test_preds = clf.predict(X_test)

In [None]:
from sklearn import metrics

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.753398968588842


See the classification report:

In [None]:
print( metrics.classification_report(test_labels, test_preds) )

              precision    recall  f1-score   support

          +1       0.76      0.74      0.75      1062
          -1       0.75      0.77      0.76      1071

    accuracy                           0.75      2133
   macro avg       0.75      0.75      0.75      2133
weighted avg       0.75      0.75      0.75      2133



We can predict the label for an input review.

In [None]:
example = "a thoughtful , provocative , insistently humanizing film ."
test_x = vectorizer.transform([example])
print("Predicted class: {}".format(clf.predict(test_x)))

Predicted class: ['+1']


We can get prediction probabilities.

In [None]:
clf.predict_proba(test_x)

array([[0.85279457, 0.14720543]])

The first value is the probability that the instance belongs to the class "+1" and the second value is the probability that the instance belongs to the class "-1". Let's try a negative review.

In [None]:
example2 = "for all its surface frenzy , high crimes should be charged with loitering -- so much on view , so little to offer ."
test_x2 = vectorizer.transform([example2])
clf.predict_proba(test_x2)

array([[0.14283192, 0.85716808]])

We can combine probability values with a threshold $t$ to customize our prediction. For instance, we can decide that the prediction is "+1" if the probability is greater than 0.6 instead of 0.5.

### Get top features with the highest weights

In this section, we would like to see top features with the highest weights.

First, we get all features in vectorizer and target_names.



In [None]:
feature_names = vectorizer.get_feature_names()
target_names = ["+1", "-1"]
print(len(clf.coef_), clf.coef_)

1 [[ 0.08720718  0.05923897  0.05923897 ...  0.03480673 -0.00243007
   0.0372368 ]]


In [None]:
import numpy as np

topN = 50
print("top {} keywords:".format(topN))
top10 = np.argsort(clf.coef_[0])[-topN:]
top_features = [ feature_names[i] for i in top10 ]
print(" ".join(top_features))

top 50 keywords:
obvious junk stupid choppy tired has its none already plain premise video wrong awful unfortunately has all shallow plays like script jokes badly the worst pointless doesn ill thin suffers bland plodding unpleasant fails routine tv tedious generic pretentious mediocre bore mildly heavy flat neither only unfunny mess lacks worst bad boring too dull


### Try with tf-idf term weighting

Now, we use tf-idf term weighting for feature extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(train_docs)

clf = LogisticRegression(solver='lbfgs')

clf.fit(X_train, train_labels)

X_test = vectorizer.transform(test_docs)
test_preds = clf.predict(X_test)

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.7510548523206751


## Multiclass Text Classification

In this section, we will do multiclass text classification with 20 newsgroup dataset. It will be automatically downloaded, then cached.

In [None]:
from sklearn.datasets import fetch_20newsgroups

remove = ('headers', 'footers', 'quotes')

data_train = fetch_20newsgroups(subset='train',
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test',
                               shuffle=True, random_state=42,
                               remove=remove)

y_train, y_test = data_train.target, data_test.target

In [None]:
def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
print()

11314 documents - 13.782MB (training set)
7532 documents - 8.262MB (test set)



### Feature Extraction

We will use TF-IDF features.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)

Let's try Logistic Regression with 'ovr' (one-vs-rest) strategy.

In [None]:
clf = LogisticRegression(solver='lbfgs', multi_class='ovr')
clf.fit(X_train, y_train)

Let's evaluate the results on the test set.

In [None]:
from sklearn import metrics

y_preds = clf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.6949017525225704


Let's try multinomial Logistic Regression.

In [None]:
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial')
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

We will test multinomial Logistic Regression on the test data.

In [None]:
y_preds = clf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.6946362187997875


### SGDClassifier with log loss

We will use [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) with logistic loss function.

In [None]:
from sklearn.linear_model import SGDClassifier

clf2 = SGDClassifier(alpha=.0001, max_iter=100, loss='log',
                     penalty='l2')
clf2.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=100,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

Let's evaluate SGDClassifier.

In [None]:
y_preds = clf2.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.6935740839086564


## Naive Bayes Classifier

We will compare the result of Logistic Regression with Naive Bayes.

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf3 = MultinomialNB(alpha=.01)
clf3.fit(X_train, y_train)
y_preds = clf3.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.6964949548592672
