# Logistic Regression for Text Categorization

In this document, we will do experiments using Logistic Regression algorithm for text classification task. We will use the framework sklearn for experiments.


## Binary classification

We download the data set as the first step.


In [1]:
%%capture
!rm -f titles-en-train.labeled
!rm -f titles-en-test.labeled

!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled


Each sample is written in a line. There are two labels {1, -1} in the data.

```
1	FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .
-1	Yomi is the world of the dead .
```

We will need to predict whether the title of an article is about an individual or not.

- Label 1: The title is about an individual
- Lable -1: The title is not about an individual

### Load data

We will load data into a list of sentences with their labels.

In [2]:
def load_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            lb, text = line.split('\t')
            data.append((text,int(lb)))

    return data


Loading data from files

In [3]:
train_data = load_data('./titles-en-train.labeled')
test_data = load_data('./titles-en-test.labeled')

train_docs, train_labels = zip(*train_data)
test_docs, test_labels = zip(*test_data)


### Using scikit-learn for feature extraction

We can use scikit-learn for [feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html). We use the bag-of-word representation for feature extraction. In scikit-learn, we can use `CountVectorizer` or `TfidfTransformer`.

### Feature extraction with CountVectorizer



In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
                             binary=True,  # Use binary features
                             max_features=10000
                            )
vectorizer


Now, we fit the vectorizer object on the training data.

In [5]:
X_train = vectorizer.fit_transform(train_docs)
X_train.shape


(11288, 10000)

We we try the vectorizer to get BoW of a sentence.

In [6]:
analyze = vectorizer.build_analyzer()
analyze("This is a text document to analyze.")


['this', 'is', 'text', 'document', 'to', 'analyze']

### Text categorization with logistic regression

Now let's try text categorization with [logistic regression implementation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) in scikit-learn. See the document [here](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) for more details.

In [7]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=500)
clf


Now, we fit the model on the training data.

In [8]:
clf.fit(X_train, train_labels)


### Evaluation on test set

Now let's evaluate the model on the test data.

In [9]:
X_test = vectorizer.transform(test_docs)
test_preds = clf.predict(X_test)


In [10]:
from sklearn import metrics

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))


# Test accuracy: 0.9411973078285512


See the classification report:

In [11]:
print( metrics.classification_report(test_labels, test_preds) )


              precision    recall  f1-score   support

          -1       0.93      0.96      0.94      1477
           1       0.95      0.93      0.94      1346

    accuracy                           0.94      2823
   macro avg       0.94      0.94      0.94      2823
weighted avg       0.94      0.94      0.94      2823



We can predict the label for an input review.

In [12]:
example = "FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period ."
test_x = vectorizer.transform([example])
print("Predicted class: {}".format(clf.predict(test_x)))


Predicted class: [1]


We can get prediction probabilities.

In [13]:
clf.predict_proba(test_x)


array([[8.21743903e-04, 9.99178256e-01]])

The first value is the probability that the instance belongs to the class "-1" and the second value is the probability that the instance belongs to the class "+1". Let's try another sample.

In [14]:
example2 = "Yomi is the world of the dead ."
test_x2 = vectorizer.transform([example2])
clf.predict_proba(test_x2)


array([[0.81846015, 0.18153985]])

We can combine probability values with a threshold $t$ to customize our prediction. For instance, we can decide that the prediction is "-1" if the probability is greater than 0.6 instead of 0.5.

### Get top features with the highest weights

In this section, we would like to see top features with the highest weights.

First, we get all features in vectorizer and target_names.



In [15]:
feature_names = vectorizer.get_feature_names_out()
target_names = ["+1", "-1"]
print(len(clf.coef_), clf.coef_)


1 [[ 0.02084022 -0.00103193 -0.0033699  ... -0.00458677 -0.02910268
   0.00760712]]


In [16]:
import numpy as np

topN = 50
print("top {} keywords:".format(topN))
top10 = np.argsort(clf.coef_[0])[-topN:]
top_features = [ feature_names[i] for i in top10 ]
print(" ".join(top_features))


top 50 keywords:
march may story shinsengumi he statesman kugyo real october november literature drama august nihonshoki uesugi december crown anecdotes member september imperial emperor kutsuki lord fiction performer tanka chapters lived warlord july detached miyake actors poems throne unknown noble emperors novel waka poetry commander director professional myth tale monk priests priest


### Try with tf-idf term weighting

Now, we use tf-idf term weighting for feature extraction

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(train_docs)

clf = LogisticRegression(solver='lbfgs')

clf.fit(X_train, train_labels)

X_test = vectorizer.transform(test_docs)
test_preds = clf.predict(X_test)

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))


# Test accuracy: 0.9351753453772582


## Multiclass Text Classification

In this section, we will do multiclass text classification with 20 newsgroup dataset. It will be automatically downloaded, then cached.

In [18]:
from sklearn.datasets import fetch_20newsgroups

remove = ('headers', 'footers', 'quotes')

data_train = fetch_20newsgroups(subset='train',
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test',
                               shuffle=True, random_state=42,
                               remove=remove)

y_train, y_test = data_train.target, data_test.target


In [19]:
def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
print()


11314 documents - 13.782MB (training set)
7532 documents - 8.262MB (test set)



### Feature Extraction

We will use TF-IDF features.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)


Let's try Logistic Regression with 'ovr' (one-vs-rest) strategy.

In [21]:
clf = LogisticRegression(solver='lbfgs', multi_class='ovr')
clf.fit(X_train, y_train)




Let's evaluate the results on the test set.

In [22]:
from sklearn import metrics

y_preds = clf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_preds)
print("# Test accuracy: {}".format(accuracy))


# Test accuracy: 0.6955655868295274


Let's try multinomial Logistic Regression.

In [23]:
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial')
clf.fit(X_train, y_train)




We will test multinomial Logistic Regression on the test data.

In [24]:
y_preds = clf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_preds)
print("# Test accuracy: {}".format(accuracy))


# Test accuracy: 0.6953000531067446
