## Application of Logistic Regression for 
## classifying English articles into fiction and non-fiction category

For more details : https://medium.com/@atmabodha/fictometer-a-simple-and-explainable-algorithm-for-sentiment-analysis-31186d2a8c7e

In [None]:
# NLTK is a popular library used for analysing texts
# The brown corpus dataset is present inside this library
import nltk
from nltk.corpus import brown
nltk.download('brown')

import pandas as pd

from sklearn import preprocessing
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Details of all the information contained in the NLTK brown corpus
help(brown)

In [None]:
# List of all text categories present in the brown corpus
brown.categories()

In [None]:
# List of all articles within the 'news' category
brown.fileids('news')

In [None]:
# List of the first 20 tagged words in article number 'ca44'
# As you can see, each article is divided into individual words (tokenization), 
# and for each word, the corresponging Part of Speech (POS) tag is specified.
# You can use the functions defined later on to convert these POS tags to universal tags that are easy to understand.

print(type(brown.tagged_words('ca44')))
print(len(brown.tagged_words('ca44')))
print(brown.tagged_words('ca44')[0:20])

The Fictometer algorithm is based on the Part of Speech (POS) tags in a text. For a given input text, it first counts the number adverbs, adjectives, and pronounds in the text and uses it as an input to the Logistic Regression algorithm to do the classification.

We will be using the Brown corpus dataset for this work, and this corpus has both the text as well as the POS tags (added by human experts). However, the POS tags present in this corpus are finer, meaning adjectives can be further sub-divided into finer categories. But for our analysis, we only need the high level tags. And so the first step is to convert/group the finer tags into high level tags, which we do through the functions defined below.

In [None]:
# Define functions to count the number of POS tags in the text.

# This function counts the number of adjectives
def n_adj(text):
    adj=0
    for i in text:
        if i[0] == 'J':
            adj=adj+1
    return adj

# This function counts the number of nouns
def n_noun(text):
    noun=0
    for i in text:
        if ((i[0] == 'N') and (i[1] != 'C')):
            noun=noun+1
    return noun

# This function counts the number of verbs
def n_verb(text):
    verb=0
    for i in text:
        if i[0] == 'V':
            verb=verb+1
    return verb

# This function counts the number of pronouns
def n_pronoun(text):
    pronoun=0
    for i in text:
        if (i[0] == 'P') or (i[:3] in ['WP$','WPO','WPS']):
            pronoun=pronoun+1
    return pronoun

# This function counts the number of adverbs
def n_adv(text):
    adv=0
    for i in text:
        if (i[0] == 'R') or (i[:3] in ['WRB']):
            adv=adv+1
    return adv

# This function outputs the universal high level tag using a finer tag as input
def func_utag(tag):
    if tag[0] == 'J' or tag == 'ADJ':
        utag='ADJ'
    elif ((tag[0] == 'N') and (tag[1] != 'C')) or tag == 'NOUN':
        utag='NOUN'
    elif tag[0] == 'V' or tag == 'VERB':
        utag='VERB'
    elif (tag[0] == 'P') or (tag[:3] in ['WP$','WPO','WPS']) or tag == 'PRON':
        utag='PRON'
    elif (tag[0] == 'R') or (tag[:3] in ['WRB']) or tag == 'ADV':
        utag='ADV'
    else:
        utag='unknown'
    return utag

# This function outputs True or False depending on whether the input tag is one of the 5 high level universal tags or not.
def func_is5tag(tag):
    if tag in ['ADJ','ADV','NOUN','PRON','VERB']:
        is5tag=True
    else:
        is5tag=False
    return is5tag

In [None]:
# This creates an empty dataframe with the defined columns
brownpostable=pd.DataFrame(columns=['category','filename','ADJ','ADV','NOUN','VERB','PRON','RADJPRON','RADVADJ'])

In [None]:
# Take each article from the Brown corpus, count the number of each universal POS tag in the article, 
# and populate the DataFrame

for i in brown.categories():
  # This loop iterates over all the 15 categories of articles present in the Brown corpus
  
    for j in brown.fileids(categories=i):
      # This loop iterates over all the articles present in the chosen category

        taggedwords=brown.tagged_words(j)
        taglist=[]
        for k in taggedwords:
          # This loop iterates over all the tagged words in the chosen article

            taglist.append(k[1])
        adj=n_adj(taglist) # Count the number of adjectives in the article
        adv=n_adv(taglist) # Count the number of adverbs in the article
        noun=n_noun(taglist) # Count the number of nouns in the article
        verb=n_verb(taglist) # Count the number of verbs in the article
        pronoun=n_pronoun(taglist) # Count the number of pronouns in the article

        # Append the above information for each article to the DataFrame
        brownpostable=brownpostable.append({'category' : i,'filename' : j, 'ADJ' : int(adj), 'ADV' : int(adv), 'NOUN' : int(noun), 'VERB' : int(verb), 'PRON' : int(pronoun)},ignore_index=True)

In [None]:
brownpostable

In [None]:
# Compute the ratio of Adjectives to Pronouns, and the ratio of Adverbs to Adjectives in each article 
# and populate the last 2 columns of the DataFrame

for i in range(len(brownpostable)):
    adj=brownpostable.ADJ.iloc[i]
    adv=brownpostable.ADV.iloc[i]
    pronoun=brownpostable.PRON.iloc[i]
    brownpostable.RADJPRON.iloc[i]=adj/pronoun
    brownpostable.RADVADJ.iloc[i]=adv/adj

In [None]:
brownpostable

In [None]:
# Re-categorise the Brown corpus categories as fiction and non-fiction.
# 5 categories are identified as fiction, 5 as non-fiction and the remaining 5 are dropped due to ambiguity.

brown2=brownpostable.copy()
for i in ['news','reviews','government','learned','hobbies']:
    brown2=brown2.replace(to_replace=i,value='nonfiction')

for i in ['fiction','mystery','science_fiction','adventure','romance']:
    brown2=brown2.replace(to_replace=i,value='fiction')
    
index_names=brown2[(brown2['category'] != 'fiction') & (brown2['category'] != 'nonfiction')].index
brown2.drop(index_names,inplace=True)

In [None]:
brown2.drop(columns=['filename','PRON','ADJ','ADV','NOUN','VERB'],inplace=True)

In [None]:
brown2

In [None]:
sns.scatterplot(data=brown2, hue='category', x='RADVADJ', y='RADJPRON')

In [None]:
# replace the text labels by numbers
brown3=brown2.replace(to_replace='nonfiction',value='0')
brown3=brown3.replace(to_replace='fiction',value='1')

In [None]:
brown3

In [None]:
x=brown3.drop(columns=['category'])
y=brown3.category

In [None]:
x

In [None]:
y

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(x_train,y_train)

In [None]:
# Training accuracy

y_pred=logreg.predict(x_train)
accuracy = metrics.accuracy_score(y_train,y_pred)
print("Training Accuracy : ",accuracy)

In [None]:
# Testing accuracy

y_pred=logreg.predict(x_test)
accuracy = metrics.accuracy_score(y_test,y_pred)
print("Testing Accuracy : ", accuracy)

In [None]:
cm = confusion_matrix(y_test,y_pred)
cm

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['nonfiction', 'fiction'])
disp.plot(cmap=plt.cm.Blues)  # plot the confusion matrix
plt.show()  # show the plot

In [None]:
# Parameter values of Logistic Regression after training
print(logreg.intercept_)
print(logreg.coef_)