# DIGI405 Lab Class: Genre classification of historical newspaper texts

In this notebook we will train and test logistic regression and Naive Bayes classifiers on a collection of texts from [historical New Zealand newspapers](https://paperspast.natlib.govt.nz/newspapers). Our aim is to build genre classification models that are independent of topic, so we will use features based on the structure and layout of the text (for example line widths), linguistic features (such as the frequency of certain parts-of-speech), and other text statistics.

The data used in this notebook is originally sourced from the [National Library of New Zealand's Papers Past open data](https://natlib.govt.nz/about-us/open-data/papers-past-metadata/papers-past-newspaper-open-data-pilot/dataset-papers-past-newspaper-open-data-pilot). It consists of a small dataset of articles that have been pre-labelled with their genre and includes features related to line widths and offsets that have been extracted from the [METS/ALTO XML files](https://veridiansoftware.com/knowledge-base/metsalto/) for each newspaper.

We will use [spaCy](https://spacy.io/), [textfeatures](https://towardsdatascience.com/textfeatures-library-for-extracting-basic-features-from-text-data-f98ba90e3932), and [textstat](https://pypi.org/project/textstat/) to extract additional features and add them to our dataframe. We will then use [scikit-learn](https://scikit-learn.org/stable/) to train and test our models.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 0:</strong> Throughout the notebook there are defined tasks for you to do. Watch out for them - they will have a box around them like this! Make sure you take some notes as you go.
</div>

![National Library Papers Past](https://images.ctfassets.net/pwv49hug9jad/58rs0U4wNbQAhTch5JeZsC/e22c8ac2acb9aa94698e8f084e136352/datasets-pp-open-data-feature-image.jpg?fm=webp)

[Image source: natlib.govt.nz](https://natlib.govt.nz/about-us/open-data/papers-past-metadata/papers-past-newspaper-open-data-pilot/dataset-papers-past-newspaper-open-data-pilot)

## Setup

We need to make sure the latest version of scikit-learn is installed (you only need to run this cell once):

In [None]:
%%bash
python -m pip install scikit-learn --upgrade

Now import the required libraries. 

In [None]:
import sys

import pandas as pd
import numpy as np
import math
import pickle
import re

# Classifier training and evaluation
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, roc_auc_score, confusion_matrix
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import average_precision_score, precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Feature extraction
import spacy
import textstat
import textfeatures as tf

nlp = spacy.load('en_core_web_sm')

sns.set(rc = {'figure.figsize':(15,8)})

### Install Python packages if needed

After running the first notebook cell you may see an import error telling you that you are missing certain modules such as textstat or seaborn. If this is the case run the cell below, and then re-run the cell above. Note the line(s) you want to run must be directly below the %%bash line i.e. with no commented lines between.

In [None]:
%%bash
python -m spacy download en_core_web_sm
python -m pip install textstat
python -m pip install textfeatures
python -m pip install seaborn

## Load and explore the dataset

In [None]:
# Load the dataframe
filepath = 'paperspast_4genres_labelled.csv'
df = pd.read_csv(filepath, index_col = 0)

In [None]:
# View the count of articles by genre
display(df.groupby(['genre'])['genre'].count())

In [None]:
# View first ten rows of the dataframe
df.head(10)

In [None]:
# Explore the distribution of the articles in our dataset by newspaper

sample_papers_unique = df['newspaper'].nunique()
print("-----------------------------------------------------") 
print(f"Number of newspaper titles in sample dataset: {sample_papers_unique}") 
print("-----------------------------------------------------") 
print("") 

ax_1 = sns.countplot(x="newspaper", 
                     data = df, 
                     order = df['newspaper'].value_counts().index, 
                     color = "#3949ab")
ax_1.set_xlabel("Newspaper", fontsize = 14)
ax_1.set_ylabel("Count of articles", fontsize = 14)
ax_1.set_title("Distribution of articles by newspaper", fontsize = 16)
plt.xticks(rotation = 90, fontsize = 13)
plt.yticks(fontsize = 14)
plt.show()

In [None]:
# Explore the distribution of the articles in our dataset by year

df['date'] = pd.to_datetime(df['date'])
annual_df = (df.groupby([df['date'].dt.year.rename('year')])
             ['text'].count().reset_index())

ax_2 = sns.barplot(x="year", y="text", data=annual_df, color = '#3949ab')
ax_2.set_xlabel("Year", fontsize = 14)
ax_2.set_ylabel("Count of articles", fontsize = 14)
ax_2.set_title("Distribution of articles in dataset by year", fontsize = 16)
plt.xticks(rotation = 90, fontsize = 13)
plt.yticks(fontsize = 14)
plt.show()

In [None]:
# Display the full text of a selected article by index
selected_index = 250

print(f"\nGenre: {df['genre'].values[selected_index]}\n")
print("==============\n")
print(f"Title: {df['title'].values[selected_index]}\n")
df['text'].values[selected_index]

## Data cleaning

You'll see from the above that there can be symbols and punctuation in the text that are the result of [OCR](https://en.wikipedia.org/wiki/Optical_character_recognition) errors. We will run a simple cleaner function over the text column of the dataframe to improve this and add the cleaned text to a new column. Before we remove punctuation, we will count the sentences and add this feature to the dataframe.  

In [None]:
def cleaner(df, column_name):
    """
    Remove unnecessary symbols to create a clean text column from the original dataframe column using a regex.
    """
    # A column of sentence count is added to the dataframe before punctuation is removed.
    df['sentence_count'] = df[column_name].apply(lambda x: textstat.sentence_count(x))

    # Regex pattern for only alphanumeric, hyphenated text
    pattern = re.compile(r"[A-Za-z0-9\-]{1,50}")
    df['clean_text'] = df[column_name].str.findall(pattern).str.join(' ')
    
    return df

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 1:</strong> Here we are doing a very simple clean-up of the text, but a number of OCR errors will remain, for example incorrect words such as 'oar' or 'onr' instead of 'our'. Think about the impact this might have on our model and discuss it with your classmates or tutors. 
</div>

In [None]:
df = cleaner(df, 'text')

In [None]:
df.head(5)

## Feature extraction: linguistic features and text statistics

The following cells extract parts-of-speech and text statistic features and add them to the dataframe. For efficiency, the texts are [processed for parts-of-speech tagging](https://spacy.io/usage/processing-pipelines) as a stream using spaCy's [nlp.pipe](https://spacy.io/usage/processing-pipelines#processing). This allows the texts to be buffered in batches instead of one-by-one.

In [None]:
# Define the dataframe column containing the text to be processed
text_col = 'clean_text'  

In [None]:
def count_propn_spacy(doc):
    """
    Given a Spacy doc object return count of the 
    following parts-of-speech: proper nouns.
    """
    count_propn = 0

    for token in doc:
        if token.pos_ == 'PROPN':
            count_propn += 1
        
    return count_propn 


def count_verb_spacy(doc):
    """
    Given a Spacy doc object return count of the 
    following parts-of-speech: verbs.
    """
    count_verb = 0

    for token in doc:
        if token.pos_ == 'VERB':
            count_verb += 1

    return count_verb


def count_noun_spacy(doc):
    """
    Given a Spacy doc object return count of the 
    following parts-of-speech: nouns.
    """
    count_noun = 0

    for token in doc:
        if token.pos_ == 'NOUN':
            count_noun += 1

    return count_noun


def count_adj_spacy(doc):
    """
    Given a Spacy doc object return count of the 
    following parts-of-speech: adjectives.
    """
    count_adj = 0

    for token in doc:
        if token.pos_ == 'ADJ':
            count_adj += 1
            
    return count_adj


def count_nums_spacy(doc):
    """
    Given a Spacy doc object return count of the 
    following parts-of-speech: numbers.
    """ 
    count_nums = 0

    for token in doc:
        if token.pos_ == 'NUM':
            count_nums += 1
        
    return count_nums


def count_pron_spacy(doc):
    """
    Given a Spacy doc object return count of the 
    following parts-of-speech: pronouns.
    """
    count_pron = 0
    
    for token in doc:
        if token.pos_ == 'PRON':
            count_pron += 1
        
    return count_pron

In [None]:
def text_features_pipe(text_col, df):
    """
    Process given text column of a dataframe to 
    extract linguistic features and add them to
    the dataframe. Return the updated dataframe.
    """
    
    input_col = df[text_col]  

    # spaCy processing pipeline
    nlp_text_pipe = nlp.pipe(input_col, batch_size=20)
    
    propn_count = []
    verb_count = []
    noun_count = []
    adj_count = []
    nums_count = []
    pron_count = []
    
    for doc in nlp_text_pipe:

        # POS tags
        # Universal POS Tags
        # http://universaldependencies.org/u/pos/

        # Count proper nouns
        propn_total = count_propn_spacy(doc)
        propn_count.append(propn_total)

        # Count verbs
        verb_total = count_verb_spacy(doc)
        verb_count.append(verb_total)

        # Count nouns
        noun_total = count_noun_spacy(doc)
        noun_count.append(noun_total)

        # Count adjectives
        adj_total = count_adj_spacy(doc)
        adj_count.append(adj_total)

        # Count numbers
        nums_total = count_nums_spacy(doc)
        nums_count.append(nums_total)

        # Count pronouns
        pron_total = count_pron_spacy(doc)
        pron_count.append(pron_total)

    # Add word and syllable counts using the textstat library 
    df['word_count'] = input_col.apply(lambda x: textstat.lexicon_count(x, removepunct=True)) 
    df['syll_count'] = input_col.apply(lambda x: textstat.syllable_count(x))
    
    # Add the number of words with a syllable count greater than or equal to 3
    df['polysyll_count'] = input_col.apply(lambda x: textstat.polysyllabcount(x)) 
    
    # Add the number of words with a syllable count equal to one
    df['monosyll_count'] = input_col.apply(lambda x: textstat.monosyllabcount(x)) 
    
    # Add stopwords count using the textfeatures library
    tf.stopwords_count(df,text_col,'stopwords_count')
    
    # tf.stopwords(df,text_col,'stopwords')  # Include a column that lists the stopwords found in the text
    
    # Add average word length and character counts using the textfeatures library 
    try:
        tf.avg_word_length(df,text_col,'avg_word_length')
    except:
        df['avg_word_length'] = 0
    
    try:
        tf.char_count(df,text_col,'char_count')
    except:
        df['char_count'] = 0
        
    # Add parts of speech counts to the dataframe
    df['propn_count'] = propn_count
    df['verb_count'] = verb_count
    df['noun_count'] = noun_count
    df['adj_count'] = adj_count
    df['nums_count'] = nums_count
    df['pron_count'] = pron_count
    
    # Add frequency columns   
    df['propn_freq'] = df['propn_count']/df['word_count']
    df['verb_freq'] = df['verb_count']/df['word_count']
    df['noun_freq'] = df['noun_count']/df['word_count']
    df['adj_freq'] = df['adj_count']/df['word_count']
    df['nums_freq'] = df['nums_count']/df['word_count']
    df['pron_freq'] = df['pron_count']/df['word_count']
    
    df['polysyll_freq'] = df['polysyll_count']/df['word_count']
    df['monosyll_freq'] = df['monosyll_count']/df['word_count']
    df['stopword_freq'] = df['stopwords_count']/df['word_count']
    
    return df 

In [None]:
# Run the function to extract text features and add them to the dataframe
# This will take a little while to run
df = text_features_pipe(text_col, df)

In [None]:
# Inspect the first few rows of the dataframe to see the features that have been added
pd.set_option('display.max_columns', None)
df.head(5)

In [None]:
# Inspect descriptive statistics for our numerical data
df.describe()

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 2:</strong> Why is exploratory data anlaysis (EDA) using techniques such as visualising the data and examining descriptive statistics important? What can it reveal? Discuss with your classmates or tutors. 
</div>

In [None]:
# Inspect the features and data types of the dataframe
display(df.dtypes)

In [None]:
# Change the text, newspaper name, newspaper id, genre, and clean_text columns to strings
df['genre'] = (df['genre']).astype('string')
df['text'] = (df['text']).astype('string')
df['clean_text'] = (df['clean_text']).astype('string')
df['newspaper_id'] = (df['newspaper_id']).astype('string')
df['newspaper'] = (df['newspaper']).astype('string')

## Specify features to include in the model

* We now need to specify the features we want to include in our model, for example it makes sense to only include the parts-of-speech frequencies not the counts.
* You can include or remove features from the model to explore the impact of different combinations of features on the performance of the classifier.

In [None]:
# List of features to include in the model 
# Place cursor in the text and press Ctrl + / to comment or uncomment the line

features = ["propn_freq", 
            "verb_freq", 
            "noun_freq", 
            "adj_freq",
            "nums_freq", 
            "pron_freq", 
            "stopword_freq", 
            "avg_line_offset", 
#             "max_line_offset", 
            "avg_line_width", 
#             "min_line_width", 
#             "max_line_width", 
#             "line_width_range", 
            "polysyll_freq", 
            "monosyll_freq", 
            "sentence_count", 
#             "word_count", 
            "avg_word_length", 
#             "char_count", 
            
            # We will code our target genre as '1' and the others as '0'
            # Do not remove this feature from the set
            "binary_class"  
           ]

## Set the target genre

* We will specify the genre we want to predict with the binary classifier. 
* The selected genre will be labelled as 1 in the binary classification model, with the other classes labelled as 0.

In [None]:
# Select from:
# FamilyNotice     
# Fiction          
# LetterToEditor    
# Poetry         

target_genre = "Poetry"

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 3:</strong> Train and test classifiers for each of the four genres and take note of the results in a separate document. Which combination of genre and classifier achieved the best metrics and which was the worst? Discuss the results with your classmates or tutors.
</div>

## Split the data into train and test sets

* Run the cells below to split the data into train and test data sets.

In [None]:
def train_test_data(df, features, target_genre):
    """
    Given the dataframe, features to include in the model,
    and the target genre, split the data into 
    training and test sets and use the dataframe indices to 
    save the order of the split
    """
    
    df['binary_class'] = np.where(df['genre']== target_genre, 1, 0)
    model_df = df.filter(features, axis=1)
    indices = df.index.values

    # Extract the explanatory variables in X and the target variable in y
    y = model_df.binary_class.copy()
    X = model_df.drop(['binary_class'], axis=1)

    # Train test split 
    # Use the indices to save the order of the split.
    # https://stackoverflow.com/questions/48947194/add-randomforestclassifier-predict-proba-results-to-original-dataframe
    X_train, X_test, indices_train, indices_test = train_test_split(X, 
                                                                    indices, 
                                                                    test_size = .3,    # This value changes the proportion of data held out for the test set
                                                                    random_state = 3)
    
    y_train, y_test = y[indices_train], y[indices_test]
    
    return X_train, X_test, y_train, y_test, indices_train, indices_test

In [None]:
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_data(df, features, target_genre)

In [None]:
X_train.head(10)

In [None]:
y_train.head(10)

## Train and test a logistic regression classifier 
* [Logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) is a binary classification method popular for its computational efficiency and interpretability.
* Run the cells below to train and test a logistic regression classifier for our selected genre.

In [None]:
def log_reg_binary(X_train, X_test, y_train, y_test, target_genre):
    """
    Train a logistic regression model to classify the selected genre
    """ 
    pipe = Pipeline([('scl', StandardScaler()),
                    ('clf', LogisticRegression())]) 
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    accuracy_result = accuracy_score(y_test, y_pred)
    precision_result = precision_score(y_test, y_pred)
    recall_result = recall_score(y_test, y_pred)
    f1_result = f1_score(y_test, y_pred)
    auroc_result = roc_auc_score(y_test, y_pred)

    print("-----------------------------------------------")
    print(f"Binary Classification - Logistic Regression")
    print(f"{target_genre}")
    print("-----------------------------------------------")
    print()
    print(f"Accuracy = {accuracy_result:.3f}")
    print(f"Precision = {precision_result:.3f}")
    print(f"Recall = {recall_result:.3f}")
    print(f"F1 Score = {f1_result:.3f}")
    print(f"AUROC Score = {auroc_result:.3f}")
    
    RocCurveDisplay.from_predictions(y_test, y_pred)
    plt.title("AUROC: Logistic Regression")
    plt.show()
    
    print()
    print("-----------------------------------------------")
    print(f"Model coefficients (converted to odds)")
    print(f"{target_genre}")
    print("-----------------------------------------------")
    
    
    y_pred_train = pipe.predict(X_train)
    y_pred_test = pipe.predict(X_test)
    
    y_prob_train = pipe.predict_proba(X_train)
    y_prob_test = pipe.predict_proba(X_test)
    
    # Get coefficients and convert from log odds to odds
    # https://towardsdatascience.com/interpreting-coefficients-in-linear-and-logistic-regression-6ddf1295f6f1
    odds = np.exp(pipe.named_steps['clf'].coef_[0])
    
    return y_pred_train, y_pred_test, y_prob_train, y_prob_test, odds

In [None]:
def genres_binary_lr(df, target_genre, X_train, X_test, y_train, y_test, indices_train, indices_test):
    """
    Train and test the model, and return the dataframe
    with appended predictions.
    """

    y_pred_train, y_pred_test, y_prob_train, y_prob_test, odds = log_reg_binary(X_train, 
                                                                                X_test, 
                                                                                y_train, 
                                                                                y_test, 
                                                                                target_genre)

    # Add the predictions to a copy of the original dataframe
    df_new = df.copy()
    df_new.loc[indices_train,'pred_train'] = y_pred_train
    df_new.loc[indices_test,'pred_test'] = y_pred_test
    df_new.loc[indices_train,'prob_0_train'] = y_prob_train[:,0]
    df_new.loc[indices_test,'prob_0_test'] = y_prob_test[:,0]
    df_new.loc[indices_train,'prob_1_train'] = y_prob_train[:,1]
    df_new.loc[indices_test,'prob_1_test'] = y_prob_test[:,1]   
    
    # We will construct a link that allows us to visit the newspaper on the Papers Past website
    df_new["newspaper_web"] = df_new["date"].astype('string')
    df_new["newspaper_web"] = df_new["newspaper_web"].str.replace('-','/')
    df_new["newspaper_web"] = "https://paperspast.natlib.govt.nz/newspapers/" \
                            + df_new["newspaper"].str.replace(
                            ' ',
                            '-', 
                            regex = False).str.lower().str.replace(
                            "'",                                                                            
                            "", 
                            regex = False).str.replace(
                            ".", 
                            "", 
                            regex = False) \
                            + "/" \
                            + df_new["newspaper_web"]

    # Sort the dataframe by probability of being the given genre
    df_new = df_new.sort_values(by="prob_1_test", ascending=False)  
    
    # Create a dataframe of the coefficients and features 
    lr_odds_df = pd.DataFrame(odds, X_train.columns, columns=['coef (odds)']).sort_values(by='coef (odds)', ascending=False)
    lr_odds_df['feature'] = lr_odds_df.index
    
    return df_new, lr_odds_df

In [None]:
lr_preds_df, lr_odds_df = genres_binary_lr(df, target_genre, X_train, X_test, y_train, y_test, indices_train, indices_test)

# Explore the model coefficients
display(lr_odds_df)

### Interpreting the logistic regression model
A benefit of logistic regression is that it is relatively easy to interpret compared to other classifiers. We can extract the coefficients of the features in the final model (using the 'coef_' attribute) to see which features were the strongest predictors of the positive class (in our case, the selected genre). 

The coefficients extracted using 'coef_' are the log odds that an observation belongs to the positive class. In order to interpret them, we convert them to standard odds. Odds greater than 1 are positive odds and can be interpreted as follows:

**"For every unit increase in {feature}, the odds that the observation is {positive class} are {coef (odds)} times greater than the odds that it is not {positive class} when all other variables are held constant."**

Odds less than 1 are the negative coefficients i.e. the strongest predictors that the observation is not in the positive class. To describe them in a similar way to the above, we need to take 1/odds. For example:

"For every unit increase in {feature}, the odds that the observation **is not** {positive class} are {1 / coef (odds)} times greater than the odds that it **is** {positive class} when all other variables are held constant."

When interpreting the model coefficients it is important to consider the influence of features that may be correlated with each other. These features will have similar predictive relationships to the outcome and therefore the sign and value of the coefficients should be interpreted with caution. 

You can read more about calculating and interpreting the coefficients of regression models in this [Towards Data Science](https://towardsdatascience.com/interpreting-coefficients-in-linear-and-logistic-regression-6ddf1295f6f1) article. 

## Train and test a Naive Bayes classifier 
* [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) methods are also widely used for text classification and have been effective in many real-world applications such as spam filtering.
* Run the cells below to train and test a Naive Bayes classifier for our selected genre.

In [None]:
def nb_binary(X_train, X_test, y_train, y_test, target_genre):
    """
    Train a Naive Bayes model to classify the selected genre
    """ 
    pipe = Pipeline([('scl', StandardScaler()),
                    ('clf', GaussianNB())]) 
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    accuracy_result = accuracy_score(y_test, y_pred)
    precision_result = precision_score(y_test, y_pred)
    recall_result = recall_score(y_test, y_pred)
    f1_result = f1_score(y_test, y_pred)
    auroc_result = roc_auc_score(y_test, y_pred)

    print("-----------------------------------------------")
    print(f"Binary Classification - Naive Bayes")
    print(f"{target_genre}")
    print("-----------------------------------------------")
    print()
    print(f"Accuracy = {accuracy_result:.3f}")
    print(f"Precision = {precision_result:.3f}")
    print(f"Recall = {recall_result:.3f}")
    print(f"F1 Score = {f1_result:.3f}")
    print(f"AUROC Score = {auroc_result:.3f}")
    
    RocCurveDisplay.from_predictions(y_test, y_pred)
    plt.title("AUROC: Naive Bayes")
    plt.show()
    
    y_pred_train = pipe.predict(X_train)
    y_pred_test = pipe.predict(X_test)
    
    y_prob_train = pipe.predict_proba(X_train)
    y_prob_test = pipe.predict_proba(X_test)
    
    return y_pred_train, y_pred_test, y_prob_train, y_prob_test

In [None]:
def genres_binary_nb(df, target_genre, X_train, X_test, y_train, y_test, indices_train, indices_test):
    """
    Train and test the model, and return the dataframe
    with appended predictions.
    """
    
    y_pred_train, y_pred_test, y_prob_train, y_prob_test = nb_binary(X_train, 
                                                                     X_test, 
                                                                     y_train, 
                                                                     y_test, 
                                                                     target_genre)

    # Add the predictions to a copy of the original dataframe
    df_new = df.copy()
    df_new.loc[indices_train,'pred_train'] = y_pred_train
    df_new.loc[indices_test,'pred_test'] = y_pred_test
    df_new.loc[indices_train,'prob_0_train'] = y_prob_train[:,0]
    df_new.loc[indices_test,'prob_0_test'] = y_prob_test[:,0]
    df_new.loc[indices_train,'prob_1_train'] = y_prob_train[:,1]
    df_new.loc[indices_test,'prob_1_test'] = y_prob_test[:,1]    
       
    # We will construct a link that allows us to visit the newspaper on the Papers Past website
    df_new["newspaper_web"] = df_new["date"].astype('string')
    df_new["newspaper_web"] = df_new["newspaper_web"].str.replace('-','/')
    df_new["newspaper_web"] = "https://paperspast.natlib.govt.nz/newspapers/" \
                            + df_new["newspaper"].str.replace(
                            ' ',
                            '-', 
                            regex = False).str.lower().str.replace(
                            "'",                                                                            
                            "", 
                            regex = False).str.replace(
                            ".", 
                            "", 
                            regex = False) \
                            + "/" \
                            + df_new["newspaper_web"]

    # Sort the dataframe by probability of being the given genre
    df_new = df_new.sort_values(by="prob_1_test", ascending=False)  
    
    return df_new

In [None]:
nb_preds_df = genres_binary_nb(df, target_genre, X_train, X_test, y_train, y_test, indices_train, indices_test)

## Inspect incorrectly classified texts

We can explore which texts were incorrectly classified by the two models. Run the cell below to display dataframes of the misclassified texts.

In [None]:
pd.set_option('display.max_columns', None)

lr_misclass = lr_preds_df.loc[(lr_preds_df["binary_class"] != lr_preds_df["pred_test"]) & (lr_preds_df["pred_test"] >= 0)]
lr_misclass = lr_misclass.filter(["date", 
                                  "newspaper_id", 
                                  "newspaper", 
                                  "article_id", 
                                  "title", 
                                  "text", 
                                  "genre", 
                                  "binary_class", 
                                  "pred_test", 
                                  "newspaper_web"], axis=1).reset_index(drop=True)

print(f"\nMisclassified texts for Logistic Regression model (lr)")
print(f"{target_genre}")
print("========================================================\n")
display(lr_misclass)

nb_misclass = nb_preds_df.loc[(nb_preds_df["binary_class"] != nb_preds_df["pred_test"]) & (nb_preds_df["pred_test"] >= 0)]
nb_misclass = nb_misclass.filter(["date", 
                                  "newspaper_id", 
                                  "newspaper", 
                                  "article_id", 
                                  "title", 
                                  "text", 
                                  "genre", 
                                  "binary_class", 
                                  "pred_test", 
                                  "newspaper_web"], axis=1).reset_index(drop=True)

print(f"\nMisclassified texts for Naive Bayes model (nb)")
print(f"{target_genre}")
print("========================================================\n")
display(nb_misclass)

### Display the full text and newspaper web address of a selected misclassification by model and index

In [None]:
# Select the model and index number of the misclassified text

# Enter 'lr' or 'nb'
model = 'lr'
selected_index = 1

##################################################################

if model == 'lr':
    print(f"\n{lr_misclass['title'].values[selected_index]}\n")
    print(lr_misclass['text'].values[selected_index])
    print("\nView the scanned newspaper on the Papers Past website - use Ctrl + F to search for the article by title")
    print(lr_misclass['newspaper_web'].values[selected_index])
elif model == 'nb':
    print(f"\n{nb_misclass['title'].values[selected_index]}\n")
    print(nb_misclass['text'].values[selected_index])
    print("\nView the scanned newspaper on the Papers Past website - use Ctrl + F to search for the article by title")
    print(nb_misclass['newspaper_web'].values[selected_index])
else:
    print("\nPlease enter either 'lr' or 'nb' for the model")

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 4:</strong> Examine some of the misclassified texts. Why do you think they were misclassified? Do the coefficients of the logistic regression model provide any clues? Discuss with your classmates or tutors.
</div>