# Athar bag-of-words baseline naive-bayes - balance and bias

This notebook explores the effect of different ways of balancing the dataset and train / test split.

In [1]:
import pandas as pd

import nltk
from nltk.stem import PorterStemmer

import warnings
warnings.filterwarnings('ignore')

import sklearn
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score

from IPython.display import display

In [2]:
import sys
sys.path.append('../')

from citation_sentiment_analysis.datasets.athar import (
    download_and_read_athar_txt_with_sentiment_label,
    filter_long_sentences_from_athar
)
from citation_sentiment_analysis.preprocessing.token_filter import (
    get_default_words_to_include,
    keep_sentence_list_tokens_in
)
from citation_sentiment_analysis.utils.jupyter import printmd
from citation_sentiment_analysis.utils.scoring import train_test_score
from citation_sentiment_analysis.utils.vectorizer import transform_to_counts

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
athar_all_df = filter_long_sentences_from_athar(download_and_read_athar_txt_with_sentiment_label())
athar_all_df.head()

Unnamed: 0,source_paper_id,target_paper_id,sentiment,citation_text,sentiment_label
0,A00-1043,A00-2024,o,We analyzed a set of articles and identified s...,neutral
1,H05-1033,A00-2024,o,Table 3: Example compressions Compression AvgL...,neutral
2,I05-2009,A00-2024,o,5.3 Related works and discussion Our two-step ...,neutral
3,I05-2009,A00-2024,o,(1999) proposed a summarization system based o...,neutral
4,I05-2009,A00-2024,o,We found that the deletion of lead parts did n...,neutral


In [5]:
def balance_by_target_paper_id(df):
    df = df[df['sentiment'].isin(['n', 'o'])]
    min_df = (
        df.groupby(['target_paper_id', 'sentiment']).size()
        .groupby(level=0).agg(lambda x: min(x) if len(x) == 2 else 0)
    )
    df = pd.concat([
        df[df['target_paper_id'] == paper_id].groupby('sentiment').head(min_df[paper_id])
        for paper_id in min_df.index
    ])
    return df

In [6]:
def balance_by_sentiment_only(df):
    df_athar_negative = df[df['sentiment'] == "n"]
    df_athar_neutral_selection = df[df['sentiment'] == "o"][:df_athar_negative.shape[0]]

    return df_athar_negative.append(df_athar_neutral_selection)

In [7]:
words_to_include = get_default_words_to_include()

len(words_to_include)

109442

In [8]:
def athar_df_to_X_y(df):
    citation_texts = df['citation_text']
    citation_tokens = [nltk.word_tokenize(s) for s in citation_texts]
    citation_filtered_tokens = keep_sentence_list_tokens_in(citation_tokens, words_to_include)
    ps = PorterStemmer()
    citation_stemmed_tokens = [[ps.stem(t) for t in tokens] for tokens in citation_filtered_tokens]
    X = transform_to_counts(citation_stemmed_tokens)
    y = df['sentiment'] == 'n'
    return X, y

In [9]:
# regular train/test split no randomisation
X, y = athar_df_to_X_y(balance_by_sentiment_only(athar_all_df))
score = train_test_score(
    BernoulliNB(), X, y, test_size=0.2,
    scoring='accuracy',
    shuffle=False
)
printmd('\n'.join([
    '### train/test split',
    '* not randomised before undersampling',
    '* not randomised before train/test split',
    '* accuracy: **%s**'
]) % score)

### train/test split
* not randomised before undersampling
* not randomised before train/test split
* accuracy: **0.5357142857142857**

In [10]:
# same as above, but randomise before train/test split
X, y = athar_df_to_X_y(balance_by_sentiment_only(athar_all_df))
score = train_test_score(
    BernoulliNB(), X, y, test_size=0.2,
    scoring='accuracy',
    shuffle=True,
    random_state=42
)
printmd('\n'.join([
    '### train/test split',
    '* not randomised before undersampling',
    '* randomised before train/test split',
    '* accuracy: **%s**'
]) % score)

### train/test split
* not randomised before undersampling
* randomised before train/test split
* accuracy: **0.8125**

In [11]:
# let's do the same with cross validation, first without randomising
X, y = athar_df_to_X_y(balance_by_sentiment_only(athar_all_df))
scores = cross_val_score(
    BernoulliNB(), X, y, cv=5,
    scoring='accuracy'
)
printmd('\n'.join([
    '### cross validation',
    '* not randomised before undersampling',
    '* not randomised before cv train/test split',
    '* accuracy: **%.3f** (std: %.3f)'
]) % (scores.mean(), scores.std()))

### cross validation
* not randomised before undersampling
* not randomised before cv train/test split
* accuracy: **0.780** (std: 0.054)

In [12]:
# cross validation, with randomising before doing the splits
X, y = athar_df_to_X_y(balance_by_sentiment_only(athar_all_df).sample(frac=1, random_state=42))
scores = cross_val_score(
    BernoulliNB(), X, y, cv=5,
    scoring='accuracy'
)
printmd('\n'.join([
    '### cross validation',
    '* not randomised before undersampling',
    '* randomised before cv train/test split',
    '* accuracy: **%.3f** (std: %.3f)'
]) % (scores.mean(), scores.std()))

### cross validation
* not randomised before undersampling
* randomised before cv train/test split
* accuracy: **0.857** (std: 0.020)

In [13]:
# let's try to balance also by target paper (i.e. get same number of neg/neutral citations from same paper)
X, y = athar_df_to_X_y(balance_by_target_paper_id(athar_all_df).sample(frac=1, random_state=42))
scores = cross_val_score(
    BernoulliNB(), X, y, cv=5,
    scoring='accuracy'
)
printmd('\n'.join([
    '### cross validation',
    '* special undersampling method: *by target paper*',
    '* not randomised before undersampling',
    '* randomised before cv train/test split',
    '* accuracy: **%.3f** (std: %.3f)'
]) % (scores.mean(), scores.std()))

### cross validation
* special undersampling method: *by target paper*
* not randomised before undersampling
* randomised before cv train/test split
* accuracy: **0.757** (std: 0.029)

In [14]:
# go back to balancing by sentiment only, but randomise before doing so (and after)
X, y = athar_df_to_X_y(balance_by_sentiment_only(
    athar_all_df.sample(frac=1, random_state=42)
).sample(frac=1, random_state=42))
scores = cross_val_score(
    BernoulliNB(), X, y, cv=5,
    scoring='accuracy'
)
printmd('\n'.join([
    '### cross validation',
    '* randomised before undersampling',
    '* randomised before cv train/test split',
    '* accuracy: **%.3f** (std: %.3f)'
]) % (scores.mean(), scores.std()))

### cross validation
* randomised before undersampling
* randomised before cv train/test split
* accuracy: **0.784** (std: 0.020)

In [15]:
# now no undersampling, we're just using roc_auc as the scoring method instead
X, y = athar_df_to_X_y(athar_all_df.sample(frac=1, random_state=42))
scores = cross_val_score(
    BernoulliNB(), X, y, cv=5,
    scoring='roc_auc'
)
printmd('\n'.join([
    '### cross validation',
    '* no undersampling',
    '* randomised before cv train/test split',
    '* roc auc: **%.3f** (std: %.3f)'
]) % (scores.mean(), scores.std()))

### cross validation
* no undersampling
* randomised before cv train/test split
* roc auc: **0.814** (std: 0.007)

## Conclusion

Randomising the data is very important and most effective at the start.
Additionally removing bias by only selecting target paper id produces similar result to just undersampling by class.

The roc auc result suggest that there is potential to learn from more data.