# Classification of PubMed manuscripts using Naive Bayes

Previously, we have scraped 26000 manuscript references from [PubMed](https://pubmed.ncbi.nlm.nih.gov/). The dataset is available in our `Data/pubmed/` [folder](https://github.com/chauvu/chauvu.github.io/tree/main/Data/pubmed).

In this project, we want to classify the PubMed references to their topic classes based on the abstract text. The abstract is a concise summary of the manuscript and is provided with almost every manuscript publication. We will use the Naive Bayes method to classify references according to their topics.

One barrier to this project is the lack of specific topic classes for the manuscripts. Each manuscript entry contains the title, author list, abstract and list of 5-7 keywords. We will make use of these keywords to group manuscripts into topics ([dataset](https://github.com/chauvu/chauvu.github.io/tree/main/Data/pubmed/manuscripts_topics.pkl)); afterward, we will use Naive Bayes to perform classification of each abstract to their corresponding topic class. Notably, Naive Bayes is implemented in this script instead of using the default Naive Bayes class on `sklearn`.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
manuscripts = pd.read_csv('../Data/pubmed/manuscripts.csv')
manuscripts = manuscripts[manuscripts['Keywords'].notnull()] # drop entries with null keywords
manuscripts = manuscripts[manuscripts['Abstract'].notnull()] # drop entries with null abstracts
manuscripts.reset_index(drop=True, inplace=True)
manuscripts.head()

Unnamed: 0,Titles,Authors,Journal,Date,PMID,Free Article,Abstract,Keywords
0,Characteristics of the isocitrate dehydrogenas...,"Qu CX, Ji HM, Shi XC, Bi H, Zhai LQ, Han DW.",Brain Behav,2020,32146731,False,OBJECTIVES: To explore the characteristics of ...,Chinese gliomas; IDH mutation; TERT promoter m...
1,"Demographics, Natural History and Treatment Ou...","Savage P, Winter M, Parker V, Harding V, Sita-...",BJOG,2020,32146729,False,"OBJECTIVE: To investigate the demographics, na...",Choriocarcinoma; chemotherapy; demographics; i...
2,Integrated seed proteome and phosphoproteome a...,"Sinha A, Haider T, Narula K, Ghosh S, Chakrabo...",Proteomics,2020,32146728,False,Nutrient dynamics in storage organs is a compl...,2DE; chickpea; mass spectrometry; nutrient; pr...
3,Is R(+)-Baclofen the best option for the futur...,"Echeverry-Alzate V, Jeanblanc J, Sauton P, Blo...",Addict Biol,2020,32146727,False,"For several decades, studies conducted to eval...",GABAB receptor; R(+)-Baclofen; RS(±)-Baclofen;...
4,Association between the dimensions of the maxi...,"Zhang B, Wei Y, Cao J, Xu T, Zhen M, Yang G, C...",J Periodontol,2020,32146722,False,BACKGROUND: The information of the association...,Cone-beam computed tomography; molars; mucosal...


## Generate topics

Each reference entry contains the list of keywords, which are relevant topics of the manuscript. Since a manuscript is a complex piece of writing that can fall into multiple topics, we will assume that these topics are mutually exclusive (which are likely not true, e.g. 'IDH mutation' and 'mutation frequencies' are related). With this assumption, we will account for the frequencies of each topic and **choose the top 10 most popular topics** for analysis.

In [3]:
manuscripts['Keywords'].iloc[0]

'Chinese gliomas; IDH mutation; TERT promoter mutation; mutation frequencies; overall survival analysis; sanger sequencing'

In [4]:
manuscripts['Keywords'].iloc[1]

'Choriocarcinoma; chemotherapy; demographics; incidence'

To create the frequency table of each topic, we will have to split the `Keywords` columns into a list of keywords (separated by the semi-colon). The top 10 topics of our manuscripts are:

   * `Apoptosis` (cell death)
   * `Breast cancer`
   * `Cancer`
   * `Depression`
   * `Epidemiology`
   * `Inflammation`
   * `Obesity`
   * `Oxidative stress`
   * `Prognosis` (likely course of disease)
   * `Quality of life`

In [5]:
manuscripts['Keywords_list'] = manuscripts['Keywords'].apply(lambda x: [y.strip().lower() for y in x.split(';')])
keywords = [k.strip().lower() for ks in list(manuscripts['Keywords_list']) for k in ks]
keywords = pd.Series(keywords)

# top 10 topics
topics = list(keywords.value_counts().head(10).index)
topics.sort()
print(topics)

['apoptosis', 'breast cancer', 'cancer', 'depression', 'epidemiology', 'inflammation', 'obesity', 'oxidative stress', 'prognosis', 'quality of life']


From these 10 topics, we can clearly see that the topics are *not* mutually exclusive. For example, `breast cancer` and `cancer` are related, and `breast cancer` should fall within the larger topic `cancer`. To address this incorrect assumption, we will check the overlap of manuscripts between different topics.

From the frequency of overlaps, `inflammation` and `oxidative stress` have the most overlap, which makes sense because the process of inflammation is typically caused by the generation of reactive oxygen species, leading to oxidative stress. Therefore, I will remove `oxidative stress` as a topic but retain `inflammation`.

Additionally, I will also remove `breast cancer` due to the overlap with the `cancer` topic and remove `epidemiology` as a topic due to its broad definition (incidence, distribution and control of diseases).

In [6]:
overlap = [0] * len(manuscripts)
topics_overlap = [''] * len(manuscripts)
for index, row in manuscripts.iterrows():
    for t in topics:
        if t in row['Keywords_list']:
            overlap[int(index)] += 1
            if topics_overlap[int(index)] == '':
                topics_overlap[int(index)] = t
            else:
                topics_overlap[int(index)] += ', ' + t
manuscripts['Overlap_count'] = overlap
manuscripts['Overlap_topics'] = topics_overlap

# over 1000 manuscripts in these topics
print('Number of overlaps is {}'.format(len(manuscripts.loc[manuscripts['Overlap_count']>0])))
topics_overlap = manuscripts.loc[manuscripts['Overlap_count']>1, 'Overlap_topics']
print(topics_overlap.value_counts()[:5])

Number of overlaps is 1880
inflammation, oxidative stress    21
breast cancer, prognosis          11
apoptosis, oxidative stress        8
inflammation, obesity              8
epidemiology, obesity              8
Name: Overlap_topics, dtype: int64


In [7]:
# drop these topics
topics.remove('oxidative stress')
topics.remove('epidemiology')
topics.remove('breast cancer')

Now that we have the topics: `apoptosis`, `cancer`, `depression`, `inflammation`, `obesity`, `prognosis` and `quality of life`, we will only perform analysis on the subset of manuscripts within these topics. Additionally, any manuscript reference that contains an overlap of these 7 topics will be removed from the dataset. All columns are removed except for the `Topic` and `Abstract` columns.

In [8]:
manuscripts = manuscripts.loc[manuscripts['Overlap_count']==1] # only 1 topic
manuscripts.reset_index(drop=True, inplace=True)

abstract_topic = [' '] * len(manuscripts)
for index, row in manuscripts.iterrows():
    for t in topics:
        if t in row['Keywords_list']:
            abstract_topic[int(index)] = t
            break
abstract_topic = [at.strip() for at in abstract_topic]
manuscripts['Topic'] = abstract_topic
manuscripts = manuscripts[manuscripts['Topic']!='']
manuscripts = manuscripts[['Abstract','Topic']]
manuscripts.reset_index(drop=True, inplace=True)

In [9]:
manuscripts.head()

Unnamed: 0,Abstract,Topic
0,BACKGROUND: The purpose of this prospective st...,inflammation
1,INTRODUCTION: Each dermatological condition as...,quality of life
2,An abdominal aortic aneurysm (AAA) is a relati...,inflammation
3,"For several years, the number of studies on th...",depression
4,Neuroblastoma (NB) is the common pediatric tum...,apoptosis


In [10]:
print(manuscripts['Topic'].value_counts())

inflammation       209
depression         197
apoptosis          196
prognosis          186
obesity            180
quality of life    161
cancer             130
Name: Topic, dtype: int64


Finally, we retain 1259 manuscripts. The dataset is slightly unbalanced, with the most popular topic `inflammation` containing 209 entries while the least popular topic `cancer` contain 130 entries. This dataframe is written into a pickle file `manuscripts_topics` in our Data folder for further analysis.

In [11]:
manuscripts.to_pickle('../Data/pubmed/manuscripts_topics.pkl')

## Classification into topics

We start of by cleaning the text data in `Abstract` column, by removing all capitalization, removal of digits and words shorter than 5 characters to retain only *complex* words. We show an example of a processed abstract below, with variable number of spaces between words. Since Naive Bayes uses a *bag of words* approach, the spaces do not matter.

Afterward, we split into training and testing set (20% test). 

In [12]:
# lower() and remove all punctuations
manuscripts['Abstract'] = manuscripts['Abstract'].str.lower().str.replace(r'[\W]', ' ')
# remove words shorter than 5 characters (only complex words)
manuscripts['Abstract'] = manuscripts['Abstract'].str.replace(r'\b\w\w?\w?\w?\b', '')
# remove numbers
manuscripts['Abstract'] = manuscripts['Abstract'].str.replace(r'\b\d+\b', '')

In [13]:
manuscripts.iloc[0]['Abstract']

'background   purpose   prospective study   compare  changes  periodontal somatosensory function  microcirculation  patients  periodontitis following initial treatment  scaling   planing      without adjuvant laser therapy methods  twenty  patients suffering  periodontitis  recruited  randomly allocated   split mouth design  either  combined laser therapy           control     treatments  performed    investigator   single visit  laser doppler flowmetry     quantitative sensory testing     performed  baseline            weeks       weeks    after treatment   sides   attached gingiva   maxillary lateral incisor  clinical examination including pocket probing depth     bleeding  probing     performed          sides    analyzed    analysis  variance  anova  results      significantly improved after treatment          values  significantly decreased   sides   follow   points        temperature  increased             whereas there   significant change   control          significantly  sensit

In [14]:
train_df, test_df = train_test_split(manuscripts, test_size=0.2, random_state=1)
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

Next, we will built the vocabulary space from the training set. Even though there might be extra vocabularies in the testing set, we will assume that the training set represents the entire word space; any word not in the training set does not exist and can be replace with an empty string.

In [15]:
vocab = set()
vocab_count = {}
for index, row in train_df.iterrows():
    words = row['Abstract'].split() # split by space
    vocab = vocab.union(set(words))
    for w in words:
        if w in vocab_count:
            vocab_count[w] += 1
        else:
            vocab_count[w] = 1
vocab_count_series = pd.Series(vocab_count).sort_values(ascending=False)

In [16]:
print(vocab_count_series.head(10))

patients      1844
study         1566
results       1035
cancer         790
between        759
group          733
associated     659
levels         638
treatment      635
depression     605
dtype: int64


In [17]:
print(vocab_count_series.tail(10))

chictr            1
crocus            1
sativus           1
saffron           1
fourty            1
solvents          1
piperlongumine    1
edible            1
pepper            1
trace             1
dtype: int64


From the frequency table of the vocabularies, we see that the most frequently-used words are general words that likely appear in many abstracts. For example, `study` and `results` are extremely general since each scientific paper is an independent *study* and needs to provide *results* to be published in a journal. On the other hand, the bottom words are very rare; most words such as `chictr` (Chinese clinical trial registry) or `crocus` (a type of iris flower) are so rare most people do not know what they mean.

To avoid the overly-general words and the obscure words, we will remove top results with > 400 occurences and bottom results with < 30 occurences.

In [18]:
vocab_count_series = vocab_count_series[vocab_count_series>=30]
vocab_count_series = vocab_count_series[vocab_count_series<=400]
vocab = set(list(vocab_count_series.keys()))

After converting `vocab` to a list object and sorting it alphabetically, we can look at a subset of words. We can immediately recognize that a lot of words can be combined, such as `addition`, `additional`, and `additionally`.

In [19]:
vocab_list = list(vocab)
vocab_list.sort()

In [20]:
vocab_list[:20]

['ability',
 'about',
 'accompanied',
 'according',
 'accumulation',
 'accuracy',
 'acids',
 'across',
 'activated',
 'activation',
 'active',
 'activities',
 'activity',
 'acute',
 'addition',
 'additional',
 'additionally',
 'adenocarcinoma',
 'adherence',
 'adjusted']

Words such as `additional` and `additionally` should be merged into `addition` since they are simply variant forms of the same word. This process is called **lemmatization**. We do this by creating a dictionary `root_words`, with two keys `root` (the lemmatized root word) and `remove` (conjugation of the root word that should be removed). 

We assume that the root word and conjugated words are directly adjacent to each other in the alphabetically-sorted list. This assumption is mostly valid, for example plural versions of a noun usually have an additional `-s` at the end (such as `adult` and `adults`), or simple past conjugation of a verb usually have an `-ed` at the end (like `show` and `showed`). Some other exceptions are ignored, such as antonyms like `clear` and `unclear`. After going through the vocab list, I also manually modified a couple words.

After word, we removed all the `remove` words from the training and testing datasets as well as the whole vocabulary space.

In [21]:
root_words = []
conjugations = ['s','al','ly','ed','d','ion'] # plural nouns, adjectives, adverbs, past tense
for idx in range(len(vocab_list)-1): # compare current word with next word
    v1 = vocab_list[idx]
    v2 = vocab_list[idx+1]
    for conj in conjugations:
        if v1 + conj == v2:
            # print(v1 + ' ' + v2)
            root_words.append({'root':v1, 'remove':v2})
            break
root_words = pd.DataFrame(root_words, columns=['root','remove'])
# manual modify certain words
root_words.loc[root_words['root']=='additional', 'root'] = 'addition'
root_words.loc[root_words['root']=='clinical', 'root'] = 'clinic'

In [22]:
# remove conjugated word variants from training and testing sets
for i in range(len(root_words)):
    root_word = root_words.loc[i,'root']
    remove_word = root_words.loc[i,'remove']
    train_df['Abstract'] = train_df['Abstract'].str.replace(remove_word,root_word)
    test_df['Abstract'] = test_df['Abstract'].str.replace(remove_word,root_word)
    vocab.remove(remove_word)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


After lemmatization, we retain a vocabulary of around 700 words. In the training and testing dataframes, we will now create a column for each vocabulary word and track the number of occurences of each word.

In [23]:
len(vocab)

726

In [24]:
# create a column for each vocab word in training and testing set
for v in vocab: # initialize at 0 for all words
    train_df.loc[:,v] = 0
    test_df.loc[:,v] = 0
for index, row in train_df.iterrows():
    words = row['Abstract'].split()
    for w in words:
        if w in vocab:
            train_df.loc[index,w] = train_df.loc[index,w] + 1
for index, row in test_df.iterrows():
    words = row['Abstract'].split()
    for w in words:
        if w in vocab:
            test_df.loc[index,w] = test_df.loc[index,w] + 1
            
# let's pickle this for later use as well
train_df.to_pickle('../Data/pubmed/naivebayes_train_df.pkl')
test_df.to_pickle('../Data/pubmed/naivebayes_test_df.pkl')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


For example, in the first entry of the training dataset, the abstract contains 2 occurences of `increase`, 0 occurence of `while`, 0 occurence of `cardiovascular`. In the second row, the abstract contains 0 occurence of `increase`, 2 occurences of `while` and 1 occurence of `cardiovascular`. Overall, the training dataset has 10007 rows with 728 columns, and the testing dataset has 252 rows with 728 columns (726 vocab words).

In [25]:
print(train_df.iloc[0:2,2:50])

   negative  explore  inhibitor  obtained  patient  suggests  however  \
0         0        1          0         0        0         0        0   
1         0        0          0         0        0         0        1   

   administration  period  older  ...  periodontitis  impact  growth  target  \
0               0       0      0  ...              0       0       0       0   
1               0       0      0  ...              0       0       0       1   

   involved  oxidative  anxiety  database  gender  order  
0         0          0        0         0       0      0  
1         0          0        0         0       0      0  

[2 rows x 48 columns]


In [26]:
print(train_df.shape)
print(test_df.shape)

(1007, 728)
(252, 728)


Now let's get some priors for naive bayes. For each topic, we will need to calculate:
   * P(topic) `p_topic`, probability of each topic
   * P(word | topic) `p_word_given_topic`, probability of each word given the topic

Since there are so many vocabulary words (726) compared to the number of manuscript abstracts available, most entries will have P(word | topic) as 0 for most words; multiplication by zero will wipe out other non-zero probabilities of other words. To avoid this situation, we will implement **Laplace Smoothing** with an alpha = 1, which ensures the probability of word given topic is never exactly 0.

In [27]:
alpha = 1 # Laplace smoothing alpha is 1
n_train = len(train_df)
topics = list(train_df['Topic'].unique())
n_topics = len(topics)
p_topic = {}
p_word_given_topic = {}
for topic in topics:
    train_df_topic = train_df[train_df['Topic']==topic]
    p_topic[topic] = len(train_df_topic) / n_train # probability of this topic
    p_word = {} # probability of each word for this topic
    n_words = train_df_topic.iloc[:,2:].sum(axis=1).sum() # number of words for this topic
    for word in vocab:
        p_word[word] = (train_df_topic[word].sum() + alpha) / (n_words + alpha * n_topics) # Laplace smoothing
    p_word_given_topic[topic] = p_word

As we see before, the dataset is only slightly unbalanced, with `inflammation` accounting for 17% and `cancer` accounting for 9% of the manuscripts.

In [28]:
p_topic

{'inflammation': 0.16782522343594836,
 'apoptosis': 0.15789473684210525,
 'prognosis': 0.14597815292949354,
 'obesity': 0.14498510427010924,
 'quality of life': 0.13406156901688182,
 'depression': 0.15590863952333664,
 'cancer': 0.09334657398212512}

For example, if we look at the topic `cancer`. Here are the top 10 words (highest `P(word | topic='cancer')`). We see general words like `conclusion` and `studies`, but we also see important topic-related terms such as `tumor` (cancer usually involved growth of a tumor) and `mortality` (most cancers are life-threatening and thus leading to higher mortality rate compared to other diseases).

In [29]:
sorted(((value,key) for (key,value) in p_word_given_topic['cancer'].items()), reverse=True)[:10]

[(0.010524528942454592, 'conclusion'),
 (0.008996774741130538, 'tumor'),
 (0.008487523340689187, 'disease'),
 (0.00814802240706162, 'studies'),
 (0.00814802240706162, 'among'),
 (0.007978271940247837, 'clinic'),
 (0.007808521473434052, 'outcome'),
 (0.007808521473434052, 'mortality'),
 (0.007469020539806484, 'specific'),
 (0.0072992700729927005, 'their')]

Now let's try calculating the posterior for the first row in the testing set. Naive Bayes was able to correctly classify this manuscript as topic `obesity`.

In [30]:
row = test_df.iloc[0] # first row
p_topic_row = {}
for topic in topics:
    p = p_topic[topic] # prior of this topic
    for word in vocab:
        p *= p_word_given_topic[topic][word] ** row[word] # multiply P(word | topic) to power of num_word_occurence
    p_topic_row[topic] = p
print('The posterior probabilities are {} for each topic.'.format(p_topic_row))
row_topic_predicted = max(p_topic_row, key=p_topic_row.get) # topic with maximum posterior
row_topic_observed = row['Topic']
print('\nThe predicted and observed topic for this row is {} and {}.'.format(row_topic_predicted, row_topic_observed))

The posterior probabilities are {'inflammation': 1.6945017779849863e-300, 'apoptosis': 0.0, 'prognosis': 2.672776502143779e-287, 'obesity': 1.0196211598166791e-238, 'quality of life': 8.483031499994382e-290, 'depression': 2.6976390033587065e-288, 'cancer': 1.6681406122475038e-293} for each topic.

The predicted and observed topic for this row is obesity and obesity.


Now that the first row was able to generate correct classification of topics, we will apply our Naive Bayes algorithm to the entire testing dataset.

In [31]:
n_correct = 0
incorrect = [] # list of incorrect classifications
for index, row in test_df.iterrows():
    p_topic_row = {}
    for topic in topics:
        p = p_topic[topic]
        for word in vocab:
            p *= p_word_given_topic[topic][word] ** row[word]
        p_topic_row[topic] = p
    row_topic_predicted = max(p_topic_row, key=p_topic_row.get)
    row_topic_observed = row['Topic']
    if(row_topic_observed == row_topic_predicted):
        n_correct += 1 # number of correct predictions
    else:
        incorrect.append([row_topic_observed, row_topic_predicted]) # observation and prediction for incorrect entries
accuracy = n_correct / len(test_df)
print(accuracy)

0.7579365079365079


Overall, we get an accuracy of 76% in our testing dataset, which is impressive for such a simplistic algorithm. To analyze our incorrect results, we can look at entries that are incorrectly classified for an intuition of what went wrong.

In [32]:
incorrect_df = pd.DataFrame(incorrect, columns=['Observed','Predicted'])
incorrect_df['Combined'] = incorrect_df['Observed'] + ' : ' + incorrect_df['Predicted']
print(incorrect_df['Combined'].value_counts())

depression : quality of life      9
inflammation : apoptosis          5
cancer : apoptosis                5
depression : apoptosis            4
prognosis : apoptosis             3
inflammation : cancer             3
depression : obesity              3
apoptosis : inflammation          3
cancer : depression               3
quality of life : depression      2
cancer : quality of life          2
quality of life : inflammation    2
obesity : depression              2
cancer : prognosis                2
prognosis : cancer                2
obesity : inflammation            1
depression : prognosis            1
prognosis : inflammation          1
obesity : prognosis               1
apoptosis : cancer                1
cancer : inflammation             1
inflammation : prognosis          1
prognosis : depression            1
quality of life : obesity         1
obesity : quality of life         1
inflammation : obesity            1
Name: Combined, dtype: int64


## Conclusion

Firstly, we see that apoptosis appeared in 4 out of 5 top incorrect labels. Topics `cancer` and `apoptosis` are very similar, because cancer is uncontrolled cell growth whereas apoptosis is the cell death that follows. Topics `inflammation` and `apoptosis` are also very similar because cellular inflammation is followed by cell death. Quite likely, a number of pathologies involve the death of cells, so `apoptosis` might be too general a topic to include.

Secondly, the most frequent incorrect classification is between `depression` and `quality of life`. In hindsight, we should have noticed that these two topics are very closely related, as depression and other mental issues very likely lowers the quality of life. Another round of topic identification to remove these overlapping topics can be key to improve the performance of our model.

In conclusion, in this work, we use Naive Bayes to classify scientific text abstracts to different topics. The text is processed by removing stop-words, lemmatization and removing words that appear too frequently or too sparsely throughout the entries. Future work make use of more NLP libraries can help optimize this process.