# Notebook 00: Data Preprocessing and baseline models

In this notebook, we inspect the data that will be used for this practice project. We'll also build some models using non-neural network methods. These models serve as baseline models for our development of a neural network model.

>[Notebook 00: Data Preprocessing and baseline models](#scrollTo=1e9uKq4VaKnL)

>>[0.1 Load needed libraries and functions](#scrollTo=P2lmp4MQ2HAZ)

>>[0.2 Load data](#scrollTo=JkQoMaWAbVxe)

>>>[0.2.1 Preprocess data](#scrollTo=-G7b_G8EbY-i)

>>[0.3 Baseline models](#scrollTo=KHaZH3fAB_gj)

>>>[0.3.1 Naive Bayes](#scrollTo=sxJDUxpKCCgu)

>>>[0.3.2 RandomForest](#scrollTo=LDf8uN7-GpuG)

>>>[0.3.3 Baseline model selection](#scrollTo=DNxJruqXKqW2)



## 0.1 Load needed libraries and functions

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

I have also organized some my own customized utility functions in a separate `.py` file. Here I import the file from my project directory on github.

In [2]:
!wget https://raw.githubusercontent.com/ZYWZong/ML_Practice_Projects/refs/heads/main/SkimLit_project_practice/SkimLit_utils.py

--2024-12-24 01:21:48--  https://raw.githubusercontent.com/ZYWZong/ML_Practice_Projects/refs/heads/main/SkimLit_project_practice/SkimLit_utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2818 (2.8K) [text/plain]
Saving to: ‘SkimLit_utils.py’


2024-12-24 01:21:48 (35.7 MB/s) - ‘SkimLit_utils.py’ saved [2818/2818]



In [3]:
from SkimLit_utils import *

## 0.2 Load data

Fetching the labeled `PubMed` data from Franck Dernoncourt github.


In [4]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct.git
!ls pubmed-rct

fatal: destination path 'pubmed-rct' already exists and is not an empty directory.
PubMed_200k_RCT				       PubMed_20k_RCT_numbers_replaced_with_at_sign
PubMed_200k_RCT_numbers_replaced_with_at_sign  README.md
PubMed_20k_RCT


Let's inspect a few samples from the `train.txt` file, which contains the training data. As seen below, each line of an abstract is labeled as either `OBJECTIVE`, `METHODS`, `RESULTS`, `BACKGROUND`, and `CONCLUSIONS`. Later, we will preprocess the data and create categorical and quantitative features from them.




In [5]:
# Inspect samples of the training data
data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"
read_lines(data_dir+"train.txt")[:16]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

The data is already splitted into `train`, `dev`, and `test` sets. We load them accordingly.

In [6]:
%%time
train_samples = preprocess_data_with_line_numbers(data_dir + "train.txt")
dev_samples = preprocess_data_with_line_numbers(data_dir + "dev.txt")
test_samples = preprocess_data_with_line_numbers(data_dir + "test.txt")
len(train_samples), len(dev_samples), len(test_samples)

CPU times: user 438 ms, sys: 120 ms, total: 559 ms
Wall time: 565 ms


(180040, 30212, 30135)

### 0.2.1 Preprocess data

Here, we preprocess the data and create features that will be used to train our models later.

In [7]:
train_df = pd.DataFrame(train_samples)
dev_df = pd.DataFrame(dev_samples)
test_df = pd.DataFrame(test_samples)
train_df.head(16)

Unnamed: 0,label,text,line_number,total_lines
0,OBJECTIVE,to investigate the efficacy of @ weeks of dail...,0,11
1,METHODS,a total of @ patients with primary knee oa wer...,1,11
2,METHODS,outcome measures included pain reduction and i...,2,11
3,METHODS,pain was assessed using the visual analog pain...,3,11
4,METHODS,secondary outcome measures included the wester...,4,11
5,METHODS,"serum levels of interleukin @ ( il-@ ) , il-@ ...",5,11
6,RESULTS,there was a clinically relevant reduction in t...,6,11
7,RESULTS,the mean difference between treatment arms ( @...,7,11
8,RESULTS,"further , there was a clinically relevant redu...",8,11
9,RESULTS,these differences remained significant at @ we...,9,11


In [8]:
# Create one_hot encoding for the categorical features
train_sentences = train_df["text"].tolist()
dev_sentences = dev_df["text"].tolist()
test_sentences = test_df["text"].tolist()

one_hot_encoder = OneHotEncoder(sparse_output=False)
train_labels_one_hot = one_hot_encoder.fit_transform(train_df["label"].to_numpy().reshape(-1, 1))
dev_labels_one_hot = one_hot_encoder.transform(dev_df["label"].to_numpy().reshape(-1, 1))
test_labels_one_hot = one_hot_encoder.transform(test_df["label"].to_numpy().reshape(-1, 1))

print(train_labels_one_hot[:5])

[[0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]]


In [9]:
# Create label encoding to build the baseline models
label_encoder = LabelEncoder()
train_labels_encoded = label_encoder.fit_transform(train_df["label"].to_numpy())
dev_labels_encoded = label_encoder.transform(dev_df["label"].to_numpy())
test_labels_encoded = label_encoder.transform(test_df["label"].to_numpy())

print(train_labels_encoded[:5])

[3 2 2 2 2]


## 0.3 Baseline models

In this section, I build baseline models using the Naive Bayes and the Random Forest methods. I will pick the best model that has the highest accuracy.

### 0.3.1 Naive Bayes

In [10]:
from sklearn.naive_bayes import MultinomialNB

# Create and fit a naive Bayes model using the scikitlearn library
model_Naive_Bayes = Pipeline([
  ("tf-idf", TfidfVectorizer()),
  ("clf", MultinomialNB())
])

model_Naive_Bayes.fit(X=train_sentences,
            y=train_labels_encoded);

# Get the prediction on the development dataset
Naive_Bayes_preds = model_Naive_Bayes.predict(dev_sentences)

print(Naive_Bayes_preds[:5])

[4 1 3 2 2]


In [11]:
# Perform model evaluations
Naive_Bayes_evaluations = perform_evaluations(y_true=dev_labels_encoded, y_pred=Naive_Bayes_preds, model_name = "Naive Bayes")

Naive_Bayes_evaluations_df = pd.DataFrame(Naive_Bayes_evaluations)

Naive_Bayes_evaluations_df.head()

Unnamed: 0,Metric,Naive Bayes
0,accuracy,72.183238
1,precision,0.718647
2,recall,0.721832
3,F1,0.698925


### 0.3.2 RandomForest

Additionally to a Naive Bayes model, I also try a Random Forest model here and see if this model get a better accuracy.

Ideally, I would try a few numbers of `n_estimators` and select an optimal ensemble, but in interest of time I dictate a number of 15 trees here.

In [12]:
from sklearn.ensemble import RandomForestClassifier

model_Random_Forest = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", RandomForestClassifier(n_estimators=15, random_state=42)) # Build a smaller than default ensemble for fast processing.
])

model_Random_Forest.fit(X=train_sentences,
            y=train_labels_encoded);

Random_Forest_preds = model_Random_Forest.predict(dev_sentences)

print(Random_Forest_preds[:5])

[0 0 0 2 2]


In [13]:
Random_Forest_evaluations = perform_evaluations(y_true=dev_labels_encoded, y_pred=Random_Forest_preds, model_name = "Random Forest (15)")

Random_Forest_evaluations_df = pd.DataFrame(Random_Forest_evaluations)

Random_Forest_evaluations_df.head()

Unnamed: 0,Metric,Random Forest (15)
0,accuracy,76.095591
1,precision,0.754801
2,recall,0.760956
3,F1,0.753103


### 0.3.3 Baseline model selection

In [14]:
joined_df = pd.merge(Naive_Bayes_evaluations_df, Random_Forest_evaluations_df, on="Metric")

joined_df.set_index("Metric", inplace=True)

joined_df

Unnamed: 0_level_0,Naive Bayes,Random Forest (15)
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
accuracy,72.183238,76.095591
precision,0.718647,0.754801
recall,0.721832,0.760956
F1,0.698925,0.753103


Based on the accuracy score, I select my 15 trees Random Forest model as my baseline model.