# Chapter 2: Baseline Model
Baseline model is an important step in any machine learning project. It gives data scientists the illusion of progress, imaging that empty dopamine surge when they see some number goes up for a test set that is in no way an indication of the model's performance with real world data after deployment.

Without this baseline, how data scientists suppose to tell the world what kind of progress they have made? How can they tell their boss that they have done something useful? How can they tell their friends that they are not wasting their time? **How can they tell their parents that they are not a failure**? (I swear to god this entire paragraph is generated by Copilot).

In [3]:
import torch
from torch.utils.data import DataLoader
from torchdata.datapipes.iter import IterableWrapper, IterDataPipe
from transformers import AutoTokenizer
import pandas as pd

from tqdm import tqdm
from transformers import logging


import boto3, json

Remember in the previous chapter I was like "*oh saving the metadata of our dataset is going to be useful*" and you were totally doubting me? Well, here we are, using importing it from our bucket again. I hope you feel proud of yourself :)

In [4]:
logging.set_verbosity_error()

BUCKET_NAME = 'xy-mp-pipeline'
METADATA_KEY = 'data/covid-csv-metadata.json'
bucket = boto3.resource('s3').Bucket(BUCKET_NAME)
metadata = json.loads(bucket.Object(METADATA_KEY).get()['Body'].read())
input_schema = metadata['schema']['input_features']

Yeah, just gonna casually define more constants..

In [5]:

OUTPUT_PATH = 'data/covid-csv'
N_SAMPLES = metadata['dataset_size']
TRAIN_FILES = N_SAMPLES * 4 // 5 // 16 + 1
TEST_FILES = N_SAMPLES // 5 // 16 + 1
BATCH_SIZE = metadata['batch_size']
TRAIN_S3_URL = f's3://{BUCKET_NAME}/{OUTPUT_PATH}/training/'
TEST_S3_URL = f's3://{BUCKET_NAME}/{OUTPUT_PATH}/testing/'
TEST_DATASET_SIZE = metadata['test_size']
TRAIN_DATASET_SIZE = metadata['train_size']
MODEL_OUTPUT_PATH = 'assets/model'

### Create a data pipe
So a data pipe sounds really dirty for some reason. I'm sure most of PyTorch users are already familiar with Datasets, it is a convenient way to load data into our devices for training a batch eachn time. We can also do some preprocessing on the fly, which is very useful when we have a large dataset and we don't want to load all the data into memory at once.

This is how a [pipe](https://pytorch.org/data/main/torchdata.datapipes.iter.html) comes in handy. Instead of loading the entire dataset into memory and then split into batches, we can import the data from a cloud bucket, one batch at a time;. This was also why we split our data into multiple files in the previous chapter.and we can also do some preprocessing on the fly. This is very useful when we have a large dataset and we don't want to load all the data into memory at once.

With this, we can also easily perform data parallel training, which is a very useful technique when we have a large dataset and we want to train our model faster. Well I really hope you are excited about this because it is actually really really cool. Are you psyched? Did you have your hopes up? Well, too bad, we are not doing that here. Well, do I look like someone who can afford a machine with multiple GPUs?

**Recall the dataframe**
>
    RangeIndex: 19454 entries, 0 to 19453
    Data columns (total 22 columns):
    #   Column           Non-Null Count  Dtype 
    ---  ------           --------------  ----- 
    0   headlines        19454 non-null  object
    1   length           19454 non-null  int64 
    2   has_num          19454 non-null  bool  
    3   ner_percent      19454 non-null  int64 
    4   ner_quantity     19454 non-null  int64 
    5   ner_law          19454 non-null  int64 
    6   ner_person       19454 non-null  int64 
    7   ner_product      19454 non-null  int64 
    8   ner_gpe          19454 non-null  int64 
    9   ner_work_of_art  19454 non-null  int64 
    10  ner_date         19454 non-null  int64 
    11  ner_time         19454 non-null  int64 
    12  ner_cardinal     19454 non-null  int64 
    13  ner_org          19454 non-null  int64 
    14  ner_money        19454 non-null  int64 
    15  ner_language     19454 non-null  int64 
    16  ner_ordinal      19454 non-null  int64 
    17  ner_event        19454 non-null  int64 
    18  ner_loc          19454 non-null  int64 
    19  ner_fac          19454 non-null  int64 
    20  ner_norp         19454 non-null  int64 
    21  outcome          19454 non-null  int64 
    dtypes: bool(1), int64(20), object(1)
    memory usage: 3.1+ MB

For this model, let do something fun by combining a pretrained model with some layers of our own. One of the most popular pretrained NLP is BERT, just like our favorite yellow muppet living in the basement apartment on *123 Sesame Street*. While the muppet Bert teaches you words and shit, the pretrained BERT embeds them. We can throw our entire text into BERT and we will get some sweet sweet embeddings. 

To do so, we need to tokenize the text, pretty much the same way you would process for a pre-transformer era NLP model. For that, we can use the BERT autotokenizer from HuggingFace.

Now back to the data pipe, we first load data from S3 using PyTorch's inbuilt [S3FileLoader](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.S3FileLoader.html), and then we
    1. tokenize the news headlines into.. well, tokens padded to a length of 100, together with the attention masks. 
    2. convert the non-text features into tensors
    3. return the `outcome` column as labels

In [6]:
class TextDataset(IterDataPipe):
    '''
    This class is used to load the data from S3 and return the data in the format of (bert_input, tabular_input, label)

    Args:
        s3_data_path: the url of the  directory containing the all the csv files in s3 
        tokenizer: the tokenizer used to tokenize the text
        num_files: the number of files to be loaded from s3
    ''' 
    def __init__(self, s3_data_path: str, tokenizer: AutoTokenizer, num_files: int) -> None:
        super().__init__()
        self.tokenizer = tokenizer
        self.url_wrapper = IterableWrapper([s3_data_path]).list_files_by_s3().shuffle().sharding_filter()
        self.num_files = num_files

    def __iter__(self) -> tuple[torch.TensorType, ...]:
        '''
            This function defines the iterative behavior of the data pipe, which allows the __next__ method to get the next batch of data samples

            Yields:
                bert_input (torch.TensorType): the input of the BERT model, which is a list of two tensors, the first tensor is the input_ids, the second tensor is the attention_mask
                tabular_input (torch.TensorType): the input of the tabular model, which is a list of tensors
                label (torch.TensorType): the label of the data sample 
        '''
        for _, file in self.url_wrapper.load_files_by_s3():
            temp = pd.read_csv(file)
            label = torch.from_numpy(temp['outcome'].values)
            # For BERT model
            bert_input = []
            tokens = [self.tokenizer(t, padding='max_length', max_length=100, truncation=True, return_tensors='pt') for t in temp['headlines']]
            bert_input.append(torch.cat([e['input_ids'] for e in tokens], dim=0))
            bert_input.append(torch.cat([e['attention_mask'] for e in tokens], dim=0))

            # Tabular features
            tabular_input = [torch.from_numpy(temp[col].values).to(torch.float32).squeeze() for col in temp.columns if col not in ['outcome', 'headlines']]
            yield bert_input, tabular_input, label

    def __len__(self):
        ''' 
            Returns the number of files to be loaded from s3, this method helps tqdm to know the progress of the data loading process.

            Returns:
                num_files (int): the number of files to be loaded from s3
        '''
        return self.num_files



Now, we get to the part where people gets unnecessarily excited about: making the model. The model I had in mind kinda looks like this:

![alt text](images/model_architecture.png "Model Architecture")

In [7]:
from transformers import BertModel

class FakeNewsClassifier(torch.nn.Module):
    ''' 
        This class defines the model architecture of the fake news classifier, which is a combination of BERT model and tabular model.

        Args:
            pretrained_model_name: the name of the pretrained BERT model

        Attributes:
            bert: the pretrained BERT model
            dropout_1: the dropout layer
            linear: the linear layer with input shape equals to BERT embedding
            dropout_2: the dropout layer
            final_linear: the linear layer with input shape equals to the output shape of the tabular model
            relu: the relu activation function
            normalize: the normalization function
            sigmoid: the sigmoid activation function  
    '''
    def __init__(self, pretrained_model_name: str) -> None:
        super().__init__()
        self.bert = BertModel.from_pretrained(pretrained_model_name)
        self.dropout_1 = torch.nn.Dropout(0.25)
        self.linear = torch.nn.Linear(768, 12)
        self.dropout_2 = torch.nn.Dropout(0.25)
        self.final_linear = torch.nn.Linear(32, 1)
        self.relu = torch.nn.ReLU()
        self.normalize = torch.nn.functional.normalize
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, bert_input: dict, tabular_input: list):
        ''' 
            This function defines the forward pass of the model, which takes the input of the BERT model and the tabular model, and returns the output of the model.

            Args:
                bert_input (dict): the input of the BERT model, which is a dictionary containing two tensors, the first tensor is the input_ids, the second tensor is the attention_mask, and a boolean parameter for return_dict
                tabular_input (torch.TensorType): tensor for the input of the tabular model

            Returns:
                final_output (torch.TensorType): the output of the model, a single bit logit
        '''
        print(bert_input['input_ids'].shape, bert_input['attention_mask'].shape, tabular_input.shape)
        # Left input path
        _, pooled_output = self.bert(**bert_input)
        dropout_1_output = self.dropout_1(pooled_output)
        linear_output = self.linear(dropout_1_output)
        relu_output = self.relu(linear_output)
        norm1 = self.normalize(relu_output, p=2, dim=1)
        
        # Right input path
        norm2 = self.normalize(tabular_input, p=2, dim=1)
        # Combine two paths
        combined_output = torch.cat([norm1, norm2], dim=1)
        dropout_2_output = self.dropout_2(combined_output)
        final_output = self.final_linear(dropout_2_output)
        return self.sigmoid(final_output)
    

In [8]:
def train_model(pretrained_model_name: str, train_data_url: str, test_data_url: str, train_len: int, test_len:int, train_file_len: int, test_file_len: int, epochs: int, lr: float):
    # Prepare dataloaders
    model = FakeNewsClassifier(pretrained_model_name)
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)

    train_df = TextDataset(train_data_url, tokenizer, train_file_len)
    test_df = TextDataset(test_data_url, tokenizer, test_file_len)

    train_loader = DataLoader(train_df, batch_size=1, shuffle=True)
    test_loader = DataLoader(test_df, batch_size=1, shuffle=True)

    # Config device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    # Config optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss_function = torch.nn.BCELoss()

    if torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)

    # Train
    for epoch in range(epochs):
        training_loss = 0.0
        training_acc = 0.0

        for bert_input, tabular_input, label in tqdm(train_loader):
            bert_input = {
                'input_ids': bert_input[0].squeeze().to(device),
                'attention_mask': bert_input[1].squeeze().to(device),
                'return_dict': False
            }
            tabular_input = torch.cat(tabular_input).T.to(device)
            label = label.T.to(device)

            output = model(bert_input, tabular_input)

            loss = loss_function(output, label.float())
            training_loss += loss.item()

            # get acc of signmoid output
            acc = (output[0].round() == label).sum().item()
            training_acc += acc

            model.zero_grad()
            loss.backward()
            optimizer.step()

        validation_loss = 0.0
        validation_acc = 0.0

        with torch.no_grad():
            for bert_input, tabular_input, label in test_loader:
                bert_input = {
                    'input_ids': bert_input[0].squeeze().to(device),
                    'attention_mask': bert_input[1].squeeze().to(device),
                    'return_dict': False
                }
                tabular_input = torch.cat(tabular_input).T.to(device)
                label = label.T.to(device)

                output = model(bert_input, tabular_input)

                loss = loss_function(output, label.float())
                validation_loss += loss.item()

                # get acc of signmoid output
                acc = (output[0].round() == label).sum().item()
                validation_acc += acc
        print(f'Epoch: {epoch+1}/{epochs} | Training loss: {training_loss/train_len:.3f} | Training acc: {training_acc/train_len:.3f} | Validation loss: {validation_loss/test_len:.3f} | Validation acc: {validation_acc/test_len:.3f}')

In [10]:
pretrain_name = 'bert-base-uncased'
model = FakeNewsClassifier(pretrain_name)
EPOCHS = 1
LR = 5e-6

train_model(
    pretrain_name, 
    TRAIN_S3_URL, 
    TEST_S3_URL, 
    TRAIN_DATASET_SIZE, 
    TEST_DATASET_SIZE, 
    TRAIN_FILES, 
    TEST_FILES,
    EPOCHS, 
    LR
)

import os
if not os.path.exists(MODEL_OUTPUT_PATH):
    os.makedirs(MODEL_OUTPUT_PATH)

torch.save(model.state_dict(), MODEL_OUTPUT_PATH + '/baseline.pth')
print('Model saved to:  ', MODEL_OUTPUT_PATH)

100%|██████████| 973/973 [03:28<00:00,  4.67it/s]


Epoch: 1/1 | Training loss: 0.038 | Training acc: 0.871 | Validation loss: 0.038 | Validation acc: 0.883
Model saved to:   assets/model


### Load model and use it

In [9]:
import torch 
from transformers import AutoTokenizer
pretrain_name = 'bert-base-uncased'
# load model from output path
model = FakeNewsClassifier(pretrain_name)
model.load_state_dict(torch.load(MODEL_OUTPUT_PATH + '/baseline.pth'))

<All keys matched successfully>

### Measure F1 score from test set

In [6]:


# from sklearn.metrics import f1_score
# tokenizer = AutoTokenizer.from_pretrained(pretrain_name)
# test_ds = TextDataset(TEST_S3_URL, tokenizer, TEST_FILES)
# test_loader = DataLoader(test_ds, batch_size=1, shuffle=True)

# with torch.no_grad():
#     outputs = []
#     labels = []
#     for bert_input, tabular_input, label in tqdm(test_loader):
#         bert_input = {
#             'input_ids': bert_input[0].squeeze(),
#             'attention_mask': bert_input[1].squeeze(),
#             'return_dict': False
#         }
#         tabular_input = torch.cat(tabular_input).T
#         label = label.T

#         outputs.append(model(bert_input, tabular_input))
#         labels.append(label)

# output = torch.cat(outputs, dim=0)
# label = torch.cat(labels, dim=0)
# print(f'F1 score: {f1_score(output.round().detach().numpy(), label.detach().numpy())}')

100%|██████████| 244/244 [03:42<00:00,  1.10it/s]

F1 score: 0.8316017316017316





### Make inference with model and user input

In [16]:
import spacy
from collections import Counter
import re
nlp = spacy.load('en_core_web_sm')
tokenizer = AutoTokenizer.from_pretrained(pretrain_name)

def preprocess_headline(headline: str):
    tokens = tokenizer(headline, return_tensors='pt', padding='max_length', truncation=True, max_length=64)
    tokens = {k: v.squeeze() for k, v in tokens.items()}
    input_ids = tokens['input_ids'].unsqueeze(0)
    attention_mask = tokens['attention_mask'].unsqueeze(0)
    bert_input = {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'return_dict': False
    }


    headline_len = len(headline.split())
    has_stats = int(re.match(r'\d', headline) is not None)

    ner_count = [f'ner_{ent.label_}' for ent in nlp(headline).ents]
    ner_count = Counter(ner_count)
    ner_input = get_ner_input(ner_count, input_schema)
    return bert_input, torch.cat([torch.tensor([headline_len, has_stats]), ner_input], dim=0).unsqueeze(0)

    
def get_ner_input(ner_count, input_schema):
    ner_features = [s for s in input_schema if s.startswith('ner_')]
    ner_input = torch.zeros(len(ner_features))
    for i, feat in enumerate(ner_features):
        ner_input[i] = ner_count.get(feat, 0)
    return ner_input
    

def classify(headline: str):
    bert_input, tabular_input = preprocess_headline(headline)
    with torch.no_grad():
        output = model(bert_input, tabular_input)
    return output.round().item()


In [17]:
# notebook text input ui
from ipywidgets import widgets
from IPython.display import display

text = widgets.Textarea(
    value='',
    placeholder='Type something',
    description='Headline:',
    disabled=False,
    layout={'width': '69%', 'height': '69px', 'display': 'flex', 'flex_flow': 'column wrap'}
)

button = widgets.Button(description="Fact Check", layout={'display': 'flex'})
out = widgets.Label(value='Click button to fact check headline!', layout={'display': 'flex', 'flex_flow': 'column wrap', 'align_items': 'flex-end'})

def fact_check(b):
    headline = text.value
    pred = classify(headline)
    
    if pred == 1:
        
        out.value = 'This is fake news!'
    else:
        out.value = 'This is real news!'

button.on_click(fact_check)

vb = widgets.Box([button, out])
display(text)
display(vb)

Textarea(value='', description='Headline:', layout=Layout(display='flex', flex_flow='column wrap', height='69p…

Box(children=(Button(description='Fact Check', layout=Layout(display='flex'), style=ButtonStyle()), Label(valu…