# **Training Notebook (NovaSBE X GregoryAI)**

![Description of the image](../images/train_tune_pipeline_diagram.png)

## 1. Import libraries

In [1]:
import os
import sys
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datetime import datetime

# Add the parent directory of code_utils to the Python path
sys.path.append(os.path.abspath(os.path.join('..')))

from code_utils.text_utils import *  # Import everything from text_utils.py
from code_utils.model_utils.LSTM_algorithm_utils import *  
from code_utils.model_utils.BERT_algorithm_utils import *  
from code_utils.model_utils.LGBM_algorithm_utils import *  
from code_utils.model_utils.classify_model_choose import *
from code_utils.download_utils import * 
from code_utils.pseudo_utils.utils_pseudo import *

  from .autonotebook import tqdm as notebook_tqdm


## 2. Download articles

In [None]:
# load the previous data
old_articles_path = '..\\data\\articles_08-06-2024_14h13m04s.csv'
url = 'https://api.gregory-ms.com/articles/?format=json'
fetch_all_articles(url, old_articles_path, 'all')

## 3. Clean and Preprocess

In [2]:
dataset_path = os.path.join('../data/2024-05-07', # choose the day folder intended to use 
                            'train_articles.csv')

# additional step to ensure consistency in the index column formating as article_id
articles_df = pd.read_csv(dataset_path)

# if the first column is not article_id, remove that first column

if articles_df.columns[0] != 'article_id':
    articles_df = articles_df.drop(columns=articles_df.columns[0])

articles_clean_df = load_and_format_dataset(dataset_path, text_cleaning_pd_series)

articles_clean_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[label_column] = data[label_column].apply(lambda x: 1 if x is True else (0 if x is False else 'unlabeled'))


Unnamed: 0_level_0,full_text_clean,relevant
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
283515,prevalence stress urinary incontinence urge ur...,unlabeled
283512,radiologic lag brain mri lesion dynamics attac...,unlabeled
283510,motor function multiple sclerosis assessed nav...,unlabeled
283508,additive value complementing diagnostic idiopa...,unlabeled
283507,australian headache epidemiology data ahead pi...,unlabeled


In [3]:
articles_clean_df.relevant.value_counts()

relevant
unlabeled    21676
0             1267
1              952
Name: count, dtype: int64

## 4. Data Split into train, validation and test sets

In [4]:
# let's divide thr articles_clean_df into labelled and unlabelled data

unlabelled_df = articles_clean_df[articles_clean_df.relevant == 'unlabeled']
labelled_df = articles_clean_df[articles_clean_df.relevant != 'unlabeled']

In [5]:

relevant_column = 'relevant'

# First split: 85% train_val and 15% test
train_val_df, test_df = train_test_split(
    labelled_df,
    test_size=0.15,
    stratify=labelled_df[relevant_column],
    random_state=69
)

# Second split: ~88.235% train and ~11.765% val from train_val_df
train_df, val_df = train_test_split(
    train_val_df,
    test_size=0.1765,  # 0.1765 * 0.85 ≈ 0.15 of the original dataset
    stratify=train_val_df[relevant_column],
    random_state=69
)

# Verifying the splits
print("Train set distribution:")
print(train_df[relevant_column].value_counts(normalize=True))
print("\nValidation set distribution:")
print(val_df[relevant_column].value_counts(normalize=True))
print("\nTest set distribution:")
print(test_df[relevant_column].value_counts(normalize=True))

# Check the number of articles in each set
print(f"Number of articles in the training set: {len(train_df)}")
print(f"Number of articles in the validation set: {len(val_df)}")
print(f"Number of articles in the test set: {len(test_df)}")


Train set distribution:
relevant
0    0.571153
1    0.428847
Name: proportion, dtype: float64

Validation set distribution:
relevant
0    0.570571
1    0.429429
Name: proportion, dtype: float64

Test set distribution:
relevant
0    0.570571
1    0.429429
Name: proportion, dtype: float64
Number of articles in the training set: 1553
Number of articles in the validation set: 333
Number of articles in the test set: 333


## 5. Pseudo Labelling

Here you may choose to perform the Pseudo-Labelling task: 

Self-Training with a Traditional ML model (in this example we have LogisticRegression but you can test with others) 

Co-training approach (that is a diferent type of pseudolabelling that uses a combination of two traditional machine learning models), 

Or the BERT model uncased.

Our recomendation, to achieve best results, is BERT, but we understand it can be computer intensive and tricky to run.
Below you have the different sections for the different approaches.

### 5.1 Using BERT

In [8]:
# choose max_lenght for BERT

max_len = 128 # as said in the report, optimal would be 400, but here we had to choose 128 for computational reasons

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = create_bert_uncased_model(max_len = max_len)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [12]:
labelled_df['text_processed'] = labelled_df['full_text_clean']
unlabelled_df['text_processed'] = unlabelled_df['full_text_clean']
train_df['text_processed'] = train_df['full_text_clean']
val_df['text_processed'] = val_df['full_text_clean']
test_df['text_processed'] = test_df['full_text_clean']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labelled_df['text_processed'] = labelled_df['full_text_clean']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unlabelled_df['text_processed'] = unlabelled_df['full_text_clean']


In [None]:
labelled_train_df = train_df[['text_processed', 'relevant']]
val_df_pseudo = val_df[['text_processed', 'relevant']]
unlabelled_data_pseudo = unlabelled_df[['text_processed']]

labelled_train_df, X_val, y_val, y_pred_val = bert_iterative_training(
    model=bert_model,
    tokenizer=tokenizer,
    unlabelled_data_pseudo=unlabelled_data_pseudo,
    labelled_train_df=labelled_train_df,
    val_df_pseudo=val_df_pseudo,
    max_len=max_len,
    confidence_threshold=0.9,
    max_iterations=10
)


### 5.2 Using Traditional ML + vectorizer (Self-Training)

### 5.3 Using Co-Training

## 5. Train the Model and Store Model Weights

Again, here you may choose from different options to train the data:

1 - BERT (Pubmed)

2 - LSTM

3 - LGBM (Tfidf)

Our recomendation, to achieve best results, is BERT again, but we understand it can be computer intensive and tricky to run.
Below you have the different sections for the different approaches.

## 6. Show Model Performance Results