# Introduction:
This notebook contains all the scripts to run the training of SciBert model on custom dataset. The workflow is as follows:
- Downloading SCIBERT model from huggingface.
- Creating train and dev sentences for MLM(Mask Language Modelling) Task, using Data_Fine-tuning.xlsx file.
- Train SCI-BERT model using this dataset and save the model in google drive.
- Using, labelled dataset, fine-tune this model on downstream task.
- Create sentence embeddings from Paper_Corpus dataset and compute Cosine Similarity.

# NOTE:
All the files should be under ./drive/MyDrive/Fiverr/sci_bert_training/ folder. Create a Fiverr folder and upload the entire folder into this directory.
Make sure to turn on GPU instance by clicking runtime, then change runtime type and then clicking GPU.

Importing google drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Installing dependancies 

In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 9.1MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 37.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |███████

In [3]:
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cc/75/df441011cd1726822b70fbff50042adb4860e9327b99b346154ead704c44/sentence-transformers-1.2.0.tar.gz (81kB)
[K     |████                            | 10kB 22.8MB/s eta 0:00:01[K     |████████                        | 20kB 16.1MB/s eta 0:00:01[K     |████████████                    | 30kB 14.1MB/s eta 0:00:01[K     |████████████████                | 40kB 13.6MB/s eta 0:00:01[K     |████████████████████▏           | 51kB 7.7MB/s eta 0:00:01[K     |████████████████████████▏       | 61kB 9.0MB/s eta 0:00:01[K     |████████████████████████████▏   | 71kB 8.5MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 5.7MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# PATH VARIABLES
These variables contain several paths which will be used by the script.

In [3]:
OUTPUT_DIR_SCI_BERT = "./drive/MyDrive/Fiverr/sci_bert_training/output/"
OUTPUT_MODEL_NAME = "SCIBERT_pretrained_on_patentdata"
TRAIN_PATH = "./drive/MyDrive/Fiverr/sci_bert_training/datafinetuning_train.txt"
VAL_PATH = "./drive/MyDrive/Fiverr/sci_bert_training/datafinetuning_val.txt"
SCIBERT_MODEL = "allenai/scibert_scivocab_uncased"
RANDOM_SEED = 0

# Script to create datafinetuning_train.txt and datafinetuning_val.txt files which will be used by the "Pre-training" of SCIBERT model.
Note: This cell is note required if these two files already exists.

In [None]:
import pandas as pd
import nltk
from tqdm import tqdm
corpus = pd.read_excel("./drive/MyDrive/Fiverr/sci_bert_training/Data_Fine-tuning.xlsx")
corpus = corpus.sample(frac=1, random_state=RANDOM_SEED)
corpus = corpus.dropna()
train_length = corpus.shape[0] * 0.80
val_length = 1 - train_length
train_list = []
dev_list = []
for i in tqdm(range(corpus.shape[0])):
    if i <= train_length:
      for column in ['TTL', 'ABST', 'ACLM']:
          sentences = nltk.tokenize.sent_tokenize(corpus.loc[i, column])
          for sentence in sentences:
              if sentence[0] not in ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']: # Deleting the numbering in front of each sentence.
                  train_list.append(sentence)
    else:
      for column in ['TTL', 'ABST', 'ACLM']:
          sentences = nltk.tokenize.sent_tokenize(corpus.loc[i, column])
          for sentence in sentences:
              if sentence[0] not in ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']: # Deleting the numbering in front of each sentence.
                  dev_list.append(sentence)

print("Total training sentences: ", len(train_list))
textfile = open(TRAIN_PATH, "w")
for element in train_list:
    textfile.write(element + "\n")
textfile.close()

print("Total validation sentences: ", len(dev_list))
textfile = open(VAL_PATH, "w")
for element in dev_list:
    textfile.write(element + "\n")
textfile.close()


100%|██████████| 204/204 [00:00<00:00, 627.42it/s]

Total training sentences:  4261
Total validation sentences:  1037





# MASK LANGUAGE MODELLING task for Pre training of SCIBERT MODEL

NOTE: model will be saved in this directory: ./drive/MyDrive/Fiverr/sci_bert_training/output/SCIBERT_pretrained_on_patentdata

Due to GPU Limitations, the max length of sentences will be 100. If sentence exceeds 100, the rest of sentence will be truncated.

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
from transformers import DataCollatorForLanguageModeling, DataCollatorForWholeWordMask
from transformers import Trainer, TrainingArguments
import sys
import gzip
from datetime import datetime


model_name = SCIBERT_MODEL
per_device_train_batch_size = 64

save_steps = 10               #Save model every 1k steps
num_train_epochs = 6            #Number of epochs
use_fp16 = False                #Set to True, if your GPU supports FP16 operations
max_length = 100                #Max length for a text input
do_whole_word_mask = True       #If set to true, whole words are masked
mlm_prob = 15                   #Probability that a word is replaced by a [MASK] token

# Load the model
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


output_dir = OUTPUT_DIR_SCI_BERT+"{}".format(OUTPUT_MODEL_NAME)
print("Save checkpoints to:", output_dir)


##### Load our training datasets

train_sentences = []
train_path = TRAIN_PATH
with gzip.open(train_path, 'rt', encoding='utf8') if train_path.endswith('.gz') else  open(train_path, 'r', encoding='utf8') as fIn:
    for line in fIn:
        line = line.strip()
        if len(line) >= 10:
            train_sentences.append(line)

print("Train sentences:", len(train_sentences))

dev_sentences = []
dev_path = VAL_PATH
with gzip.open(dev_path, 'rt', encoding='utf8') if dev_path.endswith('.gz') else open(dev_path, 'r', encoding='utf8') as fIn:
    for line in fIn:
        line = line.strip()
        if len(line) >= 10:
            dev_sentences.append(line)

print("Dev sentences:", len(dev_sentences))

#A dataset wrapper, that tokenizes our data on-the-fly
class TokenizedSentencesDataset:
    def __init__(self, sentences, tokenizer, max_length, cache_tokenization=False):
        self.tokenizer = tokenizer
        self.sentences = sentences
        self.max_length = max_length
        self.cache_tokenization = cache_tokenization

    def __getitem__(self, item):
        if not self.cache_tokenization:
            return self.tokenizer(self.sentences[item], add_special_tokens=True, truncation=True, max_length=self.max_length, return_special_tokens_mask=True)

        if isinstance(self.sentences[item], str):
            self.sentences[item] = self.tokenizer(self.sentences[item], add_special_tokens=True, truncation=True, max_length=self.max_length, return_special_tokens_mask=True)
        return self.sentences[item]

    def __len__(self):
        return len(self.sentences)

train_dataset = TokenizedSentencesDataset(train_sentences, tokenizer, max_length)
dev_dataset = TokenizedSentencesDataset(dev_sentences, tokenizer, max_length, cache_tokenization=True) if len(dev_sentences) > 0 else None


##### Training arguments

if do_whole_word_mask:
    data_collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm=True, mlm_probability=mlm_prob)
else:
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=mlm_prob)

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    num_train_epochs=num_train_epochs,
    evaluation_strategy="steps" if dev_dataset is not None else "no",
    per_device_train_batch_size=per_device_train_batch_size,
    eval_steps=save_steps,
    save_steps=100,
    logging_steps=save_steps,
    save_total_limit=1,
    prediction_loss_only=True,
    fp16=use_fp16
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset
)

print("Save tokenizer to:", output_dir)
tokenizer.save_pretrained(output_dir)

trainer.train()

print("Save model to:", output_dir)
model.save_pretrained(output_dir)

print("Training done")

Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Save checkpoints to: ./drive/MyDrive/Fiverr/sci_bert_training/output/SCIBERT_pretrained_on_patentdata
Train sentences: 4099
Dev sentences: 1024
Save tokenizer to: ./drive/MyDrive/Fiverr/sci_bert_training/output/SCIBERT_pretrained_on_patentdata


Step,Training Loss,Validation Loss
10,6.3869,5.323237
20,5.0176,4.900414
30,4.7,4.665883
40,4.5059,4.540105
50,4.4532,4.48819
60,4.3352,4.417169
70,4.2291,4.417305
80,4.233,4.418386
90,4.1466,4.366575
100,4.1745,4.338973


Save model to: ./drive/MyDrive/Fiverr/sci_bert_training/output/SCIBERT_pretrained_on_patentdata
Training done


In [None]:
MODEL_PATH = "./drive/MyDrive/Fiverr/sci_bert_training/output/SCIBERT_pretrained_on_patentdata"

# Script to convert Transformer model into Sentence Transformer model and train it for a STS(Sentence Similarity Task) using labelled dataset.
Name of sentence transformer model will be SciBert_finetuning_BIOSSES. 

Note: Model will be saved also in ./drive/MyDrive/Fiverr/sci_bert_training/output/

In [4]:
from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer,  LoggingHandler, losses, models, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample
import logging
from datetime import datetime
import sys
import os
import gzip
import csv
import pandas as pd

In [5]:
SENTENCETRANSFORMER_MODEL_NAME = 'SciBert_finetuning_BIOSSES'
model_save_path = './drive/MyDrive/Fiverr/sci_bert_training/output/'+SENTENCETRANSFORMER_MODEL_NAME


In [None]:


#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
model_name = MODEL_PATH
# Read the dataset
train_batch_size = 8
num_epochs = 4


# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
word_embedding_model = models.Transformer(model_name)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model], device='cuda')

train_samples = []
dataset = pd.read_csv('./drive/MyDrive/Fiverr/sci_bert_training/BIOSSES-Dataset/BIOSSES_Pairs_Scores.csv')
dataset = dataset.sample(frac = 1)
dataset = dataset.dropna()
for i in range(dataset.shape[0]):
  text1 = dataset['Sentence 1'].iloc[i]
  text2 = dataset['Sentence 2'].iloc[i]
  #TAKING THE MAX 
  score_list = [int(dataset['Annotator A'].iloc[i]), int(dataset['Annotator B'].iloc[i]), \
                int(dataset['Annotator C'].iloc[i]), int(dataset['Annotator D'].iloc[i]), int(dataset['Annotator E'].iloc[i])]
  score = max(set(score_list), key = score_list.count)/4.0 # Normalize score to range 0 ... 1
  inp_example = InputExample(texts=[text1, text2], label=score)
  train_samples.append(inp_example)

logging.info("train samples: {}".format(len(train_samples)))

train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)

# Configure the training. We skip evaluation in this example
warmup_steps = math.ceil(len(train_dataloader) * num_epochs  * 0.1) #10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))


# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=None,
          epochs=num_epochs,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path=model_save_path)



Some weights of the model checkpoint at ./drive/MyDrive/Fiverr/sci_bert_training/output/SCIBERT_pretrained_on_patentdata were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ./drive/MyDrive/Fiverr/sci_bert_training/out

2021-06-12 12:18:03 - NumExpr defaulting to 2 threads.
2021-06-12 12:18:03 - train samples: 94
2021-06-12 12:18:03 - Warmup-steps: 5


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=4.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=12.0, style=ProgressStyle(description_wid…

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.





HBox(children=(FloatProgress(value=0.0, description='Iteration', max=12.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=12.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=12.0, style=ProgressStyle(description_wid…



2021-06-12 12:18:32 - Save model to ./drive/MyDrive/Fiverr/sci_bert_training/output/SciBert_finetuning_BIOSSES


# Comparing patent sentences with the corresponding content of doi paper.

In [6]:
model = SentenceTransformer(model_save_path, device='cuda')

In [7]:
PATH = './drive/MyDrive/Fiverr/sci_bert_training/'
pairs_df = pd.read_excel(PATH + "Paper_Corpus.xlsx", sheet_name=0, header=2)
corpus_df = pd.read_excel(PATH + 'Paper_Corpus.xlsx', sheet_name=1)

This cell extracts all the sentences from "Sentences" folder and the corresponding mapping number from the file number

In [None]:
import glob
import pandas as pd
files = glob.glob(PATH + "/Sentences/*.txt")
def create_sentence_df(file_list):
    sentences_df = pd.DataFrame(columns=['filename', 'sentence', 'mapping'])
    sentences = []
    mapping = []
    for file in file_list:
        with open(file, 'r') as fd:
            sentences.append(fd.readline())
            mapping.append(int(file.split('#')[1].split('_')[0].replace(',', '')))
    sentences_df['filename'] = file_list
    sentences_df['sentence'] = sentences
    sentences_df['mapping'] = mapping
    
    return sentences_df

sentences_df = create_sentence_df(files)

# This cell reads each file in the Sentences folder. Gets the mapping number e.g USPTO-Dokument #8,501,349_1_P_1 will have 8501349. Search that number inside the Paper_Corpus file, get the corresponding doi numbers and extract the texts related to that doi. Compute Cosine Similarity, based on the embeddings by the trained model, and saves the maximum cos similarity sentence.
Doi texts are converted to each sentence. These sentences will be the corpus of that corresponding patent sentence. Each sentence is converted into embeddings, and compared against the patent sentence. Maximum similarity sentence is saved inside the csv in the end

There was a misspelling of doi 10.1017/s1431927614012744. In the Corpus sheet it was 10.1017/S1431927614012744 but in pairs sheet it was 10.1017/s1431927614012744. So i manually changed it.

In [None]:
from tqdm import tqdm
import numpy as np
scores_list = []
top_sentence_list = []
doi_list = []
for INDEX in tqdm(range(sentences_df.shape[0])):
  test = sentences_df.iloc[INDEX, :]['mapping']
  doi_many = pairs_df[pairs_df.iloc[:, 2] == test]['citation external id'].values
  paper_text = []
  query_sentence = [sentences_df.iloc[INDEX, :]['sentence']]
  embeddings1 = model.encode(query_sentence, convert_to_tensor=True)
  max_cos_similariy_ref = -np.Inf
  most_relevant_sentence_ref = ''
  doi_ref = None
  for each_doi in doi_many:
    each_doi_text = corpus_df[corpus_df['DOI'] == each_doi]['Paper_Text'].values.tolist()
    each_doi_sentences = nltk.tokenize.sent_tokenize(each_doi_text[0])
    embeddings2 = model.encode(each_doi_sentences, convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
    descending_order_idx = reversed(np.argsort(cosine_scores[0, :].cpu()))
    max_cos_similariy = cosine_scores[0][descending_order_idx[0]]
    most_relevant_sentence = each_doi_sentences[descending_order_idx[0]]
    if max_cos_similariy > max_cos_similariy_ref:
      max_cos_similariy_ref = max_cos_similariy
      most_relevant_sentence_ref = most_relevant_sentence
      doi_ref = each_doi
  scores_list.append(max_cos_similariy_ref.cpu().numpy().tolist())
  top_sentence_list.append(most_relevant_sentence_ref)
  doi_list.append(doi_ref)

In [None]:
sentences_df['cosine_similarity'] = scores_list
sentences_df['doi'] = doi_list
sentences_df['top_sentence'] = top_sentence_list
sentences_df.to_csv(PATH+'scibert_pretrained_finetuned__pred_v1.csv', index=False)

# TSNE: 
Tsne visualization is done by taking the average embedding of patent sentence and corresponding top sentence in the sentences_df. Average embedding, which is 768 dimensions, is then compressed into 2 dimensions by the TSNE algorithm.

Due to large sentences, parts of sentences , in some cases, could not be visualized completely in the graph.

In [54]:
from sklearn.manifold import TSNE
import pandas as pd
import plotly.express as px

In [55]:
model = SentenceTransformer(model_save_path, device='cuda')
PATH = './drive/MyDrive/Fiverr/sci_bert_training/'

In [8]:
sentences_df = pd.read_csv(PATH+'scibert_pretrained_finetuned__pred_v1.csv')
sentence_embeddings = model.encode(sentences_df['sentence'].values)
top_sentence_embeddings = model.encode(sentences_df['top_sentence'].values)
average_embeddings = (sentence_embeddings + top_sentence_embeddings)/2
labels = 'Patent: ' + sentences_df['sentence'].values + '<br>' + 'Top Sci Sent: '+sentences_df['top_sentence'].values

In [52]:

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=2000, random_state=RANDOM_SEED)
tsne_results = tsne.fit_transform(average_embeddings)
tsne_df = pd.DataFrame(columns=['tsne-2d-one', 'tsne-2d-two', 'labels'])
tsne_df['tsne-2d-one'] = tsne_results[:,0]
tsne_df['tsne-2d-two'] = tsne_results[:,1]
tsne_df['labels'] = labels

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 350 samples in 0.017s...
[t-SNE] Computed neighbors for 350 samples in 0.173s...
[t-SNE] Computed conditional probabilities for sample 350 / 350
[t-SNE] Mean sigma: 3.584561
[t-SNE] KL divergence after 250 iterations with early exaggeration: 58.203316
[t-SNE] KL divergence after 1750 iterations: 0.312426


In [53]:

fig = px.scatter(tsne_df, x="tsne-2d-one",  y='tsne-2d-two', hover_name=tsne_df.labels)
fig.show()

In [None]:
import plotly.express as px
#df = px.data.tips()
fig = px.histogram(sentences_df, x="cosine_similarity",  color = 'mapping', hover_data=[sentences_df.doi], hover_name=sentences_df.doi)
fig.show()