## Methodology

The purpose of the code in this notebook is to take a Pandas Dataframe containing documents and be able to match a before-unseen block of text to our domain-specific documents found in the dataframe.

To perform our matching task, we will use a BERT model which has been previously fine-tuned on Sentence-BERT tasks (outlined in https://arxiv.org/abs/1908.10084 "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"). This model is designed to take in two document embeddings (less than 512 BERT tokens) and compare them using dot-product or cosine similarity.

One weakness of this approach is that off-the-shelf models do not transfer well to unique domains, and so in order to further fine-tune a BERT model for our purposes, we will use the transformer model FLAN T-5 to generate summaries for each of our documents, and then fine-tune the BERT model on a summary-source match task. Inspiration for this approach comes from BEIR's paper outlining its choice to perform synthetic text generation in order to append documents with semantically similar text to improve BM-25 search results (https://arxiv.org/abs/2104.08663 "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models")

Once we have fine-tuned our BERT model on our specific domain, we can now embed the documents using the model and compare new blocks of text with our document store. 

We will be using FAISS to index our documents and perform our matching. For more information on FAISS indices, please visit their github (https://github.com/facebookresearch/faiss)


In [12]:
!pip install huggingface_hub



In [24]:
from sentence_transformers_multiGPU import SentenceTransformer, util, losses, models, datasets, InputExample
from torch import nn
import os
from transformers import AutoTokenizer, AutoModel, T5Tokenizer, T5ForConditionalGeneration
import pandas as pd
import torch
import joblib
import numpy as np
from accelerate import Accelerator
from matplotlib import pyplot as plt
import faiss
from tqdm import tqdm


data = pd.read_csv('swcs_text_data.csv')
# data = joblib.load('fulldataset.joblib')

In [15]:
#This creates a new dataframe column called 'body'
data['body'] = data.text.str.strip()

In [16]:
#This makes sure all of the body text is in the same format before sending it to be encoded
data['body'] = [''.join(map(str, l)) for l in data['body']]

In [17]:
'''Helper function which fetches article info for query-article match. Right now it is formatted to pull the body 
column from the data dataframe, but can be adjusted'''
def fetch_article_info(dataframe_idx):
    info = data.iloc[dataframe_idx]
    meta_dict = dict()
#     meta_dict['Title'] = info['Title']
    meta_dict['Body'] = info['body']
    return meta_dict
    
'''Helper function which encodes the query using the BERT model and then performs a search to try to match the query vector
to the top k most similar articles'''
def search(query, top_k, index, model):
    t=time.time()
    query_vector = model.encode([query])
    if index == index_cosine:
        faiss.normalize_L2(query_vector)
    distances, top_k = index.search(query_vector, top_k)
    print('>>>> Time to return results: {}'.format(time.time()-t))
    top_k_ids = top_k.tolist()[0]
    top_k_ids = list(np.unique(top_k_ids))
    results = zip([fetch_article_info(idx) for idx in top_k_ids], distances[0])
    return results

In [34]:
# model_name = 'google/flan-t5-large' #Time to complete 3795 paragraphs w/ num_queries = 5, batch_size = 64, max_query_length = 64: 1:45:06
model_name = 'google/flan-t5-base' #Time to complete 3795 paragraphs w/ num_queries = 5, batch_size = 64, max_query_length = 64: 58:04

device = 'cuda'

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

def _removeNonAscii(s): 
    return "".join(i for i in s if ord(i) < 128)

# Parameters for generation
batch_size = 64 #Batch size
num_queries = 5 #Number of queries to generate for every paragraph
max_length_query = 64   #Max length for output query

corpus = data.body
k = 0

'''Now we create a new tsv which will store our summary-source pairs'''

with open('generated_queries_t5-base(2).tsv', 'w') as fOut:
    for para in tqdm(corpus):
        para = str(corpus.values[k])
        pre_para= 'summarize:'+para
        index = corpus.index[k]
        input_ids = tokenizer.encode(pre_para, truncation=True, return_tensors='pt').to(device)
        outputs = model.generate(
            input_ids=input_ids,
            max_length=max_length_query,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=num_queries)

        for i in range(len(outputs)):
            query = tokenizer.decode(outputs[i], skip_special_tokens=True)
            query = _removeNonAscii(query)
            para = _removeNonAscii(para)
            fOut.write("{}\t{}\n".format(query,para))
        k += 1

100%|███████████████████████████████████████| 3795/3795 [57:47<00:00,  1.09it/s]


In [22]:
'''create the training dataset using the tsv we just created.'''

from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets
from torch import nn
import os

log = []
train_examples = [] 
with open('generated_queries_t5-base(2).tsv') as fIn:
    for line in fIn:
        try:
            query, paragraph = line.strip().split('\t', maxsplit=1)
            train_examples.append(InputExample(texts=[query, paragraph]))
        except:
            log.append("error")
            print(line)
            pass
    print("The following number of examples could not be appended into your training examples: {} out of {}".format(len(log), len(train_examples)))

	National Counter Terrorism Center (NCTC) Analyst Comments: The Sand Hill Crips consist primarily of black males. They are loosely affiliated with the US-based Crips. As the Bloods, they are located throughout NPP. Their dominant color is blue, and they wear hats tilted to the right, or wear blue-colored shoelaces on the right shoe or roll up their right pant leg. The Sand Hill Crips' main symbol is a six-pointed star. Their primary income is from narcotics distribution, but they are also involved in shootings, theft, prostitution, and other violent crimes. They also serve to carryout low-level actions for the WU  builders.  Very little activity has been heard of in the WU since they posted their manifesto online on W-061.

	Analyst Historical Note: In 1895, ROP liberals, frustrated with the intransigence, urged the liberal President Parnell to advocate for nationwide referendum on universal suffrage. Before he could achieve his goal, President Parnell was assassinated while touring Un

In [25]:
'''Now we fine tune our model using the dataset we created from the tsv'''

model = SentenceTransformer("sentence-transformers/all-distilroberta-v1")
train_dataloader = datasets.NoDuplicatesDataLoader(train_examples, batch_size=8)
train_loss = losses.MultipleNegativesRankingLoss(model)
accelerator = Accelerator()

num_epochs = 3
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], 
          epochs=num_epochs, 
          warmup_steps=warmup_steps, 
          show_progress_bar=True,
         accelerator=accelerator)

'''save our fine-tuned model to disk'''

os.makedirs('search', exist_ok=True)
model.save('search/search-model-t5-base-queries')
#Time to fine tune "sentence-transformers/all-distilroberta-v1": 10 mins 1 sec with 3795 paragraphs w/ 5 queries each (18903)
#Time to fine tune "sentence-transformers/all-distilroberta-v1": 10 mins 1 sec with 3795 paragraphs w/ 5 queries each (18903)



In [39]:
'''load model checkpoint and encode documents again using fine-tuned model'''
model = SentenceTransformer('search/search-model-t5-large-queries')
#7 seconds to encode 3795 paragraphs using fine-tuned distilroberta model
encoded_data = model.encode(data.body.tolist(), show_progress_bar=True)
encoded_data = np.asarray(encoded_data.astype('float32'))

index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
ids = np.array(range(0, len(data)), dtype='int64')
index.add_with_ids(encoded_data, ids)

# faiss.normalize_L2(encoded_data)
# index_cosine = faiss.IndexFlat(768, faiss.METRIC_INNER_PRODUCT)
# index_cosine.add(encoded_data)


faiss.write_index(index, 'body_paragraphs.index')
# faiss.write_index(index_cosine, 'body_paragraphs_cosine.index')

Batches:   0%|          | 0/119 [00:00<?, ?it/s]

In [1]:
import torch

In [11]:
device = 'cuda'

In [26]:
torch.cuda.empty_cache()

In [1]:
# from transformers import AutoTokenizer, T5EncoderModel
# tokenizer = AutoTokenizer.from_pretrained("t5-small")
# model = T5EncoderModel.from_pretrained("t5-small")
# input_ids = tokenizer(
#     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
# ).input_ids  # Batch size 1
# outputs = model(input_ids=input_ids)
# last_hidden_states = outputs.last_hidden_state



from transformers import AutoTokenizer, T5EncoderModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = T5EncoderModel.from_pretrained("t5-small")

# Tokenize sentences
encoded_input = tokenizer(data.body.tolist(), padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

# print("Sentence embeddings:")
# print(sentence_embeddings)




# '''load model checkpoint and encode documents again using fine-tuned model'''
# model2 = SentenceTransformer('search/search-model-t5-large-queries')
# #7 seconds to encode 3795 paragraphs using fine-tuned distilroberta model
# encoded_data = model2.encode(data.body.tolist(), show_progress_bar=True)
# encoded_data = np.asarray(encoded_data.astype('float32'))

# index2 = faiss.IndexIDMap(faiss.IndexFlatIP(768))
# ids = np.array(range(0, len(data)), dtype='int64')
# index2.add_with_ids(encoded_data, ids)

# # faiss.normalize_L2(encoded_data)
# # index_cosine = faiss.IndexFlat(768, faiss.METRIC_INNER_PRODUCT)
# # index_cosine.add(encoded_data)


# faiss.write_index(index2, 'body_paragraphs.index')
# # faiss.write_index(index_cosine, 'body_paragraphs_cosine.index')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Some weights of the model checkpoint at t5-small were not used when initializing T5EncoderModel: ['decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.2.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.5.layer.1.EncDecAttention.q.weight', 'decoder.block.2.layer.0.SelfAttention.q.weight', 'decoder.block.4.layer.0.layer_norm.weight', 'decoder.block.5.layer.1.EncDecAttention.k.weight', 'decoder.block.2.layer.0.SelfAttention.k.weight', 'decoder.block.3.layer.2.layer_norm.weight', '

NameError: name 'data' is not defined

In [15]:
sentence_embeddings.shape

torch.Size([2, 512])

In [31]:
from pprint import pprint
import time
index = faiss.read_index('body_paragraphs.index')

query="Who is the WTU"
results=search(query, top_k=5, index=index, model=model)

print("\n")
for result in results:
    print('\t','\n',result)

>>>> Time to return results: 0.009956121444702148


	 
 ({'Body': 'Analyst Comment: The WTU is associated with instances of violence and leads open protests and public demonstrations. The WTU hard core cadre has been known to incite violence that leads directly to riots. Michael K. Shinne (1978-Present) is one of the main organizers and has personally provided numerous hours of instruction to establish a “cadre” to incite acti ve and passive subversive activities within demonstrations. The WTU cadre includes some former members of the ROP military and the fringe media, and can very effectively manipulate emotions to cause the desired effect. These agitators are taught how to evade civil authorities once the action begins and are experts at highlighting to the media the perception of abuses of power by law enforcement. It is assumed the WTU is aligned with the Liberal Progressive Party (LPP), but this cannot be ascertained.'}, 0.64612675)
	 
 ({'Body': "The WTU is promoting a dedicated,

In [29]:
from pprint import pprint
import time

query = "President Canteth recently made the controversial move to make Mr. David Patton the new Governor of the Northern Pineland Province."
results=search(query, top_k=5, index=index, model=model)

print("\n")
for result in results:
    print('\t','\n',result)

>>>> Time to return results: 0.011704444885253906


	 
 ({'Body': 'President Hale-Canteth appointed Mr. David S. Paton as the new Governor of the NPP. This move is controversial considering Mr. Paton is from the opposition party, the Liberal Progressive Party (LPP). This selection comes immediately following the Republic of Pineland (ROP) President ’ s decision to impose martial law in certain counties of the NPP. The martial law is allowable under Presidential Decision Directive-65 (PDD-65) issued on (W-103) and will permit law enforcement to arrest without probable cause and to detain individuals indefinitely. The multiple abductions point to a further'}, 0.72334385)
	 
 ({'Body': 'President Hale-Canteth appointed Mr. David S. Paton as the new Governor of the NPP.  This move is controversial considering Mr. Paton is from the opposition party, the Liberal Progressive Party (LPP).  This selection comes immediately following the Republic of Pineland (ROP) President’s decision to impose 