<a href="https://colab.research.google.com/github/Vis5892/IR-WS-project-FSS22/blob/main/SBert_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import pandas as pd
import numpy as np
import itertools
from itertools import combinations
import torch
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
import re
from sentence_transformers import InputExample, util
from tqdm.auto import tqdm

In [3]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In [4]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [5]:
# Change directory
%cd drive/MyDrive
%pwd

/content/drive/MyDrive


'/content/drive/MyDrive'

## ***Unziping the files to get the datasets***

In [6]:
# import tarfile
# fname = 'collectionandqueries.tar.gz'

# if fname.endswith("tar.gz"):
#     tar = tarfile.open(fname, "r:gz")
#     tar.extractall()
#     tar.close()
# elif fname.endswith("tar"):
#     tar = tarfile.open(fname, "r:")
#     tar.extractall()
#     tar.close()

In [7]:
# import tarfile
# f1name = 'triples.train.small.tar.gz'

# if f1name.endswith("tar.gz"):
#     tar = tarfile.open(f1name, "r:gz")
#     tar.extractall()
#     tar.close()
# elif f1name.endswith("tar"):
#     tar = tarfile.open(f1name, "r:")
#     tar.extractall()
#     tar.close()

In [8]:
# import gzip
# import shutil
# with gzip.open('qidpidtriples.train.full.2.tsv.gz', 'rb') as f_in:
#     with open('qidpidtriples.train.full.2.tsv', 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)

In [65]:
%ls

'=0.0.7'                          queries.eval.tsv
[0m[01;34m'Colab Notebooks'[0m/                queries.train.tsv
 collection.tsv                   [01;34msbert_test_a[0m/
 [01;34mFinal_model[0m/                     [01;34msbert_test_b[0m/
'Getting started.pdf'             [01;34msbert_test_c[0m/
'IR&WS - Project notes.gdoc'      [01;34msbert_test_d[0m/
 [01;34mlogs[0m/                            [01;34msbert_test_e[0m/
 qidpidtriples.train.full.2.tsv   [01;34msbert_test_f[0m/
 qrels.dev.small.tsv              [01;34msbert_test_g[0m/
 qrels.train.tsv                  [01;34msbert_test_h[0m/
 queries.dev.small.tsv            [01;34msbert_test_ht[0m/
 queries.dev.tsv                  [01;34msbert_test_ht1[0m/
 queries.eval.small.tsv           triples.train.small.tsv


## ***Extracting Training Data***

In [10]:
# Extracting a chunk as the file is huge
reader = pd.read_csv('triples.train.small.tsv', iterator=True, chunksize=1000000, sep = '\t', header = None)

train_triples = next(reader)
train_triples.columns = ['query', 'positive_passage', 'negetive_passage']

In [11]:
train_data = []
for index, lines in train_triples.iterrows():
    train_data.append([lines[0], lines[1], 1])
    train_data.append([lines[0], lines[2], 0])

In [12]:
training_data_df = pd.DataFrame(train_data, columns=['query', 'passage', 'relevance'])
training_data_df.drop_duplicates(inplace=True)
training_data_df.reset_index(inplace=True)
triples = training_data_df.drop(['index'], axis = 1)

## ***Balancing the training data***

In [13]:
# define dataset
X = triples.drop('relevance', axis = 1)
y = triples['relevance']
# summarize class distribution
print(Counter(y))
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')
# fit and apply the transform
X_under, y_under = undersample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_under))

Counter({0: 995386, 1: 369374})
Counter({0: 369374, 1: 369374})


## ***Text Preprocessing***

In [14]:
def text_preprocessing(text):
    """
    - Remove special characters and trailing whitespaces
    - Convert text to lower case
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
    # Remove 'special characters'
    text = re.sub('[^a-zA-Z0-9 \n\.]', '', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # converting to lower case
    text = text.lower()

    return text

In [15]:
X_under['query'] =X_under['query'].map(lambda x:text_preprocessing(x))
X_under['passage'] =X_under['passage'].map(lambda x:text_preprocessing(x))

In [17]:
X_under['relevance'] = y_under
train = X_under

In [18]:
train

Unnamed: 0,query,passage,relevance
0,what is the purpose of a micelle,what is the purpose of the dotmlpfp change rec...,0
1,when did india get republic,a parisian women march on versailles storming ...,0
2,are onions lethal to cats,cooking onions and sweet onions are better sto...,0
3,how long do chickens molt for,every city is going to have a municipal code r...,0
4,how do they measure auroras,pls advice how to convert square meter to mete...,0
...,...,...,...
738743,who played sandman in spider man,exists and is an alternate of . if a sword had...,1
738744,is uninsured motorist insurance required in vi...,if you have auto insurance on the vehicle you ...,1
738745,who played velma in scoobydoo,velma in the scooby doo movie is played by lin...,1
738746,how is the mountain pine beetle affecting cana...,the mountain pine beetle is a naturally occurr...,1


## ***Creating Dataloader***

In [19]:
train_samples = []
for index, row in tqdm(train.iterrows()):
    train_samples.append(InputExample(texts = [row['query'], row['passage']],
                                      label = [row['relevance']]))

0it [00:00, ?it/s]

In [20]:
from torch.utils.data import DataLoader

batch_size = 64

loader = DataLoader(train_samples, batch_size = batch_size, shuffle = True)

## ***Loading model***

In [21]:
from sentence_transformers import models, SentenceTransformer

model = SentenceTransformer('sentence-transformers/msmarco-MiniLM-L6-cos-v5')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

## ***Defining Loss***

In [22]:
from sentence_transformers import losses

loss = losses.MultipleNegativesRankingLoss(model = model)

## ***Training the model***

In [23]:
epochs = 1
warmup_steps = int(0.1 * len(loader))

model.fit(
    train_objectives = [(loader, loss)],
    epochs = epochs,
    warmup_steps = warmup_steps,
    output_path = './Final_model'
)



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/11543 [00:00<?, ?it/s]

## ***Extracting test data***

In [24]:
reader_test = pd.read_csv('triples.train.small.tsv', iterator=True, skiprows = 1000000, chunksize=10000, sep = '\t', header = None)

test_triples= next(reader_test)
test_triples.columns = ['query', 'positive_passage', 'negetive_passage']

In [25]:
test_data = []
for index, lines in test_triples.iterrows():
    test_data.append([lines[0], lines[1], 1])
    test_data.append([lines[0], lines[2], 0])

In [26]:
test_data_df = pd.DataFrame(test_data, columns=['query', 'passage', 'relevance'])
test_data_df.drop_duplicates(inplace=True)
test_data_df.reset_index(inplace=True)
test = test_data_df.drop(['index'], axis = 1)

In [27]:
test['query'] =test['query'].map(lambda x:text_preprocessing(x))
test['passage'] =test['passage'].map(lambda x:text_preprocessing(x))

In [28]:
test

Unnamed: 0,query,passage,relevance
0,why disease in silk road,people got very sick some even died because of...,1
1,why disease in silk road,the capital of afghanistan is its largest city...,0
2,does any rental car company offer unlimited on...,you can easily pick up and return you vehicle ...,1
3,does any rental car company offer unlimited on...,could be your best value in getting an unlimit...,0
4,how much to reimburse for mileage,the irs standard mileage rate is the rate prov...,1
...,...,...,...
19814,when did nyan cat come out,what is category 5 5e 6 cat5 cat5e cat6 this i...,0
19815,how to find molar mass of cyclohexane,i also taught masters students at imperial for...,1
19816,how to find molar mass of cyclohexane,answers. best answer you have enough informati...,0
19817,honda dealers in tampa fl area,welcome to kuhn honda. honda vehicles span nea...,1


## ***Creating test dataloader***

In [29]:
test_samples = []
for index, row in tqdm(test.iterrows()):
    test_samples.append(InputExample(texts = [row['query'], row['passage']],
                                      label = [row['relevance']]))

0it [00:00, ?it/s]

In [30]:
batch_size = 64
test_loader = DataLoader(test_samples, batch_size = batch_size, shuffle = True)

## ***Defining evaluator***

In [31]:
query_ls = test['query'].tolist()
passage_ls = test['passage'].tolist()
label_ls = test['relevance'].tolist()

In [32]:
from sentence_transformers.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(query_ls, passage_ls, label_ls, name = 'result', show_progress_bar= True)

## ***Evaluating the trained model on test data***

In [35]:
trained_model = SentenceTransformer('./Final_model')

In [36]:
epochs = 1
warmup_steps = int(0.1 * len(loader))

trained_model.fit(
    train_objectives=[(test_loader, loss)], 
    epochs=1, warmup_steps=warmup_steps, 
    evaluator=evaluator, 
    evaluation_steps=500, 
    output_path = './Final_model'
)



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/310 [00:00<?, ?it/s]

Batches:   0%|          | 0/677 [00:00<?, ?it/s]

In [37]:
evaluator(trained_model)

Batches:   0%|          | 0/677 [00:00<?, ?it/s]

0.8845458574239929

## ***Extracting search data***

In [38]:
corpus = {}

with open('collection.tsv', 'r') as file:
    for line in file:
        pid, passage = line.strip().split("\t")
        corpus[pid] = passage

print('Loading corpus finished')

Loading corpus finished


In [39]:
queries = {}

with open('queries.train.tsv', 'r') as file:
    for line in file:
        qid, query = line.strip().split("\t")
        queries[qid] = query
print('Loading queries finished')

Loading queries finished


In [40]:
reader_q = pd.read_csv('qidpidtriples.train.full.2.tsv', iterator=True, skiprows = 10000, chunksize=10000, sep = '\t', header = None)

triples_q = next(reader_q)
triples_q.columns = ['qid', 'ppid', 'npid']

In [41]:
triples_q = triples_q.astype(str).values.tolist()

In [42]:
q_data = []
for lines in triples_q:
    if lines[0] in queries:
        q_data.append([lines[0], lines[1], queries[lines[0]], corpus[lines[1]], 1])
        q_data.append([lines[0], lines[2], queries[lines[0]], corpus[lines[2]], 0])

In [43]:
q_data_df = pd.DataFrame(q_data, columns=['qid', 'pid','query', 'passage', 'relevance'])
q_data_df.drop_duplicates(inplace=True)
q_data_df.reset_index(inplace=True)
q_data_df = q_data_df.drop(['index'], axis = 1)

In [64]:
q_data_df

Unnamed: 0,qid,pid,query,passage,relevance
0,1003900,4788874,where was the movie goonies filmed,"The picturesque town of Astoria, in Oregon, wh...",1
1,1003900,3299997,where was the movie goonies filmed,1 The Omega House was filmed in 729 E. 11th Av...,0
2,1003900,7813127,where was the movie goonies filmed,Below is a list of feature films and select te...,0
3,1003900,6855194,where was the movie goonies filmed,"Camp No-Be-Bo-Sco, also known as NoBe, is a Bo...",0
4,1003900,4129258,where was the movie goonies filmed,"Goonies, The : Did you know? Accuracy: A team ...",0
...,...,...,...,...,...
9008,1006748,6685489,which chamber of the heart is responsible for ...,The damaged muscle either can't relax properly...,0
9009,1006748,8448834,which chamber of the heart is responsible for ...,The aorta supplies oxygenated blood from the h...,0
9010,1006748,128067,which chamber of the heart is responsible for ...,Function. While veins usually carry deoxygenat...,0
9011,1006748,12960,which chamber of the heart is responsible for ...,Which parts of the heart carry oxygenated bloo...,0


## ***Searching ranked passages for a query***

In [45]:
tested_model = SentenceTransformer('./Final_model')

In [53]:
def search(queryID):
    """
    - Returns a ranked list of passages for the given query
    @param    queryID (str): ID of the query to be searched.
    @return   Prints 10 passages related to the search query according to the score.
    """
    # Get information on the searched query
    search_data = q_data_df.loc[q_data_df['qid'] == queryID]
    search_data = search_data[:50]
    search_data['query']=search_data['query'].str.lower()
    search_data['passage']=search_data['passage'].str.lower()
    query = search_data['query'].unique().tolist()
    passage = search_data['passage'].unique().tolist()

    # Creating query embedding using the model
    query_emb = tested_model.encode(query)

    # Creating combinations for pairwise ranking
    comb = itertools.combinations(passage, 2)

    # Getting scores for each pair
    final_score = []
    for (i, j) in comb:
        docs = [i, j]
        doc_emb = tested_model.encode(docs)
        scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
        scores = [ '%.3f' % elem for elem in scores ] 
        doc_score_pairs = list(zip(docs, scores))
        final_score.append(doc_score_pairs)

    # Comparing each pair and taking the one with higher score
    f_final_score = []
    for i in final_score:
        if i[0][1] > i[1][1]:
            f_final_score.append(i[0])
        else:
            f_final_score.append(i[1])
    f_final_score = list(set(f_final_score))
    f_final_score.sort(key=lambda x: x[1], reverse = True)
    f_final_score = f_final_score[:10]
                                  
    # Printing final result
    query = ''.join(query)
    #Output passages & scores
    print("Query:", str(query))
    for doc, score in f_final_score:
        print(score, doc)

In [57]:
search('1006748')

Query: which chamber of the heart is responsible for pumping deoxygenated blood to the lungs?
0.705 pulmonary circulation is the movement of blood from the heart, to the lungs, and back to the heart again. deoxygenated blood leaves the heart, goes to the lungs, and then re-enters the heart; deoxygenated blood leaves through the right ventricle through the pulmonary artery.
0.691 the strongest of the four chambers of the heart responsible for pumping oxygenated blood out to the body. refers to the relaxation phase of the cardiac cycle. the middle layer of the heart wall that contains the cardiac muscle cells responsible the pumping action of the heart.
0.689 after it has become deoxygenated, it goes through the veins back to the heart, where the process starts all over again. pulmonary circulation pulmonary circulation is the part of the cardiovascular system which carries deoxygenated blood away from the heart, to the lungs, and returns oxygenated blood back to the heart. it is what th

In [63]:
search('1004979')

Query: which airport is closest to london bridge?
0.662 london city airport is the closest, approximately 6 miles from the city centre.
0.633 manchester airport is england and the uk's 3rd busiest airport. it is based in ringway, manchester and is the biggest airport outside of the london region.
0.622 niagara falls international airport is obviously the closest airport to the city. it only flies charters and tourism flights and is located on the us side. 2. the closest commuter airport on the us side is buffalo international airport. 3. the closest commuter airport on the canadian side is john c munro hamilton international airport.
0.579 find out how to travel between stansted airport and central london by rail, coach or taxi, plus your guide to hotel transfers and car parking. london stansted airport (stn) is the uk's third busiest airport and one of the fastest growing airports in europe.the airport is located 64km (40 miles) to the north-east of london. it is home to many of the u

In [55]:
search('1003900')

Query: where was the movie goonies filmed
0.650 the picturesque town of astoria, in oregon, where 1980s adventure film the goonies was set. this scene from the famous film shows chunk showing off the iconic 'truffle shuffle'. 'on the way to astoria, we stopped off at cannon beach which you can see near the start of the movie.
0.628 check them out: movies. lilo and stitch â how can an animated movie be filmed at walt disney world â® simple in its animation studio located at disneyâs hollywood studios â®. that is exactly where most of the 2002 animated science fiction comedy-drama film lilo & stitch was filmed.
0.627 below is a list of feature films and select television movies that were filmed in tennessee, either fully or partially: 1  hannah montana the movie-filmed in columbia, tn, released 2009, more info here. 2  tennessee-filmed in dunlap, 2007. 3  black snake moan-filmed in memphis, 2005.
0.624 doug and i went to madrid, new mexico where the movie, wild hogs was filmed. maggi

In [56]:
search('1005307')

Query: which animals are cnidarians
0.667 cnidaria. cnidaria is a phylum of animals, which contains amazingly beautiful coral reefs, electrifying jellyfish, and many other interesting oceanic creatures. there are about 10,000 species of cnidarians and all of them are unique among all other organisms for the presence of the cnidocytes. cnidaria and porifera are two different phyla. â¢ cnidarians have cnidocytes but not the poriferans. â¢ cnidarians have well-organized organ systems but not the poriferans; on the other hand, poriferans have an efficient tunnel system made up of pores but not the cnidarians.
0.636 cnidarians are a group of aquatic invertebrates that includes jellyfish, corals, sea anemones and hydras.
0.625 1 animals of the phylum cnidaria the phylum cnidaria includes animals such as corals, jellyfish, and sea anemones. 2  all cnidarians are aquatic and carnivorous, and they occur in two basic body shapes: polyps,â¦. 3  community collaboration in reiki practice reiki i