# **INF8460 A20 Project: Open-domain questions answering**

<br>

Equipe 8:


*   Cedric Sadeu (Glove, ranking with classification)
*   Mamoudou Sacko (pretraitement + TF-IDF, cosine ranking)
*   Oumayma Messoussi (PCP Bert, ML/DL for ranking)

<br>

---

<br>

In [1]:
!pip install transformers pytorch-pretrained-bert # pytorch-nlp pytorch_transformers



In [2]:
import io
import os
import math
import nltk
import time
import torch
import random
import sklearn
import zipfile
import operator
import requests
import functools
import itertools
import numpy as np
import pandas as pd
import lightgbm as lgb
import multiprocessing
from functools import partial
from typing import Dict, List, Tuple
from collections import Counter, defaultdict
from concurrent.futures import ProcessPoolExecutor
from scipy.spatial.distance import euclidean, cosine
from transformers import pipeline, Trainer, TrainingArguments
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, BertTokenizer, BertModel, BertForQuestionAnswering

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
from google.colab import drive
drive.mount('/content/drive')
!ls '/content/drive/My Drive/Colab Notebooks/INF8460/Project/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
data				  LSTMSiameseTextSimilarity  pytorch_model.bin
inf8460_projet_A20_equipe8.ipynb  output		     yahooLTR_C14.tgz


In [4]:
import sys
sys.path.append('/content/drive/My Drive/Colab Notebooks/INF8460/Project/')

### Lecture des donnees

In [5]:
def read_data(path: str) -> Tuple[List[int], List[str]]:
    data = pd.read_csv(path)
    ids = data["id"].tolist()
    paragraphs = data["paragraph"].tolist()
    return ids, paragraphs

def read_questions(path: str) -> Tuple[List[int], List[str], List[int], List[str]]:
    data = pd.read_csv(path)
    ids = data["id"].tolist()
    questions = data["question"].tolist()
    if "test" not in path:
        paragraph_ids = data["paragraph_id"].tolist()
        answers = data["answer"].tolist()
        return ids, questions, paragraph_ids, answers
    else:
        return ids, questions

def save_to_csv(path: str, corpus):
    df = pd.DataFrame(corpus, columns= list(corpus.keys())).head()
    df.to_csv (os.path.join(output_path, path), index = False, header=True)

In [6]:
data_path = "data"
output_path = "/content/drive/My Drive/Colab Notebooks/INF8460/Project/output"

# train_data = read_data(os.path.join(data_path, "/content/drive/My Drive/Colab Notebooks/INF8460/Project/data/corpus.csv"))
train_ids = read_questions(os.path.join(data_path, "/content/drive/My Drive/Colab Notebooks/INF8460/Project/data/train_ids.csv"))
test_ids = read_questions(os.path.join(data_path, "/content/drive/My Drive/Colab Notebooks/INF8460/Project/data/test.csv"))


# paragraphs = [" ".join(sentence.split()).lower() for sentence in train_data[1]]
questions = [" ".join(sentence.split()).lower() for sentence in train_ids[1]]
test_questions = [" ".join(sentence.split()).lower() for sentence in test_ids[1]]

### Pretraitement

In [7]:
class Preprocess(object):
    def __init__(self, lemmatize=True):
        self.stopwords = set(nltk.corpus.stopwords.words("english"))
        self.lemmatize = lemmatize

    def preprocess_pipeline(self, data):
        clean_tokenized_data = self._clean_doc(data)
        if self.lemmatize:
            clean_tokenized_data = self._lemmatize(clean_tokenized_data)

        return clean_tokenized_data

    def _clean_doc(self, data):
        tokenizer = nltk.tokenize.RegexpTokenizer(r"\w+")
        return [
            [
                token.lower()
                for token in tokenizer.tokenize(review)
                if token.lower() not in self.stopwords
                and len(token) > 1
                and token.isalpha()
            ]
            for review in data
        ]

    def _lemmatize(self, data):
        lemmatizer = nltk.stem.WordNetLemmatizer()
        return [[lemmatizer.lemmatize(word) for word in review] for review in data]

    def convert_to_reviews(self, tokenized_reviews):
        reviews = []
        for tokens in tokenized_reviews:
            reviews.append(" ".join(tokens))

        return reviews

In [8]:
pre = Preprocess()

# paragraphs_tokenized = pre.preprocess_pipeline(paragraphs)
questions_tokenized = pre.preprocess_pipeline(questions)
test_questions_tokenized = pre.preprocess_pipeline(test_questions)

# paragraphs_text = [" ".join(sentence) for sentence in paragraphs_tokenized]
questions_text = [" ".join(sentence) for sentence in questions_tokenized]
test_questions_text = [" ".join(sentence) for sentence in test_questions_tokenized]

# del paragraphs_tokenized
del questions_tokenized
del test_questions_tokenized



---

<br>

## **1. Plongements lexicaux**

### TF-IDF

In [40]:
def buildVocab(X) -> object:
  vectorizer = CountVectorizer(min_df=0, lowercase=False)
  vectorizer.fit(X)
  return vectorizer.vocabulary_

def getTfIdfReprentation(data, vectorizer) -> object: 
  data_tfidf = vectorizer.fit_transform(data)
  features = vectorizer.get_feature_names()
  dense = data_tfidf.todense()
  return dense # data_tfidf

# def getTfIdfEmbedded(vocab, data, feature = 5) -> object:
#   vectorizer = TfidfVectorizer(max_features=15) 
#   data_tfidf = vectorizer.fit_transform(data)
#   features = vectorizer.get_feature_names()
#   dense = data_tfidf.todense()
#   denselist = dense.tolist()
#   df = pd.DataFrame(
#     denselist,columns=features)
#   return df


def get_doc_embedded(X, vocab, embeddings) -> object:
  X_embedded = np.zeros((len(X), len(embeddings)), dtype=float)

  for i, doc in enumerate(X):
    vec = np.zeros((1, len(embeddings)), dtype=float)
    tokens = doc.split() #new_question_tfidf
    cpt = 0
    for word in tokens:
      if(word in vocab):
        cpt += 1
        vec += embeddings[word]
    vec /= cpt
    X_embedded[i] = vec
  return X_embedded

  def getMedian(corpus):
    total_lenght = sorted([len(doc) for doc in corpus])
    return total_lenght[int(len(total_lenght) * 2/3)]

  
def sklearn_svd(df, k):
    svd_model = TruncatedSVD(n_components=k)
    df_r = svd_model.fit_transform(df)
    return  df_r

In [None]:
corpus = {'id': train_data[0], 'paragraph': paragraphs_tfidf }
save_to_csv("corpus.csv", corpus)

train_ids = {'id': train_ids[0], 'question': paragraphs_tfidf, 'paragraph_id': train_ids[2], 'answer': train_ids[3] }
save_to_csv("train_ids.csv", train_ids)

### GloVe

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
!rm glove.6B.50d.txt
!rm glove.6B.100d.txt
!rm glove.6B.200d.txt

--2020-11-17 23:35:47--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-11-17 23:35:48--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-11-17 23:35:48--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-1

In [None]:
def read_from_csv(path):
    """ 
    reads a matrix from a csv
    """
    data = pd.read_csv(path)
    data = data.dropna(axis=1,how='all')
    return (data.to_numpy().T).tolist()

def get_lines_gloves(line):
    """ 
    this function takes:
    line: a line from the glove text file (a string)
    returns a tuple (word, embeddings vector)
    """
    values = line.split()
    word = values[0]
    return word, np.asarray(values[1:], dtype=float)

def get_gloves_dict(path = "glove.6B.300d.txt"):
    """ 
    this function takes:
    path: to a  glove text file (a string)
    returns a dict {key=word:Value=embeddings vector}
    """
    with open(path, "r", encoding="UTF-8") as f:
            lines = f.readlines()
    p = multiprocessing.Pool()
    result = p.map(get_lines_gloves, lines)
    p.close()
    p.join()
    p.terminate()
    return dict(result)

def get_plong_doc(doc, embeddings_dict, len_vec_emb):
    """
    this functions takes in:
    doc: a string representing a doc in the corpus ex:'il est'
    embeddings_dict: a dict {key=word:Value=embeddings}
    len_vec_emb: the length of the embedding vector (d)
    return an embedding vector for the doc 
    this result is the mean of the vector embedding of each word
    """
    vectorizer = CountVectorizer()
    temp_ = vectorizer.fit([doc]).vocabulary_
    vec = np.zeros(len_vec_emb, dtype=float)
    for word in temp_.keys():
        vec += (embeddings_dict.get(word, 0) * temp_[word])
    return vec / sum(temp_.values())

def get_plong_corpus(corpus, embeddings_dict):
    """
    his functions takes in:
    corpus: ['je vais' 'il est']a list of strings representing the corpus. each string in the list is document in the corpus
    embeddings_dict: a dict {key=word:Value=embeddings}
    return a list of embedding vector [] each vector is the embedding vector for a doc
    """
    p = multiprocessing.Pool()
    result = p.map(partial(get_plong_doc, embeddings_dict=embeddings_dict, len_vec_emb=len(list(embeddings_dict.items())[0][1])), corpus)
    p.close()
    p.join()
    p.terminate()
    return result

In [None]:
path = "/content/drive/My Drive/Colab Notebooks/INF8460/Project/output/corpus.csv"
datat = read_from_csv(path)

vectorizer = CountVectorizer()
X = vectorizer.fit(datat[1]).vocabulary_

In [None]:
glove_dict = get_gloves_dict()
key_set = set(X.keys()) & set(glove_dict.keys())
glove_dict_vocab_corpus = {key: glove_dict[key] for key in key_set}

In [None]:
plongement_doc = get_plong_corpus(datat[1], glove_dict_vocab_corpus)

### Plongements contextuels pré-entraînés / non pré-entraînés



> #### Huggingface ready pipeline



In [None]:
question = "How many parameters does BERT-large have?"
answer_text = r"""BERT-large is really big... it has 24-layers and an embedding size of 1,024, 
                  for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take 
                  a couple minutes to download to your Colab instance."""

nlp = pipeline("question-answering")
result = nlp(question=question, context=answer_text)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…






Answer: '340M', score: 0.7121, start: 111, end: 115




> #### DistilBERT SQuAD pre-trained



In [None]:
questions = ["How many parameters does BERT-large have?"]
answer_text = r"""BERT-large is really big... it has 24-layers and an embedding size of 1,024, 
                  for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take 
                  a couple minutes to download to your Colab instance."""

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad", 
                                                      return_dict=True, output_hidden_states = True)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad", return_dict=True)

for question in questions:
    inputs = tokenizer(question, answer_text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # display tokens and ids
    for token, id in zip(text_tokens, input_ids):
        if id == tokenizer.sep_token_id:
            print('')
        print('{:<12} {:>6,}'.format(token, id))
        if id == tokenizer.sep_token_id:
            print('')

    outputs = model(**inputs)

    last_hidden_states = outputs.hidden_states[-1]
    print(last_hidden_states.shape)

    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits) + 1

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}, start: {answer_start}, end: {answer_end}")

[CLS]           101
How           1,731
many          1,242
parameters   11,934
does          1,674
B               139
##ER          9,637
##T           1,942
-               118
large         1,415
have          1,138
?               136

[SEP]           102

B               139
##ER          9,637
##T           1,942
-               118
large         1,415
is            1,110
really        1,541
big           1,992
.               119
.               119
.               119
it            1,122
has           1,144
24            1,572
-               118
layers        8,798
and           1,105
an            1,126
em            9,712
##bed         4,774
##ding        3,408
size          2,060
of            1,104
1               122
,               117
02            5,507
##4           1,527
,               117
for           1,111
a               170
total         1,703
of            1,104
340          16,984
##M           2,107
parameters   11,934
!               106
Alto         17,76

> #### BERT base pre-trained

In [None]:
torch.cuda.set_device(0)

questions = ["How many parameters does BERT-large have?"]
answer_text = r"""BERT-large is really big... it has 24-layers and an embedding size of 1,024, 
                  for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take 
                  a couple minutes to download to your Colab instance."""

model = BertModel.from_pretrained("bert-base-cased", return_dict=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# device = 'cuda' if torch.cuda.is_available() else 'cpu'
# print(device)
# # model = model.to(device)

for question in questions:
    inputs = tokenizer(question, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # display tokens and ids
    for token, id in zip(text_tokens, input_ids):
        if id == tokenizer.sep_token_id:
            print('')
        print('{:<12} {:>6,}'.format(token, id))
        if id == tokenizer.sep_token_id:
            print('')

    outputs = model(**inputs)

    last_hidden_states = outputs.last_hidden_state
    print(last_hidden_states.shape)

[CLS]           101
How           1,731
many          1,242
parameters   11,934
does          1,674
B               139
##ER          9,637
##T           1,942
-               118
large         1,415
have          1,138
?               136

[SEP]           102

torch.Size([1, 13, 768])


> #### DistilBERT SQuAD training



---

<br>

## **2. Ordonnancement**



> #### cosine similarity



In [None]:
def voisins(word, df, n, distfunc=cosine):
    assert distfunc.__name__ == 'cosine' or distfunc.__name__ == 'euclidean', "distance metric not supported"
    order = True if distfunc.__name__ == 'euclidean' else False

    closest = {}
    for w in df:
        distance = distfunc(word, df[w])
        closest[w] = distance

    closest = {k: v for k, v in sorted(closest.items(), key=lambda item: item[1], reverse=order)}

    return list(closest.keys())[:n], list(closest.values())[:n]

In [None]:
# paragraphs_vocab = buildVocab(paragraphs_text)
questions_vocab = buildVocab(questions_text)

questions_ids = train_ids[0]
vectorizer = TfidfVectorizer(max_features=500) # vocabulary=questions_vocab
questions_tfidf = getTfIdfReprentation(questions_text, vectorizer)

dic_questions = {}
for i, ids in enumerate(questions) :
    dic_questions[questions_ids[i]] = questions_tfidf[i]

#questions_tfidf_r = sklearn_svd(questions_tfidf, len(embeddings))

In [None]:
start = time.time()

ranking_list = {}
for i in range(len(test_questions_text)):
    new_question = test_questions_text[i]
    new_question_tokenized = pre.preprocess_pipeline([new_question])
    new_question_text = [" ".join(sentence) for sentence in new_question_tokenized]
    
    new_question_tfidf = vectorizer.transform(new_question_text).todense()

    topk_ids, topk_questions = voisins(new_question_tfidf, dic_questions, 3, distfunc=cosine)
    print(topk_ids, topk_questions)

    ranking_list[i] = topk_ids

print(time.time() - start)

  dist = 1.0 - uv / np.sqrt(uu * vv)


[3736, 9277, 663] [0.21815962598654437, 0.3765213479496673, 0.5667729562836451]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[549, 623, 1404] [0.0, 0.0, 0.0]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[2927, 25752, 558] [0.28044591097594374, 0.28044591097594374, 0.4304412417979788]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[964, 2990, 22204] [0.0, 0.0, 0.0]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[10171, 1710, 67] [0.3970612842870358, 0.5506789894166402, 0.5716624108482339]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[331, 2920, 1074] [0.5369323027345263, 0.5369715647490179, 0.6235697311139214]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[85, 5948, 96348] [0.21686895949687934, 0.21686895949687934, 0.3258161538570762]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[1240, 3571, 17243] [0.0, 0.0, 0.0]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[584, 247, 2] [0.38061932977127344, 0.466747743096798, 0.5520734461509238]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[599, 5, 69] [0.6237818822519656, 0.6877793564683086, 0.7594926247939646]


  dist = 1.0 - uv / np.sqrt(uu * vv)


[3253, 6324, 19286] [0.0, 0.0, 0.0]


  dist = 1.0 - uv / np.sqrt(uu * vv)


KeyboardInterrupt: ignored

In [None]:
# save to file
save_to_csv("test_cosine_sim_questions.csv", ranking_list)



> #### LambdaMART with lightgbm





> #### LSTM Siamese text similarity



In [None]:
# !git clone https://github.com/amansrivastava17/lstm-siamese-text-similarity.git

In [None]:
import sys
sys.path.append('/content/drive/My Drive/Colab Notebooks/INF8460/Project/LSTMSiameseTextSimilarity/')
!wget https://github.com/brmson/dataset-sts/tree/master/data/sts/sick2014/SICK_train.txt

--2020-11-14 22:43:07--  https://github.com/brmson/dataset-sts/tree/master/data/sts/sick2014/SICK_train.txt
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/brmson/dataset-sts/blob/master/data/sts/sick2014/SICK_train.txt [following]
--2020-11-14 22:43:08--  https://github.com/brmson/dataset-sts/blob/master/data/sts/sick2014/SICK_train.txt
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘SICK_train.txt.1’

SICK_train.txt.1        [<=>                 ]       0  --.-KB/s               SICK_train.txt.1        [ <=>                ]   1.36M  --.-KB/s    in 0.05s   

2020-11-14 22:43:08 (30.3 MB/s) - ‘SICK_train.txt.1’ saved [1430770]



In [None]:
from model import SiameseBiLSTM
from inputHandler import word_embed_meta_data, create_test_data
from config import siamese_config
import pandas as pd

############ Data Preperation ##########

df = pd.read_csv('lstm-siamese-text-similarity/sample_data.csv')

sentences1 = list(df['sentences1'])
sentences2 = list(df['sentences2'])
is_similar = list(df['is_similar'])
del df

######## Word Embedding ############

tokenizer, embedding_matrix = word_embed_meta_data(sentences1 + sentences2,  siamese_config['EMBEDDING_DIM'])

embedding_meta_data = {
	'tokenizer': tokenizer,
	'embedding_matrix': embedding_matrix
}

## creating sentence pairs
sentences_pair = [(x1, x2) for x1, x2 in zip(sentences1, sentences2)]
del sentences1
del sentences2

######## Training ########

class Configuration(object):
    """Dump stuff here"""

CONFIG = Configuration()

CONFIG.embedding_dim = siamese_config['EMBEDDING_DIM']
CONFIG.max_sequence_length = siamese_config['MAX_SEQUENCE_LENGTH']
CONFIG.number_lstm_units = siamese_config['NUMBER_LSTM']
CONFIG.rate_drop_lstm = siamese_config['RATE_DROP_LSTM']
CONFIG.number_dense_units = siamese_config['NUMBER_DENSE_UNITS']
CONFIG.activation_function = siamese_config['ACTIVATION_FUNCTION']
CONFIG.rate_drop_dense = siamese_config['RATE_DROP_DENSE']
CONFIG.validation_split_ratio = siamese_config['VALIDATION_SPLIT']

siamese = SiameseBiLSTM(CONFIG.embedding_dim , CONFIG.max_sequence_length, CONFIG.number_lstm_units , CONFIG.number_dense_units, CONFIG.rate_drop_lstm, CONFIG.rate_drop_dense, CONFIG.activation_function, CONFIG.validation_split_ratio)

best_model_path = siamese.train_model(sentences_pair, is_similar, embedding_meta_data, model_save_directory='./')

Embedding matrix shape: (3052, 50)
Null word embeddings: 1
Epoch 1/200
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200


In [None]:
######## Testing ########

from operator import itemgetter
from keras.models import load_model

model = load_model(best_model_path)

test_sentence_pairs = [('What can make Physics easy to learn?','How can you make physics easy to learn?'),('How many times a day do a clocks hands overlap?','What does it mean that every time I look at the clock the numbers are the same?')]

test_data_x1, test_data_x2, leaks_test = create_test_data(tokenizer,test_sentence_pairs,  siamese_config['MAX_SEQUENCE_LENGTH'])

preds = list(model.predict([test_data_x1, test_data_x2, leaks_test], verbose=1).ravel())
results = [(x, y, z) for (x, y), z in zip(test_sentence_pairs, preds)]
results.sort(key=itemgetter(2), reverse=True)
print(results)

[('What can make Physics easy to learn?', 'How can you make physics easy to learn?', 0.39372748), ('How many times a day do a clocks hands overlap?', 'What does it mean that every time I look at the clock the numbers are the same?', 0.169769)]




---

<br>

## **3. Extraction de reponse**



> #### Pytorch BERT squad 2.0



In [None]:
!git clone https://github.com/surbhardwaj/BERT-QnA-Squad_2.0_Finetuned_Model.git

Cloning into 'BERT-QnA-Squad_2.0_Finetuned_Model'...
remote: Enumerating objects: 60, done.[K
remote: Total 60 (delta 0), reused 0 (delta 0), pack-reused 60[K
Unpacking objects: 100% (60/60), done.


In [None]:
import numpy as np
import pandas as pd
import argparse
from pytorch_pretrained_bert.tokenization import (BasicTokenizer,
                                                  BertTokenizer,whitespace_tokenize)
import collections
import torch
from torch.utils.data import TensorDataset
from pytorch_pretrained_bert.modeling import BertForQuestionAnswering, BertConfig
import math
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import tqdm
from termcolor import colored, cprint


class SquadExample(object):
    """
    A single training/test example for the Squad dataset.
    For examples without an answer, the start and end position are -1.
    """

    def __init__(self,
                 example_id,
                 para_text,
                 qas_id,
                 question_text,
                 doc_tokens,
                unique_id):
        self.qas_id = qas_id
        self.question_text = question_text
        self.doc_tokens = doc_tokens
        self.example_id = example_id
        self.para_text = para_text
        self.unique_id = unique_id
        

    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        s = ""
        s += "qas_id: %s" % (self.qas_id)
        s += ", question_text: %s" % (
            self.question_text)
        s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens))
        
        return s



### Convert paragraph to tokens and returns question_text
def read_squad_examples(input_data):
    """Read a SQuAD json file into a list of SquadExample."""
    def is_whitespace(c):
        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
            return True
        return False
    i = 0
    examples = []
    for entry in input_data:
        example_id = entry['id']
        paragraph_text = entry['text']
        doc_tokens = []
        prev_is_whitespace = True
        for c in paragraph_text:
            if is_whitespace(c):
                prev_is_whitespace = True
            else:
                if prev_is_whitespace:
                    doc_tokens.append(c)
                else:
                    doc_tokens[-1] += c
                prev_is_whitespace = False
            
        for qa in entry['ques']:
            qas_id = i
            question_text = qa
            
            
            example = SquadExample(example_id = example_id,
                    qas_id=qas_id,
                    para_text = paragraph_text,               
                    question_text=question_text,
                    doc_tokens=doc_tokens,
                    unique_id = i)
            i+=1
            examples.append(example)

    return examples



def _check_is_max_context(doc_spans, cur_span_index, position):
    """Check if this is the 'max context' doc span for the token."""
    best_score = None
    best_span_index = None
    for (span_index, doc_span) in enumerate(doc_spans):
        end = doc_span.start + doc_span.length - 1
        if position < doc_span.start:
            continue
        if position > end:
            continue
        num_left_context = position - doc_span.start
        num_right_context = end - position
        score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
        if best_score is None or score > best_score:
            best_score = score
            best_span_index = span_index

    return cur_span_index == best_span_index

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self,
                 unique_id,
                 example_index,
                 doc_span_index,
                 tokens,
                 token_is_max_context,
                 token_to_orig_map,
                 input_ids,
                 input_mask,
                 segment_ids):
 
        self.doc_span_index = doc_span_index
        self.unique_id = unique_id
        self.example_index = example_index
        self.tokens = tokens
        self.token_is_max_context = token_is_max_context
        self.token_to_orig_map = token_to_orig_map
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        




def convert_examples_to_features(examples, tokenizer, max_seq_length,
                                 doc_stride, max_query_length):
    """Loads a data file into a list of `InputBatch`s."""


    features = []
    unique_id = 1
    for (example_index, example) in enumerate(examples):
        query_tokens = tokenizer.tokenize(example.question_text)
        ### Truncate the query if query length > max_query_length..
        if len(query_tokens) > max_query_length:
            query_tokens = query_tokens[0:max_query_length]

        tok_to_orig_index = []
        orig_to_tok_index = []
        all_doc_tokens = []
        for (i, token) in enumerate(example.doc_tokens):
            orig_to_tok_index.append(len(all_doc_tokens))
            sub_tokens = tokenizer.tokenize(token)
            for sub_token in sub_tokens:
                tok_to_orig_index.append(i)
                all_doc_tokens.append(sub_token)

        tok_start_position = None
        tok_end_position = None

        max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
    

        # We can have documents that are longer than the maximum sequence length.
        # To deal with this we do a sliding window approach, where we take chunks
        # of the up to our max length with a stride of `doc_stride`.
        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
            "DocSpan", ["start", "length"])
        doc_spans = []
        start_offset = 0
        while start_offset < len(all_doc_tokens):
            length = len(all_doc_tokens) - start_offset
            if length > max_tokens_for_doc:
                length = max_tokens_for_doc
            doc_spans.append(_DocSpan(start=start_offset, length=length))
            if start_offset + length == len(all_doc_tokens):
                break
            start_offset += min(length, doc_stride)

        for (doc_span_index, doc_span) in enumerate(doc_spans):
            tokens = []
            token_to_orig_map = {}
            token_is_max_context = {}
            segment_ids = []
            tokens.append("[CLS]")
            segment_ids.append(0)
            for token in query_tokens:
                tokens.append(token)
                segment_ids.append(0)
            tokens.append("[SEP]")
            segment_ids.append(0)

            for i in range(doc_span.length):
                split_token_index = doc_span.start + i
                token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]

                is_max_context = _check_is_max_context(doc_spans, doc_span_index,
                                                       split_token_index)
                token_is_max_context[len(tokens)] = is_max_context
                tokens.append(all_doc_tokens[split_token_index])
                segment_ids.append(1)
            tokens.append("[SEP]")
            segment_ids.append(1)


            input_ids = tokenizer.convert_tokens_to_ids(tokens)

                # The mask has 1 for real tokens and 0 for padding tokens. Only real
                # tokens are attended to.
            input_mask = [1] * len(input_ids)

            # Zero-pad up to the sequence length.
            while len(input_ids) < max_seq_length:
                input_ids.append(0)
                input_mask.append(0)
                segment_ids.append(0)

            assert len(input_ids) == max_seq_length
            assert len(input_mask) == max_seq_length
            assert len(segment_ids) == max_seq_length

            
            features.append(InputFeatures(unique_id = unique_id,
                            example_index = example_index,
                            doc_span_index=doc_span_index,
                            tokens=tokens,   
                            token_is_max_context=token_is_max_context,
                            token_to_orig_map=token_to_orig_map,
                            input_ids=input_ids,
                            input_mask=input_mask,
                            segment_ids=segment_ids))
            unique_id += 1

            
    
    return features


def _get_best_indexes(logits, n_best_size):
    """Get the n-best logits from a list."""
    index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)
    best_indexes = []
    for i in range(len(index_and_score)):
        if i >= n_best_size:
            break
        best_indexes.append(index_and_score[i][0])
   
    return best_indexes



def get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=False):
    """Project the tokenized prediction back to the original text."""

    def _strip_spaces(text):
        ns_chars = []
        ns_to_s_map = collections.OrderedDict()
        for (i, c) in enumerate(text):
            if c == " ":
                continue
            ns_to_s_map[len(ns_chars)] = i
            ns_chars.append(c)
        ns_text = "".join(ns_chars)
        return (ns_text, ns_to_s_map)

    
    tokenizer = BasicTokenizer(do_lower_case=do_lower_case)

    tok_text = " ".join(tokenizer.tokenize(orig_text))

    start_position = tok_text.find(pred_text)
    if start_position == -1:
        if verbose_logging:
            logger.info(
                "Unable to find text: '%s' in '%s'" % (pred_text, orig_text))
            print("no answer")
        return orig_text
    end_position = start_position + len(pred_text) - 1

    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)

    if len(orig_ns_text) != len(tok_ns_text):
        if verbose_logging:
            logger.info("Length not equal after stripping spaces: '%s' vs '%s'",
                        orig_ns_text, tok_ns_text)
        return orig_text

    # We then project the characters in `pred_text` back to `orig_text` using
    # the character-to-character alignment.
    tok_s_to_ns_map = {}
    for (i, tok_index) in tok_ns_to_s_map.items():
        tok_s_to_ns_map[tok_index] = i

    orig_start_position = None
    if start_position in tok_s_to_ns_map:
        ns_start_position = tok_s_to_ns_map[start_position]
        if ns_start_position in orig_ns_to_s_map:
            orig_start_position = orig_ns_to_s_map[ns_start_position]

    if orig_start_position is None:
        if verbose_logging:
            logger.info("Couldn't map start position")
        return orig_text

    orig_end_position = None
    if end_position in tok_s_to_ns_map:
        ns_end_position = tok_s_to_ns_map[end_position]
        if ns_end_position in orig_ns_to_s_map:
            orig_end_position = orig_ns_to_s_map[ns_end_position]

    if orig_end_position is None:
        if verbose_logging:
            logger.info("Couldn't map end position")
        return orig_text

    output_text = orig_text[orig_start_position:(orig_end_position + 1)]
    return output_text

    
    
_PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
        "PrelimPrediction",
["feature_index", "start_index", "end_index", "start_logit", "end_logit"])


def _compute_softmax(scores):
    """Compute softmax probability over raw logits."""
    if not scores:
        return []

    max_score = None
    for score in scores:
        if max_score is None or score > max_score:
            max_score = score

    exp_scores = []
    total_sum = 0.0
    for score in scores:
        x = math.exp(score - max_score)
        exp_scores.append(x)
        total_sum += x

    probs = []
    for score in exp_scores:
        probs.append(score / total_sum)
    return probs


_NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
    "NbestPrediction", ["text", "start_logit", "end_logit"])

def predict(examples, all_features, all_results, max_answer_length, thresh):

    n_best_size = 10
    
    ### Adding index to feature ###
    example_index_to_features = collections.defaultdict(list)
    for feature in all_features:
        example_index_to_features[feature.example_index].append(feature)
     
    unique_id_to_result = {}
    for result in all_results:
        unique_id_to_result[result.unique_id] = result
        
        
    all_predictions = collections.OrderedDict()
   
    
    
    for example in examples:
        index = 0
        features = example_index_to_features[example.unique_id]
        prelim_predictions = []
       
        for (feature_index, feature) in enumerate(features):
            result = unique_id_to_result[feature.unique_id]
            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
            for start_index in start_indexes:
                    for end_index in end_indexes:
                     #### we remove the indexes which are invalid @
                        if start_index >= len(feature.tokens):
                            continue
                        if end_index >= len(feature.tokens):

                            continue
                        if start_index not in feature.token_to_orig_map:
                            continue
                        if end_index not in feature.token_to_orig_map:
                            continue
                        if not feature.token_is_max_context.get(start_index, False):
                            continue
                        if end_index < start_index:
                            continue
                        length = end_index - start_index + 1
                        if length > max_answer_length:
                            continue

                        prelim_predictions.append(
                                        _PrelimPrediction(
                                            feature_index=feature_index,
                                            start_index=start_index,
                                            end_index=end_index,
                                            start_logit=result.start_logits[start_index],
                                            end_logit=result.end_logits[end_index]))


        prelim_predictions = sorted(
            prelim_predictions,
            key=lambda x: (x.start_logit + x.end_logit),
            reverse=True) 
            
    
         
        seen_predictions = {}
        nbest = []
        for pred in prelim_predictions:
            if len(nbest) >= n_best_size:
                break
                
            feature = features[pred.feature_index]
            if pred.start_index > 0:  # this is a non-null prediction
                tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1)]
                orig_doc_start = feature.token_to_orig_map[pred.start_index]
                orig_doc_end = feature.token_to_orig_map[pred.end_index]
                orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end + 1)]
                tok_text = " ".join(tok_tokens)

                # De-tokenize WordPieces that have been split off.
                tok_text = tok_text.replace(" ##", "")
                tok_text = tok_text.replace("##", "")

                # Clean whitespace
                tok_text = tok_text.strip()
                tok_text = " ".join(tok_text.split())
                orig_text = " ".join(orig_tokens)

                final_text = get_final_text(tok_text, orig_text, True)
                if final_text in seen_predictions:
                    continue

                seen_predictions[final_text] = True
            else:
                final_text = ""
                seen_predictions[final_text] = True

            nbest.append(
                _NbestPrediction(
                    text=final_text,
                    start_logit=pred.start_logit,
                    end_logit=pred.end_logit))
        
        
    
        if not nbest:
                nbest.append(
                    _NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0))

        assert len(nbest) >= 1
        
        
        total_scores = []
        best_non_null_entry = None
        for entry in nbest:
            total_scores.append(entry.start_logit + entry.end_logit)
            if not best_non_null_entry:
                if entry.text:
                    best_non_null_entry = entry

    
        probs = _compute_softmax(total_scores)
        nbest_json = []
        for (i, entry) in enumerate(nbest):

            output = collections.OrderedDict()
            output["text"] = entry.text if probs[i] > thresh else "<No Answer>"
            output["probability"] = probs[i]
            output["start_logit"] = entry.start_logit if probs[i] > thresh else -1
            output["end_logit"] = entry.end_logit if probs[i] > thresh else -1
            nbest_json.append(output)

        assert len(nbest_json) >= 1
        all_predictions[example] = nbest_json[0]["text"]
        index=+1
    return all_predictions
        

                

RawResult = collections.namedtuple("RawResult",
                                   ["unique_id", "start_logits", "end_logits"])




def main(para_file="BERT-QnA-Squad_2.0_Finetuned_Model/Input_file.txt"):
    
    # args = parser.parse_args()
    model_path = "/content/drive/My Drive/Colab Notebooks/INF8460/Project/pytorch_model.bin"
    
    if torch.cuda.is_available:
        print('GPU available')
    else:
        print('Please set GPU')
    device = torch.device("cuda" if torch.cuda.is_available else "cpu")
    n_gpu = torch.cuda.device_count()
    
    ### Raeding paragraph
    f = open(para_file, 'r')
    para = f.read()
    f.close()
    
    ## Reading question
#     f = open(ques_file, 'r')
#     ques = f.read()
#     f.close()
    
    para_list = para.split('\n\n')
    
    input_data = []
    i = 1
    for para in para_list :
        paragraphs = {}
        splits = para.split('\nQuestions:')
        paragraphs['id'] = i
        paragraphs['text'] = splits[0].replace('Paragraph:', '').strip('\n')
        paragraphs['ques']=splits[1].lstrip('\n').split('\n')
        input_data.append(paragraphs)
        i+=1
       
    
    ## input_data is a list of dictionary which has a paragraph and questions

    
    examples = read_squad_examples(input_data)
    # print(examples)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    
    
    eval_features = convert_examples_to_features(
            examples = examples,
            tokenizer=tokenizer,
            max_seq_length=384,
            doc_stride=128,
            max_query_length=64)
    
    # print(eval_features)
    
    
    all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
    all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
    
    ### Loading Pretrained model for QnA 
    config = BertConfig("BERT-QnA-Squad_2.0_Finetuned_Model//Results/bert_config.json")
    model = BertForQuestionAnswering(config)
    model.load_state_dict(torch.load(model_path, map_location=torch.device('cpu')))
    model.to(device)
   

    pred_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_example_index)
    # Run prediction for full data
    pred_sampler = SequentialSampler(pred_data)
    pred_dataloader = DataLoader(pred_data, sampler=pred_sampler, batch_size=9)
    
    predictions = []
    for input_ids, input_mask, segment_ids, example_indices in tqdm(pred_dataloader):
        input_ids = input_ids.to(device)
        input_mask = input_mask.to(device)
        segment_ids = segment_ids.to(device)
        
        with torch.no_grad():
            batch_start_logits, batch_end_logits = model(input_ids, segment_ids, input_mask)
            
    
        features=[]
        example = []
        all_results = []
       
        for i, example_index in enumerate(example_indices):
                start_logits = batch_start_logits[i].detach().cpu().tolist()
                end_logits =   batch_end_logits[i].detach().cpu().tolist()
                feature = eval_features[example_index.item()]
                unique_id = int(feature.unique_id)
                features.append(feature)
                all_results.append(RawResult(unique_id=unique_id,
                                             start_logits=start_logits,
                                             end_logits=end_logits))
                
       
        output = predict(examples, features, all_results, 30, 0.25)
        print('Output:\n')
        print(output)
        print('\n')
        predictions.append(output)
 
   
    ### For printing the results ####
    index = None
    for example in examples:
        if index!= example.example_id:
            # print(example.para_text)
            index = example.example_id
            print('\n')
            print(colored('***********Question and Answers *************', 'red'))
          
        ques_text = colored(example.question_text, 'blue')
        print(ques_text)
        prediction = colored(predictions[math.floor(example.unique_id/12)][example], 'green', attrs=['reverse', 'blink'])
        print(prediction)
        print('\n')

In [None]:
main()

GPU available


100%|██████████| 1/1 [00:00<00:00,  5.38it/s]

Output:

OrderedDict([(qas_id: 0, question_text: What is the formula of alkyl group?, doc_tokens: [Ethers are a class of organic compounds that contain an ether group—an oxygen atom connected to two alkyl or aryl groups. They have the general formula R–O–R′, where R and R′ represent the alkyl or aryl groups. Ethers can again be classified into two varieties: if the alkyl groups are the same on both sides of the oxygen atom, then it is a simple or symmetrical ether, whereas if they are different, the ethers are called mixed or unsymmetrical ethers. A typical example of the first group is the solvent and anesthetic diethyl ether, commonly referred to simply as "ether" (CH3–CH2–O–CH2–CH3). Ethers are common in organic chemistry and even more prevalent in biochemistry, as they are common linkages in carbohydrates and lignin.], 'R–O–R′'), (qas_id: 1, question_text: What is symmetrical ether ?, doc_tokens: [Ethers are a class of organic compounds that contain an ether group—an oxygen atom co




In [None]:
main("sample_project.txt")

GPU available


100%|██████████| 1/1 [00:00<00:00,  4.60it/s]

Output:

OrderedDict([(qas_id: 0, question_text: Who heads an office or division?, doc_tokens: [In 1982, Regis McKenna was brought in to shape the marketing and launch of the Macintosh. Later the Regis McKenna team grew to include Jane Anderson, Katie Cadigan and Andy Cunningham, who eventually led the Apple account for the agency. Cunningham and Anderson were the primary authors of the Macintosh launch plan. The launch of the Macintosh pioneered many different tactics that are used today in launching technology products, including the "multiple exclusive," event marketing (credited to John Sculley, who brought the concept over from Pepsi), creating a mystique around a product and giving an inside look into a product's creation.], 'Regis McKenna'), (qas_id: 1, question_text: Who leaders the sub-divisions of offices or divisions?, doc_tokens: [In 1982, Regis McKenna was brought in to shape the marketing and launch of the Macintosh. Later the Regis McKenna team grew to include Jane Ander


