## File input and preprocessing

In [None]:
import pandas as pd
import numpy as np

In [None]:
train_df = pd.read_csv('FAQs.csv')

In [None]:
train_df

Unnamed: 0,Question,Answer
0,When was Albert Einstein born?,Albert Einstein was born on 14 March 1879.
1,Where was he born?,"He was born in Ulm, Germany."
2,When did he die?,"He died 18 April 1955 in Princeton, New Jersey..."
3,Who were his parents?,His father was Hermann Einstein and his mother...
4,Did he have any sisters and brothers?,He had one sister named Maja.
5,Did he marry and have children?,He was married to Mileva Marić between 1903 an...
6,Where did he receive his education?,He received his main education at the followin...
7,When was Albert Einstein awarded the Nobel Pri...,"The Nobel Prize Awarding Institution, the Roya..."
8,Did Albert Einstein attend the Nobel Prize Awa...,The Nobel Prize was announced on 9 November 19...
9,For what did he receive the Nobel Prize?,Einstein was rewarded for his many contributio...


In [None]:
len(train_df)

10

In [None]:
questions = train_df['Question']
answers = train_df['Answer']

In [None]:
FAQ = dict(zip(questions, answers))

In [None]:
test_df = pd.read_csv('FAQs_test.csv')

In [None]:
test_questions = test_df['Question']

In [None]:
test_questions

0        What is the date of his death?
1           Did Einstein have siblings?
2                     Who was his wife?
3    What was Einstein's father's name?
4    At what institutions did he study?
Name: Question, dtype: object

## Sentence Transformers- multi-qa-mpnet-base-dot-v1

In [None]:
pip install -U sentence-transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.9 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 26.8 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 53.1 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 56.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 40.0 MB/s 
Building wheels for collected 

In [None]:
from sentence_transformers import SentenceTransformer, util

In [None]:
model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:
print(test_questions[0])

What is the date of his death?


In [None]:
# final_answers = []

In [None]:
final_answers = []
for ques in test_questions:
  query_emb = model.encode(ques)
  doc_emb = model.encode(questions)
  scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
  #Combine docs & scores
  doc_score_pairs = list(zip(questions, scores))

  #Sort by decreasing score
  doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
  #Output passages & scores
  print(ques)
  print(doc_score_pairs[0][0])
  similar_ques = doc_score_pairs[0][0]
  # print(FAQ[similar_ques])

  final_answers.append(FAQ[similar_ques])
  # for doc, score in doc_score_pairs:
  #     print(score, doc)
  #     print(type(doc_score_pairs))
  # break
  

What is the date of his death?
When did he die?
Did Einstein have siblings?
Did he have any sisters and brothers?
Who was his wife?
Did he marry and have children?
What was Einstein's father's name?
When was Albert Einstein born?
At what institutions did he study?
Where did he receive his education?


In [None]:
final_test = list(zip(test_questions, final_answers))

In [None]:
final_test_df = pd.DataFrame(final_test, columns=['Questions','Answers'])

In [None]:
final_test_df

Unnamed: 0,Questions,Answers
0,What is the date of his death?,"He died 18 April 1955 in Princeton, New Jersey..."
1,Did Einstein have siblings?,He had one sister named Maja.
2,Who was his wife?,He was married to Mileva Marić between 1903 an...
3,What was Einstein's father's name?,Albert Einstein was born on 14 March 1879.
4,At what institutions did he study?,He received his main education at the followin...


From our general knowledge **we can see our model found 4 out of 5 questions correctly. so the accuracy is about 80% in general.** Lets see if we can improve it further.

## Using BERT and sent2vec

In [None]:
pip install sent2vec

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sent2vec
  Downloading sent2vec-0.3.0-py3-none-any.whl (8.1 kB)
Installing collected packages: sent2vec
Successfully installed sent2vec-0.3.0


In [None]:
type(questions)

pandas.core.series.Series

In [None]:
from scipy import spatial
from sent2vec.vectorizer import Vectorizer

In [None]:
vectorizer = Vectorizer(pretrained_weights='distilbert-base-uncased')
vectorizer.run(questions.to_list())
vectors_bert = vectorizer.vectors

Initializing Bert distilbert-base-uncased
Vectorization done on cpu


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
vectorizer_test = Vectorizer(pretrained_weights='distilbert-base-uncased')
vectorizer_test.run(test_questions.to_list())
vectors_bert_test = vectorizer_test.vectors

Initializing Bert distilbert-base-uncased
Vectorization done on cpu


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
dist_1 = spatial.distance.cosine(vectors_bert[0], vectors_bert[1])

In [None]:
selected_questions = []
for i in range(len(test_questions)):
  min_dist = 100000000
  index = -1
  for j in range(len(questions)):
    dist = spatial.distance.cosine(vectors_bert_test[i], vectors_bert[j])
    # print(dist)
    if min_dist > dist:
      min_dist = dist
      index = j
  selected_questions.append(questions[j])  

    

In [None]:
selected_questions

['For what did he receive the Nobel Prize?',
 'For what did he receive the Nobel Prize?',
 'For what did he receive the Nobel Prize?',
 'For what did he receive the Nobel Prize?',
 'For what did he receive the Nobel Prize?']

In [None]:
final_answers_bert = []
for qs in selected_questions:
  final_answers_bert.append(FAQ[qs])

In [None]:
final_answers_bert

['Einstein was rewarded for his many contributions to theoretical physics, and especially for his discovery of the law of the photoelectric effect.',
 'Einstein was rewarded for his many contributions to theoretical physics, and especially for his discovery of the law of the photoelectric effect.',
 'Einstein was rewarded for his many contributions to theoretical physics, and especially for his discovery of the law of the photoelectric effect.',
 'Einstein was rewarded for his many contributions to theoretical physics, and especially for his discovery of the law of the photoelectric effect.',
 'Einstein was rewarded for his many contributions to theoretical physics, and especially for his discovery of the law of the photoelectric effect.']

In [None]:
final_test_bert = list(zip(test_questions, final_answers_bert))

In [None]:
final_test_bert_df = pd.DataFrame(final_test_bert, columns=['Questions','Answers'])

In [None]:
final_test_bert_df

Unnamed: 0,Questions,Answers
0,What is the date of his death?,Einstein was rewarded for his many contributio...
1,Did Einstein have siblings?,Einstein was rewarded for his many contributio...
2,Who was his wife?,Einstein was rewarded for his many contributio...
3,What was Einstein's father's name?,Einstein was rewarded for his many contributio...
4,At what institutions did he study?,Einstein was rewarded for his many contributio...


**So, distilbert-base-uncased does not give a good result at all.**

## Best Outcome

Best outcome is found by using multi-qa-mpnet-base-dot-v1

In [None]:
final_test_df

Unnamed: 0,Questions,Answers
0,What is the date of his death?,"He died 18 April 1955 in Princeton, New Jersey..."
1,Did Einstein have siblings?,He had one sister named Maja.
2,Who was his wife?,He was married to Mileva Marić between 1903 an...
3,What was Einstein's father's name?,Albert Einstein was born on 14 March 1879.
4,At what institutions did he study?,He received his main education at the followin...


In [None]:
final_test_df.to_csv('FAQ_QA.csv', index= False)