Notes: I have tested the retriever using previously generated embedding files and questions extracted from the gpt-3 generated QA pairs. I noticed that the contriever model works best on section embeddings and worst on sentence embeddings.
The model does retrieve proper sections - tested using top 5 sections.


In [8]:
# importing required libraries

import torch
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import json
import re
import numpy as np

In [9]:
tokenizer = AutoTokenizer.from_pretrained('facebook/contriever-msmarco')
model = AutoModel.from_pretrained('facebook/contriever-msmarco')

RuntimeError: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback):
install() got an unexpected keyword argument 'show_locals'

In [3]:
def mean_pooling(token_embeddings, mask):
    token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.)
    sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
    return sentence_embeddings

In [4]:
fp = open("paragraphs.json")
para_data = json.load(fp)

fsec = open("sections.json")
section_data = json.load(fsec)

fs = open("sentences.json")
sentence_data = json.load(fs)

In [5]:
saved_section_embeddings = np.load('section_embeddings.npy')
saved_para_embeddings = np.load('paragraph_embeddings.npy')
saved_sentence_embeddings = np.load('sentence_embeddings.npy')

In [6]:
section_tensors = torch.from_numpy(saved_section_embeddings)
para_tensors = torch.from_numpy(saved_para_embeddings)
sentence_tensors = torch.from_numpy(saved_sentence_embeddings)

In [7]:
print(len(section_data))
print(len(para_data))
print(len(sentence_data))

144
2078
4243


In [8]:
# cleaning the data

def clean(text):
    new_text = re.sub('\n', '', text)
    return new_text

Extracting questions from the gpt-3 sections data

In [9]:
# loading the gpt-3 data

fq = open("../data-generator/gpt-3/GPT-3_section_level.json")
gpt3_section_data = json.load(fq)
print("length of section data:", len(gpt3_section_data))

length of section data: 144


In [10]:
# create a dictionary with section number as key and question as value

qc_section_dict = {}
for i in range(len(gpt3_section_data)):
    clean_question = clean(gpt3_section_data[i]['questions'][4:])
    qc_section_dict[i] = clean_question

Retrieving most relevant section, paragraph, and sentence from saved embeddings

In [11]:
# retrieves the section with highest dot product with the question

def section_retriever(embedded_question):
    section_scores = {}
    for i in range(len(section_tensors)):
        score = embedded_question[0]@section_tensors[i]
        section_scores[i] = score
    
    highest_score = max(section_scores, key=section_scores.get)
    return highest_score

In [27]:
# retrieves the paragraph with highest dot product with the question

def paragraph_retriever(embedded_question):
    para_scores = {}
    for i in range(len(para_tensors)):
        score = embedded_question[0]@para_tensors[i]
        para_scores[i] = score
    
    highest_score = max(para_scores, key=para_scores.get)
    return highest_score

In [28]:
# retrieves the sentence with highest dot product with the question

def sentence_retriever(embedded_question):
    sentence_scores = {}
    for i in range(len(sentence_tensors)):
        score = embedded_question[0]@sentence_tensors[i]
        sentence_scores[i] = score
    
    highest_score = max(sentence_scores, key=sentence_scores.get)
    return highest_score

In [38]:
# creating a dataframe with highest section/paragraph/sentence level scores for all questions

cols = ['Question', 'GPT-3 Section', 'Section', 'Paragraph', 'Sentence']
score_data = []

for j in qc_section_dict:
    question = qc_section_dict[j]
    
    #embed the question
    tokenized_question = tokenizer(question, padding=True, truncation=True, return_tensors='pt')
    output_question = model(**tokenized_question)
    embeddings_question = mean_pooling(output_question[0], tokenized_question['attention_mask'])
    
    # retrieve section/para/sentence
    section_no = section_retriever(embeddings_question)
    paragraph_no = paragraph_retriever(embeddings_question)
    sentence_no = sentence_retriever(embeddings_question)
    
    # store the question no., question, and scores in a dataframe
    row = [question, j, section_no, paragraph_no, sentence_no]
    score_data.append(row)

score_df = pd.DataFrame(score_data, columns=cols)

In [39]:
score_df

Unnamed: 0,Question,GPT-3 Section,Section,Paragraph,Sentence
0,How does the design process for a digital FSM ...,0,28,371,1
1,Why is it important to design digital systems ...,1,1,845,1883
2,What is a Gray code?,2,2,12,2600
3,How does a three-bit gray code counter work?,3,2,124,225
4,Is it possible to create a counter with fewer ...,4,41,30,59
...,...,...,...,...,...
139,"What is the meaning of the term ""universal com...",139,60,1687,3636
140,What is the overflow condition for unsigned ad...,140,73,1398,3046
141,Why is the converse of an implication not alwa...,141,81,1571,4119
142,-Why is it important to know when an addition ...,142,53,1397,4124


In [40]:
# extracting the section level retrievals not matching with GPT-3 data

sections_not_match = score_df.loc[~(score_df['GPT-3 Section'] == score_df['Section'])]

In [62]:
sections_not_match.shape

(88, 5)

In [82]:
k = 4
print(score_df.loc[k])
print("-------------")
print(score_df.loc[k]['Question'])
print("-------------")
print(gpt3_section_data[score_df.loc[k]['GPT-3 Section']]['positive_ctxs']['text'])
print("-------------")
print("SECTION:", section_data[str(score_df.loc[k]['Section'])])
print("-------------")
print("PARAGRAPH:", para_data[str(score_df.loc[k]['Paragraph'])])
print("-------------")
print("SENTENCE: ", sentence_data[str(score_df.loc[k]['Sentence'])])

Question         Is it possible to create a counter with fewer ...
GPT-3 Section                                                    4
Section                                                         41
Paragraph                                                       30
Sentence                                                        59
Name: 4, dtype: object
-------------
Is it possible to create a counter with fewer than 3 flip-flops?
-------------
{Example: A Color Sequencer}


Early graphics systems used a three-bit red-green-blue (RGB) 
encoding for colors.  The color mapping for such a system is shown to
the right.

Imagine that you are charged with creating a counter to drive a light
through a sequence of colors.  The light takes an RGB input as just
described, and the desired pattern is

{off (black)     yellow     violet     green     blue}

You immediately recognize that you merely need a counter with five
states.  How many flip-flops will we need?  At least three, since
_2 (5)=3

Retrieving the top 5 sections based on a question

In [88]:
# outputs an array of dot product of the question against all the embeddings

from operator import itemgetter


def section_retriever(embedded_question):
    section_scores = {}
    for i in range(len(section_tensors)):
        score = embedded_question[0]@section_tensors[i]
        section_scores[i] = score
    
    res = dict(sorted(section_scores.items(), key = itemgetter(1), reverse = True)[:5])
    highest_score = max(section_scores, key=section_scores.get)
    return res

In [105]:
question = qc_section_dict[4]
print(question)

Is it possible to create a counter with fewer than 3 flip-flops?


In [106]:
#embed the question
tokenized_question = tokenizer(question, padding=True, truncation=True, return_tensors='pt')
output_question = model(**tokenized_question)
embeddings_question = mean_pooling(output_question[0], tokenized_question['attention_mask'])
    
# retrieve section/para/sentence
section_no = section_retriever(embeddings_question)

In [107]:
section_no

{41: tensor(1.4539, grad_fn=<DotBackward0>),
 78: tensor(1.3056, grad_fn=<DotBackward0>),
 40: tensor(1.3005, grad_fn=<DotBackward0>),
 4: tensor(1.2652, grad_fn=<DotBackward0>),
 2: tensor(1.2119, grad_fn=<DotBackward0>)}

In [114]:
print(section_data['41'])

{Ripple Counters}

A second class of
counter drives some of its flip-flops with a clock signal and feeds
flip-flop outputs into the clock inputs of its remaining flip-flops,
possibly through additional logic.  Such a counter is called a {
ripple counter}, because the effect of a clock edge ripples through
the flip-flops.  The delay inherent to the ripple effect, along with
the complexity of ensuring that timing issues do not render the design
unreliable, are the major drawbacks of ripple counters.  Compared with
synchronous counters, however, ripple counters consume less energy,
and are sometimes used for devices with restricted energy supplies.


General ripple counters
can be tricky because of timing issues, but certain types are easy.

Consider the design of binary ripple counter.  The state diagram for 
a {3-bit} binary counter is replicated to the right.
Looking
at the states, notice that the least-significant bit alternates with
each state, while higher bits flip whenever the nex