# Contriever - Playground

## Imports

In [170]:
%load_ext autoreload
%autoreload 2

import torch
import torch.nn.functional as F
import numpy as np
import pandas as pd
from tqdm import trange, tqdm
from bs4 import BeautifulSoup

import sys
sys.path.append('../code')
sys.path.append('../code/contriever-main')
sys.path.append('../code/contriever-main/src')

from contriever_utils import calc_contriever_score, get_contriever_embedding, mail_raw_text2paragraphs
from hillary_mails import HillaryEmails, HillaryQueryScore


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load model and tokenizer

In [22]:
from src.contriever import Contriever
from transformers import AutoTokenizer

contriever = Contriever.from_pretrained("facebook/contriever") 
tokenizer = AutoTokenizer.from_pretrained("facebook/contriever") #Load the associated tokenizer:

Some weights of the model checkpoint at facebook/contriever were not used when initializing Contriever: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing Contriever from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Contriever from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
slat_text = " Salt is a mineral composed primarily of sodium chloride (NaCl), a chemical compound belonging to the larger class of salts; salt in the form of a natural crystalline mineral is known as rock salt or halite. Salt is present in vast quantities in seawater. The open ocean has about 35 g (1.2 oz) of solids per liter of sea water, a salinity of 3.5%. Salt is essential for life in general, and saltiness is one of the basic human tastes. Salt is one of the oldest and most ubiquitous food seasonings, and is known to uniformly improve the taste perception of food, including otherwise unpalatable food.[1] Salting, brining, and pickling are also ancient and important methods of food preservation. Some of the earliest evidence of salt processing dates to around 6,000 BC, when people living in the area of present-day Romania boiled spring water to extract salts; a salt-works in China dates to approximately the same period. Salt was also prized by the ancient Hebrews, Greeks, Romans, Byzantines, Hittites, Egyptians, and Indians. Salt became an important article of trade and was transported by boat across the Mediterranean Sea, along specially built salt roads, and across the Sahara on camel caravans. The scarcity and universal need for salt have led nations to go to war over it and use it to raise tax revenues. Salt is used in religious ceremonies and has other cultural and traditional significance. Salt is processed from salt mines, and by the evaporation of seawater (sea salt) and mineral-rich spring water in shallow pools. The greatest single use for salt (sodium chloride) is as a feedstock for the production of chemicals.[2] It is used to produce caustic soda and chlorine; it is also used in the manufacturing processes of polyvinyl chloride, plastics, paper pulp and many other products. Of the annual global production of around three hundred million tonnes of salt, only a small percentage is used for human consumption. Other uses include water conditioning processes, de-icing highways, and agricultural use. Edible salt is sold in forms such as sea salt and table salt which usually contains an anti-caking agent and may be iodised to prevent iodine deficiency. As well as its use in cooking and at the table, salt is present in many processed foods. Sodium is an essential nutrient for human health via its role as an electrolyte and osmotic solute.[3][4][5] Excessive salt consumption may increase the risk of cardiovascular diseases, such as hypertension, in children and adults. Such health effects of salt have long been studied. Accordingly, numerous world health associations and experts in developed countries recommend reducing consumption of popular salty foods.[5][6] The World Health Organization recommends that adults consume less than 2,000 mg of sodium, equivalent to 5 grams of salt per day"

In [4]:
sentences = [
    "Where was Marie Curie born?",
    "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
    "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace.",
    slat_text,
    "diagestible product"
]

model = contriever
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings = model(**inputs)

In [5]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [6]:
embeddings

tensor([[-0.0209, -0.0018, -0.0142,  ...,  0.0633, -0.0633, -0.0476],
        [-0.0803,  0.0197, -0.0279,  ..., -0.0004, -0.0683, -0.0378],
        [-0.0471, -0.0310, -0.0315,  ...,  0.0225,  0.0133, -0.0278],
        [ 0.0935,  0.0421,  0.0112,  ...,  0.0324, -0.0065,  0.0623],
        [-0.0451, -0.0315, -0.0090,  ..., -0.0232,  0.0295, -0.0788]],
       grad_fn=<DivBackward0>)

In [7]:
score01 = embeddings[0] @ embeddings[1] #1.0473
score02 = embeddings[0] @ embeddings[2] #1.0095
score03 = embeddings[0] @ embeddings[3] #1.0095

In [8]:
score01

tensor(1.0473, grad_fn=<DotBackward0>)

In [9]:
score02

tensor(1.0095, grad_fn=<DotBackward0>)

In [10]:
score03

tensor(0.4666, grad_fn=<DotBackward0>)

In [11]:
score01 = embeddings[4] @ embeddings[1] #1.0473
score02 = embeddings[4] @ embeddings[2] #1.0095
score03 = embeddings[4] @ embeddings[3] #1.0095
print(f"score01 = {score01}")
print(f"score02 = {score02}")
print(f"score03 = {score03}")

score01 = 0.45287925004959106
score02 = 0.420390784740448
score03 = 0.7001084685325623


In [15]:
np.linalg.norm(embeddings[3].detach().numpy())

1.82967

## TechCrunch Scrappings

In [92]:
paragraphs_path = '/home/ron/projects/contriever/notebooks/techcrunch_paragraphs21.fea'

paragraphs = pd.read_feather(paragraphs_path)
paragraphs

Unnamed: 0,index,url,paragraph
0,0,https://techcrunch.com/2023/01/31/snapchat-now...,Snapchat has more than 2 million paid subscrib...
1,0,https://techcrunch.com/2023/01/31/snapchat-now...,The social network first launched Snapchat+ in...
2,0,https://techcrunch.com/2023/01/31/snapchat-now...,Snap was able to attract a million subscribers...
3,0,https://techcrunch.com/2023/01/31/snapchat-now...,Snapchat+ offers users features like the abili...
4,0,https://techcrunch.com/2023/01/31/snapchat-now...,"In the last quarter, the company introduces ne..."
...,...,...,...
3869,312,https://techcrunch.com/2023/01/12/career-karma...,"“Last year, we made the decision to right-size..."
3870,312,https://techcrunch.com/2023/01/12/career-karma...,"During Career Karma’s last cut, Harris emphasi..."
3871,312,https://techcrunch.com/2023/01/12/career-karma...,"As TechCrunch has discussed in the past, the s..."
3872,312,https://techcrunch.com/2023/01/12/career-karma...,"With 80 staff now remaining at Career Karma, H..."


In [55]:
batch_size = 100
embeddings = []

for i in trange(0, len(paragraphs), batch_size):
    embeddings += get_contriever_embedding(paragraphs.iloc[i:i+batch_size]['paragraph'])

Some weights of the model checkpoint at facebook/contriever were not used when initializing Contriever: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing Contriever from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Contriever from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100% 13/13 [01:12<00:00,  5.56s/it]


In [56]:
paragraphs['embedding'] = embeddings

In [59]:
query = 'How many users does ChatGPT have?'
query_embedding = get_contriever_embedding([query])[0]

In [67]:
best_idx = paragraphs['embedding'].map(
    lambda emb: np.dot(emb, query_embedding) / (
        np.linalg.norm(emb) * np.linalg.norm(query_embedding)
    )
).argmax()

best_paragraph = paragraphs.iloc[best_idx]['paragraph']

In [68]:
best_paragraph

'The potential of AI tools like ChatGPT creates a similar dilemma — should companies license large language models without modifications, or customize them and pay much higher usage rates?'

In [69]:
contriever_scores = paragraphs['embedding'].map(lambda emb: calc_contriever_score(emb, query_embedding))

In [82]:
paragraphs[(contriever_scores > 0.4)]

Unnamed: 0,index,url,paragraph,embedding
55,3,https://techcrunch.com/2023/01/31/daily-crunch...,"Meanwhile, Rita ponders what would happen if C...","[-0.027771467342972755, -0.08329060673713684, ..."
57,4,https://techcrunch.com/2023/01/31/openai-relea...,As the fervor around generative AI — particula...,"[0.0299002043902874, -0.12141577154397964, 0.0..."
58,4,https://techcrunch.com/2023/01/31/openai-relea...,OpenAI’s classifier — aptly called OpenAI AI T...,"[-0.0034281201660633087, -0.026922766119241714..."
62,4,https://techcrunch.com/2023/01/31/openai-relea...,"Out of curiosity, I fed some text through the ...","[-0.0027712653391063213, 3.058241054532118e-05..."
494,31,https://techcrunch.com/2023/01/30/rewinds-new-...,"For a bit of fun, the app leveraged ChatGPT to...","[-0.05853530392050743, -0.11826053261756897, 0..."
812,58,https://techcrunch.com/2023/01/27/best-twitter...,Discord doesn’t really work like Twitter at al...,"[-0.0696338340640068, -0.041346512734889984, -..."
815,58,https://techcrunch.com/2023/01/27/best-twitter...,The downside is that Discord is more about cha...,"[-0.04414699971675873, -0.11405253410339355, 0..."
829,58,https://techcrunch.com/2023/01/27/best-twitter...,We’ll keep this list updated as we explore new...,"[-0.022291801869869232, -0.07762783020734787, ..."
875,63,https://techcrunch.com/2023/01/27/techcrunch-r...,The potential of AI tools like ChatGPT creates...,"[0.015728451311588287, -0.06545095145702362, 0..."
945,67,https://techcrunch.com/2023/01/27/the-current-...,"Companies like Stability AI and OpenAI, the co...","[-0.059773217886686325, -0.06636650115251541, ..."


In [85]:
paragraphs.iloc[1213]

index                                                       87
url          https://techcrunch.com/2023/01/26/nea-now-mana...
paragraph    On one more note about LPs, Sandell shut down ...
embedding    [0.011584528721868992, -0.052937474101781845, ...
Name: 1213, dtype: object

In [93]:
paragraphs[paragraphs['paragraph'].str.startswith('Since')]

Unnamed: 0,index,url,paragraph
62,4,https://techcrunch.com/2023/01/31/energy-x-sec...,"Since its inception in 2019, the startup says ..."
263,21,https://techcrunch.com/2023/01/30/secai-marche...,"Since its seed funding, Secai Marche has built..."
796,58,https://techcrunch.com/2023/01/27/best-twitter...,"Since Cohost is fairly new and a bit rocky, it..."
849,63,https://techcrunch.com/2023/01/27/techcrunch-r...,"Since most startups are not AI businesses, his..."
1384,106,https://techcrunch.com/2023/01/25/apparel-desi...,"Since January 6, a database containing hundred..."
1492,115,https://techcrunch.com/2023/01/24/japans-terra...,"Since its last fundraising, its subsidiaries’ ..."
1559,123,https://techcrunch.com/2023/01/24/all-raise-ce...,"Since launch, the nonprofit has raised $11 mil..."
1678,130,https://techcrunch.com/2023/01/24/alexa-funds-...,Since the Alexa fund was created under Jeff Be...
2239,180,https://techcrunch.com/2023/01/20/a-new-kind-o...,"Since its foundation in 2019, FLEX Capital has..."
3072,253,https://techcrunch.com/2023/01/16/twitters-thi...,"Since the beginning of the saga, many develope..."


## Hillary

In [23]:
emails_path = '/home/hackathon_2023/data/hillary_clinton_emails/data/Emails.csv'

In [24]:
emails = pd.read_csv(emails_path)

In [25]:
emails

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\nU.S. Department of State\nCase N...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\nFriday, March 11,...",B6\nUNCLASSIFIED\nU.S. Department of State\nCa...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7940,7941,C05778462,WYDEN,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>",,"Thursday, December 16, 2010 7:41 PM",F-2014-20439,C05778462,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7941,7942,C05778463,SENATE,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>","Sullivan, Jacob J; Mills, Cheryl D; Abedin, Huma","Thursday, December 16, 2010 8:09 PM",F-2014-20439,C05778463,08/31/2015,RELEASE IN FULL,Big change of plans in the Senate. Senator Rei...,UNCLASSIFIED U.S. Department of State Case No....
7942,7943,C05778465,RICHARD (TNR),H,"Jiloty, Lauren C",116.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Jiloty, Lauren C <JilotyLC@state.gov>",,"Thursday, December 16, 2010 10:52 PM",F-2014-20439,C05778465,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7943,7944,C05778466,FROM,H,PVerveer,143.0,2012-12-17T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,"PVervee,",,,12/14/201,F-2014-20439,C05778466,08/31/2015,RELEASE IN PART,"PVerveer B6\nFriday, December 17, 2010 12:12 A...","Hi dear Melanne and Alyse,\nHope this email re..."


In [66]:

mail_raw_text2paragraphs(raw_text1)[8]

"The Syrian soldiers in Libya are part of a mission established in 1984 following the signing of a military agreement between Qaddafi and Syria's long-time ruler and Bashir's father, Hafez al- Assad, in the presence of General Soubhi Haddad, who was the commander in chief of the Air Force at the time. Both Air Forces are equipped with Russian materiel and have had long- standing, close links with Moscow."

In [126]:
hl = HillaryEmails()

Some weights of the model checkpoint at facebook/contriever were not used when initializing Contriever: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing Contriever from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Contriever from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [160]:
paragraphs = hl.get_paragraphs()

In [161]:
paragraphs

Unnamed: 0,DocNumber,RawText,paragraphs
0,C05739545,UNCLASSIFIED\nU.S. Department of State\nCase N...,"What a wonderful, strong and moving statement ..."
1,C05739546,UNCLASSIFIED\nU.S. Department of State\nCase N...,H: Latest How Syria is aiding Qaddafi and more...
1,C05739546,UNCLASSIFIED\nU.S. Department of State\nCase N...,hrc memo syria aiding libya 030311.docx; hrc m...
1,C05739546,UNCLASSIFIED\nU.S. Department of State\nCase N...,This memo has two parts. Part one is the repor...
1,C05739546,UNCLASSIFIED\nU.S. Department of State\nCase N...,"During the afternoon of March 3, advisers to M..."
...,...,...,...
7942,C05778465,UNCLASSIFIED U.S. Department of State Case No....,Source URL: http://www.tnr.com/article/79956/r...
7943,C05778466,"Hi dear Melanne and Alyse,\nHope this email re...",Hope this email reaches you at good time.
7943,C05778466,"Hi dear Melanne and Alyse,\nHope this email re...",let me know if I can be of any help to your de...
7944,C05778470,UNCLASSIFIED U.S. Department of State Case No....,The outpouring of condolences from the interna...


In [162]:
hl.add_embedding()

100% 423/423 [1:22:27<00:00, 11.70s/it]


In [165]:
fea_save_path = '/home/hackathon_2023/ron/hillary_paragraph_embedded.fea'
hl.emails_paragraphs.reset_index().to_feather(fea_save_path)

In [166]:
pargraphs = pd.read_feather(fea_save_path)

In [167]:
pargraphs

Unnamed: 0,index,DocNumber,RawText,paragraphs,embedding
0,0,C05739545,UNCLASSIFIED\nU.S. Department of State\nCase N...,"What a wonderful, strong and moving statement ...","[0.021930361166596413, -0.021510425955057144, ..."
1,1,C05739546,UNCLASSIFIED\nU.S. Department of State\nCase N...,H: Latest How Syria is aiding Qaddafi and more...,"[-0.0028949459083378315, 0.003931017126888037,..."
2,1,C05739546,UNCLASSIFIED\nU.S. Department of State\nCase N...,hrc memo syria aiding libya 030311.docx; hrc m...,"[-0.005725331604480743, -0.03571368753910065, ..."
3,1,C05739546,UNCLASSIFIED\nU.S. Department of State\nCase N...,This memo has two parts. Part one is the repor...,"[0.0843600481748581, 0.08511161804199219, -0.0..."
4,1,C05739546,UNCLASSIFIED\nU.S. Department of State\nCase N...,"During the afternoon of March 3, advisers to M...","[0.05619313567876816, -0.01754940114915371, -0..."
...,...,...,...,...,...
42231,7942,C05778465,UNCLASSIFIED U.S. Department of State Case No....,Source URL: http://www.tnr.com/article/79956/r...,"[0.009207452647387981, 0.02946539968252182, -0..."
42232,7943,C05778466,"Hi dear Melanne and Alyse,\nHope this email re...",Hope this email reaches you at good time.,"[-0.037801507860422134, -0.0005279278266243637..."
42233,7943,C05778466,"Hi dear Melanne and Alyse,\nHope this email re...",let me know if I can be of any help to your de...,"[-0.06297259032726288, 0.018239501863718033, 0..."
42234,7944,C05778470,UNCLASSIFIED U.S. Department of State Case No....,The outpouring of condolences from the interna...,"[0.1036808043718338, 6.542534720210824e-06, -0..."


## Queries

In [182]:
hqs = HillaryQueryScore()

In [183]:
query = 'Benghazi attack'

query_scores = hqs.get_query_scores(query)
query_scores

Unnamed: 0,index,DocNumber,RawText,paragraphs,embedding,score
2373,227,C05739826,UNCLASSIFIED\nU.S. Department of State\nCase N...,"on the successful Benghazi attack, or launchin...","[-0.030353069305419922, -0.032487206161022186,...",0.721530
2028,202,C05739798,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libyans march against Islamist militias in Ben...,"[0.03709092363715172, -0.00034611611044965684,...",0.704742
1788,166,C05739757,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libyans march against Islamist militias in Ben...,"[0.03709092363715172, -0.00034611611044965684,...",0.704742
36,6,C05739560,UNCLASSIFIED\nU.S. Department of State\nCase N...,Ambassador J. Christopher Stevens and three ot...,"[-0.019471045583486557, -0.07690425217151642, ...",0.689470
502,52,C05739623,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libya attack to paint Obama as weak on terrorism,"[-0.03538616746664047, -0.011815848760306835, ...",0.673741
...,...,...,...,...,...,...
28012,5284,C05769949,UNCLASSIFIED U.S. Department of State Case No....,The core goals of the Arenga Reforestation Pro...,"[-0.021177805960178375, -0.08578544110059738, ...",0.125444
41274,7718,C05775819,UNCLASSIFIED U.S. Department of State Case No....,The core goals of the Arenga Reforestation Pro...,"[-0.021177805960178375, -0.08578544110059738, ...",0.125444
25787,4860,C05769080,UNCLASSIFIED U.S. Department of State Case No....,"1171:n70.1134041:=2:147.:MTY:71,32'W...arinTWA...","[0.0834178626537323, 0.02872912585735321, -0.0...",0.120128
6793,1359,C05760796,UNCLASSIFIED U.S. Department of State Case No....,"It is this conviction that drove Domenici, who...","[-0.061600152403116226, 0.09540513902902603, -...",0.119166


In [184]:
query_scores.iloc[:10]

Unnamed: 0,index,DocNumber,RawText,paragraphs,embedding,score
2373,227,C05739826,UNCLASSIFIED\nU.S. Department of State\nCase N...,"on the successful Benghazi attack, or launchin...","[-0.030353069305419922, -0.032487206161022186,...",0.72153
2028,202,C05739798,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libyans march against Islamist militias in Ben...,"[0.03709092363715172, -0.00034611611044965684,...",0.704742
1788,166,C05739757,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libyans march against Islamist militias in Ben...,"[0.03709092363715172, -0.00034611611044965684,...",0.704742
36,6,C05739560,UNCLASSIFIED\nU.S. Department of State\nCase N...,Ambassador J. Christopher Stevens and three ot...,"[-0.019471045583486557, -0.07690425217151642, ...",0.68947
502,52,C05739623,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libya attack to paint Obama as weak on terrorism,"[-0.03538616746664047, -0.011815848760306835, ...",0.673741
567,57,C05739629,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libya attack to paint Obama as weak on terrorism,"[-0.03538616746664047, -0.011815848760306835, ...",0.673741
534,55,C05739627,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libya attack to paint Obama as weak on terrorism,"[-0.03538616746664047, -0.011815848760306835, ...",0.673741
446,46,C05739615,B6\nUNCLASSIFIED\nU.S. Department of State\nCa...,Libya attack to paint Obama as weak on terrorism,"[-0.03538616746664047, -0.011815848760306835, ...",0.673741
465,48,C05739618,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libya attack to paint Obama as weak on terrorism,"[-0.03538616746664047, -0.011815848760306835, ...",0.673741
484,50,C05739620,UNCLASSIFIED\nU.S. Department of State\nCase N...,Libya attack to paint Obama as weak on terrorism,"[-0.03538616746664047, -0.011815848760306835, ...",0.673741


In [194]:
print(query_scores.iloc[9]['paragraphs'])

Libya attack to paint Obama as weak on terrorism


In [209]:
query = 'receipt over work'

query_scores = hqs.get_query_scores(query)
query_scores.iloc[:10]

Unnamed: 0,index,DocNumber,RawText,paragraphs,embedding,score
12821,2347,C05763223,UNCLASSIFIED U.S. Department of State Case No....,6 employee would have been entitled had,"[0.01179588958621025, -0.007500728126615286, -...",0.416572
12832,2347,C05763223,UNCLASSIFIED U.S. Department of State Case No....,19 that no employee of the Office may receive ...,"[0.03442981839179993, 0.10039772093296051, -0....",0.390957
9856,1860,C05762308,UNCLASSIFIED U.S. Department of State Case No....,Lawsuits against abusive contractors,"[-0.02425953559577465, -0.0375225804746151, -0...",0.383827
32399,6161,C05771617,UNCLASSIFIED U.S. Department of State Case No....,attorney/client communications or work product...,"[-0.09112697094678879, -0.008438197895884514, ...",0.381435
32427,6164,C05771621,UNCLASSIFIED U.S. Department of State Case No....,attorney/client communications or work product...,"[-0.09112697094678879, -0.008438197895884514, ...",0.381435
12972,2347,C05763223,UNCLASSIFIED U.S. Department of State Case No....,19 employees for any provision of law administ...,"[-0.0526542030274868, 0.03969474136829376, -0....",0.380533
9768,1860,C05762308,UNCLASSIFIED U.S. Department of State Case No....,"the committee members' work ""would be","[-0.013876518234610558, 0.030208135023713112, ...",0.378028
12792,2347,C05763223,UNCLASSIFIED U.S. Department of State Case No....,"22 pose of preserving such employee's allowances,","[0.015126695856451988, 0.04493490979075432, -0...",0.377528
13227,2347,C05763223,UNCLASSIFIED U.S. Department of State Case No....,18 velopment on a reimbursable basis. Any empl...,"[-0.09702442586421967, 0.010156353935599327, -...",0.375811
18415,3423,C05766021,UNCLASSIFIED U.S. Department of State Case No....,ATTORNEY-CLIENT PRIVILEGE/ATTORNEY WORK-PRODUCT,"[-0.04743870347738266, 0.0212693028151989, 0.0...",0.373973


In [203]:
print(query_scores.iloc[2]['paragraphs'])

Iran provides 14 percent of China's demand for oil.
