#  **Exploring the use of sentence-BERT**

[Tutorial](https://towardsdatascience.com/quick-semantic-search-using-siamese-bert-networks-1052e7b4df1)


In [None]:
# installing packages needed
!pip install sentence-transformers
!pip install scipy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
                                                                                                                                                                                        # load the BERT model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')


Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:
# set up an example corpus

sentences = ['Absence of sanity',
             'Lack of saneness',
             'A man is eating food.',
             'A man is eating a piece of bread.',
             'The girl is carrying a baby.',
             'A man is riding a horse.',
             'A woman is playing violin.',
             'Two men pushed carts through the woods.',
             'A man is riding a white horse on an enclosed ground.',
             'A monkey is playing drums.',
             'A cheetah is running behind its prey.']

# creat embeddings
sentence_embeddings = model.encode(sentences)

print('Sample BERT embedding vector - length', len(sentence_embeddings[0]))
print('Sample BERT embedding vector - note includes negative values', sentence_embeddings[0].tolist())

Sample BERT embedding vector - length 768
Sample BERT embedding vector - note includes negative values [-0.05245405435562134, 0.026297958567738533, 0.02364524081349373, -0.046886514872312546, 0.09920787066221237, -0.016844285652041435, 0.02325245551764965, 0.010967695154249668, 0.03190600499510765, -0.04975314810872078, 0.07217402011156082, -0.02231350913643837, 0.015832798555493355, -0.013539413921535015, 0.02234629914164543, -0.043928712606430054, -0.07782064378261566, -0.009699025191366673, -0.018518369644880295, 0.0025585137773305178, -0.023588215932250023, 0.06260788440704346, -0.02184705249965191, 0.023285064846277237, 0.011254267767071724, 0.005217744503170252, 0.060471609234809875, -0.01357501931488514, 0.006760442163795233, 0.0357733815908432, -0.030896849930286407, 0.0076118772849440575, 0.012363860383629799, 0.016074014827609062, -0.030300596728920937, -0.02721293829381466, 0.0050917877815663815, -0.022153057157993317, -0.0003695595369208604, -0.02154984138906002, -0.0723322

In [None]:
# perform semantic search
import scipy

query = 'Nobody has sane thoughts' #@param {type: "string"}

queries = [query]
query_embeddings = model.encode(queries)

# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
number_top_matches = 3 #@param {type: "number"}

print("Semantic Search Results")

for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:number_top_matches]:
        print(sentences[idx].strip(), "(Cosine Score: %.4f)" % (1-distance))

Semantic Search Results




Query: Nobody has sane thoughts

Top 5 most similar sentences in corpus:
Lack of saneness (Cosine Score: 0.8958)
Absence of sanity (Cosine Score: 0.8744)
A man is riding a horse. (Cosine Score: 0.1705)


# **Semantic search using Quora dataset**

Utillising larger dataset and vector database

[Tutorial 1](https://www.youtube.com/watch?v=7RF03_WQJpQ), [Tutorial 2](https://docs.pinecone.io/docs/semantic-text-search), [Tutorial 3](https://github.com/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb)

In [None]:
# check where running
!curl ipinfo.io

{
  "ip": "35.199.0.35",
  "hostname": "35.0.199.35.bc.googleusercontent.com",
  "city": "Washington",
  "region": "Washington, D.C.",
  "country": "US",
  "loc": "38.8951,-77.0364",
  "org": "AS396982 Google LLC",
  "postal": "20004",
  "timezone": "America/New_York",
  "readme": "https://ipinfo.io/missingauth"
}

In [None]:
# install dependencies
!pip install sentence-transformers torch
!pip install pinecone-client
!pip install datasets # quora data

**Get a sample of the dataset**

From the datasets library (provided by HuggingFace)

In [None]:
# Get sample of the Quora dataset
import datasets
quora = datasets.load_dataset('quora', split='train[:100]') # first 100 samples

In [None]:
# create dictionary with all the sentences
id2text = {}
for item in quora:
    question_ids = item['questions']['id']
    question_texts = item['questions']['text']
    for q_id, q_text in zip(question_ids, question_texts):
        id2text[str(q_id)] = q_text

**Exploring the dataset**

The 5th row of the dataset gives us a pair of sentences (IDs 11 and 12). These questions are a duplicate of one another - semantically these are the same sentence. The 30th row is an example where the 2 sentences are semantically different.

In [None]:
# exploring the data
# the 5th row of the dataset gives
quora[5]

{'questions': {'id': [11, 12],
  'text': ['Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?',
   "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"]},
 'is_duplicate': True}

In [None]:
quora[30]

{'questions': {'id': [61, 62],
  'text': ["What's one thing you would like to do better?",
   "What's one thing you do despite knowing better?"]},
 'is_duplicate': False}

In [None]:
# exploring the dictionary imported
# all the sentences are contained in the dictionary
id2text['62']

"What's one thing you do despite knowing better?"

In [None]:
print(type(id2text))

<class 'dict'>


**Get the model**

all-distilroberta-v1. From Hugging Face, intended to be used as a sentence and short paragraph encoder.

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-distilroberta-v1')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

**Generate embeddings and then upsert**

Use BERT model to generate sentence embeddings for your corpus. Upload these embeddings to Pinecone (vector storage). Encode the query so it is embedding format, then search Pinecone database for similar sentences (through comparing the sentence embeddings that were generated by BERT). Using cosine similarity.

Generate vectors and organise into tuples:
* `id`: a string ID for the vector
* `values`: the embedding values
* `metadata`: additional json data



In [None]:
# example embedding
text = "What's one thing you do despite knowing better?"
embedding = model.encode(text).tolist() # using to.list() so it prints out horizontally (rather than as an array)
print(embedding)

[0.03564424067735672, 0.004422139376401901, -0.03592892736196518, -0.03588281199336052, 0.04677799344062805, -0.04415673762559891, 0.0043084993958473206, -0.041780732572078705, 0.09228257834911346, 0.0037722173146903515, -0.011697283945977688, -0.041758306324481964, 0.0063322484493255615, 0.027928894385695457, -0.056048694998025894, 0.06027890369296074, 0.00499312998726964, -0.01382930763065815, 0.02822950668632984, -0.0016739473212510347, 0.013883999548852444, 0.0463692843914032, -0.022510463371872902, -0.014811544679105282, 0.03470756113529205, 0.006957618519663811, 0.009613013826310635, 0.014117548242211342, 0.04858928173780441, -0.029262663796544075, 0.0028094835579395294, 0.017335528507828712, -0.02663308009505272, 0.0461873896420002, -0.008705617859959602, -0.011891786940395832, -0.004178375471383333, -0.025888128206133842, 0.02351728267967701, -0.0015236211474984884, -0.10949676483869553, -0.015226143412292004, 0.011276714503765106, -0.04821989685297012, -0.01833612471818924, 0.

code below demonstrates that encoding a single sentences leaves us with a `768` dimensional sentence embedding (aligned to the `word_embedding_dimension` above)

In [None]:
embedding = model.encode(text)
embedding.shape

(768,)

In [None]:
# organise into vector - this needs to be done so that the vectors can later be upserted to Pinecone
id = "vec62"
values = embedding
metadata = {
    'is_duplicate': False,
    'char_length':len(text)
}

vector = (id, values, metadata)
print(vector)

('vec62', [0.03564424067735672, 0.004422139376401901, -0.03592892736196518, -0.03588281199336052, 0.04677799344062805, -0.04415673762559891, 0.0043084993958473206, -0.041780732572078705, 0.09228257834911346, 0.0037722173146903515, -0.011697283945977688, -0.041758306324481964, 0.0063322484493255615, 0.027928894385695457, -0.056048694998025894, 0.06027890369296074, 0.00499312998726964, -0.01382930763065815, 0.02822950668632984, -0.0016739473212510347, 0.013883999548852444, 0.0463692843914032, -0.022510463371872902, -0.014811544679105282, 0.03470756113529205, 0.006957618519663811, 0.009613013826310635, 0.014117548242211342, 0.04858928173780441, -0.029262663796544075, 0.0028094835579395294, 0.017335528507828712, -0.02663308009505272, 0.0461873896420002, -0.008705617859959602, -0.011891786940395832, -0.004178375471383333, -0.025888128206133842, 0.02351728267967701, -0.0015236211474984884, -0.10949676483869553, -0.015226143412292004, 0.011276714503765106, -0.04821989685297012, -0.01833612471

Next we connect to **Pinecone**.

tutorial for using Pinecone can be found [here](https://towardsdatascience.com/crud-with-pinecone-ee6b6f8b54e8).

In [None]:
# Connect to Pinecone
!pip install --upgrade pinecone-client


In [None]:
import pinecone
from google.colab import userdata

key = userdata.get('pinecone_api_key')
env = userdata.get('pinecone_environment')

In [None]:
# initialize connection to Pinecone
pinecone.init(api_key=key, environment=env)

In [None]:
# create pinecone index
index_name = 'quora-semantic-search-example'

# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=768, # vector space for BERT is 768
        metric='cosine'
    )
    # wait a moment for the index to be fully initialized
    time.sleep(1)

index = pinecone.Index(index_name)

In [None]:
# Upsert vector to pinecone
index.upsert(vectors=[vector])

{'upserted_count': 1}

In [None]:
# check index
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1}},
 'total_vector_count': 1}

**Upserting more data to pinecone**

Need to first transform the data to a suitable format.

In [None]:
# transforming the data

questions = []

for record in quora['questions']:
    questions.extend(record['text'])

# remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))

Nd she is always sad?
Does Quora have a character limit for profile descriptions?
How do I prevent breast cancer?
How is the average speed of gas molecules determined?
Are there any people who genuinely enjoy salad with no dressing?
200


When uploading to pinecone, the records objects needs to be a list of `(id, embedding, metadata)` tuples

In [None]:
from tqdm.auto import tqdm

batch_size = 128

for i in tqdm(range(0, len(questions), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(questions))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in questions[i:i_end]]
    # create embeddings
    xc = model.encode(questions[i:i_end]).tolist()
    # create records list for upsert
    records = list(zip(ids, xc, metadatas))
    print(type(records))
    # upsert to Pinecone
    index.upsert(vectors=records)

# check number of records in the index
index.describe_index_stats()

  0%|          | 0/2 [00:00<?, ?it/s]

<class 'list'>
<class 'list'>


{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 201}},
 'total_vector_count': 201}

**Making queries !!**

this example is performing a semantic search for similar questions, so need to embed and search with another question


In [None]:
query = "which city has the highest population in the world?"

# create the query vector
xq = model.encode([query]).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '82',
              'metadata': {'text': 'What are some of the high salary income '
                                   'jobs in the field of biotechnology?'},
              'score': 0.239463449,
              'values': []},
             {'id': '110',
              'metadata': {'text': 'Who is the richest gambler of all time and '
                                   'how can I reach his level?'},
              'score': 0.214109302,
              'values': []},
             {'id': '22',
              'metadata': {'text': 'Who is the richest gambler of all time and '
                                   'how can I reach his level as a gambler?'},
              'score': 0.192162454,
              'values': []},
             {'id': '133',
              'metadata': {'text': 'What is the best travel website in spain?'},
              'score': 0.188128248,
              'values': []},
             {'id': '16',
              'metadata': {'text': 'What are some high paying jobs 

Reformat the response so it is easier to read:


**note:** *these aren't that similar as it is a very small dataset.*

In [None]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.24: What are some of the high salary income jobs in the field of biotechnology?
0.21: Who is the richest gambler of all time and how can I reach his level?
0.19: Who is the richest gambler of all time and how can I reach his level as a gambler?
0.19: What is the best travel website in spain?
0.17: What are some high paying jobs for a fresher with an M.Tech in biotechnology?


# **Creating a semantic search engine with Transformers and Faiss**

will use sentence transformers model (same as above)

**Faiss** is an alternative vector database provided by Facebook.

Need to change the runtime to GPU and then clone this repository to access the data used in this tutorial (which is similar to the data that we would need for an app ourselves).


In [None]:
!git clone https://github.com/kstathou/vector_engine

In [None]:
cd vector_engine

/content/vector_engine


In [None]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/vector_engine (from -r requirements.txt (line 9))
  Preparing metadata (setup.py) ... [?25l[?25hdone
[31mERROR: Could not find a version that satisfies the requirement torch==1.8.1 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1)[0m[31m
[0m[31mERROR: No matching distribution found for torch==1.8.1[0m[31m
[0m

In [None]:
# installing torch as it didn't work above - and additional packages being imported below
!pip install torch
!pip install sentence-transformers
!pip install faiss-gpu
!pip install faiss-cpu
!pip install pathlib
!pip install vector

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2
# Used to import data from local.
import pandas as pd

# Used to create the dense document vectors.
import torch
from sentence_transformers import SentenceTransformer

# Used to create and store the Faiss index.
import faiss
import numpy as np
import pickle
from pathlib import Path

# Used to do vector searches and display the results - these are functions specifically in the cloned git directory
# written out in full later in the code
from vector_engine.utils import vector_search, id2details

**get data**

Data has been pre-processed to get it into this dataframe. Will need to do something similar for NAO reports.

In [None]:
# stored and processed data in s3
df = pd.read_csv('data/misinformation_papers.csv')

In [None]:
df.head(10)

Unnamed: 0,original_title,abstract,year,citations,id,is_EN
0,When Corrections Fail: The Persistence of Poli...,An extensive literature addresses citizen igno...,2010,901,2132553681,1
1,A postmodern Pandora's box: anti-vaccination m...,The Internet plays a large role in disseminati...,2010,440,2117485795,1
2,Spread of (Mis)Information in Social Networks,We provide a model to investigate the tension ...,2010,278,2120015072,1
3,Mandatory Influenza Vaccination of Health Care...,BACKGROUND Influenza vaccination of health car...,2010,232,2156683801,1
4,Intrauterine contraception in Saint Louis: a s...,Abstract Background Many obstacles to intraute...,2010,99,2148310716,1
5,Tracking the source of glacier misinformation.,A recent News of the Week story on Himalayan g...,2010,83,2052194488,1
6,Cost Overruns in Large-Scale Transportation In...,Managing large-scale transportation infrastruc...,2010,122,2120193275,1
7,The Role of the Social Network in Contraceptiv...,Abstract Purpose Understanding reasons for con...,2010,106,1972823986,1
8,Persistent high fertility in Uganda: young peo...,Background\r\nHigh fertility among young peopl...,2010,100,2012373852,1
9,How to Disappear: Erase Your Digital Footprint...,1. How to Disappear2. Database Sites3. Disappe...,2010,4,631917952,1


In [None]:
print(f"Misinformation, disinformation and fake news papers: {df.id.unique().shape[0]}")

Misinformation, disinformation and fake news papers: 8430


Using the `distilbert-base-nli-stsb-mean-tokens` model which has the best performance on Semantic Textual Similarity tasks among the DistilBERT versions.

In [None]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Check if GPU is available and use it
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

Downloading (…)7e0d5/.gitattributes:   0%|          | 0.00/345 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)0e5ca7e0d5/README.md:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

Downloading (…)5ca7e0d5/config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)7e0d5/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

Downloading (…)0e5ca7e0d5/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)ca7e0d5/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

cuda:0


In [None]:
# convert abstracts to embeddings
embeddings = model.encode(df.abstract.to_list(), show_progress_bar=True)

Batches:   0%|          | 0/264 [00:00<?, ?it/s]

In [None]:
print(f'Shape of the vectorised abstract: {embeddings[0].shape}')

Shape of the vectorised abstract: (768,)


**vector similarity search with Faiss**

Faiss is built around the `Index` object which contains, and sometimes preprocesses, the searchable vectors.

**Note:** Faiss uses only 32-bit floating point matrices. This means that you will have to change the data type of the input before building the index.

Use `IndexFlatL2` index: simple index that performs a brute-force L2 distance search. Scales linearly.

To create the index:

1. Change the data type of the abstract vectors to float32.
2. Build an index and pass it the dimension of the vectors it will operate on.
3. Pass the index to `IndexIDMap`, an object that enables us to provide a custom list of IDs for the indexed vectors.
4. Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their paper IDs from MAG.

In [None]:
# Step 1: Change data type
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(embeddings.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, df.id.values)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 8430


Searching the index:

The index we built will perform a k-nearest-neighbour search. Provide the number of neighbours to be returned.

Query the index with an abstract from the dataset. First document has to be the query.

In [None]:
# Paper abstract
df.iloc[5415, 1]

"We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and provide a differential assessment on the evolution of the discourse on a global scale for each platform and their users. We fit information spreading with epidemic models characterizing the basic reproduction number [Formula: see text] for each social media platform. Moreover, we identify information spreading from questionable sources, finding different volumes of misinformation in each platform. However, information from both reliable and questionable sources do not present different spreading patterns. Finally, we provide platform-dependent numerical estimates of rumors' amplification."

In [None]:
# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([embeddings[5415]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 1.3749969005584717, 55.55474853515625, 65.87544250488281, 67.96585083007812, 69.05718994140625, 69.70634460449219, 70.40652465820312, 70.57284545898438, 71.0338134765625]

MAG paper IDs: [3092618151, 3011345566, 3012936764, 3011186656, 3092128270, 3048848247, 3044429417, 3055557295, 3024620668, 3044097955]


In [None]:
# Fetch the paper titles based on their index
id2details(df, I, 'original_title')

[['The COVID-19 social media infodemic.'],
 ['The COVID-19 Social Media Infodemic'],
 ['Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset'],
 ['Coronavirus Goes Viral: Quantifying the COVID-19 Misinformation Epidemic on Twitter'],
 ['Analysis of online misinformation during the peak of the COVID-19 pandemics in Italy'],
 ['COVID-19-Related Infodemic and Its Impact on Public Health: A Global Social Media Analysis.'],
 ['Effects of misinformation on COVID-19 individual responses and recommendations for resilience of disastrous consequences of misinformation'],
 ['Covid-19 infodemic reveals new tipping point epidemiology and a revised R formula.'],
 ['Quantifying COVID-19 Content in the Online Health Opinion War Using Machine Learning'],
 ['Coronavirus-related online web search desire amidst the rising novel coronavirus incidence in Ethiopia: Google Trends-based infodemiology']]

In [None]:
# Fetch the paper abstracts based on their index
id2details(df, I, 'abstract')

[["We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and provide a differential assessment on the evolution of the discourse on a global scale for each platform and their users. We fit information spreading with epidemic models characterizing the basic reproduction number [Formula: see text] for each social media platform. Moreover, we identify information spreading from questionable sources, finding different volumes of misinformation in each platform. However, information from both reliable and questionable sources do not present different spreading patterns. Finally, we provide platform-dependent numerical estimates of rumors' amplification."],
 ["We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and p

**Putting it all together**

To query the index with an unseen query and retrieve its most relevant documents:

1. Encode the query with the same sentence-DistilBERT model we used for the rest of the abstract vectors
2. Change its data type to float32
3. Search the index with the encoded query

In [None]:
user_query = """
WhatsApp was alleged to have been widely used to spread misinformation and propaganda
during the 2018 elections in Brazil and the 2019 elections in India. Due to the
private encrypted nature of the messages on WhatsApp, it is hard to track the dissemination
of misinformation at scale. In this work, using public WhatsApp data from Brazil and India, we
observe that misinformation has been largely shared on WhatsApp public groups even after they
were already fact-checked by popular fact-checking agencies. This represents a significant portion
of misinformation spread in both Brazil and India in the groups analyzed. We posit that such
misinformation content could be prevented if WhatsApp had a means to flag already fact-checked
content. To this end, we propose an architecture that could be implemented by WhatsApp to counter
such misinformation. Our proposal respects the current end-to-end encryption architecture on WhatsApp,
thus protecting users’ privacy while providing an approach to detect the misinformation that benefits
from fact-checking efforts.
"""

In [None]:
def vector_search(query, model, index, num_results=10):
    """Tranforms query to vector using a pretrained, sentence-level
    DistilBERT model and finds similar vectors using FAISS.

    Args:
        query (str): User query that should be more than a sentence long.
        model (sentence_transformers.SentenceTransformer.SentenceTransformer)
        index (`numpy.ndarray`): FAISS index that needs to be deserialized.
        num_results (int): Number of results to return.

    Returns:
        D (:obj:`numpy.array` of `float`): Distance between results and query.
        I (:obj:`numpy.array` of `int`): Paper ID of the results.

    """
    vector = model.encode(list(query))
    D, I = index.search(np.array(vector).astype("float32"), k=num_results)
    return D, I


def id2details(df, I, column):
    """Returns the paper titles based on the paper index."""
    return [list(df[df.id == idx][column]) for idx in I[0]]


In [None]:
# query the index
D, I = vector_search([user_query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [7.636616230010986, 58.327415466308594, 58.327415466308594, 70.91807556152344, 73.32894897460938, 81.48760986328125, 85.36540985107422, 85.85223388671875, 87.20014953613281, 92.07552337646484]

MAG paper IDs: [3047438096, 3037966274, 3021927925, 2889959140, 2791045616, 2943077655, 2990343632, 2974128076, 3014380170, 3028584171]


In [None]:
# Fetching the paper titles based on their index
id2details(df, I, 'original_title')

[['Can WhatsApp Benefit from Debunked Fact-Checked Stories to Reduce Misinformation?'],
 ['A Dataset of Fact-Checked Images Shared on WhatsApp During the Brazilian and Indian Elections'],
 ['A Dataset of Fact-Checked Images Shared on WhatsApp During the Brazilian and Indian Elections'],
 ['A System for Monitoring Public Political Groups in WhatsApp'],
 ['Politics of Fake News: How WhatsApp Became a Potent Propaganda Tool in India'],
 ['Characterizing Attention Cascades in WhatsApp Groups'],
 ['Can WhatsApp Counter Misinformation by Limiting Message Forwarding'],
 ['Can WhatsApp Counter Misinformation by Limiting Message Forwarding'],
 ['OS IMPACTOS JURÍDICOS E SOCIAIS DAS FAKE NEWS EM TERRITÓRIO BRASILEIRO'],
 ['Images and Misinformation in Political Groups: Evidence from WhatsApp in India']]

In [None]:
# Fetching the paper titles based on their index
id2details(df, I, 'original_title')

In [None]:
# Define project base directory
project_dir = Path('notebooks').resolve().parents[0]
print(project_dir)

# Serialise index and store it as a pickle
with open(f"{project_dir}/models/faiss_index.pickle", "wb") as h:
    pickle.dump(faiss.serialize_index(index), h)

/content/vector_engine
