https://www.sbert.net/index.html
1. Separar em tópicos (SBERT - assymetric semantic search / clustering / topic modelling)
2. Encontrar perguntas dentro de tópicos relevantes (manual)
3. Obter sinônimos para aquelas perguntas (SBERT - symmetric semantic search)
4. Extrair pares pergunta-resposta (IBM - domain-specific-QA)

# Assymmetric Semantic Search
For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you wand to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.   
<br>
### Suitable models for assymmetric semantic search:

- msmarco-distilbert-base-v2

# Symmetric Semantic Search
For symmetric semantic search your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”. For symmetric tasks, you could potentially flip the query and the entries in your corpus.   
<br>
### Suitable models for symmetric semantic search:

- paraphrase-distilroberta-base-v1 / paraphrase-xlm-r-multilingual-v1

- quora-distilbert-base / quora-distilbert-multilingual

- distiluse-base-multilingual-cased-v2

In [None]:
"""SBERT - symmetric semantic search
This script contains an example how to perform semantic search with PyTorch. It performs exact nearest neighborh search.
As dataset, we use the Quora Duplicate Questions dataset, which contains about 500k questions (we only use about 100k):
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).
Google Colab example: https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import pickle
import time

model_name = 'quora-distilbert-base'
model = SentenceTransformer(model_name)

url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000

embedding_cache_path = 'quora-embeddings-{}-size-{}.pkl'.format(
    model_name.replace('/', '_'), max_corpus_size)

#Check if embedding cache path exists
if not os.path.exists(embedding_cache_path):
    # Check if the dataset exists. If not, download and extract
    # Download dataset if needed
    if not os.path.exists(dataset_path):
        print("Download dataset")
        util.http_get(url, dataset_path)

    # Get all unique sentences from the file
    corpus_sentences = set()
    with open(dataset_path, encoding='utf8') as fIn:
        reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
        for row in reader:
            corpus_sentences.add(row['question1'])
            if len(corpus_sentences) >= max_corpus_size:
                break

            corpus_sentences.add(row['question2'])
            if len(corpus_sentences) >= max_corpus_size:
                break

    corpus_sentences = list(corpus_sentences)
    print("Encode the corpus. This might take a while")
    corpus_embeddings = model.encode(corpus_sentences,
                                     show_progress_bar=True,
                                     convert_to_tensor=True)

    print("Store file on disc")
    with open(embedding_cache_path, "wb") as fOut:
        pickle.dump(
            {
                'sentences': corpus_sentences,
                'embeddings': corpus_embeddings
            }, fOut)
else:
    print("Load pre-computed embeddings from disc")
    with open(embedding_cache_path, "rb") as fIn:
        cache_data = pickle.load(fIn)
        corpus_sentences = cache_data['sentences'][0:max_corpus_size]
        corpus_embeddings = cache_data['embeddings'][0:max_corpus_size]

###############################
print("Corpus loaded with {} sentences / embeddings".format(
    len(corpus_sentences)))

#Move embeddings to the target device of the model
corpus_embeddings = corpus_embeddings.to(model._target_device)

while True:
    inp_question = input("Please enter a question: ")

    start_time = time.time()
    question_embedding = model.encode(inp_question, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings)
    end_time = time.time()
    hits = hits[0]  #Get the hits for the first query

    print("Input question:", inp_question)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for hit in hits[0:5]:
        print("\t{:.3f}\t{}".format(hit['score'],
                                    corpus_sentences[hit['corpus_id']]))

    print("\n\n========\n")

# Extraindo apenas queries do MSMARCO

In [None]:
# abre o json
import json
path = r"D:\Gabriel\Documents\TCC\MS MARCO\train_v2.1.json"
with open(path, 'r') as f:
    file = f.read()
    msmarco = json.loads(file)
del file

In [None]:
# pega as queries e passa para string
msmarco_queries = ';'.join(msmarco['query'].values())
del msmarco

In [None]:
# escreve a string num arquivo
with open("D:\Gabriel\Documents\TCC\MS MARCO\queries.txt", "w", encoding='utf8') as f2:
    f2.write(msmarco_queries)

In [None]:
# lê a string do arquivo
with open("D:\Gabriel\Documents\TCC\MS MARCO\queries.txt", "r", encoding='utf8') as f3:
    msmarco_queries = f3.read()
print(msmarco_queries[0:1000])

# Topic modelling with BERT

## Installs

In [None]:
!pip install umap-learn tqdm

In [None]:
!pip install hdbscan

In [None]:
# Demorado! ~3 GB
!pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

In [None]:
# verificar se está instalado corretamente
import torch
print(torch.cuda.is_available()) # True
print(torch.cuda.current_device()) # int
print(torch.cuda.device_count()) # >0
print(torch.cuda.get_device_name(0)) # GeForce ...

## Imports

In [None]:
from tqdm import tqdm
import umap
import os
import hdbscan
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option("display.max_rows", 10, "display.max_columns", None, "display.width", None, "display.max_colwidth", 70)

## Obter data

In [None]:
with open("D:\Gabriel\Documents\TCC\MS MARCO\queries.txt", "r", encoding='utf8') as f3:
    msmarco_queries = f3.read()
data = msmarco_queries.split(';')
del msmarco_queries

In [None]:
print(len(data))
display(data[:10])

## Obter embeddings

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
embeddings = model.encode(data, show_progress_bar=True, device='cuda')
print('Finished')

Não vale a pena salvar os embeddings, pois ocupa muito espaço (16 GB), demora mais para salvar que gerar novamente e não consegui carregar de volta.

## Obter umap

In [None]:
# reduz embeddings de 768 componentes para 5
umap_embeddings = umap.UMAP(n_neighbors=15, 
                            n_components=5, 
                            metric='cosine').fit_transform(embeddings)

In [None]:
display(umap_embeddings)

In [None]:
np.shape(umap_embeddings)

In [None]:
# salvar umap_embeddings
umap_path = os.getcwd() + '\\umap_embeddings.txt'
with open(umap_path, 'w') as file:
    for row in tqdm(umap_embeddings):
        np.savetxt(file, row)

In [None]:
%%time
# carregar umap_embeddings
try:
    umap_embeddings
except:
    umap_path = os.getcwd() + '\\umap_embeddings.txt'
    umap_embeddings = np.loadtxt(umap_path).reshape(809214, 5)

## Obter clusters

In [None]:
cluster = hdbscan.HDBSCAN(min_cluster_size=15,
                          metric='euclidean',                      
                          cluster_selection_method='eom').fit(umap_embeddings)

## Visualizar
Opcional!

In [None]:
%%time
# Visualização
# Faz uma nova reducao dos embeddings para apenas 2 componentes (extremamente demorado!)
umap_data = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

In [None]:
result = pd.DataFrame(umap_data, columns=['x', 'y'])
result['labels'] = cluster.labels_

# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=0.05)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=0.05, cmap='hsv_r')
plt.colorbar()
plt.show()

## Analisar resultados

In [None]:
# Resultados em dataframe
# Doc = query, Topic = cluster
docs_df = pd.DataFrame(data, columns=["Doc"])
docs_df['Topic'] = cluster.labels_
docs_df['Doc_ID'] = range(len(docs_df))
docs_per_topic = docs_df.groupby(['Topic'], as_index = False).agg({'Doc': ' '.join})

In [None]:
display(docs_df)

In [None]:
%%time
from sklearn.feature_extraction.text import CountVectorizer

def c_tf_idf(documents, m, ngram_range=(1, 1)):
    count = CountVectorizer(ngram_range=ngram_range, stop_words="english").fit(documents)
    t = count.transform(documents).toarray()
    w = t.sum(axis=1)
    tf = np.divide(t.T, w)
    sum_t = t.sum(axis=0)
    idf = np.log(np.divide(m, sum_t)).reshape(-1, 1)
    tf_idf = np.multiply(tf, idf)

    return tf_idf, count
  
tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m=len(data))

In [None]:
%%time
def extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=20):
    words = count.get_feature_names()
    labels = list(docs_per_topic.Topic)
    tf_idf_transposed = tf_idf.T
    indices = tf_idf_transposed.argsort()[:, -n:]
    top_n_words = {label: [(words[j], tf_idf_transposed[i][j]) for j in indices[i]][::-1] for i, label in enumerate(labels)}
    return top_n_words

def extract_topic_sizes(df):
    topic_sizes = (df.groupby(['Topic'])
                     .Doc
                     .count()
                     .reset_index()
                     .rename({"Topic": "Topic", "Doc": "Size"}, axis='columns')
                     .sort_values("Size", ascending=False))
    return topic_sizes

top_n_words = extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=10)
topic_sizes = extract_topic_sizes(docs_df)
topic_sizes.head(10)

In [None]:
top_n_words[5]

In [None]:
# palavras para procurar dentre os clusters
to_find = ['marine', 'ocean', 'sea', 'oil', 'beach', 'current', 'tide', 'wave']

In [None]:
%%time
found = {word : [] for word in to_find}
for i in range(len(top_n_words) - 1):
    for word in to_find:
        if word in [a for a,b in top_n_words[i]]:
            found[word].append(i)
            
display(found)  # clusters que possuem as palavras procuradas

In [None]:
to_find = ['marine', 'ocean', 'sea', 'oil', 'beach', 'current', 'tide', 'wave']
for i in found['wave']:
    display(i, top_n_words[i])

In [None]:
s = docs_df[docs_df['Topic'] == 3908]
display(s)
print(s.index.tolist())

In [None]:
# Salvar resultado
inter = [872, 905, 1885, 891, 1616, 1651, 1652, 3779, 1421, 1913, 885]  # clusters possivelmente contendo perguntas interessantes

In [None]:
inter_df = docs_df[docs_df['Topic'].apply(lambda x: x in inter)]  # linhas pertencentes aos clusters interessantes
display(inter_df)

In [None]:
inter_json = inter_df.drop(columns=['Topic', 'Doc_ID'])
inter_json.to_json('pre_lookup_table.json')

## Reduzir a quantidade de clusters

In [None]:
# TOPIC REDUCTION
for i in range(20):
    # Calculate cosine similarity
    similarities = cosine_similarity(tf_idf.T)
    np.fill_diagonal(similarities, 0)

    # Extract label to merge into and from where
    topic_sizes = docs_df.groupby(['Topic']).count().sort_values("Doc", ascending=False).reset_index()
    topic_to_merge = topic_sizes.iloc[-1].Topic
    topic_to_merge_into = np.argmax(similarities[topic_to_merge + 1]) - 1

    # Adjust topics
    docs_df.loc[docs_df.Topic == topic_to_merge, "Topic"] = topic_to_merge_into
    old_topics = docs_df.sort_values("Topic").Topic.unique()
    map_topics = {old_topic: index - 1 for index, old_topic in enumerate(old_topics)}
    docs_df.Topic = docs_df.Topic.map(map_topics)
    docs_per_topic = docs_df.groupby(['Topic'], as_index = False).agg({'Doc': ' '.join})

    # Calculate new topic words
    m = len(data)
    tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m)
    top_n_words = extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=20)

topic_sizes = extract_topic_sizes(docs_df)
topic_sizes.head(10)

# Domain-Specific QA (IBM)
Inspirado em: https://github.com/ibm-aur-nlp/domain-specific-QA  
O código da IBM *não filtra* os tópicos, apenas pega a lista de ids já filtrados e gera  
o json contendo todas as informações relevantes a cada pergunta em cada domínio.

In [None]:
# IMPORTANTE!!
# abrir o json consome muita memória. Apague variáveis grandes antes
to_del = ['tf_idf', 'umap_embeddings', 'docs_df', 'data', 'top_n_words', 'docs_per_topic', 'topic_sizes', 's', 'inter_df', 'inter_json']
for var in to_del:
    globals().pop(var, None)

In [None]:
import json
from tqdm import tqdm

def fix_lookup(lookup):
    """
    Por algum motivo, a lookup table não anotou corretamente o query id de cada query capturada.
    Então, para cada query presente na lookup table, iteramos sobre as queries do arquivo original
    até encontrar o id correto.

    :param lookup:
    :return:
    """
    new_lookup = {}
    print('Consertando lookup! ~1min')
    for l_index, l_query in tqdm(lookup['Doc'].items()):
        real_key = next(m_index for m_index, m_query in marco_query.items() if m_query == l_query)
        new_lookup[real_key] = l_query
    return new_lookup


with open('train_v2.1_query.json', 'r') as marco_query:
    marco_query = json.load(marco_query)
    
with open('train_v2.1_answers.json', 'r') as marco_answers:
    marco_answers = json.load(marco_answers)
    
with open('train_v2.1_wellFormedAnswers.json', 'r') as marco_wfanswers:
    marco_wfanswers = json.load(marco_wfanswers)

print('Lendo lookup...')
with open('lookup_table.json', 'r') as lookupfile:
    lookup = json.load(lookupfile)

# substituir chaves pelas corretas
lookup = fix_lookup(lookup)

In [None]:
# Pega a resposta e a resposta formatada para cada pergunta
final = {}
print('Formando arquivo final...')
for qid, query in tqdm(lookup.items()):
    temp = {
        'query': query,
        'answer': marco_answers[qid][0],
        'wellFormedAnswer': marco_wfanswers[qid][0]
    }
    final[qid] = temp

In [None]:
# Salva
print('Feito! Salvando...')
with open('filtered_qna2.json', 'w') as finalfile:
    json.dump(final, finalfile)

## Tentativa com ijson

In [None]:
# Biblioteca ijson serve para abrir grandes arquivos json
# porém não consegui fazer funcionar
import ijson
from pathlib import Path
# carregar o msmarco
path = Path().resolve().parent.__str__() + r'\edited_v2.1.json'
print(path)
f = open(path, 'rb')  # lembrar de fechar depois!!

queries_ = ijson.kvitems(f, 'marco.query.item')
queries = (v for k, v in queries_)

In [None]:
for query in queries:
    print(query)

In [None]:
f.close()

## Extrair para csv

In [1]:
import json

with open('filtered_qna.json', 'rb') as f:
    filtered = json.load(f)
display(filtered)

{'101004': {'query': 'how big can miniature turtles get',
  'answer': 'The size of miniature turtles, 7 to 9 inches for males and 8 to 12 inches for females.',
  'wellFormedAnswer': '['},
 '101220': {'query': 'seafloor spreading definition',
  'answer': 'iT is a process that occurs at mid-ocean ridges, where new oceanic crust is formed through volcanic activity and then gradually moves away from the ridge.',
  'wellFormedAnswer': '['},
 '101561': {'query': 'average production cost for oil in permian basin',
  'answer': '$90.57',
  'wellFormedAnswer': 'The average production cost for an oil in the Permian Basin is $90.57.'},
 '102096': {'query': 'price of castor oil',
  'answer': '$10 for a supply of 180 milliliters.',
  'wellFormedAnswer': '['},
 '103233': {'query': 'what causes waves to occur in the ocean',
  'answer': 'Wind',
  'wellFormedAnswer': 'The wind causes waves to occur in the ocean.'},
 '103329': {'query': 'what do turtles eat and drink',
  'answer': 'No Answer Present.',
 

In [None]:
from IPython.display import clear_output

skip = ['hemp', 'canola', 'cbd', 'car', 'tanker', 'salt lake', 'argan', 'town', 'color']
always_keep = ['turtle']
new = {}
for i, (qid, qaw) in enumerate(filtered.items()):
    if i > 1200:
        break
    print(f"Pergunta #{i}")
    for word in skip:
        if qaw['query'].find(word) >= 0:
            continue
            
    newqaw = {}
    newqaw['query'] = qaw['query']
    wfa = qaw['wellFormedAnswer']
    newqaw['answer'] = wfa if wfa != '[' else qaw['answer']
    
    keep = False
    for word in always_keep:
        if qaw['query'].find(word) >= 0:
            keep = True
            break
    if not keep:
        print('Query: ', newqaw['query'])
        print('Answer:', newqaw['answer'])
        keep = input('Manter? ') == 'y'
    if keep:
        new[qid] = newqaw
    clear_output(wait=True)

with open('filtered_qna_.json', 'w') as f:
    json.dump(new, f)


Pergunta #1
Query:  seafloor spreading definition
Answer: iT is a process that occurs at mid-ocean ridges, where new oceanic crust is formed through volcanic activity and then gradually moves away from the ridge.
ananas


In [None]:
'what is a mid ocean ridge'

In [None]:
import json

with open('filtered_qna.json', 'rb') as f:
    filtered = json.load(f)
    
new = {}
for i, qid, qaw in enumerate(filtered.items()):
    if i > 1200:
        newqaw = {}
        newqaw['query'] = qaw['query']
        wfa = qaw['wellFormedAnswer']
        newqaw['answer'] = wfa if wfa != '[' else qaw['answer']
        print('Query: ', newqaw['query'])
        print('Answer:', newqaw['answer'])
        keep = input('Manter?') == 'y'
        if keep:
            new[qid] = newqaw
    else:
        continue

with open('filtered_qna_.json', 'w') as f:
    json.dump(new, f)