https://www.sbert.net/index.html
1. Separar em tópicos (SBERT - assymetric semantic search / clustering / topic modelling)
2. Encontrar perguntas dentro de tópicos relevantes (manual)
3. Obter sinônimos para aquelas perguntas (SBERT - symmetric semantic search)
4. Extrair pares pergunta-resposta (IBM - domain-specific-QA)

# Assymmetric Semantic Search
For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you wand to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.   
<br>
### Suitable models for assymmetric semantic search:

- msmarco-distilbert-base-v2

# Symmetric Semantic Search
For symmetric semantic search your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”. For symmetric tasks, you could potentially flip the query and the entries in your corpus.   
<br>
### Suitable models for symmetric semantic search:

- paraphrase-distilroberta-base-v1 / paraphrase-xlm-r-multilingual-v1

- quora-distilbert-base / quora-distilbert-multilingual

- distiluse-base-multilingual-cased-v2

# Extraindo apenas queries do MSMARCO

In [None]:
# abre o json
import json
path = r"data/train_v2.1.json"
with open(path, 'r') as f:
    file = f.read()
    msmarco = json.loads(file)
del file

In [None]:
# pega as queries e passa para string
msmarco_queries = ';'.join(msmarco['query'].values())
del msmarco

In [None]:
# escreve a string num arquivo
with open("data/queries.txt", "w", encoding='utf8') as f2:
    f2.write(msmarco_queries)

In [None]:
# lê a string do arquivo
with open("data/queries.txt", "r", encoding='utf8') as f3:
    msmarco_queries = f3.read()
print(msmarco_queries[0:1000])

# Topic modelling with BERT

## Installs

In [None]:
!pip install umap-learn tqdm

In [None]:
!pip install hdbscan

In [None]:
# Demorado! ~3 GB
!pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

In [None]:
# verificar se está instalado corretamente
import torch
print(torch.cuda.is_available()) # True
print(torch.cuda.current_device()) # int
print(torch.cuda.device_count()) # >0
print(torch.cuda.get_device_name(0)) # GeForce ...

## Imports

In [None]:
from tqdm import tqdm
import umap
import os
import hdbscan
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option("display.max_rows", 10, "display.max_columns", None, "display.width", None, "display.max_colwidth", 70)

## Obter data

In [None]:
with open("data/queries.txt", "r", encoding='utf8') as f3:
    msmarco_queries = f3.read()
data = msmarco_queries.split(';')
del msmarco_queries

In [None]:
print(len(data))
display(data[:10])

## Obter embeddings

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
embeddings = model.encode(data, show_progress_bar=True, device='cuda')
print('Finished')

Não vale a pena salvar os embeddings, pois ocupa muito espaço (16 GB), demora mais para salvar que gerar novamente e não consegui carregar de volta.

## Obter umap

In [None]:
# reduz embeddings de 768 componentes para 5
umap_embeddings = umap.UMAP(n_neighbors=15, 
                            n_components=5, 
                            metric='cosine').fit_transform(embeddings)

In [None]:
display(umap_embeddings)

In [None]:
np.shape(umap_embeddings)

In [None]:
# salvar umap_embeddings
umap_path = os.path.join(os.getcwd(), 'umap_embeddings.txt')
with open(umap_path, 'w') as file:
    for row in tqdm(umap_embeddings):
        np.savetxt(file, row)

In [None]:
%%time
# carregar umap_embeddings
try:
    umap_embeddings
except:
    umap_path = os.path.join(os.getcwd(), 'umap_embeddings.txt')
    umap_embeddings = np.loadtxt(umap_path).reshape(809214, 5)

## Obter clusters

In [None]:
cluster = hdbscan.HDBSCAN(min_cluster_size=15,
                          metric='euclidean',                      
                          cluster_selection_method='eom').fit(umap_embeddings)

## Visualizar
Opcional!

In [None]:
%%time
# Visualização
# Faz uma nova reducao dos embeddings para apenas 2 componentes (extremamente demorado!)
umap_data = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

In [None]:
result = pd.DataFrame(umap_data, columns=['x', 'y'])
result['labels'] = cluster.labels_

# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=0.05)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=0.05, cmap='hsv_r')
plt.colorbar()
plt.show()

## Analisar resultados

In [None]:
# Resultados em dataframe
# Doc = query, Topic = cluster
docs_df = pd.DataFrame(data, columns=["Doc"])
docs_df['Topic'] = cluster.labels_
docs_df['Doc_ID'] = range(len(docs_df))
docs_per_topic = docs_df.groupby(['Topic'], as_index = False).agg({'Doc': ' '.join})

In [None]:
display(docs_df)

In [None]:
%%time
from sklearn.feature_extraction.text import CountVectorizer

def c_tf_idf(documents, m, ngram_range=(1, 1)):
    count = CountVectorizer(ngram_range=ngram_range, stop_words="english").fit(documents)
    t = count.transform(documents).toarray()
    w = t.sum(axis=1)
    tf = np.divide(t.T, w)
    sum_t = t.sum(axis=0)
    idf = np.log(np.divide(m, sum_t)).reshape(-1, 1)
    tf_idf = np.multiply(tf, idf)

    return tf_idf, count
  
tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m=len(data))

In [None]:
%%time
def extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=20):
    words = count.get_feature_names()
    labels = list(docs_per_topic.Topic)
    tf_idf_transposed = tf_idf.T
    indices = tf_idf_transposed.argsort()[:, -n:]
    top_n_words = {label: [(words[j], tf_idf_transposed[i][j]) for j in indices[i]][::-1] for i, label in enumerate(labels)}
    return top_n_words

def extract_topic_sizes(df):
    topic_sizes = (df.groupby(['Topic'])
                     .Doc
                     .count()
                     .reset_index()
                     .rename({"Topic": "Topic", "Doc": "Size"}, axis='columns')
                     .sort_values("Size", ascending=False))
    return topic_sizes

top_n_words = extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=10)
topic_sizes = extract_topic_sizes(docs_df)
topic_sizes.head(10)

In [None]:
top_n_words[5]

In [None]:
# palavras para procurar dentre os clusters
to_find = ['marine', 'ocean', 'sea', 'oil', 'beach', 'current', 'tide', 'wave']

In [None]:
%%time
found = {word : [] for word in to_find}
for i in range(len(top_n_words) - 1):
    for word in to_find:
        if word in [a for a,b in top_n_words[i]]:
            found[word].append(i)
            
display(found)  # clusters que possuem as palavras procuradas

In [None]:
to_find = ['marine', 'ocean', 'sea', 'oil', 'beach', 'current', 'tide', 'wave']
for i in found['wave']:
    display(i, top_n_words[i])

In [None]:
s = docs_df[docs_df['Topic'] == 3908]
display(s)
print(s.index.tolist())

In [None]:
# Salvar resultado
inter = [872, 905, 1885, 891, 1616, 1651, 1652, 3779, 1421, 1913, 885]  # clusters possivelmente contendo perguntas interessantes

In [None]:
inter_df = docs_df[docs_df['Topic'].apply(lambda x: x in inter)]  # linhas pertencentes aos clusters interessantes
display(inter_df)

In [None]:
inter_json = inter_df.drop(columns=['Topic', 'Doc_ID'])
inter_json.to_json('pre_lookup_table.json')

## Reduzir a quantidade de clusters

In [None]:
# TOPIC REDUCTION
for i in range(20):
    # Calculate cosine similarity
    similarities = cosine_similarity(tf_idf.T)
    np.fill_diagonal(similarities, 0)

    # Extract label to merge into and from where
    topic_sizes = docs_df.groupby(['Topic']).count().sort_values("Doc", ascending=False).reset_index()
    topic_to_merge = topic_sizes.iloc[-1].Topic
    topic_to_merge_into = np.argmax(similarities[topic_to_merge + 1]) - 1

    # Adjust topics
    docs_df.loc[docs_df.Topic == topic_to_merge, "Topic"] = topic_to_merge_into
    old_topics = docs_df.sort_values("Topic").Topic.unique()
    map_topics = {old_topic: index - 1 for index, old_topic in enumerate(old_topics)}
    docs_df.Topic = docs_df.Topic.map(map_topics)
    docs_per_topic = docs_df.groupby(['Topic'], as_index = False).agg({'Doc': ' '.join})

    # Calculate new topic words
    m = len(data)
    tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m)
    top_n_words = extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=20)

topic_sizes = extract_topic_sizes(docs_df)
topic_sizes.head(10)

# Domain-Specific QA (IBM)
Inspirado em: https://github.com/ibm-aur-nlp/domain-specific-QA  
O código da IBM *não filtra* os tópicos, apenas pega a lista de ids já filtrados e gera  
o json contendo todas as informações relevantes a cada pergunta em cada domínio.