# Gerar uma seção de um survey

Esse caderno gera uma seção de um survey. A ideia é que seja fornecida como entrada pelo usuário uma query que buscará um conjunto de artigos relevantes para a seção.

A ideia é que o caderno, a partir da query, busque um conjunto de artigos e inicie a análise a partir deles. Caso o usuário já tenha o conjunto de artigos desejado, é possível pular a parte de pesquisa de artigos.

## Setup

In [1]:
!pip install openai -q

from getpass import getpass
import openai

OPENAI_API_KEY = getpass('OpenAI API key: ')
openai.api_key = OPENAI_API_KEY

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.6/149.6 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hOpenAI API key: ··········


In [2]:
# Onde será feito o download dos artigos
folder_papers = './papers_pdf/'

# Definição da seção e do assunto principal do survey
survey_topic = 'neural information retrieval'
query_section = 'text representation for ranking'
# Número de sub seções que serão geradas para essa seção
n_sub_sections = 3

# Nome do modelo que será usado
#gpt_model_name = 'gpt-3.5-turbo'
gpt_model_name = 'gpt-3.5-turbo-0613'
#gpt_model_name = 'gpt-4-0314'

# Se use_chat_model = True, usa langchain.chat_models.ChatOpenAI. Caso contrário, usa langchain.llms.OpenAI
use_chat_model = True

# Tamanho do batch para gerar os embeddings usando o Specter
batch_size = 32
# Se True, gera os embeddings usando o Specter para geração de seção baseado no TLDR em vez do abstract.
# Setar como True apenas para fazer testes, visto que o Specter foi projetado para title [SEP] abstract
use_tldr_instead_of_abstract = False

# É possível testar duas possibilidades de quebrar texto com esse caderno.
# 1. Quebrar o texto a cada max_char_length_chunk_to_index caracteres
# 2. Quebrar o texto a cada sentences_in_chunk sentenças (roda se split_using_sentences = True)
#
# Tamanho máximo do texto que será quebrado para gerar os chunks de texto.
# Esses textos não podem ser grandes demais, pois se forem não caberão no specter
# (512 tokens ~ 2000 caracteres). Além disso, a ideia depois é enviar depois esses
# trechos para o gpt. Então também tem que caber na janela...
max_char_length_chunk_to_index = 700
split_using_sentences = True
sentences_in_chunk = 7
sentences_overlap = 2

# Número de artigos usados para extrair uma seção
n_papers_to_extract_section_name = 10
# Número de trechos de texto retornados pelo retriever (specter)
n_chunks_returned_by_vector_retriever = 20
# Número de trechos de texto que serão usados para gerar uma referência
n_chunks_to_use_as_reference = 10

In [3]:
!pip install langchain -q
!pip install faiss-gpu -q
!pip install adapter-transformers -q
!pip install pypdfium2 -q
!pip install spacy -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m74.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m95.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Pesquisa de artigos

Usa a Semantic Scholar para fazer pesquisas de artigos.

In [4]:
import requests
import json
import pickle

def search_by_keywords(query,
                       fields='url,title,venue,year,authors,abstract,openAccessPdf,citationCount,referenceCount,publicationTypes,journal,tldr,publicationDate',
                       fieldsOfStudy='Computer Science',
                       year='2020-2023',
                       openAccessPdf=True,
                       offset=0,
                       limit=100):
    query_openaccess = '&openAccessPdf' if openAccessPdf else ''
    url = f'https://api.semanticscholar.org/graph/v1/paper/search?query={query}&fields={fields}&fieldsOfStudy={fieldsOfStudy}&year={year}{query_openaccess}&offset={offset}&limit={limit}'
    return requests.get(url).json()

def save_all_papers(query="neural+information+retrieval", file_name='papers_metadata.pkl', year='2020-2023', fieldsOfStudy="Computer Science"):
    offset = 0
    limit = 100
    total = 1
    all_papers = []

    while offset < total and (offset + limit < 10000):
        print(f'Searching {offset} to {offset+limit} (Total: {total})')
        result = search_by_keywords(query, fieldsOfStudy=fieldsOfStudy, year=year, offset=offset, limit=limit)

        all_papers.extend(result['data'])
        total = result['total']
        offset += limit

    with open(file_name, 'wb') as f:
        pickle.dump(all_papers, f)

    return all_papers

Vamos ver quantos artigos tem para essa query no período de 2020-2023:

In [5]:
query = f'{survey_topic} {query_section}'
print(f'Searching query: {query}')
paper_2020_2023 = search_by_keywords(f'{survey_topic} {query_section}')
print(paper_2020_2023['total'])

Searching query: neural information retrieval text representation for ranking
86


Salva um pickle e pega apenas o array de artigos. Essa chamada é necessária apenas se tiver mais do que 100 artigos retornados, pois nesse caso ele já pagina e concatena tudo. Caso contário, é só buscar o ['data'] do retorno ao search_by_keywords mesmo.

In [6]:
all_papers = save_all_papers(query)

Searching 0 to 100 (Total: 1)


## Extração de tópicos da seção

### Classe SpecterEmbeddings

Essa classe pode ser usada para gerar os embeddings do Specter-v2 tanto no modo specter2_proximity (proximidade entre documentos) quanto no modo specter2_adhoc_query (pesquisa de documentos via query).

In [7]:
# Based on https://github.com/hwchase17/langchain/blob/4379bd4cbb8482e70d8936f747abd5ae7663f977/langchain/embeddings/huggingface.py#L16

from transformers import AutoAdapterModel, AutoTokenizer, AutoModel
from torch import cuda, bfloat16
import transformers
import torch
from tqdm.auto import tqdm
import numpy as np

from pydantic import BaseModel, Extra, Field
from langchain.embeddings.base import Embeddings
from typing import Any, Dict, List, Optional

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

class SpecterEmbeddings(BaseModel, Embeddings):

  """Key word arguments to pass to the model."""
  encode_kwargs: Dict[str, Any] = Field(default_factory=dict)

  def __init__(self, **kwargs: Any):
    super().__init__(**kwargs)

    self.tokenizer = AutoTokenizer.from_pretrained('allenai/specter2')
    self.model = AutoModel.from_pretrained('allenai/specter2')

    self.model.load_adapter("allenai/specter2_proximity", source="hf", load_as="specter2_proximity", set_active=False)
    self.model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="adhoc_query", set_active=False)

    self.device = device
    self.model.eval()
    self.model.to(self.device)

  @torch.no_grad()
  def embed_documents(self, texts: List[str]) -> List[List[float]]:
    """Compute doc embeddings using a HuggingFace transformer model.

    Args:
        texts: The list of texts to embed.

    Returns:
        List of embeddings, one for each text.
    """

    self.model.set_active_adapters(None)
    self.model.set_active_adapters("specter2_proximity")

    all_embeddings = []

    batch_size = 32
    show_progress_bar = True

    if 'batch_size' in self.encode_kwargs:
      batch_size = self.encode_kwargs['batch_size']
    if 'show_progress_bar' in self.encode_kwargs:
      show_progress_bar = self.encode_kwargs['show_progress_bar']

    # sort text for less padding
    length_sorted_idx = np.argsort([-len(sen) for sen in texts])
    texts_sorted = [texts[idx] for idx in length_sorted_idx]

    for start_index in tqdm(range(0, len(texts_sorted), batch_size), desc="Batches", disable=not show_progress_bar):
      texts_batch = texts_sorted[start_index:start_index+batch_size]

      inputs = self.tokenizer(texts_batch, padding=True, truncation=True,
                              return_tensors="pt", return_token_type_ids=False, max_length=512)

      output = self.model(**inputs.to(self.device))
      # take the first token in the batch as the embedding
      embeddings = output.last_hidden_state[:, 0, :]

      all_embeddings.extend(embeddings.tolist())

    return all_embeddings

  class Config:
      """Configuration for this pydantic object."""

      extra = Extra.allow

  @torch.no_grad()
  def embed_query(self, text: str) -> List[float]:
    """Compute query embeddings using a HuggingFace transformer model.

    Args:
        text: The text to embed.

    Returns:
        Embeddings for the text.
    """
    self.model.set_active_adapters(None)
    self.model.set_active_adapters("adhoc_query")

    inputs = self.tokenizer(text, padding=True, truncation=True,
                            return_tensors="pt", return_token_type_ids=False, max_length=512)

    output = self.model(**inputs.to(self.device))
    # take the first token in the batch as the embedding
    embeddings = output.last_hidden_state[:, 0, :]

    return embeddings.squeeze(0).tolist()

### Gera embeddings dos documentos [title] [SEP] [abstract]

O objetivo de gerar esses embeddings é agrupar os artigos por classes, para escrever subseções dentro das seções.

In [8]:
def get_metadata(paper):
  metadata = {
    'paperId': paper['paperId'],
    'title': paper['title'],
    'venue': paper['venue'],
    'year': paper['year'],
    'authors': paper['authors'],
    'abstract': paper['abstract'],
    'citationCount': paper['citationCount'],
    'referenceCount': paper['referenceCount'],
    'journal': paper['journal'],
  }
  return metadata

In [9]:
from langchain.schema import Document

documents = []
documents_str = []
for paper in all_papers:
  paper_id = paper['paperId']
  paper_title = paper['title']
  paper_abstract = paper['abstract']
  paper_tldr = paper['tldr']['text']

  documents_str.append(f'{paper_title} [SEP] {paper_abstract}')
  documents.append(Document(page_content=f'{paper_title} [SEP] {paper_tldr if use_tldr_instead_of_abstract else paper_abstract}', metadata=get_metadata(paper)))

In [10]:
batch_size_embeddings = 1 if device == 'cpu' else batch_size

embeddings = SpecterEmbeddings(encode_kwargs={'batch_size': batch_size_embeddings})

Downloading (…)okenizer_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/228k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/717k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/754 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading (…)4c6bd/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)a0abc4c6bd/README.md:   0%|          | 0.00/8.77k [00:00<?, ?B/s]

Downloading pytorch_adapter.bin:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

Downloading (…)/adapter_config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading (…)/adapter_config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading (…)01564144ac/README.md:   0%|          | 0.00/8.87k [00:00<?, ?B/s]

Downloading (…)144ac/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading pytorch_adapter.bin:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

In [11]:
document_embeddings = embeddings.embed_documents(documents_str)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

### Separa os documentos em n_sub_sections clusters com kmeans

Como já estamos no nível de seção, vamos criar apenas n_sub_sections subseções

In [12]:
from sklearn.cluster import SpectralClustering, KMeans
import numpy as np
from collections import defaultdict

# Crie uma instância do modelo KMeans com n_sub_sections clusters
kmeans_model = KMeans(n_clusters=n_sub_sections, random_state=12345)

# Ajuste o modelo aos embeddings
kmeans_model.fit(document_embeddings)

# Obtenha os rótulos dos clusters atribuídos a cada documento
labels = kmeans_model.labels_

# Centro dos clusters
cluster_centers = kmeans_model.cluster_centers_
distance_to_center = np.linalg.norm(document_embeddings - cluster_centers[labels], axis=1)

# Separa os documentos em clusters
idx_docs_in_cluster = defaultdict(list)
distances_docs_in_cluster = defaultdict(list)
for idx_doc, label in enumerate(labels):
  idx_docs_in_cluster[label].append(idx_doc)
  distances_docs_in_cluster[label].append(distance_to_center[idx_doc])

# Ordena os índices de acordo com a distância do centro do cluster (maior para o menor)
# Aproveita e imprime o total de documentos em cada cluster
for key in idx_docs_in_cluster:
  idx_docs_in_cluster[key] = [idx_doc for _, idx_doc in sorted(zip(distances_docs_in_cluster[key], idx_docs_in_cluster[key]), reverse=True)]
  print(f"Cluster {key}: {len(idx_docs_in_cluster[key])}")

Cluster 2: 38
Cluster 0: 18
Cluster 1: 30




### Usa GPT para gerar nomes para os subtópicos

Para cada agrupamento feito vamos chamar o GPT para ele sugerir um nome para o tópico em questão. No total, serão n_sub_sections chamadas à API.

Começa definindo a estrutura de chamadas. A mensagem de sistema e a primeira interação entre o humano e a LLM são iguais, independentemente do cluster. A última mensagem do humano depende do cluster que estamos tratando, pois esta contém os títulos/abstracts dos artigos no cluster.

In [13]:
from langchain.prompts import (
    ChatPromptTemplate,
    PromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

# SYSTEM MESSAGE
system_template = "You are a renowned scientist who is writing a survey on '{survey_topic}'. You are currently writing a section about '{query_section}'"
system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

# FIRST HUMAN MESSAGE - EXPLAINING THE TASK
human_template_task = """\
I will send you a list of title and abstract of scientific articles. \
Most of them cover a specific subtopic about section '{query_section}'. \
Your task is to find out what this subtopic is and suggest a good title for a section in a scientific survey that addresses this subtopic. \
Your answer should be a valid RFC8259 compliant JSON object with three properties. \
The first property, called "subtopic", describes the subtopic and must be a subset of '{query_section}'. \
The second property, called "title", is the title of the section that will cover this subtopic and must be clearly related to the property "subtopic". \
The last property is called "reasoning" and should contains your reasoning to choose this subtopic as an answer. \
Remember to format your answer as a valid RFC8259 compliant JSON object, enclosing the keys and values in quotes. \
Do you understand?
"""
human_message_prompt_task = HumanMessagePromptTemplate.from_template(human_template_task)

# FIRST AI ANSWER - AGREEING
ai_message_prompt_yes = AIMessagePromptTemplate.from_template('Sure, send me the list and I will give you what you need.')

# SECOND HUMAN MESSAGE - ABSTRACT AND TITLE
# This is a variable message that depends on the cluster
def text_message_human_prompt_papers_in_subsection(papers):
  message = ''
  for paper in papers:
    if (paper['abstract'] is None or paper['title'] is None):
      continue
    message = message + f"Title: {paper['title']}\nAbstract: {paper['abstract']}\n\n"

  # Encode "{" and "}" from the message
  message = message.replace("{", "{{")
  message = message.replace("}", "}}")
  return HumanMessagePromptTemplate.from_template(message)

In [14]:
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain

if use_chat_model:
  llm_gpt = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name=gpt_model_name)
else:
  llm_gpt = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name=gpt_model_name)

subsections_per_cluster_str = {}
subsections_per_cluster = {}

for cluster in range(n_sub_sections):
  # Get the indexes of the docs in the cluster
  idx_docs = idx_docs_in_cluster[cluster]
  # The the papers corresponding to the given indexes
  papers_subsection = [all_papers[idx_doc] for idx_doc in idx_docs]
  # Select only the first n_papers_to_extract_section_name (default=10) papers to generate the last human message
  message_few_shot_prompt = text_message_human_prompt_papers_in_subsection(papers_subsection[0:n_papers_to_extract_section_name])

  # Now, generate the chat messages for this subsection
  chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt,
                                                  human_message_prompt_task,
                                                  ai_message_prompt_yes,
                                                  message_few_shot_prompt])
  #chat_messages = chat_prompt.format_prompt(survey_topic=survey_topic, query_section=query_section).to_messages()

  question_chain = LLMChain(llm=llm_gpt, prompt=chat_prompt, verbose=True)
  subsections_per_cluster_str[cluster] = question_chain.run(survey_topic=survey_topic, query_section=query_section)

# Try to convert to json
for cluster in range(n_sub_sections):
  subsections_per_cluster[cluster] = json.loads(subsections_per_cluster_str[cluster])



[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are a renowned scientist who is writing a survey on 'neural information retrieval'. You are currently writing a section about 'text representation for ranking'
Human: I will send you a list of title and abstract of scientific articles. Most of them cover a specific subtopic about section 'text representation for ranking'. Your task is to find out what this subtopic is and suggest a good title for a section in a scientific survey that addresses this subtopic. \ 
Your answer should be a valid RFC8259 compliant JSON object with three properties. The first property, called "subtopic", describes the subtopic and must be a subset of 'text representation for ranking'. The second property, called "title", is the title of the section that will cover this subtopic and must be clearly related to the property "subtopic". The last property is called "reasoning" and should contains your reasoning to choose this subt

Nomes para as seções que o GPT sugeriu com base nos artigos.

Como parâmetro, [nesse survey](https://arxiv.org/pdf/2207.13443.pdf) as seções são:

- BOW Encodings
- LTR Features
- Word Embeddings

In [15]:
for cluster in range(n_sub_sections):
  print(f"Title: {subsections_per_cluster[cluster]['title']}")
  print(f"Subtopic: {subsections_per_cluster[cluster]['subtopic']}")
  print('.'*50)

Title: Text-Image Matching for Cross-Modal Retrieval via Graph Neural Network
Subtopic: Text representation for cross-modal retrieval
..................................................
Title: Deep Learning Models for Text Representation in Ranking
Subtopic: Deep Learning Models for Text Representation
..................................................
Title: Textual Representations for Crosslingual Information Retrieval
Subtopic: Text representation for crosslingual information retrieval
..................................................


## Geração do texto das subseções

### Download dos PDFs dos artigos

In [16]:
!mkdir -p {folder_papers}

for i, paper in enumerate(all_papers):
  print(f"Downloading paper {i}: {paper['paperId']}: {paper['openAccessPdf']['url']}")
  !wget {paper['openAccessPdf']['url']} -O {folder_papers}{paper['paperId']}.pdf --user-agent="Mozilla" --tries=1 -T 5

Downloading paper 0: a609db40216a4071f9f739766c6691fa46fb8072: https://aclanthology.org/2021.ecnlp-1.14.pdf
--2023-06-14 19:04:05--  https://aclanthology.org/2021.ecnlp-1.14.pdf
Resolving aclanthology.org (aclanthology.org)... 174.138.37.75
Connecting to aclanthology.org (aclanthology.org)|174.138.37.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 317304 (310K) [application/pdf]
Saving to: ‘./papers_pdf/a609db40216a4071f9f739766c6691fa46fb8072.pdf’


2023-06-14 19:04:05 (986 KB/s) - ‘./papers_pdf/a609db40216a4071f9f739766c6691fa46fb8072.pdf’ saved [317304/317304]

Downloading paper 1: aacc51b75d910031d8b34476e6a343d5eed73fc2: https://aclanthology.org/2021.emnlp-main.78.pdf
--2023-06-14 19:04:05--  https://aclanthology.org/2021.emnlp-main.78.pdf
Resolving aclanthology.org (aclanthology.org)... 174.138.37.75
Connecting to aclanthology.org (aclanthology.org)|174.138.37.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 627293 (613K) [

Agora que os subtópicos e títulos das seções foram gerados podemos usar o gpt para escrever o texto dos documentos.

Para isso, agora vamos usar o conteúdo dos artigos para identificar trechos que tem a ver com o título da subseção que vamos escrever.

Aqui temos duas alternativas principais:

1. Ignorar os artigos que já trouxemos e buscar por novos artigos mais específicos que tem a ver com o título gerado.

2. Aproveitar os artigos que já buscamos, pois já agrupamos eles.

As duas alternativas são válidas e vão tratar de papers diferentes. Pra ter uma consistência maior com o que foi feito até agora, vamos considerar a abordagem 2 e pegar apenas os artigos dentro da seção.

<br>

Em relação ao código, agora abandonamos a estrutura de documentos que estávamos considerando (apenas título e abstract) e passamos a tratar com título + conteúdo. Como o conteúdo de uma artigo é muito grande e obviamente não cabe na janela, vamos quebrar em chunks de texto.

### Extração de texto do pdf

In [17]:
import pypdfium2 as pdfium

from langchain.vectorstores import FAISS
from langchain.schema import Document

documents_in_subsection = {}
# For each subsection...
for cluster in range(n_sub_sections):
  # Papers in subsection:
  idx_docs = idx_docs_in_cluster[cluster]
  papers_subsection = [all_papers[idx_doc] for idx_doc in idx_docs]

  documents = []
  for paper in papers_subsection:
    paper_id = paper['paperId']
    paper_title = paper['title']
    pdf_file = f'{folder_papers}{paper_id}.pdf'
    txt_contents = ''

    print(f"Extracting {paper_id} in subsection {cluster}")
    try:
      pdf = pdfium.PdfDocument(pdf_file)
    except:
      print(f'***** Problems with {pdf_file}. Ignoring...')
      continue

    for i in range(len(pdf)):
      txt_contents += pdf[i].get_textpage().get_text_range()

    # Remove the break lines and considers only one big string of text:
    txt_contents = txt_contents.replace('\r\n', ' ')
    txt_contents = txt_contents.replace('\n', ' ')
    # Remove everything before introduction and reference section:
    txt_contents_lower = txt_contents.lower()
    idx_introduction = max(txt_contents_lower.find('introduction'), 0)
    idx_references = max(txt_contents_lower.rfind('reference'), 0)
    txt_contents = txt_contents[idx_introduction:idx_references]

    documents.append(Document(page_content=txt_contents, metadata=get_metadata(paper)))

  documents_in_subsection[cluster] = documents

Extracting 1ee7569c388b53ce5c4bff610df5ee06db5ed7f0 in subsection 0
Extracting 7b8fe8c28a371120b4479540b2c8a0f7c5af25bf in subsection 0
Extracting 8f41470b690f38a5b9c06b7e50c865e6a3d937f4 in subsection 0
Extracting f4663ff98c7d696b8fda2cb7c5e729862b0df191 in subsection 0
Extracting 4af673758c1d501cdf761e57a34e29485668336b in subsection 0
Extracting 54da6371750c53cad52314a3aa80b5ed2e0e89ad in subsection 0
***** Problems with ./papers_pdf/54da6371750c53cad52314a3aa80b5ed2e0e89ad.pdf. Ignoring...
Extracting 01b6bf20e38818df0b1c9f5a55a5f013aadcef09 in subsection 0
Extracting a8cc7fe29cd4a7c1480cf6f36b698db6103b1a53 in subsection 0
Extracting 336e531a59cafbe215b950fd749bca866b89cea0 in subsection 0
Extracting 8c21b1df7ac375742e412251cb37f10966bb3bfa in subsection 0
Extracting f6d69afebcebcbd3e511faf19375f71dd679cdcb in subsection 0
Extracting 471dea6589d6f19e78db1f47fbc7cff0d9f1aab3 in subsection 0
Extracting 4aa1d28944856ebe1950a27f633c6667ead3cbf8 in subsection 0
Extracting 3355935d5e2d08

### Classe para quebrar um texto em sentenças

In [18]:
from typing import Any, List, Optional
from transformers import AutoTokenizer
from langchain.docstore.document import Document
import copy
import spacy

class SentenceTextSplitter():
  def __init__(self, sentences_in_chunk=7, sentences_overlap=2, pipeline: str = "en_core_web_sm"):
    self._spacy_tokenizer = spacy.load(pipeline)
    self._sentences_in_chunk = sentences_in_chunk
    self._sentences_overlap = sentences_overlap

  def split_documents(self, docs: List[Document]) -> List[Document]:
    documents = []
    for i, doc in tqdm(enumerate(docs), desc="Splitting", total=len(docs)):
      for chunk in self.split_text(doc.page_content):
        new_doc = Document(
            page_content=chunk, metadata=copy.deepcopy(doc.metadata)
        )
        documents.append(new_doc)
    return documents

  def split_text(self, text: str) -> List[str]:
    self._spacy_tokenizer.max_length = len(text) + 100
    sentences_in_chunk = self._sentences_in_chunk
    sentences_overlap = self._sentences_overlap
    sentences = (str(s) for s in self._spacy_tokenizer(text).sents)

    chunks = []
    chunk = []

    for sentence in sentences:
      chunk.append(sentence)

      whole_text = ' '.join(chunk)

      # if there are at least min_sentences_in_chunk sentences in chunk and there are more tokens than token_chunk_limit
      if len(chunk) >= sentences_in_chunk:
        chunks.append(whole_text)
        chunk = chunk[-sentences_overlap:]

    if chunk is not None:
      chunks.append(' '.join(chunk))

    return chunks

### Divisão de cada documento em trechos menores

Cada trecho de texto é muito grande. É necessário quebrar cada um dos documentos pra indexarmos

In [19]:
from langchain.embeddings import HuggingFaceEmbeddings

from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import SpacyTextSplitter

# 1 token é mais ou menos 4 caracteres - https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
# Como o specterv2 tem tamanho 512, vamos considerar então um tamanho máximo de janela de 2000 caracteres.
# Aqui temos que considerar que além do texto vamos passar também o título do artigo. Vou separar
# 200 caracteres para o título e, por isso, criar chunks de 1800 caracteres.
text_splitter = CharacterTextSplitter(chunk_size=max_char_length_chunk_to_index, chunk_overlap=100, separator=' ')

splitted_documents_in_subsection = {}
for cluster in range(n_sub_sections):
  splitted_documents_in_subsection[cluster] = text_splitter.split_documents(documents_in_subsection[cluster])
  print(f'There is {len(splitted_documents_in_subsection[cluster])} chunks in subsection {cluster}')

There is 4187 chunks in subsection 0
There is 3462 chunks in subsection 1
There is 4878 chunks in subsection 2


In [20]:
%%time

if split_using_sentences:
  sentence_splitter = SentenceTextSplitter(sentences_in_chunk=sentences_in_chunk, sentences_overlap=sentences_overlap)

  splitted_documents_in_subsection = {}
  for cluster in range(n_sub_sections):
    splitted_documents_in_subsection[cluster] = sentence_splitter.split_documents(documents_in_subsection[cluster])
    print(f'There is {len(splitted_documents_in_subsection[cluster])} chunks in subsection {cluster}')

Splitting:   0%|          | 0/17 [00:00<?, ?it/s]

There is 4502 chunks in subsection 0


Splitting:   0%|          | 0/24 [00:00<?, ?it/s]

There is 3648 chunks in subsection 1


Splitting:   0%|          | 0/37 [00:00<?, ?it/s]

There is 5121 chunks in subsection 2
CPU times: user 3min 38s, sys: 24.2 s, total: 4min 2s
Wall time: 4min 24s


Os embeddings do Specter consideram [title] [SEP] [abstract], mas nesse caso estamos substituindo o abstract pelo conteúdo do artigo. É necessário alterar o page_content dos documentos gerados pra inserir também o title. Vamos considerar apenas os primeiros 200 caracteres do título para o caso do título ser enorme (temos limitação de 512 tokens na entrada).

In [21]:
for cluster in range(n_sub_sections):
  documents = splitted_documents_in_subsection[cluster]

  for doc in documents:
    # Como o [SEP] é usado na inferência, vamos remover todos os [SEP] e substituir por {sep}
    doc.page_content = f"{doc.metadata['title'][:200].replace('[SEP]', '{sep}')} [SEP] {doc.page_content.replace('[SEP]', '{sep}')}"

O conteúdo ficou assim:

In [22]:
print(splitted_documents_in_subsection[0][0].page_content)
print(splitted_documents_in_subsection[0][1].page_content)
print(splitted_documents_in_subsection[0][2].page_content)

Toward English-Chinese Translation Based on Neural Networks [SEP] Introduction Machine translation (MT) has a long history. In 1949, War￾ren Weaver put forward the first influential machine transla￾tion proposal, which marked the beginning of machine translation [1]. In Weaver’s proposal, he mentioned the idea of computer translation and proposed to combine the knowl￾edge of statistics, logic, and linguistics to solve the problem of ambiguity in language. In the decades since, MT has come a long way. In 1949, the success of MT is started, but in the early decades of MT research, the so-called MT was almost entirely word-to-word substitution relying on bilingual dic￾tionaries. MT research soon fell into a cold winter. In 1966, the American Advisory Committee on Automatic Language Processing (ALPAC) pointed out in its research report lan￾guage and machine that “there is no hope for machine trans￾lation in the near future.”
Toward English-Chinese Translation Based on Neural Networks [SEP]

### Criação de índices para cada subseção

In [23]:
%time

db_for_subsection = {}

for cluster in range(n_sub_sections):
  db_for_subsection[cluster] = FAISS.from_documents(splitted_documents_in_subsection[cluster], embeddings)
  db_for_subsection[cluster].save_local(f"faiss_index_{cluster}")

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs


Batches:   0%|          | 0/141 [00:00<?, ?it/s]

Batches:   0%|          | 0/114 [00:00<?, ?it/s]

Batches:   0%|          | 0/161 [00:00<?, ?it/s]

### Pesquisa os trechos mais relevantes dos artigos para cada subseção

Para cada subseção temos um índice contendo trechos do inteiro teor de cada arquivo. A ideia agora é representar o título da seção em um vetor e fazer uma busca vetorial dentro dos trechos dos documentos procurando por trechos que tem a ver com o título da seção.

A ideia aqui é encontrar, dentro dos artigos da seção, os trechos mais relevantes, independente de em que artigo esses trechos estão.

Uma vez encontrados, recuperamos um conjunto desses textos e enviamos para o GPT para sumarizar todos eles em um texto único (que será o texto da subseção), contendo referências para cada artigo.

In [24]:
docs_title_per_cluster = {}

for cluster in range(n_sub_sections):
  title_cluster = subsections_per_cluster[cluster]['title']

  # Retrieve the chunks of the articles using the title of the subsection
  # (it is also possible to use the topic generated, but the results are similar)
  docs_title = db_for_subsection[cluster].similarity_search_with_score(title_cluster, n_chunks_returned_by_vector_retriever)

  docs_title_per_cluster[cluster] = docs_title

  print(f'Cluster: {cluster}. Query using title: {title_cluster}. Retrieved chunks: ')
  for doc in docs_title:
    # Remove the title from content and also the ' [SEP] ' string
    print(f"{doc[0].metadata['paperId']}: {doc[0].page_content[len(doc[0].metadata['title'])+7:]}")
  print('.'*50)

Cluster: 0. Query using title: Text-Image Matching for Cross-Modal Retrieval via Graph Neural Network. Retrieved chunks: 
336e531a59cafbe215b950fd749bca866b89cea0: Journal of machine learn￾ing research, 12(Oct):2825–2830. M. Polignano, P. Basile, M. de Gemmis, G. Semer￾aro, and V. Basile. 2019. Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. In R. Bernardi, R. Navigli, and G. Semeraro, editors, Proceedings of the Sixth Italian Conference on Computational Linguistics, Bari, Italy, November 13-15, 2019, volume 2481 of CEUR Workshop Proceedings. CEUR-WS.org. M. Sanguinetti, G. Comandini, E. Di Nuovo, S. Frenda, M. Stranisci, C. Bosco, T. Caselli, V. Patti, and I. Russo. 2020.
336e531a59cafbe215b950fd749bca866b89cea0: Nils Reimers and Iryna Gurevych. 2019. Sentence￾BERT: Sentence Embeddings using Siamese BERT￾Networks. arXiv:1908.10084 [cs], August. arXiv: 1908.10084. Nils Reimers and Iryna Gurevych. 2020. Making Monolingual Sentence Embeddings

In [25]:
print(docs_title_per_cluster)

{0: [(Document(page_content='SNK @ DANKMEMES: Leveraging Pretrained Embeddings for Multimodal Meme Detection (short paper) [SEP] Journal of machine learn\ufffeing research, 12(Oct):2825–2830. M. Polignano, P. Basile, M. de Gemmis, G. Semer\ufffearo, and V. Basile. 2019. Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. In R. Bernardi, R. Navigli, and G. Semeraro, editors, Proceedings of the Sixth Italian Conference on Computational Linguistics, Bari, Italy, November 13-15, 2019, volume 2481 of CEUR Workshop Proceedings. CEUR-WS.org. M. Sanguinetti, G. Comandini, E. Di Nuovo, S. Frenda, M. Stranisci, C. Bosco, T. Caselli, V. Patti, and I. Russo. 2020.', metadata={'paperId': '336e531a59cafbe215b950fd749bca866b89cea0', 'title': 'SNK @ DANKMEMES: Leveraging Pretrained Embeddings for Multimodal Meme Detection (short paper)', 'venue': 'International Workshop on Evaluation of Natural Language and Speech Tools for Italian', 'year': 2020, 'authors': [

### Usa o GPT para reranquear os trechos retornados pelo Specter

In [26]:
import re

def remove_citation(text):
  # "Aqui vai uma [1] citação. Será legal isso? Ou não? [2]""
  patternIEEE = r"\[\d+\]"
  # "This method (Doe, J., 2020) has been shown to outperform previously discussed methods (Smith, J. et al., 2014) and while it has its draw-backs, it is clear that the benefits outweigh the disadvantages (Jones, A. & Karver, B., 2009, Lubber, H. et al., 2013)."
  patternAPA1 = r"\s\([A-Z][a-z]+,\s[A-Z][a-z]?\.[^\)]*,\s\d{4}\)"

  new_text = re.sub(patternIEEE, '', text)
  new_text = re.sub(patternAPA1, '', new_text)

  return new_text


In [27]:
# SYSTEM MESSAGE
system_template = "You are a renowned scientist who is writing a section of a survey entitled '{title_subsection}'."
system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

# FIRST HUMAN MESSAGE - EXPLAINING THE TASK
human_template_task = """\
I've found a text that might be useful for your survey. \
Your task is to generate a score for it ranging from 0 to 5 indicating \
its importance to the section that you are writing. \
The score of a text written in a language other than English must be 0. \
You should also explain why you choose this score. \
Your answer MUST be enclosed in a RFC8259 compliant JSON object with two properties, "score" and "reasoning", \
containing the score and the reasoning for it. Remember to enclose the value of the reasoning property in quotes. \
Do not answer with anything besides the JSON object. Do not insert any text before or after the RFC8259 compliant JSON object. \
Use the following format to answer: \n\n\

```\n\
{{\n\
  "score": {{SCORE}},\n\
  "reasoning": {{REASONING}}\n\
}}\n\
```
"""
human_message_prompt_task = HumanMessagePromptTemplate.from_template(human_template_task)

# FIRST AI ANSWER - AGREEING
ai_message_prompt_yes = AIMessagePromptTemplate.from_template('Sure, send me the text I will give you what you need. I will answer with only a JSON object containg the score and the reasoning.')

# SECOND HUMAN MESSAGE - ABSTRACT AND TITLE
human_message_prompt_text = HumanMessagePromptTemplate.from_template('Text: {text}')

In [28]:
if use_chat_model:
  llm_gpt = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name=gpt_model_name)
else:
  llm_gpt = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name=gpt_model_name)

importance_of_chunks_per_cluster = {}

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt,
                                                human_message_prompt_task,
                                                ai_message_prompt_yes,
                                                human_message_prompt_text])
question_chain = LLMChain(llm=llm_gpt, prompt=chat_prompt, verbose=True)

for cluster in range(n_sub_sections):
  importance_of_chunks_per_cluster[cluster] = []

  for doc in docs_title_per_cluster[cluster]:
    text = doc[0].page_content # 0 is the doc, 1 is the score of the retriever
    text = text[len(doc[0].metadata['title'])+7:] # page_content contains title [SEP] content. Remove title and [SEP]
    text = remove_citation(text) # Remove citations in the text

    result = question_chain.run(title_subsection=subsections_per_cluster[cluster]['title'], text=text)
    # Save the scores
    importance_of_chunks_per_cluster[cluster].append(result)
    print(result)



[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are a renowned scientist who is writing a section of a survey entitled 'Text-Image Matching for Cross-Modal Retrieval via Graph Neural Network'.
Human: I've found a text that might be useful for your survey. Your task is to generate a score for it ranging from 0 to 5 indicating its importance to the section that you are writing. The score of a text written in a language other than English must be 0. You should also explain why you choose this score. Your answer MUST be enclosed in a RFC8259 compliant JSON object with two properties, "score" and "reasoning", containing the score and the reasoning for it. Remember to enclose the value of the reasoning property in quotes. Do not answer with anything besides the JSON object. Do not insert any text before or after the RFC8259 compliant JSON object. Use the following format to answer: 


```
{
  "score": {SCORE},
  "reasoning": {REASONING}
}
```

AI: Sure, s

A resposta da LLM nem sempre obedece ao comando e, por isso, não é possível fazer o parser do JSON diretamente. Muitas vezes ele coloca um texto antes do json retornado.

Vamos fazer um parser simples desses JSON, ignorando os resultados que não puderem ser parseados.

In [29]:
chunks_per_cluster_and_score = {}

for cluster in range(n_sub_sections):
  chunks_per_cluster_and_score[cluster] = []

  for idx, result_gpt in enumerate(importance_of_chunks_per_cluster[cluster]):
    open_curly_brackets = result_gpt.find('{')
    close_curly_brackets = result_gpt.find('}')+1
    json_str = result_gpt[open_curly_brackets:close_curly_brackets]

    try:
      parsed_json = json.loads(json_str)
      # Each idx is paired with a document,
      chunk_document = docs_title_per_cluster[cluster][idx][0]
      chunks_per_cluster_and_score[cluster].append( (chunk_document, parsed_json['score'], parsed_json['reasoning'])  )
    except:
      continue

  # Sort based on the score of GPT (the score is the second element)
  chunks_per_cluster_and_score[cluster] = sorted(chunks_per_cluster_and_score[cluster], key=lambda x: x[1], reverse=True)

### Usa o GPT para escrever o trecho da seção

Vamos considerar aqui apenas os 10 primeiros trechos retornados

In [37]:
# SYSTEM MESSAGE
system_template = "You are a renowned scientist who is writing a survey a section of a survey entitled '{title_subsection}'."
system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

# FIRST HUMAN MESSAGE - EXPLAINING THE TASK
human_template_task = """\
Your task is to write the contents of a section of a survey. \
The title of the section that you are writing is '{title_subsection}'. \n\
To complete this task, I will give you a list of documents that should be used as references. \
Each document has a text and an alphanumeric ID. \n\
When writing the section, you MUST follow this rules: \n\
- be aware of plagiarism, i.e., you should not copy the text, but use them as inspiration.\n\
- when using some reference, you must cite it right after its use. You should use the IEEE citing style (write the id of the text \
between square brackets).\n\
- you are writing the paragraphs of the section. You MUST write only this section.\n\
- you MUST NOT split the section in subsections, nor create introduction and conclusion for it.\n\
- DO NOT write any conclusion in any form for the subsection.\n\n\
- DO NOT write a references section.\n\
Do you understand your task?\
"""
human_message_prompt_task = HumanMessagePromptTemplate.from_template(human_template_task)

# FIRST AI ANSWER - AGREEING
ai_message_prompt_yes = AIMessagePromptTemplate.from_template('Sure, send me a list of text and I will write a section about {title_subsection} using them as references. I am aware that I should use the IEEE citing style.')

# SECOND HUMAN MESSAGE - LIST OF DOCUMENTS
def text_message_human_prompt_list_documents(texts):
  message = ''
  for idx, text in enumerate(texts):
    message = message + f"ID: REF{idx}\nText: {text.replace('{', '{{').replace('}', '}}')}\n\n"
  return HumanMessagePromptTemplate.from_template(message)


In [38]:
section_text = {}

for cluster in range(n_sub_sections):
  print(f"Title: {subsections_per_cluster[cluster]['title']}")
  # Extract the first ten texts reranked by the GPT
  texts = []

  list_of_references = '\n\nReferences given to GPT: \n'
  for idx, tuple_doc_score_reasoning in enumerate(chunks_per_cluster_and_score[cluster][0:n_chunks_to_use_as_reference]):
    text = tuple_doc_score_reasoning[0].page_content
    text = text[text.find('[SEP]')+6:] # page_content contains title [SEP] content. Remove title and [SEP]
    text = remove_citation(text)

    texts.append(text)
    list_of_references = list_of_references + f"[REF{idx}] - paperID: {tuple_doc_score_reasoning[0].metadata['paperId']}"
    list_of_references = list_of_references + f"\tTitle: {tuple_doc_score_reasoning[0].metadata['title']}"
    list_of_references = list_of_references + f"\tChunk of text: {text}\n\n"
  message_list_documents = text_message_human_prompt_list_documents(texts)

  # Now, generate the chat messages for this subsection
  chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt,
                                                  human_message_prompt_task,
                                                  ai_message_prompt_yes,
                                                  message_list_documents])

  #chat_messages = chat_prompt.format_prompt(title_subsection=subsections_per_cluster[cluster]['title']).to_messages()
  #print(chat_messages)
  question_chain = LLMChain(llm=llm_gpt, prompt=chat_prompt, verbose=True)
  section_text[cluster] = question_chain.run(title_subsection=subsections_per_cluster[cluster]['title'])
  print(section_text[cluster])
  print(list_of_references)
  print('.'*200)

Title: Text-Image Matching for Cross-Modal Retrieval via Graph Neural Network


[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are a renowned scientist who is writing a survey a section of a survey entitled 'Text-Image Matching for Cross-Modal Retrieval via Graph Neural Network'.
Human: Your task is to write the contents of a section of a survey. The title of the section that you are writing is 'Text-Image Matching for Cross-Modal Retrieval via Graph Neural Network'. 
To complete this task, I will give you a list of documents that should be used as references. Each document has a text and an alphanumeric ID. 
When writing the section, you MUST follow this rules: 
- be aware of plagiarism, i.e., you should not copy the text, but use them as inspiration.
- when using some reference, you must cite it right after its use. You should use the IEEE citing style (write the id of the text between square brackets).
- you are writing the paragraphs of the secti