# Segmentation v1

Create a pipeline to get relevant sentences from a paper w.r.t. user query. Then summarize the sentences.

Step 1: Get relevant rating in each sentence in a long text.

- Text to sentences
- Sentences to embedding vectors
- Question to embedding vector
- Dot-product between question and sentences embedding to measure relatedness

Step 2: Subset by top-k or threshold to get relevant sentences to a query.

Step 3: Summarize the sentences.

### Pubmed data

In [12]:
from datasets import load_dataset
from backend import dot_score, to_embeddings, summarize

dataset = load_dataset("ccdv/pubmed-summarization")
dataset = dataset["train"][4]
dataset

No config specified, defaulting to: pubmed-summarization/section
Found cached dataset pubmed-summarization (/root/.cache/huggingface/datasets/ccdv___pubmed-summarization/section/1.0.0/f765ec606c790e8c5694b226814a13f1974ba4ea98280989edaffb152ded5e2b)


  0%|          | 0/3 [00:00<?, ?it/s]

{'article': 'syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . \n clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsions , and urinary incontinence , making it difficult to differentiate from epileptic seizure by clinical features alone . \n significant fluctuations in cerebral perfusion pressure are prevented by autoregulation of cerebral circulation , but there may be conditions where such mechanism may not compensate adequately . \n cough syncope , a rare form of syncope , may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow . \n we present an unusual case of recurrent cough syncope , which was initially diagnosed and treated as seizures , in the context of a left - sided glomus jugulare tumor , a benign paraganglio

In [13]:
question = "What are the symptoms of syncope?"

In [14]:
sentences = dataset["article"].split("\n")
sentences.append(question)  # Last sentence is query
print(len(dataset["article"]))
print("\nExcerpt:")
print(dataset["article"][:300])

6918

Excerpt:
syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . 
 clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsion


## Step 1: Get relevant rating in each sentence in a long text.

In [15]:
embeddings = to_embeddings(sentences, model="sentence-transformers/all-mpnet-base-v2")  # Can use more powerful models
similarity = dot_score(embeddings[:-1], embeddings[-1])  # Last sentence is query

## Step 2: Subset by top-k or threshold to get relevant sentences to a query.

In [16]:
def get_by_topk(sentences, similarity, topk=5):
    """Get top k sentences by similarity."""
    topk = similarity.topk(topk, dim=0)
    return [sentences[i] for i in topk.indices.flatten().tolist()]


def get_by_threshold(sentences, similarity, threshold):
    """Get sentences by similarity threshold."""
    sel = (similarity > threshold).flatten().tolist()
    return [sentences[i] for i in range(len(sel)) if sel[i]]


In [17]:
top5_sentences = get_by_topk(sentences, similarity, topk=5)
top5_sentences

[' clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsions , and urinary incontinence , making it difficult to differentiate from epileptic seizure by clinical features alone . ',
 'syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . ',
 ' hypotension accompanying syncope observed in our patient was not orthostasis - related ( the patient was always supine during spells ) and was most likely related to cough . ',
 ' transient cerebral circulatory arrest has been demonstrated by transcranial doppler measurements during cough syncope . our patient also had a drop in blood pressure and heart rate but probably not sufficient to cause syncope by itself . ',
 ' cough syncope , a rare form of syncope , may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decre

In [18]:
relevant_sentences = get_by_threshold(sentences, similarity, 0.3)
relevant_sentences

['syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . ',
 ' clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsions , and urinary incontinence , making it difficult to differentiate from epileptic seizure by clinical features alone . ',
 ' cough syncope , a rare form of syncope , may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow . ',
 ' we present an unusual case of recurrent cough syncope , which was initially diagnosed and treated as seizures , in the context of a left - sided glomus jugulare tumor , a benign paraganglioma . ',
 ' they were triggered by coughing ( usually a bout of cough ) and were characterized by staring and unresponsiveness as well as stiffening of the body with mild shaking of both upper extremit

- This is very local... meaning that this implementation may not be able to identify larger concept that span across multiple sentences. 
- Consider long T5 we can incorporate similar local and global features to capture multiple level of embeddings, DEFER for now.

## Step 3: Summarize the sentences.

In [8]:
summarize(top5_sentences)



'<pad> cough syncope, a rare form of syncope, may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow.</s>'

In [19]:
summarize(relevant_sentences)



'<pad> cough syncope, a rare form of syncope, may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow. this case highlights the fact that cough syncope, a rare form of syncope, may be associated with intracranial mass lesions that indirectly exaggerate the increase in icp in response to cough.</s>'

In [21]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/long-t5-tglobal-base")

In [22]:
tokenizer.all_special_tokens

['</s>',
 '<unk>',
 '<pad>',
 '<extra_id_0>',
 '<extra_id_1>',
 '<extra_id_2>',
 '<extra_id_3>',
 '<extra_id_4>',
 '<extra_id_5>',
 '<extra_id_6>',
 '<extra_id_7>',
 '<extra_id_8>',
 '<extra_id_9>',
 '<extra_id_10>',
 '<extra_id_11>',
 '<extra_id_12>',
 '<extra_id_13>',
 '<extra_id_14>',
 '<extra_id_15>',
 '<extra_id_16>',
 '<extra_id_17>',
 '<extra_id_18>',
 '<extra_id_19>',
 '<extra_id_20>',
 '<extra_id_21>',
 '<extra_id_22>',
 '<extra_id_23>',
 '<extra_id_24>',
 '<extra_id_25>',
 '<extra_id_26>',
 '<extra_id_27>',
 '<extra_id_28>',
 '<extra_id_29>',
 '<extra_id_30>',
 '<extra_id_31>',
 '<extra_id_32>',
 '<extra_id_33>',
 '<extra_id_34>',
 '<extra_id_35>',
 '<extra_id_36>',
 '<extra_id_37>',
 '<extra_id_38>',
 '<extra_id_39>',
 '<extra_id_40>',
 '<extra_id_41>',
 '<extra_id_42>',
 '<extra_id_43>',
 '<extra_id_44>',
 '<extra_id_45>',
 '<extra_id_46>',
 '<extra_id_47>',
 '<extra_id_48>',
 '<extra_id_49>',
 '<extra_id_50>',
 '<extra_id_51>',
 '<extra_id_52>',
 '<extra_id_53>',
 '<extra_

In [20]:
relevant_sentences

['syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . ',
 ' clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsions , and urinary incontinence , making it difficult to differentiate from epileptic seizure by clinical features alone . ',
 ' cough syncope , a rare form of syncope , may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow . ',
 ' we present an unusual case of recurrent cough syncope , which was initially diagnosed and treated as seizures , in the context of a left - sided glomus jugulare tumor , a benign paraganglioma . ',
 ' they were triggered by coughing ( usually a bout of cough ) and were characterized by staring and unresponsiveness as well as stiffening of the body with mild shaking of both upper extremit

In [23]:
from backend import prompt_sum

prompt_sum(''.join(relevant_sentences), question)



'<pad> this case highlights the fact that cough syncope, a rare form of syncope, may be associated with intracranial mass lesions that indirectly exaggerate the increase in icp in response to cough.</s>'

In [25]:
from transformers import pipeline
pipe = pipeline(model='distilbert-base-uncased')

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

  with safe_open(checkpoint_file, framework="pt") as f:
  return self.fget.__get__(instance, owner)()
  storage = cls(wrap_storage=untyped_storage)
  with safe_open(filename, framework="pt", device=device) as f:


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [26]:
pipe(''.join(relevant_sentences))

Token indices sequence length is longer than the specified maximum sequence length for this model (802 > 512). Running this sequence through the model will result in indexing errors


PipelineException: No mask_token ([MASK]) found on the input