# Segmentation v1

Create a pipeline to get relevant sentences from a paper w.r.t. user query. Then summarize the sentences.

Step 1: Get relevant rating in each sentence in a long text.

- Text to sentences
- Sentences to embedding vectors
- Question to embedding vector
- Dot-product between question and sentences embedding to measure relatedness

Step 2: Subset by top-k or threshold to get relevant sentences to a query.

Step 3: Summarize the sentences.

### Pubmed data

In [1]:
from datasets import load_dataset
from backend import dot_score, to_embeddings, summarize

dataset = load_dataset("ccdv/pubmed-summarization")
dataset = dataset["train"][4]
dataset

No config specified, defaulting to: pubmed-summarization/section
Found cached dataset pubmed-summarization (/root/.cache/huggingface/datasets/ccdv___pubmed-summarization/section/1.0.0/f765ec606c790e8c5694b226814a13f1974ba4ea98280989edaffb152ded5e2b)


  0%|          | 0/3 [00:00<?, ?it/s]

{'article': 'syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . \n clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsions , and urinary incontinence , making it difficult to differentiate from epileptic seizure by clinical features alone . \n significant fluctuations in cerebral perfusion pressure are prevented by autoregulation of cerebral circulation , but there may be conditions where such mechanism may not compensate adequately . \n cough syncope , a rare form of syncope , may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow . \n we present an unusual case of recurrent cough syncope , which was initially diagnosed and treated as seizures , in the context of a left - sided glomus jugulare tumor , a benign paraganglio

In [2]:
question = "What are the symptoms of syncope?"

In [3]:
sentences = dataset["article"].split("\n")
sentences.append(question)  # Last sentence is query
print(len(dataset["article"]))
print("\nExcerpt:")
print(dataset["article"][:300])

6918

Excerpt:
syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . 
 clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsion


## Step 1: Get relevant rating in each sentence in a long text.

In [4]:
embeddings = to_embeddings(sentences, model="sentence-transformers/all-mpnet-base-v2")  # Can use more powerful models
similarity = dot_score(embeddings[:-1], embeddings[-1])  # Last sentence is query

## Step 2: Subset by top-k or threshold to get relevant sentences to a query.

In [5]:
def get_by_topk(sentences, similarity, topk=5):
    """Get top k sentences by similarity."""
    topk = similarity.topk(topk, dim=0)
    return [sentences[i] for i in topk.indices.flatten().tolist()]


def get_by_threshold(sentences, similarity, threshold):
    """Get sentences by similarity threshold."""
    sel = (similarity > threshold).flatten().tolist()
    return [sentences[i] for i in range(len(sel)) if sel[i]]


In [6]:
top5_sentences = get_by_topk(sentences, similarity, topk=5)
top5_sentences

[' clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsions , and urinary incontinence , making it difficult to differentiate from epileptic seizure by clinical features alone . ',
 'syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . ',
 ' hypotension accompanying syncope observed in our patient was not orthostasis - related ( the patient was always supine during spells ) and was most likely related to cough . ',
 ' transient cerebral circulatory arrest has been demonstrated by transcranial doppler measurements during cough syncope . our patient also had a drop in blood pressure and heart rate but probably not sufficient to cause syncope by itself . ',
 ' cough syncope , a rare form of syncope , may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decre

In [7]:
relevant_sentences = get_by_threshold(sentences, similarity, 0.5)
relevant_sentences

['syncope is caused by transient diffuse cerebral hypoperfusion and is characterized by transient loss of consciousness with a rapid onset followed by spontaneous and complete recovery . ',
 ' clinical features of syncope may include myoclonic jerks which are often multifocal and asynchronous , convulsions , and urinary incontinence , making it difficult to differentiate from epileptic seizure by clinical features alone . ',
 ' cough syncope , a rare form of syncope , may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow . ',
 ' transient cerebral circulatory arrest has been demonstrated by transcranial doppler measurements during cough syncope . our patient also had a drop in blood pressure and heart rate but probably not sufficient to cause syncope by itself . ',
 ' hypotension accompanying syncope observed in our patient was not orthostasis - related ( the patient was always supine during spells ) and was mo

- This is very local... meaning that this implementation may not be able to identify larger concept that span across multiple sentences. 
- Consider long T5 we can incorporate similar local and global features to capture multiple level of embeddings, DEFER for now.

## Step 3: Summarize the sentences.

In [8]:
summarize(top5_sentences)



'<pad> cough syncope, a rare form of syncope, may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow.</s>'

In [9]:
summarize(relevant_sentences)

'<pad> cough syncope, a rare form of syncope, may be a result of transient failure of the cerebral autoregulatory mechanism to cope with sudden decrease in cerebral blood flow.</s>'