# Embedding WikiSection data

This notebook shows how to embed and store WikiSection sentences as `networkx`graph objects

In [21]:
%load_ext autoreload
%autoreload 2

import os
import sys
import json
from pathlib import Path

from tqdm.notebook import tqdm

import torch
import numpy as np

from sentence_transformers import SentenceTransformer

from utils import process_sentences
from utils import serialize

sys.path.append("src")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


The dataset is a list of dictionaries where each dictionary is a wikipage on some disease with annotations that tell you from which character index `begin` to `begin` + `length` within `text` to extract if you want to know about `sectionHeading`.

In [22]:
train_path = "notebook_datasets/wikisection_en_disease_train.json"
with open(train_path, 'r') as f:
    train_data = json.load(f)
    
print(type(train_data),len(train_data)) #List

<class 'list'> 2513


In [23]:
train_data[0].keys()

dict_keys(['id', 'type', 'title', 'abstract', 'text', 'annotations'])

In [24]:
train_data[0]['annotations'][:3]

[{'class': 'SectionAnnotation',
  'begin': 0,
  'length': 715,
  'sectionHeading': 'Signs and symptoms',
  'sectionLabel': 'disease.symptom'},
 {'class': 'SectionAnnotation',
  'begin': 715,
  'length': 503,
  'sectionHeading': 'Cause | Spread',
  'sectionLabel': 'disease.cause'},
 {'class': 'SectionAnnotation',
  'begin': 1218,
  'length': 522,
  'sectionHeading': 'Treatment',
  'sectionLabel': 'disease.treatment'}]

In [25]:
print(train_data[0]['text'][:715])
print(train_data[0]['text'][715:715+503])

The most apparent symptom of pneumonic plague is coughing, often with hemoptysis (coughing up blood). With pneumonic plague, the first signs of illness are fever, headache, weakness and rapidly developing pneumonia with shortness of breath, chest pain, cough and sometimes bloody or watery sputum.
The pneumonia progresses for two to four days and may cause respiratory failure and shock. Patients will die without early treatment, some within 36 hours.
Initial pneumonic plague symptoms can often include the following:
- Fever
- Weakness
- Headaches
- Nausea
Rapidly developing pneumonia with:
- Shortness of breath
- Chest pain
- Cough
- Bloody or watery sputum (saliva and discharge from respiratory passages).

Pneumonic plague can be caused in two ways: primary, which results from the inhalation of aerosolised plague bacteria, or secondary, when septicaemic plague spreads into lung tissue from the bloodstream. Pneumonic plague is not exclusively vector-borne like bubonic plague; instead it

### Data Components

`text` is the `document` indexed by `wiki_page_index` which consists of a list of sentences

We are storing 4 things:

1. `section_pseudo` labels each sentence in a document by the topic `[...'disease.cause', 'disease.pathophysiology'....]`
2. `section_labels` sentence level binary labels for each document `[0, 0, 0, 1, 1, 1, 0, 0, 0]` where the middle 3 sentences are positive
3. `document_labels` is 1 the `section_labels` is not all 0's
4.  `doc_<wiki_page_index>.npy` which is a tensor of embeddings for each sentence in `wiki_page_index`

In [26]:
embedding_model = SentenceTransformer(
    'all-MiniLM-L6-v2', 
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
)


In [32]:
save_path = "notebook_datasets/wikisection_processed/"
save_path_Zs = os.path.join(save_path, "train_embeddings")
directory = Path(save_path_Zs)
directory.mkdir(parents=True, exist_ok=True)

# good ones: mechanism, genetics, prevention, prognosis
# bad ones: treatment, symptom

target = "disease.genetics" # class-1

document_labels = {}
section_labels = {}
section_pseudo = {}

for wiki_page_index in tqdm(range(len(train_data)), total=len(train_data)):

    text = train_data[wiki_page_index]["text"]

    annots = train_data[wiki_page_index]["annotations"]

    labels = []
    embeddings = []
    pseudos = []
    
    for annotation in annots:
        
        begin = int(annotation["begin"])
        end = int(annotation["begin"])+int(annotation["length"])
        chunk = text[begin:end]

        sentences = [s for s in chunk.split(".")]
        sentences = process_sentences(sentences)
        num_chunk_sents = len(sentences)

        pseudo = [annotation["sectionLabel"]] * num_chunk_sents
        pseudos.extend(pseudo)
        
        embedded_sentences = [embedding_model.encode(sent) for sent in sentences]
        embeddings.extend(embedded_sentences)

        if annotation["sectionLabel"] == target:
            label = [1] * num_chunk_sents
        else:
            label = [0] * num_chunk_sents

        labels.extend(label)
            
    document_labels[wiki_page_index] = int(np.sum(labels) > 0)
    section_labels[wiki_page_index] = labels
    section_pseudo[wiki_page_index] = pseudos

    Z = np.stack(embeddings)
    Z_name = "doc_"+str(wiki_page_index)+".npy"
    save_path_Z = os.path.join(save_path_Zs, Z_name)
    np.save(save_path_Z, Z)

    # print(len(labels), len(pseudos), Z.shape)
    # break

  0%|          | 0/2513 [00:00<?, ?it/s]

In [33]:
print("# docs:", len(document_labels.values()))

count_0, count_1 = 0, 0
for dv in document_labels.values():
    if dv == 0:
        count_0 += 1
    else:
        count_1 += 1
print("# class-0:", count_0)
print("# class-1:", count_1)
print("number of numpy files:", len(os.listdir(save_path_Zs)))

# docs: 2513
# class-0: 2177
# class-1: 336
number of numpy files: 2513


In [34]:
print(document_labels[3])
print(section_labels[3])
print(section_pseudo[3])
Z.shape

1
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
['disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.symptom', 'disease.genetics', 'disease.genetics', 'disease.genetics', 'disease.genetics', 'disease.cause', 'disease.cause', 'disease.cause', 'disease.cause', 'disease.cause', 'disease.cause', 'disease.cause', 'disease.cause', 'disease.pathophysiology', 'disease.pathophysiology', 'disease.pathophysiology', 'disease.pathophysiology', 'disease.research', 'disease.research', 'disease.research', 'disease.treatment', 'disease.treatment', 'disease.treatment', 'disease.treatment', 'disease.treatment', 'disease.t

(56, 384)

In [35]:
save_path_doc_labs = os.path.join(save_path, "train_document_labels.obj")
serialize(document_labels, save_path_doc_labs)
save_path_sec_pseudo = os.path.join(save_path, "train_section_pseudo.obj")
serialize(section_pseudo, save_path_sec_pseudo)
save_path_sec_labs = os.path.join(save_path, "train_section_labels.obj")
serialize(section_labels, save_path_sec_labs)

# sample embeddings + build map graphs

In [16]:
%load_ext autoreload
%autoreload 2

import os
import sys; sys.path.append("src")
import random
from pathlib import Path
import numpy as np
import networkx as nx

from utils import convert_text2graph, deserialize

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
save_path = "notebook_datasets/wikisection_processed/"
save_path_Gs = os.path.join(save_path, "train_Gs")
directory = Path(save_path_Gs)
directory.mkdir(parents=True, exist_ok=True)

In [36]:
save_path = "notebook_datasets/wikisection_processed/"
save_path_Zs = os.path.join(save_path, "train_embeddings")
save_path_sec_labs = os.path.join(save_path, "train_section_labels.obj")
section_labels_dict = deserialize(save_path_sec_labs)

In [40]:
wiki_page_index = 3
print(len(section_labels_dict[wiki_page_index]))
z = np.load(os.path.join(save_path_Zs,f"doc_{wiki_page_index}.npy"))
z.shape

58


(58, 384)