# Explore label examples
This notebook is for exploring the annotations for the individual label to get a better insight into the data

In [4]:
from spacy.tokens import DocBin
import spacy
import random

# Initialize NLP object
nlp = spacy.blank("en")

# Get participants docs from training data
participants_path = "../corpus/ner_train_p.spacy"
participants_docBin = DocBin().from_disk(participants_path)
participants_docs = list(participants_docBin.get_docs(nlp.vocab))

# Get intervention docs from training data
intervention_path = "../corpus/ner_train_i.spacy"
intervention_docBin = DocBin().from_disk(intervention_path)
intervention_docs = list(intervention_docBin.get_docs(nlp.vocab))

# Get intervention docs from training data
outcome_path = "../corpus/ner_train_o.spacy"
outcome_docBin = DocBin().from_disk(outcome_path)
outcome_docs = list(outcome_docBin.get_docs(nlp.vocab))

## Label [Participants]

In [11]:
# Show 10 random spans marked with the participant label
random.shuffle(participants_docs)
for doc in participants_docs:
    for i,ent in enumerate(doc.ents):
        if ent[-1].is_punct:
            print(i,ent,ent.label_)

0 tumors of hematopoietic organs . PARTICIPANTS
2 multiple institutions using a central registration system . PARTICIPANTS
3 ( 55/93 ) PARTICIPANTS
4 ( 44/93 ) PARTICIPANTS
0 metastatic breast cancer : PARTICIPANTS
1 67 patients with systemic breast cancer treated by chemotherapy ; 55 were assessable by UICC criteria and the response index ( 96 % of all UICC assessable patients ) . PARTICIPANTS
0 primary hypercholesterolemia in hypertensive patients treated with hydrochlorothiazide ] PARTICIPANTS
0 resectable locally invasive pancreatic cancer . PARTICIPANTS
2 Patients with pancreatic cancer who met our preoperative criteria for inclusion ( pancreatic cancer invading the pancreatic capsule without involvement of the superior mesenteric artery or the common hepatic artery , or without distant metastasis ) underwent laparotomy . PARTICIPANTS
3 Twenty patients were assigned to the resection group and 22 to the radiochemotherapy group . PARTICIPANTS
0 autism spectrum disorder . PARTICIPANT

0 autism spectrum disorder . PARTICIPANTS
0 children with autism spectrum disorders : PARTICIPANTS
2 children with ASD . PARTICIPANTS
0 children with autism : PARTICIPANTS
0 anorexia nervosa and bulimia nervosa . PARTICIPANTS
1 Eighty patients ( 57 with anorexia nervosa ; 23 with bulimia nervosa ) were first admitted to a specialized unit to restore their weight to normal . PARTICIPANTS
2 patients whose illness was not chronic and had begun before the age of 19 years . PARTICIPANTS
3 older patients . PARTICIPANTS
0 healthy volunteers . PARTICIPANTS
1 included 36 healthy male volunteers . PARTICIPANTS
0 tooth enamel . PARTICIPANTS
1 Two study groups were randomly formed : enamel blocks brushed with ( a ) the Gantrez-NNP combination and ( b ) conventional toothpaste , for 1 minute once daily for 4 weeks , then rinsed with distilled water and placed in thymol solution . PARTICIPANTS
0 patients undergoing ambulatory cosmetic surgery . PARTICIPANTS
1 postoperative nausea and vomiting ( PONV

1 patients with hormone-receptor-positive breast cancer . PARTICIPANTS
2 outpatient clinics and hospitals . We enrolled postmenopausal women with hormone-receptor-positive , locally advanced or metastatic breast cancer previously treated with endocrine treatment . PARTICIPANTS
3 We screened 189 patients and enrolled 156 ( 106 in the ganitumab group and 50 in the placebo group ) . PARTICIPANTS
0 Hawaiian youth and communities . PARTICIPANTS
0 acute extremity pain after emergency department discharge . PARTICIPANTS
1 patients discharged from the emergency department ( ED ) . PARTICIPANTS
0 South African population . PARTICIPANTS
1 women with pregnancy-related low back pain . PARTICIPANTS
2 Fifty women between 16 and 24 weeks of pregnancy were recruited at Tygerberg and Paarl Hospitals , Western Cape , South Africa . Twenty-six women were randomized to a 10-week exercise program and 24 were randomized as controls . PARTICIPANTS
3 pregnancy in South African women with lumbar and pelvic gir

2 ( N = 70 ) PARTICIPANTS
0 postoperative pain relief after total knee arthroplasty . PARTICIPANTS
1 total knee arthroplasty . Patients were randomly enrolled into patient-controlled anesthesia ( PCA ) alone , PCA plus TENS , or PCA plus sham TENS . PARTICIPANTS
1 in nondependent subjects . PARTICIPANTS
2 15 male , experienced , intermittent nontherapeutic drug users . PARTICIPANTS
0 patients with superficial bladder tumors . PARTICIPANTS
2 A total of 301 patients underwent transurethral resection of bladder tumors with white light or fluorescence diagnosis . PARTICIPANTS
0 pediatric patients with irritability associated with autistic disorder . PARTICIPANTS
1 autistic disorder in pediatric patients . PARTICIPANTS
2 patients ( 6-17 years ) who met the current Diagnostic and Statistical Manual of Mental Disorders , Fourth Edition , Text Revision ( DMS-IV-TR ) criteria for autistic disorder and who also had serious behavioral problems ( ie , tantrums , aggression , self-injurious behavio

2 Six Departments of Physical Medicine and Rehabilitation in university-based medical schools . PARTICIPANTS
3 Individuals ( N=123 ) with SCI and major depression between 18 and 64 years of age , at least 1 month post-SCI who also reported pain . PARTICIPANTS
0 post-operative pain . PARTICIPANTS
4 low-grade glioma in 70 patients , who were the focus of the current study . PARTICIPANTS
5 42 males and 28 females ( median age , 7.7 years ) with a median follow-up of 10.4 years . PARTICIPANTS
0 treatment of primary open angle glaucoma . PARTICIPANTS
1 Seventeen children were randomized into the Intervention ( n = 9 ) and Control ( n = 8 ) groups . PARTICIPANTS
0 [ Localised prostate cancer : PARTICIPANTS
1 Sixty chronic pancreatitis patients were compared to 15 healthy controls . PARTICIPANTS
0 in HIV-associated B-cell non-Hodgkin lymphoma . PARTICIPANTS
1 immunocompetent patients with B-cell non-Hodgkin lymphoma ( NHL ) . PARTICIPANTS
2 HIV-associated NHL . PARTICIPANTS
5 HIV-associated l

0 childhood . PARTICIPANTS
1 500 mother-child pairs from a low-income area of S?o Leopoldo , State of Rio Grande do Sul , Brazil , to evaluate the impact of a nutritional intervention in the first year of life on the dietary quality of 3- to 4-y-old children . PARTICIPANTS
3 of children in a low-income population . PARTICIPANTS
0 patients with seasonal allergic rhinitis . PARTICIPANTS
1 seasonal allergic rhinitis due to ragweed . PARTICIPANTS
3 seasonal allergic rhinitis due to ragweed . PARTICIPANTS
0 following cholecystectomy . PARTICIPANTS
1 after cholecystectomy . PARTICIPANTS
0 liver resection with intermittent clamping ( INT ) . PARTICIPANTS
0 obese patients after jejunoileal bypass with 3:1 or 1:3 jejunoileal ratio . PARTICIPANTS
1 Seven patients were studied before bypass surgery and 28 were examined after end-to-side jejunoileal bypass with 50 cm intestine in continuity and a 3:1 or 1:3 ratio between the length of the jejunal and ileal segments . PARTICIPANTS
1 Forty-seven sub

In [10]:
# Create a list of span lengths for the participant label
participant_lengths = {}
for doc in participants_docs:
    for ent in doc.ents:
        ent_len = 0
        for token in ent:
            if not token.is_stop:
                ent_len+= 1
        if ent_len not in participant_lengths:
            participant_lengths[ent_len] = 0
        participant_lengths[ent_len] += 1
        
participant_sum = 0

for ent_len in participant_lengths:
    participant_sum += participant_lengths[ent_len]

range_max = 7
range_sum = 0
for i in range(0,range_max+1):
    range_sum += participant_lengths[i]
    
print(f"0-{range_max}: {range_sum} ({round((range_sum/participant_sum)*100,2)}%)")
    
for ent_len in sorted(participant_lengths.keys()) :        
    print(f"{ent_len}: {participant_lengths[ent_len]} ({round((participant_lengths[ent_len]/participant_sum)*100,2)}%)")

0-7: 9479 (69.65%)
0: 3 (0.02%)
1: 604 (4.44%)
2: 1424 (10.46%)
3: 2033 (14.94%)
4: 1914 (14.06%)
5: 1548 (11.37%)
6: 1067 (7.84%)
7: 886 (6.51%)
8: 656 (4.82%)
9: 467 (3.43%)
10: 379 (2.78%)
11: 332 (2.44%)
12: 244 (1.79%)
13: 202 (1.48%)
14: 201 (1.48%)
15: 181 (1.33%)
16: 158 (1.16%)
17: 135 (0.99%)
18: 126 (0.93%)
19: 114 (0.84%)
20: 96 (0.71%)
21: 95 (0.7%)
22: 83 (0.61%)
23: 64 (0.47%)
24: 77 (0.57%)
25: 45 (0.33%)
26: 31 (0.23%)
27: 40 (0.29%)
28: 40 (0.29%)
29: 29 (0.21%)
30: 24 (0.18%)
31: 19 (0.14%)
32: 27 (0.2%)
33: 24 (0.18%)
34: 19 (0.14%)
35: 19 (0.14%)
36: 15 (0.11%)
37: 13 (0.1%)
38: 15 (0.11%)
39: 17 (0.12%)
40: 8 (0.06%)
41: 6 (0.04%)
42: 11 (0.08%)
43: 10 (0.07%)
44: 7 (0.05%)
45: 8 (0.06%)
46: 8 (0.06%)
47: 9 (0.07%)
48: 10 (0.07%)
49: 5 (0.04%)
50: 6 (0.04%)
51: 4 (0.03%)
52: 2 (0.01%)
53: 7 (0.05%)
54: 1 (0.01%)
55: 5 (0.04%)
56: 4 (0.03%)
57: 4 (0.03%)
58: 1 (0.01%)
62: 1 (0.01%)
63: 3 (0.02%)
64: 2 (0.01%)
65: 2 (0.01%)
66: 1 (0.01%)
67: 2 (0.01%)
69: 1 (0.01%)


## Label [Intervention]

In [15]:
random.shuffle(intervention_docs)
for doc in intervention_docs[:15]:
    for i,ent in enumerate(doc.ents):
        print(i,ent)

0 inpatient Dialectical Behavior Therapy ( DBT )
1 standard outpatient DBT ,
2 receive 12 weeks of intensified inpatient DBT plus six months of standard DBT ,
0 all-norgestrel
1 dl-norgestrel alone
2 estradiol-17 beta alone
3 combined hormones
4 placebo control
5 dl-norgestrel
6 dl-norgestrel
0 careful estimates
1 public commitment , self-consistency
2 unique causal risk models .
3 risk anchor based on downward social comparison processes
4 comparison anchors
0 New drug trials
0 recombinant human granulocyte-macrophage colony-stimulating factor
1 Recombinant murine GM-CSF administration
2 recombinant human ( rhu ) GM-CSF
3 placebo
4 rhuGM-CSF
5 rhuGM-CSF .
6 rhuGM-CSF
7 rhuGM-CSF .
8 rhuGM-CSF
9 rhuGM-CSF
10 rhuGM-CSF
0 adenosine myocardial perfusion imaging
1 adenosine myocardial perfusion imaging
2 Coronary angiography was conducted within 6 weeks of an adenosine thallium-201 myocardial perfusion imaging study .
3 adenosine thallium-201 myocardial perfusion imaging
0 remote ischemic 

## Label [Outcome]

In [2]:
random.shuffle(outcome_docs)
for doc in outcome_docs[:15]:
    for i,ent in enumerate(doc.ents):
        print(i,ent)

0 proteinuria
1 proteinuria-lowering effect of a renin inhibitor ( aliskiren )
2 reduced proteinuria . These
3 Furthermore , 24-h proteinuria was
4 significantly reduced proteinuria . The antiproteinuric effect is
5 with chronic proteinuric non-diabetic kidney disease .
0 advanced colorectal cancer :
1 5-fluorouracil
2 partial response
3 complete
4 partial response
5 Time to failure
6 median survival time
7 Diarrhea , stomatitis and vomiting
8 nonhematologic toxicities
9 hematologic toxicity was leukopenia ;
10 advanced colorectal cancer
0 language
1 behavioral symptoms
2 assessments of language , behavior , and autism symptomatology .
3 mean scores on any measure of language , behavior , or autism symptom severity
0 incidence of clinical sepsis .
1 incidence of positive blood cultures , necrotising enterocolitis ( NEC ) stage II or III , or death , and the duration of hospital stay .
2 incidence of clinical sepsis
3 Mortality
4 blood cultures
5 incidence of NEC and the duration of hos