# B. first automatic tag detection 

## Intro - Importing libraries and datasets

In [45]:
# import of libraries
import pandas as pd
from fuzzywuzzy import fuzz
from tqdm import tqdm

In [2]:
# import of thesaurus
thesaurus = pd.read_csv('data/thesaurus_key_words.csv', encoding="ISO-8859-1", sep=';')
thesaurus.head()

Unnamed: 0,classification_E,catégorie,symptome-fr,symptome-en,CIM_10,CIM11,Orphanet
0,E1,Période néonatale,Encéphalopathie myoclonique précoce,Benign familial neonatal epilepsy (BFNE),G40.8,8A61.0Y,1935.0
1,E2,Période néonatale,Epilepsie néonatale familiale bénigne (BFNE),Early myoclonic encephalopathy (EME),G40.8,8A61.10,1949.0
2,E3,Période néonatale,Syndrome d'ohtahara,Ohtahara syndrome,G40.8,8A62.Y,1934.0
3,E31,Nourrisons,Encépahlopathie myoclonique des affections non...,Myoclonic encephalopathy in nonprogressive dis...,G40.4,8A62.Y,86913.0
4,E33,Nourrisons,Epilepsie benigne du nourisson,Benign infantile epilepsy,G40.3,8A61.1Z,166302.0


In [3]:
# import of classification dataset
classification_dataset = pd.read_csv('data/classification_dataset.csv')
classification_dataset.head()

Unnamed: 0,filepath,report
0,CR_Patients_info_patients-v0_4/edf/dev/01_tcp_...,Description: 2.5 to 5 hz spike/wave and polys...
1,CR_Patients_info_patients-v0_4/edf/dev/01_tcp_...,LENGTH OF THE RECORDING: 22 minutes and 53 s...
2,CR_Patients_info_patients-v0_4/edf/dev/01_tcp_...,"MEDICATIONS: Vimpat, Norvasc, Felbamate, Car..."
3,CR_Patients_info_patients-v0_4/edf/dev/01_tcp_...,CLINICAL HISTORY: 27 year old gentleman with...
4,CR_Patients_info_patients-v0_4/edf/dev/01_tcp_...,"MEDICATIONS: Vimpat, Norvasc, Felbamate, Car..."


# I - Working with Levenshtein distance on full text

## A - Using partial ratio on full text

In [79]:

# We will calculate the partial_ratio for each thesaury therme and update it in a result dataset

%%time

for i in tqdm(list(thesaurus['symptome-en'])):
    classification_dataset[i] = classification_dataset['report'].apply(lambda x: fuzz.partial_ratio(x, i)) 

df_results = pd.DataFrame(data=classification_dataset.columns[4:], columns=['target'])
df_results['ratio'] = df_results['target'].apply(lambda x: max(classification_dataset[x]))

# What can we predict at best?
df_results.sort_values(by='ratio', ascending=False)

100%|██████████| 46/46 [00:47<00:00,  1.03s/it]CPU times: user 47 s, sys: 46.9 ms, total: 47.1 s
Wall time: 47.3 s



Unnamed: 0,target,ratio
13,Lennox-Gastaut syndrome,100
16,Epilepsy with generalized tonicclonic seizure...,81
7,West syndrome,77
31,temporal epilepsy,76
23,central epilepsy,75
26,frontal epilepsy,75
27,insular epilepsy,75
6,Dravet syndrome,73
2,Benign infantile epilepsy,72
30,parietal epilepsy,71


It looks we have "honest" results, but in reality other than Lneeox-Gastaut syndrome it does not really works... Ex: for temporal epilsepy, ratio is high thanks to "epilepsy" alone. 

In [133]:
# For a target, output the related reports sorted by partial_ratio

def research_similarity(target):
    df = pd.DataFrame(classification_dataset['report'])
    df['partial_ratio'] = df['report'].apply(lambda x: fuzz.partial_ratio(x, target))
    df = df.sort_values(by='partial_ratio', ascending=False)
    return df

' CLINICAL HISTORY: This is a 29-year-old, mentally retarded female with seizures since age 7, after removal of an occipital tumor.  Last seizure one week ago. MEDICATIONS: Dilantin. INTRODUCTION: Digital video EEG is performed in the lab using standard 10-20 system of electrode placement with one channel of EKG. Hyperventilation and photic stimulation were not performed.  This is an awake and light sleep record. DESCRIPTION OF THE RECORD: The background EEG is diffusely slow with theta activity and disorganized.  A rare, slow alpha rhythm of 7 Hz, 20 V is seen. Low voltage, frontocentral beta activity is also seen.  Soon after the EEG starts, bursts of 2 to 4 Hz frontally predominant sharp and slow wave complexes lasting for one to two seconds are seen.  In addition, frequent right focal spike and wave complexes are seen in the right hemisphere, mostly in the temporal region, but also in the occipital area.  This focal spike and wave activity occasionally generalized. Drowsiness is ch

In [322]:
## B - Using token_sort_ratio on full text

In [318]:

# We will calculate the token_sort_ratio for each thesaury therme and update it in a result dataset

%time

for i in tqdm(list(thesaurus['symptome-en'])):
    classification_dataset[i] = classification_dataset['report'].apply(lambda x: fuzz.token_sort_ratio(x, i)) 

df_results = pd.DataFrame(data=classification_dataset.columns[4:], columns=['target'])
df_results['ratio'] = df_results['target'].apply(lambda x: (classification_dataset[x]).max())

# What can we predict at best?
df_results.sort_values(by='ratio', ascending=False)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.3 µs


Unnamed: 0,target,ratio
37,Other location,50
38,Unkown location,48
8,Epilepsy with myoclonic atonic (previously ast...,46
3,Epilepsy of infancy with migrating focal seizures,44
18,Autosomal dominant epilepsy with auditory feat...,43
33,Mesial temporal lobe epilepsy with hippocampal...,41
34,Mesial temporal lobe epilepsy without hippoca...,40
16,Epilepsy with generalized tonicclonic seizure...,40
20,Gelastic seizures with hypothalamic hamartoma,39
11,Epilepsy with myoclonic absences,37


This method does not really works out.

## C - Example Cases

### 1 - Focus on Lennox-Gastaut

In [137]:
# Looking for the index where the index is high
research_similarity('Lennox-Gastaut')['partial_ratio'].head(30)

838     100
117     100
45      100
46      100
47      100
48      100
49      100
51      100
1258    100
843     100
220     100
1315    100
372     100
44      100
1107    100
276     100
1302    100
227     100
1211    100
547     100
1167    100
684     100
1269    100
597      50
596      50
595      50
594      50
593      50
598      50
809      43
Name: partial_ratio, dtype: int64

In [140]:
# Looking for the text correlated with the report at index 227 
research_similarity('Lennox-Gastaut').report[227]

' CLINICAL HISTORY: This is a 27-year-old male with a history of severe MR, multiple medical problems with multiple brief seizures per month.  Seizures characterized by generalized shaking lasting 20 seconds. MEDICATIONS: Lamictal, Tegretol, Tranxene, and many others. INTRODUCTION: Digital video EEG is performed in the lab using standard 10-20 system of electrode placement with one channel of EKG. The patient is drowsy or somnolent. Photic stimulation is performed. DESCRIPTION OF THE RECORD: The background EEG is markedly abnormal and is primarily a mixture of rhythmic 3 Hz activity with smaller amounts of 2 Hz activity and some 4 to 5 Hz theta.  There are multifocal spike and slow wave complexes identified in the record including bifrontal, high amplitude spike and slow wave complexes with an approximately 2 Hz after going slow wave.  Focal epileptiform activity is also seen in the occipital regions, sometimes maximum at O2 and at other times with a poly spike wave component at O1-O2.

it works with 100 partial ratio

In [None]:
# Looking for the text correlated with the report at index 227 
research_similarity('Lennox-Gastaut').report[597]

It's not working on 50 partial ratio.

### 2 - Focus on temporal epilepsy

In [154]:
research_similarity('temporal epilepsy')

Unnamed: 0,report,partial_ratio
377,EEG REMARKS: 7 L temporal Spikes but seems se...,76
191,CLINICAL HISTORY: 40 year old right handed ma...,71
194,CLINICAL HISTORY: 40 year old right handed ma...,71
1133,CLINICAL HISTORY: \tForty-seven-year-old male...,71
523,HISTORY: A 62-year-old woman with adult-onse...,71
...,...,...
145,"CLINICAL HISTORY: A 25-year-old man, with hi...",0
426,REASON FOR STUDY: Seizures. CLINICAL HISTORY...,0
118,CLINICAL HISTORY: A 35-year-old woman with c...,0
688,REASON FOR STUDY: Change in mental status. C...,0


In [155]:
research_similarity('temporal epilepsy')['report'][191]

' CLINICAL HISTORY: 40 year old right handed male with encephalitis and recurrent seizures. MEDICATIONS: Lacosamide, dilantin, Ativan, Klonopin INTRODUCTION: Continuous digital video EEG monitoring was performed at bedside using standard 10-20 system of electrode placement with 1 channel of EKG. As this section of the records begins, the patient reports "he is feeling great" as if he is not having more seizures. Then subsequently he has 2 events that he describes as auras, which are seizures with impairment of awareness. He does have occasional myoclonic jerks. DESCRIPTION OF THE RECORD: This section of the 24-hour period includes more of the rhythmic repetitive slowing than noted at other times. Isolated high amplitude right hemispheric spike and wave activity is observed. Push button times include 5:20 which includes actually a seizure. Although the patient describes this as an aura, it is really a focal motor seizure with loss of axial tone and stiffening of the right leg. The patie

No trace of temporal epilepsy: it just does not work out!

### 3 - Other research

Let's try to research medication associate with one Lennox-Gastaut syndrome: maybe we can find other occurences? 

In [171]:
research_similarity('Keppra Ativan famotidine Lovenox topiramate Flagyl Depakote').head(30)

Unnamed: 0,report,partial_ratio
44,DURATION OF STUDY: Study date 03/26/2013 thr...,90
51,DURATION OF STUDY: Study date 03/26/2013 thr...,90
48,REASON FOR STUDY: Seizures. CLINICAL HISTORY...,90
47,REASON FOR STUDY: Seizures. CLINICAL HISTORY...,90
46,DURATION OF STUDY: Study date 03/26/2013 thr...,90
45,REASON FOR STUDY: Seizures. CLINICAL HISTORY...,90
52,REASON FOR STUDY: Seizures. CLINICAL HISTORY...,54
652,CLINICAL HISTORY: 60 year old right handed fe...,53
1071,CLINICAL HISTORY: 60 year old right handed fe...,53
1070,CLINICAL HISTORY: 60 year old right handed fe...,53


Analysis show it does not really works

# II - Working with Levenshtein distance on each sentence of a  text

Empiric test have shown that precision can be higher if tested on sentences rather than full text. Let's try the efficiency!

In [320]:
# len > 5 to overcome the small words, which have naturally a high ratio
def partial_ratio_by_sentence(texte, target):
    max = 0
    for i in texte.split('.'):
        if fuzz.partial_ratio(i, target) > max:
            if len(i) > 5:
                max = fuzz.partial_ratio(i, target)
    return max

# For a target, output the related reports sorted by partial_ratio

def research_similarity_by_sentence(target):
    df = pd.DataFrame(classification_dataset['report'])
    df['partial_ratio'] = df['report'].apply(lambda x: partial_ratio_by_sentence(x, target))
    df = df.sort_values(by='partial_ratio', ascending=False)
    return df


In [321]:

# We will calculate the partial_ratio for each thesaury therme and update it in a result dataset
%time

classification_dataset_by_sentence = classification_dataset[['filepath', 'report']]

for i in tqdm(list(thesaurus['symptome-en'])):
    print(i)
    classification_dataset_by_sentence[i] = classification_dataset_by_sentence['report'].apply(lambda x: partial_ratio_by_sentence(x, i)) 

df_results_by_sentence = pd.DataFrame(data=classification_dataset_by_sentence.columns[2:], columns=['target'])
df_results_by_sentence['ratio'] = df_results_by_sentence['target'].apply(lambda x: classification_dataset_by_sentence[x].max())

# What can we predict at best?
df_results_by_sentence.sort_values(by='ratio', ascending=False)

0%|          | 0/46 [00:00<?, ?it/s]CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 8.82 µs
Benign familial neonatal epilepsy (BFNE)
  2%|▏         | 1/46 [00:42<31:54, 42.55s/it]Early myoclonic encephalopathy (EME)
  4%|▍         | 2/46 [01:13<28:32, 38.93s/it]Ohtahara syndrome
  7%|▋         | 3/46 [01:26<22:28, 31.36s/it]Myoclonic encephalopathy in nonprogressive disorders
  9%|▊         | 4/46 [02:23<27:16, 38.96s/it]Benign infantile epilepsy
 11%|█         | 5/46 [02:45<23:10, 33.91s/it]Epilepsy of infancy with migrating focal seizures
 13%|█▎        | 6/46 [03:35<25:46, 38.66s/it]Benign familial infantile epilepsy
 15%|█▌        | 7/46 [04:07<23:57, 36.87s/it]Myoclonic epilepsy in infancy (MEI)
 17%|█▋        | 8/46 [04:38<22:03, 34.84s/it]Dravet syndrome
 20%|█▉        | 9/46 [04:50<17:15, 27.99s/it]West syndrome
 22%|██▏       | 10/46 [05:00<13:37, 22.71s/it]Epilepsy with myoclonic atonic (previously astatic) seizures
 24%|██▍       | 11/46 [06:04<20:29, 35.14s/it]Late 

Unnamed: 0,target,ratio
33,temporal epilepsy,100
15,Lennox-Gastaut syndrome,100
18,Epilepsy with generalized tonicclonic seizure...,91
32,parietal epilepsy,91
30,multifocal epilepsy,89
31,occipital epilepsy,89
10,Epilepsy with myoclonic atonic (previously ast...,88
25,central epilepsy,88
22,Gelastic seizures with hypothalamic hamartoma,88
34,external temporal epilepsy,88


In [323]:
df_results_by_sentence.to_csv('df_results_by_sentence.csv')

# III - Using a simplified Thesaurus

In [365]:
# Loading simplified Thesaurus
thesaurus_simplified = pd.read_csv('data/thesaurus_key_words - simplified.csv', encoding="ISO-8859-1", sep=';')

In [366]:
thesaurus_simplified.head()

Unnamed: 0,classification_E,catégorie,symptome-fr,symptome-en,symptome-en-simple,Comments,CIM_10,CIM11,Orphanet
0,E1,Période néonatale,Encéphalopathie myoclonique précoce,Benign familial neonatal epilepsy (BFNE),BFNE,"Too much common words, keeping acronym",G40.8,8A61.0Y,1935.0
1,E2,Période néonatale,Epilepsie néonatale familiale bénigne (BFNE),Early myoclonic encephalopathy (EME),EME,"Too much common words, keeping acronym",G40.8,8A61.10,1949.0
2,E3,Période néonatale,Syndrome d'ohtahara,Ohtahara syndrome,Ohtahara,syndrome is too common,G40.8,8A62.Y,1934.0
3,E31,Nourrisons,Encépahlopathie myoclonique des affections non...,Myoclonic encephalopathy in nonprogressive dis...,Myoclonic encephalopathy,simplification,G40.4,8A62.Y,86913.0
4,E33,Nourrisons,Epilepsie benigne du nourisson,Benign infantile epilepsy,infantile,generalisation,G40.3,8A61.1Z,166302.0


In [486]:

# We will calculate the token_sort_ratio for each thesaury therme and update it in a result dataset

%time
classification_dataset_simple = classification_dataset.iloc[:,0:2]

for i in tqdm(list(thesaurus_simplified['symptome-en-simple'].unique())):
    classification_dataset_simple[i] = classification_dataset_simple['report'].apply(lambda x: partial_ratio_by_sentence(x, i)) 

df_results_simple = pd.DataFrame(data=classification_dataset_simple.columns[2:], columns=['target'])
df_results_simple['ratio'] = df_results_simple['target'].apply(lambda x: (classification_dataset_simple[x]).max())

# What can we predict at best?
df_results_simple = df_results_simple.sort_values(by='ratio', ascending=False)
df_results_simple.to_csv('df_results_simple.csv')
df_results_simple

0%|          | 0/36 [00:00<?, ?it/s]CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.48 µs
100%|██████████| 36/36 [06:58<00:00, 11.62s/it]


Unnamed: 0,target,ratio
18,tonic-clonic,100
11,Gastaut type,100
32,parietal,100
31,occipital,100
28,frontal,100
25,central,100
24,Rasmussen,100
1,EME,100
17,temporal lobe,100
15,Lennox-Gastaut,100


In [487]:
# defining threshold

df_results_simple['threshold'] = df_results_simple['target'].apply(lambda x: 100 if len(x) <= 6 else 
((len(x)-2)*100/len(x) if x.find(" ") == -1 else
((len(x)-4)*100/len(x)))
)
df_results_simple['correspondance'] = df_results_simple['ratio'] >= df_results_simple['threshold']
df_results_simple = df_results_simple.sort_values(by=['correspondance', 'threshold'], ascending=False)
df_results_simple

Unnamed: 0,target,ratio,threshold,correspondance
1,EME,100,100.0,True
15,Lennox-Gastaut,100,85.714286,True
18,tonic-clonic,100,83.333333,True
3,Myoclonic encephalopathy,84,83.333333,True
30,mutlifocal,90,80.0,True
31,occipital,100,77.777778,True
24,Rasmussen,100,77.777778,True
35,temporal occipital,100,77.777778,True
4,infantile,100,77.777778,True
13,myoclonic absences,78,77.777778,True


More elements are at 100! We now set the threshold score to keep the results. For a word with n caracters with e number of characters to change, the score is (n-e)/n.
For less than 6 characters, we take for principle that there should be no error (ex: acronyms)
For more than 6 characters, we can take 2 errors for reference
For two words search, we can take 4 errors for reference (2 by words)

In [488]:
classification_dataset_hashtag = classification_dataset_simple
for i in tqdm(classification_dataset_hashtag.columns[2:]):
    threshold = df_results_simple[df_results_simple['target'] == i]['threshold'].iloc[0]
    print(threshold)
    classification_dataset_hashtag[i] = classification_dataset_hashtag[i].apply(lambda x: 1 if (x >= df_results_simple[df_results_simple['target'] == i]['threshold'].iloc[0])==True else 0)

0%|          | 0/36 [00:00<?, ?it/s]100.0
  3%|▎         | 1/36 [00:01<00:42,  1.20s/it]100.0
  6%|▌         | 2/36 [00:02<00:39,  1.16s/it]75.0
  8%|▊         | 3/36 [00:03<00:37,  1.15s/it]83.33333333333333
 11%|█         | 4/36 [00:04<00:36,  1.15s/it]77.77777777777777
 14%|█▍        | 5/36 [00:05<00:36,  1.17s/it]73.33333333333333
 17%|█▋        | 6/36 [00:06<00:35,  1.17s/it]100.0
 19%|█▉        | 7/36 [00:07<00:33,  1.14s/it]100.0
 22%|██▏       | 8/36 [00:09<00:31,  1.12s/it]100.0
 25%|██▌       | 9/36 [00:10<00:30,  1.13s/it]100.0
 28%|██▊       | 10/36 [00:11<00:29,  1.12s/it]75.0
 31%|███       | 11/36 [00:12<00:27,  1.12s/it]66.66666666666667
 33%|███▎      | 12/36 [00:13<00:26,  1.10s/it]100.0
 36%|███▌      | 13/36 [00:14<00:26,  1.13s/it]77.77777777777777
 39%|███▉      | 14/36 [00:15<00:24,  1.13s/it]86.66666666666667
 42%|████▏     | 15/36 [00:16<00:23,  1.11s/it]85.71428571428571
 44%|████▍     | 16/36 [00:17<00:21,  1.10s/it]86.66666666666667
 47%|████▋     | 17/36 [0

In [489]:
classification_dataset_hashtag.iloc[:,2:].sum().sort_values(ascending=False)

temporal lobe               620
parietal                    571
frontal                     558
central                     538
occipital                   258
tonic-clonic                140
external temporal            56
temporal occipital           44
Gastaut type                 32
mutlifocal                   26
Lennox-Gastaut               23
Gelastic                     21
infantile                    11
myoclonic atonic             10
Rasmussen                     8
insular                       7
myoclonic absences            6
EME                           5
MTLE with HS                  4
Myoclonic encephalopathy      1
migrating focal               1
Ohtahara                      0
West                          0
Unkown                        0
MEI                           0
Dravet                        0
Temporoparietal junction      0
CAE                           0
supplementary motor area      0
Landau-Kleffner               0
Panayiotopoulos               0
jAE     

In [490]:
for report in classification_dataset_hashtag[classification_dataset_hashtag['Rasmussen']==1]['report']:
    display(report)

" CLINICAL HISTORY:  Rasmussen's encephalitis with breakthrough seizures. MEDICATIONS:  Keppra, IVIG, phenobarbital, Klonopin, others. INTRODUCTION:  Continuous video EEG monitoring is performed in the unit.  During a section of the record, the patient has approximately 40 simple partial seizures, all characterized by involuntary movements on the right.  Other seizures can occur out of sleep, but in this 24-hour section almost all the seizures seem to wake him up and are associated with right-sided shaking. The seizures have variable patterns, but all localize to the left hemisphere.  Some seem to start with a beta buzz in the left central region, others with more higher amplitude spike and wave activity.  The interictal activity includes a pattern with excess beta and theta from the right hemisphere.  The left hemisphere demonstrates __________ delta and the epileptiform activity interictally is more of a polyspike activity in the left posterior temporal region or central parietal reg

' CLINICAL HISTORY:  41 year old right handed male with Rasmussenâ\x80\x99s encephalitis with increasing seizures. MEDICATIONS:  Topiramate, Lacosamide, Phenobarbital, Klonopin, Lipitor, Pantoprazole, Lisinopril INTRODUCTION:  Digital video EEG was performed in lab using standard 10-20 system of electrode placement with channel of EKG.  Photic stimulation was performed. DESCRIPTION OF THE RECORD:  The background EEG is markedly abnormal.  As the record begins, the activity includes a prominent interhemispheric asymmetry.  It is medium amplitude, but slow, primarily theta on the right with some occasional posterior delta.  From there left there is clearly a breach with a high amplitude spike and slow-wave complex at T3 and T5.  It is also picked up at C3/P3. The first seizure occurs within 1 minute with a burst of 14 Hz activity emanating from the left frontal region with frequency evolution.  This is over 4 minutes and 35 seconds into the EEG.  Additional seizure occurs at 4 minutes an

' CLINICAL HISTORY:  Rasmussen encephalitis. MEDICATIONS:  Vimpat, Topamax, phenobarbital, IVIG, and Solu-Medrol. INTRODUCTION:  Continuous video EEG monitoring is performed for this individual.  He has many seizures typically characterized by right-sided shaking. DESCRIPTION OF THE RECORD:  The majority of the seizures occur on the evening of the 26th with multiple, repetitive focal seizures.  Aside from this, he demonstrates stage 2 sleep with vertex waves, K complexes and spindles.  By the later sections of the record on the 27th, the patient has more significant sections where he is awake, doing well and then drifting off to sleep. This piece of EEG concludes at 3:24 on the 27th. IMPRESSION:'

" CLINICAL HISTORY:  Rasmussen's encephalitis. MEDICATIONS:  Topamax, IVIG, Glucosamine, phenobarbital. INTRODUCTION:  Digital video EEG with long term EEG monitoring is performed in the long term monitoring unit using standard 10-20 system of electrode placement with 1 channel EKG.  The patient has a tender scalp and the tech sometimes had to modify the electrode placement. DESCRIPTION OF THE RECORD:  The interictal EEG continues to demonstrate focal slowing from the left hemisphere with left posterior temporal sharp waves.  Multiple seizures are identified in the 24 hour section, including in wakefulness and sleep.  The patient does not seem to wake up for all of them.  Stage II sleep, including the 2:00 a.m. to 3:00 a.m. section are prominent.  The nurses were aware of the seizures in sleep.  These seizures seem to be beginning with a burst of fast activity, almost some 10 to 5 Hz which is picked up very close to the midline.  The activity is really very prominent at CZ where it is 

' CLINICAL HISTORY:  41 year old right handed male with Rasmussenâ\x80\x99s encephalitis with increasing seizures. MEDICATIONS:  Topiramate, Lacosamide, Phenobarbital, Klonopin, Lipitor, Pantoprazole, Lisinopril INTRODUCTION:  Digital video EEG was performed in lab using standard 10-20 system of electrode placement with channel of EKG.  Photic stimulation was performed. DESCRIPTION OF THE RECORD:  The background EEG is markedly abnormal.  As the record begins, the activity includes a prominent interhemispheric asymmetry.  It is medium amplitude, but slow, primarily theta on the right with some occasional posterior delta.  From there left there is clearly a breach with a high amplitude spike and slow-wave complex at T3 and T5.  It is also picked up at C3/P3. The first seizure occurs within 1 minute with a burst of 14 Hz activity emanating from the left frontal region with frequency evolution.  This is over 4 minutes and 35 seconds into the EEG.  Additional seizure occurs at 4 minutes an

' DATES OF STUDY:  February 23-24, 2012. CLINICAL HISTORY:  Rasmussen encephalitis with increase in seizures. MEDICATIONS:  Vimpat, Topamax, phenobarbital, IVIG, others. INTRODUCTION:  Continuous video EEG monitoring is performed in the unit using standard 10-20 system of electrode placement with one channel of EKG.  This is an awake and asleep record. DESCRIPTION OF THE RECORD:  Random wakefulness and sleep, in wakefulness, the background EEG is somewhat slow from the right hemisphere.  The left hemisphere demonstrates arrhythmic delta activity with a high amplitude left posterior temporal spike complex. Clinical seizures are noted reliably with the patient and nurse and there are more than 20 pushbutton events, approximately 23, all 30-60 seconds in duration.  They are characterized by focal motor activity on the right hemibody.  Electrocardiographically, there is a buzz of mixed 5 and 10 Hz activity in the left hemisphere including the central regions.   There are a handful of seizu

' CLINICAL HISTORY:  A 42-year-old gentleman with Rasmussen encephalitis and increasing right-sided weakness as well as 2 tonic-clonic seizures and simple partial seizures. MEDICATIONS:  Vimpat Topamax, phenobarbital, IVIG, and others. INTRODUCTION:  Digital video EEG was performed in the lab using standard 10-20 system of electrode placement with 1-channel EKG.  Hyperventilation was not possible but photic stimulation was completed.  This was an awake and drowsy record. The patient had brief seizures with R jerks just prior to initiation of EEG and had a clinical seizure with eyes closed,  looking left,  and slowed responsiveness DESCRIPTION OF THE RECORD:  In wakefulness, the background EEG demonstrates a marked asymmetry between the 2 hemispheres.  The right hemisphere demonstrates modest background slowing with excess theta.  The left hemisphere demonstrates significant disruption of faster frequency activity.  Frequent sharp waves or spike is noted, high amplitude in the left hemi

" CLINICAL HISTORY:  A 42-year-old male with Rasmussen's encephalitis, status post left craniotomy with recent focal motor seizure followed by right-sided weakness and then epilepsy partialis continua. MEDICATIONS:  Decadron, phenobarbital, lacosamide, Zocor, others. INTRODUCTION:  Digital video EEG is performed in the lab/bedside using standard 10-20 system of electrode placement with one channel of EKG.  Photic stimulation was completed.  The patient was not experiencing involuntary movements during the EEG.  So this is a technically satisfactory EEG with acceptable impedances, but the craniotomy defect was noted. DESCRIPTION OF THE RECORD:  The background EEG is abnormal and demonstrates an asymmetry.  The right hemisphere is moderately slow with primarily a theta frequency background noted in wakefulness.  The left hemisphere demonstrates more significant arrhythmic delta activity particularly in the left posterior quadrant.  A high amplitude epileptiform discharge, high amplitude 

In [424]:
classification_dataset_simple.columns[2:]

Index(['BFNE', 'EME', 'Ohtahara', 'Myoclonic encephalopathy', 'infantile',
       'migrating focal', 'Unkown', 'MEI', 'Dravet', 'West',
       'myoclonic atonic', 'Gastaut type', 'CAE', 'myoclonic absences',
       'Landau-Kleffner', 'Lennox-Gastaut', 'Panayiotopoulos', 'temporal lobe',
       'tonic-clonic', 'jAE', 'ADEAF', 'JME', 'Gelastic', 'Reflex',
       'Rasmussen', 'central', 'supplementary motor area',
       'Temporoparietal junction', 'frontal', 'insular', 'mutlifocal',
       'occipital', 'parietal', 'external temporal', 'MTLE with HS',
       'temporal occipital'],
      dtype='object')

In [412]:
# Let's implement the hashtag system
colums.append(hastags)
classification_dataset_hashtag = classification_dataset_simple[['filepath', 'report', hashtag for hashtag in hashtags]]
classification_dataset_hashtag.head()

SyntaxError: invalid syntax (<ipython-input-412-3d29c2c8aae3>, line 2)

# IV - Rake experiment

In [None]:
from rake_nltk import Rake

r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.

r.extract_keywords_from_text(research_similarity('temporal epilepsy')['report'][191])

r.get_ranked_phrases() # To get keyword phrases ranked highest to lowest.