### References

* Chatbot with TF: https://chatbotsmagazine.com/contextual-chat-bots-with-tensorflow-4391749d0077
* Good reference list for reading: http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/


### Options:
* Use categories from semtype tagging to identify

## Initialize

In [1]:
# Utility
import sys,os
import time

import pandas as pd

# Custom
from processing import tag_utterances
from processing import load_sem_types

## NLP
import spacy

# Set absolute path the QuickUMLS Server
abs_path_umls = '/Users/austinpowell/Google_Drive/kp_datascience/doctor_notes/ontology/UMLS'
abs_path_data_umls = '/Users/austinpowell/Google_Drive/kp_datascience/doctor_notes/ontology/UMLS/QuickUMLS_data'
sys.path.append(abs_path_umls+'/QuickUMLS')
from quickumls import QuickUMLS
tagger = QuickUMLS(abs_path_data_umls)

In [2]:
path_to_data = '../data/reddit_comments_askDocs_2014_to_2018_03.gz'
df = pd.read_csv(path_to_data,low_memory=False)
print('Shape',df.shape)
df.head(2)

Shape (557648, 21)


Unnamed: 0,body,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,...,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,removal_reason
0,for a manlet such as yourself I'd recommend at...,,,,-Ai,This user has not yet been verified.,,1513411674,t5_2xtuc,t3_7k5x2h,...,0,1514772000.0,0,0,drbt2db,AskDocs,,,default,
1,Thank you very much for answering!,,,,-SY,This user has not yet been verified.,,1445798103,t5_2xtuc,t3_3q697b,...,2,1447190000.0,0,0,cwcfjpr,AskDocs,2.0,,default,


In [3]:
# Load semantic types
sem_type_dict = load_sem_types('../data/SemGroups_2013.txt')

In [8]:
t = tagger.nlp(df['body'].iloc[4])
t.text

"She's had it for 8 months, we've never had any issues with it before."

In [9]:
t = tagger.nlp(t.text)

s = t
matches= tagger.match(s, best_match=True, ignore_syntax=False)
for match in matches:
    dir(match)
for m in match:
    print(m)

{'start': 19, 'end': 25, 'ngram': 'months', 'term': 'month', 'cui': 'C1561542', 'similarity': 0.75, 'semtypes': {'T170'}, 'preferred': 1}


In [21]:
sd = load_sem_types('../data/SemGroups_2013.txt')
m['semtypes']

AttributeError: 'set' object has no attribute 'text'

In [11]:
import importlib
from processing import tag_utterances
from processing import load_sem_types

tag_utterances(1, t.text, tagger)

{'T170'}


[[1, 19, 25, 'month', 'C1561542', 0.75, {'Intellectual Product'}]]

In [31]:
%%time
print('Iterating over every document')
#Iterate over every document and extract the concepts
i=-1        
result = []
for idx,doc in  enumerate(df['body']):

    if idx % 100 == 0:
        print("Documents processed: {}".format(idx))
    try:
        i+=1
        annotations = tag_utterances(i,doc,tagger)
        result.extend(annotations)
    except Exception as e:
        print(e)
        
df_matches = pd.DataFrame(data=result, columns =['document','start','end','term','cui','similarity','semtypes'])
df_matches.sort_values(by=['document','start'],inplace=True)

Iterating over every document
Documents processed: 0
Documents processed: 5
Documents processed: 10
Documents processed: 15
Documents processed: 20
Documents processed: 25
Documents processed: 30
Documents processed: 35
Documents processed: 40
Documents processed: 45
Documents processed: 50
Documents processed: 55
Documents processed: 60
Documents processed: 65
Documents processed: 70
Documents processed: 75
Documents processed: 80
Documents processed: 85
Documents processed: 90
Documents processed: 95
Documents processed: 100
Documents processed: 105
Documents processed: 110
Documents processed: 115
Documents processed: 120
Documents processed: 125
Documents processed: 130
Documents processed: 135
Documents processed: 140
Documents processed: 145
Documents processed: 150
Documents processed: 155
Documents processed: 160
Documents processed: 165
Documents processed: 170
Documents processed: 175
Documents processed: 180
Documents processed: 185
Documents processed: 190
Documents process

In [35]:
df_matches.head()

Unnamed: 0,document,start,end,term,cui,similarity,semtypes
2,0,62,67,water,C0043047,1.0,"{Pharmacologic Substance, Inorganic Chemical}"
0,0,89,103,hours of sleep,C2937255,1.0,{Finding}
3,0,105,108,cut,C0000925,1.0,{Injury or Poisoning}
1,0,140,153,sodium intake,C0489645,1.0,{Clinical Attribute}
11,2,11,17,tested,C0392366,1.0,"{Body Part, Organ, or Organ Component, Intelle..."


In [33]:
df.shape

(557648, 21)

In [6]:
df.head()

Unnamed: 0,body,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,...,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,removal_reason
0,for a manlet such as yourself I'd recommend at...,,,,-Ai,This user has not yet been verified.,,1513411674,t5_2xtuc,t3_7k5x2h,...,0,1514772000.0,0,0,drbt2db,AskDocs,,,default,
1,Thank you very much for answering!,,,,-SY,This user has not yet been verified.,,1445798103,t5_2xtuc,t3_3q697b,...,2,1447190000.0,0,0,cwcfjpr,AskDocs,2.0,,default,
2,Never been tested for that. I was hoping the ...,,,,-o2,This user has not yet been verified.,,1461952470,t5_2xtuc,t3_4gz1fi,...,1,1463777000.0,0,0,d2mce34,AskDocs,1.0,,default,
3,"She said her constant abdominal pain is a 6, t...",,,,05P,This user has not yet been verified.,,1504214332,t5_2xtuc,t3_6x9jk0,...,1,1504553000.0,0,0,dme9lzr,AskDocs,,,default,
4,"She's had it for 8 months, we've never had any...",,,,05P,This user has not yet been verified.,,1504217835,t5_2xtuc,t3_6x9jk0,...,1,1504554000.0,0,0,dmecohs,AskDocs,,,default,


In [9]:
df.author_flair_text.value_counts()

This user has not yet been verified.     352466
Physician                                 26484
Medical Student                           13399
B.S., Medical Lab Sciences                 3069
FY1 Doctor                                 2860
Moderator                                  2644
Registered Nurse                           2589
Emergency Medical Technician               2023
Surgeon                                    1765
Surgeon | Moderator                        1710
Orthopaedic Surgeon                        1569
Psychiatrist                               1513
Dermatologist                              1481
Doctor (A&amp;E)                           1435
Pharm.D. - Hospital Pharmacist             1056
Emergency Physician                        1027
Nursing Graduate, RPN                       803
Pharm.D.                                    748
Nursing Student                             735
Occupational Therapist                      727
Anesthesiologist                        

### Predicting Author

In [16]:
df['is_clinician'] = df['author_flair_text'].apply(lambda r: 0 if r =='This user has not yet been verified.' else 1)
print(df['is_clinician'].mean())
print(df['is_clinician'].value_counts())

0.36794178406449946
0    352466
1    205182
Name: is_clinician, dtype: int64


In [6]:
nlp_pipe_umls = spacy('e')

TypeError: 'float' object is not iterable

In [None]:
model.train(num_epochs=100)

In [None]:
model.generate_tsne()