## Notebook #1: Intro and EDA

In this notebook, I will introduce my project and explore my data. This will revolve around understanding the lanugage used by the consumers, patterns of use, and looking at randomly selected inputs. The notebook will finish with a brief look into my thought process for modeling and how I will be deciding my modeling direction.

---

### Introduction and Background

As a member of the data science team, I have been tasked with the second phase of our new online website product on medlineplus.gov from the US National Library of Medicine. This product allows consumers to input a question into our embedded chat feature. Our team has already built out a model to summarize the questions into a basic and logical form. This is where I come in: I will be building the model to differentiate between question content and direct them to our on-call specialists for answering. This chat feature will function similarly to an online customer service chat platform. In this second phase, I will focus on the classification feature and building out an initial chat interface to progress to the engineering team. 

Explicitly, my project will aim to build a binomial classification NLP model that identifies patient needs from an incoming message and matches that patient with a specialist to answer in real-time. This will be done using the MeQSum corpus, a dataset of 1,000 patient questions pulled from the US National Library of Medicine. These questions were then summarized by medical experts. The model I will build will use the natural patient question to identify whether the patient is seeking a pharmacist or internist to answer their question. I will build a chat-feature Streamlit app to highlight the importance of this model's work. This model would fit into a larger project outside the scope of this course but buildable in the long-term. The use case for a product like this would be embedded into a medical information website as a question asking chat feature. Patients would be able to ask a doctor a question, their question would be sorted to the correct specialist, and a specialist would be able to answer them.

The most important metrics here will be accuracy and the F1 score. Msot importantly, I want to ensure I prioritize accuracy. In the medical field, it is of utmost importance that what we are telling our patients is accurate (predicting correctly overall) for their safety. For my project, it is not necessarily life-threatening if a question is categorized incorrectly, but minimizing that occurance is still a high priority. However, it is also important to ensure that my model is not disproportionately favoring one class over another because I have imbalanced data. This is why I will also look at the F1 score. The F1 score provides the harmonic mean between precision and recall (sensitivity). Now, I am truly after recall as it tells us about the correct positive predictions of the model, but because I have imbalanced data, using the F1 score will help evaluate the model's fairness in classification as well. This is important in identifying better or worse models for my use-case. 

The risks and issues with this project are clear from the outset. Ideally, I would have more data points than 1,000, more time to hand label questions (or use a program), and would be able to build a multi-class target (several specialties). When I initially labeled the data, I had about 20 classes of different specialties from pediatrician to infectious disease to dentists. I quickly realized that for this size of dataset, a multi-class project would not be feasible. Therefore, I trimmed it down to differentiating between pharmacist and internist (general MD or primary care physician). I made this decision because I noticed several medicaition-related questions that would best be answered by a pharmacist. This binary differentiation fell into place naturally. With this, however, I realize that the medical questions being classified to 'internist' may not always be relevant to that internist's work. Because there are a broad range of questions in this dataset, there are some questions that may be more suited for an ophthalmologist, dentist, pulmonologist, oncologist, etc. Because I made this a binary classification model, I am accepting that this project will help me identify a good starting point for what could be a great model in production. Further, because I hand-labeled the data, there is a risk that I mislabeled or misunderstood the question. To minimize this risk, I ensured I did research on symptoms, conditions, medications, and concerns when I did not have extensive background knowledge on the topic. I believe I labeled the questions as accurately as I could and to the best of my knowledge. 

---

As I continue in this notebook, we will explore the data and identify early trends. In identifying patterns of language between the two target classes, we can see anything but a clear divide in language. Both classes share several common words that may make modeling more difficult. 

In [36]:
# basics
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
import pickle

import nltk
nltk.download('stopwords') # resource: https://pythonspot.com/nltk-stop-words/
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))

# graphs
import os
import kaleido

[nltk_data] Downloading package stopwords to /Users/ER/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
questions = pd.read_csv('./data/Capstone-Data - Sheet1.csv')
questions.head(3)

Unnamed: 0,File,message,binary: pharmacist/internist,specialty,Summary,specialty.1,Unnamed: 6
0,1-131188152.xml.txt,SUBJECT: who and where to get cetirizine - D\n...,pharmacist,pharmacist,Who manufactures cetirizine?,pharmacist,pharmacist
1,14348.txt,who makes bromocriptine\n i am wondering what ...,pharmacist,pharmacist,Who manufactures bromocriptine?,pediatrician,internist
2,1-131985747.xml.txt,SUBJECT: nulytely\n MESSAGE: Hello can you tel...,pharmacist,pharmacist,"Who makes nulytely, and where can I buy it?",oncologist,dentist


In [19]:
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   File         1000 non-null   object
 1   message      1000 non-null   object
 2   binary       1000 non-null   object
 3   specialty    1000 non-null   object
 4   Summary      1000 non-null   object
 5   specialty.1  21 non-null     object
 6   Unnamed: 6   5 non-null      object
dtypes: object(7)
memory usage: 54.8+ KB


In our initial view of the data, the column names and meanings may be confusing. The 'binary' column is where I differentiated between 'pharmacist' or 'internist'. The 'specialty' column is what I had initially used when I was hoping for a multinomial classification model. This ended up having about 20 target classes as can be seen in the 'specialty.1' column (used in Excel to find unique classes). Because I am doing a binomial classification, I will be focusing on the 'binary' column for target classes.

In [20]:
questions.rename(columns = {'binary: pharmacist/internist': 'binary'}, inplace=True)  # renaming the binary column for ease

In [21]:
questions['binary'].value_counts() 

internist     861
pharmacist    139
Name: binary, dtype: int64

In [22]:
questions['binary'][questions['binary'] == 'internists'] = 'internist'  # fixing the plural internist datapoint

In [23]:
questions['binary'].value_counts()

internist     861
pharmacist    139
Name: binary, dtype: int64

In [None]:
fig1 = px.histogram(data_frame=questions['binary'],
            title='Distribution of Target Classes: pharmacist and internist',
            labels={'value': 'Target Class'})

fig1.update_yaxes(title = 'Count')

fig1.update_layout(title_font_size=18, title_x=.5)

<img src='../assets/class-distribution.png' width='1600' height='800'>

As I mentioned in my introduction, my data is imbalanced. It is not an extreme case, but I am concerned it may cause some issues with modeling. I will use an oversampler to fix this in my modeling process. Aside from that, I am not surprised it is so imbalanced, but I also want to explore what makes up the 'internist' designation. 

In [25]:
questions['specialty'].value_counts()  # multinomial specialties

pharmacist            139
neurologist           113
internist              87
gastroenterologist     78
orthopedist            64
pediatrician           58
obgyn                  56
dermatologist          55
cardiologist           53
oncologist             48
endocrinologist        40
urologist              36
ophthamologist         34
pulmonologist          33
infectious disease     30
ENT                    28
psychologist           18
rheumatologist         14
dentist                 8
bariatrician            7
bariaticians            1
Name: specialty, dtype: int64

In [26]:
questions['specialty'][questions['specialty'] == 'bariaticians'] = 'bariatrician'  # fixing the bariatricians plural

In [None]:
fig2 = px.histogram(data_frame=questions['specialty'][questions['specialty'] != 'pharmacist'],
            title='Distribution of Proposed Multinomial Target Classes',
            labels={'value': 'Target Class'})

fig2.update_yaxes(title = 'Count')

fig2.update_layout(title_font_size=18, title_x=.5)

<img src='../assets/distribution-multi-nomial-class.png' width='1600' height='800'>

As we can see here, this is the distribution of the specialties that make up the 'internist' class. Like I discussed in my introduction, this is a risk I am accepting when doing this model and joining all of these specialties together under the 'internist' title. These specialties are so broad and are not grouped together typically (like dentist and dermatologist and orthopedist). However, for the sake of this problem, I will be addressing them all under one umbrella.

In [28]:
pd.options.display.max_colwidth = 1000

questions[['message', 'Summary', 'specialty']].iloc[[855]] # np.random.randint(0,1000) to get random ints

Unnamed: 0,message,Summary,specialty
855,SUBJECT: taking oxycodone 5mg\n MESSAGE: what will happen to me if my doctors just stop my oxycodone 5mg 2 every 4 hours and iv been on it for three ahalf weeks? and how do i have them wean me off slowly?,How do I stop taking oxycodone?,pharmacist


In [29]:
questions[['message', 'Summary', 'specialty']].iloc[[906]]

Unnamed: 0,message,Summary,specialty
906,"hi. i'm a student that suffers from Pectus excavatum (funnel chest), and i need help to pass it. please give me some way to get the solution of this problem. waiting for your answer. please need help!",What are treatments for pectus excavatum?,pulmonologist


In [30]:
questions[['message', 'Summary', 'specialty']].iloc[[386]]

Unnamed: 0,message,Summary,specialty
386,"SUBJECT: contents of barium sulfate solution\n MESSAGE: I have your Readi-Cat 2 barium sulfate solution. 21% w/v, 2.0% w/w. I need to know of this contains any gluten or dairy. Also, do you know if it contains any salicylates or anything similar to aspirin?","What are the ingredients of Readi-Cat 2 barium sulfate solution, and is it gluten and dairy free?",pharmacist


In [31]:
questions[['message', 'Summary', 'specialty']].iloc[[525]]

Unnamed: 0,message,Summary,specialty
525,"SUBJECT: ALT in blood\n MESSAGE: My ALT is 45, AST is 56. What is best way or test to know the reason for these increased values in Blood. And any treatment possible without knowing the reason. I am a [LOCATION] having retired life. I feel 99-100 F most of the time, Age 73, wt 87 kg, no alkohol, no smoking etc.",What are the causes of and treatments for eleveted ALT and AST?,gastroenterologist


Looking at a few random consumer submitted questions, we can see how the different classes present themselves in the raw data. The questions asked about medications are specific: frequency and patterns of taking oxycodone, or ingredients of a solution. Patients often asked about the ingredients or allergy recommendations for certain medications as well as how to take or not take medications. I also saw people ask about side effects or off-brand labels for medications. 

This is a stark difference from some of the internist questions: looking at lab counts or asking about treatments, causes, and symptoms of certain conditions. People also ask about genetic testing, symptoms of diagnoses, and types of diagnoses.

In [32]:
questions['message_word_count'] = [len(i.split()) for i in questions['message']]  # getting word count per message

questions.sort_values(by='message_word_count', ascending=False).head(5)  # longest messages

Unnamed: 0,File,message,binary,specialty,Summary,specialty.1,Unnamed: 6,message_word_count
714,1-135983184.xml.txt,"Hi All,\n I am from India and really worried.\n I have a 6 months old baby girl. I have read the article on Lactose intolerance. She is unable to digest any formula milk or any milk products. When she was a new born, I started giving her Lactogen (a formula milk) a little bit along with my milk. She was fine with it but after 3 weeks, I switched her to another formula milk named Nan Pro 0 which is for babies upto 1 year. She started drinking that along with my breast milk. After 2 weeks, she started getting Diarrhea and a severe one where she pooped at least 15 times in one day. Therefore, I consulted the doctor and the doctor immediately asked me to put her on breast milk only. I did try that but she was hungry and crying therefore, I gave her a little bit of formula milk aswell. She was file with all the medicines. However, once the medicine stooped, she again started getting diarrhea. This continued on and off.\n After a while, the doctor asked me to stop the formula milk and st...",internist,pediatrician,Where can I get help for my 6 months old baby girl with Lactose intolerance in India?,,,378
642,11199.txt,"ClinicalTrials.gov - Question - general information.\n Hello, \n My name is [NAME],I am 30 years old and I am from [LOCATION]. I met my friend [NAME] on a chat room. [NAME] is 25 years old and he currently lives in [LOCATION], Algeria. His dad died, his brother died 2 years ago in a motorcycle accident and about 9 months ago my dear friend got into a car accident. In that car accident he suffered a T-6 and T-7 fracture which cause him to be paralyzed from his waist down. He is so young and it really hurts how much he has suffered in his short life. Not only him, but just thinking how much his mom has suffered it really breaks my heart. I asked him how or what can I do to help him, he did not asked for money or something like, he just said: ""Would you help me to research how to get cured?"". Since I met my friend there is no one day I have stopped thinking in what can I do to make his dream come true. I know we lives in an imperfect world and things happen to people but faith moves ...",internist,neurologist,Where can I get information about treatment for T-6 and T-7 fracture paraplegia?,,,348
640,1-132048350.xml.txt,"Hello.\n I am writing this mail from [LOCATION]. This in regards with seeking help for\n Ulceratice colitis.\n My mother is suffering from this diseases from last 3 Years. Her treatment\n is going with [LOCATION] from past a\n year and half.\n Below are the symptoms of her disease :\n 1. Mucas Pus flow in quantity while passing stool\n 2. Blood flow while passing stool\n 3. Heavy weakness\n 4. Heavy and steady weight loss\n 5. Weak Eyesight\n 6. Heavy cramps after meal\n 7. stomach pain sometimes after lifting heavy weight\n 8. Poor appetite\n Below are the medicines given by the doctor during the course :\n Cap SonprazD, Tab Coolgut 1.5 , tab Falute, Mesacol Supporteries, Cap A to\n Z , Tab siho fix, and Tab omnacortil , rabelco rd cap , coolgut , folvitc ,\n entofoam , bevon , anovate.\n She had undergone Colonoscopy , Endoscopy , Stool test , Urine tests and\n Blood test many times ( i can mail the reports of all the above if needed\n ). Her progress was very good at initial sta...",internist,gastroenterologist,Where can I get advice and help for ulceratice colitis?,,,341
802,12800.txt,"I understand that you cannot provide opinions,nor suggest, any type of therpy. \n My son (dob-[DATE]), was called back to the hospital four days after birth due to a abnormal PKU, which suggested adrenal hyperplasia. After test the endocrine physians felt he did not have the condition. Several years later he began having severe headaches preceded by violent vomiting epidodes \n that lasted for hours. When he was admitted to [LOCATION], he was suffering from bleeding esphogal ulcers, and severe dehydration and, malnutrician. His urine (24 hr), blood plasma,& VMA,showed increased values of catacholomines.He also demostrated intermittent hypertension, and tachcardia through out the day (he had been diagnoised with ADHD several years earlier). His MIGB clearly showed bifocal uptake in both adrenal glands. The mri did not show tumors at that time ([DATE]. He has since been diagnoised with antral polys (biosphy's taken were misplaced, and the gastrologist would not redo the test), aberra...",internist,endocrinologist,How can I find physician(s) or hospital(s) who specialize in pheochromocytomas?,,,328
392,1-135532945.xml.txt,SUBJECT: G6PD deficiency adopted 1957\n MESSAGE: My primary care says that even though I have had three blood tests that all confirm I am totally lacking the G6PD that my deficiency is unsubstantiated. Other than expensive genetic testing how to I get them to confirm this? I want need this in my medical records and a bracelet or necklace or something to wear that I have this so they do not give me drugs and/or vitamins supplements that I should not have at all and make me worse or kill me? I am adopted and the records are still sealed and I have a son and a daughter and four grandchildren that could have had this deficiency passed on to them so you see why this is important to me. I got injured at work years ago and they gave me some kind of anesthesia I should not of had to find a bleeder on the inside and sew it and then they could not wake me up for two days. Also same job required that I get yellow fever shot every six months. No wonder I felt so awfull back then. I do not want...,internist,internist,"How is G6PD deficiency inherited and diagnosed, and what are the treatments for it?",,,314


In [33]:
questions.sort_values(by='message_word_count', ascending=False).tail(5)  # shortest messages

Unnamed: 0,File,message,binary,specialty,Summary,specialty.1,Unnamed: 6,message_word_count
577,38.txt,cinca sindrome. where the treatment of cinca sindrome.,internist,pediatrician,What are the tratments for cinca syndrome?,,,8
946,5566.txt,Cross Eye.\n Need to fix my cross eyed,internist,ophthamologist,How to treat crossed eyes?,,,8
611,78.txt,are jumping genes[transposons] associated with lynch syndrome? [NAME],internist,oncologist,Are jumping genes [transposons] associated with Lynch syndrome?,,,8
827,11352.txt,erection problems.\n how to get rid of erections?,internist,urologist,What are the treatments for erection problems?,,,8
759,11947.txt,bile.\n because reason vomit bile?,internist,gastroenterologist,What causes vomiting bile?,,,5


The range of message word count is substantial. Patients seem to sometimes tell entire stories within their question, and others simply ask a basic and straightforward question. Luckily, the summarizer produced questions make these easier to understand. One issue that concerns me is the amount of extraneous information in the longer messages that do not get at the heart of the question. I suspect this may be an issue when it comes to modeling. The opposite concern exists for the short messages. The question related to erections could be asking about medications or other treatments. Because there is no other information, I did not assume the patient is asking for medications and instead directed their question toward an internist that may be able to help with a more holistic approach.

As can be seen below, majority of the questions are under or around 50 words in length, but there are still some messages that are very long. The longest message is over 370 words in length. 

In [None]:
fig = px.histogram(data_frame=questions,
             x='message_word_count',
             nbins=25,
             title='Distribution of Message Length',
             labels={'status_length' : 'Message Length by Word Count'}
            )

fig.update_yaxes(title = 'Count')

fig.update_layout(title_font_size=18, title_x=.5)

<img src='../assets/message-length.png' width='1600' height='800'>

In [30]:
df_vec = questions[['message', 'binary']]  # preparing dataset for vectorized EDA
df_vec.head(3)

Unnamed: 0,message,binary
0,SUBJECT: who and where to get cetirizine - D\n MESSAGE: I need/want to know who manufscturs Cetirizine. My Walmart is looking for a new supply and are not getting the recent,pharmacist
1,"who makes bromocriptine\n i am wondering what company makes the drug bromocriptine, i need it for a mass i have on my pituitary gland and the cost just keeps raising. i cannot ever buy a full prescription because of the price and i was told if i get a hold of the maker of the drug sometimes they offer coupons or something to help me afford the medicine. if i buy 10 pills in which i have to take 2 times a day it costs me 78.00. and that is how i have to buy them. thanks.",pharmacist
2,"SUBJECT: nulytely\n MESSAGE: Hello can you tell me where do i order the nulytely who is the manufacture, what phone number can i call. thanks.",pharmacist


In [31]:
df_vec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   message  1000 non-null   object
 1   binary   1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


In [33]:
cv = CountVectorizer(stop_words=stops)
dfqu_vec = pd.DataFrame(cv.fit_transform(df_vec['message']).A, columns=cv.get_feature_names_out())
dfqu_vec.head(3)  # vectorizing messages

Unnamed: 0,00,000,000421,001274,00527172874,01,01d08e1e,01t11,02,03,...,zerolac,zest,zinc,zolmitriptan,zostavax,évidence,úlcera,ımportant,ınformatıon,ıs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
dfqu_vec.sum().sort_values(ascending=False).head(15)  # most common words overall

message        731
subject        666
name           354
help           336
please         255
know           228
would          227
years          217
information    193
thank          180
treatment      171
old            169
location       160
need           155
get            151
dtype: int64

The most common words here are unsurprising but still telling. Many of the questions stared with 'message' or 'subject'. Disregarding those, we can see that most people are asking for help or the name of something. Patients are asking politely ('please') for information, treatments, locations, and are expressing different needs or for knowledge from these inputs. In production, this chat feature would allow these patients to receive answers and help they need in real time.

In [None]:
freq = dfqu_vec.sum().sort_values(ascending=False).head(15)

bar_freq = px.bar(
                  x=list(freq.index),
                  y=list(freq.values),
                  title='15 Most Common Words from Submitted Messages',
                  labels = {'y': 'Count', 'x': 'Common Word'}

)

bar_freq.update_layout(title_x = 0.5)

<img src='../assets/most-common-words.png' width='1600' height='800'>

In [36]:
dfqu_vec['col'] = range(0,1000)
dfqu_vec.head(3)  # preparing datasets for merge on 'col'

Unnamed: 0,00,000,000421,001274,00527172874,01,01d08e1e,01t11,02,03,...,zest,zinc,zolmitriptan,zostavax,évidence,úlcera,ımportant,ınformatıon,ıs,col
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


In [37]:
questions['col'] = range(0,1000)
questions.head(3)

Unnamed: 0,File,message,binary,specialty,Summary,specialty.1,Unnamed: 6,message_word_count,col
0,1-131188152.xml.txt,SUBJECT: who and where to get cetirizine - D\n MESSAGE: I need/want to know who manufscturs Cetirizine. My Walmart is looking for a new supply and are not getting the recent,pharmacist,pharmacist,Who manufactures cetirizine?,pharmacist,pharmacist,31,0
1,14348.txt,"who makes bromocriptine\n i am wondering what company makes the drug bromocriptine, i need it for a mass i have on my pituitary gland and the cost just keeps raising. i cannot ever buy a full prescription because of the price and i was told if i get a hold of the maker of the drug sometimes they offer coupons or something to help me afford the medicine. if i buy 10 pills in which i have to take 2 times a day it costs me 78.00. and that is how i have to buy them. thanks.",pharmacist,pharmacist,Who manufactures bromocriptine?,pediatrician,internist,97,1
2,1-131985747.xml.txt,"SUBJECT: nulytely\n MESSAGE: Hello can you tell me where do i order the nulytely who is the manufacture, what phone number can i call. thanks.",pharmacist,pharmacist,"Who makes nulytely, and where can I buy it?",oncologist,dentist,25,2


In [38]:
df_all = pd.merge(left = dfqu_vec, right = questions[['binary', 'col']], how = 'left', left_on = 'col', right_on = 'col')
df_all.drop(columns='col', inplace=True)
df_all.head(3)  # merging so we can see most common words for both target classes

Unnamed: 0,00,000,000421,001274,00527172874,01,01d08e1e,01t11,02,03,...,zinc,zolmitriptan,zostavax,évidence,úlcera,ımportant,ınformatıon,ıs,col,binary
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pharmacist
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,pharmacist
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,pharmacist


In [40]:
df_all[df_all['binary'] == 'pharmacist'].head(3)

Unnamed: 0,00,000,000421,001274,00527172874,01,01d08e1e,01t11,02,03,...,zest,zinc,zolmitriptan,zostavax,évidence,úlcera,ımportant,ınformatıon,ıs,binary
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pharmacist
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pharmacist
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pharmacist


In [None]:
internist = df_all[df_all['binary'] == 'internist']
internist.drop(columns='binary', inplace=True)
freq1 = internist.sum().sort_values(ascending=False).head(15)

bar_freq1 = px.bar(
                  x=list(freq1.index),
                  y=list(freq1.values),
                  title='15 Most Common Words from Submitted Messages (Hand-Classified as Internist before Modeling)',
                  labels = {'y': 'Count', 'x': 'Common Word'}

)

bar_freq1.update_layout(title_x = 0.5)

<img src='../assets/most-common-words-internist.png' width='1600' height='800'>

From the plot above, those that are seeking internist (or specialist for multinomial classification) help are searching for information, treatments, and names. It is interesting to see many overlapping words within both of these plots. We can see below that patient questions designated for pharmacists are asking about the name and ingredients (gluten) of the meds. I suspect the differentiation of these messages may cause some trouble for the models. Also, it would be interesting to look at a topic analysis for the messages in both target classes. 

In [None]:
pharmacist = df_all[df_all['binary'] == 'pharmacist']
pharmacist.drop(columns='binary', inplace=True)
freq2 = pharmacist.sum().sort_values(ascending=False).head(15)

bar_freq2 = px.bar(
                  x=list(freq2.index),
                  y=list(freq2.values),
                  title='15 Most Common Words from Submitted Messages (Hand-Classified as Pharmacist before Modeling)',
                  labels = {'y': 'Count', 'x': 'Common Word'}

)

bar_freq2.update_layout(title_x = 0.5)

<img src='../assets/most-common-words-pharmacist.png' width='1600' height='800'>

As we reach the end of my EDA, I have seen that patients are asking many times for information about their own medical status, but also very often for help with their loved ones. They are asking about treatments and medications. They are asking for more knowledge, to be educated on what concerns or interests them. I can see that some patients are desperate for help and often very concerned. I believe that this product will be very useful for patients in seeking real time answers to their questions. 

When approaching the modeling phase in my next notebook, I will first want to run a basic model to see potentially how the data fits. I already know from the distribution of my target classes that the null model sits around 83% accurate when predicting just internist for every question. The models need to beat that to be productive. I also want to explore a topic analysis to see if any models can pick up what specifically the questions are asking about or if they can be grouped. The next notebook will explore these ideas.