## Expermienting with dataset

In [13]:
import requests
import hashlib
import re



In [3]:
# URL of the JSON file
url = 'https://raw.githubusercontent.com/Kent0n-Li/ChatDoctor/main/chatdoctor5k.json'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON content
    data = response.json()
    print("JSON data loaded successfully.")
else:
    print(f"Failed to retrieve data: {response.status_code}")

# Now, 'data' contains your JSON content as a Python dictionary


JSON data loaded successfully.


In [9]:
data[0]

{'instruction': "If you are a doctor, please answer the medical questions based on the patient's description.",
 'input': "Doctor, I have been experiencing sudden and frequent panic attacks. I don't know what to do.",
 'output': "Well, based on what you're telling me, it sounds like you may be suffering from panic disorder. The best course of action is to start with psychotherapy and mental health counseling. Additionally, we should conduct an electrocardiogram to make sure that there are no physical issues causing your panic attacks. We will also need to perform a depression screen and a toxicology screen to rule out any other underlying causes. Finally, I would recommend a comprehensive psychological and psychiatric evaluation and therapy to help manage your symptoms."}

In [18]:
# Function to generate a unique ID based on the question text
def generate_id(question):
    hash_object = hashlib.sha256(question.encode())
    return hash_object.hexdigest()

# Function to clean up the question by removing the introductory phrases
def clean_question(question):
    # Regular expression to match "Hi doctor,", "Hello doctor,", "Doctor,", etc.
    pattern = r"^(hi doctor,|hello doctor,|doctor,|Hi, Doctor.)\s*"
    # Substitute the matched phrase with an empty string
    cleaned_question = re.sub(pattern, "", question, flags=re.IGNORECASE)
    return cleaned_question

# Process the data with ID generation and question cleanup
processed_data = []

for item in data:
    question = item.get("input", "")
    
    # Clean the question by removing the introductory phrases
    cleaned_question = clean_question(question)
    
    # Create a new dictionary with the desired key names and generated ID
    new_item = {
        "id": generate_id(cleaned_question),  # Generate a reproducible unique ID
        "question": cleaned_question,         # Use the cleaned question
        "answer": item.get("output", "")      # Rename 'output' to 'answer'
    }
    
    processed_data.append(new_item)

# 'processed_data' now contains the list with the cleaned questions and reproducible IDs


In [19]:
processed_data

[{'id': 'e0d69b874ce0eea9a131edd3581d2678bf57c7e4be27cdb0e97cf976e82e3306',
  'question': "I have been experiencing sudden and frequent panic attacks. I don't know what to do.",
  'answer': "Well, based on what you're telling me, it sounds like you may be suffering from panic disorder. The best course of action is to start with psychotherapy and mental health counseling. Additionally, we should conduct an electrocardiogram to make sure that there are no physical issues causing your panic attacks. We will also need to perform a depression screen and a toxicology screen to rule out any other underlying causes. Finally, I would recommend a comprehensive psychological and psychiatric evaluation and therapy to help manage your symptoms."},
 {'id': '36ee867e1cda9611cb05b1e9833153b621a40b335837ce528aae9cd2669c412e',
  'question': "I have been experiencing a hoarse voice for a few weeks now and it's not getting any better despite taking medication. What could be the problem?",
  'answer': "It'

In [20]:
# Iterate over and print the first 100 questions
for i, item in enumerate(processed_data[:100]):
    print(f"Question {i+1}: {item['question']}")


Question 1: I have been experiencing sudden and frequent panic attacks. I don't know what to do.
Question 2: I have been experiencing a hoarse voice for a few weeks now and it's not getting any better despite taking medication. What could be the problem?
Question 3: I think I might have Turner syndrome. 
Question 4: I've been experiencing a condition called cryptorchidism. My testicles have not descended properly into the scrotum. What medical tests do I need to take?
Question 5: I think I've been poisoned. I drank some ethylene glycol by mistake. 
Question 6: I've been having discomfort and dryness in my vagina for a while now. I also experience pain during sex. What could be the problem and what tests do I need?
Question 7: I think I fractured my hand while playing basketball. It hurts a lot and I can't move it properly.
Question 8: I have a problem with my mouth. It's painful and there's swelling around it. It's difficult to eat or talk properly.
Question 9: I have been having troub

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform([item['question'] for item in processed_data])

# K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_

# Assign labels to questions
for item, label in zip(processed_data, labels):
    item["topic"] = f"Cluster {label}"

# Optionally group by clusters
grouped_by_topic = {}
for item in processed_data:
    topic = item["topic"]
    if topic not in grouped_by_topic:
        grouped_by_topic[topic] = []
    grouped_by_topic[topic].append(item)


In [24]:
grouped_by_topic

{'Cluster 5': [{'id': 'e0d69b874ce0eea9a131edd3581d2678bf57c7e4be27cdb0e97cf976e82e3306',
   'question': "I have been experiencing sudden and frequent panic attacks. I don't know what to do.",
   'answer': "Well, based on what you're telling me, it sounds like you may be suffering from panic disorder. The best course of action is to start with psychotherapy and mental health counseling. Additionally, we should conduct an electrocardiogram to make sure that there are no physical issues causing your panic attacks. We will also need to perform a depression screen and a toxicology screen to rule out any other underlying causes. Finally, I would recommend a comprehensive psychological and psychiatric evaluation and therapy to help manage your symptoms.",
   'topic': 'Cluster 5'},
  {'id': '36ee867e1cda9611cb05b1e9833153b621a40b335837ce528aae9cd2669c412e',
   'question': "I have been experiencing a hoarse voice for a few weeks now and it's not getting any better despite taking medication. Wh

In [41]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [42]:
import spacy
from gensim import corpora, models
from spacy.lang.en.stop_words import STOP_WORDS

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Preprocessing
def preprocess(text):
    doc = nlp(text.lower())
    return [token.text for token in doc if token.is_alpha and token.text not in STOP_WORDS]

# Prepare the data
texts = [preprocess(item['question']) for item in processed_data]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Train LDA model
lda_model = models.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=15)

# Label questions based on the dominant topic
for item, bow in zip(processed_data, corpus):
    topic_probs = lda_model.get_document_topics(bow)
    dominant_topic = max(topic_probs, key=lambda x: x[1])[0]
    item["topic"] = f"Topic {dominant_topic}"

# Optionally group by topics
grouped_by_topic = {}
for item in processed_data:
    topic = item["topic"]
    if topic not in grouped_by_topic:
        grouped_by_topic[topic] = []
    grouped_by_topic[topic].append(item)


In [43]:
grouped_by_topic

{'Topic 2': [{'id': 'e0d69b874ce0eea9a131edd3581d2678bf57c7e4be27cdb0e97cf976e82e3306',
   'question': "I have been experiencing sudden and frequent panic attacks. I don't know what to do.",
   'answer': "Well, based on what you're telling me, it sounds like you may be suffering from panic disorder. The best course of action is to start with psychotherapy and mental health counseling. Additionally, we should conduct an electrocardiogram to make sure that there are no physical issues causing your panic attacks. We will also need to perform a depression screen and a toxicology screen to rule out any other underlying causes. Finally, I would recommend a comprehensive psychological and psychiatric evaluation and therapy to help manage your symptoms.",
   'topic': 'Topic 2'},
  {'id': '43ba6b70aa312c100dedeb573f18ca69701f9ec36bc73a0cb4be03a22f918a3f',
   'question': "I think I fractured my hand while playing basketball. It hurts a lot and I can't move it properly.",
   'answer': "I'm sorry 

In [44]:
import pandas as pd

# Assuming processed_data is a list of dictionaries
df = pd.DataFrame(processed_data)

# Optionally, you can inspect the first few rows
print(df.head())


                                                  id  \
0  e0d69b874ce0eea9a131edd3581d2678bf57c7e4be27cd...   
1  36ee867e1cda9611cb05b1e9833153b621a40b335837ce...   
2  32cf528cff3e2838a23d11f69f41dce6fe99ea74df59cd...   
3  a58f734a5703bc0753c864a5ddb2a813b0c98a23404710...   
4  3d622cdbc4710dbb3cb74bca76fea3e45dafe1ec9093bb...   

                                            question  \
0  I have been experiencing sudden and frequent p...   
1  I have been experiencing a hoarse voice for a ...   
2             I think I might have Turner syndrome.    
3  I've been experiencing a condition called cryp...   
4  I think I've been poisoned. I drank some ethyl...   

                                              answer    topic  
0  Well, based on what you're telling me, it soun...  Topic 2  
1  It's possible that you have a vocal cord polyp...  Topic 7  
2  Okay, in that case, you will need to undergo a...  Topic 1  
3  You will need to undergo a complete physical s...  Topic 1  
4  I'm

In [45]:
df

Unnamed: 0,id,question,answer,topic
0,e0d69b874ce0eea9a131edd3581d2678bf57c7e4be27cd...,I have been experiencing sudden and frequent p...,"Well, based on what you're telling me, it soun...",Topic 2
1,36ee867e1cda9611cb05b1e9833153b621a40b335837ce...,I have been experiencing a hoarse voice for a ...,It's possible that you have a vocal cord polyp...,Topic 7
2,32cf528cff3e2838a23d11f69f41dce6fe99ea74df59cd...,I think I might have Turner syndrome.,"Okay, in that case, you will need to undergo a...",Topic 1
3,a58f734a5703bc0753c864a5ddb2a813b0c98a23404710...,I've been experiencing a condition called cryp...,You will need to undergo a complete physical s...,Topic 1
4,3d622cdbc4710dbb3cb74bca76fea3e45dafe1ec9093bb...,I think I've been poisoned. I drank some ethyl...,I'm sorry to hear that. We need to run a few t...,Topic 6
...,...,...,...,...
5447,6fd6d1e32675c12db324ce207e86d5f2708cbd999454de...,I have an open wound in my nose and I also fee...,"Based on your symptoms, you may have a complic...",Topic 4
5448,dfdd413ef45d59fdc24889ebd174de49967d6e82ced051...,I have an open wound in my nose and my knee fe...,"Based on your symptoms, you may have a staph i...",Topic 2
5449,3d4b7c458c73ed29cd0e5ab241962050e09ac76850a08d...,I have an open wound on my nose and I'm experi...,"Hot flashes are a common symptom of menopause,...",Topic 5
5450,92ed88d54d176f02b7cfca962efb8ad56a81caa1466c98...,I have a sore in my nose and I keep getting he...,It sounds like you have an open wound in your ...,Topic 6
