
### 📒 Step 1: Business Understanding

The aim of this project is to build a system that simplifies medical information retrieval for respiratory illnesses. Instead of relying on traditional machine learning models for prediction, the approach focuses on **data preprocessing and retrieval-based matching**.

The system will:

* Map **symptoms → disease → treatment** in a structured way.
* Convert the dataset into a clean, consistent CSV where each row represents a **unique disease–treatment pair with its associated symptoms**.
* Feed this structured data into a **vector database** (e.g., FAISS or Chroma) for efficient semantic search.
* Allow user queries (e.g., "shortness of breath and coughing") to be matched against the database, retrieving the most relevant disease and its treatment options.

**Why this approach?**

* Traditional predictive modeling is unnecessary here. Instead, the objective is **precise matching and retrieval** of knowledge.
* This method ensures **transparency and explainability**—users see exactly which symptoms map to which disease and treatment.
* It also supports cases where a disease has multiple treatments (each treatment is preserved as a separate entry).

This business framing highlights the project’s role as a **student-friendly, cost-free RAG (Retrieval-Augmented Generation) pipeline prototype** that prioritizes **usability, clarity, and reproducibility** over model complexity.



### 📒 Step 2: Data Understanding

In this step, we load the dataset into a Pandas DataFrame and perform an initial inspection. The goal is to understand:

* **Shape of the dataset** (number of rows and columns).
* **Column names** and what they represent.
* **Sample records** to see how symptoms, diseases, and treatments are structured.
* Whether the dataset contains missing values or irregularities that may affect preprocessing.

This understanding will guide the cleaning and preprocessing steps later.

In [1]:
# Import core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualization style
sns.set(style="whitegrid")

# Load the dataset
file_path = "../data/respiratory symptoms and treatment.csv"  # adjust if needed
df = pd.read_csv(file_path)

# Display dataset structure
print("✅ Dataset loaded successfully\n")
print("Shape of dataset:", df.shape, "\n")
print("Columns:", df.columns.tolist(), "\n")
print("First 5 rows of the dataset:\n")
display(df.head())


✅ Dataset loaded successfully

Shape of dataset: (38537, 6) 

Columns: ['Symptoms', 'Age', 'Sex', 'Disease', 'Treatment', 'Nature'] 

First 5 rows of the dataset:



Unnamed: 0,Symptoms,Age,Sex,Disease,Treatment,Nature
0,coughing,5.0,female,Asthma,Omalizumab,high
1,tight feeling in the chest,4.0,female,Asthma,Mepolizumab,high
2,wheezing,6.0,male,Asthma,Mepolizumab,high
3,shortness of breath,7.0,male,Asthma,Mepolizumab,high
4,shortness of breath,9.0,male,Asthma,Mepolizumab,high



### 📒 Step 2: Data Understanding (Results Explanation)

The dataset has been successfully loaded, containing **38,537 rows** and **6 columns**. Let’s break down what we now know:

* **Columns:**

  * **Symptoms** → text descriptions of medical complaints (e.g., “coughing”, “shortness of breath”).
  * **Age** → numeric (float), representing the age of the individual.
  * **Sex** → categorical (male/female).
  * **Disease** → categorical, the diagnosed condition (e.g., Asthma).
  * **Treatment** → categorical, prescribed medication or intervention (e.g., Omalizumab, Mepolizumab).
  * **Nature** → categorical (e.g., “high”), likely severity of the disease or treatment context.

* **Observations from sample rows:**

  * Multiple rows can belong to the same disease, but with different symptoms and treatments.
  * Example: *Asthma* appears multiple times, with symptoms like *“coughing”*, *“wheezing”*, and *“shortness of breath”*, and treatments like *Omalizumab* and *Mepolizumab*.
  * This confirms our earlier hypothesis: the dataset is **symptom-level granular**, not yet aggregated into the disease–treatment–symptom format we want.

* **Next Steps (Data Cleaning):**

  * Check for missing values and duplicates.
  * Normalize text entries (consistent casing, spacing).
  * Verify unique disease–treatment pairs.
  * Decide how to handle numeric fields like **Age** and categorical fields like **Sex** and **Nature** since the project’s retrieval pipeline may or may not need them.

This exploration shows the dataset is rich but fragmented—we’ll need to restructure it so that **all symptoms for a disease–treatment pair live in a single row** for embedding and retrieval.




### 📒 Step 3: Data Cleaning

**Markdown (Explanation before code):**

The purpose of this step is to ensure the dataset is consistent and reliable before we transform it. Data quality issues—such as missing values, duplicate entries, or inconsistent text formatting—can cause errors when we later aggregate symptoms or feed the data into a vector database.

Key cleaning checks include:

1. **Missing Values** → identify whether any column has null/NaN values.
2. **Duplicates** → check for repeated records that may artificially inflate results.
3. **Text Normalization** → standardize symptom, disease, treatment, and categorical columns (lowercase, strip spaces).
4. **Column Inspection** → decide whether all columns are needed for retrieval or only a subset (likely: `Disease`, `Treatment`, `Symptoms`).


In [2]:
# Check for missing values
print("🔎 Missing Values per Column:\n", df.isnull().sum(), "\n")

# Check for duplicate rows
print("🔎 Number of duplicate rows:", df.duplicated().sum(), "\n")

# Normalize text fields (Symptoms, Disease, Treatment, Sex, Nature)
text_columns = ["Symptoms", "Disease", "Treatment", "Sex", "Nature"]
for col in text_columns:
    df[col] = df[col].astype(str).str.lower().str.strip()

# Preview after normalization
print("✅ Preview after text normalization:\n")
display(df.head())


🔎 Missing Values per Column:
 Symptoms      696
Age           342
Sex           922
Disease       340
Treatment    2841
Nature       2190
dtype: int64 

🔎 Number of duplicate rows: 37634 

✅ Preview after text normalization:



Unnamed: 0,Symptoms,Age,Sex,Disease,Treatment,Nature
0,coughing,5.0,female,asthma,omalizumab,high
1,tight feeling in the chest,4.0,female,asthma,mepolizumab,high
2,wheezing,6.0,male,asthma,mepolizumab,high
3,shortness of breath,7.0,male,asthma,mepolizumab,high
4,shortness of breath,9.0,male,asthma,mepolizumab,high


In [3]:
# Drop rows missing critical columns: Disease or Symptoms
df = df.dropna(subset=["Disease", "Symptoms"])

# Re-check missing values after dropping
missing_after = df.isna().sum()

print("🔎 Missing values after dropping rows without Disease or Symptoms:")
print(missing_after)
print(f"\n✅ Remaining rows: {len(df)}")


🔎 Missing values after dropping rows without Disease or Symptoms:
Symptoms       0
Age          342
Sex            0
Disease        0
Treatment      0
Nature         0
dtype: int64

✅ Remaining rows: 38537



* ✅ `Disease` and `Symptoms` are now **fully intact** (no missing values).
* ⚠️ Only `Age` still has **342 missing values** — this is acceptable for now, since `Age` is not the primary mapping key.
* ✅ `Sex`, `Treatment`, and `Nature` no longer report nulls (the NaNs were dropped along with their missing `Disease`/`Symptoms` rows).
* ✅ We retained **38,537 usable rows**, which is a strong base dataset.


In [5]:
# Assign the cleaned dataset to df_clean for consistency
df_clean = df.copy()

# Inspect unique values in categorical columns
categorical_cols = ['Sex', 'Disease', 'Treatment', 'Nature']

for col in categorical_cols:
    print(f"\n🔎 Unique values in {col}:")
    print(df_clean[col].value_counts(dropna=False))



🔎 Unique values in Sex:
Sex
male          21256
female        15411
not to say      948
nan             922
Name: count, dtype: int64

🔎 Unique values in Disease:
Disease
pneumonia                                6144
bronchitis                               4925
chronic obstructive pulmonary disease    3888
mesothelioma                             3216
pneumothorax                             2880
bronchiolitis                            2650
chronic bronchitis                       2016
bronchiectasis                           1950
influenza                                1872
tuberculosis                             1680
pulmonary hypertension                   1680
asthma                                   1096
chronic cough                             912
sleep apnea                               864
respiratory syncytial virus               720
acute respiratory distress syndrome       696
asbestosis                                504
aspergillosis                             504


Perfect — let’s lock down the **minor inconsistencies** in `Treatment` before we dive into the messy `Symptoms`.
We’ll build a **mapping dictionary** to unify values that are essentially the same but spelled differently.

In [6]:
# Define mapping dictionary for Treatment standardization
treatment_mapping = {
    'antibiotic': 'antibiotics',
    'antibiotics.': 'antibiotics',
    'oxyzen': 'oxygen',
    'consult doctor': 'consult a doctor',
    'inhealer': 'inhaler'
}

# Apply the mapping
df_clean['Treatment'] = df_clean['Treatment'].replace(treatment_mapping)

# Verify results after standardization
print("🔎 Unique treatments after normalization:")
print(df_clean['Treatment'].value_counts().head(20))


🔎 Unique treatments after normalization:
Treatment
antibiotics                          9839
chemotherapy                         2928
isotonic sodium chloride solution    2880
nan                                  2841
consult a doctor                     2336
oseltamivir                          1872
saline nose drops                    1800
oxygen                               1704
diuretics                            1680
pulmonary rehabilitation             1104
cough medicine                        960
inhaler                               957
hypertonic saline                     850
adaptive servo-ventilation            816
ethambutol                            720
intravenous fluids                    720
steroids to reduce inflammation       672
x-ray                                 624
pyrazinamide                          528
surgery                               432
Name: count, dtype: int64


**Treatment** column is standardized:

* `antibiotic` + `antibiotics.` merged into **antibiotics** (9839 total)
* `oxyzen` corrected to **oxygen** (1704 total)
* `consult doctor` unified with **consult a doctor** (2336 total)
* `inhealer` corrected to **inhaler** (957 total)

This means treatments are now clean and won’t fragment your later grouping or aggregation.

👉 Next step: **Symptoms cleanup**.
Unlike Treatments, Symptoms are free-text and will need:

1. **Lowercasing and stripping** (already done).
2. **Deduplication of similar terms** (e.g., “shortness of breath” vs “breathlessness”).
3. **Optionally lemmatization** (e.g., “coughing” → “cough”).
4. Grouping similar variations into a **controlled vocabulary** for consistency.


In [8]:
# Count unique symptoms
num_unique_symptoms = df_clean['Symptoms'].nunique()
print(f"🔎 Number of unique symptoms: {num_unique_symptoms}")

# Preview some of them
print("\n✅ Sample of unique symptoms:")
print(df_clean['Symptoms'].unique())  # first 50 unique symptoms


🔎 Number of unique symptoms: 78

✅ Sample of unique symptoms:
['coughing' 'tight feeling in the chest' 'wheezing' 'shortness of breath'
 'fever' 'cold' 'allergy' 'coughing up yellow or green mucus daily'
 'shortness of breath that gets worse during flare-ups'
 'fatigue, feeling run-down or tired' 'chest pain'
 'whistling sound while you breathe' 'coughing up blood' 'runny nose'
 'stuffy nose' 'loss of appetite' 'cough' 'low-grade fever'
 'chest congestion' 'whistling sound while breathing' 'yellow cough'
 'feeling run-down or tired' 'mucus' 'nan' 'chronic cough' 'fatigue'
 'lower back pain' 'dry cough' 'greenish cough' 'cough with blood'
 'sweating' 'shaking' 'rapid breathing' 'shallow breathing' 'low energy'
 'nausea' 'vomiting' 'sharp chest pain' 'bluish skin' 'rapid heartbeat'
 'high fever' 'headache' 'muscle aches' 'joint pain' 'chills'
 'sore throat' 'nasal congestion' 'diarrhea' 'breath' 'dizziness'
 'fainting' 'heart palpitations' 'edema' 'snoring' 'daytime sleepiness'
 'pauses 

In [9]:
# List all unique symptoms
unique_symptoms = df_clean['Symptoms'].unique()

print("🔎 List of all unique symptoms:\n")
for i, symptom in enumerate(unique_symptoms, 1):
    print(f"{i}. {symptom}")


🔎 List of all unique symptoms:

1. coughing
2. tight feeling in the chest
3. wheezing
4. shortness of breath
5. fever
6. cold
7. allergy
8. coughing up yellow or green mucus daily
9. shortness of breath that gets worse during flare-ups
10. fatigue, feeling run-down or tired
11. chest pain
12. whistling sound while you breathe
13. coughing up blood
14. runny nose
15. stuffy nose
16. loss of appetite
17. cough
18. low-grade fever
19. chest congestion
20. whistling sound while breathing
21. yellow cough
22. feeling run-down or tired
23. mucus
24. nan
25. chronic cough
26. fatigue
27. lower back pain
28. dry cough
29. greenish cough
30. cough with blood
31. sweating
32. shaking
33. rapid breathing
34. shallow breathing
35. low energy
36. nausea
37. vomiting
38. sharp chest pain
39. bluish skin
40. rapid heartbeat
41. high fever
42. headache
43. muscle aches
44. joint pain
45. chills
46. sore throat
47. nasal congestion
48. diarrhea
49. breath
50. dizziness
51. fainting
52. heart palpitatio

this symptom set is full of:

* **Duplicates / Variations**:

  * “cough” vs “coughing” vs “dry cough” vs “chronic cough” vs “cough with blood” vs “wheezing cough” vs “persistent dry coug” (typo).
  * “shortness of breath” vs “short of breath” vs “breath”.
  * “tight feeling in the chest” vs “chest tightness or chest pain” vs “chest pain”.
  * “whistling sound while you breathe” vs “whistling sound while breathing”.

* **Synonyms or near-duplicates**:

  * “fatigue” vs “feeling run-down or tired” vs “fatigue, feeling run-down or tired”.
  * “low energy” overlaps with “fatigue”.
  * “loss of appetite” vs “loss of appetite and unintentional weight loss” vs “weight loss from loss of appetite”.

* **Compound symptoms**:

  * “fatigue, feeling run-down or tired” should be split or reduced to “fatigue”.
  * “short, shallow and rapid breathing” overlaps with “rapid breathing” and “shallow breathing”.

* **Errors**:

  * “persistent dry coug” → typo, should be “persistent dry cough”.

Here’s the play:

1. Build a **mapping dictionary** to collapse all these variations into standardized forms.
2. Keep **atomic, distinct symptoms** (like “edema” or “snoring”) separate.
3. Remove “nan” entries (they’re just missing values).

In [10]:
df_clean.head(20)  # Display the first few rows of the cleaned DataFrame

Unnamed: 0,Symptoms,Age,Sex,Disease,Treatment,Nature
0,coughing,5.0,female,asthma,omalizumab,high
1,tight feeling in the chest,4.0,female,asthma,mepolizumab,high
2,wheezing,6.0,male,asthma,mepolizumab,high
3,shortness of breath,7.0,male,asthma,mepolizumab,high
4,shortness of breath,9.0,male,asthma,mepolizumab,high
5,tight feeling in the chest,,male,asthma,mepolizumab,high
6,shortness of breath,,male,asthma,mepolizumab,high
7,tight feeling in the chest,8.0,female,asthma,mepolizumab,high
8,shortness of breath,36.0,female,asthma,mepolizumab,medium
9,wheezing,40.0,female,asthma,omalizumab,medium


In [13]:

# Normalize the raw symptom column
df_clean['symptom_clean'] = df_clean['Symptoms'].str.lower().str.strip()

# --- Step 1: Define a standardization dictionary ---
symptom_map = {
    # coughing & variants
    "coughing": "cough",
    "cough": "cough",
    "dry cough": "dry cough",
    "chronic cough": "chronic cough",
    "wheezing cough": "wheezing cough",
    "persistent dry coug": "persistent dry cough",  # spelling correction
    "persistent dry cough": "persistent dry cough",
    "a cough that lasts more than three weeks": "persistent cough",
    "yellow cough": "cough with yellow mucus",
    "greenish cough": "cough with green mucus",
    "cough with blood": "cough with blood",
    "coughing up blood": "cough with blood",
    "coughing up yellow or green mucus daily": "productive cough with yellow/green mucus",
    
    # breathing issues
    "shortness of breath": "shortness of breath",
    "short of breath": "shortness of breath",
    "breath": "shortness of breath",
    "shortness of breath that gets worse during flare-ups": "shortness of breath (worsens during flare-ups)",
    "rapid breathing": "rapid breathing",
    "shallow breathing": "shallow breathing",
    "short, shallow and rapid breathing": "rapid shallow breathing",
    "whistling sound while you breathe": "wheezing",
    "whistling sound while breathing": "wheezing",
    "wheezing": "wheezing",
    "a dry, crackling sound in the lungs while breathing in": "lung crackles",
    
    # chest-related
    "tight feeling in the chest": "chest tightness",
    "chest tightness or chest pain": "chest tightness/pain",
    "chest pain": "chest pain",
    "sharp chest pain": "sharp chest pain",
    "chest congestion": "chest congestion",
    
    # fever variants
    "fever": "fever",
    "low-grade fever": "low-grade fever",
    "high fever": "high fever",
    
    # fatigue & low energy
    "fatigue": "fatigue",
    "fatigue, feeling run-down or tired": "fatigue",
    "feeling run-down or tired": "fatigue",
    "low energy": "fatigue",
    
    # nose/throat
    "runny nose": "runny nose",
    "stuffy nose": "nasal congestion",
    "nasal congestion": "nasal congestion",
    "sore throat": "sore throat",
    
    # appetite/weight
    "loss of appetite": "loss of appetite",
    "loss of appetite and unintentional weight loss": "loss of appetite & weight loss",
    "weight loss": "weight loss",
    "weight loss from loss of appetite": "weight loss from appetite loss",
    
    # systemic / flu-like
    "cold": "cold",
    "allergy": "allergy",
    "muscle aches": "muscle aches",
    "joint pain": "joint pain",
    "headache": "headache",
    "morning headaches": "morning headaches",
    "chills": "chills",
    "sweating": "sweating",
    "shaking": "shaking",
    "night sweats": "night sweats",
    
    # GI issues
    "nausea": "nausea",
    "vomiting": "vomiting",
    "diarrhea": "diarrhea",
    
    # cardio / circulation
    "rapid heartbeat": "rapid heartbeat",
    "faster heart beating": "rapid heartbeat",
    "heart palpitations": "heart palpitations",
    "edema": "edema",
    "bluish skin": "cyanosis",
    "dizziness": "dizziness",
    "fainting": "fainting",
    
    # sleep & cognitive
    "snoring": "snoring",
    "daytime sleepiness": "daytime sleepiness",
    "pauses in breathing": "sleep apnea episodes",
    "frequently waking": "frequent waking",
    "dry mouth": "dry mouth",
    "difficulties with memory and concentration": "memory/concentration problems",
    "unusual moodiness": "mood changes",
    "irritability": "irritability",
    
    # rare
    "wider and rounder than normal fingertips and toes": "clubbing",
    "distressing": "distress",
    "pain": "pain",
    "nan": None  # treat missing
}

# --- Step 2: Apply mapping ---
df_clean['symptom_standardized'] = (
    df_clean['symptom_clean'].map(symptom_map).fillna(df_clean['symptom_clean'])
)

# --- Step 3: Check results ---
print(
    df_clean[['Symptoms', 'symptom_standardized']]
    .drop_duplicates()
    .sort_values(by='symptom_standardized')
)


                              Symptoms            symptom_standardized
35                             allergy                         allergy
231                   chest congestion                chest congestion
69                          chest pain                      chest pain
1           tight feeling in the chest                 chest tightness
980      chest tightness or chest pain            chest tightness/pain
..                                 ...                             ...
981  weight loss from loss of appetite  weight loss from appetite loss
77   whistling sound while you breathe                        wheezing
2                             wheezing                        wheezing
233    whistling sound while breathing                        wheezing
899                     wheezing cough                  wheezing cough

[78 rows x 2 columns]


In [14]:
df_clean.head(20)  # Display the first few rows of the cleaned DataFrame

Unnamed: 0,Symptoms,Age,Sex,Disease,Treatment,Nature,symptom_clean,symptom_standardized
0,coughing,5.0,female,asthma,omalizumab,high,coughing,cough
1,tight feeling in the chest,4.0,female,asthma,mepolizumab,high,tight feeling in the chest,chest tightness
2,wheezing,6.0,male,asthma,mepolizumab,high,wheezing,wheezing
3,shortness of breath,7.0,male,asthma,mepolizumab,high,shortness of breath,shortness of breath
4,shortness of breath,9.0,male,asthma,mepolizumab,high,shortness of breath,shortness of breath
5,tight feeling in the chest,,male,asthma,mepolizumab,high,tight feeling in the chest,chest tightness
6,shortness of breath,,male,asthma,mepolizumab,high,shortness of breath,shortness of breath
7,tight feeling in the chest,8.0,female,asthma,mepolizumab,high,tight feeling in the chest,chest tightness
8,shortness of breath,36.0,female,asthma,mepolizumab,medium,shortness of breath,shortness of breath
9,wheezing,40.0,female,asthma,omalizumab,medium,wheezing,wheezing


In [15]:
# Check total number of rows in df_clean
print(f"Number of rows in df_clean: {len(df_clean)}")


Number of rows in df_clean: 38537
