# ðŸ§¬ Unify Thyroid Datasets

This notebook merges three thyroid-related datasets into a single unified CSV/Excel file.

**Datasets used:**
- `thyroidDF.csv`
- `hypothyroid.csv`
- `thyroid_cancer_risk_data.csv`

## 1. Imports

In [1]:
import pandas as pd
from pathlib import Path

## 2. Define Directory Paths

The notebook lives inside `notebooks/`, so we go one level up to reach `data/`.

In [2]:
BASE_DIR = Path("..")
DATA_DIR = BASE_DIR / "data"
OUT_DIR  = DATA_DIR / "Unified Dataset"

OUT_DIR.mkdir(parents=True, exist_ok=True)

print("Data directory:  ", DATA_DIR)
print("Output directory:", OUT_DIR)

Data directory:   ..\data
Output directory: ..\data\Unified Dataset


## 3. Load the Three Datasets

In [3]:
df1 = pd.read_csv(DATA_DIR / "thyroidDF.csv")
df2 = pd.read_csv(DATA_DIR / "hypothyroid.csv")
df3 = pd.read_csv(DATA_DIR / "thyroid_cancer_risk_data.csv")

print("thyroidDF   :", df1.shape)
print("hypothyroid :", df2.shape)
print("cancer_risk :", df3.shape)

thyroidDF   : (9172, 31)
hypothyroid : (3770, 30)
cancer_risk : (212691, 17)


## 4. Preprocess & Unify

Steps performed here:
1. Tag each row with its source dataset.
2. Normalise all column names (lowercase + strip whitespace).
3. Rename each dataset's target column to a common `class` column.
4. Reindex all three dataframes to the full union of columns.
5. Concatenate into a single `final_df`.

In [4]:
# --- Step 1: Add source label (must happen before reindex) ---
df1["source"] = "thyroidDF"
df2["source"] = "hypothyroid"
df3["source"] = "cancer_risk"

# --- Step 2: Normalise column names ---
for df in (df1, df2, df3):
    df.columns = df.columns.str.lower().str.strip()

# --- Step 3: Rename target columns to 'class' ---
df1 = df1.rename(columns={"target": "class"})
df2 = df2.rename(columns={"binaryclass": "class"})
df3 = df3.rename(columns={"thyroid_cancer_risk": "class"})

# --- Step 4: Build union of all columns and reindex ---
all_cols = sorted(set(df1.columns) | set(df2.columns) | set(df3.columns) | {"source"})

df1 = df1.reindex(columns=all_cols)
df2 = df2.reindex(columns=all_cols)
df3 = df3.reindex(columns=all_cols)

# --- Step 5: Concatenate ---
final_df = pd.concat([df1, df2, df3], ignore_index=True)

print("Final shape:", final_df.shape)
print(final_df["source"].value_counts())

Final shape: (225633, 61)
source
cancer_risk    212691
thyroidDF        9172
hypothyroid      3770
Name: count, dtype: int64


## 5. Quick Quality Check

In [5]:
print("Has 'class'  column:", "class"  in final_df.columns)
print("Has 'source' column:", "source" in final_df.columns)
print("\nTop 10 columns by % missing values:")
display(
    (final_df.isna().mean() * 100)
    .sort_values(ascending=False)
    .head(10)
)

Has 'class'  column: True
Has 'source' column: True

Top 10 columns by % missing values:


ftimeasured                98.329145
onantithyroidmedication    98.329145
queryhyperthyroid          98.329145
i131treatment              98.329145
tbgmeasured                98.329145
t4umeasured                98.329145
thyroxine                  98.329145
thyroidsurgery             98.329145
t3measured                 98.329145
t4                         98.329145
dtype: float64

## 6. Save the Unified Dataset

Export to both **CSV** and **Excel** formats.

In [6]:
# --- CSV ---
csv_path = OUT_DIR / "unified_dataset.csv"
final_df.to_csv(csv_path, index=False)
print("âœ… CSV saved :", csv_path)


âœ… CSV saved : ..\data\Unified Dataset\unified_dataset.csv
