1 Dataset overview

This dataset comprises all documented cases of thyroid hydatid cyst (THC) reported worldwide. 
It includes clinical and demographic data from a total of 75 patients.



2 loading libraries and data 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df=pd.read_csv('thyroid_hydatid_cyst.csv')

In [3]:
#preview rows and shape of data
print(df.info())
print('-----------------------------')
print(df.head())
print('-----------------------------')
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   patient_id         75 non-null     float64
 1   year               75 non-null     float64
 2   rural              75 non-null     object 
 3   gender             75 non-null     object 
 4   age                75 non-null     object 
 5   country            75 non-null     object 
 6   duration_years     75 non-null     object 
 7   symptom_type       75 non-null     object 
 8   cyst_size_mm       75 non-null     object 
 9   Serology           75 non-null     object 
 10  location           75 non-null     object 
 11  Treatment          75 non-null     object 
 12  follow_up_outcome  75 non-null     object 
dtypes: float64(2), object(11)
memory usage: 7.7+ KB
None
-----------------------------
   patient_id    year rural gender age country duration_years  \
0         1.0  2023.0   

3 initial sanity check

missing values per column, column data types, and data validation

In [4]:
print(df.isnull().sum())
print('----------------------')
print((df=='N/d').sum())
print('----------------------')
print(df.info())
print('----------------------')
print(df.describe())

patient_id           0
year                 0
rural                0
gender               0
age                  0
country              0
duration_years       0
symptom_type         0
cyst_size_mm         0
Serology             0
location             0
Treatment            0
follow_up_outcome    0
dtype: int64
----------------------
patient_id            0
year                  0
rural                42
gender                1
age                   1
country               0
duration_years       38
symptom_type          2
cyst_size_mm         39
Serology             44
location              2
Treatment             1
follow_up_outcome    36
dtype: int64
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   patient_id         75 non-null     float64
 1   year               75 non-null     float64
 2   rural            

4 data cleaning

4.1 replacing missing data

Some papers didn't report certain data, shown as 'N/d' in our dataset. Replaced with NaN

In [5]:
df=df.map(lambda x: np.nan if isinstance(x,str) and x.strip()=='N/d' else x)

4.2 conver numeric columns

Convert numeric columns to float64/int64 for statistical analysis.

In [6]:
cols_to_float=['age','duration_years','cyst_size_mm']
df[cols_to_float]=df[cols_to_float].astype('float64')

In [7]:
cols_to_int=['patient_id','year']
df[cols_to_int]=df[cols_to_int].astype('int64')

4.3 convert binary columns

rural, location, treatment, serology , gender, and follow-up were binarized for analysis.

In [8]:
#male=1 female=0
df['gender']=df['gender'].astype('float64')
#rural=1 urban=0
df['rural']=df['rural'].astype('float64')
#positive= 1 negative=0
df['Serology']=df['Serology'].astype('float64')
#surgery=1 medication=0
surgery=[]
medication=[]
for x in df['Treatment']:
    if pd.isna(x):
        surgery.append(np.nan)
        medication.append(np.nan)
    if x=='1':
        surgery.append(1)
        medication.append(0)
    if x=='0/1':
        surgery.append(1)
        medication.append(1)
df['medication']=medication
df['surgery']=surgery
#drop treatment column
df.drop(columns=['Treatment'],inplace=True)
#no recurrence=0 recurrence=1
df['follow_up_outcome']=df['follow_up_outcome'].astype('float64')
#location 0=right lobe of thyroid  1=left lobe of thyroid
right_lobe=[]
left_lobe=[]
for loc in df['location']:
    if loc=='0':
        right_lobe.append(1)
        left_lobe.append(0)
    elif loc=='1':
        right_lobe.append(0)
        left_lobe.append(1)
    elif pd.isna(loc):
        right_lobe.append(np.nan)
        left_lobe.append(np.nan)
    elif loc=='0/1':
        right_lobe.append(1)
        left_lobe.append(1)
df['right_lobe']=right_lobe
df['left_lobe']=left_lobe
#drop string location column
df.drop(columns=['location'],inplace=True)
#Symptoms column split into multiple binary columns: 1 = symptom present, 0 = symptom absent
df['has_mass']=df['symptom_type'].str.contains('mass').astype('float64')
df['has_pulmonary']=df['symptom_type'].str.contains('pulmonary').astype('float64')
df['has_digestive']=df['symptom_type'].str.contains('digestive').astype('float64')
df.drop(columns=['symptom_type'],inplace=True)

4.4 clean string data

In [9]:
df['country']=df['country'].str.strip()
df['country'] = df['country'].str.title()
df['country']=df['country'].str.strip()

5 final sanity check

ensure all columns are correctly formatted and no invalid values remain.

In [10]:
print(df.isna().sum())
print('----------------------')
print(df.info())
print('----------------------')
print(df.describe())
        

patient_id            0
year                  0
rural                42
gender                1
age                   1
country               0
duration_years       38
cyst_size_mm         39
Serology             44
follow_up_outcome    36
medication            1
surgery               1
right_lobe            2
left_lobe             2
has_mass              2
has_pulmonary         2
has_digestive         2
dtype: int64
----------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   patient_id         75 non-null     int64  
 1   year               75 non-null     int64  
 2   rural              33 non-null     float64
 3   gender             74 non-null     float64
 4   age                74 non-null     float64
 5   country            75 non-null     object 
 6   duration_years     37 non-null     float64
 7   cyst_size_mm    

6 save cleaned dataset

cleaned dataset saved for further analysis

In [11]:
df.to_csv('thyroid_hydatid_cyst_cleaned.csv', index=False,encoding='utf-8')