Work done by: Savelii Shaposhnyk (50%),  Yurii Huziienko (50%)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from statsmodels.stats.power import TTestIndPower
import re

In [None]:
station = pd.read_csv("./094/station.csv", sep='\t', engine="python")
patient = pd.read_csv("./094/patient.csv", sep='\t', engine="python")
observation = pd.read_csv("./094/observation.csv", sep='\t', engine="python")

## 1.1 A 

### Analýza štruktúr dát

### Initial rows of datasets


In [None]:
station.head()

In [None]:
observation.head()

In [None]:
patient.head()

### Statistics and information about datasets

In [None]:
patient.info()
patient.describe(include='all')

In [None]:
station.info()
station.describe(include='all')

In [None]:
observation.info()
observation.describe(include='all')

### Data type, shape and missing values

In [None]:
def datatypes_counts(df): 
    return df.dtypes.astype(str).value_counts().to_dict()

In [None]:
patient_smm = {
    "Number_of_lines": patient.shape[0],
    "Number_of_columns": patient.shape[1],
    "Data_types": datatypes_counts(patient),
    "Missing_total": int(patient.isna().sum().sum())
}
patient_smm

In [None]:
station_smm = {
    "Number_of_lines": station.shape[0],
    "Number_of_columns": station.shape[1],
    "Data_types": datatypes_counts(station),
    "Missing_total": int(station.isna().sum().sum())
}
station_smm

In [None]:
observation_smm = {
    "Number_of_lines": observation.shape[0],
    "Number_of_columns": observation.shape[1],
    "Data_types": datatypes_counts(observation),
    "Missing_total": int(observation.isna().sum().sum())
}
observation_smm

### Data analysis – records and attributes
| File                |   Rows | Columns | Data types                    | Missing values |
| :------------------ | -----: | ------: | :---------------------------- | -------------: |
| **patient.csv**     |  2 102 |      13 | 10×object, 2×int64, 1×float64 |          3 993 |
| **station.csv**     |    798 |       6 | 4×object, 2×float64           |              0 |
| **observation.csv** | 12 081 |      23 | 23×float64                    |              0 |



### Analysis of missing values
Based on all the omitted values listed above, we will check what exactly has been omitted.

In [None]:
missing_percento = (patient.isna().sum() / len(patient) * 100).sort_values(ascending=False).round(2)
missing_percento.head(10)

 From the calculation of the percentage of missing values, we can see that some attributes contain a significant number of empty records, so we will make a complete conclusion about our data.

Dataset: patient.csv
- Number of records: 2102
- Number of attributes: 13
- Data types: object, int64, float64
- Missing values total = 3993  
  - Most missing: residence (100%), job (70%), address (15%), current_location (5%).  
- Description: contains demographic information about patients and a link to the station (station_ID).  
Cannot be directly linked to the station file.

Dataset: station.csv
- Number of records: 798
- Number of attributes: 6
- Data types: object, float64  
- Missing values: 0 
- Description: contains information about measuring stations — station, latitude, longitude etc.  
- Observation: revision values use different date formats, so normalization is required in further steps.

Dataset: observation.csv
- Number of records: 12,081
- Number of attributes: 23  
- Data types: float64 
- Missing values: 0  
- Target variable: oxygen saturation (0/1).
- Attributes: SpO₂, HR, Skin Temperature, BP, CO, FiO₂, etc.

## 1.1 B

### Chosen attributes

In [None]:
skin_temp = observation["Skin Temperature"]
spo = observation["SpO₂"]
hr =  observation["HR"]
pi = observation["PI"]
rr = observation["RR"]  
prv = observation["PRV"]
bp = observation["BP"]
pvi = observation["PVI"]
sv = observation["SV"]
co = observation["CO"]

### General statistics and information about attributes

In [None]:
cols = ["Skin Temperature", "SpO₂", "HR", "PI", "RR", "PRV", "BP", "PVI", "SV", "CO"]
observation[cols].describe()

### Visualisation of attributes

In [None]:
for col in cols:
    plt.figure(figsize=(8, 5))
    plt.subplot(1, 2, 1)
    sns.histplot(observation[col], bins=30, kde=True)
    plt.title(f'{col}')
    plt.xlabel(f'{col}')
    
    plt.subplot(1, 2, 2)
    plt.boxplot(observation[col])
    plt.title(f'{col}')
    plt.xlabel(f'{col}')
    plt.show()

We can see that almost every chosen attribute have normal distribution, and also they have outliers as well

## 1.1 C

### Identify relationships and dependencies between pairs of attributes

In [None]:
corr = observation.corr(numeric_only=False)

plt.figure(figsize=(14,10))
sns.heatmap(corr, annot=True, fmt=".2f")
plt.show()

In [None]:
corr_prs = corr.unstack().sort_values(key=lambda x: x.abs(), ascending=False)
corr_prs = corr_prs[(corr_prs < 0.999) & (corr_prs > -0.999)]
corr_prs.drop_duplicates(inplace=True)
print("Top correlations:\n")
print(corr_prs.head(6))

### Interpretation of paired data analysis results

Several significant relationships can be identified:

**CO and HR (r = 0.76)** – dependence between heart rate and cardiac output.

**Oximetry and PVI (r = 0.67)** – oxygen saturation is closely related to perfusion variability.

**Skin Temperature and PI (r = –0.49)** – inverse relationship between skin temperature and perfusion index: as the skin temperature decreases, the perfusion index tends to increase

**Skin Temperature and Oximetry (r = 0.37)** – correlation between peripheral temperature and saturation.

**EtCO₂ and PI (r = 0.31)** – correlation between exhaled CO₂ and perfusion index.

**PVI and Skin Temperature (r = 0.29)** – relationship between perfusion variability and skin temperature.  

Most other attributes do not show statistically significant linear relationships.

These findings point to physiological relationships between selected variables
and help determine which attributes may be relevant in future modeling
and prediction of the target variable `oximetry`.

In [None]:
pairs_to_plot = [pair for pair, value in corr_prs.head(5).items()]

for x, y in pairs_to_plot:
    sns.scatterplot(data=observation, x=x, y=y)
    plt.title(f"Relationship between {x} a {y}")
    plt.show()

#### CO - HR
The graph shows a strong positive correlation between cardiac output (CO) and heart rate (HR).
With higher cardiac output, heart rate also increases, which is expected from a physiological point of view.
The points form a curved shape, indicating that the relationship is not completely linear, but strongly positive.

#### EtCO a PI
The values are more evenly distributed on this graph. There is a slight positive correlation between
the concentration of carbon dioxide in exhaled air (EtCO₂) and the perfusion index (PI).
The relationship is not strong, but it confirms that changes in breathing can partially affect peripheral perfusion.

## 1.1 D

### Correlation between the predicted variable

In [None]:
corr = observation.corr(numeric_only=False)
corr['oximetry'].sort_values()

Our predicted variable have a very strong correlation with **PVI**, **Skin Temperature**, **EtCO₂** and not very strong negative correlation with **SpO₂**

### Visualisation of correlations

In [None]:
attributes = ['PVI', 'Skin Temperature', 'EtCO₂', 'SpO₂']

In [None]:
for attribute in attributes:
    plt.figure(figsize=(8, 5))
    sns.scatterplot(data=observation, x=f'{attribute}', y='oximetry')
    plt.title(f'{attribute}')
    plt.xlabel(f'{attribute}')
    plt.show()

## (E-1b)

After analysis, we can seen that there is a strong correlation between the **CO** and **HR** variables (r = 0.76) also variables **PI** and **EtCO₂** (r = 0.31) and also **Skin Temperature** and **PVI** (r = 0.299) and also a negative correlation between **Skin Temperature and PI (r = –0.49)**

Our target variable, oximetry, is most strongly dependent on **PVI** (r = 0.66) **Skin Temperature** (r = 0.368), and **EtCO₂** (r = 0.281),
and also has a negative correlation with **SpO₂** (r = −0.121)

We don't need to combine other datasets, all required the information is already in the observation dataset

## (A-2b)

### Check data types

In [None]:
observation.dtypes

In [None]:
station.dtypes

In [None]:
patient.dtypes

### Format data types

In [None]:
station['revision'] = pd.to_datetime(station['revision'], format="mixed")
station['station'] = station['station'].astype('string')
station['QoS'] = station['QoS'].astype('category')
station['location'] = station['location'].astype('string')

In [None]:
patient['job'] = patient['job'].astype('string')
patient['ssn'] = patient['ssn'].astype('string')
patient['blood_group'] = patient['blood_group'].astype('category')
patient['company'] = patient['company'].astype('string')
patient['name'] = patient['name'].astype('string')
patient['username'] = patient['username'].astype('string')
patient['residence'] = patient['residence'].astype('string')
patient['registration'] = patient['registration'].astype('string')
patient['address'] = patient['address'].astype('string')
patient['mail'] = patient['mail'].astype('string')
patient[['longitude', 'latitude']] = (
    patient['current_location']
    .astype(str)
    .str.extract(r"Decimal\('([\d\.\-]+)'\), Decimal\('([\d\.\-]+)'\)")
    .astype(float)
)
patient.drop(columns=['current_location'], inplace=True)

### Check nulls

In [None]:
observation.isnull().sum()

In [None]:
station.isnull().sum()

In [None]:
patient.isnull().sum()

We have a lot of nulls in **patient** dataset, so we need to put some default values

In [None]:
patient['residence'] = patient['residence'].fillna('Unknown')
patient['job'] = patient['job'].fillna('Unknown')
patient['address'] = patient['address'].fillna('Unknown')
patient['longitude'] = patient['longitude'].fillna(0)
patient['latitude'] = patient['latitude'].fillna(0)

### Check duplicates

In [None]:
observation.duplicated().sum()

In [None]:
station.duplicated().sum()

In [None]:
patient.duplicated().sum()

We have one duplicate in **observation** dataset, we need to clean it

### Clean duplicates

In [None]:
observation = observation.drop_duplicates()

## 1.2 B

In [None]:
ranges = pd.read_csv("./094/sensor_variable_range.csv", sep="\t")
print(ranges.head())

In [None]:
num_pat = re.compile(r"[-+]?\d+(?:[.,]\d+)?")
def parse_range(s):
    nums = num_pat.findall(str(s))
    if len(nums) >= 2:
        a = float(nums[0].replace(",", "."))
        b = float(nums[1].replace(",", "."))
        lo, hi = (a, b) if a <= b else (b, a)
        return lo, hi
    return None, None

ranges[["Min", "Max"]] = ranges["Value Range"].apply(lambda r: pd.Series(parse_range(r)))
ranges = ranges.dropna(subset=["Min", "Max"])
ranges.loc[ranges["Variable"] == "BP", ["Min", "Max"]] = [90.0, 120.0]
print(ranges[["Variable", "Min", "Max"]].reset_index(drop=True))

In [None]:
records = []

for var, low, high in zip(ranges["Variable"], ranges["Min"], ranges["Max"]):
    if var in observation.columns:
        vals = pd.to_numeric(observation[var], errors="coerce")
        mask = (vals < low) | (vals > high)
        records.append({
            "Attribute": var,
            "Number of abnormal values": int(mask.sum()),
            "Allowed range": f"{low} – {high}",
            "Examples idx": mask[mask].index.tolist()
        })
        
anomalies_df = pd.DataFrame(records).sort_values("Number of abnormal values", ascending=False).reset_index(drop=True)
anomalies_df

### Kontrola správnosti v dátach

Dáta z *observation.csv* boli porovnané s referenčnými rozsahmi fyziologických parametrov zo *sensor_variable_range.csv*.  
V žiadnom z atribútov neboli zistené abnormálne hodnoty mimo definovaných intervalov, čo naznačuje, že dataset neobsahuje chybné alebo extrémne merania.  

Pre istotu bola ďalej vykonaná kontrola nelogických kombinácií hodnôt
(vzťahov medzi atribútmi), ktoré by mohli naznačovať chyby senzora alebo anotácie.

In [None]:
logic_errors = []

# Pressure = 0 when pulse is present
mask = (observation["BP"] == 0) & (observation["HR"] > 0)
logic_errors.append(("BP = 0 a HR > 0", mask.sum()))

# Does the measured cardiac output (CO) match the calculated value based on heart rate (HR) and stroke volume (SV)
co_est = observation["HR"] * observation["SV"] / 1000.0
mask = (observation["CO"] - co_est).abs() > 0.5 * co_est.fillna(0).abs()
logic_errors.append(("|CO - HR*SV/1000| > 50%", mask.sum()))

# if the signal quality is >= 80%, but the signal to noise ratio is < 20 dB (strong noise), then the data is contradictory.
mask = (observation["Signal Quality Index"] >= 80) & (observation["SNR"] < 20)
logic_errors.append(("Signal Quality Index >= 80 a SNR < 20", mask.sum()))

# It is impossible to accurately measure ideal saturation with a poor signal.
mask = (observation["Signal Quality Index"] <= 10) & (observation["SpO₂"] >= 99)
logic_errors.append(("Signal Quality Index <= 10 a SpO₂ >= 99", mask.sum()))

# When breathing normal air, saturation should not be this low.
mask = (observation["FiO₂"] <= 22) & (observation["SpO₂"] < 85)
logic_errors.append(("FiO₂ ≈ 21% a SpO₂ < 85%", mask.sum()))

# Do such coordinates exist on Earth?
mask = (
    (observation["latitude"] < -90) |
    (observation["latitude"] > 90) |
    (observation["longitude"] < -180) |
    (observation["longitude"] > 180)
)
logic_errors.append(("Latitude/Longitude out of range", mask.sum()))

# Unrealistic combination — a person breathes often, but exhales almost no CO₂.
mask = (observation["RR"] > 40) & (observation["EtCO₂"] < 20)
logic_errors.append(("RR > 40 a EtCO₂ < 20", mask.sum()))

logic_df = pd.DataFrame(logic_errors, columns=["-", "Number of violations"])
logic_df

### Data accuracy check

Based on the reference ranges from the sensor_variable_range.csv file, the values in the observation.csv dataset were checked for accuracy. No abnormal values outside the permitted physiological ranges were found.
Subsequently, a check of logical relationships between attributes was also performed:

**BP = 0 and HR > 0**

**|CO – HR×SV/1000| > 50%**

**Signal Quality Index >= 80 and SNR < 20**

**Signal Quality Index <= 10 and SpO₂ >= 99**

**FiO2 = 21% and SpO₂ < 85%**

**Latitude/Longitude outside range**

**RR > 40 and EtCO2 < 20**

All conditions had 0 violations, which means that the dataset does not contain illogical or erroneous combinations of data.  

## 1.2 С

### Remove outliers or distant observations


In [None]:
def identify_outliers(a):
    lower = a.quantile(0.25) - 1.5 * stats.iqr(a)
    upper = a.quantile(0.75) + 1.5 * stats.iqr(a)
    
    return a[(a > upper) | (a < lower)]

In [None]:
co_out = identify_outliers(observation['CO'])
observation = observation.drop(co_out.index)

pvi_out = identify_outliers(observation['PVI'])
observation = observation.drop(pvi_out.index)

sv_out = identify_outliers(observation['SV'])
observation = observation.drop(sv_out.index)

bp_out = identify_outliers(observation['BP'])
observation = observation.drop(bp_out.index)

### Visualisation of cleaned attributes

In [None]:
cleaned_attributes = ["BP", "PVI", "SV", "CO"]

In [None]:
for attribute in cleaned_attributes:
    plt.figure(figsize=(8, 5))
    plt.subplot(1, 2, 1)
    sns.histplot(observation[attribute], bins=30, kde=True)
    plt.title(f'{attribute}')
    plt.xlabel(f'{attribute}')
    
    plt.subplot(1, 2, 2)
    plt.boxplot(observation[attribute])
    plt.title(f'{attribute}')
    plt.xlabel(f'{attribute}')
    plt.show()

### Replace outliers

In [None]:
def replace_outliers(a):
    lower = a.quantile(0.05)
    upper = a.quantile(0.95)
    
    clipped = a.clip(lower, upper)
    
    return clipped

In [None]:
observation["PRV"] = replace_outliers(observation["PRV"])

observation["Skin Temperature"] = replace_outliers(observation["Skin Temperature"])

observation["SpO₂"] = replace_outliers(observation["SpO₂"])

observation["HR"] = replace_outliers(observation["HR"])

### Visualisation of replaced attributes

In [None]:
replaced_attributes = ["Skin Temperature", "SpO₂", "HR", "PRV"]

In [None]:
for attribute in replaced_attributes:
    plt.figure(figsize=(8, 5))
    plt.subplot(1, 2, 1)
    sns.histplot(observation[attribute], bins=30, kde=True)
    plt.title(f'{attribute}')
    plt.xlabel(f'{attribute}')
    
    plt.subplot(1, 2, 2)
    plt.boxplot(observation[attribute])
    plt.title(f'{attribute}')
    plt.xlabel(f'{attribute}')
    plt.show()

# 1.3 A

### Hypotheses


#### H1: The mean SpO₂ is lower under high respiratory effort.

In [None]:
low_effort = observation.loc[observation["Respiratory effort"] <= observation["Respiratory effort"].median(), "SpO₂"]
high_effort = observation.loc[observation["Respiratory effort"] > observation["Respiratory effort"].median(), "SpO₂"]
print("Average SpO₂ during low exertion:", low_effort.mean())
display(low_effort.describe())
print("Average SpO₂ during high exertion:", high_effort.mean())
display(high_effort.describe())

In [None]:
sh_low = stats.shapiro(low_effort.sample(5000, random_state=0))
sh_high = stats.shapiro(high_effort.sample(5000, random_state=0))
print("Shapiro p (low):", sh_low.pvalue)
print("Shapiro p (high):", sh_high.pvalue)

The Shapiro–Wilk test was used to verify the normality of the distribution of SpO₂ values in both groups.
At lower exertion, the test had a p-value of p = 1.49e-28, and at high exertion, p = 5.24е-29, i.e., in both cases (p < 0.05).

A (p-value < 0.05) means that the distribution differs significantly from normal.
Therefore, in addition to the parametric t-test, the non-parametric Mann–Whitney U test, which does not require normal data distribution, was also used in further analysis.

In [None]:
lev = stats.levene(low_effort, high_effort)
print("Levene p:", lev.pvalue)

The Levene test (p = 0.895 > 0.05) confirmed that the variances between the groups are statistically comparable.  
We can therefore use Welch's t-test, which is robust even with small differences in variances.

In [None]:
t, p = stats.ttest_ind(low_effort, high_effort, equal_var=False, nan_policy="omit")
print(f"t = {t:.3f}, p = {p/2:.4f}")

Welch's t-test (t = –1.516, p = 0.0648 > 0.05) did not show a statistically significant difference between the SpO₂ averages during low and high exertion. Thus, a decrease in SpO₂ during increased respiratory load was not confirmed.

In [None]:
u, p_mwu = stats.mannwhitneyu(low_effort, high_effort, alternative="less")
print(f"Mann-Whitney p = {p_mwu:.4f}")

The Mann–Whitney U test (p = 0.0456 < 0.05) indicated a slight downward trend in SpO₂ during high exertion, but the effect is only borderline statistically significant and very small.
Overall, it can be concluded that increased respiratory effort does not have a significant effect on SpO₂ values in the observed data.

Based on the results of both tests, it can be concluded that although the SpO₂ value is slightly lower during higher respiratory exertion, the difference is not statistically or practically significant.
Hypothesis H₁ (SpO₂ has a lower average value during higher exertion) was not confirmed.

#### H2: RR má v priemere nizsiu hodnotu pri vyššej FiO₂

In [None]:
high_fio2 = observation.loc[observation["FiO₂"] > observation["FiO₂"].median(), "RR"]
low_fio2 = observation.loc[observation["FiO₂"] <= observation["FiO₂"].median(), "RR"]
print("RR average at lower FiO₂:", low_fio2.mean())
display(low_fio2.describe())
print("RR average at higher FiO₂:", high_fio2.mean())
display(high_fio2.describe())

In [None]:
sh_low = stats.shapiro(low_fio2.sample(5000, random_state=0))
sh_high = stats.shapiro(high_fio2.sample(5000, random_state=0))
print("Shapiro p (low):", sh_low.pvalue)
print("Shapiro p (high):", sh_high.pvalue)

Both p-values are > 0.05 ⇒ RR distributions are approximately normal.  
We can therefore use a parametric t-test; to be on the safe side, we will also include a non-parametric test.

In [None]:
lev = stats.levene(low_fio2, high_fio2)
print("Levene p:", lev.pvalue)

p > 0.05 -> there is no evidence of a difference in variance between the groups.  
The variances are comparable; we will use Welch's t-test (robust even for small differences).

In [None]:
u, p_mwu = stats.mannwhitneyu(high_fio2, low_fio2, alternative="less")
print(f"Mann-Whitney p(one-sided)= {p_mwu:.4f}")

The Mann–Whitney U test (p = 0.042 < 0.05) showed that the RR value is statistically significantly lower at higher FiO₂.
Hypothesis H2 (lower RR at higher FiO₂) was confirmed.

In [None]:
t, p = stats.ttest_ind(high_fio2, low_fio2, equal_var=False, nan_policy="omit")
print(f"t = {t:.3f}, p = {p/2:.4f}")

Both tests (Welch's t-test and Mann–Whitney U) confirm H2: at higher FiO₂, the respiratory rate RR is lower (p < 0.05).  

The data are approximately normally distributed (Shapiro p > 0.05) and have comparable variance (Levene p = 0.33).
Both the one-sided Welch's t-test (t = –1.662, p = 0.0483) and the Mann–Whitney U test (p = 0.042) confirmed that the respiratory rate (RR) is statistically significantly lower at higher FiO₂.

Hypothesis H2 (lower RR at higher FiO₂) was confirmed, but the difference is of little practical significance.
At increased oxygen concentrations, there is a slight but statistically significant decrease in respiratory rate.

## 1.3 B

### Verification of statistical power of the test (H1 – SpO and respiratory effort)

In [None]:
n1 = len(high_effort)
n2 = len(low_effort)
mean1, mean2 = high_effort.mean(), low_effort.mean()
sd1, sd2 = high_effort.std(ddof=1), low_effort.std(ddof=1)


In [None]:
spooled = np.sqrt(((n1-1)*sd1**2 + (n2-1)*sd2**2) / (n1+n2-2))
d = (mean1 - mean2) / spooled
print(f"Cohen's d = {d:.4f}")

The calculated effect size Cohen’s d = 0.0289 represents a very small difference between the groups.  

In [None]:
analysis = TTestIndPower()
power = analysis.power(effect_size=abs(d), nobs1=n1, ratio=n2/n1, alpha = 0.05)
print(f"Statistical power of the test = {power:.3f}")

The test has only limited ability to reliably detect such small differences in data.

In [None]:
mde = analysis.solve_power(effect_size=None, nobs1=n1, ratio=n2/n1, alpha=0.05, power=0.8)
print(f"Minimum detectable effect (MDE) = {mde:.3f}")

Based on these results, it can be concluded that the difference between respiratory effort levels is not statistically significant and SpO₂ values remain practically stable.  
The test therefore confirms that although there is a statistically small shift, its magnitude is negligible from a physiological point of view.

### Verification of statistical power of the test (H₂ – RR and FiO)

In [None]:
n1 = len(high_fio2)
n2 = len(low_fio2)

mean1, mean2 = high_fio2.mean(), low_fio2.mean()
sd1, sd2 = high_fio2.std(ddof=1), low_fio2.std(ddof=1)

In [None]:
spooled = np.sqrt(((n1 - 1)*sd1**2 + (n2 - 1)*sd2**2) / (n1 + n2 - 2))
d = (mean1 - mean2) / spooled
print(f"Cohen’s d = {d:.4f}")

The effect size Cohen’s d = –0.0317 represents a very small and practically negligible difference between the groups.  

In [None]:
analysis = TTestIndPower()
power = analysis.power(effect_size=abs(d), nobs1=n1, ratio=n2/n1, alpha=0.05)
print(f"Statistical power of the test = {power:.3f}")

Power = 0.383 is low, meaning that the test has only limited ability to reliably detect such small differences.  

In [None]:
mde = analysis.solve_power(effect_size=None, nobs1=n1, ratio=n2/n1, alpha=0.05, power=0.8)
print(f"Minimum detectable effect (MDE) = {mde:.3f}")

Although Welch's t-test and Mann–Whitney test showed a statistically significant difference (p = 0.04), the effect size found is extremely small and therefore has no practical physiological significance.  
Although an increase in oxygen concentration (FiO₂) leads to a slight decrease in respiratory rate (RR), this difference is negligible.

## Conclusion

There were minor issues with the format and quality of the source data.
Some values were of type object, so they were converted to a more suitable type.

In the patient dataset, there were missing values in the columns, which were replaced with the category “Unknown” so as not to lose data when deleting.

One duplicate record was found in the observation table and was deleted.

Rare outliers in the variables were also identified and processed using two methods: IQR deletion and replacement with boundary values (5th and 95th percentiles).

No inconsistent values were found in the data. No violations were found when checking the logical relationships between attributes. All values are mutually consistent and within physiologically acceptable ranges.

After cleaning, the data does not contain duplicates, missing or incorrect values,
making it suitable for further analytical and model processing. 