**The primary goal of this step is to identify and handle outliers in the dataset, with a specific focus on instances where a single column contains more than 50 percent outliers. The Interquartile Range (IQR) method was employed for outlier identification and removal.**

**Outlier Detection Using IQR:**
-   Outliers were identified using the Interquartile Range (IQR) method for each column individually. The IQR is calculated as the difference between the third quartile (Q3) and the first quartile (Q1), and outliers are defined as values falling outside the range **[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR].**

**the way i use it:**
-   If a column had more than 50 percent outliers, it was dropped from the dataset.


## 1- import and reading data

In [7]:
import sys
sys.path.append('../../../scripts/utilities')
from helper_functions import *
sys.path.append('../../../scripts/data_preprocessing')
from data_cleaning import *

In [8]:
base_path = '../../../data/processed_data/'
df = read_files('df_filling_missing_values_with_median_encoded_handle_noisy.csv', base_path=base_path)[0]
df.shape

(10175, 372)

## 2- remove outliers

In [9]:
df = remove_outliers(df, outlier_threshold=1.5, column_outlier_percentage_threshold=40)
print(df.head())

    SEQN  RIDSTATR  RIAGENDR  RIDAGEYR  RIDRETH1  RIDRETH3  RIDEXMON  \
0  73557       2.0       1.0      69.0       4.0       4.0       1.0   
1  73558       2.0       1.0      54.0       3.0       3.0       1.0   
2  73559       2.0       1.0      72.0       3.0       3.0       2.0   
3  73560       2.0       1.0      26.0       3.0       3.0       1.0   
4  73561       2.0       2.0      73.0       3.0       3.0       1.0   

   DMQMILIZ  DMDBORN4  DMDCITZN  ...  LBXTC  LBDTCSI  LBXTTG      WTSH2YR.y  \
0       1.0       1.0       1.0  ...  167.0     4.32     2.0   34086.061823   
1       2.0       1.0       1.0  ...  170.0     4.40     2.0   49123.400015   
2       1.0       1.0       1.0  ...  126.0     3.26     2.0  115794.742161   
3       2.0       1.0       1.0  ...  168.0     4.34     2.0   55766.512438   
4       2.0       1.0       1.0  ...  201.0     5.20     2.0   34086.061823   

   LBDBCDLC  LBDTHGLC  URXVOL1  URDFLOW1  MCQ160L  MCQ220  
0       0.0       0.0     87.0  

## 3- save dataframe

In [10]:
save_files([df], 'df_filling_missing_values_with_median_encoded_handle_noisy_handle_outlier.csv', base_path=base_path)