**The objective of this data cleaning process is to address and resolve issues related to duplicate records and constant columns in the dataset. By identifying and handling these elements, we aim to improve the quality and reliability of the data for subsequent analysis.**

**1. Handling Duplicate Records:**
-   Duplicate records were identified by comparing all rows in the dataset and identifying instances where all values in a row are identical to those in another row.
Duplicate records were dropped, retaining only the first occurrence.
By removing duplicate records, we ensure that each observation in the dataset is unique, preventing potential biases in subsequent analyses.

**2. Handling Constant Columns:**
- Constant columns were identified by checking if a column has the same value across all rows in the dataset.
Columns with constant values were dropped from the dataset.

## 1- import and reading data

In [1]:
import sys
sys.path.append('../../../scripts/utilities')
from helper_functions import *
sys.path.append('../../../scripts/data_preprocessing')
from data_cleaning import *

In [2]:
base_path = '../../../data/processed_data/'
df1 = read_files('df_filling_missing_values_with_median_encoded.csv', base_path=base_path)[0]

## 2- handle constant columns

In [3]:
df2 = drop_constant_columns(df1)
df2.shape

(10175, 372)

## 3- handle duplicate records

In [4]:
df3 = drop_duplicates(df2)
df3.shape

(10175, 372)

In [5]:
df3.head()

Unnamed: 0,SEQN,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,DMQMILIZ,DMDBORN4,DMDCITZN,...,LBXBSE,LBDBSESI,LBXBMN,LBDBMNSI,URXVOL1,URDFLOW1,LBDB12,LBDB12SI,MCQ160L,MCQ220
0,73557,2.0,1.0,69.0,4.0,4.0,1.0,1.0,1.0,1.0,...,186.5,2.37,9.89,180.0,87.0,0.821,524.0,386.7,2.0,2.0
1,73558,2.0,1.0,54.0,3.0,3.0,1.0,2.0,1.0,1.0,...,204.73,2.6,8.15,148.33,90.0,1.636,507.0,374.2,2.0,2.0
2,73559,2.0,1.0,72.0,3.0,3.0,2.0,1.0,1.0,1.0,...,209.64,2.66,9.57,174.17,66.0,0.647,732.0,540.2,2.0,1.0
3,73560,2.0,1.0,26.0,3.0,3.0,1.0,2.0,1.0,1.0,...,169.82,2.16,13.07,237.87,61.0,0.575,514.0,379.3,0.0,0.0
4,73561,2.0,2.0,73.0,3.0,3.0,1.0,2.0,1.0,1.0,...,186.5,2.37,9.89,180.0,5.0,0.109,225.0,166.1,2.0,2.0


## 4- save dataframe

In [6]:
save_files([df3], 'df_filling_missing_values_with_median_encoded_handle_noisy.csv', base_path='../../../data/processed_data/')