In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

***

 ### Step 1: Loading and Preprocessing the Data
---
Read the CSV file and save it in a temporary dataframe.    
To further safeguard patient privacy, the data has been subjected to extra deidentification prior to importing.

In [2]:
temp_df = pd.read_csv("~/Desktop/mimiciv_project/egfr_deidentified.csv")

### Renaming Columns and Setting Index
---
* Rename the column "Unnamed: 0" to "instance" in the temporary dataframe.
* Set the "instance" column as the index of the dataframe.
* Remove the index name to improve clarity.
* Create a new dataframe, "df," as a copy of the temporary dataframe.

In [3]:
temp_df.rename(columns={"Unnamed: 0": "instance"}, inplace = True)
temp_df.set_index('instance', inplace = True)
temp_df.index.name = None
df = temp_df.copy()


### Displaying DataFrame Information
---
This next code snippet provides various insights and information about the DataFrame:    

* Displaying the First Five Lines:    
    The first five lines of the dataframe are printed.
   
* Displaying DataFrame Columns:    
    The names of the columns in the dataframe are printed.
   
* Displaying Column Data Types:    
    The data types of the columns in the dataframe are printed.
   
* Displaying DataFrame Shape:    
    The number of rows and columns in the dataframe are printed.
   
* Displaying Summary Statistics:    
    Summary statistics for the numerical columns in the dataframe are printed, rounded to two decimal places.
   
* Displaying Gender Distribution:    
    The distribution of genders in the dataframe is printed as a percentage, rounded to five decimal places.    

These lines of code help provide an overview of the dataframe's content, structure, and statistical information.

In [4]:
print(f"\033[1mThe first five lines are:\033[0m\n\n{df.head()}\n")
print(f"\033[1mThe DataFrame columns are:\033[0m\n\n{df.columns}\n")
print(f"\033[1mThe column data types are:\033[0m\n\n{df.dtypes}\n")
print(f"\033[1mThe dataframe has\033[0m {df.shape[0]} \033[1mlines and\033[0m {df.shape[1]} \033[1mcolumns\033[0m\n")
print(f"\033[1mThe summary statistics for the numerical columns are:\033[0m\n\n{df.describe().round(2)}\n")
print(f"\033[1mThe gender distribution is:\033[0m\n\n{(df['gender'].value_counts(normalize=True).round(5) * 100).apply(lambda x: f'{x}%')}\n")

[1mThe first five lines are:[0m

         gender        age  serum_creatinine
31084037      M  83.886161               0.4
31318874      M  70.337792               0.8
31456588      F  70.218365               2.0
31468360      F  74.144543               0.7
31572939      M  58.215661               2.0

[1mThe DataFrame columns are:[0m

Index(['gender', 'age', 'serum_creatinine'], dtype='object')

[1mThe column data types are:[0m

gender               object
age                 float64
serum_creatinine    float64
dtype: object

[1mThe dataframe has[0m 1892649 [1mlines and[0m 3 [1mcolumns[0m

[1mThe summary statistics for the numerical columns are:[0m

              age  serum_creatinine
count  1892649.00        1891927.00
mean        64.53              1.43
std         16.68              1.53
min         18.00              0.00
25%         54.28              0.70
50%         65.96              0.90
75%         77.01              1.50
max        103.15             80.00

[

---

### Calculating Missing and Duplicate Values
---
This code snippet calculates and displays the following information about the dataframe:

* Calculating Percentage of Missing Values:    
    The percentage of missing values in each column is calculated and rounded to three decimal places.
   
* Calculating Percentage of Duplicate Values:    
    The percentage of duplicate rows based on specific columns ('gender', 'age', and 'serum_creatinine') is calculated and rounded to two decimal places. Since the egfr equation uses the gender, age, and Scr as factors, there is no point keeping multiple indexes where the gender, age, and Scr factors are equal respectivelly.

These lines of code provide insights into the presence of missing values and duplicates within the dataframe.

In [5]:
missing_values_percentage = (df.isnull().mean()*100).round(3).apply(lambda x: f'{x}%')
duplicates_count = (df.duplicated(subset=['gender', 'age', 'serum_creatinine']).mean()*100).round(2)

print(f"\033[1mThe percentage of missing values is:\033[0m\n\n{missing_values_percentage}\n")
print(f"\033[1mThe percentage of duplicate values is:\033[0m\n\n{duplicates_count}%")

[1mThe percentage of missing values is:[0m

gender                0.0%
age                   0.0%
serum_creatinine    0.038%
dtype: object

[1mThe percentage of duplicate values is:[0m

49.62%


### Counting Rows with Serum Creatinine Equal to 0
----------------------------------------------------
This code snippet counts the number of rows in the dataframe where the 'serum_creatinine' column has a value of 0.    
The egfr equation contains this part $\min\left( \frac{standardized\ Scr}{a},\ 1 \right)^{- 0.241}$.    
Thus in case of Scr equal to 0 we would need to raise 0 in a negative number (-0.241) which throws a ZeroDivisionError.
 
* Counting Rows with Serum Creatinine Equal to 0:    
    Rows in the dataframe where the 'serum_creatinine' column is equal to 0 are counted.    
    The count is stored in the variable 'creatinine_0' as a list and the calculated percentage is printed.

In [6]:
creatinine_0 = df[df['serum_creatinine'] == 0]['serum_creatinine'].value_counts().to_list()
print(f"\033[1mThe percentage of indexes with serum_creatinine equal to 0 are\033[0m {round((100*creatinine_0[0]/len(df)),5)}%")

[1mThe percentage of indexes with serum_creatinine equal to 0 are[0m 0.00571%


### Data Cleaning Operations
---
This code snippet performs various data cleaning operations on the dataframe:

* Removing Rows with Missing Values:    
    Since the percentage of missing values for Scr is 0.038%, the rows containing missing values are removed from the dataframe, modifying it in-place.
   
* Removing Duplicate Rows:    
    Duplicate rows based on specific columns ('gender', 'age', and 'serum_creatinine') are removed from the dataframe, modifying it in-place. As mentioned above, it would be redundant to calculate the same values more than one.
   
* Removing Rows with Serum Creatinine Equal to 0:
    Rows where the 'serum_creatinine' column has a value of 0 are removed from the dataframe, creating a new modified dataframe.
    
These operations help ensure data quality and prepare the dataframe for further analysis.

In [7]:
df.dropna(inplace = True)
df.drop_duplicates(subset=['gender', 'age', 'serum_creatinine'], inplace=True)
df = df[df['serum_creatinine'] != 0].dropna()

In [8]:
print(f"\033[1mThe dataframe has\033[0m {df.shape[0]} \033[1mlines and\033[0m {df.shape[1]} \033[1mcolumns\033[0m\n")
print(f"\033[1mThe summary statistics for the numerical columns are:\033[0m\n\n{df.describe().round(2)}\n")
print(f"\033[1mThe gender distribution is:\033[0m\n\n{(df['gender'].value_counts(normalize=True).round(5) * 100).apply(lambda x: f'{x}%')}\n")

[1mThe dataframe has[0m 953144 [1mlines and[0m 3 [1mcolumns[0m

[1mThe summary statistics for the numerical columns are:[0m

             age  serum_creatinine
count  953144.00         953144.00
mean       64.73              1.74
std        16.91              1.90
min        18.00              0.10
25%        54.15              0.80
50%        66.18              1.10
75%        77.54              1.80
max       103.15             80.00

[1mThe gender distribution is:[0m

M    53.431%
F    46.569%
Name: gender, dtype: object



In [9]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

### DataFrames comparison

Comparing the original DataFrame with the DataFrame obtained after performing cleaning operations, it is obvious that no major change has happened in either the gender distribution, or the statistical characteristics of the DataFrames (see the table below). On the other hand the blood sample count (indexes) has been reduced by almost 50% (from 1,892,649 to 953,144).

|           |               |   **Dataframe**  |             |                  |
| :-------- | :------------ | :--------------- | :---------- | :--------------- |
|           | **Original**  |                  | **Cleaned** |                  |
| **M**     | 53.159%       |                  | 53.431%     |                  |
| **F**     | 46.841%       |                  | 46.569%     |                  |
|           |               |                  |             |                  |
|           | age           | serum_creatinine | age         | serum_creatinine |
| **count** | 1.892.649     | 1.891.927        | 953.144     | 953144.00        |
| **mean**  | 64.53         | 1.43             | 64.73       | 1.74             |
| **std**   | 16.68         | 1.53             | 16.91       | 1.90             |
| **min**   | 18.00         | 0.00             | 18.00       | 0.10             |
| **25%**   | 54.28         | 0.70             | 54.15       | 0.80             |
| **50%**   | 65.96         | 0.90             | 66.18       | 1.10             |
| **75%**   | 77.01         | 1.50             | 77.54       | 1.80             |
| **max**   | 103.15        | 80.00            | 103.15      | 80.00            |

### eGFR Calculation
***

The CKD-EPI(2021) eGFR equation differentiates depending on the patient gender. Based on the CKD-EPI(2021) equation from [kidney.org](https://www.kidney.org/professionals/kdoqi/gfr_calculator/formula) the gender specific equations are:    

$eGFR\_ female = 142*\min\left( \frac{standardized\ Scr}{0.7},\ 1 \right)^{- 0.241}*{\max\left( \frac{standardized\ Scr}{0.7},\ 1 \right)}^{- 1.2}*{0.9938}^{age}*1.012$    

  

$eGFR\_ male = 142*\min\left( \frac{standardized\ Scr}{0.9},\ 1 \right)^{- 0.302}*{\max\left( \frac{standardized\ Scr}{0.9},\ 1 \right)}^{- 1.2}*{0.9938}^{age}$    

---
This code snippet defines two functions to calculate the estimated glomerular filtration rate (eGFR) based on age and serum creatinine levels. The calculated eGFR is rounded to two decimal places and returned as the result:

* Equation for Females: egfr_f(age,s_cr).
* Equation for Males: egfr_m(age,s_cr).    

Where:    
**age** represents the age of the individual    
**s_cr** denotes the serum creatinine level

In [10]:
# For females
def egfr_f(age,s_cr):
    result = round(142 * pow(min((s_cr/0.7),1), -0.241) * pow(max((s_cr/0.7),1),-1.2) * pow(0.9938, age) * 1.012 , 2)
    return result

# For males
def egfr_m(age,s_cr):
    result = round(142 * pow(min((s_cr/0.9),1), -0.302) * pow(max((s_cr/0.9),1),-1.2) * pow(0.9938, age) , 2)
    return result

### Calculating eGFR for Each Row
---    
This code snippet calculates the estimated glomerular filtration rate (eGFR) for each row in the dataframe based on gender-specific equations and the values of 'age' and 'serum_creatinine' columns.

1. Applying eGFR Calculation:
   - The `apply()` function is used to iterate over each row in the dataframe and calculate the eGFR.
   - For females (gender == 'F'), the `egfr_f()` function is applied with the 'age' and 'serum_creatinine' values from the current row.
   - For males (gender != 'F'), the `egfr_m()` function is applied with the 'age' and 'serum_creatinine' values from the current row.

2. Creating 'egfr' Column:
   - The calculated eGFR values are assigned to a new column named 'egfr' in the dataframe.

In [13]:
df['egfr'] = df.apply(lambda row: egfr_f(row['age'], row['serum_creatinine']) if row['gender'] == 'F' else egfr_m(row['age'], row['serum_creatinine']), axis=1)

In [14]:
df.head()

Unnamed: 0,gender,age,serum_creatinine,egfr
31084037,M,83.886161,0.4,107.66
31318874,M,70.337792,0.8,95.01
31456588,F,70.218365,2.0,26.34
31468360,F,74.144543,0.7,90.62
31572939,M,58.215661,2.0,37.92
