# PROJECT UNDERSTANDING

**Introduction**

Sepsis is a life-threatening medical condition that occurs when the body's response to an infection injures its own tissues and organs. It is a leading cause of death in hospitals worldwide, and its incidence is increasing. Early diagnosis and treatment of sepsis are crucial for improving patient outcomes. However, sepsis can be difficult to diagnose in its early stages, as its symptoms can be subtle and nonspecific.

Early diagnosis and treatment of sepsis is critical for improving patient outcomes. However, sepsis can be difficult to diagnose early because its symptoms are often non-specific and can overlap with other conditions. As a result, sepsis is often underdiagnosed or misdiagnosed, leading to delayed treatment and worse patient outcomes.

A sepsis prediction model could help to improve the early diagnosis and treatment of sepsis by identifying patients who are at high risk of developing the condition. This could be done by using machine learning algorithms to analyze patient data from electronic health records (EHRs).



Read About Data Columns **[here](https://github.com/fantastic-rambo/Embedding-Machine-Learning-Model-in-FastAPI/blob/main/data/README.md)**

**About the Datasets:**

* **ID:**                            Unique patient identifier.
* **PRG** (Plasma glucose):          Measurement of plasma glucose levels.
* **PL** (Blood Work Result-1):      First blood work result (in mu U/ml).
* **PR** (Blood Pressure):           Blood pressure measurement (in mm Hg).
* **SK** (Blood Work Result-2):       Second blood work result (in mm).
* **TS** (Blood Work Result-3):       Third blood work result (in mu U/ml).
* **M11** (Body mass index):          Body mass index calculated as weight in kg divided by the square of height in meters.
* **BD2** (Blood Work Result-4):      Fourth blood work result (in mu U/ml).
* **Age**: Age of the patient in years.

* **Insurance**:              Binary indicator of whether the patient holds a valid insurance card.

* **Sepsis**: Binary outcome indicating the development of sepsis in the ICU (Positive or Negative). otherwise


## Goal Of Project
The goal of the project is to develop a model that can predict whether or not a patient in the ICU will develop sepsis. This model could be used to identify patients who are at high risk of developing sepsis, allowing clinicians to initiate early treatment and improve patient outcomes.





## Hypothesis
**Null Hypothesis (H0)**
There is no significant difference in the blood work results and clinical data of patients who will develop sepsis compared to those who will not develop sepsis.

**Alternative Hypothesis (Ha)**
There is a significant difference in the blood work results and clinical data of patients who will develop sepsis compared to those who will not develop sepsis.




## Analytical Questions

* Are there discernible demographic patterns among patients who develop sepsis, such as age, insurance status, or other demographic variables?

* What is the correlation between different medical indicators (e.g., plasma glucose, blood pressure) and the likelihood of developing sepsis?

* Do specific blood work results exhibit a noticeable impact on the probability of sepsis development?

* Is there a significant association between insurance status and the risk of developing sepsis in ICU patients?

* How well can we predict the likelihood of sepsis based on the available variables in the dataset?

* Are there any temporal trends in the occurrence of sepsis or changes in the distribution of key variables over time?

* How do different predictive models perform in estimating the likelihood of sepsis?

# DATA UNDERSTANDING

In [16]:
import pandas as pd                       # For data manipulation and analysis
import numpy as np                        # For numerical operations
import matplotlib.pyplot as plt           # For data visualization
import seaborn as sns                     # For statistical data visualization
import re
#import warning


#Libraries for feature scaling
from sklearn.preprocessing import StandardScaler

#Libraries for Validation
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics   #Import scikit-learn metrics module for accuracy calculation

#Libraries for Training model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [17]:
#Check numpy and pandas version

print("Numpy version: ", np.__version__)
print("Pandas version: ",pd.__version__)

Numpy version:  1.23.5
Pandas version:  1.5.3


In [18]:
#load data
train_data = pd.read_csv("https://raw.githubusercontent.com/fantastic-rambo/Embedding-Machine-Learning-Model-in-FastAPI/main/data/Paitients_Files_Train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/fantastic-rambo/Embedding-Machine-Learning-Model-in-FastAPI/main/data/Paitients_Files_Test.csv")

# DATA PREPARATION

In [19]:
train_data.head()

Unnamed: 0,ID,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance,Sepssis
0,ICU200010,6,148,72,35,0,33.6,0.627,50,0,Positive
1,ICU200011,1,85,66,29,0,26.6,0.351,31,0,Negative
2,ICU200012,8,183,64,0,0,23.3,0.672,32,1,Positive
3,ICU200013,1,89,66,23,94,28.1,0.167,21,1,Negative
4,ICU200014,0,137,40,35,168,43.1,2.288,33,1,Positive


In [48]:
test_data.head()

Unnamed: 0,ID,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance
0,ICU200609,1,109,38,18,120,23.1,0.407,26,1
1,ICU200610,1,108,88,19,0,27.1,0.4,24,1
2,ICU200611,6,96,0,0,0,23.7,0.19,28,1
3,ICU200612,1,124,74,36,0,27.8,0.1,30,1
4,ICU200613,7,150,78,29,126,35.2,0.692,54,0


### EDA

In [31]:
def display_dataset_info(train_data, test_data):
    """
    Display information about the train and test datasets.

    Parameters:
    - train_df: DataFrame, the training dataset.
    - test_df: DataFrame, the testing dataset.
    """
    print("Train Dataset Info:")
    train_data.info()

    print("\nTest Dataset Info:")
    test_data.info()

# Assuming you have train_data and test_data DataFrames
display_dataset_info(train_data, test_data)

Train Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         599 non-null    object 
 1   PRG        599 non-null    int64  
 2   PL         599 non-null    int64  
 3   PR         599 non-null    int64  
 4   SK         599 non-null    int64  
 5   TS         599 non-null    int64  
 6   M11        599 non-null    float64
 7   BD2        599 non-null    float64
 8   Age        599 non-null    int64  
 9   Insurance  599 non-null    int64  
 10  Sepssis    599 non-null    object 
dtypes: float64(2), int64(7), object(2)
memory usage: 51.6+ KB

Test Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         169 non-null    object 
 1   PRG        169 non-null    int64  
 2   PL

In [40]:
train_data.shape, test_data.shape          # checking for the shapes

((599, 11), (169, 10))

From the training dataset, I extracted information revealing a total of 599 patient records, all complete with no missing values. The dataset comprises 11 columns, including the target variable 'Sepsis.'

Also, From the test dataset, it is observed that there are a total of 169 entries. The dataset spans 10 columns, and it is noteworthy that the 'sepsis' target column is not present, indicating that sepsis outcome labels are not included in the test data.

In [36]:
train_data.duplicated().sum(), test_data.duplicated().sum()   #checking for duplicates

(0, 0)

From The Above codes, (0, 0) depicts that both the Train and Test datasets do not possess duplicated rows.

In [47]:
train_data.isna().sum()     #checking for null values

ID           0
PRG          0
PL           0
PR           0
SK           0
TS           0
M11          0
BD2          0
Age          0
Insurance    0
Sepssis      0
dtype: int64

In [None]:
test_data.isna().sum()    #checking for null values

In [43]:
train_data.describe()     #checking statistical info of dataset

Unnamed: 0,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance
count,599.0,599.0,599.0,599.0,599.0,599.0,599.0,599.0,599.0
mean,3.824708,120.153589,68.732888,20.562604,79.460768,31.920033,0.481187,33.290484,0.686144
std,3.362839,32.682364,19.335675,16.017622,116.576176,8.008227,0.337552,11.828446,0.464447
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,64.0,0.0,0.0,27.1,0.248,24.0,0.0
50%,3.0,116.0,70.0,23.0,36.0,32.0,0.383,29.0,1.0
75%,6.0,140.0,80.0,32.0,123.5,36.55,0.647,40.0,1.0
max,17.0,198.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [44]:
test_data.describe()       #checking statistical info of dataset

Unnamed: 0,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance
count,169.0,169.0,169.0,169.0,169.0,169.0,169.0,169.0,169.0
mean,3.91716,123.52071,70.426036,20.443787,81.0,32.249704,0.438876,33.065089,0.727811
std,3.402415,29.259123,19.426805,15.764962,110.720852,7.444886,0.306935,11.54811,0.44641
min,0.0,56.0,0.0,0.0,0.0,0.0,0.1,21.0,0.0
25%,1.0,102.0,62.0,0.0,0.0,27.6,0.223,24.0,0.0
50%,3.0,120.0,74.0,23.0,0.0,32.4,0.343,28.0,1.0
75%,6.0,141.0,80.0,32.0,135.0,36.6,0.587,42.0,1.0
max,13.0,199.0,114.0,49.0,540.0,57.3,1.698,70.0,1.0


# MODELLING

# EVALUATION