# Goal 
- Given a dataset with 50 features we must reduce it to `7` features including the 2 possible target variables `A1Cresult` and `readmitted`. We must then find `5` more features that can have strong predictive power in determining our desired target. We decided that `7` is a good number because it greatly reducing the number of potential patterns to be discovered without losing information. If we use all 50 it basically will be almost impossible to find the true pattern.

# About the dataset
- The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14 days. The goal is to determine the early readmission of the patient within 30 days of discharge. The problem is important for the following reasons. Despite high-quality evidence showing improved clinical outcomes for diabetic patients who receive various preventive and therapeutic interventions, many patients do not receive them. This can be partially attributed to arbitrary diabetes management in hospital environments, which fail to attend to glycemic control. Failure to provide proper diabetes care not only increases the managing costs for the hospitals (as the patients are readmitted) but also impacts the morbidity and mortality of the patients, who may face complications associated with diabetes.

### Import Libraries & Data #### 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
print("Libraries imported ....")

Libraries imported ....


In [3]:
# Load Data
df = pd.read_csv("../data/diabetic_data.csv")
print("data loaded ...")

data loaded ...


In [4]:
df.replace('?', np.nan, inplace=True)   # Replace all '?' to NaN for easier processing
print("? -> NaN")

? -> NaN


In [5]:
df.head(5)


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [6]:
df.shape

(101766, 50)

# Features to Keep 

### Potential Targets
1. `readmitted`  - This represents whether the patient was readmitted and is directly related to patient outcomes.

2. `A1Cresult` - Outcome of the hemoglobin A1c test, which measures the average blood sugar level over the past 2-3 months. It's a key metric for assessing diabetes control.

### Additional 5 Features 
- We've determined that these features have the strongest predictive power on determining if a patient has diabetes. It will give a good glimpse of who the patient is, the severity of their diagnosis, and the complexity of their overall case.

3. `age` - Despite age being in a range we believe that age still has strong predictive power on determining 'readmitted' or 'A1Cresult' as the older the person is the more prone to having diabetes.

4. `time_in_hospital` - Indicates the length of hospital stay, potentially related to both readmission likelihood and diabetes control. 


5. `num_medications` - The number of medications prescribed could reflect the severity and complexity of the patient's condition.

6. `number_diagnoses` - represents the total number of diagnoses, indicating the complexity of a patients condition.

7. `insulin` - a key feature on indicating if someone has diabetes is if they are taking insulin as the prescription is cruicial in diabetes management. This maybe one of the most important features if we were to guess.


### Total features = `7`
- It also is great that these features have no null values which makes ensures data integrity as this is the true values of the dataset and we won't have to rely on imputation that makes the model less capable of generalizing as the imputation methods usually rely on interpolation.



In [7]:
df.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [8]:
features_to_keep = ['age', 'time_in_hospital', 'num_medications', 'number_diagnoses', 'insulin', 'A1Cresult', 'readmitted']
print(f"Features to keep: {features_to_keep}. Total number of features: {len(features_to_keep)}")

Features to keep: ['age', 'time_in_hospital', 'num_medications', 'number_diagnoses', 'insulin', 'A1Cresult', 'readmitted']. Total number of features: 7


In [9]:
# Reduce dataset
reduced_df = df[features_to_keep]
print("Dataframe Reduced ...")

Dataframe Reduced ...


In [10]:
reduced_df

Unnamed: 0,age,time_in_hospital,num_medications,number_diagnoses,insulin,A1Cresult,readmitted
0,[0-10),1,1,1,No,,NO
1,[10-20),3,18,9,Up,,>30
2,[20-30),2,13,6,No,,NO
3,[30-40),2,16,7,Up,,NO
4,[40-50),1,8,5,Steady,,NO
...,...,...,...,...,...,...,...
101761,[70-80),3,16,9,Down,>8,>30
101762,[80-90),5,18,9,Steady,,NO
101763,[70-80),1,9,13,Down,,NO
101764,[80-90),10,21,9,Up,,NO


In [11]:
# Generate a null values report
null_report = pd.DataFrame({
    'Total Nulls': reduced_df.isnull().sum(),
    'Percentage Null': (reduced_df.isnull().mean() * 100)
}).sort_values(by='Total Nulls', ascending=False)

print(null_report)


                  Total Nulls  Percentage Null
A1Cresult               84748        83.277322
age                         0         0.000000
time_in_hospital            0         0.000000
num_medications             0         0.000000
number_diagnoses            0         0.000000
insulin                     0         0.000000
readmitted                  0         0.000000


This is great, we have no null values. A1Cresult seems like there are a lot of null values but that is intentional as having no result means that maybe the test was not even ordered as the case may not be as serious. 

### Feature transformation

## age <br>
Possible Values: 10 possible values. The ages aren't specific and have been turned into bins. We will convert them into numerical bins as it would be easier to process these values computationally if they were integers. <br>
- [0-10)  -> 0
- [10-20)  -> 1
- [20-30) -> 2
- [30-40) -> 3
- [40-50) -> 4
- [50-60) -> 5
- [60-70) -> 6
- [70-80) -> 7
- [80-90) -> 8
- [90-100) -> 9

In [12]:
reduced_df['age'].unique()

array(['[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)',
       '[60-70)', '[70-80)', '[80-90)', '[90-100)'], dtype=object)

In [13]:
# Define the mapping from age bins to integers
age_mapping = {
    '[0-10)': 0,
    '[10-20)': 1,
    '[20-30)': 2,
    '[30-40)': 3,
    '[40-50)': 4,
    '[50-60)': 5,
    '[60-70)': 6,
    '[70-80)': 7,
    '[80-90)': 8,
    '[90-100)': 9
}

# Replace age bins with integers in the DataFrame
reduced_df['age'] = reduced_df['age'].replace(age_mapping)

  reduced_df['age'] = reduced_df['age'].replace(age_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_df['age'] = reduced_df['age'].replace(age_mapping)


In [14]:
reduced_df['age'].unique()

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [49]:
from methods import find_optimal_bin_size, discretize_continuous_variables
age_values = [ 1,  3,  2,  4,  5, 13, 12,  9,  7, 10,  6, 11,  8, 14]
age_optimal_bin_size = find_optimal_bin_size(age_values)
print(f"Optimal bin size for age: {age_optimal_bin_size}")
age_output = discretize_continuous_variables(age_values, age_optimal_bin_size)
print(f"Bins for age: {age_output['bin_edges']}")

Optimal bin size for age: 9
Bins for age: [(1.0, 2.0), (2.0, 3.0), (3.0, 5.0), (5.0, 7.0), (7.0, 8.0), (8.0, 9.0), (9.0, 10.5), (10.5, 13.0)]


### Discretization of Continous variables
- Find optimal number of bins by clustering using kmeans and finding and extracting optimal values by using the elbow method

#### Steps:
1. Find unique values
2. Find optimal number of bins
3. perform discretization
4. convert values in the dataset

In [16]:
from methods import find_optimal_bin_size, discretize_continuous_variables # custom implementation of discretization

## time_in_hospital <br>

In [20]:
print(f"Total number of unique values: {len(reduced_df['time_in_hospital'].unique())}")
reduced_df['time_in_hospital'].unique()

Total number of unique values: 14


array([ 1,  3,  2,  4,  5, 13, 12,  9,  7, 10,  6, 11,  8, 14],
      dtype=int64)

In [17]:
time_in_hospital_values = [ 1,  3,  2,  4,  5, 13, 12,  9,  7, 10,  6, 11,  8, 14]
time_in_hospital_optimal_bin_size = find_optimal_bin_size(time_in_hospital_values)
print(f"Optimal bin size for time_in_hospital: {time_in_hospital_optimal_bin_size}")
time_in_hospital_output = discretize_continuous_variables(time_in_hospital_values, time_in_hospital_optimal_bin_size)
print(f"Bins for time_in_hospital: {time_in_hospital_output['bin_edges']}")

Optimal bin size for time_in_hospital: 9
Bins for time_in_hospital: [(1.0, 2.0), (2.0, 3.0), (3.0, 5.0), (5.0, 7.0), (7.0, 8.0), (8.0, 9.0), (9.0, 10.5), (10.5, 13.0)]


In [21]:
# Define the mapping for time_in_hospital bins
time_in_hospital_mapping = {
    '(1.0, 2.0)': 0,
    '(2.0, 3.0)': 1,
    '(3.0, 5.0)': 2,
    '(5.0, 7.0)': 3,
    '(7.0, 8.0)': 4,
    '(8.0, 9.0)': 5,
    '(9.0, 10.5)': 6,
    '(10.5, 13.0)': 7,
    'below': -1,   # Special case for values below 1.0
    'above': 8     # Special case for values above 13.0
}

# Replace time_in_hospital with integers in the DataFrame
def map_time_in_hospital(value):
    if value < 1.0:
        return time_in_hospital_mapping['below']
    elif value >= 13.0:
        return time_in_hospital_mapping['above']
    elif 1.0 <= value < 2.0:
        return time_in_hospital_mapping['(1.0, 2.0)']
    elif 2.0 <= value < 3.0:
        return time_in_hospital_mapping['(2.0, 3.0)']
    elif 3.0 <= value < 5.0:
        return time_in_hospital_mapping['(3.0, 5.0)']
    elif 5.0 <= value < 7.0:
        return time_in_hospital_mapping['(5.0, 7.0)']
    elif 7.0 <= value < 8.0:
        return time_in_hospital_mapping['(7.0, 8.0)']
    elif 8.0 <= value < 9.0:
        return time_in_hospital_mapping['(8.0, 9.0)']
    elif 9.0 <= value < 10.5:
        return time_in_hospital_mapping['(9.0, 10.5)']
    elif 10.5 <= value < 13.0:
        return time_in_hospital_mapping['(10.5, 13.0)']

# Apply the mapping
reduced_df['time_in_hospital'] = reduced_df['time_in_hospital'].apply(map_time_in_hospital)
print("Hospital Values Binned ...")

Hospital Values Binned ...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_df['time_in_hospital'] = reduced_df['time_in_hospital'].apply(map_time_in_hospital)


In [25]:
# Validate
reduced_df['time_in_hospital'].unique()

array([0, 2, 1, 3, 8, 7, 6, 4, 5], dtype=int64)

## num_medications <br>


In [23]:
print(f"Total number of unique values: {len(reduced_df['num_medications'].unique())}")
reduced_df['num_medications'].unique()

Total number of unique values: 75


array([ 1, 18, 13, 16,  8, 21, 12, 28, 17, 11, 15, 31,  2, 23, 19,  7, 20,
       14, 10, 22,  9, 27, 25,  4, 32,  6, 30, 26, 24, 33,  5, 39,  3, 29,
       61, 40, 46, 41, 36, 34, 35, 50, 43, 42, 37, 51, 38, 45, 54, 52, 49,
       62, 55, 47, 44, 53, 48, 57, 59, 56, 60, 63, 58, 70, 67, 64, 69, 65,
       68, 66, 81, 79, 75, 72, 74], dtype=int64)

In [24]:
num_medications_values = [ 1, 18, 13, 16,  8, 21, 12, 28, 17, 11, 15, 31,  2, 23, 19,  7, 20,
       14, 10, 22,  9, 27, 25,  4, 32,  6, 30, 26, 24, 33,  5, 39,  3, 29,
       61, 40, 46, 41, 36, 34, 35, 50, 43, 42, 37, 51, 38, 45, 54, 52, 49,
       62, 55, 47, 44, 53, 48, 57, 59, 56, 60, 63, 58, 70, 67, 64, 69, 65,
       68, 66, 81, 79, 75, 72, 74]

num_medications_optimal_bin_size = find_optimal_bin_size(num_medications_values)
print(f"Optimal bin size for num_medications: {num_medications_optimal_bin_size}")
num_medications_output = discretize_continuous_variables(num_medications_values, num_medications_optimal_bin_size)
print(f"Bins for num_medications: {num_medications_output['bin_edges']}")

Optimal bin size for num_medications: 2
Bins for num_medications: [(19.499999999999993, 57.432432432432435)]


In [27]:
# Define the mapping for num_medications bins
num_medications_mapping = {
    '(19.5, 57.4)': 0,   # Single bin
    'below': -1,         # Special case for values below 19.5
    'above': 1           # Special case for values above 57.4
}

# Replace num_medications with integers in the DataFrame
def map_num_medications(value):
    if value < 19.5:
        return num_medications_mapping['below']
    elif value > 57.4:
        return num_medications_mapping['above']
    else:  # value is between 19.5 and 57.4
        return num_medications_mapping['(19.5, 57.4)']

# Apply the mapping
reduced_df['num_medications'] = reduced_df['num_medications'].apply(map_num_medications)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_df['num_medications'] = reduced_df['num_medications'].apply(map_num_medications)


In [28]:
# Validate
reduced_df['num_medications'].unique()

array([-1,  0,  1], dtype=int64)

## number_diagnoses <br>

In [29]:
print(f"Total number of unique values: {len(reduced_df['number_diagnoses'].unique())}")
reduced_df['number_diagnoses'].unique()

Total number of unique values: 16


array([ 1,  9,  6,  7,  5,  8,  3,  4,  2, 16, 12, 13, 15, 10, 11, 14],
      dtype=int64)

In [30]:
number_diagnoses_values = [ 1,  9,  6,  7,  5,  8,  3,  4,  2, 16, 12, 13, 15, 10, 11, 14]

number_diagnoses_optimal_bin_size = find_optimal_bin_size(number_diagnoses_values)
print(f"Optimal bin size for number_diagnoses: {number_diagnoses_optimal_bin_size}")
number_diagnoses_output = discretize_continuous_variables(number_diagnoses_values, number_diagnoses_optimal_bin_size)
print(f"Bins for number_diagnoses: {number_diagnoses_output['bin_edges']}")

Optimal bin size for number_diagnoses: 5
Bins for number_diagnoses: [(2.5, 7.0), (7.0, 11.0), (11.0, 13.5), (13.5, 15.5)]


In [31]:
# Define the mapping for number_diagnoses bins
number_diagnoses_mapping = {
    '(2.5, 7.0)': 0,
    '(7.0, 11.0)': 1,
    '(11.0, 13.5)': 2,
    '(13.5, 15.5)': 3,
    'below': -1,         # Special case for values below 2.5
    'above': 4           # Special case for values above 15.5
}

# Replace number_diagnoses with integers in the DataFrame
def map_number_diagnoses(value):
    if value < 2.5:
        return number_diagnoses_mapping['below']
    elif 2.5 <= value < 7.0:
        return number_diagnoses_mapping['(2.5, 7.0)']
    elif 7.0 <= value < 11.0:
        return number_diagnoses_mapping['(7.0, 11.0)']
    elif 11.0 <= value < 13.5:
        return number_diagnoses_mapping['(11.0, 13.5)']
    elif 13.5 <= value < 15.5:
        return number_diagnoses_mapping['(13.5, 15.5)']
    else:  # value is above 15.5
        return number_diagnoses_mapping['above']

# Apply the mapping
reduced_df['number_diagnoses'] = reduced_df['number_diagnoses'].apply(map_number_diagnoses)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_df['number_diagnoses'] = reduced_df['number_diagnoses'].apply(map_number_diagnoses)


In [32]:
# Validate
reduced_df['number_diagnoses'].unique()

array([-1,  1,  0,  4,  2,  3], dtype=int64)

## insulin <br>

- `No` - No insulin was prescribed or administered. 
- `Down` - Indicates a decrease in the insulin dosage.
- `Steady` - Indicates that the insulin dosage remained consistent.
- `Up` - Indicates an increase in the insulin dosage.
<br>
-  Modification we convert these into numerical bins as well with the mapping them into integers.

In [33]:
reduced_df['insulin'].unique()

array(['No', 'Up', 'Steady', 'Down'], dtype=object)

In [34]:
# Define the mapping for insulin
insulin_mapping = {
    'No': 0,      # Map 'No' to 1
    'Down': 1,    # Map 'Down' to 2
    'Steady': 2,  # Map 'Steady' to 3
    'Up': 3       # Map 'Up' to 4
}

# Replace insulin values with the mapping in the DataFrame
reduced_df['insulin'] = reduced_df['insulin'].replace(insulin_mapping)

print("Insulin has been mapped ...")

Insulin has been mapped ...


  reduced_df['insulin'] = reduced_df['insulin'].replace(insulin_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_df['insulin'] = reduced_df['insulin'].replace(insulin_mapping)


In [35]:
reduced_df['insulin'].unique()

array([0, 3, 2, 1], dtype=int64)

## A1Cresult <br>
- `>8` : This suggests the patient has consistently high blood glucose levels over time and may require adjustments to their diabetes management plan, such as changes to medication, diet, or lifestyle.

- `>7` : This level suggests the patient is managing their blood sugar better than >8 but still may not meet the recommended target range for A1c..

- `Norm` : The patient has good blood sugar control, likely meeting the target set by their healthcare provider.

- `nan` : This means the test wasn't conducted or results are unavailable, so no inference about long-term blood sugar control can be made. This may imply that the tests were not needed making the case less complicated.


- The modification will be to convert the unique values into numerical values so that we can process the data easier.

In [36]:
reduced_df['A1Cresult'].unique()

array([nan, '>7', '>8', 'Norm'], dtype=object)

In [37]:
# Define the mapping for A1Cresult
a1c_mapping = {
    np.nan: 0,  # Map NaN to 1
    'Norm': 1,  # Map 'Norm' to 2
    '>7': 2,    # Map '>7' to 3
    '>8': 3     # Map '>8' to 4
}

# Replace A1Cresult values with the mapping in the DataFrame
reduced_df['A1Cresult'] = reduced_df['A1Cresult'].replace(a1c_mapping)
print("Mapped A1Cresult ...")

Mapped A1Cresult ...


  reduced_df['A1Cresult'] = reduced_df['A1Cresult'].replace(a1c_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_df['A1Cresult'] = reduced_df['A1Cresult'].replace(a1c_mapping)


In [38]:
reduced_df['A1Cresult'].unique()

array([0, 2, 3, 1], dtype=int64)

## readmitted <br>
- NO - patient has not been readmitted back within the time of the study
- `>30` - Patient was readmitted after a month
- `<30` - Patient was readmitted within a month
<br>

- similar to what we've done before we just map this into numerical integers as well

In [39]:
reduced_df['readmitted'].unique()

array(['NO', '>30', '<30'], dtype=object)

In [40]:
# Define the mapping for readmitted
readmitted_mapping = {
    'NO': 0,     # Map 'NO' to 1
    '>30': 1,    # Map '>30' to 2
    '<30': 2     # Map '<30' to 3
}

# Replace readmitted values with the mapping in the DataFrame
reduced_df['readmitted'] = reduced_df['readmitted'].replace(readmitted_mapping)

print("readmitted has been mapped ...")


readmitted has been mapped ...


  reduced_df['readmitted'] = reduced_df['readmitted'].replace(readmitted_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_df['readmitted'] = reduced_df['readmitted'].replace(readmitted_mapping)


In [41]:
reduced_df['readmitted'].unique()

array([0, 1, 2], dtype=int64)

# Train-test-split
- Our data has been fully preprocessed and transformed and is ready to be used for modeling. To turn this into a supervised learning task we must assing X (our features) and Y(our target) and split the data. We are opting in for a 80-train, 20-test, split or the default as there are `101,766` rows which is plenty to train on to evaluate on.

In [44]:
reduced_df.head(15)

Unnamed: 0,age,time_in_hospital,num_medications,number_diagnoses,insulin,A1Cresult,readmitted
0,0,0,-1,-1,0,0,0
1,1,2,-1,1,3,0,1
2,2,1,-1,0,0,0,0
3,3,1,-1,1,3,0,0
4,4,0,-1,0,2,0,0
5,5,2,-1,1,2,0,1
6,6,2,0,1,2,0,0
7,7,3,-1,1,0,0,1
8,8,8,0,1,2,0,0
9,9,7,-1,1,2,0,0


In [45]:
# Export full dataset
reduced_df.to_csv('../data/full_transformed_diabetic_data.csv', index=False)
print("dataset exported ...")

dataset exported ...


In [46]:
# Separate features (X) and target labels (y)
X = reduced_df.drop(columns=['readmitted'])  # Features
y = reduced_df['readmitted']  # Target labels
print("X and y separated ...")

X and y separated ...


In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("Train test split executed ...")

Train test split executed ...


In [48]:
# Combine the training data and labels
train_data = X_train.copy()
train_data['readmitted'] = y_train

# Combine the testing data and labels
test_data = X_test.copy()
test_data['readmitted'] = y_test

# Export training data to a CSV file
train_data.to_csv('../data/train_transformed_diabetic_data.csv', index=False)

# Export testing data to a CSV file
test_data.to_csv('../data/test_transformed_diabetic_data.csv', index=False)

print("Training and Testing Data exported ...")

Training and Testing Data exported ...
