Notes:
- <b>clustering_data.csv</b> is created by taking train_test_validate_data and adding an extra column 'cluster_group'. The 'cluster_group' is filtered using the 'los_icu' column. (Each patient is only able to be in one cluster - they are distinct). This is important because this information is used to create the intime distribution into the ABM.

    1. 'los_icu' <=0
    2. 'los_icu' 1 - 7
    3. 'los_icu' 8 - 24
    4. 'los_icu' > 25
    

- the four csvs <b>'group1_df.csv', 'group2_df.csv', 'group3_df.csv', 'group4_df.csv'</b> are not distinct. They are created using filters that have come from clustering_exploration_v2.ipynb. Patients can be in more than one group - they are not distinct). 
This is because these groups are used to create los distributions, where we do not need the groups to be distinct. 

    we want patients in each group to have a few of the same characteristics because we think those are the most important features for determining which group for los they should be in. However, we also want to keep in patients with other information too, which is how we enrich the los that is given to a patient when they enter the ABM. 


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from category_encoders import TargetEncoder
from category_encoders.cat_boost import CatBoostEncoder
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
import os

In [3]:
#os.chdir('/Users/chloemaine/Documents/Chloe/BGSE/masters_project/processed_data')
train_test_validate_data = pd.read_csv('../output_data/train_test_validate_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [91]:
train_test_validate_data['age_group'].unique()

array(['40.0-59.0', '60.0-79.0', '80-100', '0.0-19.0', '20.0-39.0',
       'nan-nan'], dtype=object)

In [95]:
train_test_validate_data[train_test_validate_data['age_group'] == '40.0-59.0']['los_icu'].value_counts()

1.0     4344
2.0     2658
3.0     1375
4.0      820
5.0      502
        ... 
59.0       1
57.0       1
76.0       1
58.0       1
62.0       1
Name: los_icu, Length: 68, dtype: int64

In [94]:
train_test_validate_data[train_test_validate_data['age_group'] == '20.0-39.0']['los_icu'].value_counts()

1.0     1555
2.0      851
3.0      427
4.0      217
5.0      148
0.0      134
6.0       92
7.0       77
11.0      58
9.0       38
8.0       34
10.0      33
12.0      23
15.0      23
14.0      19
13.0      18
17.0      18
16.0      16
20.0      13
21.0      11
24.0       9
23.0       8
27.0       8
32.0       7
26.0       6
22.0       6
18.0       6
19.0       5
28.0       4
31.0       4
33.0       4
37.0       4
39.0       4
40.0       3
25.0       3
29.0       2
36.0       2
34.0       2
51.0       1
49.0       1
43.0       1
79.0       1
42.0       1
77.0       1
47.0       1
60.0       1
30.0       1
38.0       1
48.0       1
83.0       1
Name: los_icu, dtype: int64

In [58]:
train_test_validate_data['los_icu'].mean()

4.639160384451263

In [59]:
train_test_validate_data['los_icu'].median()

2.0

# Add column for group for cluster group. Needed for creating intime distribution (input to admission_priors.ipynb)

#### Prepare data for model in same way as clustering_exploration.ipynb but now do not drop ideal_icu and FIRST_CAREUNIT

We do not drop 'first_category' now because we will use it to create intime distributions. We dropped it in clustering_exploration because it was correlated with other diagnosis information that would affect the prediction model

- start with 43 columns
- end with 33 columns

In [60]:
train_test_validate_data['los_icu'].isnull().any()

True

In [61]:
#we have 11 NAN FOr los_icu which will be dropped when we make clusters
train_test_validate_data[train_test_validate_data['los_icu'].isnull()]['los_icu']

772     NaN
1023    NaN
2681    NaN
5304    NaN
7578    NaN
9487    NaN
11528   NaN
13153   NaN
14804   NaN
15646   NaN
21088   NaN
21156   NaN
45327   NaN
Name: los_icu, dtype: float64

In [62]:
drop_cols = ['subject_id', 'hadm_id', 'icustay_id', 'admittime','dischtime', 'los_hospital' ,'dod', 'category', 'pregnancy complications','hospital_expire_flag']
train_test_validate_data = train_test_validate_data.drop(drop_cols, axis=1)


In [63]:
#add column to train_test_validate_data with the cluster_number they are in
cluster_1 = train_test_validate_data[train_test_validate_data['los_icu'] == 0]
cluster_1['cluster_number'] = 1

cluster_2 = train_test_validate_data[(train_test_validate_data['los_icu'] >=1) & (train_test_validate_data['los_icu'] <=7)]
cluster_2['cluster_number'] = 2

cluster_3 = train_test_validate_data[(train_test_validate_data['los_icu'] >=8) & (train_test_validate_data['los_icu'] <=24)]
cluster_3['cluster_number'] = 3

cluster_4 = train_test_validate_data[train_test_validate_data['los_icu'] >=25 ]
cluster_4['cluster_number'] = 4


clustered_data = pd.concat([cluster_1, cluster_2, cluster_3, cluster_4])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col

In [64]:
#checks:
print(len(cluster_1) + len(cluster_2) + len(cluster_3) + len(cluster_4) )
print(len(train_test_validate_data))

49213
49226


In [67]:
clustered_data.groupby('INTIME_COLLAPSED_day').count().index.max()
clustered_data.groupby('INTIME_COLLAPSED_day').count().index.min()

'2003-03-01'

In [73]:
from datetime import datetime
min_date = clustered_data.groupby('INTIME_COLLAPSED_day').count().index.min()
min_date = datetime.strptime(min_date, '%Y-%m-%d').date()

max_date = clustered_data.groupby('INTIME_COLLAPSED_day').count().index.max()
max_date = datetime.strptime(max_date, '%Y-%m-%d').date()

max_date - min_date

datetime.timedelta(days=3591)

In [75]:
max_date

datetime.date(2012, 12, 29)

In [76]:
clustered_data.to_csv('clustered_data_for_priors.csv', index=False)


# Make clusters/4 groups


In [272]:
print(len(cluster_1)/len(train_test_validate_data)) #6%
print(len(cluster_2)/len(train_test_validate_data)) #79%
print(len(cluster_3)/len(train_test_validate_data)) #12%
print(len(cluster_4)/len(train_test_validate_data)) #2%

0.06358428472758298
0.7949254459025719
0.11282655507252265
0.028399626213789463


In [273]:
print(cluster_1['los_icu'].sum()/ train_test_validate_data['los_icu'].sum() )
print(cluster_2['los_icu'].sum()/ train_test_validate_data['los_icu'].sum() )
print(cluster_3['los_icu'].sum()/ train_test_validate_data['los_icu'].sum() )


#23% of all days are in cluster 4 even though only 2% of data is in there
print(cluster_4['los_icu'].sum()/ train_test_validate_data['los_icu'].sum() )


0.0
0.4054847201356069
0.3151589745386694
0.2793563053257237


# Make clusters/ 4 groups (input into los_prediction_ideal_icu.ipynb)

The groups have been chosen in a way where we expect there to be a distinct los between groups (discovery of these groups is in clustering_exploration.ipynb)

We have found classification problems in the literature that try to classify these four groups

#### Group 1 ( from clustering analysis, we expect los = 0)
- age group 0-19
- admission type = Newbown or emergency
- perinatal > 0
- sepsis = False



#### Group 2: (we expect los between 1-7 days)
- age group 40-59, 60-79, 80-100


#### Group 3: (we expect los between 8-24 days)
- respiratory > 0
- diagnoses_count > =5
- sepsis=True




#### Group 4 : (we expect los >25 days)
- congential > 0
- age grcoup 0-19, 40-59, 60-79, 80-100
- sepsis = True
- Infectious parasitic > 0
- Diagnoses count >=5



#### Each group will have it's own los prediction model in los_prediction_clustering.ipynb 

In [274]:
#group1

group1_df = train_test_validate_data[(train_test_validate_data['diagnoses_count']<=4)&
                                    ((train_test_validate_data['admission_type']=='NEWBORN') | (train_test_validate_data['admission_type']=='EMERGENCY')) &
                                    (train_test_validate_data['sepsis']==False) & 
                                    (train_test_validate_data['age_group']=='0.0-19.0')&
                                    (train_test_validate_data['perinatal']>0)]





In [275]:
group1_df['los_icu'].median()

0.0

In [276]:
group1_df['los_icu'].mean()

0.665514261019879

In [277]:
#group 2 (average group)
group2_df =train_test_validate_data[((train_test_validate_data['age_group'] == '40.0-59.0') | 
                                    (train_test_validate_data['age_group'] == '60.0-79.0') |
                                    (train_test_validate_data['age_group'] == '80-100'))]

In [278]:
group2_df['los_icu'].mean()


4.01221451104101

In [279]:
#group 3
group3_df = train_test_validate_data[ (train_test_validate_data['diagnoses_count'] >=5) & 
                                    (train_test_validate_data['respiratory'] >0) & 
                                    (train_test_validate_data['sepsis'] ==True) &
                                    (train_test_validate_data['congenital'] >0)  ]




In [280]:
group3_df['los_icu'].mean()

14.608938547486034

In [281]:
group3_df['los_icu'].median()

6.0

In [282]:
group4_df = train_test_validate_data[((train_test_validate_data['age_group']=='0.0-19.0') | 
                                     (train_test_validate_data['age_group']=='60.0-79.0') |
                                      (train_test_validate_data['age_group']=='80.0-10.0') |
                                    (train_test_validate_data['age_group']=='40.0-59.0')) &
                                    (train_test_validate_data['sepsis']==True) &
                                    (train_test_validate_data['congenital']>0) &
                                    (train_test_validate_data['infectious|parasitic']>0) & 
                                    (train_test_validate_data['diagnoses_count']>=5)] 


In [283]:
group4_df['los_icu'].mean()

31.987951807228917

In [284]:

print('mean los in group 1:',group1_df['los_icu'].mean())
print('mean los in group 2:',group2_df['los_icu'].mean())
print('mean los in group 3:',group3_df['los_icu'].mean())
print('mean los in group 4:', group4_df['los_icu'].mean())

mean los in group 1: 0.665514261019879
mean los in group 2: 4.01221451104101
mean los in group 3: 14.608938547486034
mean los in group 4: 31.987951807228917


In [285]:
len(group1_df) + len(group2_df) + len(group3_df) + len(group4_df) 

42369

In [286]:
len(train_test_validate_data)

49226

In [287]:
group1_df = group1_df.dropna() #drop 1 row
group2_df = group2_df.dropna() #drop 1 row
group3_df = group3_df.dropna() #drop 0 row
group4_df = group4_df.dropna() #drop 0 row

In [288]:
group1_df.to_csv('../output_data/group1_df.csv', index=False)
group2_df.to_csv('../output_data/group2_df.csv', index=False)
group3_df.to_csv('../output_data/group3_df.csv', index=False)
group4_df.to_csv('../output_data/group4_df.csv', index=False)

In [292]:
len(group1_df)

2314

In [293]:
len(group2_df)

39625

In [294]:
len(group3_df)

179