<a href="https://colab.research.google.com/github/dcpatton/Structured-Data/blob/main/tf_embedding_cms_claims.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective
This notebook is a proof of concept. It shows an example of tf.feature_columns and in particular embedding columns (https://www.tensorflow.org/api_docs/python/tf/feature_column/embedding_column)

In [None]:
import tensorflow as tf
import pandas as pd
pd.set_option('display.max_rows', 999)
pd.set_option('max_info_columns', 200)
import numpy as np
import random
import warnings
warnings.filterwarnings("ignore")

seed=52
tf.random.set_seed(seed)
random.seed(seed)

tf.__version__

'2.3.0'

# Get the Data

The data can be downloaded from https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF. Here I retrieve it from my GCP storage.

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
!curl https://sdk.cloud.google.com >/dev/null

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   443  100   443    0     0  18458      0 --:--:-- --:--:-- --:--:-- 18458


In [None]:
!gcloud init --skip-diagnostics --account='dcpatton@gmail.com' --project='cms-de1' 

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
component_manager:
  disable_update_check: 'True'
compute:
  gce_metadata_read_timeout_sec: '0'
core:
  account: dcpatton@gmail.com
  project: cms-de1

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
Please enter your numeric choice:  1

Your current configuration has been set to: [default]

You are logged in as: [dcpatton@gmail.com].

Your current project has been set to: [cms-de1].

Not setting default zone/region (this feature makes it easier to use
[gcloud compute] by setting an appropriate default value for the
--zone and --region flag).
See https://cloud.google.com/compute/docs/gcloud-compute section on how to set
default compute region and zone manually. If you would like [gcloud init] to be
able to do this for you the next time you run it, make sure the
Compute Engine API i

In [None]:
!gsutil cp gs://de-synpuf/*.zip .

Copying gs://de-synpuf/176537_DE1_0_2010_Beneficiary_Summary_File_Sample_20.zip...
Copying gs://de-synpuf/176541_DE1_0_2008_Beneficiary_Summary_File_Sample_1.zip...
Copying gs://de-synpuf/176549_DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.zip...
Copying gs://de-synpuf/176600_DE1_0_2009_Beneficiary_Summary_File_Sample_1.zip...
/ [4 files][ 12.8 MiB/ 12.8 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://de-synpuf/176616_DE1_0_2008_to_2010_Outpatient_Claims_Sample_1.zip...
Copying gs://de-synpuf/DE1_0_2008_to_2010_Carrier_Claims_Sample_1A.zip...
Copying gs://de-synpuf/DE1_0_2008_to_2010_Carrier_Claims_Sample_1B.zip...
Copying gs://de-synpuf/DE1_0_2008_to_2010_Prescription_Drug_Events_Sample_1.zip...
- [8 files][361.5 MiB/36

# Preprocess the data


In [None]:
!unzip 176549_DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.zip

Archive:  176549_DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.zip
  inflating: DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.csv  


In [None]:
claims_df = pd.read_csv('DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.csv', parse_dates=['CLM_FROM_DT', 'CLM_THRU_DT'])

In [None]:
!unzip 176541_DE1_0_2008_Beneficiary_Summary_File_Sample_1.zip
!unzip 176600_DE1_0_2009_Beneficiary_Summary_File_Sample_1.zip
!unzip 176537_DE1_0_2010_Beneficiary_Summary_File_Sample_20.zip

Archive:  176541_DE1_0_2008_Beneficiary_Summary_File_Sample_1.zip
  inflating: DE1_0_2008_Beneficiary_Summary_File_Sample_1.csv  
Archive:  176600_DE1_0_2009_Beneficiary_Summary_File_Sample_1.zip
  inflating: DE1_0_2009_Beneficiary_Summary_File_Sample_1.csv  
Archive:  176537_DE1_0_2010_Beneficiary_Summary_File_Sample_20.zip
  inflating: DE1_0_2010_Beneficiary_Summary_File_Sample_20.csv  


In [None]:
summary_2008_df = pd.read_csv('DE1_0_2008_Beneficiary_Summary_File_Sample_1.csv', parse_dates=['BENE_BIRTH_DT'])
summary_2009_df = pd.read_csv('DE1_0_2009_Beneficiary_Summary_File_Sample_1.csv', parse_dates=['BENE_BIRTH_DT'])
summary_2010_df = pd.read_csv('DE1_0_2010_Beneficiary_Summary_File_Sample_20.csv', parse_dates=['BENE_BIRTH_DT'])

Combining the beneficiary data into a single dataframe.

In [None]:
summary_df = pd.merge(summary_2009_df, summary_2009_df, how='outer')
summary_df = pd.merge(summary_df, summary_2010_df, how='outer')

In [None]:
# drop all lines with SEGMENT=2 because they contain no diagnosis codes nor procedure codes
claims_df = claims_df[claims_df['SEGMENT']==1]

In [None]:
# set missing admitting diagnosis codes to first diagnosis code
missing_df = claims_df[claims_df['ADMTNG_ICD9_DGNS_CD'].isna()]
for idx, row in missing_df.iterrows():
  claims_df.at[idx, 'ADMTNG_ICD9_DGNS_CD'] = row.ICD9_DGNS_CD_1

In [None]:
claims_df.at[26530, 'ADMTNG_ICD9_DGNS_CD'] = '8020' # set it to ICD9_DGNS_CD_2 value

In [None]:
claims_df['ADMTNG_ICD9_DGNS_CD'].isna().sum()

0

In [None]:
claims_sub_df = claims_df[['DESYNPUF_ID', 'ADMTNG_ICD9_DGNS_CD','CLM_UTLZTN_DAY_CNT']]
claims_sub_df.isna().sum()

DESYNPUF_ID            0
ADMTNG_ICD9_DGNS_CD    0
CLM_UTLZTN_DAY_CNT     0
dtype: int64

In [None]:
summary_sub_df = summary_df[['DESYNPUF_ID', 'BENE_BIRTH_DT', 'BENE_SEX_IDENT_CD', 'SP_ALZHDMTA', 'SP_CHF', 'SP_CHRNKIDN', 'SP_CNCR', 'SP_COPD',
                             'SP_DEPRESSN', 'SP_DIABETES', 'SP_ISCHMCHT', 'SP_OSTEOPRS', 'SP_RA_OA', 'SP_STRKETIA']]
summary_sub_df.isna().sum()

DESYNPUF_ID          0
BENE_BIRTH_DT        0
BENE_SEX_IDENT_CD    0
SP_ALZHDMTA          0
SP_CHF               0
SP_CHRNKIDN          0
SP_CNCR              0
SP_COPD              0
SP_DEPRESSN          0
SP_DIABETES          0
SP_ISCHMCHT          0
SP_OSTEOPRS          0
SP_RA_OA             0
SP_STRKETIA          0
dtype: int64

In [None]:
data_df = claims_sub_df.merge(summary_sub_df, on='DESYNPUF_ID')
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66518 entries, 0 to 66517
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   DESYNPUF_ID          66518 non-null  object        
 1   ADMTNG_ICD9_DGNS_CD  66518 non-null  object        
 2   CLM_UTLZTN_DAY_CNT   66518 non-null  float64       
 3   BENE_BIRTH_DT        66518 non-null  datetime64[ns]
 4   BENE_SEX_IDENT_CD    66518 non-null  int64         
 5   SP_ALZHDMTA          66518 non-null  int64         
 6   SP_CHF               66518 non-null  int64         
 7   SP_CHRNKIDN          66518 non-null  int64         
 8   SP_CNCR              66518 non-null  int64         
 9   SP_COPD              66518 non-null  int64         
 10  SP_DEPRESSN          66518 non-null  int64         
 11  SP_DIABETES          66518 non-null  int64         
 12  SP_ISCHMCHT          66518 non-null  int64         
 13  SP_OSTEOPRS          66518 non-

In [None]:
data_df['year'] = pd.DatetimeIndex(data_df['BENE_BIRTH_DT']).year
data_df['age'] = 2020-data_df['year']
data_df.drop(['year','BENE_BIRTH_DT'], axis='columns', inplace=True)

CLM_UTLZTN_DAY_CNT will be the target to predict and ADMTNG_ICD9_DGNS_CD will be the high cardinality categorical column we encode. Renaming them for convenience.

In [None]:
data_df = data_df.rename(columns={"CLM_UTLZTN_DAY_CNT": "target", "ADMTNG_ICD9_DGNS_CD": "diagnosis"})
data_df.head()

Unnamed: 0,DESYNPUF_ID,diagnosis,target,BENE_SEX_IDENT_CD,SP_ALZHDMTA,SP_CHF,SP_CHRNKIDN,SP_CNCR,SP_COPD,SP_DEPRESSN,SP_DIABETES,SP_ISCHMCHT,SP_OSTEOPRS,SP_RA_OA,SP_STRKETIA,age
0,00013D2EFD8E45D1,4580,1.0,1,2,2,2,2,2,2,2,2,2,2,2,97
1,00016F745862898F,7866,6.0,1,1,2,1,2,2,1,1,1,2,1,1,77
2,00016F745862898F,6186,2.0,1,1,2,1,2,2,1,1,1,2,1,1,77
3,00016F745862898F,29590,3.0,1,1,2,1,2,2,1,1,1,2,1,1,77
4,00016F745862898F,5849,5.0,1,1,2,1,2,2,1,1,1,2,1,1,77


In [None]:
# Note the high cardinality
data_df.diagnosis.nunique()

2316

In [None]:
data_df.drop(['DESYNPUF_ID'], axis='columns', inplace=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66518 entries, 0 to 66517
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   diagnosis          66518 non-null  object 
 1   target             66518 non-null  float64
 2   BENE_SEX_IDENT_CD  66518 non-null  int64  
 3   SP_ALZHDMTA        66518 non-null  int64  
 4   SP_CHF             66518 non-null  int64  
 5   SP_CHRNKIDN        66518 non-null  int64  
 6   SP_CNCR            66518 non-null  int64  
 7   SP_COPD            66518 non-null  int64  
 8   SP_DEPRESSN        66518 non-null  int64  
 9   SP_DIABETES        66518 non-null  int64  
 10  SP_ISCHMCHT        66518 non-null  int64  
 11  SP_OSTEOPRS        66518 non-null  int64  
 12  SP_RA_OA           66518 non-null  int64  
 13  SP_STRKETIA        66518 non-null  int64  
 14  age                66518 non-null  int64  
dtypes: float64(1), int64(13), object(1)
memory usage: 8.1+ MB


In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(data_df, test_size=0.2, random_state=seed)
print(train_df.shape)
print(test_df.shape)

(53214, 15)
(13304, 15)


In [None]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

Here we are going to create embeddings for the diagnosis codes. These embeddings will be learned by the model.

In [None]:
from tensorflow import feature_column

feature_columns = []

# numeric cols
for header in ['age']:
  feature_columns.append(feature_column.numeric_column(header))

# indicator_columns
indicator_column_names = ['BENE_SEX_IDENT_CD', 'SP_ALZHDMTA', 'SP_CHF', 'SP_CHRNKIDN', 'SP_CNCR',
                          'SP_COPD', 'SP_DEPRESSN', 'SP_DIABETES', 'SP_ISCHMCHT', 'SP_OSTEOPRS', 'SP_RA_OA', 'SP_STRKETIA']
for col_name in indicator_column_names:
  categorical_column = feature_column.categorical_column_with_vocabulary_list(
      col_name, data_df[col_name].unique())
  indicator_column = feature_column.indicator_column(categorical_column)
  feature_columns.append(indicator_column)

# embedding columns
diagnosis = feature_column.categorical_column_with_vocabulary_list(
      'diagnosis', data_df.diagnosis.unique())
diagnosis_embedding = feature_column.embedding_column(diagnosis, dimension=16)
feature_columns.append(diagnosis_embedding)

In [None]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [None]:
batch_size = 32
train_ds = df_to_dataset(train_df, batch_size=batch_size)
test_ds = df_to_dataset(test_df, shuffle=False, batch_size=batch_size)

# The Model

In [None]:
tf.keras.backend.clear_session()
from tensorflow.keras.layers import Dense

model = tf.keras.Sequential([
  feature_layer,
  Dense(512, activation='relu'),
  Dense(256, activation='relu'),
  Dense(128, activation='relu'),
  Dense(64, activation='relu'),
  Dense(1)
])

model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Training

Since the data is synthetic we should not expect any kind of meaning in our results. But just running 10 epochs to demonstrate how to accomplish this with real data.

In [None]:
history = model.fit(train_ds, epochs=10, validation_data=test_ds)

Epoch 1/10
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
