# Predict Future Medical Events

### Problem Statement

Insurance Plus++, a premium payer, wants to use predictive modeling on healthcare data to predict the occurrence of future events among their covered patients. They want to use existing data about their patients’ previous medical events to predict future events in their patient journey. Events are recorded in the standardized ICD-9 format. In this challenge, the goal is to predict the next 10 events in 2014 for each patient in order of occurrence.

### Data Description

The “train.csv” file contains historical patient information from Jan 2011 to Dec 2013. The “test.csv” file contains a list of Patient IDs for which we aim to predict the next 10 events for in the year 2014. Event codes should be considered to be categorical in nature, not continuous.

| Variable | Description |
| :--- | :--- |
| UID | Unique Patient ID |
| Age | Age of the patient |
| Gender | Gender of the patient |
| Date | Date of Event |
| Event_Code | Event Code (ICD-9 format, the target variable of this challenge) |

## 1. Understanding the Problem Statement and Dataset

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline

In [2]:
# Read training dataset
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,UID,Age,Gender,Date,Event_Code
0,Id_e45bbc48,14,F,201205,8707
1,Id_e45a8472,52,F,201305,7261
2,Id_e45b20d6,12,F,201212,1967
3,Id_e45aabad,22,F,201211,7172
4,Id_e45c5780,73,F,201312,8100


In [3]:
df.tail()

Unnamed: 0,UID,Age,Gender,Date,Event_Code
766782,Id_e45c576b,65,M,201107,9937
766783,Id_e45a5e18,12,M,201201,3325
766784,Id_e45c5771,17,M,201105,5308
766785,Id_e45a84d5,62,M,201302,7225
766786,Id_e45b2099,69,F,201306,220


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 766787 entries, 0 to 766786
Data columns (total 5 columns):
UID           766787 non-null object
Age           766787 non-null int64
Gender        766787 non-null object
Date          766787 non-null int64
Event_Code    766787 non-null object
dtypes: int64(2), object(3)
memory usage: 29.3+ MB


In [5]:
# How many unique IDs?
n_id = len(df['UID'].unique())
print("There are %d unique UID." % n_id)

There are 3000 unique UID.


In [6]:
# How many unique events?
n_event = len(df['Event_Code'].unique())
print("There are %d unique Events." % n_event)

There are 6472 unique Events.


In [7]:
# How many events for one patient?
df_0 = df[df['UID'] == df['UID'][0]]
print("Patient 0 has %d Events." % len(df_0['Event_Code']))
print("Patient 0 has %d unique Events." % len(df_0['Event_Code'].unique()))

Patient 0 has 81 Events.
Patient 0 has 44 unique Events.


In [8]:
df_1 = df[df['UID'] == df['UID'][1]]
print("Patient 1 has %d Events." % len(df_1['Event_Code']))
print("Patient 1 has %d unique Events." % len(df_1['Event_Code'].unique()))

Patient 1 has 346 Events.
Patient 1 has 82 unique Events.


In [9]:
df_2 = df[df['UID'] == df['UID'][2]]
print("Patient 2 has %d Events." % len(df_2['Event_Code']))
print("Patient 2 has %d unique Events." % len(df_2['Event_Code'].unique()))

Patient 2 has 826 Events.
Patient 2 has 227 unique Events.


### Convert categorical variables into numeric ones

In [10]:
from sklearn.preprocessing import LabelEncoder

var_mod = ['UID','Gender','Event_Code']
le = LabelEncoder()
for item in var_mod:
    df[item] = le.fit_transform(df[item])
df.dtypes

UID           int64
Age           int64
Gender        int64
Date          int64
Event_Code    int64
dtype: object

In [11]:
df.head()

Unnamed: 0,UID,Age,Gender,Date,Event_Code
0,1435,14,0,201205,4729
1,198,52,0,201305,3894
2,935,12,0,201212,493
3,428,22,0,201211,3819
4,2054,73,0,201312,4397


In [12]:
df.tail()

Unnamed: 0,UID,Age,Gender,Date,Event_Code
766782,2033,65,1,201107,5257
766783,189,12,1,201201,1700
766784,2039,17,1,201105,2843
766785,297,62,1,201302,3868
766786,874,69,0,201306,667


In [13]:
max_id = np.max(df['UID'])
max_event = np.max(df['Event_Code'])
print('Maximum UID = ', max_id)
print('Maximum Event_Code = ', max_event)

Maximum UID =  2999
Maximum Event_Code =  6471


So 'UID' is converted to 0 ~ 2999, 'Event_Code' is converted to 0 ~ 6471.

### Generate Year and Month from Date

In [15]:
df['Year'] = (df['Date'] / 100).astype(int)
df['Month'] = (df['Date'] % 100).astype(int)

In [17]:
df.head(20)

Unnamed: 0,UID,Age,Gender,Date,Event_Code,Year,Month
0,1435,14,0,201205,4729,2012,5
1,198,52,0,201305,3894,2013,5
2,935,12,0,201212,493,2012,12
3,428,22,0,201211,3819,2012,11
4,2054,73,0,201312,4397,2013,12
5,1548,77,1,201203,4397,2012,3
6,1487,76,0,201307,5259,2013,7
7,803,20,1,201206,5243,2012,6
8,1287,24,1,201104,1548,2011,4
9,1971,62,1,201105,496,2011,5


In [28]:
# Sort by UID, Year and Month
df.sort_values(by = ['UID','Year','Month'], inplace=True)

In [32]:
# Reset index
df = df.reset_index(drop=True)

In [33]:
df

Unnamed: 0,UID,Age,Gender,Date,Event_Code,Year,Month
0,0,17,0,201101,2671,2011,1
1,0,17,0,201101,678,2011,1
2,0,17,0,201101,1150,2011,1
3,0,17,0,201101,3957,2011,1
4,0,17,0,201101,3763,2011,1
5,0,17,0,201101,657,2011,1
6,0,17,0,201102,678,2011,2
7,0,17,0,201102,3763,2011,2
8,0,17,0,201102,1003,2011,2
9,0,17,0,201103,3804,2011,3
