# Oscar Take-Home Case Study
----------------------------------------
Submission by: David Li  
Contact: davidjli508@gmail.com / (248)918-8782

-------------------------------------------------------------------------------------

### Opening Thoughts on the Use Case / Task:

The task is to analyze the Member-Diagnosis-Drug Datasets and build a meaningful model to predict patient health statuses from prescription drug data.  

As part of strategizing and brainstorming when approaching any new tasks, its a important step to review the business requirements, success criteria, and anticipated challenges - identified by either the stakeholder/Data Scientist. 
This will overall lead to a more organized strategy when handling the task and identifying any potential technical/data/project risks as far in advance as possible!

#### Business/Task Requirements:
- Classifcation Model Type (based on requiring categories/labels of outcomes)
- Column of Interest: TBD (some variety of patient health statuses, but to be defined as part of the task)
- Predictive Features: Drugs administered

#### Success Criteria:
- Built a Classifcation Model using the Member-Diagnosis-Drug Datasets referenced.
- Due to time constraints, accuracy is not the highest priority - but still best to do sanity checks and quick reviews. Especially with Data the company/individual/stakeholders are not necessarily familar with.
- Write Organized Code with Helpful Documentation
- Document Helpful Ideas and Thoughts along the process.

#### Anticipated Challenges:
- This is data we are not fully familiar with the collection and engineering process. So we should especially be on lookout for missing or low-quality data.
- The domain knowledge of the data is somewhat niche, so it is important we take time and steps to thoroughly understand the medical-specific terminology and nomenclature. We will be sure to discuss investigative steps, learnings, and decisions based on these.
- Part of the open-ended nature of this assignment is that we will need to define our own Target Column of interest. In some datasets this is immediately provided, but we will need to strategize accordingly.

------------------------------------------------------------------

### Ingestion/Exploratory Data Analysis (EDA)

Though the Modeling is usually the more "exciting" part of the project, it's important to be well-informed on the data we are working with.  

The idea is to take a look at the variables individually (uni-variate) as well as check out possible few correlation/relationships between variables (bi-variate) that we suspect may have interesting patterns. Additionally, there are other vital reasons why EDA is strongly needed for successful and clean modeling.

- Determining how to address missing/low-quality data upfront segments much of the Data Manipulation code away from the AI/ML Modelling Code, making corrections/revisions easy to add in the corresponding section (as vs. all of the code entangled together)  
- Addressing all Data Quality Issues upfront eliminates confusion between a Data process issue vs. a Modelling process issue, which is extremely relevant when we need to identify the root cause of poor model performance or make model enhancements/improvements. It saves significant time.
- It gives direction on how to initially approach the modelling steps. Learning what variables are poorly maintained, insignificant, or irrelevant can save significant time in deciding how to be constructing and setting up the model. It becomes easier to quickly start with a fairly reasonable model - especially great if deadlines or resources happen to be constrained. 

In particular with this assessment, we have 3 datasets to review. The importance of this section is to roughly **determine the meaning, data quality, statistical overview, and important learnings/facts about the data that will be either a prerequisite or an a important factor in part of our modelling.**

Because of the suggested 4 hour constraint, we will focus on the most evident learnings that we can quickly understand. There may be further analysis that could be done with more time - these potential steps are covered in the summary section of the EDA portion.

In [258]:
# Import Python Packages Necessary for Assignment
import numpy as np # Data Manipulation
import pandas as pd  # Data Manipulation
import matplotlib.pyplot as plt # Generating Graphs, especially for EDA
import seaborn as sns # Generating Graphs, especially for EDA

##### Claim Lines w/ Diagnosis Codes ("claim_lines.csv") - EDA Overview

In [259]:
# Import Dataset
claims_df = pd.read_csv("claim_lines.csv") # Assumes dataset is in same folder as this .ipynb

In [260]:
# Preview the Data
claims_df.head()

Unnamed: 0,record_id,member_id,date_svc,diag1
0,57738,M0000001,2015-12-06,N92.6
1,57750,M0000001,2015-12-06,O26.842
2,65072,M0000001,2015-12-13,O26.842
3,201796,M0000001,2016-02-29,O26.843
4,267197,M0000001,2016-03-27,O26.843


In [261]:
# Review the columns we have access to in the Dataset
claims_df.columns

Index(['record_id', 'member_id', 'date_svc', 'diag1'], dtype='object')

At this stage, it's important at a High-Level Glance to:  

Understand what information is being conveyed in the data. There looks to be 2 main types of information conveyed. 
- Identification Information: "record_id", "member_id", "date_svc". Mostly used as referencing or labeling
- Diagnosis Code: "diag1". This is the target variable of interest in this dataset, describing the diagnosis code.

Other Learnings/Notes About This Dataset:  
- The asesssment instructions describe the "diag1" codes as following the ICD-10 format. 
- When following the documentation, it is learned that there are ICD-10 **diagnosis** codes and ICD-10 **procedure** codes. 
- Our clues & evidence of determining these are **diagnosis** codes are from the given format. These codes in this dataset follow the syntax of the diagnosis codes description from:
    - https://hcup-us.ahrq.gov/datainnovations/BriefIntrotoICD-10Codes041117.pdf : Table 1


In [262]:
# Next, we would want to check data quality to give both a better understanding and set up our modelling for success (i.e. "Garbage in Garbage Out")

# Count total NaN at each column in a DataFrame 
print(" \nCount total NaN at each column in a DataFrame : \n\n", 
    claims_df.isnull().sum()) 

# Count Number of Duplicate Rows 
print(" \nCount total duplicates in Dataframe : \n\n", 
    len(claims_df)-len(claims_df.drop_duplicates()))

# Print Total Number of Rows for Relative Reference
print(" \nCount total rows in Dataframe : \n\n", 
    claims_df.shape[0])



 
Count total NaN at each column in a DataFrame : 

 record_id     0
member_id     0
date_svc     24
diag1         0
dtype: int64
 
Count total duplicates in Dataframe : 

 0
 
Count total rows in Dataframe : 

 1919983


Even a simple exercise of counting the number of missing values will give some insight on useful variables. 

- Here, we can see that only the date_svc column has missing values, 24 of them.
- There are also no duplicate rows, so we will not need to handle for duplicate data.
- We have 1,919,983 data points, described in our instructions as "Every row lists one diagnosis given to a member on a certain day.". As a sanity check where we are also told this data covers 200,000 members for 3 years, its important to also self-ask sanity questions - this amount of data roughly makes sense for the time period we are told.

Then, we need to think how to handle the missing data.
- Normally, we would want to inquire what could lead to the missing data existing in the first place (i.e Human Error, Data Gaps, Intention of N/A as a result). If the root cause is known, this can inform on how to resolve the missing data appropriately (i.e. Human error could be solved by manual correction, missing numeric values over time can involve an average between the two next adjacent data points, etc.)
- We would not need to trim data rows which have missing data in columns that are irrelevant for our model to learn from.

Based on the available size of data in this dataset, as well as simplicity for the assignment - we will simply remove the 24 rows of data, since this is a marginal portion of the data. 

In [263]:
# Final Refinements for the Claim Lines Dataset

# Drop Columns
claims_df.dropna(subset=claims_df.columns.tolist(), inplace = True)

# Note: A particular step I personally find very useful and worthwhile to do is coercing all the types very early ahead of time if the dataset size is reasonable for that exercise. 
# It facilitates all future code as we know exactly what data types we are working with, and makes it easy to check for data type correctness for all future EDA and Modelling requirements.
# Takes some extra time in the beginning, but saves tons of time and headache later!

# Specify Data Types for what values we would expect
STR_COLS = ['record_id', "member_id", 'diag1'] 
#FLOAT_COLS = []
DATE_COLS = ['date_svc']

# Coerce Data Types for what values we would expect. Note: Errors in Coercion become NA from the parameter specified.
claims_df[STR_COLS] = claims_df[STR_COLS].astype(str) # String
#claims_df[FLOAT_COLS] = claims_df[FLOAT_COLS].apply(pd.to_numeric, errors = 'coerce') # FLoats/Decimals
claims_df[DATE_COLS] = claims_df[DATE_COLS].apply(pd.to_datetime, errors = "coerce") # DateTime

##### CCS / Clinical Categories ("ccs.csv") - EDA Overview

We follow the same methodology as the claim lines dataset. For conciseness, the code is all joined together; and the insights/summaries are still given in a separate cell.

In [264]:
# Import Dataset
ccs_df = pd.read_csv("ccs.csv") # Assumes dataset is in same folder as this .ipynb

# Review the columns we have access to in the Dataset
print(ccs_df.columns)

# Next, we would want to check data quality to give both a better understanding and set up our modelling for success (i.e. "Garbage in Garbage Out")

# Count total NaN at each column in a DataFrame 
print(" \nCount total NaN at each column in a DataFrame : \n\n", 
    ccs_df.isnull().sum()) 

# Count Number of Duplicate Rows 
print(" \nCount total duplicates in Dataframe : \n\n", 
    len(ccs_df)-len(ccs_df.drop_duplicates()))

# Print Total Number of Rows for Relative Reference
print(" \nCount total rows in Dataframe : \n\n", 
    ccs_df.shape[0])

# Final Refinements for the CCS Dataset

# Specify Data Types for what values we would expect
STR_COLS = ['diag', 'diag_desc', 'ccs_1_desc', 'ccs_2_desc', 'ccs_3_desc'] 

# Coerce Data Types for what values we would expect. Note: Errors in Coercion become NA from the parameter specified.
ccs_df[STR_COLS] = ccs_df[STR_COLS].astype(str) # String

Index(['diag', 'diag_desc', 'ccs_1_desc', 'ccs_2_desc', 'ccs_3_desc'], dtype='object')
 
Count total NaN at each column in a DataFrame : 

 diag          0
diag_desc     0
ccs_1_desc    0
ccs_2_desc    0
ccs_3_desc    0
dtype: int64
 
Count total duplicates in Dataframe : 

 0
 
Count total rows in Dataframe : 

 72167


We should review two important understandings as before - the meaning of this dataset as well as any data quality/overview details we need to note of.

- "diag" follows the same format of the ICD-10 diagnosis codes, but with one small yet important difference; there are **no decimals** in this dataset's depiction.
    - Consequence 1: When joining the ccs dataset with the claim_lines dataset, we will need to adjust the representation (i.e. claim_lines's "Z38.01" matches to ccs's "Z3801")
        - This is an assumption that normally would be recommended to confer and check with the data expert / owner of the datasets
    - Consequence 2: We would need to validate that the above-mentioned join is relatively successful/correct
        - If the matching is poor based on our assumption above, we would need to investigate further for an alternative join strategy.
        - There could be both technical and non-technical causes for a poor join
            - Significant Zeroes, or other similar issues regarding different formatting for the same expression(s)
            - Data Source Consolidation Gaps, such as different systems capturing the same information but in different syntaxes or nomenclatures.
- "diag_desc", "ccs_1_desc", "ccs_2_desc", "ccs_3_desc" are all descriptions of the diagnosis, but on different broadness/hierarchy.
    - From a glance, it would seem that "diag_desc" is the "true" description of the diagnosis. 
    - Since multiple diagnoses can be classified similarly in nature, the mulitple "ccs_desc" columns classify at different levels of specificity.
        - Hypothesis: "ccs_1" is more broad encapsulating, up to "ccs_3" as most specific categorization

From the above, we have important action items:

- Validate / Investigate Consequences 1 & 2, to ensure our data is being used together correctly.
- Understand the description breakdown hierarchy enough to accomplish an important step - determine an appropriate level categorization the model will be targetting.
    - This will be explored in the Uni-Variate Sections.

From noticing no missing data or duplicate rows, we will quickly make an assumption that this dataset is good as is.

##### Prescription Drugs Data ("prescription_drugs.csv") - EDA Overview

We follow the same methodology as the claim lines & ccs dataset.

In [265]:
# Import Dataset
drug_df = pd.read_csv("prescription_drugs.csv") # Assumes dataset is in same folder as this .ipynb

# Review the columns we have access to in the Dataset
print(drug_df.columns)

# Next, we would want to check data quality to give both a better understanding and set up our modelling for success (i.e. "Garbage in Garbage Out")

# Count total NaN at each column in a DataFrame 
print(" \nCount total NaN at each column in a DataFrame : \n\n", 
    drug_df.isnull().sum()) 

# Count Number of Duplicate Rows 
print(" \nCount total duplicates in Dataframe : \n\n", 
    len(drug_df)-len(drug_df.drop_duplicates()))

# Print Total Number of Rows for Relative Reference
print(" \nCount total rows in Dataframe : \n\n", 
    drug_df.shape[0])

# Final Refinements for the CCS Dataset

# Specify Data Types for what values we would expect
STR_COLS = ['record_id', 'member_id', 'ndc', 'drug_category', 'drug_group', 'drug_class'] 
DATE_COLS = ['date_svc']

# Coerce Data Types for what values we would expect. Note: Errors in Coercion become NA from the parameter specified.
drug_df[STR_COLS] = drug_df[STR_COLS].astype(str) # String
drug_df[DATE_COLS] = drug_df[DATE_COLS].apply(pd.to_datetime, errors = "coerce") # DateTime

Index(['record_id', 'member_id', 'date_svc', 'ndc', 'drug_category',
       'drug_group', 'drug_class'],
      dtype='object')
 
Count total NaN at each column in a DataFrame : 

 record_id        0
member_id        0
date_svc         0
ndc              0
drug_category    0
drug_group       0
drug_class       0
dtype: int64
 
Count total duplicates in Dataframe : 

 0
 
Count total rows in Dataframe : 

 3005934


We should review two important understandings as before - the meaning of this dataset as well as any data quality/overview details we need to note of.

- "record_id" would assumedly be the unique transaction id logged to the drug being given to the member
    - Hypothesis: this is not the same as the "record_id" as the claims_lines dataset.
    - Validation Process: we could join "record_id" across the two datasets, and validate if the secondary data attributes (i.e. "member_id") are consistent.
        - If this is validated to actually match, this could be our id to join on
        - If not, we likely will need to join with other data attributes.
- "member_id" would be the associated member for the drug being administered to.
    - This likely would be consistent with the "member_id" seen in claims_lines (otherwise, the join between these datasets will be quite difficult)
    - Validation Process: we could join these two "member_id"s and spot check for data consistency (i.e. diabetes patients are both diagnosed with diabetes AND are prescribed diabetes-treatment drugs)
- "date_svc" is actually unclear whether this falls in one of two scenarios. 
    - Scenario 1: These correspond to the same "date_svc" as when the member received the diagnosis, as its purpose as an additional joining key
    - Scenario 2: These are dates indicating when the prescription drug was filled, and completely specific within the scope of only the prescription drug dataset.
        - Since dates are harder to intuitively validate, there may not be a clear cut strategy to find the answer to this.
            - Consequence 1: As part of our data joining strategy, we would need to employ a strategy that will not be joining on "date_svc" to avoid misleading joins
            - Consequence 2: Since we are uncertain about the consistency between multiple "date_svc" columns, we should exclude this as a predictive attribute from our final model since it could yield misleading insights - That is the risk of using data elements we don't understand.
- "ndc" is the National Drug Code for the exact drug being used. Most likely, we will not need to use this column as we have the other columns describing the functional purpose of the drug (i.e We don't understand anything directly from proper Drug names like Tylenol - it is the properties and usages of Tylenol that we care about understanding from the drug name).
- "drug_category", "drug_group", "drug_class" are helpful descriptors of the drug being used. Since there are so many unique drugs that exist, we will probably utilize these descriptions to determine how we will be broadly binning the drugs in functional groups as a model predictor input.

From the above, we have important action items:

- We need to clarify the consistency of the data attributes named in this dataset to the claims_lines.
- We will need to establish a joining strategy with claims_lines, to ensure the data is being used together as correctly as possible.

From noticing no missing data or duplicate rows, we will quickly make an assumption that this dataset is good as is.

## Univariate EDA

Univariate EDA is generating technical counts, statistical summaries, and breakdowns of our data in individual attributes. In essence, we are digging one level deeper than a broad overview in the sections above. This section will help clarify some of the above consequences, validations, and assumptions we determined - leading to the **importance of this section of passing our data correctly to the modelling phase and learning what features need to be engineered or adjusted**.

##### Claim Lines w/ Diagnosis Codes ("claim_lines.csv") - Univariate EDA

In [266]:
# Review record_id uniqueness
claims_df["record_id"].value_counts()


record_id
57738      1
1196744    1
1216321    1
1215584    1
1215527    1
          ..
1342480    1
1342352    1
1315855    1
1303734    1
695615     1
Name: count, Length: 1919959, dtype: int64

In [267]:
# Review range of record_ids
claims_df["record_id"].value_counts().sort_index()

record_id
100        1
1000       1
10000      1
100000     1
1000000    1
          ..
999995     1
999996     1
999997     1
999998     1
999999     1
Name: count, Length: 1919959, dtype: int64

In [268]:
# Review member_id frequencies
claims_df["member_id"].value_counts()

member_id
M0135203    921
M0242760    805
M0117950    781
M0080814    780
M0087703    692
           ... 
M0190692      1
M0044626      1
M0190689      1
M0190679      1
M0145870      1
Name: count, Length: 245166, dtype: int64

In [269]:
# Review range of member_ids
claims_df["member_id"].value_counts().sort_index()

member_id
M0000001    15
M0000002     6
M0000003     2
M0000004     2
M0000005     2
            ..
M0291609     3
M0291610     8
M0291611     1
M0291612     2
M0291613     7
Name: count, Length: 245166, dtype: int64

In [270]:
# Review date_svc frequencies
claims_df["date_svc"].value_counts()

date_svc
2018-02-15    4237
2018-02-27    4230
2018-02-14    4227
2018-02-28    4212
2018-02-11    4203
              ... 
2013-08-30       1
2015-03-08       1
2014-12-13       1
2014-06-10       1
2015-03-30       1
Name: count, Length: 1150, dtype: int64

In [271]:
# Review date_svc range
claims_df["date_svc"].value_counts().sort_index()

date_svc
1899-12-04    1
1899-12-07    1
1899-12-22    1
1899-12-24    1
1899-12-26    1
             ..
2018-10-16    1
2018-11-10    2
2018-11-16    1
2018-11-18    1
2018-12-06    1
Name: count, Length: 1150, dtype: int64

In [272]:
# Review diag1
claims_df["diag1"].value_counts()

diag1
Z00.00      70352
I10         46739
Z01.419     36676
Z12.31      25565
M54.5       23535
            ...  
S92.311G        1
S92.311D        1
S21.009A        1
S29.099A        1
M24.474         1
Name: count, Length: 20566, dtype: int64

In [273]:
# Review range of diag1 values
claims_df["diag1"].value_counts().sort_index()

diag1
000         63
000.0        1
000.00       1
002.1        1
035.1XX0     1
            ..
Z98.891     18
Z99.11      54
Z99.2       14
Z99.81       4
Z99.89       2
Name: count, Length: 20566, dtype: int64

In [274]:
# Review breakdown of general diagnoses category frequencies

# Get First Letter (the category)
claims_df["temp_category"] = claims_df["diag1"].str[0]

# Get Category Frequencies
print(claims_df["temp_category"].value_counts())

# There are some ~1600 values that don't follow the ICD-10 diagnosis code format (i.e. Alpha letter the first character), could be obsolete formatting or data entry error. 
# Since it is a small portion of the entire dataset, we can remove these for now...
claims_df.drop(claims_df[~claims_df["temp_category"].isin([chr(i) for i in range(ord('A'), ord('Z') + 1)])].index, inplace=True)

# Get Category Frequencies
print(claims_df["temp_category"].value_counts())

# Revert dataframe back to earlier form
claims_df.drop(columns = ["temp_category"], inplace = True)


temp_category
Z    334300
M    250146
R    241012
J    161433
E    113395
I    110368
N    101250
K     80552
L     78757
S     78004
H     73936
F     60683
D     49843
G     48318
C     47172
B     30716
O     24729
T     13000
A     11609
P      4418
Q      4265
9      1578
W       181
V       101
Y        96
0        69
X        11
3         4
7         4
4         2
1         2
8         2
2         1
5         1
6         1
Name: count, dtype: int64
temp_category
Z    334300
M    250146
R    241012
J    161433
E    113395
I    110368
N    101250
K     80552
L     78757
S     78004
H     73936
F     60683
D     49843
G     48318
C     47172
B     30716
O     24729
T     13000
A     11609
P      4418
Q      4265
W       181
V       101
Y        96
X        11
Name: count, dtype: int64


For the claims_lines dataset, our uni-variate analysis of frequencies and ranges of values gives some simple but useful facts about our dataset.

- "record_id" is sufficiently unique, as expected.
- "member_id" frequency counts show as high as 900 unique diagnosises. This gives us a frame of reference that there are members who have many many diagnoses, so we need to accomodate our binning strategy accordingly.
- "date_svc" ranges as early from 1899 up to 2018 data. This is an interesting discovery considering the background of the problem statement described a 3-year span for members.
    - 1990 dates would even be relatively explainable - however 1899 is very far back.
- "diag1" does contain some codes that do not follow the ICD-10 diagnosis code syntax (either by data entry error, or alternative syntaxes at data recording).
    - There are about ~1600 rows affected by this, and in the scope of our much much larger dataset, we can follow a symplistic approach of removing these rows
    - Our new claims_lines dataset still being ~1,900,000+ rows after this operation
    - After this, we see that the distribution of 1st-position alpha characters are not equal among all the codes,
        - Some as many as ~334,000+ for Z headed codes, while some as little as 11 for X headed codes.
            - This implies there is some imbalance in our dataset (i.e. there is much more data and learning the model will do off of the well-opulated categories, compared to the sparse categories)

##### CCS / Clinical Categories ("ccs.csv") - Uni-Variate EDA

In [275]:
# Review diag frequencies being unique
ccs_df["diag"].value_counts()

diag
A000       1
S82319D    1
S82391B    1
S82391A    1
S82319S    1
          ..
S02609D    1
S02609G    1
S02609K    1
S02609S    1
Z9989      1
Name: count, Length: 72167, dtype: int64

In [276]:
# Review range of diag codes
ccs_df["diag"].value_counts().sort_index()

diag
A000     1
A001     1
A009     1
A0100    1
A0101    1
        ..
Z9912    1
Z992     1
Z993     1
Z9981    1
Z9989    1
Name: count, Length: 72167, dtype: int64

In [277]:
# Review Frequencies of ccs_1_desc
ccs_df["ccs_1_desc"].value_counts()

ccs_1_desc
Injury and poisoning                                                                 40432
Residual codes; unclassified; all E codes [259. and 260.]                             7621
Diseases of the musculoskeletal system and connective tissue                          5392
Diseases of the nervous system and sense organs                                       4017
Complications of pregnancy; childbirth; and the puerperium                            2332
Mental Illness                                                                        2012
Neoplasms                                                                             1729
Endocrine; nutritional; and metabolic diseases and immunity disorders                 1492
Diseases of the circulatory system                                                    1222
Diseases of the digestive system                                                       941
Infectious and parasitic diseases                                              

In [278]:
# Review Frequencies of ccs_2_desc
ccs_df["ccs_2_desc"].value_counts()

ccs_2_desc
Fractures                                                      16924
                                                                7621
Open wounds                                                     3636
Other injuries and conditions due to external causes [244.]     3126
Crushing injury or internal injury [234.]                       2736
                                                               ...  
Cystic fibrosis [56.]                                              5
Osteoporosis [206.]                                                3
Maintenance chemotherapy; radiotherapy [45.]                       3
Respiratory distress syndrome [221.]                               1
Aspiration pneumonitis; food/vomitus [129.]                        1
Name: count, Length: 136, dtype: int64

In [279]:
# Review Frequencies of ccs_3_desc
ccs_df["ccs_3_desc"].value_counts()

ccs_3_desc
Fracture of lower limb                                  6888
Fracture of upper limb                                  6651
Other injuries and conditions due to external causes    3126
Crushing injury or internal injury                      2736
Burns                                                   2584
                                                        ... 
Respiratory distress syndrome                              1
Forceps delivery                                           1
Aspiration pneumonitis; food/vomitus                       1
Syncope                                                    1
Parkinson`s disease                                        1
Name: count, Length: 283, dtype: int64

Similar to the earlier section, we collect our thoughts and learnings from the CCS dataset describing the diagnosis code details.

- A main takeaway is the approximate unique-ness/size of the categories in each respective CCS layer.
    - ccs_1 has 18 unique categories, ccs_2 has 136, ccs_3 has 283.
    - Due to the extensive research on the background determination of these categories and their distinctions, these categories are likely to be good candidates for the patient health statuses as-is.
        - These categories also align in description fairly well to the CCS/ICD-10 documentation linked in the assessment, so they are accurate akin to the verified documentation.

##### Prescription Drugs Data ("prescription_drugs.csv") - Univariate EDA

In [280]:
drug_df.columns

Index(['record_id', 'member_id', 'date_svc', 'ndc', 'drug_category',
       'drug_group', 'drug_class'],
      dtype='object')

In [281]:
# Check uniqueness of record_id
drug_df["record_id"].value_counts()

record_id
4115084976453758912    1
7027581895774705290    1
5623458955091541006    1
56846722386239700      1
2199505267363317795    1
                      ..
3405075619608131908    1
6328362827352577029    1
8477828071356189092    1
2334210907924906058    1
1237414475289191242    1
Name: count, Length: 3005934, dtype: int64

In [282]:
# Check range of values from record_id
drug_df["record_id"].value_counts().sort_index()

record_id
1000007285353966078    1
1000009593670105711    1
1000020063626312092    1
1000020374385918396    1
10000205291842512      1
                      ..
999993725707021047     1
999993948402636287     1
999996671367233637     1
999996871257577583     1
999999727377882656     1
Name: count, Length: 3005934, dtype: int64

In [283]:
# Check frequency of member_id
drug_df["member_id"].value_counts()

member_id
M0151509    669
M0023252    454
M0118840    445
M0026330    444
M0287678    441
           ... 
M0151419      1
M0100986      1
M0029023      1
M0212291      1
M0291201      1
Name: count, Length: 240735, dtype: int64

In [284]:
# Check range of values from member_id
drug_df["member_id"].value_counts().sort_index()

member_id
M0000001     1
M0000002     4
M0000003     1
M0000004     2
M0000005     3
            ..
M0291608    13
M0291610    18
M0291612    15
M0291613     7
M0291614     1
Name: count, Length: 240735, dtype: int64

In [285]:
# Check frequency of date_svc
drug_df["date_svc"].value_counts()

date_svc
2018-03-27    5881
2018-04-03    5809
2018-02-28    5755
2018-03-01    5736
2018-03-21    5727
              ... 
2013-12-10       4
2013-12-13       3
2013-12-05       2
2013-12-06       1
2013-12-04       1
Name: count, Length: 1644, dtype: int64

In [286]:
# Check range of date_svc
drug_df["date_svc"].value_counts().sort_index()

date_svc
2013-12-04      1
2013-12-05      2
2013-12-06      1
2013-12-07     10
2013-12-08      6
             ... 
2018-05-31    399
2018-06-01    268
2018-06-02    207
2018-06-03    129
2018-06-04     28
Name: count, Length: 1644, dtype: int64

In [287]:
# Check frequencies of drug_category
drug_df["drug_category"].value_counts()

drug_category
Antidepressants                271689
Antihypertensives              180439
Antihyperlipidemics            170385
Contraceptives                 150723
Antidiabetics                  145150
                                ...  
Antiseptics & Disinfectants        13
Nutrients                          11
Alternative Medicines               2
Antacids                            2
General Anesthetics                 1
Name: count, Length: 92, dtype: int64

In [288]:
# Check frequencies of drug_group
drug_df["drug_group"].value_counts()

drug_group
Hmg CoA Reductase Inhibitors                       147973
Selective Serotonin Reuptake Inhibitors (SSRIs)    140335
Combination Contraceptives - Oral                  131597
Thyroid Hormones                                    94937
Nonsteroidal Anti-Inflammatory Agents (NSAIDs)      78008
                                                    ...  
Internal Vehicle Ingredients/Agents                     1
Zinc                                                    1
Antacids - Bicarbonate                                  1
Antihistamines-Topical                                  1
Bulk Chemicals - P's                                    1
Name: count, Length: 464, dtype: int64

In [289]:
# Check frequencies of drug_class
drug_df["drug_class"].value_counts()

drug_class
Hmg CoA Reductase Inhibitors                       147961
Selective Serotonin Reuptake Inhibitors (SSRIs)    140335
Thyroid Hormones                                    94937
Combination Contraceptives - Oral                   91720
Anticonvulsants - Misc.                             74351
                                                    ...  
Keratolytic And/Or Antimitotic Combinations             1
Bulk Chemicals - Ca's                                   1
Bulk Chemicals - Di's                                   1
C1 Inhibitors                                           1
Electrolytes & Dextrose                                 1
Name: count, Length: 687, dtype: int64

Similar to the earlier section, we collect our thoughts and learnings from the Prescription Drug dataset describing what drugs were filled.

- "record_id", as suspected share a completely different range of values from the claim_lines dataset's "record_id"
    - Thus they are simply unique ids for their respective datasets and have no relation
- "member_id" has a similar range of values as claim_lines dataset's "member_id", so there is a high chance thier attribute represents the same info across the two datasets
    - This would be a good candidate for the joining strategy
- "date_svc" ranges from 2013 to 2018, so it is a very different range from claim_lines dataset's "date_svc"
    - The recommendation still holds that these dates should be assumed to reflect only on their respective datasets, and not recommended to be used as a join key without further understanding.
- "drug_category" and the "class"/"group" counterparts similarly reflect a descriptive hierarchy, showing respective counts of 92, 464, and 687.
    - For our first iteration and to respect time , we will decide to just consider "drug_category" as our grouping
        - In typical circumstances with too many groupings, the model may require more compute to determine the categorization especially with compute intensive algorithms.
        - Wtth too few groupings, the built model may be too simplistic and requires has much more room to handle a complex problem statement.

##### Uni-variate EDA Summary / Learnings

There were many sections above detailing a brief yet helpful overview of our data. We re-summarize a few major points to make this section easy to reference in one spot.

**Rough Joining Strategy**: Drugs.MemberId <-> Claim_lines.MemberId, Claim_lines.diag1 <-> CCS.diag

- Remember that we don't know confidently the connection between various "svc_dates" we encountered, so we dwon't use this for now.
- The Claim_lines to CCS connection requires a small adjustment to either diagnosis column to make them match - by removing/adding the decimal


**Model Structure**: It is established our target variable to predict is the health status of the members - we have custom defined this as "ccs_1_desc" from the CCS dataset. Our input is likely to be the list of drugs that a member has filled/taken.

Because any particular member can have multiple diagnoses and multiple drugs filled, we need to consider the layout of the model's input set. There are likely multiple approaches to accomplishing this, but we can follow the strategy below:

- Because our prediction variable can take on mulitiple health conditions/statuses, it should be considered whether each row will predict multiple categories at once, or individual categories in separate rows. 
    Decision: Mulitple rows for separate health statuses

- It should be determined if the uniqueness of the member IDs is relevant here. Since even if the member IDs are masked, the model adequately can run its predictions without caring what member ID is being looked at. In other words, we make an assumption here that members across members are mostly the same as a simplification.
    Decision: Don't need to track member IDs, members can show up one or many times in the model input and the model focuses on learning the connection between the drugs and the health statsues and deprioritizes the aspect of members having characteristic differences among each other.

- We need to represent in each row the multitude aspect of drugs taken that lead to the health status. So we need to represent our dataset in a form that captures multiple drugs together.
    Decision: We can convert our data to indicator variables for each drug (i.e. one column for each drug category, with "1" or "0" existing in the row-column element if that drug was taken).

**Model Considerations**:
- Because of the discussed layout, the resulting input dataset can be quite large from the indicator variable refactoring (i.e. a column per drug category). So we may want to pick a first-time algorithm that will compute and execute relatively quickly.
- The dataset may be relatively sparse, with many columns having a value of 0. There are issues with a large number of features in a sparse dataset that could be considered.
- Because there can be multiple health statuses per member, we would like to have an output with multiple predictions. The idea would be that each health status cateogry has it's own likelihood probability, so we can assess if one or more health statuses are likely to be labeled for the member in question.
- Our dataset size is large, so that should alleviate some of our concerned issues to some degree (hopefully).

Nice-To-Haves:  
With further time, it would be great to spend more time polishing and making nice visuals that could be used as visual aids / references in powerpoints or presentations for stakeholders. There could also be more research done on the questioned data attributes mentioned. Additionally, there may be other better ways to decide how to layout the input dataset for the model. 

Potential Issues:  
We aren't entirely sure if our categorization/binning strategy is optimal. Additionally using indicator variables expands the dataset greatly and may not be optimal for algorithms that greatly increase in compute operations/efficiency with largely sized datasets. Finally, we commented on a seen imbalance in the final health status categories, so the model may not learn evenly especially for the very infrequent health statuses.

## Bivariate EDA

This section will be left skipped, as the duration spent on this assignment is nearing given limit. But we summarize what the intent of this section would have been, and what we would have done.

Normally, some datasets have correlated attributes or covariates that play an important part in the behavior and/or outcome within a dataset. Some of these correlations are patterns that can be leveraged to the model's benefit - others which can mislead or throw off the model especially with relevant statistical concepts such as Simpson's Paradox. It's important to be aware of the relationships that exist in the data. 
    - In our such case, the hierarchical relationships would have been of interest to analyze the breakdown and composition of these attributes with each other.
    - Another interesting relationship would be the bivariate relationship between dates and factors such as diagnoses or drugs administered at specific periods.

Bivariate EDA also lends to many helpful graphical representation of two variables visually represented together. These commonly reveal some initial key insights that may even still surprise stakeholders on the patterns.
    - Analyzing existing Frequencies of Health Statuses and their pairings against the Drug categories, without needing to implement a model
    - Analyzing a correlation heatmap of the different pairings of Health Statuses and Drug categories.

For now, we have enough guidance to proceed into our model preparation.

--------------------------------------------------------------------------------------------

### Initial / v1 Classification Modelling

As discussed up through now from the Opening Remarks section and EDA section, we are set on the type of Model we want to build for our use case: Classification for Health States/Statuses from Existing Drugs Used. 

Recall a few key insights/points we've gathered that should be accounted for in our modelling (copied from the earlier section):

- Because our prediction variable can take on mulitiple health conditions/statuses, it should be considered whether each row will predict multiple categories at once, or individual categories in separate rows. 
    Decision: Mulitple rows for separate health statuses
- It should be determined if the uniqueness of the member IDs is relevant here. Since even if the member IDs are masked, the model adequately can run its predictions without caring what member ID is being looked at. In other words, we make an assumption here that members across members are mostly the same as a simplification.
    Decision: Don't need to track member IDs, members can show up one or many times in the model input and the model focuses on learning the connection between the drugs and the health statsues and deprioritizes the aspect of members having characteristic differences among each other.
- We need to represent in each row the multitude aspect of drugs taken that lead to the health status. So we need to represent our dataset in a form that captures multiple drugs together.
    Decision: We can convert our data to indicator variables for each drug (i.e. one column for each drug category, with "1" or "0" existing in the row-column element if that drug was taken).

- Because of the discussed layout, the resulting input dataset can be quite large from the indicator variable refactoring (i.e. a column per drug category). So we may want to pick a first-time algorithm that will compute and execute relatively quickly.
- The dataset may be relatively sparse, with many columns having a value of 0. There are issues with a large number of features in a sparse dataset that could be considered.
- Because there can be multiple health statuses per member, we would like to have an output with multiple predictions. The idea would be that each health status cateogry has it's own likelihood probability, so we can assess if one or more health statuses are likely to be labeled for the member in question.
- Our dataset size is large, so that should alleviate some of our concerned issues to some degree (hopefully).


------------------------------------------------------------------

Our Key Steps in the (first-pass) Modelling Phase will be to:

- Perform any Data Transformations in the Dataset if necessary. This could either to synthesize/derive additional variables to help the model learn, or to re-design a few attributes so that they are better understood by the model. For example, we mentioned earlier that we will need to implement indicator variable refactoring to represent scenarios involving multiple drugs.

- Implement a ML model. Especially in our case, a v1 model's focus is to establish a baseline which is a point that should not be underestimated. It provides a frame of reference for future model implementations and improvements in further iterations. 

- Be cognizant and aware of the performance of the model, by reviewing the performance metrics. Additionally, consider the non-technical advantages/dis-advantages of the model - interpretation, usability, monitoring, complexity of the results. As a whole, an adoption of the model in the industry greatly relies on the best of both technical and non-technical upsides.

In [290]:
# Based on our earlier mentioned joining strategy, we have enough details to begin combining the data in the datasets

# Remove the Decimal in the Claim_lines Dataset Diagnoses
claims_df['diag_icd10'] = claims_df['diag1'].str.replace('.','')

# Join Claim_lines.diag1 with ccs.diag
modelling_df = claims_df.merge(ccs_df[["diag", "ccs_1_desc"]], how = "left", left_on = "diag_icd10", right_on = "diag")

# Validate that the Joins were mostly successful and if there needs to be further reconciling
print(modelling_df["diag"].value_counts().sort_index())

# Clean up Columns that won't be used probably
modelling_df.drop(columns = ["record_id", "diag_icd10", "date_svc", "diag"], inplace = True)


# Join Modelling Dataframe with Drug Info now. Note a Member may have multiple drugs, so this will expand the number of rows. 
# Later, we will re-collapse this once we repivot to indicator variables
modelling_df = modelling_df.merge(drug_df[["member_id", "drug_category"]], how = "left", left_on = "member_id", right_on = "member_id")

# This Process will generate some duplicates, so we make sure drop duplicates that considers ALL columns in the duplicate detection
modelling_df.drop_duplicates(inplace = True)

# Because we have effectively summarized the exact diagnosis code into its tier 1 description, we no longer need the exact diagnosis code
modelling_df.drop(columns = ["diag1"], inplace = True)

# Then recheck for duplicates
modelling_df.drop_duplicates(inplace = True)

diag
A000       4
A001       1
A009       2
A0100      6
A020      16
          ..
Z98891    18
Z9911     54
Z992      14
Z9981      4
Z9989      2
Name: count, Length: 19845, dtype: int64


In [291]:
# Review the Dataset before we repivot it to indicator variables
modelling_df

Unnamed: 0,member_id,ccs_1_desc,drug_category
0,M0000001,Diseases of the genitourinary system,Antivirals
1,M0000001,Complications of pregnancy; childbirth; and th...,Antivirals
10,M0000001,Certain conditions originating in the perinata...,Antivirals
15,M0000002,Residual codes; unclassified; all E codes [259...,Vaccines
17,M0000002,Residual codes; unclassified; all E codes [259...,Tetracyclines
...,...,...,...
57461531,M0291613,Diseases of the nervous system and sense organs,Vitamins
57461533,M0291613,Diseases of the nervous system and sense organs,Antiemetics
57461534,M0291613,Diseases of the nervous system and sense organs,Antidiabetics
57461536,M0291613,Diseases of the nervous system and sense organs,Anticonvulsants


In [292]:
# Pivoting such that a new column is generated per drug_category
modelling_df = modelling_df.pivot(index=['member_id', 'ccs_1_desc'], columns='drug_category', values = 'drug_category')

# Reset the Index so it is a normal dataframe
modelling_df = modelling_df.reset_index()

# Drop the NaN column, index at 2
modelling_df = modelling_df.drop(modelling_df.columns[[2]], axis=1)

# We have some nans in our ccs_1_desc column. Its not too many, so we can drop these rows for completeness
modelling_df.dropna(subset=['ccs_1_desc'], inplace = True)

# Replace nans in our indicator columns in dataframe with 0 (the drug wasn't taken by the member for the associated diagnosis)
# Replace otherwise not 0 with 1 (indicating the drug was taken for the member with that particular diagnosis)
for indicator_cols in modelling_df.columns.tolist()[2:]:
    modelling_df[indicator_cols] = modelling_df[indicator_cols].notnull().astype(int)

# We have functionally utilized the member_id as part of the relevant groupings, so we can drop this column now.
modelling_df.drop(columns = ["member_id"], inplace = True)

# Rename the diagnosis column for clarity
modelling_df.rename(columns = {'ccs_1_desc':'health_status'}, inplace = True)

In [293]:
# Review the modelling dataframe up until now
modelling_df

drug_category,health_status,Adhd/Anti-narcolepsy/Anti-obesity/Anorexiants,Allergenic Extracts/Biologicals Misc,Alternative Medicines,Aminoglycosides,Analgesics - Anti-Inflammatory,Analgesics - NonNarcotic,Analgesics - Opioid,Androgens-Anabolic,Anorectal Agents,...,Tetracyclines,Thyroid Agents,Toxoids,Ulcer Drugs,Urinary Anti-Infectives,Urinary Antispasmodics,Vaccines,Vaginal Products,Vasopressors,Vitamins
0,Certain conditions originating in the perinata...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Complications of pregnancy; childbirth; and th...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Diseases of the genitourinary system,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Diseases of the circulatory system,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
4,Mental Illness,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
712567,Diseases of the musculoskeletal system and con...,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
712568,Diseases of the musculoskeletal system and con...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
712569,Diseases of the nervous system and sense organs,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
712570,Endocrine; nutritional; and metabolic diseases...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### Theoretical Modelling Steps / Future Steps & Extensions

At this point in the assessment, we've reached the respected limit for the assessment. However, this section will be dedicated to the following explanations:

##### How Our Final Model Dataset Expresses the Problem Statement: 

Our final Dataset depicts the combinations of drugs seen administered associated for any particular health status. The definitions were presented in our earlier section of model preparation. In technical terms - for any data row, we can establish particular health status and the drugs that were given to an individual with that health status. **This is an adequate layout for providing model predictive inputs (which drugs were used and which weren't) to predict a model classification (what health status we would anticipate or expect).** It is additionally fitting in that we could supply any kind of new or sample data, and could be applied easily to this dataset's form. The member ids were used for groupings to associate the required groupings, but effectively are not predictive as a trait itself and thus dropped in the final step of the model dataset preparation.

##### What Model we would potentially employ, and why.

There is a wide portfolio of classification models that would be tried here. We also identified that this is a supervised learning problem, with historical "true" labels we have acess to. Due to our desire of explainability as well as reasonable runtime, we could try a relatively robust model type such as Random Forests or XGBoosted Trees. The next steps would be to implement the related model code, train it with our dataset here, and assess its performance on test data (either set aside from the original dataset, or brand new data). Depending on the performance, there could be future iterations to continue improvements and utilize it for real predictions as described in the problem statement.

Advantages/Fit For This Problem Statement:
- Tree Models are easy to interpret, as they visually represent how humans think similarly in tree forms.
- Tree Models are quite accurate for a first-choice model, as supported by many online documentations https://www.kaggle.com/datasets/kaggle/kaggle-blog-winners-posts/code
- They compute relatively fast with better modern compute resources and more efficient packages that continue to release.
- There is not a ton of hyperparameter requirements as compared to something like neural networks.

Disadvantages/Concerns for this Problem Statement:
- The Tree Model may react strangely to the extreme sparse nature of the dataset
- There are many columns, which could greatly increase the runtime beyond what we could hope.
- The categorization assumptions we have utilized may be too simple for what an optimal solution could entail.
- Visually it may be hard to generate trees for explanations or stakeholder interaction, with many different tree decision points from all the columns.

##### Other Extension/Explorative steps & Ideas

We re-summarize some areas of improvement and reasonable future steps if there was more time and/or this project was on a larger scope/scale.

- We made multiple decisions & assumptions related on choosing relatively easy categorizations. With further data exploration, we could find a more ideal way of expressing both the drug categories and/or the health statuses that we are trying to predict.
- Once we have a baseline model, it would be great to try alterative model algorithms and compare if the performance improves or is not as strong.
- We could reshape the dataset in an alternative way that is still adequate for a model input, and see if the model improves on performance here.
- For most classification models, there are metrics on evaluating the model performance. We had not talked about that in this assessment, but such metrics would include aspects such as misclassification rate, sensitivity/specificity, etc.
- It would be a strong consideration to explore if there are other datasets that could supplement the information here.