# 2023 Humana Mays Healthcare Analytics Case Competition
## Problem Prompt
**By: Dustin James Harper—Senior Data Scientist, Humana Pharmacy Analytics and Consulting**

---

### Introduction
This Jupyter Notebook outlines the problem statement and available data for the 2023 Humana Mays Healthcare Analytics Case Competition. For additional information about registration, schedule, submission logistics, and the leaderboard, visit [Humana TAMU Analytics](https://mays.tamu.edu/humana-tamu-analytics/).

---

### Table of Contents
1. [Motivation and Opportunity](#Motivation-and-Opportunity)
2. [Predictive Modeling Target](#Predictive-Modeling-Target)
    1. [Unsuccessful Therapy](#Unsuccessful-Therapy)
    2. [All Other Therapies](#All-Other-Therapies)
3. [Available Data](#Available-Data)
    1. [File Descriptions](#File-Descriptions)

---

### 1. Motivation and Opportunity
Cancer remains a leading cause of death in the U.S., despite significant advances in research and new therapies. One such medication, Osimertinib, has been effective but also presents challenges due to its side effects. About a quarter of Humana members taking Osimertinib discontinue their therapy within the first 6 months due to side effects. The aim is to leverage data and analytics to encourage medication adherence.

---

### 2. Predictive Modeling Target
Your task is to build a predictive model that identifies patients who are likely to discontinue Osimertinib therapy due to adverse drug events (ADEs).

#### 2.1 Unsuccessful Therapy
- **`tgt_ade_dc_ind == 1`**: Therapy that ends before 180 days and has an ADE reported during the therapy.

#### 2.2 All Other Therapies
- **`tgt_ade_dc_ind == 0`**: Includes successful therapies, therapies with no ADEs, and those where members changed insurance plans or passed away before 180 days.

---

### 3. Available Data
The data is organized based on a specific therapy with one member, a start date, and an end date. The datasets are separated into a training and holdout set.

#### 3.1 File Descriptions
- **Target**: `target_train.csv` (1232 records), `target_holdout.csv` (420 records)
- **Medical Claims**: `medclms_train.csv` (100159 records), `medclms_holdout.csv` (23232 records)
- **Pharmacy Claims**: `rxclms_train.csv` (32133 records), `rxclms_holdout.csv` (6670 records)
- **Data Dictionary**: `data_dictionary.csv` (49 records)
- **Race Code Descriptions**: `race_cd_desc.csv` (7 records)

---

**Important Note**: For Round 1 submissions, you need to submit an ID, score, and rank for each individual ID in the `target_holdout.csv` file.

# 2023 Humana Mays Healthcare Analytics Case Competition Code --
### Import necessary libraries

In [4]:
# Import necessary libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [5]:
# Function to load multiple datasets from a given path
def load_data(base_path):
    datasets = {}
    # Loop through each dataset file and load it into a Pandas DataFrame
    for filename in ['data_dictionary', 'target_holdout', 'target_train',
                     'medclms_holdout', 'medclms_train', 'rxclms_holdout', 'rxclms_train',
                     'race_cd_desc']:
        datasets[filename] = pd.read_csv(f"{base_path}/{filename}.csv")
    return datasets

In [6]:
# Step 1: Load Data
# Define the path where the data files are stored 
base_path = "/Users/brocktbennett/GitHub/Project Data/2023_TAMU_competition_data"

In [7]:
# Call function to load datasets into a dictionary 
datasets = load_data(base_path)

## Examine the loaded datasets to verify that they have been loaded accurately.

In [10]:
# Examine the loaded datasets
# Loop through each loaded dataset and print its head and shape 
for name, df in datasets.items(): 
    print(f"Dataset: {name}")
    print(f"Shape: {df.shape}")
    print(f"First few rows:\n{df.head()}\n{'-'*40}")

Dataset: data_dictionary
Shape: (49, 3)
First few rows:
                field                                         definition  \
0                  id            Person Identifier - unique for a member   
1  therapy_start_date   The date of the member's first fill of Tagrisso.   
2    therapy_end_date  The date the member runs out of their supply o...   
3      tgt_ade_dc_ind  An indicator for whether this person meets the...   
4             race_cd                       a numeric indicator for race   

       table  
0  target_df  
1  target_df  
2  target_df  
3  target_df  
4  target_df  
----------------------------------------
Dataset: target_holdout
Shape: (420, 8)
First few rows:
           id             therapy_id            therapy_start_date  race_cd  \
0  1018450235  1018450235-TAGRISSO-1  2022-05-23T00:00:00.000+0000      5.0   
1  1032849118  1032849118-TAGRISSO-1  2020-01-22T00:00:00.000+0000      1.0   
2  1044251683  1044251683-TAGRISSO-1  2020-09-25T00:00:00.000+0