# Task description:

You need to predict new diseases for patients.

Train data contains medical information for 2018-2019 period and labels with new diseases discovered in 2019.

You are required to predict new diseases for 2020 using data for 2018-2019.

Background
Hierarchical condition category relies on ICD-10 coding to assign risk scores to patients. Each HCC is mapped to an ICD-10 code. Along with demographic factors (such as age and gender), insurance companies use HCC coding to assign patients a risk adjustment factor. It is very important to predict HCC coding to evaluate health risk of patients.

Every HCC can be mapped to several ICDs but not every ICD has a corresponding HCC. It is a one to many connection for HCC to ICDs.

We are providing ICD history of patients to predict their net new HCC coding for next year.

# Imports

In [None]:
import numpy as np 
import pandas as pd 
import pickle
import string
import re
import os

from tqdm.notebook import tqdm
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from scipy.spatial.distance import cosine
from datetime import datetime

import xgboost as xgb
from joblib import dump, load
from sklearn.multioutput import MultiOutputClassifier
from skmultilearn.model_selection import iterative_train_test_split
from sklearn.metrics import roc_auc_score, precision_recall_curve, precision_score, recall_score

data_path = '/kaggle/input/eastwood-and-cleef-ml-disease/drive-download-20210929T123943Z-001/'

In [None]:
f = open("/kaggle/input/icd-decode-dict/icd_decode_di.pickle", "rb")
icd_decode_di = pickle.load(f)
print(len(icd_decode_di))
print(f"Get sample value: {icd_decode_di['K8020']}")

# Plan: 

1. Create statistical features using numerical data. 
2. Create embedding features using text data. 
3. Generate features for train (2018 year) and for test (2019 year).  
4. Split data on train and validation datasets.  
5. Train 30 xgboost models using training data for 2018 and labels for 2019.  
6. Choose threshold on validation dataset.  
7. Calculate AUC (mean) on validation dataset (65%).  
8. Train on full dataset using data for 2018 and labels for 2019.  
9. Make predictions for 2020 using selected thresholds and features generated for 2019.  

ToDo:
1. Create dynamic features - collect information how features have changed over time (for example, compare the first and last quarter).   
2. Mean targeting encoding: (1) - select the most common icd for each hcc and generate embeddings for these icds. (2) - Calculate distances between members' icd embeddings and (1).   
3. Tune xgboost params (especially, for the largest hcc classes). 1st step (use small number of n_estimators): max_depth, subsample, regularization;  2nd step (set larger number of n_estimators): add early stop; tune learning rate.  
4. Use the original xgboost instead of sklMultiOutputClassifierearn (the problem is that xgboost from sklearn cannot work with None values)

# Tools

Tools which used many times in notebook

### To work with DFs

In [None]:
def calc_date_diff(date_li):
    '''calculate difference between two string dates, return days'''
    
    regexp = '\d{4,4}\-\d{2,2}\-\d{2,2}'
    date_flag = all(
        [isinstance(c, str) and re.match(regexp,c) for c in date_li])
    if date_flag: 
        date_li = [datetime.strptime(c,'%Y-%m-%d') for c in date_li]
        date_diff = (date_li[1] - date_li[0]).days
        return date_diff if date_diff >= 0 else None
    else: 
        return None

def test_calc_date_diff(): 
    dates = ('2018-03-03', '2018-04-03')
    date_diff = calc_date_diff(dates)
    print(f'test_calc_date_diff:\ndates:{dates}\ndiff:{date_diff}')
    return True

print(test_calc_date_diff())

def calc_ndays_in_hosptl(df, clmn1:str, clmn2:str, new_col:str):
    '''calculate  difference between two string dates in df, 
    return df'''
    
    df = df.copy()
    df[new_col] = df[[clmn1, clmn2]].apply(
        lambda x: calc_date_diff(x), axis= 1).values
    return df

def get_year_slice_df(
    df,
    date_start:str,
    date_end:str,
    year:str): 
    '''get df slice in a specific year'''
    return df[df[[date_start, date_end]].apply(
        lambda x: all(c.split('-')[0] == year for c in x), axis= 1)].reset_index(drop= True)

### To work with text

In [None]:
pnct = string.punctuation
stop_words = set(stopwords.words('english'))

def remove_punctuation(word, punctuation= pnct):
    return re.sub(f'[{punctuation}]','', word) if isinstance(word, str) else None

def test_remove_punctuation(): 
    print('test remove punctuation')
    t = '!anemia,'
    print(t,'->', remove_punctuation(t))
    #return True

def prepare_text(text, stop_words= stop_words):
    if isinstance(text, str):
        # to lower register
        text = text.lower()
        # tokenize
        text = word_tokenize(text)
        # remove punctuation
        text = [remove_punctuation(w) for w in text]
        # filter out stop words
        text = [w for w in text if not w in stop_words and w != '']
        # stem (?) specific texts, can lose info..
        
        # Temp keep only words: 
        text = [w for w in text if w.isalpha()]
        
        return text
    else:
        return None

def test_prepare_text():
    print('test prepare text:')
    t = 'Displaced bicondylar fracture of left tibia, subsequent encounter for closed fracture with routine healing'
    print(t, '->\n', prepare_text(t))
    return True

print(test_prepare_text())

In [None]:
# can be one function 
def decode_and_collect_text_corpus(codes,
                                   decodes_di): 
    '''get codes, decode them to text, then tokenize and clean text.
       return list of lists
    '''
    num_missed = 0
    corpus = []
    for code in tqdm(codes): 
        decode = decodes_di.get(code)
        if pd.isnull(decode): 
            num_missed+=1
            continue
        if isinstance(decode, str): 
            corpus.append(prepare_text(decode))
    print(num_missed, len(corpus))
    return corpus

# not now:
def dis_decode_and_collect_text_corpus(
    code_df,
    decodes_di): 
    '''get codes and chronic flag,
       decode codes to text, then tokenize and clean text, add chronic flag.
       return list of lists
    '''
    num_missed= 0
    corpus = []
    for code, chronic_flag in tqdm(code_df.values): 
        decode = decodes_di.get(code)
        if pd.isnull(decode): 
            num_missed+=1
            continue
        if isinstance(decode, str): 
            decode = prepare_text(decode)
            if isinstance(chronic_flag, str) and chronic_flag != 'Unknown': 
                # add chronic flag to prepared text of icd 
                decode = decode + [chronic_flag.lower()]
            corpus.append(decode)
    print(num_missed, len(corpus))
    return corpus

In [None]:
def get_adm_icd_embedding(icd_code,
                          decodes_di,
                          w2v_model,
                          ): 
    '''get icd code and pre trained model.
    return embedding'''
    
    decode = decodes_di.get(icd_code)
    embngs_li = []
    missed_words = 0
    if isinstance(decode, str):
        clnd_tokens_li = prepare_text(decode)
        for w in clnd_tokens_li: 
            try: 
                embng = w2v_model.wv[w]
                embngs_li.append(embng)
            except Exception as e: 
                # print(e)
                missed_words+=1
        if len(embngs_li) > 0:
            return sum(embngs_li)/len(embngs_li)

# Feature generation

## Icd_code to embeddings
Get icd code from admissions and diseas data from 2018 to train word2vec model.  
Better embeddings achived using only on admissions data. 

In [None]:
admissions_df = pd.read_csv(
    os.path.join(data_path, 'admissions_data.csv'))
print(admissions_df.shape)
admissions_df.head(2)

### Data from 2018

In [None]:
admissions_2018df = get_year_slice_df(
    df= admissions_df, 
    date_start= 'admission_date',
    date_end= 'discharge_date', 
    year= '2018'
  )

# save to encode feauters
admission_types_set = set(
    admissions_2018df.admission_type.unique()) 
print(f'Number admission types: {len(admission_types_set)}')

print(admissions_2018df.shape,
      admissions_2018df.member_id.nunique())
admissions_2018df.head()

### adm_icd_corpus

In [None]:
# collect unique icd_codes
adm_icd_code_2018 = admissions_2018df.icd_code.dropna().apply(
    lambda x: x.split(';')).explode('icd_code').drop_duplicates()
print(f'Unique icd_code in 2018: {len(adm_icd_code_2018)}')

In [None]:
# icd to tokens:
adm_icd_corpus_2018 = decode_and_collect_text_corpus(
    adm_icd_code_2018,
    icd_decode_di
    )
print(adm_icd_corpus_2018[0])

In [None]:
# train model only on adm_icd_corpus_2018
adm_icd_wv_model = Word2Vec(
    adm_icd_corpus_2018,
    vector_size= 100,
    min_count=1,
    workers= 5,
    window = 2,
    seed = 1
)

In [None]:
# Fast check: distanse between v1 and v2 smaller then v1 and v3: 
v1 = get_adm_icd_embedding(
    'A048',
    icd_decode_di,
    adm_icd_wv_model)

v2 = get_adm_icd_embedding(
    'A049',
    icd_decode_di,
    adm_icd_wv_model)

v3 = get_adm_icd_embedding(
    'Z9989',
    icd_decode_di,
    adm_icd_wv_model)

print(
prepare_text(icd_decode_di['A048']),'\n',
prepare_text(icd_decode_di['A049']),'\n', 
prepare_text(icd_decode_di['Z9989'])
)

# unspecified - drop it
print(cosine(v1,v2), cosine(v1, v3))

## Features from admissions data

In [None]:
# funcs to get embdngs 
def combine_many_embdngs(
    x,
    decode_di:dict,
    vw_model,
    dim:int
): 
    '''get average vector from icd_code embedings. 
    return averaged vector'''
    
    embdngs_li = []
    for c in x: 
        if not pd.isnull(c):
            embdng = get_adm_icd_embedding(
                c,
                decode_di, 
                vw_model) 
            if isinstance(embdng, np.ndarray): 
                embdngs_li.append(embdng)
    l = len(embdngs_li)
    if l > 0:
        return sum(embdngs_li)/l
    else: 
        return [None] * dim

def generate_icd_embng_for_df(
    df, 
    id_column:str,
    emb_clmn:str,
    decode_di:dict, 
    wv_model, # trained model to get embdngs
    dim:int, # vector size
    emb_feature_name:str):
    '''group by id and generate avereged embedding. 
    return df with id and embeddings'''

    df = df[[id_column, emb_clmn]].dropna().copy()

    df[emb_clmn] = df[emb_clmn].str.split(';')
    df = df.explode(emb_clmn)

    df.drop_duplicates(
        inplace= True)
    df.reset_index(
        drop= True,
        inplace= True)
    
    # group by id
    df = df.groupby(id_column)[emb_clmn].apply(
        lambda x: combine_many_embdngs(
            x,
            decode_di,
            wv_model,
            dim
        )
    ).copy()
    
    df = df.reset_index()
    # adding embeddings columns
    new_cols = [f'{emb_feature_name}_{i}' for i in range(1,dim + 1)]
    df[new_cols] = pd.DataFrame(
        df[emb_clmn].to_list(), 
        index = df.index).values
    
    # some codes have no embeddings
    prev_num_rows = df.shape[0]
    df.dropna(inplace= True)
    print(f'Percent of Null embdngs: {(1 - prev_num_rows/df.shape[0]):.2f}')
    
    df.drop(emb_clmn, axis= 1, inplace= True)
    df.reset_index(drop= True, inplace= True)
    return df

In [None]:
adm_icd_embngs_2018df = generate_icd_embng_for_df(
    admissions_2018df,
    id_column = 'member_id',
    emb_clmn = 'icd_code', 
    decode_di = icd_decode_di, 
    wv_model = adm_icd_wv_model,
    dim = 100, 
    emb_feature_name = 'icd_embdng'
)

In [None]:
class Generate_Adm_Features: 
    '''Generate feares froom admissons data. return df'''
    
    def __init__(self,
                 df,
                 date_strat:str,
                 date_end:str,
                 year:str, 
                 adm_types_set:set, 
                 id_column: str, 
                 emb_clmn: str, 
                 decode_di: dict, 
                 wv_model, 
                 dim,
                 emb_feature_name
                ): 

        self.df = df
        self.date_strat = date_strat
        self.date_end = date_end
        self.year = year
        self.adm_types_set = adm_types_set 
        self.id_column = id_column
        self.emb_clmn = emb_clmn
        self.decode_di = decode_di
        self.wv_model = wv_model
        self.dim = dim
        self.emb_feature_name = emb_feature_name
        
    def adm_hosptl_day_agg_rules(self, x): 
        '''rules to aggregate days in hospital'''       

        d = {
            'num_apprnc_in_hosptl'   : x['admission_date'].count(),
            'sum_readmissions'       : x['enc_is_readmission'].sum(), 
            'sum_admission_transfer' : x['enc_admission_transfer'].sum(),
            
            'max_day_in_hosptl'    : x['day_in_hosptl'].max(),
            'mean_day_in_hosptl'   : x['day_in_hosptl'].mean(),
        }
        return pd.Series(d, index= d.keys())

    def adm_agg_hosptl_day_df(self, df):
        '''aggregate days in hospital'''
        
        return df.groupby('member_id').apply(
            self.adm_hosptl_day_agg_rules).reset_index()
    
    def generate_features(self): 
        
        # get data from specific date
        self.df = get_year_slice_df(
            self.df,
            self.date_strat,
            self.date_end, 
            self.year
          )
        
        if self.df.shape[0] == 0: 
            print(f'df shape equals 0')
            return self.df

        # calculate new feature - day in hospital
        self.df = calc_ndays_in_hosptl(
            df = self.df,
            clmn1 = self.date_strat,
            clmn2 = self.date_end,
            new_col = 'day_in_hosptl'
            )
        
        # encode words
        self.df['enc_is_readmission'] = self.df.is_readmission.replace(
            {'No':0, 'Yes':1})
        self.df['enc_admission_transfer'] = self.df.er_to_inp_admission_transfer.replace(
            {'N':0,'Y':1})
        
        # exclude unknown admission_type 
        self.df.admission_type = self.df.admission_type.apply(
            lambda x: x if x in self.adm_types_set else None)
        
        # get dummies from admission_type
        dummy_df = pd.get_dummies(
            self.df.set_index('member_id').admission_type, 
            prefix= 'adm_type').reset_index()
        
        # aggregate dummies
        dummy_df = dummy_df.groupby('member_id').sum()
    
        # generate and agg embeddings: 
        icd_embngs_df = generate_icd_embng_for_df(
            self.df,
            self.id_column,
            self.emb_clmn, 
            self.decode_di, 
            self.wv_model,
            self.dim, 
            self.emb_feature_name
        )
    
        # aggregate features 
        self.df = self.adm_agg_hosptl_day_df(self.df)
    
        # combine features 
        print(f'Dummies shape matched: {dummy_df.shape[0] == self.df.shape[0]}')
        
        self.df = self.df.merge(
            dummy_df,
            on = 'member_id', 
            how = 'inner'
        )
        
        self.df = self.df.merge(
            icd_embngs_df, 
            on = 'member_id', 
            how= 'left'
        )
        print(f'Unique id: {self.df.member_id.nunique()}, shape: {self.df.shape}')
        return self.df

In [None]:
adm_features_2018df = Generate_Adm_Features(
    df = admissions_df, 
    date_strat = 'admission_date', 
    date_end='discharge_date',
    year = '2018',
    adm_types_set = admission_types_set,
    id_column = 'member_id',
    emb_clmn = 'icd_code', 
    decode_di = icd_decode_di, 
    wv_model = adm_icd_wv_model,
    dim = 100, 
    emb_feature_name = 'icd_embdng').generate_features()

In [None]:
adm_features_2019df = Generate_Adm_Features(
    df = admissions_df, 
    date_strat = 'admission_date', 
    date_end='discharge_date',
    year = '2019',
    adm_types_set = admission_types_set,
    id_column = 'member_id',
    emb_clmn = 'icd_code', 
    decode_di = icd_decode_di, 
    wv_model = adm_icd_wv_model,
    dim = 100, 
    emb_feature_name = 'icd_embdng').generate_features()

In [None]:
print(set(adm_features_2018df.columns) - 
      set(adm_features_2019df.columns))

In [None]:
adm_features_2018df.to_pickle(
    '/kaggle/working/adm_features_2018df.pickle')

In [None]:
adm_features_2019df.to_pickle(
    '/kaggle/working/adm_features_2019df.pickle')

# diseas_df

1. icd_chronic_or_acute != hcc_chronic_or_acute 3%

In [None]:
diseas_df = pd.read_csv(
    os.path.join(data_path, 'disease_data.csv'))
print(diseas_df.shape, diseas_df.member_id.nunique())
diseas_df.head(2)

In [None]:
diseas_df.hcc_chronic_or_acute.value_counts()

In [None]:
hcc_cat_unique = list(diseas_df[
    diseas_df.year_of_service == 2018].hcc_code.dropna().unique())
print(len(hcc_cat_unique))

In [None]:
class Generate_Hcc_Features():
    '''Generate hcc features for specific date.
    return df''' 
        
    def __init__(
        self, 
        df, 
        year:int, 
        clmn_id:str,
        hcc_catgs:str,
        embdng_clmn:str, 
        decode_di:dict,
        wv_model, 
        dim:int,
    ):
        
        self.df = df
        self.year = year
        self.clmn_id = clmn_id
        self.hcc_catgs = hcc_catgs
        self.embdng_clmn = embdng_clmn
        self.decode_di = decode_di
        self.wv_model = wv_model
        self.dim = dim        
        
    def left_merg(self, df1, df2):
        return df1.merge(df2, on = self.clmn_id, how= 'left')    
    
    def stats_icc_hcc(self, df): 
        d = {
            'dis_num_icd': df['icd_code'].dropna().count(),
            'dis_num_unique_icd' : df['icd_code'].nunique(),

            'dis_num_hcc': df['hcc_code'].dropna().count(),
            'dis_num_unique_hcc' : df['hcc_code'].nunique()
        }
        return pd.Series(d, index= d.keys())

    def agg_stats_icc_hcc(self, df):
            '''aggregate icc hcc info'''

            return df.groupby(self.clmn_id).apply(
                self.stats_icc_hcc).reset_index()
            
    def count_nflag(
        self, 
        df, 
        clmn_id:str,
        clmn_flag:str, 
        flag:str, 
        new_clmn_name:str):
        '''Count n flag for id. return df'''
        
        df = df[[clmn_id, clmn_flag]].copy()
        
        df = df.groupby(clmn_id)[clmn_flag].apply(
            lambda x: sum([True if isinstance(c,str) 
                                and (c == flag) else False for c in x]))
        df = df.reset_index()
        df.columns = [clmn_id, new_clmn_name]
        return df
    
    def generate_hcc_dummies(
        self,
        df):
        '''generate dummies variables'''
        
        df = df.copy()
        # get catgs only from train
        df['hcc_code'] = df['hcc_code'].apply(
            lambda x: x if x in self.hcc_catgs else None)

        df['hcc_code'] = df['hcc_code'].apply(
            lambda x: str(int(x)) if not pd.isnull(x) else x)

        hcc_dumm_df = pd.get_dummies(
                    df.set_index(self.clmn_id).hcc_code, 
                    prefix= 'dis_hcc_code').reset_index()
        
        hcc_dumm_df = hcc_dumm_df.groupby(
            self.clmn_id).sum().reset_index()
        
        return hcc_dumm_df
    
    def generate_features(self): 
        
        # get df on a specific date
        self.df = self.df[
            self.df.year_of_service == self.year].copy()
        
        self.df.reset_index(drop= True, inplace= True)
        
        # count statistics on icd, hcc codes
        stats_code_df = self.agg_stats_icc_hcc(self.df)
            
        # count flag
        n_hcc_chronic_df = self.count_nflag(
            df = self.df, 
            clmn_id = self.clmn_id,
            clmn_flag = 'hcc_chronic_or_acute',
            flag = 'Chronic',
            new_clmn_name = 'dis_hcc_nchronic'
        )     
        
        n_hcc_acute_df = self.count_nflag(
            df = self.df, 
            clmn_id = self.clmn_id,
            clmn_flag = 'hcc_chronic_or_acute',
            flag = 'Acute',
            new_clmn_name = 'dis_hcc_acute'
        )
        
        # hcc dummies
        hcc_dumm_df = self.generate_hcc_dummies(self.df)
        
        # icd embngs
        dis_icd_embngs_df = generate_icd_embng_for_df(
            df = self.df,
            id_column = self.clmn_id,
            emb_clmn = self.embdng_clmn, 
            decode_di = self.decode_di, 
            wv_model =  self.wv_model,
            dim = self.dim, 
            emb_feature_name = 'dis_icd_embdng'
        )             
        
        features_df = self.df[[self.clmn_id]].copy()
        features_df.drop_duplicates(inplace= True)
        features_df.reset_index(inplace= True, drop= True)
        
        # merge features
        features_df = self.left_merg(
            features_df,
            stats_code_df)
        
        features_df = self.left_merg(
            features_df,
            n_hcc_chronic_df)
        
        features_df = self.left_merg(
            features_df,
            n_hcc_acute_df)
        
        features_df = self.left_merg(
            features_df, 
            hcc_dumm_df)
        
        features_df = self.left_merg(
            features_df, 
            dis_icd_embngs_df)
        
        return features_df

In [None]:
dis_features_2018df = Generate_Hcc_Features(
    df = diseas_df, 
    year = 2018, 
    clmn_id = 'member_id', 
    hcc_catgs = hcc_cat_unique,
    embdng_clmn = 'icd_code',
    decode_di = icd_decode_di, 
    wv_model = adm_icd_wv_model, 
    dim = 100).generate_features()

In [None]:
print(dis_features_2018df.shape)
dis_features_2018df.head(3)

In [None]:
dis_features_2019df = Generate_Hcc_Features(
    df = diseas_df, 
    year = 2019, 
    clmn_id = 'member_id', 
    hcc_catgs = hcc_cat_unique,
    embdng_clmn = 'icd_code',
    decode_di = icd_decode_di, 
    wv_model = adm_icd_wv_model, 
    dim = 100).generate_features()

In [None]:
print(dis_features_2019df.shape)
dis_features_2019df.head()

In [None]:
print(
    set(dis_features_2018df.columns) - set(dis_features_2019df.columns))
print(
      set(dis_features_2019df.columns) - set(dis_features_2018df.columns))

In [None]:
dis_features_2019df['dis_hcc_code_160'] = 0
dis_features_2019df = dis_features_2019df[
    dis_features_2018df.columns].copy()

In [None]:
print(
    set(dis_features_2018df.columns) - set(dis_features_2019df.columns))
print(
      set(dis_features_2019df.columns) - set(dis_features_2018df.columns))

In [None]:
dis_features_2018df.to_pickle(
    '/kaggle/working/dis_features_2018df.pickle')

In [None]:
dis_features_2019df.to_pickle(
    '/kaggle/working/dis_features_2019df.pickle')

## labs_data

In [None]:
labs_df = pd.read_csv(os.path.join(data_path, 'labs_data.csv'))
print(labs_df.shape)
labs_df.head()

In [None]:
class Generate_Labs_Features():
    '''Generate labs features.return df'''
    
    def __init__(
        self,
        df, 
        year: int
    ):
        self.df = df
        self.year = year
    
    def correct_test_value(self, df):
        '''Initial value seems to be from other scale. return df.'''
        df = df.copy()
        df['correct_res_val'] = df.result_value.apply(
            lambda x: x/1000 if not pd.isnull(x) else x).values
        return df

    def flag_res_val(self, x):
        '''Convert res test val to category'''
        
        res_val, lower_bound, upper_bound = x
        if res_val <= lower_bound:
            return 'lower'
        elif res_val >= upper_bound: 
            return 'higher'
        elif lower_bound < res_val < upper_bound:
            return 'normal'
        else:
            return None
    
    def add_test_result_clmn(self, df):
        '''Add new column with category for test value'''

        df = df.copy()
        df['test_result'] = df[[
            'correct_res_val', 
            'normal_low_value_numeric',
            'normal_high_value_numeric']].apply(
            lambda x: self.flag_res_val(x), axis= 1)
        return df

    def test_agg_rules(self, df): 
        d = {
            'n_test' : df['test_result'].count(),
            'n_normal_test': sum(df['test_result'] == 'normal'),
            'n_higher_test': sum(df['test_result'] == 'higher'),
            'n_lower_test': sum(df['test_result'] == 'lower'),
        }
        return pd.Series(d, index= d.keys())

    def agg_test_results(self, df):
            '''Aggregate test results. return df'''
            
            return df.groupby('member_id').apply(
                self.test_agg_rules).reset_index()    
        
    def generate_features(self): 
        
        # get specific date:
        self.df = self.df[self.df.date_of_service.apply(
            lambda x: True if isinstance(x,str) and 
                      x.split('-')[0] == str(self.year) else False)].copy()
        
        self.df.reset_index(
            drop= True,
            inplace= True
        )
        
        # add corrected test val
        self.df = self.correct_test_value(self.df)
        
        # add test result categories clmn:
        self.df = self.add_test_result_clmn(self.df)
        
        # aggregate 
        self.df = self.agg_test_results(self.df)
        print(f'Unique member_id: {self.df.member_id.nunique()}')
        return self.df

In [None]:
labs_feature_2018_df = Generate_/kaggle/Features(
    labs_df, 
    2018).generate_features()
print(labs_feature_2018_df.shape)
labs_feature_2018_df.head(3)

In [None]:
labs_feature_2019_df = Generate_Labs_Features(
    labs_df, 
    2019).generate_features()
print(labs_feature_2019_df.shape)
labs_feature_2019_df.head(3)

In [None]:
labs_feature_2018_df.to_pickle(
    '/kaggle/working/labs_feature_2018_df.pickle')
labs_feature_2019_df.to_pickle(
    '/kaggle/working/labs_feature_2019_df.pickle')

## patients_data

In [None]:
patients_df =  pd.read_csv(os.path.join(data_path, 'patients_data.csv'))
print(patients_df.shape)
patients_df.head()

In [None]:
patients_df.insurance_type.value_counts()

In [None]:
# dictionaries to decode:
unique_ins_comp = patients_df.insurance_company.unique()
l = len(unique_ins_comp)
ins_di = {key:val for key, val in zip(
    unique_ins_comp, [f'ins_cmp_{i}' for i in range(l)])}
print(ins_di,'\n')

unique_ins_type = patients_df.insurance_type.unique()
l_type = len(unique_ins_type)
ins_type_di = {key:val for key, val in zip(
    unique_ins_type, [i for i in range(l_type)])}
print(ins_type_di,'\n')

unique_ins_pbp_type = patients_df.pbp_type.dropna().unique()
l_pbp = len(unique_ins_pbp_type)
print(l_pbp)
pbp_type_di = {key:val for key, val in zip(
    unique_ins_pbp_type, [f'pbp_type_{i}' for i in range(l_pbp)])}
print(pbp_type_di)

In [None]:
for clmn in patients_df.columns[1:]:
    print(f'{clmn} nunique vals: {patients_df[clmn].nunique()}')

In [None]:
class Generate_Patients_features():
    '''Generate patients features. return df'''
    
    def __init__(
        self, 
        df,):
        self.df = df
        
    def generate(self):
        
        self.df['encode_patient_gender'] = self.df.patient_gender.apply(
            lambda x: 1 if x == 'M' else 0)
        
        self.df['enc_ins_comp'] = self.df.insurance_company.apply(
            lambda x: ins_di.get(x))
        
        self.df['enc_ins_type'] = self.df.insurance_type.apply(
            lambda x: ins_type_di.get(x))
        
        self.df['enc_ins_pbp'] = self.df.pbp_type.apply(
            lambda x: pbp_type_di.get(x))
        
        ins_comp_dummy = pd.get_dummies(
            self.df.set_index('member_id').enc_ins_comp).reset_index()
        
        pbp_dummy = pd.get_dummies(
            self.df.set_index('member_id').enc_ins_pbp).reset_index()        
                
        feature_clmns = ['member_id',
                         'encode_patient_gender',
                         'enc_ins_type'
                        ]
        # merge dummies
        feature_df = self.df[feature_clmns].merge(
            ins_comp_dummy, 
            on = 'member_id'
        )
        
        feature_df = feature_df.merge(
            pbp_dummy, 
            on = 'member_id'
        )
            
        return feature_df

In [None]:
patients_feature_df = Generate_Patients_features(
    patients_df).generate()
print(patients_feature_df.shape)
patients_feature_df.head(3)

In [None]:
patients_feature_df.to_pickle(
    '/kaggle/working/patients_feature_df.pickle')

## prescription_data

In [None]:
prescriptions_df = pd.read_csv(os.path.join(
    data_path, 'prescriptions_data.csv'))
print(prescriptions_df.shape)

prescriptions_df.days_supply= prescriptions_df.days_supply.apply(
    lambda x: int(x) if not pd.isnull(x) else x)

# execute drug name
prescriptions_df['short_drug_name'] = prescriptions_2018df.drug_name.apply(
    lambda x: x.split()[0] if isinstance(x,str) else x)

# clean days_supply; metric_quantity
prescriptions_df['clnd_days_supply'] = prescriptions_df.days_supply.apply(
    lambda x: None if pd.isnull(x) or (x < 0) or (x > 364) else x)
prescriptions_df['clnd_metric_quantity'] = prescriptions_df.metric_quantity.apply(
    lambda x: None if pd.isnull(x) or (x < 0) or (x > 364) else x)

# check percent of Null
ds_null1 = prescriptions_df.days_supply.isnull().mean()
ds_null2 = prescriptions_df.clnd_days_supply.isnull().mean()

mq_null1 = prescriptions_df.metric_quantity.isnull().mean()
mq_null2 = prescriptions_df.clnd_metric_quantity.isnull().mean()

print('Columns changes after cleaning:')
print(f'days_supply percent of null:{ds_null1:.2} -> {ds_null2:.2}')
print(f'days_metric_quantity of null:{mq_null1:.2} -> {mq_null2:.2}')

prescriptions_df.head()

### Drug names embeddings

In [None]:
# get data from 2018 
prescriptions_2018df = prescriptions_df[prescriptions_df.date_filled.apply(
    lambda x: x.split('-')[0] == '2018')].reset_index(drop= True).copy()

# train model on 2018 drugs name corpus
short_drug_names = prescriptions_2018df.short_drug_name.drop_duplicates().dropna()
short_drug_names = short_drug_names.values
drug_corpus = [row.split(',') for row in short_drug_names]

# drugs name to vectors: 
drugs_model = Word2Vec(
    drug_corpus,vector_size= 50,
    min_count=1,
    workers=3,
    window = 1)

# embeddings dict
drg_emdngs = {}
missed_drugs= []
for short_name in short_drug_names:
    try:
        drg_emdngs[short_name] = drugs_model.wv[short_name]
    except: 
        missed_drugs.append(short_name)
        continue
print(f'Missed drugs names {len(missed_drugs)}')

In [None]:
# Example
v1 = drugs_model.wv['CYCLOSPORINE']
v2 = drugs_model.wv['OMEPRAZOLE']
v3 = drugs_model.wv['METOPROLOL']

w1 = drugs_model.wv['OMEPRAZOLE']
w2 = drugs_model.wv['ALLOPURINOL']
w3 = drugs_model.wv['METOPROLOL']

k1 = drugs_model.wv['ATORVASTATIN']
k2 = drugs_model.wv['CYCLOBENZAPRINE']

v = (v1+v2+v3)/3
w = (w1+w2+w3)/3
k = (k1+k2)/2

# расстояние больше, если векторы лекарств не пересекаются:
print(cosine(v,w))
print(cosine(k,v))
print(cosine(k,w))

In [None]:
def get_embdngs(word_list, embdngs_di:dict): 
    all_embdngs = []
    for w in word_list: 
        embdng = embdngs_di.get(w, None)
        if isinstance(embdng, np.ndarray): 
            all_embdngs.append(embdng)
    l = len(all_embdngs) 
    if l == 0: 
        return None
    else: 
        return sum(all_embdngs)/l
    
def make_drug_embdngs_df(df,
                         id_clmn:str,
                         short_name_clmn:str,
                         embdngs_di:dict):
    
    # make embeddings for all clients drugs:
    df= df.groupby(id_clmn)[short_name_clmn].apply(
        lambda x: get_embdngs(x, embdngs_di))
    
    # serie to df:
    df = df.reset_index().dropna()
    df.reset_index(drop= True, inplace= True)
    
    # add new clmns:
    new_cols = [f'drag_name_{i}' for i in range(1,51)]
    df[new_cols] = pd.DataFrame(
        df.short_drug_name.to_list(),  
        index = df.index)
    
    # drop clmn with lists
    df.drop(
        'short_drug_name',
        axis=1,
        inplace= True
    )
    return df

In [None]:
drug_emdb_2018df = make_drug_embdngs_df(
    prescriptions_2018df,
    'member_id',
    'short_drug_name',
    drg_emdngs)

print(drug_emdb_2018df.shape)
drug_emdb_2018df.head(3)

In [None]:
# get data from 2019 
prescriptions_2019df = prescriptions_df[prescriptions_df.date_filled.apply(
    lambda x: x.split('-')[0] == '2019')].reset_index(drop= True).copy()

drug_emdb_2019df = make_drug_embdngs_df(
    prescriptions_2019df,
    'member_id',
    'short_drug_name',
    drg_emdngs)

print(drug_emdb_2019df.shape)
drug_emdb_2019df.head(3)

### Calculate statistics on other features

ToDo: 
* dynamic features: how changed drugs and amount

In [None]:
class make_drugs_statistic_features_df:
    ''''''
    def __init__(self, df): 
        self.df = df
        
        # total statistics
        
        # days_supply
        self.ttl_ds_mean = self.df.groupby(
            'member_id').clnd_days_supply.mean().mean()
        
        self.ttl_ds_median = self.df.clnd_days_supply.median()
        
        self.ttl_ds_max = self.df.clnd_days_supply.max()
        
        # metric_quantity
        self.ttl_mq_mean = self.df.groupby(
            'member_id').clnd_metric_quantity.mean().mean()
        
        self.ttl_mq_median = self.df.clnd_metric_quantity.median()
        
        self.ttl_mq_max = self.df.clnd_metric_quantity.max()
        
        # ndc_number
        self.ttl_ndc_number_mean = self.df.groupby(
            'member_id').ndc_number.count().mean()
        
        self.ttl_ndc_number_unique_mean = self.df.groupby(
            'member_id').ndc_number.nunique().mean()
        
        
    def new_to_relif_ratio(self, clmn):
        vc = clmn.dropna().value_counts()
        if set(['N','R']) == set(vc.index):
            return vc['N']/vc['R'] if vc['R'] > 0 else None
        else: 
            return None

    def agg_rules(self, x): 
        d= {
            # ndc_number
            'count_ndc_number'       : x['ndc_number'].count(),
            'nunique_ndc_number'     : x['ndc_number'].nunique(),
            'rel_count_ndc_number'   : x['ndc_number'].count()/self.ttl_ndc_number_mean,
            'rel_nunique_ndc_number' : x['ndc_number'].nunique()/self.ttl_ndc_number_unique_mean,

            # days_supply 
            'mean_days_supply'  : x['clnd_days_supply'].mean(), # can be variable
            'std_days_supply'   : x['clnd_days_supply'].std(),
            'median_days_supply': x['clnd_days_supply'].median(),
            'max_days_supply'   : x['clnd_days_supply'].max(),
            'min_days_supply'   : x['clnd_days_supply'].min(),

            # relative days_supply to total 
            'rel_mean_days_supply'  : x['clnd_days_supply'].mean()/self.ttl_ds_mean,
            'rel_median_days_supply': x['clnd_days_supply'].median()/self.ttl_ds_median,
            'rel_max_days_supply'   : x['clnd_days_supply'].max()/self.ttl_ds_max,

            # metric_quantity
            'mean_metric_quantity'  : x['clnd_metric_quantity'].mean(),
            'std_metric_quantity'   : x['clnd_metric_quantity'].std(),
            'median_metric_quantity': x['clnd_metric_quantity'].median(),
            'max_metric_quantity'   : x['clnd_metric_quantity'].max(),
            'min_metric_quantity'   : x['clnd_metric_quantity'].min(),
            
            # relative metric_quantity to total 
            'rel_mean_metric_quantity'  : x['clnd_metric_quantity'].mean()/self.ttl_mq_mean,
            'rel_median_metric_quantity': x['clnd_metric_quantity'].median()/self.ttl_mq_median,
            'rel_max_metric_quantity'   : x['clnd_metric_quantity'].max()/self.ttl_mq_max,
            
            # new_or_refill
            'ratio_new_to_refill'       : self.new_to_relif_ratio(x['new_or_refill'])
        }

        return pd.Series(d, index= d.keys())
    
    def group_and_agg_df(self):
        return self.df.groupby('member_id').apply(
            self.agg_rules).reset_index()

In [None]:
stats_presrp_2018df = make_drugs_statistic_features_df(
    prescriptions_2018df).group_and_agg_df()
print(stats_presrp_2018df.shape)
stats_presrp_2018df.head(3)

In [None]:
prescr_features_2018df = prescriptions_2018df[
    ['member_id']].drop_duplicates().reset_index(drop= True).copy()
print(prescr_features_2018df.shape)

prescr_features_2018df = prescr_features_2018df.merge(
    drug_emdb_2018df, 
    on = 'member_id', 
    how= 'left'
)
print(prescr_features_2018df.shape)

prescr_features_2018df = prescr_features_2018df.merge(
    stats_presrp_2018df, 
    on = 'member_id', 
    how= 'left'
)
print(prescr_features_2018df.shape)

In [None]:
stats_presrp_2019df = make_drugs_statistic_features_df(
    prescriptions_2019df).group_and_agg_df()
print(stats_presrp_2019df.shape)
stats_presrp_2019df.head(3)

In [None]:
prescr_features_2019df = prescriptions_2019df[
    ['member_id']].drop_duplicates().reset_index(drop= True).copy()
print(prescr_features_2019df.shape)

prescr_features_2019df = prescr_features_2019df.merge(
    drug_emdb_2019df, 
    on = 'member_id', 
    how= 'left'
)
print(prescr_features_2019df.shape)

prescr_features_2019df = prescr_features_2019df.merge(
    stats_presrp_2019df, 
    on = 'member_id', 
    how= 'left'
)
print(prescr_features_2019df.shape)

In [None]:
prescr_features_2018df.to_pickle(
    '/kaggle/working/prescr_features_2018df.pickle')

In [None]:
prescr_features_2019df.to_pickle(
    '/kaggle/working/prescr_features_2019df.pickle')

## sample_submission

In [None]:
sample_submission_df = pd.read_csv(
    os.path.join(data_path, 'sample_submission.csv'))
print(sample_submission_df.shape)
sample_submission_df.head()

In [None]:
# Check ouput format:
req_columns = sample_submission_df.columns
print(f'Required columns: {list(req_columns)},\nclmns number: {len(req_columns)}')
print(sample_submission_df.member_id.nunique())

## train_labels

In [None]:
train_labels_df = pd.read_csv(
    os.path.join(data_path, 'train_labels.csv'))
print(train_labels_df.shape)
train_labels_df.head()

In [None]:
print(f'member_id nunique: {train_labels_df.member_id.nunique()}')

# Collect train data set

In [None]:
patients_df = pd.read_pickle(
    '/kaggle/working/patients_feature_df.pickle')

adm_features_2018df = pd.read_pickle(
    '/kaggle/working/adm_features_2018df.pickle')

dis_features_2018df = pd.read_pickle(
    '/kaggle/working/dis_features_2018df.pickle')

labs_feature_2018df = pd.read_pickle(
    '/kaggle/working/labs_feature_2018_df.pickle')

prescr_features_2018df = pd.read_pickle(
    '/kaggle/working/prescr_features_2018df.pickle')

In [None]:
train_df = train_labels_df[['member_id']].copy()

for df in [adm_features_2018df, dis_features_2018df, 
           labs_feature_2018df, prescr_features_2018df]:
    
    train_df = train_df.merge(
        df, 
        on = 'member_id', 
        how= 'left'
    )
    print(train_df.shape)
    
# train_df.to_pickle('/kaggle/working/train_df.pickle')

In [None]:
adm_features_2019df = pd.read_pickle(
    '/kaggle/working/adm_features_2019df.pickle')

dis_features_2019df = pd.read_pickle(
    '/kaggle/working/dis_features_2019df.pickle')

labs_feature_2019df = pd.read_pickle(
    '/kaggle/working/labs_feature_2019_df.pickle')

prescr_features_2019df = pd.read_pickle(
    '/kaggle/working/prescr_features_2019df.pickle')

In [None]:
test_df = train_labels_df[['member_id']].copy()

for df in [adm_features_2019df, dis_features_2019df, 
           labs_feature_2019df, prescr_features_2019df]:
    
    test_df = test_df.merge(
        df, 
        on = 'member_id', 
        how= 'left'
    )
    print(test_df.shape)
    
print(all(test_df.columns == train_df.columns))
# test_df.to_pickle('/kaggle/working/test_df.pickle')

In [None]:
train_df = pd.read_pickle(
    '/kaggle/input/prepared-features/train_df.pickle')
print(train_df.shape)

test_df = pd.read_pickle(
    '/kaggle/input/prepared-features/test_df.pickle')
print(test_df.shape)

train_labels_df = pd.read_csv(
    '/kaggle/input/eastwood-and-cleef-ml-disease/drive-download-20210929T123943Z-001/train_labels.csv')

In [None]:
print(all(train_labels_df.member_id == train_df.member_id))
print(all(train_labels_df.member_id == test_df.member_id))

# Train

In [None]:
train_df.head(1)

In [None]:
X_train, y_train, X_val, y_val =  iterative_train_test_split(
    train_df.drop('member_id', axis= 1).fillna(0).values,
    train_labels_df.drop('member_id', axis= 1).values,
    test_size = 0.15)

print(f'Train shape: {X_train.shape},Val shape {X_val.shape}')

In [None]:
def class_num_dis(x):
    df = pd.DataFrame([[i for i in (range(30))],
        [sum(x[:,i]) for i in range(30)]]).T
    df.columns = ['dis_clmn_number', 'num_dis']
    print(f'Zero classes:{df[df.num_dis == 0].dis_clmn_number.values}')
    return df

In [None]:
class_num_dis(y_train).head()

In [None]:
class_num_dis(y_val).head()

In [None]:
xgb_estimator = xgb.XGBClassifier(
    n_estimators = 300,
    eval_metric= 'logloss',
    use_label_encoder= False
) 

multilabel_model = MultiOutputClassifier(
    xgb_estimator,
)

multilabel_model.fit(
    X_train,
    y_train
)

print('Model trained')

In [None]:
dump(
    multilabel_model,
    '/kaggle/working/wth_embdngs/model_emdng_ftrs.joblib')

In [None]:
val_proba_df = multilabel_model.predict_proba(
    X_val)

val_proba_df = np.transpose(
    [pred[:,1] for pred in val_proba_df])

print(val_proba_df.shape)

## Select threshold on validation set

In [None]:
def custom_auc_score(y_true, y_pred, thr):
    if pd.isnull(thr):
        return None
    lable_pred = [1 if p > thr else 0 for p in y_pred]
    try: 
        c_auc = roc_auc_score(y_true, lable_pred)
        return c_auc
    except Exception as e: 
        return None
    
def return_top_auc_row(
    class_number:int, 
    y_val_,
    val_proba_df_):
    '''return list with highest auc'''
    
    precision_, recall_, thresholds_ = precision_recall_curve(
        y_val_,
        val_proba_df_)

    pr_df = pd.DataFrame([
        precision_, 
        recall_, 
        thresholds_
    ]).T
    
    pr_df.columns= ['precision', 'recall', 'threshold']

    pr_df['custom_auc'] = pr_df['threshold'].apply(
        lambda x: custom_auc_score(
            y_val_,
            val_proba_df_,
            x))

    pr_df.sort_values(
        by= 'custom_auc',
        ascending= False, 
        inplace= True
    )
    
    res_li = [class_number] + list(pr_df.iloc[0])
    print(f'Auc:{res_li[-1]}')
    return res_li

In [None]:
ftr_df_li = []
for i in tqdm(range(30)): 
    cur_y_val, cur_prob_val = y_val[:,i], val_proba_df[:,i]
    ftr_df_li.append(
        return_top_auc_row(
            i,
            cur_y_val,
            cur_prob_val)
    )

In [None]:
thr_df = pd.DataFrame(ftr_df_li)
thr_df.columns = ['dis_number', 'precision', 
                  'recall', 'threshold', 'top_auc']
print(thr_df.shape)
thr_df.head()

In [None]:
np.mean(thr_df.top_auc)

In [None]:
thr_df.to_pickle(
    '/kaggle/working/wth_embdngs/thr_df.pickle'
)

## Training on all train DF

In [None]:
# обучаемся на всем train DF:
xgb_estimator = xgb.XGBClassifier(
    n_estimators = 300,
    eval_metric='logloss',
    use_label_encoder=False,
) 

full_multilabel_model = MultiOutputClassifier(
    xgb_estimator
)

full_multilabel_model.fit(
    train_df.drop('member_id', axis= 1).fillna(0).values,
    train_labels_df.drop('member_id', axis= 1).values,
)

print('Model trained')

In [None]:
dump(
    full_multilabel_model,
    '/kaggle/working/wth_embdngs/full_model_no_embdngs.joblib') 

## Make prediction

In [None]:
test_proba = full_multilabel_model.predict_proba(
    test_df.drop('member_id', axis= 1).fillna(0).values
)

test_proba = np.transpose(
    [pred[:,1] for pred in test_proba])

In [None]:
test_proba_df = pd.DataFrame(test_proba)
test_proba_df.columns = list(train_labels_df.columns)[1:]
test_proba_df.head(1)

In [None]:
res_df = pd.DataFrame()
for clmn, thr_ in zip(test_proba_df.columns, thr_df.threshold): 
    res_df[clmn] = test_proba_df[clmn].apply(
        lambda x: 1 if x > thr_ else 0)

In [None]:
res_df['member_id'] = test_df.member_id.values
res_df = res_df[train_labels_df.columns].copy()
print(res_df.shape)
res_df.head()

In [None]:
res_df.to_csv(
    '/kaggle/working/emb_res_df.csv', 
    index= False
)