## Overview
This notebook introduces some exploratory data analysis involving the initial dataset with _all_ features as well as the final compressed features (used for submission).

For reference, [this OHDSI website](https://athena.ohdsi.org/) provides a lookup dictionary for all possible concept_ids. Initially, we decided to consider all possible concept_ids (i.e. all unique values across a manually-identified set of columns) as possible features. This is especially challenging given that: 1) there are many possible unique clinical concepts (>5 million), 2) clinical concepts are often correlated or loosely coupled, and 3) concept appearance is often incredibly sparse across datasets (the more niche the condition/drug/procedure, the more sparse the dataset becomes). However, from the analysis we were able to identify high-correlation concept_ids with our target hospitalization label. We combined these results with the concept_ids identified from NLP analysis and the automated model selection framework to curate a final list of IDs.

The "use_all_concepts" ETL generates the counts and disrete "parsed" values (when available) from all possible concept_ids from the given dataset.

The "use_compressed_concepts" ETL was used for the final submission and generates counts based on an input list of specific concept_ids.

## Data Inspection (All concept_ids)
This analysis was used in conjunction with the separate NLP analysis.

In [1]:
import pandas as pd
import use_all_concepts.etl as etl

PATH = etl.TRAIN_PATH

In [2]:
# Summary _does not_ include 'Parsed' values
summary_df = etl.generate_concept_summary(PATH)
summary_df

Unnamed: 0,concept_id,unique_pid_count,avg_per_pid,concept_name,from_table
0,44818702,1251,144.921663,,
1,3028553,1246,13.002408,,
2,37208405,1244,14.748392,History of alcohol use,observation
3,3035995,1243,8.670153,Alkaline phosphatase [Enzymatic activity/volum...,measurement
4,3000905,1240,9.941129,Leukocytes [#/volume] in Blood by Automated count,measurement
...,...,...,...,...,...
1506,2765743,1,1.000000,,
1507,2002747,1,1.000000,Other partial resection of small intestine,procedure_occurrence
1508,2765672,1,1.000000,,
1509,2003287,1,1.000000,Endoscopic sphincterotomy and papillotomy,procedure_occurrence


In [3]:
# This Concept-Feature map _does_ include 'Parsed' values
cf_map, corr_series = etl.get_highest_corr_concept_feature_id_map_and_corr_series(PATH)
cf_map_as_df = pd.DataFrame(cf_map.values(), index=cf_map.keys())
cf_map_as_df.columns = ['feature_id']
cf_map_as_df.index.rename('concept_id', inplace=True)
cf_map_as_df


Unnamed: 0_level_0,feature_id
concept_id,Unnamed: 1_level_1
2741240,0
3043697,1
4239779,2
2617452,3
2787823,4
...,...
4075892611,2244
3005033111,2245
3037110111,2246
3026156111,2247


In [4]:
concept_to_correlation_df = pd.DataFrame(corr_series)
concept_to_correlation_df.insert(1, 'abs_pearson_corr', abs(corr_series))
concept_to_correlation_df = concept_to_correlation_df.reset_index().rename(columns={'index':'concept_id', 'status': 'pearson_corr'})
concept_to_correlation_df.sort_values('abs_pearson_corr', ascending=False)
concept_to_correlation_df

Unnamed: 0,concept_id,pearson_corr,abs_pearson_corr
0,380378,-0.009100,0.009100
1,75909,0.043308,0.043308
2,438409,0.029498,0.029498
3,435875,-0.037932,0.037932
4,80502,0.002094,0.002094
...,...,...,...
2244,3029187111,0.000000,0.000000
2245,3004254111,0.000000,0.000000
2246,3005755111,0.000000,0.000000
2247,4075831011,0.000000,0.000000


## Matrix Generation

### Without Feature Compression
The following DataFrame demonstrates values using _all_ possible features. While more detailed, it is less performant given the high sparsity and dimensionality of the dataset.

In [5]:
feature_df = etl.create_feature_df(cf_map, path=PATH)
feature_df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,2239,2240,2241,2242,2243,2244,2245,2246,2247,2248
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1246,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1247,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### With Feature Compression
The following DataFrame demonstrates values using the compressed approach (counts of concept_ids as features). This approach performed much better and was used for the final submission.

In [6]:
import use_compressed_concepts.simple_etl as simple_etl

In [7]:
predictors = simple_etl.get_features_from_list()
predictors = predictors.set_index('person_id')
print(f"number of compressed features: {len(predictors.columns)}")
predictors

number of features from id list:  300
N unique condition:  18
N unique drug:  12
N unique device:  0
N unique measurement:  12
N unique observation:  1
N unique procedure:  1
number of compressed features: 44


concept_id,30437,133810,196523,312437,376065,378726,380097,436659,437247,437663,...,19133873,19133905,37016349,37119138,40173507,40238886,40481089,44507566,44782429,45768812
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.333333,0.0,0.0,0.000000,0.5,0.0,0.5,0.666667,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.25,0.0,0.666667,0.0
1,0.333333,0.5,0.0,0.666667,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.333333,0.0,0.0,0.0,0.00,0.0,0.000000,0.0
2,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.00,0.0,0.333333,0.0
3,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,1.0,0.0,0.333333,0.0,0.0,0.0,0.00,0.0,0.000000,0.0
4,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.333333,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.00,0.0,0.333333,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1246,0.000000,0.0,0.0,0.333333,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.25,0.0,0.000000,0.0
1247,0.000000,0.0,0.0,0.333333,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.00,0.0,0.000000,0.0
1248,0.000000,0.0,0.0,0.333333,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.00,0.0,0.000000,0.0
1249,0.333333,0.0,0.0,0.000000,0.0,0.0,0.0,0.666667,0.0,0.000000,...,0.0,0.0,0.000000,0.5,0.0,0.0,0.00,0.0,0.000000,0.0
