# Look at composition of topics from LDA model

provider_type: hematology, medical
- filtered out least and most common hcpcs_codes
- only consider in-facility claims
- number of topics = 6
- bene_unique_cnt as value

In [1]:
import psycopg2
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import time

from gensim import matutils, models, corpora

%matplotlib inline
sns.set(style="white")

In [2]:
## connect to database
con = psycopg2.connect("dbname='doctordb' user='cathy'")

In [3]:
q = """SELECT npi, nppes_provider_last_org_name, nppes_provider_state, nppes_provider_first_name,
provider_type, hcpcs_code, hcpcs_description, bene_unique_cnt
FROM payments 
WHERE (provider_type='Medical Oncology' OR provider_type='Hematology/Oncology')
AND place_of_service='O'
AND hcpcs_drug_indicator='Y'"""
payments = pd.read_sql_query(q, con=con)

In [4]:
payments.shape

(56147, 8)

In [5]:
payments['provider_type'].unique()

array(['Medical Oncology', 'Hematology/Oncology'], dtype=object)

In [6]:
## how many of each provider type?
payments.drop_duplicates('npi').groupby('provider_type')['npi'].count()

provider_type
Hematology/Oncology    3603
Medical Oncology       1056
Name: npi, dtype: int64

## read in filtered data frame

In [7]:
## write to file
by_npi = pd.read_csv("11f_by_npi_reduced_medical_hematology_oncolgists.csv", index_col=0)
by_npi.index = by_npi.index.astype(str)

In [8]:
by_npi.shape

(3122, 150)

## Load LDA model (filtered data frame using 6 topics)

In [9]:
## Load lda model
model_fname = "11f_lda_6topics_colsDropped_docsDropped_hema_medi_oncology.model"
ldamodel = models.LdaModel.load(model_fname)

## Interpret the 6 topics in the model

In [12]:
corpus = matutils.Dense2Corpus(by_npi.as_matrix(), documents_columns=False)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

In [13]:
## top 20 words in each topic
topics_matrix = ldamodel.show_topics(formatted=False, num_words=20)

In [14]:
topics_hcpcs = {}

## convert topic word indices to hcpcs_codes => value is a list of tuples of (hcpcs_code, probability)
for i in topics_matrix:
    topics_hcpcs[i[0]] = [(by_npi.columns.values[int(word[0])], word[1]) for word in i[1]]

In [17]:
# How many drugs are chemo/non chemo or neither?
def drug_category_count(drugs):
    """Count drugs per category, A,B,C
    input - list of hcpcs codes for drugs
    returns - list of drugs per category and error list containing drugs that don't fall in any category
    
    # A : J0000 - J8499 --- drugs other than chemo
    # B : J8521 - J9999 --- chemo drugs
    # C : Doesn't begin with 'J'
    """
    A = []
    B = []
    C = []
    
    countA = 0
    countB = 0
    countC = 0
    errorlist = []  # list of drugs that don't fall in either category
    for d in drugs:
        if d[0].upper() == 'J':
            if int(d[1:]) >= 0 and int(d[1:]) <= 8499:
                countA += 1
                A.append(d)
            elif int(d[1:]) >= 8521 and int(d[1:]) <= 9999:
                countB += 1
                B.append(d)
            else:
                errorlist.append(d)
        else:
            countC += 1
            C.append(d)
    
    print("drugs other than chemo: {0}; \n chemo drugs: {1}; \n drugs that don't start with J: {2}".format(countA, countB, countC))
    return A, B, C, errorlist

In [26]:
for i in range(len(topics_hcpcs)):
    print('topic ',i)
    nonchemo_drugs, chemo_drugs, other_drugs, error_drugs = drug_category_count(np.array(topics_hcpcs[i])[:,0])
    print('\n')

topic  0
drugs other than chemo: 15; 
 chemo drugs: 0; 
 drugs that don't start with J: 5


topic  1
drugs other than chemo: 18; 
 chemo drugs: 0; 
 drugs that don't start with J: 2


topic  2
drugs other than chemo: 13; 
 chemo drugs: 5; 
 drugs that don't start with J: 2


topic  3
drugs other than chemo: 13; 
 chemo drugs: 0; 
 drugs that don't start with J: 7


topic  4
drugs other than chemo: 13; 
 chemo drugs: 5; 
 drugs that don't start with J: 2


topic  5
drugs other than chemo: 16; 
 chemo drugs: 0; 
 drugs that don't start with J: 4




Topics 2 and 4 are the only topics with at least one (actually, 5) chemo drugs in the top 20.  They are also the categories that the most doctors fall under.  What if I restrict the feature space to only chemo drugs (there should only be 31 chemo drugs)?