## Finding hospitals with expensive lab tests

Let's clone the database and export the data (this might take a sec ;-))

In [None]:
!sudo bash -c 'curl -L https://github.com/dolthub/dolt/releases/latest/download/install.sh | sudo bash'
!dolt clone dolthub/quest-v3
!dolt sql -q "select billing_code_type, billing_code, billing_code_modifier, reporting_entity_name, negotiated_rate, npi from rate join npi_rate on npi_rate.rate_id = rate.id join code on code.id = rate.code_id join price_metadata on price_metadata.id = rate.price_metadata_id join insurer on insurer.id = rate.insurer_id" -r csv >> outputwnpi.csv

Our analysis is done in polars, which (imo) has a cleaner API than pandas.

In [None]:
import polars as pl
from polars import col

In [None]:
save_dir = './quest-v3-redux' # change this to '.' if file saved in this dir

In [None]:
df = pl.read_csv(f'{save_dir}/outputwnpi.csv', infer_schema_length = 10_000)

In [None]:
df = df.unique()

In [None]:
placeholder_prices = [ # "suspected..."
    999999.99,         # Sierra Health...
    699999.99,         # Blue Cross
    99999.99,          # UMR
    88888.88,          # United, Medica, Oxford
    49999.5,
    39999.6,           # Rocky Mountain Health placeholder value
    8720.0,            # Aetna
    811.0,             # Anthem (?)
    458.0,             # Anthem (?)
    140.0,             # ?
    .01,               # Aetna (?) (internal?)
    .02,               # (?)
    0]

df = df.filter(~col('negotiated_rate').is_in(placeholder_prices))

There's one billing code that consistently comes up as confusingly expensive in this analysis and I'm not sure why. It's a simple blood draw coded CPT 36416 or 36415, and it's usually bundled with other codes, and not billed separately. I'm going to filter it out for the time being.

In [None]:
df = df.filter(~col('billing_code').is_in(['36416', '36415']))

In [None]:
def compute_means_and_ratios(df) -> pl.DataFrame():
    """Compute the mean of each negotiated rate to get a kind of reference value.
    The 'multiplier' is the negotiated_rate/mean."""
    return (df
      .with_column(
          pl.mean('negotiated_rate').over(['billing_code_type', 'billing_code', 'billing_code_modifier']).alias('rate_mean')
      ).with_column(
          (col('negotiated_rate')/col('rate_mean')).alias('multiplier')
      ))

In [None]:
df = compute_means_and_ratios(df)

Let's get rid of any prices that are too low which might be skewing our mean downwards. This makes our analysis more robust -- by making the average price as high as reasonably-is-possible, we can say more confidently that prices that are way higher than this are truly outliers.

In [None]:
df = df.filter(col('multiplier') > .01)

We'll need to compute the means and ratios again.

In [None]:
df = compute_means_and_ratios(df)

Now let's look at hospitals which appear often in this dataset. We'll filter down to rates which are over 20x the average.

In [None]:
(df
 .filter(col('multiplier') > 20) # filter down to the highest negotiated rates
 .select(['npi', 'billing_code_type', 'billing_code', 'billing_code_modifier',])
 .unique()
 ['npi']                         # get just the NPI numbers
 .value_counts()
 .sort('counts')                 # sort by the NPIs that appear most frequently in this set
 [-10:]                          # take just the last 10
)

Let's make this easier to understand by joining this with NPPES, the database of NPIs with provider information.

In [None]:
!wget https://download.cms.gov/nppes/NPPES_Data_Dissemination_January_2023.zip
!unzip NPPES_Data_Dissemination_January_2023.zip

In [None]:
npi = pl.scan_csv(f'{save_dir}/npidata_pfile_20050523-20230108.csv', infer_schema_length = 10_000)

In [None]:
npi = npi.select(['NPI', 
            'Provider Organization Name (Legal Business Name)', 
            'Provider Business Practice Location Address City Name', 
            'Provider Business Practice Location Address State Name',])

In [None]:
npi = npi.collect()

In [None]:
exp_hosps = (df
 .filter(col('multiplier') > 20) # filter down to the highest negotiated rates
 .select(['npi', 'billing_code_type', 'billing_code', 'billing_code_modifier',])
 .unique()
 ['npi']                         # get just the NPI numbers
 .value_counts()
 .sort('counts')                 # sort by the NPIs that appear most frequently in this set
 [-10:]                          # take just the last 10
).join(npi, left_on = 'npi', right_on = 'NPI').sort('counts').rename({'counts': 'number_disinct_codes_gt_20_times_mean_rate'})

In [None]:
print(exp_hosps.to_pandas().set_index('npi').to_markdown())

The last hospital, Havasu Regional, has the highest number of lab tests with a cost ratio of over 20x the mean price. We can look more closely at those rates by filtering down the first dataframe.

In [None]:
exp_npi = exp_hosps[-1]['npi'][0]
df.filter(col('npi') == exp_npi).sort('multiplier')

## Finding the codes with the highest dispersion

In [None]:
df.filter(col('billing_code') == '83903').with_columns([
    (pl
     .std('negotiated_rate')
     .over(['billing_code_type', 'billing_code', 'billing_code_modifier'])/col('rate_mean')).alias('dispersion'),
    (col('negotiated_rate')/col('rate_mean')).alias('normalized_rate')
])

In [None]:
disp = (df
 .with_columns([
     (pl
      .std('negotiated_rate')
      .over(['billing_code_type', 'billing_code', 'billing_code_modifier'])
      /col('rate_mean')
     ).alias('dispersion'),
    (col('negotiated_rate')/col('rate_mean')).alias('normalized_rate')
 ])
 .filter(col('dispersion') > 0)
)

In [None]:
import altair as alt

In [None]:
alt.data_transformers.disable_max_rows()

In [None]:
def wordsin(string, wordlist):
    if any([w in string for w in wordlist]):
        return True 
    else:
        return False

from functools import partial

In [None]:
def get_insurer(string) -> str:
    string = string.lower()
    pwordsin = partial(wordsin, string)
    if pwordsin(['unitedhealth', 'united health', 'umr']):
        return 'UnitedHealthCare'
    elif pwordsin(['blue cross', 'bluecross', 'blueshield', 'blue shield', 'anthem', 'florida blue']):
        return 'Anthem'
    elif pwordsin(['centene']):
        return 'Centene'
    elif pwordsin(['aetna']):
        return 'Aetna'
    return 'Other'

In [None]:
source = disp.filter(col('dispersion') > 1.7).filter(col('normalized_rate') < 50)

In [None]:
insurer_table = (pl
                 .DataFrame([{'reporting_entity_name':x, 'normalized_name':y} 
                             for x,y in 
                                 {x: get_insurer(x) 
                                      for x in source['reporting_entity_name'].unique()}.items()
                            ]))

In [None]:
!wget https://gist.githubusercontent.com/lieldulev/439793dc3c5a6613b661c33d71fdd185/raw/25c3abcc5c24e640a0a5da1ee04198a824bf58fa/cpt4.csv

In [None]:
cpt = pl.read_csv('cpt4.csv')

In [None]:
cpt.columns = ['billing_code', 'label']

In [None]:
source = source.join(insurer_table, on = 'reporting_entity_name').join(cpt, on = 'billing_code').to_pandas()

In [None]:
import altair as alt
from vega_datasets import data

subtitle_text = [
    """""",
    """These are the (normalized) rates that insurance companies have negotiated with hospitals for lab tests.""",
    """Because lab tests make up only 3-4% of hospital revenues, they can more freely use \"strategic pricing\"""",
    """to extract more in reimbursements from insurance companies. Rates can vary wildly between hospitals.""",
    """Some tests come in at more than 20 times the average price.""",
    """""",
    ]

alt.Chart(source.sample(100)).mark_tick(opacity = 0.5).encode(
    y = alt.Y('label:N', title = None),
    x = alt.X('normalized_rate:Q', title = 'Hospital reimbursement, as multiple of mean negotiated price'),
    color = alt.Color('normalized_name:N', scale=alt.Scale(scheme='category10'), title = 'Insurance Co.'),
).properties(width = 700,
             title = {'text': 'Price dispersion for lab tests',
                      'subtitle': subtitle_text,
                      'anchor': 'start',}
).configure_axis(
    labelFontSize=16,
    titleFontSize=16,
    labelLimit = 200,
).configure_title(
    align = 'right',
    fontSize = 20,
    subtitleFontSize = 15,
).configure_legend(
    labelFontSize = 16,
    titleFontSize = 16,
)
