In [3]:
import polars as pl
from polars import col
pl.__version__

'0.13.43'

### Summary

The major lessons here are:

1. Getting comfortable with polars
1. The power in using `extract` combined with `when/then/otherwise` for data cleaning
1. How to write code that's easy to read for cleaning data that's hard to read

### Cleaning the hospitals dataset -- motivation

The hospitals dataset at DoltHub contains 300M rows of prices. Each price is given a couple of codes (a primary `code` and an `internal_revenue_code`) which refer to how the procedure is billed. Billing codes are the standard for comparing hospital prices. If we know that a hospital charges 2,000 for code `59410`, you can be pretty sure that you (or your insurer) are going to be paying about that price for a normal childbirth.

#### Where the codes are in our dataset

Hospitals, according to the CMS law, are required to post a chargemaster with a hospital-generic code. As they should be: if hospitals only used proprietary coding for their procedures there would be no way to compare prices between hospitals. For an apples-to-apples comparison, we all need to use apples. Typically these apples are the CPT codes ("current procedural terminology") which follow a specific pattern. 

We'll do our best to extract CPT codes (as well as DRG and NDC codes) from the chargemasters as much as we reasonably can. Remember that the CMS law requires that these chargemasters be machine readable. If we can't figure out what the code is supposed to be from the chargemaster, then it's not machine readable. It's junk.


#### How codes were entered

Participants typically put the most generic code in the `code` column. This would be where the CPT code went. If there was a second code, it went in the `internal_revenue_code` column. A third code would have gone in the `code_disambiguator` column, but this turns out to not have been necessary.

However, because the codes are mixed -- some contain pharmacy codes, random codes, etc. -- it is a bit difficult to work with the data. Plus, it's not clear how many of our rows even have valid codes.

So I wrote a short cleaning pipeline to figure out the coding for each row, and see what fraction of rows were coded in some machine readable way.

### Making the pipeline: custom pipes

`pipes` are an easy way to chain data transformations together. In polars/pandas, it looks something like:

```python
out_df = in_df.pipe(transformation, **params).pipe(another_transformation, **params)
```

The first transformation we'll need is one that extracts the CPT codes from the codes column.

In [935]:
def extract_cpt(df: pl.DataFrame) -> pl.DataFrame:
    '''
    Extract CPT singlet codes
    '''
    
    # do some transformations
    
    return df

### A test dataframe

We'll use to test out our transformations.

In [1191]:
testdf = pl.DataFrame({'code': ['CPT/HCPC 0206009123', 'NDC 0011-1231-12', 'CPT 99899 00123', 'MS-DRG 1101112', 
                                'CPT 123', '12341-12345', 'CPT12345', 'NDC 0012312311',
                                'MS123', '319',
                               ]})
print(testdf)

shape: (10, 1)
┌─────────────────────┐
│ code                │
│ ---                 │
│ str                 │
╞═════════════════════╡
│ CPT/HCPC 0206009123 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NDC 0011-1231-12    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ CPT 99899 00123     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ MS-DRG 1101112      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ CPT12345            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NDC 0012312311      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ MS123               │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 319                 │
└─────────────────────┘


### The power of regex

We'll hunt for the CPT codes in the codes column. We'll use a few heuristics for finding them. To get those heuristics, we'll need to know what kind of junk is present in the database already, and how to distinguish that junk from valid codes to the greatest possible extent.

#### Heuristic 1: There's only one code type in each cell

If the cell contains the string 'NDC', 'MS-DRG', or 'CDM', it is overwhelmingly likely to *not* contain a CPT code.

#### Heuristic 2: A specific set of 5-char codes

CPT codes come mainly in two flavors:

1. CPT I: 5 digit numerical code from 00100 to 99499
2. CPT II: 5 digit alphanumerical code like `any('ACEGJKLMQTV')` + 4 digits or 4 digits + `any('ACEGJKLMQTV')`

We will hunt for explicitly these patterns.

In [1192]:
def extract_cpt(df: pl.DataFrame) -> pl.DataFrame:
    '''
    Extract CPT singlet codes
    '''
    
    cpt_pats = ['\d{5}',       # simplified CPT I
                '[A-Z]\d{4}',  # simplified CPT II
                '\d{4}[A-Z]']  # simplified CPT II
    
    # to "extract" a pattern, it needs to be in parenthesis
    # and to extract multiple patterns with OR, they need to
    # have a bar char between them. A valid extraction string
    # looks something like 'abc(pat)'
    
    extraction_str = '|'.join([f'({pat})' for pat in cpt_pats])
    
    return df.with_column(
        
        pl.when(
            
            # Heuristic 1
            ~col('code').str.contains('NDC|MS|CDM')
            
        ).then(
            
            # Heuristic 2
            col('code').str.extract(extraction_str)
            
        ).otherwise(None).alias('extracted_cpt')
    )

testdf.pipe(extract_cpt)

code,extracted_cpt
str,str
"""CPT/HCPC 02060...","""02060"""
"""NDC 0011-1231-...",
"""CPT 99899 0012...","""99899"""
"""MS-DRG 1101112...",
"""CPT 123""",
"""12341-12345""","""12341"""
"""CPT12345""",
"""NDC 0012312311...",
"""MS123""",
"""319""",


We right away see some flaws here. 

1. Because there can be multiple CPT codes in a cell, we'll need a way to extract them all
2. Sometimes we're extracting codes that we shouldn't be (row 3 is not a valid CPT code)
3. The last row, which is a range of CPT codes, should probably be handled separately. Plus, these "ranges" might actually be longer codes with dashes in the middle (unless they're specifically marked as CPT). Let's filter them out for now.
4. 99899 isn't really a valid CPT code. More on that later.

### Solution addressing (1): `str.extract_all` to the rescue

This undocumented polars function allows us to extract all the CPT codes in a cell in one stroke.

### Solution addressing (2): regex word boundaries

By putting `\b` on either side of a word, we can force our five-char codes to not be part of a larger digit string.

### Solution addressing (3): eliminate dashed digit/char strings with a `str.replace_all`

We'll simply eliminate those patterns entirely before we do regex matching. (We could do this with negative lookaheads, but these aren't implemented in the rust regex parser.)

In [1193]:
def extract_cpt(df: pl.DataFrame) -> pl.DataFrame:
    '''
    Extract CPT singlet codes
    '''
    
    cpt_pats = ['\d{5}',       # simplified CPT I
                '[A-Z]\d{4}',  # simplified CPT II
                '\d{4}[A-Z]']  # simplified CPT II
    
    # to "extract" a pattern, it needs to be in parenthesis
    # and to extract multiple patterns with OR, they need to
    # have a bar char between them. A valid extraction string
    # looks something like 'abc(pat)'
    
    extraction_str = '|'.join([fr'\b({pat})\b' for pat in cpt_pats])
    
    return df.with_column(
        
        col('code').str.replace_all(r'\b\d{5}-\d{5}\b', '').alias('code_cleaned')
        
    ).with_column(
        
        pl.when(
            
            # Heuristic 1
            ~col('code_cleaned').str.contains('NDC|MS|CDM')
            
        ).then(
            
            # Heuristic 2
            col('code_cleaned').str.extract_all(extraction_str)
            
        ).otherwise(None).alias('extracted_cpt')
    )

testdf.pipe(extract_cpt)

code,code_cleaned,extracted_cpt
str,str,list[str]
"""CPT/HCPC 02060...","""CPT/HCPC 02060...",
"""NDC 0011-1231-...","""NDC 0011-1231-...",
"""CPT 99899 0012...","""CPT 99899 0012...","[""99899"", ""00123""]"
"""MS-DRG 1101112...","""MS-DRG 1101112...",
"""CPT 123""","""CPT 123""",
"""12341-12345""","""""",
"""CPT12345""","""CPT12345""",
"""NDC 0012312311...","""NDC 0012312311...",
"""MS123""","""MS123""",
"""319""","""319""",


Cool! We're on our way to doing a good cleanup on the dataset.

### the "raw string" r''

If you've never seen `r'some string'` before, the `r` just stops what's inside the quotations from being escaped with slashes. So, for example, `r'\n'` is *really* slash n, as opposed to `'\n'`, which is a newline.

In [1194]:
print(r'This is \n not a newline')
print('This is a \n newline')

This is \n not a newline
This is a 
 newline


This prevents the `\b` from being parsed as something else, so that the regex engine can work with it properly (and treat it as a word boundary.) Without the `r`, you need double slashes.

In [1195]:
print(r'A raw string: \b and \\b')
print('A regular string \\b and \b')

A raw string: \b and \\b
A regular string \b and


### Convenience functions

#### `in_col`
Inside `str.contains` should be a list of ORs. 

```python
# Heuristic 1
~col('code_cleaned').str.contains('NDC|MS|CDM')
```

But this will be cumbersome to write once we have a lot of stuff to put in there. It would be nice if we could just pass a list of strings directly to a function.

In [1196]:
def in_col(colname, words): 
    return col(colname).str.contains('|'.join(words.split()))

def extract_all_from(colname, words):
    words_str = '|'.join([fr'\b{word}\b' for word in words.split()])
    return col(colname).str.extract_all(words_str)

def clear_all_from(colname, words):
    words_str = '|'.join([fr'{word}' for word in words.split()])
    return col(colname).str.replace_all(words_str, '')

This allows us to replace, for example:
```python
# ~col('code_cleaned').str.contains('NDC|MS|CDM') 
in_col('code_cleaned', 'NDC MS CDM')

# cpt_pats = ['\d{5}', '[A-Z]\d{4}', '\d{4}[A-Z]'] 
# col('code_cleaned').str.extract_all(extraction_str)
cpt_pats = '\d{5} [A-Z]\d{4} \d{4}[A-Z]'
extract_all_from('code_cleaned', cpt_pats)
```

which looks neater to me and will get easier to work with.

### Writing the extraction function with our new functions
results in a much cleaner transformation.

In [1199]:
def extract_cpt(df: pl.DataFrame) -> pl.DataFrame:
    '''
    Extract CPT singlet codes
    '''
    
    cpt_pats = '\d{5} [A-Z]\d{4} \d{4}[A-Z]'
    
    return df.with_column(
        
        clear_all_from('code', '\d{5}-\d{5}').alias(f'code_cleaned')
        
    ).with_column(
        
        pl.when(
            
            # Heuristic 1
            ~in_col('code_cleaned', 'NDC MS CDM')
            
        ).then(
            
            # Heuristic 2
            extract_all_from('code_cleaned', cpt_pats)
            
        ).otherwise(None).alias('extracted_cpt')
    )

testdf.pipe(extract_cpt)

code,code_cleaned,extracted_cpt
str,str,list[str]
"""CPT/HCPC 02060...","""CPT/HCPC 02060...",
"""NDC 0011-1231-...","""NDC 0011-1231-...",
"""CPT 99899 0012...","""CPT 99899 0012...","[""99899"", ""00123""]"
"""MS-DRG 1101112...","""MS-DRG 1101112...",
"""CPT 123""","""CPT 123""",
"""12341-12345""","""""",
"""CPT12345""","""CPT12345""",
"""NDC 0012312311...","""NDC 0012312311...",
"""MS123""","""MS123""",
"""319""","""319""",


### Continuing on
Let's add the detail needed to make this extraction more realistic.

1. We'll only use valid 5 digit CPT codes (so we'll need a more advanced regex). I used https://3widgets.comcodes to get a regex that wil capture numbers from 00100-99499, and added the alphanumerical ones manually.
2. It sometimes happens that codes are marked as CPT codes but don't fit the regex pattern. Any time there's a CPT/HCPCS string in the cell we're going to keep all the (reasonable) codes that are in it, even if they don't match the exact pattern of being a 5 digit string. They might have just had some zeros lopped off the front or something like that.

In [1230]:
def extract_cpt(df: pl.DataFrame) -> pl.DataFrame:
    '''
    Extract CPT singlet codes
    '''
    
    cpt_pats = '''0010[0-9] 001[1-9][0-9] 00[2-9][0-9]{2} 
                0[1-9][0-9]{3} [1-8][0-9]{4} 9[0-8][0-9]{3} 
                99[0-4][0-9]{2} [ACEGJKLMQTV]\d{4} \d{4}[ACEGJKLMQTV]'''
    
    return df.with_column(
        
        clear_all_from('code', '\\b\w{5}-\w{5}\\b \\b?CPT\\b? \\b?HCPCS?\\b?').alias(f'code_cleaned'),
        
    ).with_column(
        
        pl.when(            
            # If we have a code to exclude...
            in_col('code', 'NDC MS CDM DRG PH\d+ ITM')
        ).then(
            # ...ignore the cell
            None
        ).when(
            # If we have an explicit CPT code...
            in_col('code', 'CPT HCPCS? CPT\s?/?HCPCS? HCPCS\s?/?CPT')
        ).then(
            # greedily take the candidates
            extract_all_from('code_cleaned', cpt_pats + ' ([A-Z]|\d){1,2}?\d{1,4}([A-Z]|\d){2} \\b\d{3}\\b')
        ).otherwise(
            # If we have no information about the code,
            # just select the good candidates
            extract_all_from('code_cleaned', cpt_pats)
            
        ).alias('extracted_cpt')
    ).drop('code_cleaned')

testdf.pipe(extract_cpt)

code,extracted_cpt
str,list[str]
"""CPT/HCPC 02060...",
"""NDC 0011-1231-...",
"""CPT 99899 0012...","[""99899"", ""00123""]"
"""MS-DRG 1101112...",
"""CPT 123""","[""123""]"
"""12341-12345""",
"""CPT12345""","[""12345""]"
"""NDC 0012312311...",
"""MS123""",
"""319""",


### Repeating for the other code types
We'll do the same thing to extract NDC codes and DRG codes cleanly from our dataset.

In [1232]:
def extract_ndc(df: pl.DataFrame) -> pl.DataFrame:
    '''
    Extract NDC singlet codes
    '''
    
    ndc_pats = '''\d{4}-\d{4}-\d{2} \d{5}-\d{3}-\d{2} \d{5}-\d{4}-\d{1}'''
    
    return df.with_column(
        
        pl.when(            
            in_col('code', 'CPT HCPCS? CPT\s?/?HCPCS? HCPCS\s?/?CPT MS CDM DRG PH\d+ ITM')
        ).then(None).when(
            in_col('code', 'NDC')
        ).then(
            extract_all_from('code', ndc_pats + ' ([0-9]|-){12} ([0-9]|-){11} \d{10}')
        ).otherwise(
            extract_all_from('code', ndc_pats)
        ).alias('extracted_ndc')
    )

testdf.pipe(extract_ndc)

code,extracted_ndc
str,list[str]
"""CPT/HCPC 02060...",
"""NDC 0011-1231-...","[""0011-1231-12""]"
"""CPT 99899 0012...",
"""MS-DRG 1101112...",
"""CPT 123""",
"""12341-12345""",
"""CPT12345""",
"""NDC 0012312311...","[""0012312311""]"
"""MS123""",
"""319""",


In [1233]:
def extract_drg(df: pl.DataFrame) -> pl.DataFrame:
    '''
    Extract DRG singlet codes
    '''
    
    drg_pats = '''\d{1,3}'''
    
    return df.with_column(
        clear_all_from('code', ' \\b?MS\\b? \\b?MS-DRG\\b? \\b?DRG\\b?').alias(f'code_cleaned'),
    ).with_column(
        pl.when(            
            in_col('code', 'CPT HCPCS? CPT\s?/?HCPCS? HCPCS\s?/?CPT NDC CDM PH\d+ ITM')
        ).then(None).when(
            in_col('code', drg_pats + ' MS-DRG DRG MS')
        ).then(
            extract_all_from('code_cleaned', drg_pats)
        ).otherwise(None).alias('extracted_drg')
    ).drop('code_cleaned')

testdf.pipe(extract_drg)

code,extracted_drg
str,list[str]
"""CPT/HCPC 02060...",
"""NDC 0011-1231-...",
"""CPT 99899 0012...",
"""MS-DRG 1101112...",
"""CPT 123""",
"""12341-12345""",
"""CPT12345""",
"""NDC 0012312311...",
"""MS123""","[""123""]"
"""319""","[""319""]"


Thanks to the magic of combined `when().then().otherwise()` we're really able to keep the code clean!

We can also drop the unneeded columns later. I just kept them here so it's easier to see what's going on.

In [1235]:
lf = pl.scan_csv('../prices.csv', n_rows = 20_000_000, infer_schema_length = 0, encoding = 'utf8-lossy')

In [1236]:
lf.fetch(2)

cms_certification_num,payer,code,internal_revenue_code,units,description,inpatient_outpatient,price,code_disambiguator
str,str,str,str,str,str,str,str,str
"""010001""","""BLUE ADVANTAGE...","""HCPCS 82441""","""3018244101""",,"""HC TEST FOR CH...","""UNSPECIFIED""","""279.02""","""NONE"""
"""010001""","""BLUE CROSS OF ...","""HCPCS 82441""","""3018244101""",,"""HC TEST FOR CH...","""UNSPECIFIED""","""279.02""","""NONE"""


In [None]:
df = (lf
      .collect()
      .pipe(extract_ndc)
      .pipe(extract_cpt)
      .pipe(extract_drg)
      # .pipe(extract_cpt_range)
      .drop(['units', 'code_disambiguator'])
      .with_columns(
          pl.col(['cms_certification_num', 'inpatient_outpatient']).cast(pl.Categorical)
      )
     )

In [None]:
df.sample(5)

In [None]:
g = df.filter(
    col('ndc_extracted').is_null()  & 
    col('cpts_extracted').is_null() &
    col('drg_extracted').is_null() # &
    # col('cpt_ranges_extracted').is_null()
)

In [None]:
g.sample(5)

In [None]:
len(g)/len(df)