In [None]:
import pandas as pd

Reads csv file and creates dataframe.

In [None]:
df=pd.read_csv('data/Inpatient_Prospective_Payment_System__IPPS__Provider_Summary_for_the_Top_100_Diagnosis-Related_Groups__DRG__-_FY2011.csv')
df.head(1)

Strips whitespace from beginning and end of column names. Replaces spaces in column names with underscores and makes all characters lower case.

In [None]:
df.columns = df.columns.str.lstrip().str.rstrip()
df.columns = df.columns.str.replace(" ","_").str.lower()
df.head(1)

For columns with dollar amounts, strips the dollar sign makes them float values.

In [None]:
df.average_covered_charges = df.average_covered_charges.str.replace('$','').astype('float')
df.average_total_payments = df.average_total_payments.str.replace('$','').astype('float')
df.average_medicare_payments = df.average_medicare_payments.str.replace('$','').astype('float')
df.head(1)

Zip codes with leading zeros were in the csv files as 4 digit numbers, so this pads the zip code to a 5 character string instead of an integer.

In [None]:
df.provider_zip_code = df.provider_zip_code.astype(str).str.zfill(5)

Removes some characters that interfere with geocoding (finding gps coordinates of each provider).

In [None]:
df.provider_name = df.provider_name.str.replace(", THE","")
df.provider_name = df.provider_name.str.replace(",","").str.replace("'","")
df.provider_street_address = df.provider_street_address.str.replace(",","").str.replace("'","")

City names longer than 15 characters were cut off at 15. This code selects all unique providers by their id number, filters on all cities with 15 characters and writes them to a csv file. This csv file was edited by hand to add a second column that contains the corrected city names, which is used for the replacement code in the next cell. Correcting the city names improved the geocoding results.

In [None]:
dfp=df.drop_duplicates(subset='provider_id')
dfp = dfp[dfp.provider_city.map(lambda x: len(x) >= 15)].reset_index(drop=True)
dfp['provider_city'].to_csv('data/LongCityNames.csv',index=False)

This uses the csv file described previously to replace all of the truncated city names with the full names

In [None]:
dfcity = pd.read_csv('data/city_corrections.csv')
for idx,ci in dfcity.iterrows():
    df.provider_city = df.provider_city.str.replace(ci.city,ci.corrected_city)

Adds a column with just the three digit code describing the procedure to make selecting procedure types easier. Also moves the column to the beginning of the dataframe.

In [None]:
df['drg_id']=df.drg_definition.str[:3]
df = df[[df.columns[-1]]+list(df.columns[0:-1])]

In [None]:
df.head()

Writes the dataframe containing the cleaned data to a csv file.

In [None]:
df.to_csv('data/IPPS_Data_Clean_tmp.csv',index=False)