In [1]:
import pandas as pd
import numpy as np

**Import the dataset**

- Hop Teaming data can be found at https://careset.com/docgraph-hop-teaming-dataset/.
- Download the NPPES Data Dissemination from https://download.cms.gov/nppes/NPI_Files.html.
- Download the taxonomy code to classification crosswalk from the National Uniform Claim Committee (https://www.nucc.org/index.php/code-sets-mainmenu-41/provider-taxonomy-mainmenu-40/csv-mainmenu-57).
- Download the Zip code to CBSA crosswalk from here: https://www.huduser.gov/portal/datasets/usps_crosswalk.html. 

In [2]:
hopteaming = pd.read_csv('../data/DocGraph_Hop_Teaming_2018.csv', nrows = 100)
taxonomy = pd.read_csv('../data/nucc_taxonomy_240.csv', nrows = 100)
zip_cbsa = pd.read_excel('../data/ZIP_CBSA_122023.xlsx', nrows = 100)

In [3]:
endpoint = pd.read_csv('../data/endpoint_pfile_20050523-20240211.csv', nrows = 100)
npidata = pd.read_csv('../data/npidata_pfile_20050523-20240211.csv', nrows = 100)
othername = pd.read_csv('../data/othername_pfile_20050523-20240211.csv', nrows = 100)
pl_pfile = pd.read_csv('../data/pl_pfile_20050523-20240211.csv', nrows = 100)

In [4]:
othername.head(2)

Unnamed: 0,NPI,Provider Other Organization Name,Provider Other Organization Name Type Code
0,1023011053,Vine Discount Pharmacy & Medical Supply,3
1,1023011178,PROVIDENCE PALLIATIVE CARE NAPA VALLEY,3


In [5]:
pl_pfile.head(2)

Unnamed: 0,NPI,Provider Secondary Practice Location Address- Address Line 1,Provider Secondary Practice Location Address- Address Line 2,Provider Secondary Practice Location Address - City Name,Provider Secondary Practice Location Address - State Name,Provider Secondary Practice Location Address - Postal Code,Provider Secondary Practice Location Address - Country Code (If outside U.S.),Provider Secondary Practice Location Address - Telephone Number,Provider Secondary Practice Location Address - Telephone Extension,Provider Practice Location Address - Fax Number
0,1154324382,7800 Sheridan St,,Pembroke Pines,FL,330242536,US,9549872000,,9544377000.0
1,1154324382,500 N Hiatus Rd Ste 200,,Pembroke Pines,FL,330265213,US,9544374800,,


In [6]:
endpoint.head(2)

Unnamed: 0,NPI,Endpoint Type,Endpoint Type Description,Endpoint,Affiliation,Endpoint Description,Affiliation Legal Business Name,Use Code,Use Description,Other Use Description,Content Type,Content Description,Other Content Description,Affiliation Address Line One,Affiliation Address Line Two,Affiliation Address City,Affiliation Address State,Affiliation Address Country,Affiliation Address Postal Code
0,1154324382,DIRECT,Direct Messaging Address,rclose13800@MHSDIRECT.NET,N,,,,,,,,,3501 Johnson St,,Hollywood,FL,US,330215421
1,1154324382,DIRECT,Direct Messaging Address,Richard.Close@SEP.EClinicalDirectPlus.com,N,,,DIRECT,Direct,,,,,500 N Hiatus,Ste 200,Pembroke Pines,FL,US,33026


In [7]:
hopteaming.head(3)

Unnamed: 0,from_npi,to_npi,patient_count,transaction_count,average_day_wait,std_day_wait
0,1508062167,1730166109,350,370,53.922,72.612
1,1508065640,1730166109,25,25,49.8,55.006
2,1508052093,1730166109,16,16,109.5,70.593


**Remove "accidental" referrals in hopteaming DataFrame.** 

Filter for records where transaction_count is at least 50 and the average_day_wait is less than 50.

In [8]:
hopteaming = hopteaming.loc[(hopteaming['transaction_count'] > 50) & (hopteaming['average_day_wait'] < 50)]

hopteaming.reset_index(inplace=True)

hopteaming.head(5)

Unnamed: 0,index,from_npi,to_npi,patient_count,transaction_count,average_day_wait,std_day_wait
0,7,1508085911,1730166125,58,67,23.925,43.923
1,10,1508167040,1730166125,51,51,28.196,52.876
2,15,1508863549,1730166125,340,391,18.302,42.422
3,18,1508867870,1730166125,50,79,12.658,26.402
4,25,1508011040,1730166224,132,145,8.579,28.053


**Select the relevant columns in NPPES dataset for this project.**

- 'NPI'
- Entity Type, indicated by the 'Entity Type Code' field:
    - 1 = Provider (doctors, nurses, etc.)
    - 2 = Facility (Hospitals, Urgent Care, Doctors Offices)
- Entity Name: Either First/Last or Organization or Other Organization Name contained in the following fields:
    - 'Provider Organization Name (Legal Business Name)'
    - 'Provider Last Name (Legal Name)'
    - 'Provider First Name'
    - 'Provider Middle Name'
    - 'Provider Name Prefix Text'
    - 'Provider Name Suffix Text'
    - 'Provider Credential Text'
- Address: Business Practice Location (not mailing), contained in the following fields:
    - 'Provider First Line Business Practice Location Address'
    - 'Provider Second Line Business Practice Location Address'
    - 'Provider Business Practice Location Address City Name'
    - 'Provider Business Practice Location Address State Name'
    - 'Provider Business Practice Location Address Postal Code'
- The provider's taxonomy code, which is contained in one of the 'Healthcare Provider Taxonomy Code*' columns.

In [9]:
npidata.head(2)

Unnamed: 0,NPI,Entity Type Code,Replacement NPI,Employer Identification Number (EIN),Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,Provider Name Prefix Text,Provider Name Suffix Text,...,Healthcare Provider Taxonomy Group_7,Healthcare Provider Taxonomy Group_8,Healthcare Provider Taxonomy Group_9,Healthcare Provider Taxonomy Group_10,Healthcare Provider Taxonomy Group_11,Healthcare Provider Taxonomy Group_12,Healthcare Provider Taxonomy Group_13,Healthcare Provider Taxonomy Group_14,Healthcare Provider Taxonomy Group_15,Certification Date
0,1679576722,1.0,,,,WIEBE,DAVID,A,,,...,,,,,,,,,,
1,1588667638,1.0,,,,PILCHER,WILLIAM,C,DR.,,...,,,,,,,,,,


Note: There are 330 columns in npidata DataFrame. 

In [10]:
list(npidata.columns.values)

['NPI',
 'Entity Type Code',
 'Replacement NPI',
 'Employer Identification Number (EIN)',
 'Provider Organization Name (Legal Business Name)',
 'Provider Last Name (Legal Name)',
 'Provider First Name',
 'Provider Middle Name',
 'Provider Name Prefix Text',
 'Provider Name Suffix Text',
 'Provider Credential Text',
 'Provider Other Organization Name',
 'Provider Other Organization Name Type Code',
 'Provider Other Last Name',
 'Provider Other First Name',
 'Provider Other Middle Name',
 'Provider Other Name Prefix Text',
 'Provider Other Name Suffix Text',
 'Provider Other Credential Text',
 'Provider Other Last Name Type Code',
 'Provider First Line Business Mailing Address',
 'Provider Second Line Business Mailing Address',
 'Provider Business Mailing Address City Name',
 'Provider Business Mailing Address State Name',
 'Provider Business Mailing Address Postal Code',
 'Provider Business Mailing Address Country Code (If outside U.S.)',
 'Provider Business Mailing Address Telephone Nu

In [11]:
npidata1 = npidata.loc[:,['NPI',
                          'Entity Type Code',
                          'Provider Organization Name (Legal Business Name)',
                          'Provider Last Name (Legal Name)',
                          'Provider First Name',
                          'Provider Middle Name',
                          'Provider Name Prefix Text',
                          'Provider Name Suffix Text',
                          'Provider Credential Text',
                          'Provider First Line Business Practice Location Address',
                          'Provider Second Line Business Practice Location Address',
                          'Provider Business Practice Location Address City Name',
                          'Provider Business Practice Location Address Postal Code']]
                        
npidata1.head(2)

Unnamed: 0,NPI,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,Provider Name Prefix Text,Provider Name Suffix Text,Provider Credential Text,Provider First Line Business Practice Location Address,Provider Second Line Business Practice Location Address,Provider Business Practice Location Address City Name,Provider Business Practice Location Address Postal Code
0,1679576722,1.0,,WIEBE,DAVID,A,,,M.D.,3500 CENTRAL AVE,,KEARNEY,688472944.0
1,1588667638,1.0,,PILCHER,WILLIAM,C,DR.,,MD,1824 KING STREET,SUITE 300,JACKSONVILLE,322044736.0


**Retrieve the provider's taxonomy code from the npidata DataFrame.**

A provider can have up to 15 taxonomy codes, but we want the one which has Primary Switch = Y in the associated 'Healthcare Provider Primary Taxonomy Switch*' field. Note that this does not always occur in spot 1.

In [12]:
npidata.filter(like='Taxonomy')

Unnamed: 0,Healthcare Provider Taxonomy Code_1,Healthcare Provider Primary Taxonomy Switch_1,Healthcare Provider Taxonomy Code_2,Healthcare Provider Primary Taxonomy Switch_2,Healthcare Provider Taxonomy Code_3,Healthcare Provider Primary Taxonomy Switch_3,Healthcare Provider Taxonomy Code_4,Healthcare Provider Primary Taxonomy Switch_4,Healthcare Provider Taxonomy Code_5,Healthcare Provider Primary Taxonomy Switch_5,...,Healthcare Provider Taxonomy Group_6,Healthcare Provider Taxonomy Group_7,Healthcare Provider Taxonomy Group_8,Healthcare Provider Taxonomy Group_9,Healthcare Provider Taxonomy Group_10,Healthcare Provider Taxonomy Group_11,Healthcare Provider Taxonomy Group_12,Healthcare Provider Taxonomy Group_13,Healthcare Provider Taxonomy Group_14,Healthcare Provider Taxonomy Group_15
0,207X00000X,Y,,,,,,,,,...,,,,,,,,,,
1,207RC0000X,Y,207RC0000X,N,,,,,,,...,,,,,,,,,,
2,251G00000X,Y,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,174400000X,N,207RH0003X,Y,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,207V00000X,Y,,,,,,,,,...,,,,,,,,,,
96,261QF0400X,Y,,,,,,,,,...,,,,,,,,,,
97,261QA1903X,Y,,,,,,,,,...,,,,,,,,,,
98,363AM0700X,Y,,,,,,,,,...,,,,,,,,,,


In [13]:
npidata1 = npidata.loc[:,['NPI',
                          'Entity Type Code',
                          'Provider Organization Name (Legal Business Name)',
                          'Provider Last Name (Legal Name)',
                          'Provider First Name',
                          'Provider Middle Name',
                          'Provider Name Prefix Text',
                          'Provider Name Suffix Text',
                          'Provider Credential Text',
                          'Provider First Line Business Practice Location Address',
                          'Provider Second Line Business Practice Location Address',
                          'Provider Business Practice Location Address City Name',
                          'Provider Business Practice Location Address Postal Code']]

# Create a new column for Healhthcare Provider Taxonomy Code
# Fill with NaN values for now
npidata1['Healthcare Provider Taxonomy Code'] = np.nan

# For loop to retrieve value of taxonomy code if preceding switch column contains 'Y' value
for index, row in npidata.iterrows():
    # Iterate over pairs of (taxonomy code, taxonomy switch)
    for i in range(1, 16):  # For the 15 columns relating to taxonomy code
        code_col = f'Healthcare Provider Taxonomy Code_{i}'
        switch_col = f'Healthcare Provider Primary Taxonomy Switch_{i}'
        # Check if the taxonomy switch is 'Y' and the code is not NaN
        if row[switch_col] == 'Y' and not pd.isna(row[code_col]):
            npidata1.at[index, 'Healthcare Provider Taxonomy Code'] = row[code_col]

npidata1.head(2)

Unnamed: 0,NPI,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,Provider Name Prefix Text,Provider Name Suffix Text,Provider Credential Text,Provider First Line Business Practice Location Address,Provider Second Line Business Practice Location Address,Provider Business Practice Location Address City Name,Provider Business Practice Location Address Postal Code,Healthcare Provider Taxonomy Code
0,1679576722,1.0,,WIEBE,DAVID,A,,,M.D.,3500 CENTRAL AVE,,KEARNEY,688472944.0,207X00000X
1,1588667638,1.0,,PILCHER,WILLIAM,C,DR.,,MD,1824 KING STREET,SUITE 300,JACKSONVILLE,322044736.0,207RC0000X


How many columns are there? 

In [14]:
len(npidata1.columns)

14

**Using the primary taxonomy code, match each provider to a classification (from the Classification column).**

In [15]:
taxonomy.head(3)

Unnamed: 0,Code,Grouping,Classification,Specialization,Definition,Notes,Display Name,Section
0,193200000X,Group,Multi-Specialty,,A business group of one or more individual pra...,[7/1/2003: new],Multi-Specialty Group,Individual
1,193400000X,Group,Single Specialty,,A business group of one or more individual pra...,[7/1/2003: new],Single Specialty Group,Individual
2,207K00000X,Allopathic & Osteopathic Physicians,Allergy & Immunology,,An allergist-immunologist is trained in evalua...,"Source: American Board of Medical Specialties,...",Allergy & Immunology Physician,Individual


In [16]:
# Rename Code column in taxonomy df 
taxonomy = taxonomy.rename(columns = {'Code':'Healthcare Provider Taxonomy Code'})

# Only keep Taxonomy Code and Classification columns
taxonomy = taxonomy.loc[:, ['Healthcare Provider Taxonomy Code',
                           'Classification']]

taxonomy.head(2)


Unnamed: 0,Healthcare Provider Taxonomy Code,Classification
0,193200000X,Multi-Specialty
1,193400000X,Single Specialty


In [17]:
## Merge npidata1 and taxonomy on codes. Only keep Classification column from taxonomy 
npidata1 = npidata1.merge(taxonomy, on = 'Healthcare Provider Taxonomy Code')

npidata1.head(2)

Unnamed: 0,NPI,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,Provider Name Prefix Text,Provider Name Suffix Text,Provider Credential Text,Provider First Line Business Practice Location Address,Provider Second Line Business Practice Location Address,Provider Business Practice Location Address City Name,Provider Business Practice Location Address Postal Code,Healthcare Provider Taxonomy Code,Classification
0,1588667638,1.0,,PILCHER,WILLIAM,C,DR.,,MD,1824 KING STREET,SUITE 300,JACKSONVILLE,322044736.0,207RC0000X,Internal Medicine
1,1932102084,1.0,,ADUSUMILLI,RAVI,K,,,MD,2940 N MCCORD RD,,TOLEDO,436151753.0,207RC0000X,Internal Medicine


How many columns are there? 

In [18]:
len(npidata1.columns)

15

**Match each provider to a CBSA using the Business Zip code.** 

Note that the zipcodes in the nppes dataset are either 5 or 9 digits, and be mindful that leading zeros might be dropped when reading the dataset into a dataframe. This can be used if you want to filter to providers just in the Nashville CBSA.

In [19]:
zip_cbsa.head(3)

Unnamed: 0,ZIP,CBSA,USPS_ZIP_PREF_CITY,USPS_ZIP_PREF_STATE,RES_RATIO,BUS_RATIO,OTH_RATIO,TOT_RATIO
0,501,35620,HOLTSVILLE,NY,0.0,1.0,0.0,1.0
1,601,38660,ADJUNTAS,PR,1.0,1.0,1.0,1.0
2,602,10380,AGUADA,PR,1.0,1.0,1.0,1.0


In [20]:
npidata1[npidata1['Provider Business Practice Location Address City Name'] == 'NASHVILLE']

Unnamed: 0,NPI,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,Provider Name Prefix Text,Provider Name Suffix Text,Provider Credential Text,Provider First Line Business Practice Location Address,Provider Second Line Business Practice Location Address,Provider Business Practice Location Address City Name,Provider Business Practice Location Address Postal Code,Healthcare Provider Taxonomy Code,Classification
24,1922001957,1.0,,PRESLEY,RICHARD,E,,,M.D.,2011 MURPHY AVE,STE 302,NASHVILLE,372032023.0,207V00000X,Obstetrics & Gynecology
35,1003819046,1.0,,NYLANDER,BARBARA,H,,,M.D.,345 23RD AVE N,SUITE 209,NASHVILLE,372031513.0,207VG0400X,Obstetrics & Gynecology
