In [1]:
# importing packages
import numpy as np
import pandas as pd

In [2]:
# reading in LEIE_updated
leie_updated = pd.read_csv('s3://leie-updated/UPDATED.csv', dtype=object)

Exploring the data's structure including the column names, the number of rows and columns, the number of unique values and the number of null values

In [3]:
leie_updated.columns

Index(['LASTNAME', 'FIRSTNAME', 'MIDNAME', 'BUSNAME', 'GENERAL', 'SPECIALTY',
       'UPIN', 'NPI', 'DOB', 'ADDRESS', 'CITY', 'STATE', 'ZIP', 'EXCLTYPE',
       'EXCLDATE', 'REINDATE', 'WAIVERDATE', 'WVRSTATE'],
      dtype='object')

In [4]:
leie_updated.shape

(72899, 18)

In [5]:
print(leie_updated.dtypes)

LASTNAME      object
FIRSTNAME     object
MIDNAME       object
BUSNAME       object
GENERAL       object
SPECIALTY     object
UPIN          object
NPI           object
DOB           object
ADDRESS       object
CITY          object
STATE         object
ZIP           object
EXCLTYPE      object
EXCLDATE      object
REINDATE      object
WAIVERDATE    object
WVRSTATE      object
dtype: object


In [6]:
leie_updated.nunique()

LASTNAME      28296
FIRSTNAME     11218
MIDNAME        8250
BUSNAME        3069
GENERAL          87
SPECIALTY       193
UPIN           6123
NPI            5277
DOB           20606
ADDRESS       68969
CITY           9789
STATE            60
ZIP           17179
EXCLTYPE         31
EXCLDATE       2274
REINDATE          1
WAIVERDATE       16
WVRSTATE         10
dtype: int64

In [7]:
leie_updated.isnull().sum()

LASTNAME       3128
FIRSTNAME      3127
MIDNAME           0
BUSNAME       69774
GENERAL           0
SPECIALTY      4191
UPIN          66600
NPI               0
DOB            3993
ADDRESS           9
CITY              1
STATE             5
ZIP               0
EXCLTYPE          0
EXCLDATE          0
REINDATE          0
WAIVERDATE        0
WVRSTATE      72886
dtype: int64

In [8]:
leie_updated['NPI'].value_counts()

0000000000    67515
1225072028        3
1801839139        3
1811058282        2
1326021098        2
              ...  
1437467461        1
1548251788        1
1487737243        1
1992030050        1
1093892119        1
Name: NPI, Length: 5277, dtype: int64

Noting that there are 67,515 out of 72,899 NPI values equal to 0. While there are no null values, since only 5,277 are unique, it seems like 0 is used when a NPI doesn't exist. We were planning on using this as our primary key to connect across data sets so researched why so many of these could be 0. According to the Medicare data website, this was not a required identifier for providers until 2008. Instead, UPIN had been used previously so also explored those values even though the above shows a lot of null values for those as well. 

In [9]:
leie_updated['UPIN'].value_counts()

A73915    3
T55450    3
A77906    3
D63434    3
T28092    3
         ..
D66736    1
D23628    1
U43627    1
C49303    1
B01124    1
Name: UPIN, Length: 6123, dtype: int64

In [10]:
# Specialty might be useful in our future analysis
leie_updated['SPECIALTY'].value_counts()

NURSE/NURSES AIDE       31984
OWNER/OPERATOR           3327
HEALTH CARE AIDE         2963
NO KNOWN AFFILIATION     2014
CHIROPRACTIC             1989
                        ...  
TEACHER                     1
EMPLOYEE - GM/GS-15         1
EMPLOYEE - COMM OFFI        1
MEDICARE PART D CONT        1
PRINTING FIRM               1
Name: SPECIALTY, Length: 193, dtype: int64

In [11]:
#exploring exclusion type column
leie_updated['EXCLTYPE'].value_counts()

1128b4       30619
1128a1       20554
1128a2        6877
1128a3        4124
1128a4        2824
1128b14       2296
1128b8        1492
1128a1         903
1128b1         829
1128b5         810
1128b7         613
1128b3         308
1128Aa         149
1128a3         126
1128a2          88
1128b6          66
1156            58
1128b2          54
1128b15         34
1128b7          24
1128b11         11
BRCH SA         10
1160             9
BRCH CIA         8
1128b16          3
1128b6           3
1128b2           2
1128a4           2
1128b1           1
1128b12          1
1128b5           1
Name: EXCLTYPE, dtype: int64

Will need to look up the coding for the type of exclusions 

In [12]:
#looking at the date range for exclusions
leie_updated['EXCLDATE'].min()

'19770701'

In [13]:
leie_updated['EXCLDATE'].max()

'20200220'

Per the website where we pulled the data, this updated LEIE is a complete database containing all exclusions currently in effect. Individuals and entities who have been reinstated are not included in this file. This file is complete and should not be used in conjunction with the monthly exclusion and reinstatement supplements. Therefore, we most likely should just focus mostly on this dataset instead of the monthly update data. It appears to go back to exclusions in 1977, which due to changes in Medicare policies, we will need to think about how far back we want to go because looking at data from the 70s might could throw off our modeling. Due to the NPI change in 2008, we will most likely want to filter the data only back to 2008 or even more recent. 