In [1]:
import pandas as pd
import pickle
import xlrd
import numpy as np
import matplotlib.pyplot as plt

PWS_SIZE - Indicates Public Water Systems by size of population served. 
Size	Description
Very Small	500 or less
Small	501 - 3,300
Medium	3,301 - 10,000
Large	10,001 - 100,000
Very Large	>100,000

SERIOUS_VIOLATOR - 'Yes' indicates a public water system with unresolved serious, multiple, and/or continuing violations that is designated as a priority candidate for formal enforcement, as directed by EPA's Drinking Water Enforcement Response Policy (PDF) (16 pp, 952 K). EPA designates systems as serious violators so that the drinking water system and primacy agency will act quickly to resolve the most significant noncompliance. Many public water systems with violations, however, are not serious violators. Operators and the primacy agencies are expected to correct the violations at non-serious violators as well, but without the more strict requirements and deadlines applicable to serious violators. If the violations at a non-serious violator are left uncorrected, that system may become a serious violator. When a serious violator has received formal enforcement action or has returned to compliance, it is no longer designated a serious violator. EPA updates its serious violator list on a quarterly basis.

In [2]:
serious_violators = pd.read_csv('../data/SDWA_SERIOUS_VIOLATORS.csv')
serious_violators.head(30)

Unnamed: 0,PWSID,PWS_NAME,CITY_SERVED,STATE,STATE_NAME,PWS_TYPE_CODE,PWS_TYPE_SHORT,SOURCE_WATER,PWS_SIZE,POPULATION_SERVED_COUNT,FISCAL_YEAR,SERIOUS_VIOLATOR
0,FL4501229,"RIVIERA BEACH UTILITY DISTRICT, CITY OF",RIVIERA BEACH,FL,Florida,CWS,Community,GW,Large,31500,2015,Y
1,AK2299032,OMNI PARKS STORE,GLENNALLEN,AK,Alaska,TNCWS,Non-Community,GW,Very Small,222,2011,Y
2,NJ1708300,E I DUPONT CHAMBER WORKS,PENNSVILLE TWP.-1708,NJ,New Jersey,NTNCWS,Non-Community,SW,Small,920,2013,Y
3,NJ1708300,E I DUPONT CHAMBER WORKS,PENNSVILLE TWP.-1708,NJ,New Jersey,NTNCWS,Non-Community,SW,Small,920,2012,Y
4,ID1280084,HAUSER LAKE WATER ASSN INC,,ID,Idaho,CWS,Community,GW,Small,1200,2011,Y
5,ID6030036,MARSH VALLEY JR AND SR HIGH SCHOOL,,ID,Idaho,NTNCWS,Non-Community,GW,Small,624,2014,Y
6,VT0005626,VALLEY PARK CONDOMINIUM,KILLINGTON,VT,Vermont,CWS,Community,GW,Very Small,42,2014,Y
7,VT0005626,VALLEY PARK CONDOMINIUM,KILLINGTON,VT,Vermont,CWS,Community,GW,Very Small,42,2013,Y
8,VT0005384,WHIFFLETREE CONDOMINIUM,KILLINGTON,VT,Vermont,CWS,Community,GW,Very Small,189,2014,Y
9,OK2001036,MARY JACKSON TP,,OK,Oklahoma,CWS,Community,GW,Very Small,50,2014,Y


In [3]:
serious_violators.shape

(48908, 12)

In [4]:
serious_violators.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48908 entries, 0 to 48907
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   PWSID                    48908 non-null  object
 1   PWS_NAME                 48908 non-null  object
 2   CITY_SERVED              22217 non-null  object
 3   STATE                    48908 non-null  object
 4   STATE_NAME               48902 non-null  object
 5   PWS_TYPE_CODE            48908 non-null  object
 6   PWS_TYPE_SHORT           48908 non-null  object
 7   SOURCE_WATER             48908 non-null  object
 8   PWS_SIZE                 48908 non-null  object
 9   POPULATION_SERVED_COUNT  48908 non-null  int64 
 10  FISCAL_YEAR              48908 non-null  int64 
 11  SERIOUS_VIOLATOR         48908 non-null  object
dtypes: int64(2), object(10)
memory usage: 4.5+ MB


In [5]:
serious_violators.describe()

Unnamed: 0,POPULATION_SERVED_COUNT,FISCAL_YEAR
count,48908.0,48908.0
mean,2207.711,2014.870042
std,26542.36,2.878089
min,0.0,2011.0
25%,50.0,2012.0
50%,132.0,2015.0
75%,500.0,2017.0
max,1661445.0,2020.0


For visual EDA do bar chart of year (x-axis) and counts (y-axis) 
https://echo.epa.gov/help/drinking-water-dashboard-help

VIOLATION_NAME - Violations required to be reported under SDWA are grouped into the following categories:
Health-based violations - Violations of maximum contaminant levels (MCLs) or maximum residual disinfectant levels (MRDLs), which specify the highest concentrations of contaminants or disinfectants, respectively, allowed in drinking water; or of treatment technique (TT) rules, which specify required processes intended to reduce the amounts of contaminants in drinking water. MCLs, MRDLs, and treatment technique rules are all health-based drinking water standards.
Monitoring and reporting (MR) violations - Failure to conduct regular monitoring of drinking water quality, as required by SDWA, or to submit monitoring results in a timely fashion to the state environmental agency or EPA.
Other violations - Violations of other requirements of SDWA, such as failing to issue annual consumer confidence reports, or conducting periodic sanitary surveys. 

In [6]:
violations = pd.read_csv('../data/SDWA_VIOLATIONS.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
violations.head()

Unnamed: 0,PWSID,PWS_NAME,CITY_SERVED,STATE,STATE_NAME,PWS_TYPE_CODE,PWS_TYPE_SHORT,SOURCE_WATER,PWS_SIZE,POPULATION_SERVED_COUNT,...,VIOLATION_NAME,VIOLATION_ID,RULE_NAME,BEGIN_YEAR,END_YEAR,RTC_YEAR,ACUTE_HEALTH_BASED,HEALTH_BASED,MONITORING_REPORTING,PUBLIC_NOTIF_OTHER
0,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,"MCL, Average",119615,St2 DBP,2017,2017.0,,N,Y,N,N
1,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,Monitoring and Reporting (DBP),119611,St1 DBP,2017,2017.0,,N,N,Y,N
2,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,"MCL, Average",119606,St2 DBP,2017,2017.0,,N,Y,N,N
3,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,"MCL, Average",119599,St2 DBP,2016,2016.0,,N,Y,N,N
4,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,"MCL, Average",119597,St2 DBP,2017,2017.0,,N,Y,N,N


In [8]:
violations.shape

(3142143, 21)

In [9]:
violations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3142143 entries, 0 to 3142142
Data columns (total 21 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   PWSID                    object 
 1   PWS_NAME                 object 
 2   CITY_SERVED              object 
 3   STATE                    object 
 4   STATE_NAME               object 
 5   PWS_TYPE_CODE            object 
 6   PWS_TYPE_SHORT           object 
 7   SOURCE_WATER             object 
 8   PWS_SIZE                 object 
 9   POPULATION_SERVED_COUNT  int64  
 10  FISCAL_YEAR              int64  
 11  VIOLATION_NAME           object 
 12  VIOLATION_ID             object 
 13  RULE_NAME                object 
 14  BEGIN_YEAR               int64  
 15  END_YEAR                 float64
 16  RTC_YEAR                 float64
 17  ACUTE_HEALTH_BASED       object 
 18  HEALTH_BASED             object 
 19  MONITORING_REPORTING     object 
 20  PUBLIC_NOTIF_OTHER       object 
dtypes: float

In [10]:
#violations.describe()

In [11]:
#df.round({'dogs': 1, 'cats': 0})
#violations.round({'POPULATION_SERVED_COUNT':0, 'FISCAL_YEAR':0, 'BEGIN_YEAR':0, 'END_YEAR':0, 'RTC_YEAR':0})


# Map out violations
Find out how many nans there are in city field. See if can create function to map the nan cities using PWS_Name. Is it possible to grab lat & lon from PWS_Name?

In [12]:
count = violations["CITY_SERVED"].isna().sum()
print(count)

1734086


number of entries minus nan values. 
3,142,143-1,734,086=1,408,057
so 1,734,086 have nan values in city column (more than half) 
and 1,408,057 have values in city column

In [13]:
#df.loc[df['Col2'].isnull()] 
nullcity_violations =violations.loc[violations["CITY_SERVED"].isnull()]
nullcity_violations.head(50)

Unnamed: 0,PWSID,PWS_NAME,CITY_SERVED,STATE,STATE_NAME,PWS_TYPE_CODE,PWS_TYPE_SHORT,SOURCE_WATER,PWS_SIZE,POPULATION_SERVED_COUNT,...,VIOLATION_NAME,VIOLATION_ID,RULE_NAME,BEGIN_YEAR,END_YEAR,RTC_YEAR,ACUTE_HEALTH_BASED,HEALTH_BASED,MONITORING_REPORTING,PUBLIC_NOTIF_OTHER
0,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,"MCL, Average",119615,St2 DBP,2017,2017.0,,N,Y,N,N
1,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,Monitoring and Reporting (DBP),119611,St1 DBP,2017,2017.0,,N,N,Y,N
2,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,"MCL, Average",119606,St2 DBP,2017,2017.0,,N,Y,N,N
3,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,"MCL, Average",119599,St2 DBP,2016,2016.0,,N,Y,N,N
4,OK1020515,CHECOTAH PWA,,OK,Oklahoma,CWS,Community,SW,Medium,3481,...,"MCL, Average",119597,St2 DBP,2017,2017.0,,N,Y,N,N
34,TX0840021,GALVESTON COUNTY MUD 12,,TX,Texas,CWS,Community,SW,Medium,4542,...,Follow-up Or Routine LCR Tap M/R,518,LCR,2017,,2017.0,N,N,Y,N
35,TX1240001,JIM HOGG COUNTY WCID 2,,TX,Texas,CWS,Community,GW,Medium,5526,...,Water Quality Parameter M/R,990009968,LCR,2017,2017.0,,N,N,Y,N
36,TX1240001,JIM HOGG COUNTY WCID 2,,TX,Texas,CWS,Community,GW,Medium,5526,...,"MCL, Average",990009973,Arsenic,2017,2017.0,,N,Y,N,N
37,TX1240001,JIM HOGG COUNTY WCID 2,,TX,Texas,CWS,Community,GW,Medium,5526,...,"MCL, Average",990009971,Arsenic,2017,2017.0,,N,Y,N,N
38,TX1240001,JIM HOGG COUNTY WCID 2,,TX,Texas,CWS,Community,GW,Medium,5526,...,"MCL, Average",990009965,Arsenic,2017,2017.0,,N,Y,N,N


# Map out violations
PWSID gives county name - Is it possible to map off of the PWSID instead of city? County would be more accurate. Would have to webscrape each page linked to PWSID number and they all have different URLs. 

Make a funtion that itterates over ID and pulls out county from a webpage. ???

Can we maybe focus on one particular State? Do some EDA based on number of state violations. 
First pull in some more data to see if there are addresses anywhere. - No luck so far, see below. 

In [14]:
sdwa_watersystems = pd.read_csv('../data/SDWA_PUB_WATER_SYSTEMS.csv')
sdwa_watersystems.head()

Unnamed: 0,PWSID,FISCAL_YEAR,STATE,STATE_NAME,EPA_REGION,PWS_TYPE_CODE,PWS_NAME,CITY_SERVED,STATE_CODE,SOURCE_WATER,IS_TRIBAL,SYSTEM_SIZE,POPULATION_SERVED_COUNT
0,AK2210574,2014,AK,Alaska,10,CWS,TOTEM TRAILER TOWN TC,ANCHORAGE,AK,GW,N,Very Small,480
1,AK2210574,2013,AK,Alaska,10,CWS,TOTEM TRAILER TOWN TC,ANCHORAGE,AK,GW,N,Very Small,480
2,AK2210574,2012,AK,Alaska,10,CWS,TOTEM TRAILER TOWN TC,ANCHORAGE,AK,GW,N,Very Small,480
3,AK2210574,2011,AK,Alaska,10,CWS,TOTEM TRAILER TOWN TC,ANCHORAGE,AK,GW,N,Very Small,480
4,OK2004301,2015,OK,Oklahoma,6,CWS,MARIETTA PWA,,OK,GW,N,Small,2445


In [15]:
sdwa_sitevisits = pd.read_csv('../data/SDWA_SITE_VISITS.csv')
sdwa_sitevisits.head()

Unnamed: 0,PWSID,PWS_NAME,CITY_SERVED,STATE,STATE_NAME,PWS_TYPE_CODE,PWS_TYPE_SHORT,SOURCE_WATER,PWS_SIZE,POPULATION_SERVED_COUNT,FISCAL_YEAR,SITE_VISIT_DATE,SANITARY_SURVEY
0,AK2210574,TOTEM TRAILER TOWN TC,ANCHORAGE,AK,Alaska,CWS,Community,GW,Very Small,480,2012,05/18/2012,Y
1,OK2004301,MARIETTA PWA,,OK,Oklahoma,CWS,Community,GW,Small,2445,2015,04/15/2015,Y
2,OK2004301,MARIETTA PWA,,OK,Oklahoma,CWS,Community,GW,Small,2445,2015,09/08/2015,Y
3,FL1464061,EGLIN SITE C-3 (LASER),EGLIN AFB,FL,Florida,NTNCWS,Non-Community,GW,Very Small,25,2014,06/17/2014,Y
4,FL1464061,EGLIN SITE C-3 (LASER),EGLIN AFB,FL,Florida,NTNCWS,Non-Community,GW,Very Small,25,2013,08/01/2013,N


# Initial ideas:
Count of violations by State
Chart increase of violations over time (year over year) - see if it correlates with new legislation. 
Focus on a specific region or state - drill down from there. 

Possible datasets to pull in: median income, population, age demographics, weather average over time (year over year, month?), coronovirus data? 

Datasets to continue: waterbourne illnesses get same counts: count of waterbourne diseases by State, Chart increase of diseases over time (year over year). Focus on a specific region or state - drill down from there.

In [16]:
violations.STATE_NAME.value_counts()

Texas                                                                                       345999
Pennsylvania                                                                                311978
Mississippi                                                                                 230522
West Virginia                                                                               129322
Alaska                                                                                      116349
                                                                                             ...  
Chickasaw Nation, Oklahoma                                                                       1
Robinson Rancheria of Pomo Indians of California                                                 1
Big Valley Band of Pomo Indians of the Big Valley Rancheria, California                          1
Augustine Band of Cahuilla Mission Indians of the Augustine Reservation, California              1
La Posta B

In [17]:
violation_counts = violations.STATE_NAME.value_counts()
violation_counts.head()

Texas            345999
Pennsylvania     311978
Mississippi      230522
West Virginia    129322
Alaska           116349
Name: STATE_NAME, dtype: int64

# Join serious violations with violations

In [18]:
#all_violations = pd.merge(serious_violators, violations, left_index = True, right_index = True, how = 'inner')
#all_violations.head()
#all_violations.shape

In [19]:
all_violations = pd.concat([serious_violators, violations])
all_violations.shape

(3191051, 22)

In [20]:
all_violations.columns

Index(['PWSID', 'PWS_NAME', 'CITY_SERVED', 'STATE', 'STATE_NAME',
       'PWS_TYPE_CODE', 'PWS_TYPE_SHORT', 'SOURCE_WATER', 'PWS_SIZE',
       'POPULATION_SERVED_COUNT', 'FISCAL_YEAR', 'SERIOUS_VIOLATOR',
       'VIOLATION_NAME', 'VIOLATION_ID', 'RULE_NAME', 'BEGIN_YEAR', 'END_YEAR',
       'RTC_YEAR', 'ACUTE_HEALTH_BASED', 'HEALTH_BASED',
       'MONITORING_REPORTING', 'PUBLIC_NOTIF_OTHER'],
      dtype='object')

In [21]:
all_violations.shape

(3191051, 22)

In [22]:
all_violations.head()

Unnamed: 0,PWSID,PWS_NAME,CITY_SERVED,STATE,STATE_NAME,PWS_TYPE_CODE,PWS_TYPE_SHORT,SOURCE_WATER,PWS_SIZE,POPULATION_SERVED_COUNT,...,VIOLATION_NAME,VIOLATION_ID,RULE_NAME,BEGIN_YEAR,END_YEAR,RTC_YEAR,ACUTE_HEALTH_BASED,HEALTH_BASED,MONITORING_REPORTING,PUBLIC_NOTIF_OTHER
0,FL4501229,"RIVIERA BEACH UTILITY DISTRICT, CITY OF",RIVIERA BEACH,FL,Florida,CWS,Community,GW,Large,31500,...,,,,,,,,,,
1,AK2299032,OMNI PARKS STORE,GLENNALLEN,AK,Alaska,TNCWS,Non-Community,GW,Very Small,222,...,,,,,,,,,,
2,NJ1708300,E I DUPONT CHAMBER WORKS,PENNSVILLE TWP.-1708,NJ,New Jersey,NTNCWS,Non-Community,SW,Small,920,...,,,,,,,,,,
3,NJ1708300,E I DUPONT CHAMBER WORKS,PENNSVILLE TWP.-1708,NJ,New Jersey,NTNCWS,Non-Community,SW,Small,920,...,,,,,,,,,,
4,ID1280084,HAUSER LAKE WATER ASSN INC,,ID,Idaho,CWS,Community,GW,Small,1200,...,,,,,,,,,,


In [23]:
all_violations.to_csv("../data/all_violations.csv", index=False)

In [24]:
#type(violation_counts)

In [25]:
#violation_counts = violation_counts.reset_index()
#violation_counts

In [26]:
#violation_counts.columns = ['state_name', 'count']
#violation_counts.head(20)

In [27]:
#serious_violators.STATE_NAME.value_counts()
#serious_violators = serious_violators.STATE_NAME.value_counts()
#serious_violators.head()

In [28]:
#serious_violators = serious_violators.reset_index()
#serious_violators

In [29]:
#serious_violators.columns = ['state_name', 'count']
#serious_violators.head(20)

# Take a look at waterbourne disease data

In [30]:
#water_2018 = pd.read_csv("../data/2018-table2h.csv")
#water_2018.head()

In [31]:
#this data is from years 1998-2017
water_outbreaks = pd.read_excel("../data/NationalOutbreakPublicDataTool.xlsx")
water_outbreaks.tail()

Unnamed: 0,Year,Month,State,Primary Mode,Etiology,Serotype or Genotype,Etiology Status,Setting,Illnesses,Hospitalizations,...,Deaths,Info on Deaths,Food Vehicle,Food Contaminated Ingredient,IFSAC Category,Water Exposure,Water Type,Animal Type,Animal Type Specify,Water Status
738,2017,6,Michigan,Water,Legionella pneumophila; Legionella pneumophila,serogroup 1; serogroup 2-14,Confirmed; Confirmed,Hospital/Health Care,4,4.0,...,0.0,4.0,,,,Drinking water,Community,,,Cleaned
739,2011,11,Louisiana,Water,Legionella pneumophila,serogroup 1,Confirmed,,3,3.0,...,0.0,3.0,,,,Recreational water -- treated,,,,Cleaned
740,2017,5,Illinois,Water,Legionella pneumophila; Legionella anisa,serogroup 1;,Confirmed; Confirmed,Hospital/Health Care,2,2.0,...,1.0,2.0,,,,Drinking water,Community,,,Cleaned
741,2017,2,Texas,Water,Legionella pneumophila,serogroup 1,Confirmed,Unknown,3,3.0,...,0.0,3.0,,,,Recreational water -- treated,Spa/Whirlpool/Hot Tub,,,Cleaned
742,2016,7,New York,Water,Avian schistosomes,,Suspected,Beach - Public,56,,...,,,,,,Recreational water -- untreated,Lake/Reservoir/Impoundment,,,Cleaned


In [32]:
water_outbreaks.shape

(743, 21)

In [33]:
#legionella_2017 = dr_utilization_2017[dr_utilization_2017['HCPCS Code'] == 86717]
#legionella = water_outbreaks[water_outbreaks['Etiology'] == 'Legionella']
#legionella.head()
legionella = water_outbreaks[water_outbreaks['Etiology'].str.contains('Legionella')]
legionella.shape
#df1 = df[df['Position'].str.contains("PG") & df['College'].str.contains('UC')] 
#df1 

(290, 21)

In [34]:
legionella.head()

Unnamed: 0,Year,Month,State,Primary Mode,Etiology,Serotype or Genotype,Etiology Status,Setting,Illnesses,Hospitalizations,...,Deaths,Info on Deaths,Food Vehicle,Food Contaminated Ingredient,IFSAC Category,Water Exposure,Water Type,Animal Type,Animal Type Specify,Water Status
11,2009,9,Ohio,Water,Legionella unknown,,Confirmed,Long term care facility,2,2.0,...,0.0,2.0,,,,Undetermined water,Unknown,,,Reviewed
12,2009,7,Florida,Water,Legionella unknown,,Confirmed,Club (Requires Membership),2,2.0,...,0.0,2.0,,,,Drinking water,Community,,,Reviewed
17,2009,6,Utah,Water,Legionella pneumophila,serogroup 1,Confirmed,Hotel/Motel/Lodge/Inn,5,5.0,...,0.0,5.0,,,,Drinking water,Community,,,Reviewed
19,2009,9,Illinois,Water,Legionella pneumophila,serogroup 1,Confirmed,Assisted Living/Rehab,8,8.0,...,2.0,8.0,,,,Undetermined water,Fountain - Ornamental; Watering - indoor plant...,,,Reviewed
28,2009,4,New York,Water,Legionella pneumophila,serogroup 1,Confirmed,Hospital/Health Care,3,3.0,...,2.0,3.0,,,,Drinking water,Community; Community; Community; Community,,,Reviewed


In [35]:
legionella.to_csv("../data/legionella.csv", index=False)

# Call in cdc waterborne diseases weekly reports

In [36]:
col_df = pd.read_csv('../data/2016-table2h.txt', sep=r'\\t', engine='python', skiprows = 3, nrows = 9)
columns = [col for col in col_df['Reporting Area']]
print(columns)

['Human immunodeficiency virus diagnoses', 'Influenza-associated pediatric mortality', 'Invasive pneumococcal disease, All ages, Confirmed', 'Invasive pneumococcal disease, All ages, Probable', 'Invasive pneumococcal disease, Age <5 years, Confirmed', 'Invasive pneumococcal disease, Age <5 years, Probable', 'Legionellosis', 'Leptospirosis']


In [37]:
read_txt_file = pd.read_csv('../data/2016-table2h.txt', sep='\\\t', names=columns, engine='python', skiprows = 14)
read_txt_file.head()

Unnamed: 0,Human immunodeficiency virus diagnoses,Influenza-associated pediatric mortality,"Invasive pneumococcal disease, All ages, Confirmed","Invasive pneumococcal disease, All ages, Probable","Invasive pneumococcal disease, Age <5 years, Confirmed","Invasive pneumococcal disease, Age <5 years, Probable",Legionellosis,Leptospirosis
United States,34775,82,17603,23,1137,5,6141,78
New England,966,4,1106,�,58,�,310,4
Connecticut,201,�,239,�,13,�,78,N
Maine,43,1,133,�,8,�,16,�
Massachusetts,607,2,513,�,31,�,141,2


# Get rid of the � sign

In [38]:
read_txt_file['Influenza-associated pediatric mortality'] = read_txt_file['Influenza-associated pediatric mortality'].replace('�', np.nan)
read_txt_file.head()

Unnamed: 0,Human immunodeficiency virus diagnoses,Influenza-associated pediatric mortality,"Invasive pneumococcal disease, All ages, Confirmed","Invasive pneumococcal disease, All ages, Probable","Invasive pneumococcal disease, Age <5 years, Confirmed","Invasive pneumococcal disease, Age <5 years, Probable",Legionellosis,Leptospirosis
United States,34775,82.0,17603,23,1137,5,6141,78
New England,966,4.0,1106,�,58,�,310,4
Connecticut,201,,239,�,13,�,78,N
Maine,43,1.0,133,�,8,�,16,�
Massachusetts,607,2.0,513,�,31,�,141,2


In [39]:
#read_txt_file['Influenza-associated pediatric mortality'].value_counts()

In [40]:
read_txt_file = read_txt_file.replace('�', np.nan)
read_txt_file.head()

Unnamed: 0,Human immunodeficiency virus diagnoses,Influenza-associated pediatric mortality,"Invasive pneumococcal disease, All ages, Confirmed","Invasive pneumococcal disease, All ages, Probable","Invasive pneumococcal disease, Age <5 years, Confirmed","Invasive pneumococcal disease, Age <5 years, Probable",Legionellosis,Leptospirosis
United States,34775,82.0,17603,23.0,1137,5.0,6141,78
New England,966,4.0,1106,,58,,310,4
Connecticut,201,,239,,13,,78,N
Maine,43,1.0,133,,8,,16,
Massachusetts,607,2.0,513,,31,,141,2


# Medicare data for testing 


In [41]:
#Legionella testing
HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'] == '86713']) 
               
                
hcpcs_pay_86713 = pd.concat(HCPCS_rows, ignore_index=True)
print(hcpcs_pay_86713.shape)

hcpcs_pay_86713.head()

(20, 26)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,First Name of the Provider,Middle Initial of the Provider,Credentials of the Provider,Gender of the Provider,Entity Type of the Provider,Street Address 1 of the Provider,Street Address 2 of the Provider,City of the Provider,...,HCPCS Code,HCPCS Description,HCPCS Drug Indicator,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,Average Submitted Charge Amount,Average Medicare Payment Amount,Average Medicare Standardized Amount
0,1063497451,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,69 1ST AVE,,RARITAN,...,86713,Analysis for antibody to Legionella (waterborn...,N,43.0,42,42,20.081163,142.537209,19.679535,20.57
1,1134277494,"BIO-REFERENCE LABORATORIES, INC.",,,,,O,481 EDWARD H ROSS DR,,ELMWOOD PARK,...,86713,Analysis for antibody to Legionella (waterborn...,N,75.0,39,43,20.99,75.050533,20.57,20.57
2,1194769497,"CLINICAL PATHOLOGY LABORATORIES, INC.",,,,,O,9200 WALL ST,,AUSTIN,...,86713,Analysis for antibody to Legionella (waterborn...,N,26.0,14,14,11.22,92.701923,10.995385,20.57
3,1235234402,"ENZO CLINICAL LABS, INC.",,,,,O,60 EXECUTIVE BLVD,,FARMINGDALE,...,86713,Analysis for antibody to Legionella (waterborn...,N,14.0,14,14,20.99,273.0,20.57,20.57
4,1245307818,QUEST DIAGNOSTICS INCORPORATED MD,,,,,O,1901 SULPHUR SPRING RD,,BALTIMORE,...,86713,Analysis for antibody to Legionella (waterborn...,N,88.0,48,69,20.99,155.377045,20.499773,20.57


In [42]:
#Detection Test for legionella 
HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'] == '87278']) 
               
                
hcpcs_pay_87278 = pd.concat(HCPCS_rows, ignore_index=True)

print(hcpcs_pay_87278.shape)
hcpcs_pay_87278.head()

(3, 26)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,First Name of the Provider,Middle Initial of the Provider,Credentials of the Provider,Gender of the Provider,Entity Type of the Provider,Street Address 1 of the Provider,Street Address 2 of the Provider,City of the Provider,...,HCPCS Code,HCPCS Description,HCPCS Drug Indicator,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,Average Submitted Charge Amount,Average Medicare Payment Amount,Average Medicare Standardized Amount
0,1366543795,NORTHWELL HEALTH LABORATORIES,,,,,O,10 NEVADA DR,,NEW HYDE PARK,...,87278,Detection test for legionella pneumophila (wat...,N,56.0,53,56,15.09,100.0,14.79,16.11
1,1447296272,"QUEST DIAGNOSTICS NICHOLS INSTITUTE, INC.",,,,,O,14225 NEWBROOK DR,,CHANTILLY,...,87278,Detection test for legionella pneumophila (wat...,N,47.0,30,47,15.09,153.775319,14.79,16.11
2,1538144910,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,1447 YORK CT,,BURLINGTON,...,87278,Detection test for legionella pneumophila (wat...,N,15.0,15,15,15.71,228.076667,15.4,16.11


In [65]:
my_list = ['87278', '86713', '87541']

HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'].isin(my_list)]) 
               
                
legion_testing_17 = pd.concat(HCPCS_rows, ignore_index=True)

print(legion_testing_17.shape)
legion_testing_17.head(35)


(35, 26)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,First Name of the Provider,Middle Initial of the Provider,Credentials of the Provider,Gender of the Provider,Entity Type of the Provider,Street Address 1 of the Provider,Street Address 2 of the Provider,City of the Provider,...,HCPCS Code,HCPCS Description,HCPCS Drug Indicator,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,Average Submitted Charge Amount,Average Medicare Payment Amount,Average Medicare Standardized Amount
0,1053721373,"TAUSTIN LABORATORIES, LLC",,,,,O,2868 ACTON RD,SUITE 207,VESTAVIA,...,87541,Detection test for legionella pneumophila (wat...,N,23.0,22,23,30.92,75.0,28.982609,45.128696
1,1063497451,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,69 1ST AVE,,RARITAN,...,86713,Analysis for antibody to Legionella (waterborn...,N,43.0,42,42,20.081163,142.537209,19.679535,20.57
2,1124442306,"TOTAL DIAGNOSTIX II, LLC",,,,,O,3740 BUSINESS DR STE 101,,MEMPHIS,...,87541,Detection test for legionella pneumophila (wat...,N,88.0,88,88,48.14,61.513636,47.18,47.18
3,1134277494,"BIO-REFERENCE LABORATORIES, INC.",,,,,O,481 EDWARD H ROSS DR,,ELMWOOD PARK,...,86713,Analysis for antibody to Legionella (waterborn...,N,75.0,39,43,20.99,75.050533,20.57,20.57
4,1134377781,DIATHERIX LABORATORIES LLC,,,,,O,601 GENOME WAY,SUITE 2100,HUNTSVILLE,...,87541,Detection test for legionella pneumophila (wat...,N,3235.0,2876,3235,30.0,30.0,29.229713,47.158015
5,1194769497,"CLINICAL PATHOLOGY LABORATORIES, INC.",,,,,O,9200 WALL ST,,AUSTIN,...,86713,Analysis for antibody to Legionella (waterborn...,N,26.0,14,14,11.22,92.701923,10.995385,20.57
6,1235234402,"ENZO CLINICAL LABS, INC.",,,,,O,60 EXECUTIVE BLVD,,FARMINGDALE,...,86713,Analysis for antibody to Legionella (waterborn...,N,14.0,14,14,20.99,273.0,20.57,20.57
7,1245307818,QUEST DIAGNOSTICS INCORPORATED MD,,,,,O,1901 SULPHUR SPRING RD,,BALTIMORE,...,86713,Analysis for antibody to Legionella (waterborn...,N,88.0,48,69,20.99,155.377045,20.499773,20.57
8,1255314704,LABORATORY CORPORATION OF AMERICA,,,,,O,5610 W LASALLE STREET,,TAMPA,...,86713,Analysis for antibody to Legionella (waterborn...,N,12.0,12,12,20.99,146.510833,20.57,20.57
9,1366479099,UNILAB CORPORATION,,,,,O,8401 FALLBROOK AVE,,WEST HILLS,...,86713,Analysis for antibody to Legionella (waterborn...,N,28.0,17,17,20.739286,202.016786,20.324286,20.57


In [60]:
legion_testing.columns

Index(['National Provider Identifier',
       'Last Name/Organization Name of the Provider',
       'First Name of the Provider', 'Middle Initial of the Provider',
       'Credentials of the Provider', 'Gender of the Provider',
       'Entity Type of the Provider', 'Street Address 1 of the Provider',
       'Street Address 2 of the Provider', 'City of the Provider',
       'Zip Code of the Provider', 'State Code of the Provider',
       'Country Code of the Provider', 'Provider Type',
       'Medicare Participation Indicator', 'Place of Service', 'HCPCS Code',
       'HCPCS Description', 'HCPCS Drug Indicator', 'Number of Services',
       'Number of Medicare Beneficiaries',
       'Number of Distinct Medicare Beneficiary/Per Day Services',
       'Average Medicare Allowed Amount', 'Average Submitted Charge Amount',
       'Average Medicare Payment Amount',
       'Average Medicare Standardized Amount'],
      dtype='object')

In [62]:
my_list = ['87540', '87542']

HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'].isin(my_list)]) 
               
                
legion_diagnosis = pd.concat(HCPCS_rows, ignore_index=True)

print(legion_diagnosis.shape)
legion_diagnosis.head(15)

(0, 26)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,First Name of the Provider,Middle Initial of the Provider,Credentials of the Provider,Gender of the Provider,Entity Type of the Provider,Street Address 1 of the Provider,Street Address 2 of the Provider,City of the Provider,...,HCPCS Code,HCPCS Description,HCPCS Drug Indicator,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,Average Submitted Charge Amount,Average Medicare Payment Amount,Average Medicare Standardized Amount


In [None]:
legion_diagnosis.head(15)

# Medicare data diagnosis for legionnaires 

87540
87541
87542

HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'] == '87540']) 
               
                
hcpcs_pay_87278 = pd.concat(HCPCS_rows, ignore_index=True)

print(hcpcs_pay_87278.shape)
hcpcs_pay_87278.head()

asc_provider = asc_provider.drop(columns= ['Middle Initial of the Provider',
                            'Credentials of the Provider',
                            'Street Address 2 of the Provider',
                           'Medicare Participation Indicator',
                           'HCPCS Drug Indicator',
                           'Average Submitted Charge Amount',
                           'Gender of the Provider',
                            'First Name of the Provider',
                            'Average Medicare Payment Amount',
                            'Average Medicare Standardized Amount',
                           ])
                           
asc_provider['Zip Code of the Provider'] = asc_provider['Zip Code of the Provider'].apply( lambda x: (5 - len(str(x)))*'0' + str(x) if len(str(x)) <= 5 else (9 - len(str(x)))*'0' + str(x)[:-4]  )

asc_provider.columns = asc_provider.columns.str.replace(' ','_')


asc_provider.columns = map(str.lower, asc_provider.columns)


df_name = df_name.drop(columns=['average_medicare_payment_amount',
                                'outlier_services',
                                'average_medicare_outlier_amount',
                               'average_total_submitted_charges'])

df_name = df_name.rename(columns={'national_provider_identifier':'npi',
                  'last_name/organization_name_of_the_provider':'name',
                  'entity_type_of_the_provider':'entity_type',
                 'street_address_1_of_the_provider':'address',
                 'city_of_the_provider':'city',
                 'zip_code_of_the_provider':'zip_code',
                 'state_code_of_the_provider':'state',
                 'country_code_of_the_provider':'country',
                'number_of_medicare_beneficiaries':'medicare_beneficiaries',
                'number_of_distinct_medicare_beneficiary/per_day_services':'medicare_beneficiary/per_day_services'})

# Medicare Data for 2016 - 2012 - pickle, then add year column to each one and then, merge them together. 


In [67]:
my_list = ['87278', '86713', '87541', '87540', '87542']

HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2016.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'].isin(my_list)]) 
               
                
legion_testing_16 = pd.concat(HCPCS_rows, ignore_index=True)

print(legion_testing_16.shape)
legion_testing_16.head(35)


(28, 26)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,First Name of the Provider,Middle Initial of the Provider,Credentials of the Provider,Gender of the Provider,Entity Type of the Provider,Street Address 1 of the Provider,Street Address 2 of the Provider,City of the Provider,...,HCPCS Code,HCPCS Description,HCPCS Drug Indicator,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,Average Submitted Charge Amount,Average Medicare Payment Amount,Average Medicare Standardized Amount
0,1366543795,NORTHWELL HEALTH LABORATORIES,,,,,O,10 NEVADA DR,,NEW HYDE PARK,...,86713,Analysis for antibody to Legionella (waterborn...,N,15.0,15,15,20.85,91.0,20.43,20.43
1,1124085576,QUEST DIAGNOSTICS INCORPORATED MI,,,,,O,1947 TECHNOLOGY DR,SUITE 100,TROY,...,86713,Analysis for antibody to Legionella (waterborn...,N,30.0,15,15,20.85,94.209,20.43,20.43
2,1740262880,LABORATORY CORPORATION OF AMERICA,,,,,O,5005 S 40TH ST,STE 1200,PHOENIX,...,86713,Analysis for antibody to Legionella (waterborn...,N,124.0,81,85,14.742097,103.100484,14.447903,20.43
3,1538144910,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,1447 YORK CT,,BURLINGTON,...,86713,Analysis for antibody to Legionella (waterborn...,N,293.0,261,278,19.656962,140.397816,19.261502,20.430034
4,1932145778,QUEST DIAGNOSTICS INCORPORATED,,,,,O,1 MALCOLM AVE,,TETERBORO,...,86713,Analysis for antibody to Legionella (waterborn...,N,42.0,23,23,20.85,109.900476,20.434524,20.434524
5,1225074065,QUEST DIAGNOSTICS CLINICAL LABORATORIES INC,,,,,O,50 REPUBLIC RD,,MELVILLE,...,86713,Analysis for antibody to Legionella (waterborn...,N,30.0,20,20,20.85,102.426667,20.433333,20.433333
6,1891731626,QUEST DIAGNOSTICS CLINICAL LABORATORIES INC,,,,,O,4225 E FOWLER AVE,,TAMPA,...,86713,Analysis for antibody to Legionella (waterborn...,N,28.0,14,14,20.85,96.801429,20.43,20.43
7,1538144910,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,1447 YORK CT,,BURLINGTON,...,87278,Detection test for legionella pneumophila (wat...,N,11.0,11,11,15.6,208.430909,15.29,16.0
8,1588093165,SHIEL HOLDINGS LLC,,,,,O,63 FLUSHING AVENUE,"UNIT 336, 2ND FLOOR",BROOKLYN,...,86713,Analysis for antibody to Legionella (waterborn...,N,31.0,24,24,11.14,99.0,10.917742,20.432258
9,1366479099,UNILAB CORPORATION,,,,,O,8401 FALLBROOK AVE,,WEST HILLS,...,86713,Analysis for antibody to Legionella (waterborn...,N,70.0,34,36,19.055,44.099714,18.673714,20.433143


In [70]:
my_list = ['87278', '86713', '87541', '87540', '87542']

HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2015.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'].isin(my_list)]) 
               
                
legion_testing_15 = pd.concat(HCPCS_rows, ignore_index=True)

print(legion_testing_15.shape)
legion_testing_15.head(35)


(29, 26)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,First Name of the Provider,Middle Initial of the Provider,Credentials of the Provider,Gender of the Provider,Entity Type of the Provider,Street Address 1 of the Provider,Street Address 2 of the Provider,City of the Provider,...,HCPCS Code,HCPCS Description,HCPCS Drug Indicator,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,Average Submitted Charge Amount,Average Medicare Payment Amount,Average Medicare Standardized Amount
0,1245307818,QUEST DIAGNOSTICS INCORPORATED MD,,,,,O,1901 SULPHUR SPRING RD,,BALTIMORE,...,86713,Analysis for antibody to Legionella (waterborn...,N,22.0,11.0,11.0,20.82,72.504091,20.405,17.621818
1,1457341851,"PHYSICIAN'S AUTOMATED LABORATORY, INC",,,,,O,820 34TH ST,SUITE 103,BAKERSFIELD,...,86713,Analysis for antibody to Legionella (waterborn...,N,67.0,34.0,35.0,11.13,16.92,10.607015,16.14
2,1225074065,QUEST DIAGNOSTICS CLINICAL LABORATORIES INC,,,,,O,575 UNDERHILL BLVD,,SYOSSET,...,86713,Analysis for antibody to Legionella (waterborn...,N,31.0,20.0,21.0,20.82,101.751613,20.403226,17.11129
3,1588093165,SHIEL HOLDINGS LLC,,,,,O,63 FLUSHING AVENUE,"UNIT 336, 2ND FLOOR",BROOKLYN,...,86713,Analysis for antibody to Legionella (waterborn...,N,58.0,46.0,48.0,11.13,99.0,10.908276,18.994138
4,1396746673,"LABORATORY ALLIANCE OF CENTRAL NEW YORK, LLC",,,,,O,113 INNOVATION LN,,LIVERPOOL,...,87278,Detection test for legionella pneumophila (wat...,N,11.0,11.0,11.0,15.58,40.94,15.27,15.99
5,1366543795,NORTHWELL HEALTH LABORATORIES,,,,,O,10 NEVADA DR,,NEW HYDE PARK,...,87278,Detection test for legionella pneumophila (wat...,N,25.0,25.0,25.0,14.98,100.0,14.68,15.99
6,1235234402,"ENZO CLINICAL LABS, INC.",,,,,O,60 EXECUTIVE BLVD,,FARMINGDALE,...,86713,Analysis for antibody to Legionella (waterborn...,N,30.0,29.0,30.0,20.82,273.0,20.4,20.4
7,1063497451,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,69 1ST AVE,,RARITAN,...,86713,Analysis for antibody to Legionella (waterborn...,N,62.0,55.0,57.0,18.944516,133.952903,18.562419,20.071613
8,1083613376,WEVER,KURT,A,MD,M,I,720 W US HIGHWAY 24,,WOODLAND PARK,...,87541,Detection test for legionella pneumophila (wat...,N,16.0,15.0,16.0,29.14375,29.14375,28.56125,46.8
9,1447296272,"QUEST DIAGNOSTICS NICHOLS INSTITUTE, INC.",,,,,O,14225 NEWBROOK DR,,CHANTILLY,...,87278,Detection test for legionella pneumophila (wat...,N,15.0,14.0,15.0,14.98,148.35,14.68,15.99


In [72]:
my_list = ['87278', '86713', '87541', '87540', '87542']

HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2014.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'].isin(my_list)]) 
               
                
legion_testing_14 = pd.concat(HCPCS_rows, ignore_index=True)

print(legion_testing_14.shape)
legion_testing_14.head(35)


(27, 26)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,First Name of the Provider,Middle Initial of the Provider,Credentials of the Provider,Gender of the Provider,Entity Type of the Provider,Street Address 1 of the Provider,Street Address 2 of the Provider,City of the Provider,...,HCPCS Code,HCPCS Description,Identifies HCPCS As Drug Included in the ASP Drug List,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,Average Submitted Charge Amount,Average Medicare Payment Amount,Average Medicare Standardized Amount
0,1932145778,QUEST DIAGNOSTICS INCORPORATED,,,,,O,1 MALCOLM AVE,,TETERBORO,...,86713,Analysis for antibody to Legionella (waterborn...,N,74.0,46.0,47.0,20.88,104.406216,20.46,12.994865
1,1366479099,UNILAB CORPORATION,,,,,O,8401 FALLBROOK AVE,,WEST HILLS,...,86713,Analysis for antibody to Legionella (waterborn...,N,40.0,25.0,25.0,19.364,51.4855,18.97475,13.299
2,1124085576,QUEST DIAGNOSTICS INCORPORATED MI,,,,,O,4444 GIDDINGS RD,,AUBURN HILLS,...,86713,Analysis for antibody to Legionella (waterborn...,N,53.0,19.0,19.0,19.612075,32.643774,19.219245,20.46
3,1538105366,"SONORA QUEST LABORATORIES, LLC.",,,,,O,1255 W WASHINGTON ST,,TEMPE,...,86713,Analysis for antibody to Legionella (waterborn...,N,172.0,86.0,86.0,20.88,81.943779,20.263721,10.23
4,1225074065,QUEST DIAGNOSTICS CLINICAL LABORATORIES INC,,,,,O,575 UNDERHILL BLVD,,SYOSSET,...,86713,Analysis for antibody to Legionella (waterborn...,N,46.0,26.0,27.0,20.88,107.43087,20.46,12.00913
5,1245307818,QUEST DIAGNOSTICS INCORPORATED MD,,,,,O,1901 SULPHUR SPRING RD,,BALTIMORE,...,86713,Analysis for antibody to Legionella (waterborn...,N,26.0,12.0,13.0,20.88,70.948077,20.46,10.23
6,1396746673,"LABORATORY ALLIANCE OF CENTRAL NEW YORK, LLC",,,,,O,113 INNOVATION LN,,LIVERPOOL,...,87278,Detection test for legionella pneumophila (wat...,N,13.0,11.0,13.0,15.62,40.94,15.31,16.03
7,1063497451,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,69 1ST AVE,,RARITAN,...,86713,Analysis for antibody to Legionella (waterborn...,N,59.0,53.0,53.0,18.573559,131.801186,18.2,18.379322
8,1811997711,SPECIALTY LABORATORIES INC,,,,,O,27027 TOURNEY RD,,VALENCIA,...,86713,Analysis for antibody to Legionella (waterborn...,N,42.0,24.0,25.0,20.88,149.738571,20.460714,12.178571
9,1538144910,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,1447 YORK CT,,BURLINGTON,...,86713,Analysis for antibody to Legionella (waterborn...,N,209.0,182.0,191.0,19.252249,135.00201,18.865742,20.362105


In [80]:
test = pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_CY2013.csv')
test.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name,First Name,Middle Initial,Credentials,Gender,Entity Code,Street Address 1,Street Address 2,City,...,HCPCS_DRUG_INDICATOR,LINE_SRVC_CNT,BENE_UNIQUE_CNT,BENE_DAY_SRVC_CNT,AVERAGE_MEDICARE_ALLOWED_AMT,STDEV_MEDICARE_ALLOWED_AMT,AVERAGE_SUBMITTED_CHRG_AMT,STDEV_SUBMITTED_CHRG_AMT,AVERAGE_MEDICARE_PAYMENT_AMT,STDEV_MEDICARE_PAYMENT_AMT
0,1104813138,SERSHON,PETER,D,M.D.,M,I,360 SHERMAN ST,SUITE 450,SAINT PAUL,...,N,23.0,23.0,23.0,211.82,0.0,509.0,0.0,149.303478,36.891437
1,1184755308,DAVIS,AMY,B,MD,F,I,121 PRATT DR STE 1A,,CORINTH,...,N,45.0,39.0,45.0,3.09,0.0,20.0,0.0,3.040667,0.02294
2,1972533834,SCHMIDT,MARY,S,M.D.,F,I,3003 W GOOD HOPE RD,,MILWAUKEE,...,N,25.0,21.0,22.0,28.68,0.0,169.0,0.0,21.728,4.440295


In [81]:
test.columns
## Will need to change HCPC before merge 2013

Index(['National Provider Identifier ', 'Last Name/Organization Name',
       'First Name', 'Middle Initial', 'Credentials', 'Gender', 'Entity Code',
       'Street Address 1', 'Street Address 2', 'City', 'Zip Code',
       'State Code', 'Country Code', 'Provider Type', 'Medicare Participation',
       'Place of Service', 'HCPCS_CODE', 'HCPCS_DESCRIPTION',
       'HCPCS_DRUG_INDICATOR', 'LINE_SRVC_CNT', 'BENE_UNIQUE_CNT',
       'BENE_DAY_SRVC_CNT', 'AVERAGE_MEDICARE_ALLOWED_AMT',
       'STDEV_MEDICARE_ALLOWED_AMT', 'AVERAGE_SUBMITTED_CHRG_AMT',
       'STDEV_SUBMITTED_CHRG_AMT', 'AVERAGE_MEDICARE_PAYMENT_AMT',
       'STDEV_MEDICARE_PAYMENT_AMT'],
      dtype='object')

In [None]:
my_list = ['87278', '86713', '87541', '87540', '87542']

HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_CY2013.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS_CODE'].isin(my_list)]) 
               
legion_testing_13 = pd.concat(HCPCS_rows, ignore_index=True)

print(legion_testing_13.shape)
legion_testing_13.head(35)


In [77]:
my_list = ['87278', '86713', '87541', '87540', '87542']

HCPCS_rows =[]
for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_CY2012.csv', 
                         chunksize = 1000):
    HCPCS_rows.append(chunk[chunk['HCPCS Code'].isin(my_list)]) 
               
                
legion_testing_12 = pd.concat(HCPCS_rows, ignore_index=True)

print(legion_testing_12.shape)
legion_testing_12.head(35)


(28, 28)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name,First Name,Middle Initial,Credentials,Gender,Entity Code,Street Address 1,Street Address 2,City,...,HCPCS Drug Indicator,Number of Services,Number of Medicare Beneficiaries,Number of Medicare Beneficiary/Day Services,Average Medicare Allowed Amount,Standard Deviation of Medicare Allowed Amount,Average Submitted Charge Amount,Standard Deviation of Submitted Charge Amount,Average Medicare Payment Amount,Standard Deviation of Medicare Payment Amount
0,1245307818,QUEST DIAGNOSTICS INCORPORATED MD,,,,,O,1901 SULPHUR SPRING RD,,BALTIMORE,...,N,94.0,39,58,21.67,0.0,110.686064,3.388146,21.67,0.0
1,1124085576,QUEST DIAGNOSTICS INCORPORATED MI,,,,,O,4444 GIDDINGS RD,,AUBURN HILLS,...,N,137.0,23,23,20.126642,1.758282,39.493139,30.159865,20.126642,1.758282
2,1518903350,QUEST DIAGNOSTICS CLINICAL LABORATORIES INC,,,,,O,10200 COMMERCE PKWY,,MIRAMAR,...,N,83.0,12,12,17.353373,1.702915,28.22988,29.528679,17.353373,1.702915
3,1497773337,"REGIONAL MEDICAL LABORATORY, INC",,,,,O,1923 S UTICA AVE,,TULSA,...,N,19.0,19,19,21.67,0.0,71.118421,20.636964,21.67,0.0
4,1134277494,"BIO-REFERENCE LABORATORIES, INC.",,,,,O,481 EDWARD H ROSS DR,,ELMWOOD PARK,...,N,101.0,33,35,21.67,0.0,99.69604,48.59285,21.67,0.0
5,1346233277,"SOLSTAS LAB PARTNERS GROUP, LLC",,,,,O,4380 FEDERAL DR,STE 100,GREENSBORO,...,N,37.0,30,33,21.67,0.0,470.415676,255.929615,21.67,0.0
6,1740262880,LABORATORY CORPORATION OF AMERICA,,,,,O,5005 S 40TH ST,STE 1200,PHOENIX,...,N,35.0,33,33,21.67,0.0,139.657143,23.918983,21.67,0.0
7,1225074065,QUEST DIAGNOSTICS CLINICAL LAB INC.,,,,,O,575 UNDERHILL BLVD,,SYOSSET,...,N,22.0,22,22,21.67,0.0,95.082273,10.215607,21.67,0.0
8,1790721538,QUEST DIAGNOSTICS CLINICAL LABORATORIES INC,,,,,O,4770 REGENT BLVD,,IRVING,...,N,15.0,12,12,21.67,0.0,94.618667,20.895954,21.67,0.0
9,1538144910,LABORATORY CORPORATION OF AMERICA HOLDINGS,,,,,O,1447 YORK CT,,BURLINGTON,...,N,202.0,190,197,21.070594,2.385103,140.611436,28.245609,20.982822,2.670579


In [82]:
legion_testing_12.columns


Index(['National Provider Identifier', 'Last Name/Organization Name',
       'First Name', 'Middle Initial', 'Credentials', 'Gender', 'Entity Code',
       'Street Address 1', 'Street Address 2', 'City', 'Zip Code',
       'State Code', 'Country Code', 'Provider Type', 'Medicare Participation',
       'Place of Service', 'HCPCS Code', 'HCPCS Description',
       'HCPCS Drug Indicator', 'Number of Services',
       'Number of Medicare Beneficiaries',
       'Number of Medicare Beneficiary/Day Services',
       'Average Medicare Allowed Amount',
       'Standard Deviation of Medicare Allowed Amount',
       'Average Submitted Charge Amount',
       'Standard Deviation of Submitted Charge Amount',
       'Average Medicare Payment Amount',
       'Standard Deviation of Medicare Payment Amount'],
      dtype='object')