# Explore CSV files and format fields to streamline the integration in a SQL DB

Most date fields in our csv are not formatted in the way MySQL expect to find them (_expected format is `YYYY-MM-DD`_). If one wants to use SQL to interface with the data, some pre-processing is required.

**Note**: all the table from the original archive contain a comma at the end of the header. This mess up with `pandas` and other data wrangling tool. I copied the original data in files prefix with `DG_` that I use below to process the dates. _e.g._ `WATER-SYSTEM.csv` becomes `DG_WATER_SYSTEM.csv` with a first header row which DOES NOT terminate by a comma.

In [288]:
from os.path import join
import uuid

import pandas as pd
from datetime import datetime as dt

**Important**: Update the global variable below with the absolute path to the folder containing all the csv files.

In [97]:
PATH_TO_DATA_FOLDER = "/Users/fpaupier/projects/safe-water/data/SDWIS/"

## `WATER_SYSTEM` table

There are 5 dates fields in this table. With two types of formatting:

1. Date formatted in `dd-mm-yy` _e.g `01-JUN-83` for June 1, 1983_. Fields encoded with this date format are:
    - `OUTSTANDING_PERFORM_BEGIN_DATE`
    - `PWS_DEACTIVATION_DATE`
    - `SOURCE_PROTECTION_BEGIN_DATE`
    
    Note that this formatting is ambiguous about the year. 
    --> Dates fromatted that way will be converted to the MySQL friendly date format `YYY-MM-DD`.
    
    
2. Dates are also formatted with the `MM-DD` format, fields encoded that way are:
     - `SEASON_BEGIN_DATE` formatted in `MM-DD`
     - `SEASON_END_DATE` formatted in `MM-DD`
     
     Note that those dates inform on year-recurring events, they occur every years. We keep them as raw text fields. Fine grained processing of those dates will be done by consumers applications.
     


## Load and format data

In [236]:
OUTSTANDING_PERFORM_BEGIN_DATE_idx = 44
PWS_DEACTIVATION_DATE_idx = 8
SOURCE_PROTECTION_BEGIN_DATE_idx = 42

In [237]:
df = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_WATER_SYSTEM.csv"),
                 sep=",",
                 header=0,
                 index_col=0,
                 encoding="utf-8",
                 low_memory=False, #To avoid type inference
                 parse_dates=[PWS_DEACTIVATION_DATE_idx, SOURCE_PROTECTION_BEGIN_DATE_idx, OUTSTANDING_PERFORM_BEGIN_DATE_idx])




In [238]:
df.head()

Unnamed: 0_level_0,WATER_SYSTEM.PWS_NAME,WATER_SYSTEM.NPM_CANDIDATE,WATER_SYSTEM.PRIMACY_AGENCY_CODE,WATER_SYSTEM.EPA_REGION,WATER_SYSTEM.SEASON_BEGIN_DATE,WATER_SYSTEM.SEASON_END_DATE,WATER_SYSTEM.PWS_ACTIVITY_CODE,WATER_SYSTEM.PWS_DEACTIVATION_DATE,WATER_SYSTEM.PWS_TYPE_CODE,WATER_SYSTEM.DBPR_SCHEDULE_CAT_CODE,...,WATER_SYSTEM.CITY_NAME,WATER_SYSTEM.ZIP_CODE,WATER_SYSTEM.COUNTRY_CODE,WATER_SYSTEM.STATE_CODE,WATER_SYSTEM.SOURCE_WATER_PROTECTION_CODE,WATER_SYSTEM.SOURCE_PROTECTION_BEGIN_DATE,WATER_SYSTEM.OUTSTANDING_PERFORMER,WATER_SYSTEM.OUTSTANDING_PERFORM_BEGIN_DATE,WATER_SYSTEM.CITIES_SERVED,WATER_SYSTEM.COUNTIES_SERVED
WATER_SYSTEM.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AR1900063,USCOE BSWP118 PONTIAC,N,AR,6,01-01,12-31,I,1983-06-01,TNCWS,,...,LITTLE ROCK,72203,US,AR,,NaT,,NaT,Not Reported,Marion
AR1900071,USCOE BSWP126 HIGHWAY K,N,AR,6,01-01,12-31,I,1983-06-01,TNCWS,,...,LITTLE ROCK,72203,US,AR,,NaT,,NaT,Not Reported,Marion
AR1900072,USCOE BSW127 LOWERY,N,AR,6,01-01,12-31,I,1983-06-01,TNCWS,,...,LITTLE ROCK,72203,US,AR,,NaT,,NaT,Not Reported,Marion
AR1900075,USCOE GFW02 DAM SITE,N,AR,6,01-01,12-31,I,1983-06-01,TNCWS,,...,LITTLE ROCK,72203,US,AR,,NaT,,NaT,Not Reported,Cleburne
AR1900076,USCOE GFW03 DAM SITE,N,AR,6,01-01,12-31,I,1983-06-01,TNCWS,,...,LITTLE ROCK,72203,US,AR,,NaT,,NaT,Not Reported,Cleburne


### Sanitize booleans
The fields `IS_GRANT_ELIGIBLE_IND`, `IS_WHOLESALER_IND`, `NPM_CANDIDATE` and `IS_SCHOOL_OR_DAYCARE_IND` are text values `N` or `Y`. They should be casted as `Booleans` in the database to allow easily expressed queries. 

In [239]:
df['WATER_SYSTEM.IS_GRANT_ELIGIBLE_IND'].dropna().unique()

array(['N', 'Y'], dtype=object)

In [240]:
df['WATER_SYSTEM.IS_WHOLESALER_IND'].dropna().unique()

array(['N', 'Y'], dtype=object)

In [241]:
df['WATER_SYSTEM.IS_SCHOOL_OR_DAYCARE_IND'].dropna().unique()

array(['N', 'Y'], dtype=object)

In [242]:
df['WATER_SYSTEM.NPM_CANDIDATE'].dropna().unique()

array(['N', 'Y'], dtype=object)

In [243]:
def type_checker(arr):
    ref = arr[0]
    ref_type = type(ref)
    flag = 1;
    for i in arr:
        if type(i) != ref_type:
            print('ERROR!')
            flag = 0;
    if flag == 1:
        print('compatible types')

In [244]:
tst = df['WATER_SYSTEM.ZIP_CODE'].dropna().unique()
tst

array(['72203', '92365', '92398', ..., '49661', '48909-7741',
       '55011-9204'], dtype=object)

In [245]:
type_checker(tst)

compatible types


### Perform sanitization

Map `N` to `False` and `Y` to `True`.

In [246]:
df['WATER_SYSTEM.IS_GRANT_ELIGIBLE_IND'] = df['WATER_SYSTEM.IS_GRANT_ELIGIBLE_IND'].map({'N': 0, 'Y': 1})
df['WATER_SYSTEM.IS_WHOLESALER_IND'] = df['WATER_SYSTEM.IS_WHOLESALER_IND'].map({'N': 0, 'Y': 1})
df['WATER_SYSTEM.IS_SCHOOL_OR_DAYCARE_IND'] = df['WATER_SYSTEM.IS_SCHOOL_OR_DAYCARE_IND'].map({'N': 0, 'Y': 1})
df['WATER_SYSTEM.NPM_CANDIDATE'] = df['WATER_SYSTEM.NPM_CANDIDATE'].map({'N': 0, 'Y': 1});



Check sanitization output:

In [247]:
df['WATER_SYSTEM.IS_GRANT_ELIGIBLE_IND'].dropna().unique()

array([0, 1])

In [248]:
df['WATER_SYSTEM.IS_WHOLESALER_IND'].dropna().unique()

array([0, 1])

In [249]:
df['WATER_SYSTEM.IS_SCHOOL_OR_DAYCARE_IND'].dropna().unique()

array([0, 1])

In [250]:
df['WATER_SYSTEM.NPM_CANDIDATE'].dropna().unique()

array([0, 1])

## Save sanitized data

Save the sanitized dataset in a new csv file:

In [251]:
# Data will be saved in the `sanitized` folder.
sanitized_csv_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'WATER_SYSTEM.csv')

In [252]:
df.to_csv(sanitized_csv_file, sep=",", encoding='utf-8')



## `WATER_SYSTEM_FACILITY` table

In [511]:
# date to process
FACILITY_DEACTIVATION_DATE_idx = 7
PWS_DEACTIVATION_DATE_idx = 18

In [512]:
wsf = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_WATER_SYSTEM_FACILITY.csv"),
                 sep=",",
                 header=0,
                 index_col=0,
                 encoding="utf-8",
                 low_memory=False,
                 parse_dates=[FACILITY_DEACTIVATION_DATE_idx, PWS_DEACTIVATION_DATE_idx],
                 )



In [284]:
# Process binary fields
wsf['WATER_SYSTEM_FACILITY.IS_SOURCE_IND'] = wsf['WATER_SYSTEM_FACILITY.IS_SOURCE_IND'].map({'N': 0, 'Y': 1})

In [513]:
def longest_item(arr):
    m = arr[0]
    l = len(m)
    for i in arr:
        if len(i) > l:
            m = i
            l = len(i)
    return m

In [515]:
len(longest_item(wsf['WATER_SYSTEM_FACILITY.FACILITY_ID'].unique()))

12

In [287]:
len(wsf['WATER_SYSTEM_FACILITY.FACILITY_ID'])

1408854

In [272]:
wsf.shape

(1408854, 20)

**Important**: There are only `203 078` unique `FACILITY_ID` in the dataset and yet there are `1 408 854` different records. `FACILITY_ID` alone cannot be counted as a primary key.

Thus we add prepend a ID column as primary key.

In [289]:
ids = []
for idx in range(wsf.shape[0]):
    ids.append(str(uuid.uuid4()))

In [291]:
# Prepend the ID serie to the dataframe
wsf.insert(0, 'ID', pd.Series(ids, index=wsf.index))

In [292]:
wsf.head()

Unnamed: 0_level_0,ID,WATER_SYSTEM_FACILITY.PRIMACY_AGENCY_CODE,WATER_SYSTEM_FACILITY.EPA_REGION,WATER_SYSTEM_FACILITY.FACILITY_ID,WATER_SYSTEM_FACILITY.FACILITY_NAME,WATER_SYSTEM_FACILITY.STATE_FACILITY_ID,WATER_SYSTEM_FACILITY.FACILITY_ACTIVITY_CODE,WATER_SYSTEM_FACILITY.FACILITY_DEACTIVATION_DATE,WATER_SYSTEM_FACILITY.FACILITY_TYPE_CODE,WATER_SYSTEM_FACILITY.SUBMISSION_STATUS_CODE,...,WATER_SYSTEM_FACILITY.WATER_TYPE_CODE,WATER_SYSTEM_FACILITY.AVAILABILITY_CODE,WATER_SYSTEM_FACILITY.SELLER_TREATMENT_CODE,WATER_SYSTEM_FACILITY.SELLER_PWSID,WATER_SYSTEM_FACILITY.SELLER_PWS_NAME,WATER_SYSTEM_FACILITY.FILTRATION_STATUS_CODE,WATER_SYSTEM_FACILITY.PWS_ACTIVITY_CODE,WATER_SYSTEM_FACILITY.PWS_DEACTIVATION_DATE,WATER_SYSTEM_FACILITY.PWS_TYPE_CODE,WATER_SYSTEM_FACILITY.IS_SOURCE_TREATED_IND
WATER_SYSTEM_FACILITY.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NY0900222,9c2ce89f-7c29-489a-8c23-0485bcf908aa,NY,2,54061,TRANS. MAIN 6 X 500',000000081224,A,NaT,TM,Y,...,,,,,,,A,NaT,CWS,
NY0900222,e023b201-683b-4ae5-9b2c-7b92a55b0dae,NY,2,75508,XXIDSE-DISTRIBUTION STAGE 2,XXIDSE,I,2012-05-23,DS,Y,...,,,,,,,A,NaT,CWS,
NY1000240,959cf25d-2854-4c85-b14d-97ef385a71b5,NY,2,63273,DISTRIBUTION SYSTEM,DS-01,A,NaT,DS,Y,...,,,,,,,A,NaT,CWS,
NY1000240,e2a3b971-9427-4d3f-b70c-fd01d02161ef,NY,2,73477,"485,000 GALLON STORAGE TANK",ST-01,A,NaT,ST,Y,...,,,,,,,A,NaT,CWS,
NY0611916,54965745-440f-4a8f-90b2-984d5283516f,NY,2,67310,DISTRIBUTION SYSTEM (HV2),DS-001,A,NaT,DS,Y,...,,,,,,,A,NaT,TNCWS,


In [293]:
# Data will be saved in the `sanitized` folder.
sanitized_wsf_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'WATER_SYSTEM_FACILITY.csv')
wsf.to_csv(sanitized_wsf_file, sep=",", encoding='utf-8', quotechar='"')

In [295]:
len('e5862a56-14ea-4dda-8fed-bf1efb0c9bbd')

36

Unnamed: 0_level_0,VIOLATION.VIOLATION_ID,VIOLATION.FACILITY_ID,VIOLATION.POPULATION_SERVED_COUNT,VIOLATION.NPM_CANDIDATE,VIOLATION.PWS_ACTIVITY_CODE,VIOLATION.PWS_DEACTIVATION_DATE,VIOLATION.PRIMARY_SOURCE_CODE,VIOLATION.POP_CAT_5_CODE,VIOLATION.PRIMACY_AGENCY_CODE,VIOLATION.EPA_REGION,...,VIOLATION.LATEST_ENFORCEMENT_ID,VIOLATION.RTC_ENFORCEMENT_ID,VIOLATION.RTC_DATE,VIOLATION.PUBLIC_NOTIFICATION_TIER,VIOLATION.ORIGINATOR_CODE,VIOLATION.SAMPLE_RESULT_ID,VIOLATION.CORRECTIVE_ACTION_ID,VIOLATION.RULE_CODE,VIOLATION.RULE_GROUP_CODE,VIOLATION.RULE_FAMILY_CODE
VIOLATION.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
WI1110274,1200004,2.0,35,N,A,,GW,1,WI,5,...,1300019,1300016,24-OCT-12,3,S,,,310,300,310
WI1110274,1200003,2.0,35,N,A,,GW,1,WI,5,...,1300019,1300016,24-OCT-12,3,S,,,310,300,310
WI1110274,1200024,,35,N,A,,GW,1,WI,5,...,1200015,1200015,21-AUG-12,2,S,,,110,100,110
WI1110274,1200023,2.0,35,N,A,,GW,1,WI,5,...,1300019,1300016,24-OCT-12,3,S,,,310,300,310
WI1110274,1200002,,35,N,A,,GW,1,WI,5,...,1200010,1200010,26-MAR-12,2,S,,,110,100,110


## `VIOLATION` table


In [303]:
# date to process
COMPL_PER_BEGIN_DATE_idx = 22
COMPL_PER_END_DATE_idx = 23
RTC_DATE_idx = 26

In [304]:
violation = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_VIOLATION.csv"),
                 sep=",",
                 header=0,
                 index_col=0,
                 encoding="utf-8",
                 low_memory=False,
                 parse_dates=[COMPL_PER_BEGIN_DATE_idx, COMPL_PER_END_DATE_idx, RTC_DATE_idx],
                 )



In [305]:
violation.head()

Unnamed: 0_level_0,VIOLATION.VIOLATION_ID,VIOLATION.FACILITY_ID,VIOLATION.POPULATION_SERVED_COUNT,VIOLATION.NPM_CANDIDATE,VIOLATION.PWS_ACTIVITY_CODE,VIOLATION.PWS_DEACTIVATION_DATE,VIOLATION.PRIMARY_SOURCE_CODE,VIOLATION.POP_CAT_5_CODE,VIOLATION.PRIMACY_AGENCY_CODE,VIOLATION.EPA_REGION,...,VIOLATION.LATEST_ENFORCEMENT_ID,VIOLATION.RTC_ENFORCEMENT_ID,VIOLATION.RTC_DATE,VIOLATION.PUBLIC_NOTIFICATION_TIER,VIOLATION.ORIGINATOR_CODE,VIOLATION.SAMPLE_RESULT_ID,VIOLATION.CORRECTIVE_ACTION_ID,VIOLATION.RULE_CODE,VIOLATION.RULE_GROUP_CODE,VIOLATION.RULE_FAMILY_CODE
VIOLATION.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
WI1110274,1200004,2.0,35,N,A,,GW,1,WI,5,...,1300019,1300016,2012-10-24,3,S,,,310,300,310
WI1110274,1200003,2.0,35,N,A,,GW,1,WI,5,...,1300019,1300016,2012-10-24,3,S,,,310,300,310
WI1110274,1200024,,35,N,A,,GW,1,WI,5,...,1200015,1200015,2012-08-21,2,S,,,110,100,110
WI1110274,1200023,2.0,35,N,A,,GW,1,WI,5,...,1300019,1300016,2012-10-24,3,S,,,310,300,310
WI1110274,1200002,,35,N,A,,GW,1,WI,5,...,1200010,1200010,2012-03-26,2,S,,,110,100,110


In [306]:
violation['VIOLATION.IS_HEALTH_BASED_IND'].unique()

array(['N', 'Y', nan], dtype=object)

Won't map it because of the Nan items, we keep the `y` `n` and `nan`.

In [300]:
len(violation['VIOLATION.VIOLATION_ID'].unique())

1089493

In [301]:
violation.shape

(2212450, 33)

Again, the number of id is inconsistent with the number fo records, we create a new ID for the column as primary key.

In [307]:
ids_violation = []
for idx in range(violation.shape[0]):
    ids_violation.append(str(uuid.uuid4()))

In [308]:
ids_violation[:5]

['da7c77d1-57fa-40e6-9a8e-01bbbf9dbbd3',
 'de518b9e-5d0b-4a81-8159-289024c0e5ee',
 'c60dcd65-443d-45a2-b385-8c3fa400d6b6',
 '366f21f4-abfb-41ac-b0fa-a5d5442c6e83',
 'aaec3dd6-246f-4fd6-bf13-a237b2714d76']

In [309]:
# Prepend the ID serie to the dataframe
violation.insert(0, 'ID', pd.Series(ids_violation, index=violation.index))

In [310]:
violation.head()

Unnamed: 0_level_0,ID,VIOLATION.VIOLATION_ID,VIOLATION.FACILITY_ID,VIOLATION.POPULATION_SERVED_COUNT,VIOLATION.NPM_CANDIDATE,VIOLATION.PWS_ACTIVITY_CODE,VIOLATION.PWS_DEACTIVATION_DATE,VIOLATION.PRIMARY_SOURCE_CODE,VIOLATION.POP_CAT_5_CODE,VIOLATION.PRIMACY_AGENCY_CODE,...,VIOLATION.LATEST_ENFORCEMENT_ID,VIOLATION.RTC_ENFORCEMENT_ID,VIOLATION.RTC_DATE,VIOLATION.PUBLIC_NOTIFICATION_TIER,VIOLATION.ORIGINATOR_CODE,VIOLATION.SAMPLE_RESULT_ID,VIOLATION.CORRECTIVE_ACTION_ID,VIOLATION.RULE_CODE,VIOLATION.RULE_GROUP_CODE,VIOLATION.RULE_FAMILY_CODE
VIOLATION.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
WI1110274,da7c77d1-57fa-40e6-9a8e-01bbbf9dbbd3,1200004,2.0,35,N,A,,GW,1,WI,...,1300019,1300016,2012-10-24,3,S,,,310,300,310
WI1110274,de518b9e-5d0b-4a81-8159-289024c0e5ee,1200003,2.0,35,N,A,,GW,1,WI,...,1300019,1300016,2012-10-24,3,S,,,310,300,310
WI1110274,c60dcd65-443d-45a2-b385-8c3fa400d6b6,1200024,,35,N,A,,GW,1,WI,...,1200015,1200015,2012-08-21,2,S,,,110,100,110
WI1110274,366f21f4-abfb-41ac-b0fa-a5d5442c6e83,1200023,2.0,35,N,A,,GW,1,WI,...,1300019,1300016,2012-10-24,3,S,,,310,300,310
WI1110274,aaec3dd6-246f-4fd6-bf13-a237b2714d76,1200002,,35,N,A,,GW,1,WI,...,1200010,1200010,2012-03-26,2,S,,,110,100,110


In [311]:
# Data will be saved in the `sanitized` folder.
sanitized_violation_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'VIOLATION.csv')
violation.to_csv(sanitized_violation_file, sep=",", encoding='utf-8', quotechar='"')

In [322]:
longest_item(violation['VIOLATION.RULE_CODE'])

TypeError: object of type 'numpy.int64' has no len()

In [332]:
violation['VIOLATION.POPULATION_SERVED_COUNT'].unique()

array([    35,   2500,     59, ..., 205489,  13560,  13855])

## `VIOLATION_ENF_ASSOC` table


In [361]:
violation_enf = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_VIOLATION_ENF_ASSOC.csv"),
                 sep=",",
                 header=0,
                 index_col=0,
                 encoding="utf-8",
                 low_memory=False,
                 )


In [362]:
violation_enf.head()

Unnamed: 0_level_0,VIOLATION_ENF_ASSOC.ENFORCEMENT_ID,VIOLATION_ENF_ASSOC.VIOLATION_ID
VIOLATION_ENF_ASSOC.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1
NC0241588,788429,4626636
MS0020006,7105413,7104973
CT0180334,1209,129209
PA1150767,11001569001,1102713
OH3841912,8122709,8658209


In [356]:
violation_enf.shape

(4891691, 3)

In [419]:
violation_enf['VIOLATION_ENF_ASSOC.VIOLATION_ID'].isnull().any()

False

In [360]:
violation_enf['VIOLATION_ENF_ASSOC.ENFORCEMENT_ID'].is_unique

False

Sadly we cannot use the ENFORCEMENT_ID as primary key cause its not unique, we create a new `ID` key.

In [365]:
ids_violation_enf = []
for idx in range(violation_enf.shape[0]):
    ids_violation_enf.append(str(uuid.uuid4()))

In [366]:
# Prepend the ID serie to the dataframe
violation_enf.insert(0, 'ID', pd.Series(ids_violation_enf, index=violation_enf.index))

In [367]:
violation_enf.head()

Unnamed: 0_level_0,ID,VIOLATION_ENF_ASSOC.ENFORCEMENT_ID,VIOLATION_ENF_ASSOC.VIOLATION_ID
VIOLATION_ENF_ASSOC.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NC0241588,c9f81ec0-c050-4bd0-9af3-d019de1fb31a,788429,4626636
MS0020006,fbf0edec-efb8-4c4a-80d6-afdcf1ffc67f,7105413,7104973
CT0180334,7d72119f-0dac-44f3-87b0-25da9f49345b,1209,129209
PA1150767,9829fa37-d453-4151-bb2f-c12b8bdda854,11001569001,1102713
OH3841912,de725c07-c4e4-4987-8db1-be06f6721f5c,8122709,8658209


In [369]:
# Data will be saved in the `sanitized` folder.
sanitized_violation_enf_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'VIOLATION_ENF_ASSOC.csv')
violation_enf.to_csv(sanitized_violation_enf_file, sep=",", encoding='utf-8')

## `SERVICE_AREA` table


In [378]:
service_area = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_SERVICE_AREA.csv"),
                 sep=",",
                 header=0,
                 index_col=0,
                 encoding="utf-8",
                 low_memory=False,
                 )


In [379]:
service_area.head()

Unnamed: 0_level_0,SERVICE_AREA.PRIMACY_AGENCY_CODE,SERVICE_AREA.EPA_REGION,SERVICE_AREA.PWS_ACTIVITY_CODE,SERVICE_AREA.PWS_TYPE_CODE,SERVICE_AREA.SERVICE_AREA_TYPE_CODE,SERVICE_AREA.IS_PRIMARY_SERVICE_AREA_CODE
SERVICE_AREA.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
MD0100047,MD,3,A,CWS,OR,Y
MD1220047,MD,3,A,NTNCWS,DC,Y
MD1101338,MD,3,A,NTNCWS,DC,Y
MD1011124,MD,3,A,TNCWS,OT,Y
MD1021756,MD,3,A,TNCWS,OT,Y


Again there is no primary key so we add one

In [381]:
service_area.index.isnull().any()

False

We can use a foreign key of PWSID to the WATER_SYSTEM table.
Yet, we need to add a primary key.

In [382]:
ids_service_area = []
for idx in range(service_area.shape[0]):
    ids_service_area.append(str(uuid.uuid4()))

In [383]:
# Prepend the ID serie to the dataframe
service_area.insert(0, 'ID', pd.Series(ids_service_area, index=service_area.index))

In [384]:
service_area.head()

Unnamed: 0_level_0,ID,SERVICE_AREA.PRIMACY_AGENCY_CODE,SERVICE_AREA.EPA_REGION,SERVICE_AREA.PWS_ACTIVITY_CODE,SERVICE_AREA.PWS_TYPE_CODE,SERVICE_AREA.SERVICE_AREA_TYPE_CODE,SERVICE_AREA.IS_PRIMARY_SERVICE_AREA_CODE
SERVICE_AREA.PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
MD0100047,96e894b0-91b2-4c6a-915c-77a4c5dabd41,MD,3,A,CWS,OR,Y
MD1220047,3dd5b572-e95e-4564-b329-ebc0450e04dd,MD,3,A,NTNCWS,DC,Y
MD1101338,9bf90f34-8d2c-40ff-8340-d4b0c6ded7cf,MD,3,A,NTNCWS,DC,Y
MD1011124,2cd1f536-3fc6-4424-9165-cef9e065e653,MD,3,A,TNCWS,OT,Y
MD1021756,480b760a-dd49-46a8-86c4-f25984e9c2d9,MD,3,A,TNCWS,OT,Y


In [385]:
# Data will be saved in the `sanitized` folder.
sanitized_service_area_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'SERVICE_AREA.csv')
service_area.to_csv(sanitized_service_area_file, sep=",", encoding='utf-8')

## `GEOGRAPHIC_AREA` table


In [386]:
geo = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_GEOGRAPHIC_AREA.csv"),
                 sep=",",
                 header=0,
                 index_col=0,
                 encoding="utf-8",
                 low_memory=False,
                 )


In [387]:
geo.head()

Unnamed: 0_level_0,GEO_ID,PRIMACY_AGENCY_CODE,EPA_REGION,PWS_ACTIVITY_CODE,PWS_TYPE_CODE,TRIBAL_CODE,STATE_SERVED,ANSI_ENTITY_CODE,ZIP_CODE_SERVED,CITY_SERVED,AREA_TYPE_CODE,COUNTY_SERVED
PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AK2113536,22749033,AK,10,N,TNCWS,,AK,,,HAINES,CT,
AK2113536,22749034,AK,10,N,TNCWS,,,100.0,,,CN,Haines Borough
ME0009920,22749035,ME,1,I,TNCWS,,ME,,,BETHEL,CT,
ME0009920,22749036,ME,1,I,TNCWS,,,17.0,,,CN,Oxford
NY0319346,22749037,NY,2,I,TNCWS,,NY,,,UNION (T),CT,


In [391]:
geo['GEO_ID'].is_unique

True

In [392]:
geo['GEO_ID'].isnull().values.any()

False

In [394]:
geo.index.isnull().any()

False

We can use the `GEO_ID` as primary key for the table. We also set a foreign key on PWSID to WATER_SYSTEM (PWSID). 

No need to use a different file.

## Table `ENFORCEMENT_ACTION`

In [398]:
ENFORCMENT_DATE_idx = 3

In [410]:
enfo = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_ENFORCEMENT_ACTION.csv"),
                 sep=",",
                 header=0,
                 encoding="utf-8",
                index_col=0,
                 low_memory=False,
                 parse_dates=[ENFORCMENT_DATE_idx]
                 )


In [411]:
enfo.head()

Unnamed: 0_level_0,ENFORCEMENT_ID,ORIGINATOR_CODE,ENFORCEMENT_DATE,ENFORCEMENT_ACTION_TYPE_CODE,ENFORCEMENT_COMMENT_TEXT
PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OH8200003,1091220,S,2015-11-19,SIE,
OH8200003,1091221,S,2015-12-09,SIF,
OH8300012,6062769,S,2015-11-19,SIE,
OH8301012,8073814,S,2015-12-07,SIA,
OH8301012,8073815,S,2015-12-07,SIE,


In [402]:
enfo['PWSID'].isnull().values.any()

False

We can use PWSID ad fk to WATER_SYSTEM (PWSID)

In [403]:
enfo['ORIGINATOR_CODE'].unique()

array(['S', 'R', 'F'], dtype=object)

In [404]:
enfo['ENFORCEMENT_ACTION_TYPE_CODE'].unique()

array(['SIE', 'SIF', 'SIA', 'SFM', 'SOX', 'SFJ', 'SFO', 'SO0', 'SFG',
       'SIB', 'SIC', 'SID', 'EOX', 'SFL', 'SFR', 'SFK', 'SO8', 'SFH',
       'SOY', 'SFQ', 'EFJ', 'SO6', 'SII', 'SFN', 'SF4', 'EFL', 'EF/',
       'SO+', 'EF<', 'SO7', 'SF%', 'EO6', 'SFS', 'SFU', 'EFR', 'EIA',
       'EO8', 'SF3', 'SFT', 'EIC', 'EFK', 'EIE', 'EIF', 'EF-', 'SFV',
       'EO0', 'EIB', 'EOY', 'EO7', 'SOZ', 'EID', 'EFG', 'EFH', 'EFQ',
       'EII', 'SF5', 'EF!', 'EO+', 'SFW'], dtype=object)

In [409]:
enfo['ENFORCEMENT_ID'].is_unique

False

We need to add another id.

In [412]:
ids_enfo = []
for idx in range(enfo.shape[0]):
    ids_enfo.append(str(uuid.uuid4()))

In [413]:
# Prepend the ID serie to the dataframe
enfo.insert(0, 'ID', pd.Series(ids_enfo, index=enfo.index))

In [414]:
enfo.head()

Unnamed: 0_level_0,ID,ENFORCEMENT_ID,ORIGINATOR_CODE,ENFORCEMENT_DATE,ENFORCEMENT_ACTION_TYPE_CODE,ENFORCEMENT_COMMENT_TEXT
PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
OH8200003,dc3dbc09-bf95-42ca-b55e-f6bad0c8b66a,1091220,S,2015-11-19,SIE,
OH8200003,74f8bed5-04c8-4f38-811b-a0b3f9d2698c,1091221,S,2015-12-09,SIF,
OH8300012,77583693-34fa-4664-b222-2f5d341b5d78,6062769,S,2015-11-19,SIE,
OH8301012,e3613975-3b98-4434-8178-9a6863e449f6,8073814,S,2015-12-07,SIA,
OH8301012,ff14337c-7dd2-4e54-8b70-bde602de20df,8073815,S,2015-12-07,SIE,


In [415]:
# Data will be saved in the `sanitized` folder.
sanitized_enfo_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'ENFORCEMENT_ACTION.csv')
enfo.to_csv(sanitized_enfo_file, sep=",", encoding='utf-8')

## `contaminant-codes` table 

In [420]:
cont = pd.read_csv(join(PATH_TO_DATA_FOLDER, "contaminant-codes.csv"),
                 sep=",",
                 header=0,
                 encoding="utf-8",
                index_col=0,
                 low_memory=False,
                 )


In [421]:
cont.head()

Unnamed: 0_level_0,NAME,SCIENTIFIC_NAME,TYPE_CODE
CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100,TURBIDITY,,WQ
200,SWTR,,RL
300,IESWTR,,RL
400,DBP STAGE 1,,RL
500,FILTER BACKWASH RULE,,RL


In [423]:
cont.index.is_unique

True

In [426]:
cont.index.isnull().any()

False

We can use the code as primary key

In [428]:
cont['TYPE_CODE'].unique()

array(['WQ', 'RL', 'GC', 'IOC', 'OC', 'RA', 'OT', 'MOR'], dtype=object)

We can directly integrate the file

## `contaminant-group-codes.csv` Table

In [431]:
cont_groups = pd.read_csv(join(PATH_TO_DATA_FOLDER, "contaminant-group-codes.csv"),
                 sep=",",
                 header=0,
                 encoding="utf-8",
                index_col=0,
                 low_memory=False,
                 )


In [433]:
cont_groups.head()

Unnamed: 0_level_0,CONTAMINANT_NAME,CONTAMINANT_GROUP,CONTAMINANT_GROUP_CODE
CONTAMINANT_CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4000,"GROSS ALPHA, EXCL. RADON & U",ALL RADIOCHEMICAL,ARAD
4002,"GROSS ALPHA, INCL. RADON & U",ALL RADIOCHEMICAL,ARAD
4004,RADON,ALL RADIOCHEMICAL,ARAD
4006,COMBINED URANIUM,ALL RADIOCHEMICAL,ARAD
4100,GROSS BETA PARTICLE ACTIVITY,ALL RADIOCHEMICAL,ARAD


In [436]:
cont_groups.CONTAMINANT_GROUP.unique()

array(['ALL RADIOCHEMICAL', 'ALL REG EXP.PBCU/DBP', 'ALL SOCS',
       'ALL VOCS(REG/UNREG)', 'BASELINE SECONDARIES', 'CYANIDE GROUP',
       'TTHM & HAA5', 'SOC - DIQUAT', 'SOC - ENDOTHALL', 'FLUORIDE GROUP',
       'GROSS ALPHA', 'SOC - GLYPHOSATE', 'HALOACETIC ACIDS',
       'HEAVY METALS', 'NITRATES GROUP', 'NITRITE GROUP', 'NEW RAD RULE',
       'LEAD AND COPPER', 'ALL REG PRE-2008', 'QT GA & UMASS',
       'OLD RAD RULE', 'REGULATED SOCS', 'SECONDARY PARAMETERS',
       'SECONDARY HEAVY META', 'OLD SOCS', 'TOCA', 'TOTAL TRIHALOMETHANE',
       'UNREGULATED', 'VOLATILE ORGANICS'], dtype=object)

In [437]:
cont_groups.CONTAMINANT_GROUP_CODE.unique()

array(['ARAD', 'AREG', 'ASOC', 'AVOC', 'BSEC', 'CYA', 'DBP1', 'DIQU',
       'ENDO', 'FLU', 'GA', 'GLYP', 'HAA5', 'HM', 'NIT', 'NITI', 'NRAD',
       'PBCU', 'PRE8', 'QGAU', 'RAD', 'RSOC', 'SEC', 'SECM', 'SOCS',
       'TOCA', 'TTHM', 'UNRG', 'VOC1'], dtype=object)

In [434]:
cont_groups.index.is_unique

False

In [444]:
cont_groups.CONTAMINANT_GROUP_CODE.isnull().values.any()

False

In [443]:
cont_groups.index.isnull().any()

False

We need a new primary key

In [438]:
ids_conts_groupe = []
for idx in range(cont_groups.shape[0]):
    ids_conts_groupe.append(str(uuid.uuid4()))

In [440]:
# Prepend the ID serie to the dataframe
cont_groups.insert(0, 'ID', pd.Series(ids_conts_groupe, index=cont_groups.index))

In [441]:
cont_groups.head()

Unnamed: 0_level_0,ID,CONTAMINANT_NAME,CONTAMINANT_GROUP,CONTAMINANT_GROUP_CODE
CONTAMINANT_CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4000,33e97efe-8477-4ff3-a8d7-93f9f8b29aeb,"GROSS ALPHA, EXCL. RADON & U",ALL RADIOCHEMICAL,ARAD
4002,5b9dce38-5fa6-467d-a55c-5e48d912d5c5,"GROSS ALPHA, INCL. RADON & U",ALL RADIOCHEMICAL,ARAD
4004,ff0c8dd4-f77e-4b2d-81bb-9012d23fea38,RADON,ALL RADIOCHEMICAL,ARAD
4006,66a54c51-175d-436a-85d5-f3661d7ea79d,COMBINED URANIUM,ALL RADIOCHEMICAL,ARAD
4100,5376e21c-1dcb-4621-a33f-d0da4fcfa405,GROSS BETA PARTICLE ACTIVITY,ALL RADIOCHEMICAL,ARAD


In [442]:
# Data will be saved in the `sanitized` folder.
sanitized_conts_groupe_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'CONTAMINANT_GROUP_CODES.csv')
cont_groups.to_csv(sanitized_conts_groupe_file, sep=",", encoding='utf-8')

## `LCR_SAMPLE` Table

In [445]:
SAMPLING_START_DATE_idx = 3 
SAMPLING_END_DATE_idx = 2

In [447]:
samp = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_LCR_SAMPLE.csv"),
                 sep=",",
                 header=0,
                 encoding="utf-8",
                index_col=1,
                 low_memory=False,
                parse_dates=[SAMPLING_END_DATE_idx, SAMPLING_START_DATE_idx]
                 )


In [448]:
samp.head()

Unnamed: 0_level_0,PWSID,SAMPLING_END_DATE,SAMPLING_START_DATE,RECONCILIATION_ID,PRIMACY_AGENCY_CODE,EPA_REGION
SAMPLE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ME30165,ME0090250,2018-12-31,2016-01-01,,ME,1
ME30307,ME0009850,2018-12-31,2016-01-01,,ME,1
ME30353,ME0091565,2018-12-31,2016-01-01,,ME,1
ME30211,ME0091537,2018-12-31,2016-01-01,,ME,1
ME30272,ME0002893,2019-12-31,2017-01-01,,ME,1


In [452]:
samp.PWSID.isnull().values.any()

False

In [453]:
samp.EPA_REGION.unique()

array([ 1,  7,  2,  3,  8, 10,  4,  9,  5,  6])

In [454]:
samp.PRIMACY_AGENCY_CODE.unique()

array(['ME', 'NE', 'NJ', 'VA', 'SD', 'ID', 'TN', 'AS', 'OH', 'LA', 'NV',
       'NN', 'CA', 'NY', 'VT', 'DE', 'FL', 'MS', '10', 'WA', 'WI', 'ND',
       'TX', 'MO', 'RI', 'UT', 'CT', 'MT', 'AR', 'WY', 'GA', 'OR', 'WV',
       'MN', 'IA', 'MA', 'OK', 'HI', 'MD', 'IL', 'AZ', 'NC', 'CO', '08',
       'IN', 'NM', 'KY', 'SC', 'PR', '01', 'DC', '06', 'MI', 'KS', 'AK',
       'NH', 'PA', '05', 'AL', '07', '09', '04', 'MP', 'GU', 'VI', '02'],
      dtype=object)

In [455]:
samp.RECONCILIATION_ID.unique()

array([nan])

Reconciliatio_ID is only `NAN` -> we drop it

In [458]:
samp.drop(axis=1, labels='RECONCILIATION_ID', inplace=True);

In [459]:
samp.head()

Unnamed: 0_level_0,PWSID,SAMPLING_END_DATE,SAMPLING_START_DATE,PRIMACY_AGENCY_CODE,EPA_REGION
SAMPLE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ME30165,ME0090250,2018-12-31,2016-01-01,ME,1
ME30307,ME0009850,2018-12-31,2016-01-01,ME,1
ME30353,ME0091565,2018-12-31,2016-01-01,ME,1
ME30211,ME0091537,2018-12-31,2016-01-01,ME,1
ME30272,ME0002893,2019-12-31,2017-01-01,ME,1


In [450]:
samp.index.is_unique

False

We need a new key

In [460]:
ids_samp = []
for idx in range(samp.shape[0]):
    ids_samp.append(str(uuid.uuid4()))

In [461]:
# Prepend the ID serie to the dataframe
samp.insert(0, 'ID', pd.Series(ids_samp, index=samp.index))

In [462]:
samp.head()

Unnamed: 0_level_0,ID,PWSID,SAMPLING_END_DATE,SAMPLING_START_DATE,PRIMACY_AGENCY_CODE,EPA_REGION
SAMPLE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ME30165,b386c319-cafd-45a8-acd8-f0b12a0b4977,ME0090250,2018-12-31,2016-01-01,ME,1
ME30307,c58688dc-8d85-4db6-ab37-aa34ea2e76bf,ME0009850,2018-12-31,2016-01-01,ME,1
ME30353,151d9793-5ecc-474a-b4f0-d7172460bd1d,ME0091565,2018-12-31,2016-01-01,ME,1
ME30211,400b8e00-5af2-4213-8b23-26654623e5c6,ME0091537,2018-12-31,2016-01-01,ME,1
ME30272,3194b387-cc9e-4268-a575-2542457243f0,ME0002893,2019-12-31,2017-01-01,ME,1


In [463]:
# Data will be saved in the `sanitized` folder.
sanitized_samp_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'LCR_SAMPLE.csv')
samp.to_csv(sanitized_samp_file, sep=",", encoding='utf-8')

In [464]:
longest_item(samp.index)

'191021111290L3Y2008-'

## `LCR_SAMPLE_RESULT` table

In [465]:
res = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_LCR_SAMPLE_RESULT.csv"),
                 sep=",",
                 header=0,
                 encoding="utf-8",
                index_col=0,
                 low_memory=False,
                 )


In [466]:
res.head()

Unnamed: 0_level_0,SAMPLE_ID,PRIMACY_AGENCY_CODE,EPA_REGION,SAR_ID,CONTAMINANT_CODE,RESULT_SIGN_CODE,SAMPLE_MEASURE,UNIT_OF_MEASURE
PWSID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AL0001366,AL56912,AL,4,16251692,PB90,,0.005,mg/L
AL0001380,AL56805,AL,4,16251696,PB90,,0.005,mg/L
WV3302709,WV32819,WV,3,15563385,PB90,,0.0,mg/L
WV3302947,WV32818,WV,3,15563453,PB90,,0.003,mg/L
KS2015905,KS6623,KS,7,14482660,PB90,,0.0027,mg/L


In [469]:
res.index.isnull().any()

False

We can set a fk on WATER SYSTEM(PWSID)

In [470]:
res.SAMPLE_ID.isnull().values.any()

False

We can set a fk on LCR_SAMPLE(SAMPLE_ID)

In [471]:
res.CONTAMINANT_CODE.unique()

array(['PB90', 'CU90'], dtype=object)

In [479]:
res.SAR_ID.is_unique and not res.SAR_ID.isnull().values.any()

True

We can use SAR_ID as an ID!

In [480]:
res.EPA_REGION.unique()

array([ 4,  3,  7,  6,  5,  1,  2,  9,  8, 10])

In [481]:
res.PRIMACY_AGENCY_CODE.unique()

array(['AL', 'WV', 'KS', 'OK', 'IL', 'CT', 'NJ', 'WI', 'CA', 'NY', 'MA',
       'PR', 'MT', 'ND', 'OR', 'MO', 'IA', 'MI', 'VT', '05', 'FL', 'AZ',
       'WA', 'KY', 'TX', 'VA', 'CO', 'ID', 'OH', 'LA', 'NN', 'IN', 'ME',
       'WY', 'GA', 'PA', 'TN', 'UT', 'RI', 'MN', 'AK', 'SD', 'DE', 'NE',
       'AR', 'SC', '06', 'NM', 'MD', 'NC', '09', 'HI', 'MS', 'NH', '10',
       '08', 'MP', '01', 'NV', 'DC', '04', '07', 'AS', '02', 'GU', 'VI'],
      dtype=object)

In [482]:
res.RESULT_SIGN_CODE.unique()

array([nan, '<', '='], dtype=object)

In [483]:
res.UNIT_OF_MEASURE.unique()

array(['mg/L'], dtype=object)

No processing to do.

In [487]:
res.SAMPLE_MEASURE

PWSID
AL0001366    0.005000
AL0001380    0.005000
WV3302709    0.000000
WV3302947    0.003000
KS2015905    0.002700
KS2009102    0.000000
OK1020909    0.003177
OK1021622    0.004496
OK1021731    0.000000
IL2035200    0.004600
IL2010080    0.003500
CT1440021    0.000000
CT0189973    0.000000
NJ0805427    0.000000
WI8160511    0.004500
CA2100549    0.000000
NY2800138    0.002000
NY1330601    0.004000
MA2028015    0.000000
MA2012008    0.002000
MA1283003    0.011000
WI6480200    0.003005
WI7350614    0.002600
CA3901477    0.002100
CA3900517    0.009100
PR0004604    0.006000
WV9925013    0.001300
WV9925016    0.001500
IL1410250    0.000000
MT0003089    0.007000
               ...   
AZ0410242    0.000000
AZ0410317    0.000000
NY3202411    0.011000
VT0005619    0.009000
VT0021020    0.002000
VT0021037    0.005000
CT0110011    0.000000
CT0110051    0.001000
CT0180171    0.004000
CT0180181    0.003000
CT0970041    0.003000
CA3600086    0.002600
CA3600139    0.000000
CA3600166    0.000000
CA36

## `TREATMENT` table

In [488]:
treat = pd.read_csv(join(PATH_TO_DATA_FOLDER, "DG_TREATMENT.csv"),
                 sep=",",
                 header=0,
                 encoding="utf-8",
                index_col=2,
                 low_memory=False,
                 )


In [491]:
treat.head()

Unnamed: 0_level_0,PWSID,FACILITY_ID,COMMENTS_TEXT,TREATMENT_OBJECTIVE_CODE,TREATMENT_PROCESS_CODE
TREATMENT_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
22727,MS0220004,31986,"INHIBITOR, POLYPHOSPHATE",C,447
26515,MS0220005,39514,"AERATION, PACKED TOWER",C,145
26518,MS0220005,39514,LIME - SODA ASH ADDITION,F,500
26524,MS0220005,39514,"FILTRATION, PRESSURE SAND",F,344
22407,MS0220007,31301,LIME - SODA ASH ADDITION,C,500


In [495]:
treat.PWSID.isnull().values.any() 

False

We can use a fk on WATER SYSTEM (PWSID)

In [502]:
treat.FACILITY_ID.isnull().values.any()

False

We can use a fk on WATER SYSTEM FACILiTY(FACILITY_ID)

In [496]:
treat.TREATMENT_OBJECTIVE_CODE.unique()

array(['C', 'F', 'D', 'P', 'T', 'Z', 'R', 'O', 'S', 'I', 'M', 'B', 'E'],
      dtype=object)

In [497]:
treat.TREATMENT_PROCESS_CODE.unique()

array([447, 145, 500, 344, 401, 742, 240, 125, 360, 345, 660, 421, 740,
       348, 403, 143, 380, 160, 600, 121, 520, 141, 423, 461, 473, 460,
       999, 640, 147, 680, 346, 343, 720, 341, 741, 200, 560, 443, 445,
       342, 449, 100, 361, 543, 363, 320, 220, 347, 455, 300, 700, 149,
       627, 623, 441, 541, 620, 180, 580, 625, 365, 362, 370, 190, 369,
       364, 367, 372, 368])

In [492]:
treat.index.is_unique

False

We need a new key for the treatment

In [498]:
ids_treat= []
for idx in range(treat.shape[0]):
    ids_treat.append(str(uuid.uuid4()))

In [499]:
# Prepend the ID serie to the dataframe
treat.insert(0, 'ID', pd.Series(ids_treat, index=treat.index))

In [500]:
treat.head()

Unnamed: 0_level_0,ID,PWSID,FACILITY_ID,COMMENTS_TEXT,TREATMENT_OBJECTIVE_CODE,TREATMENT_PROCESS_CODE
TREATMENT_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
22727,835a5452-dc6b-498a-a957-7d1abb2c0ecf,MS0220004,31986,"INHIBITOR, POLYPHOSPHATE",C,447
26515,5f5d79e3-4c59-40f7-8535-eeeb55ec8213,MS0220005,39514,"AERATION, PACKED TOWER",C,145
26518,e8ace27a-056b-40b9-ac21-e955d2411acf,MS0220005,39514,LIME - SODA ASH ADDITION,F,500
26524,55139563-567d-408e-815b-d32ecdfdb687,MS0220005,39514,"FILTRATION, PRESSURE SAND",F,344
22407,f4da5e67-5e1b-47e4-9062-3b0608de3282,MS0220007,31301,LIME - SODA ASH ADDITION,C,500


In [501]:
# Data will be saved in the `sanitized` folder.
sanitized_treat_file = join(PATH_TO_DATA_FOLDER, 'sanitized', 'TREATMENT.csv')
treat.to_csv(sanitized_treat_file, sep=",", encoding='utf-8')

In [507]:
len(longest_item(treat.index))

17

In [506]:
treat.index.isnull().any()

False