# Objective
The objective of this notebook is to do the following intial tasks:
1. Read all the tables (csv files) from the safe drinking water information system
2. Select the columns highlighted as green or yellow in the data dictionary (google docs)
3. Select data from years 2009-2016
4. Select only active water systems
5. Select only community water systems

The tasks listed above were discussed at the meetup on April 30. If there are any changes to our approach, we can update our tasks.

As we continue our discussions, we can add next steps such as data cleaning, transformation, visualization, analysis and modeling etc

In [34]:
import os
import pandas as pd
# for OS X, install using pip (or pip3) install mysqlclient
import MySQLdb

%matplotlib inline

## Connect to DB
(### Note: Fetching data from the cloud db will take a few minutes

Follow instructions here to connect to the Google Cloud MySQL db with SDWIS data: 
https://docs.google.com/presentation/d/17iO91g03-PkIXLm0Vba5tTsTUobVcS96fuV52u8R7NE/edit?usp=sharing

In [35]:
# Read un and pwd from local file data/my_settings.txt that is not deployed!
# first line un, second line pwd
data_dir = '../../../data'
my_settings = []
with open(os.path.join(data_dir, 'my_settings.txt')) as f:
    for line in f:
        my_settings.append(line)
un = my_settings[0].rstrip('\n')
pwd = my_settings[1].rstrip('\n')

# Alternatively just enter un, pwd here but don't check it in!
db_user= un
password= pwd
db = MySQLdb.connect(host="127.0.0.1",    # your host, usually localhost
                     user=db_user,         # your username
                     passwd=password,  # your password
                     port=3306,
                     db = 'safe_water')        # name of the data base

# Reading data

In [36]:
cmd = "select PWSID, AVAILABILITY_CODE, FACILITY_ACTIVITY_CODE, FACILITY_TYPE_CODE, FILTRATION_STATUS_CODE, " \
    + "IS_SOURCE_IND, IS_SOURCE_TREATED_IND, SELLER_PWSID, SELLER_TREATMENT_CODE, SUBMISSION_STATUS_CODE " \
    + "from WATER_SYSTEM_FACILITY"
water_sys_fac = pd.read_sql(cmd, db)
water_sys_fac.head()

Unnamed: 0,PWSID,AVAILABILITY_CODE,FACILITY_ACTIVITY_CODE,FACILITY_TYPE_CODE,FILTRATION_STATUS_CODE,IS_SOURCE_IND,IS_SOURCE_TREATED_IND,SELLER_PWSID,SELLER_TREATMENT_CODE,SUBMISSION_STATUS_CODE
0,WA5329887,P,I,WL,,1,,,,Y
1,WA5302600,P,A,WL,,1,,,,Y
2,VA4073310,E,I,WL,,1,U,,,Y
3,KS2002107,GW,00032175,,,0,CWS,,P,WL
4,OR4191666,P,A,WL,,1,,,,Y


In [None]:
cmd = "select PWSID, EPA_REGION, GW_SW_CODE, IS_GRANT_ELIGIBLE_IND, IS_SCHOOL_OR_DAYCARE_IND, " \
    + "IS_WHOLESALER_IND, OUTSTANDING_PERFORMER, OUTSTANDING_PERFORM_BEGIN_DATE, OWNER_TYPE_CODE, " \
    + "POPULATION_SERVED_COUNT, PRIMACY_AGENCY_CODE, PRIMACY_TYPE, PRIMARY_SOURCE_CODE, PWS_ACTIVITY_CODE, " \
    + "PWS_TYPE_CODE, SERVICE_CONNECTIONS_COUNT, SOURCE_WATER_PROTECTION_CODE, STATE_CODE, SUBMISSION_STATUS_CODE, " \
    + "ZIP_CODE from WATER_SYSTEM"
water_sys = pd.read_sql(cmd, db)
water_sys.head()

Unnamed: 0,PWSID,EPA_REGION,GW_SW_CODE,IS_GRANT_ELIGIBLE_IND,IS_SCHOOL_OR_DAYCARE_IND,IS_WHOLESALER_IND,OUTSTANDING_PERFORMER,OUTSTANDING_PERFORM_BEGIN_DATE,OWNER_TYPE_CODE,POPULATION_SERVED_COUNT,PRIMACY_AGENCY_CODE,PRIMACY_TYPE,PRIMARY_SOURCE_CODE,PWS_ACTIVITY_CODE,PWS_TYPE_CODE,SERVICE_CONNECTIONS_COUNT,SOURCE_WATER_PROTECTION_CODE,STATE_CODE,SUBMISSION_STATUS_CODE,ZIP_CODE
0,10106001,1,SW,1,0,1,,,N,39552,1,Tribal,GU,A,CWS,143,,CT,Y,06339-3060
1,10109001,1,GW,0,0,0,,,N,41185,1,Tribal,GW,I,NTNCWS,1,,,Y,
2,10109005,1,SW,1,0,0,,,N,37860,1,Tribal,SWP,A,CWS,20,,CT,Y,06382
3,10307001,1,GW,1,0,0,,,N,84,1,Tribal,GW,A,CWS,33,,MA,Y,02535
4,10502001,1,GW,0,0,0,,,N,30,1,Tribal,GW,I,TNCWS,2,,RI,Y,02813


In [None]:
cmd = "select PWSID, AREA_TYPE_CODE, TRIBAL_CODE from GEOGRAPHIC_AREA"
geog_area = pd.read_sql(cmd, db)
geog_area.head()

Unnamed: 0,PWSID,AREA_TYPE_CODE,TRIBAL_CODE
0,MI2420105,CN,
1,MI2420108,CN,
2,MI2420109,CN,
3,MI2420127,CN,
4,MI2420129,CN,


In [None]:
cmd = "select PWSID, SERVICE_AREA_TYPE_CODE from SERVICE_AREA"
serv_area = pd.read_sql(cmd, db)
serv_area.head()

Unnamed: 0,PWSID,SERVICE_AREA_TYPE_CODE
0,NY1330311,MU
1,OH3030812,SC
2,WI4710506,OA
3,VA5019905,PA
4,IA9271602,MH


In [None]:
cmd = "select PWSID, TREATMENT_OBJECTIVE_CODE, TREATMENT_PROCESS_CODE from TREATMENT"
treat = pd.read_sql(cmd, db)
treat.head()

Unnamed: 0,PWSID,TREATMENT_OBJECTIVE_CODE,TREATMENT_PROCESS_CODE
0,MN5820293,P,341
1,ND2911428,I,640
2,OH7748212,P,341
3,OR4100959,F,560
4,TX2300020,D,423


In [None]:
cmd = "select PWSID, COMPL_PER_BEGIN_DATE, CONTAMINANT_CODE, CORRECTIVE_ACTION_ID, FACILITY_ID, " \
    + "IS_HEALTH_BASED_IND, IS_MAJOR_VIOL_IND, LATEST_ENFORCEMENT_ID, PUBLIC_NOTIFICATION_TIER, " \
    + "RTC_DATE, RTC_ENFORCEMENT_ID, RULE_CODE, RULE_FAMILY_CODE, RULE_GROUP_CODE, SEVERITY_IND_CNT, " \
    + "VIOLATION_CATEGORY_CODE, VIOLATION_CODE from VIOLATION"
viol = pd.read_sql(cmd, db)
viol.head()

# Notes on feature selection
There were no variables highlighted in green/yellow in the data dictionary sheet for the following tables:
1. ENFORCEMENT_ACTION
2. LCR_SAMPLE 
3. LCR_SAMPLE_RESULT 
4. VIOLATION_ENF_ASSOC

We might want to confirm whether or not there are no variables we want to include from these tables. Or if we want to revisit at a later date.