# Updating E.Coli data for CRWA's visualization 

Approach:
1. There is a google spreadsheet that is connected to ArcGis which is used as the main database. This database will be updated with new data through this code.
2. The new data will be pulled through an Access Query. The file must be named the same as in the code.  
3. By running this code the current data on google spreadsheet will be cleared and replaced with the updated & formated data. 
4. ArcGis should then be updated with the new data. 
5. Data will also be archived in Google Drive. The downloadble data file will also be created. 

Steps to run code:

1.   Export data from Access as csv and store in local folder. Make sure the file is named "DataVizPipeline.xlsx". 
2.   In runtime column press "Run all"
3.   Scroll down and follow code:
  1. In the allowing access to Google Drive / Google Sheet section, click on the generated link which will open a new tab. In the new tab select the google account and click "allow". Then copy the verification code and paste into intial section of this notebook. 
  2. In the file upload section, click on "choose files" and select the newly exported access data. 
   
Notes:
* Make sure that there are the following spreadsheets already created in the google dirve:
  * E.Coli Database for ArcGis
  * data_site
* Make sure the outputs are pointed to correct location:
  * i.e. Folder for archiving data is set to "E.Coli Data" initally but can be changed to any other location. 
* 


# Initializing Libraries 

In [50]:
#Installing Reverse Geocoder for coordinates mapping
!pip install reverse_geocoder



In [51]:
# Importing libraries
import pandas as pd
import numpy as np
from datetime import date

from oauth2client.client import GoogleCredentials
from google.colab import files
from google.colab import auth
from google.colab import drive
import io

import reverse_geocoder as rg

import gspread
from gspread_dataframe import set_with_dataframe

import warnings
warnings.filterwarnings('ignore')

# Allowing Access to Google Drive / Google Sheet

In [52]:
# Allowing access to Google Drive / Google Sheet
# Follow instructions to allow access: copy the code from new tab into box below and press enter to confirm. Will need to perform this action each time it is prompted.
auth.authenticate_user()

# File Upload

In [53]:
# Upload access data into google colab
uploaded = files.upload()

Saving DataVizPipeline.xlsx to DataVizPipeline (1).xlsx


In [54]:
# Importing data into dataframe to be manipulated 
df = pd.read_excel(io.BytesIO(uploaded['DataVizPipeline.xlsx']))

# Alternative code to import CSV
# df = pd.read_csv(io.BytesIO(uploaded['DataVizPipeline.csv']))

# Check
df.head()

Unnamed: 0,Monitoring_Sites.Site_ID,Latitude_DD,Longitude_DD,River_Mile_Headwaters,Activity_ID,Results.Site_ID,Reporting_Result,Component_Name,Activity_Type,Unit_Abbreviation,Date_Collected,Time_Collected,QAQC_Status,Site_Name,Town
0,1NBS,42.3589,-71.1619,,FLG201806261NBSEC01,1NBS,3200.0,Escherichia coli,Sample-Routine,cfu/100ml,2018-06-26,09:42:00,Final/Accepted,North Beacon St. Bridge-Center,Watertown
1,2LARZ,42.3691,-71.1235,,FLG201806262LARZEC01,2LARZ,1530.0,Escherichia coli,Sample-Routine,cfu/100ml,2018-06-26,09:10:00,Final/Accepted,Larz Anderson Bridge-Center,Cambridge
2,3BU,42.3526,-71.1112,,FLG201806263BUEC01,3BU,320.0,Escherichia coli,Sample-Routine,cfu/100ml,2018-06-26,08:50:00,Final/Accepted,BU Bridge-Center,Cambridge
3,4LONG,42.3611,-71.0758,,FLG201806264LONGEC01,4LONG,120.0,Escherichia coli,Sample-Routine,cfu/100ml,2018-06-26,08:03:00,Final/Accepted,Longfellow Bridge-Center,Cambridge
4,1NBS,42.3589,-71.1619,,FLG201806281NBSEC01,1NBS,250.0,Escherichia coli,Sample-Routine,cfu/100ml,2018-06-28,08:56:00,Final/Accepted,North Beacon St. Bridge-Center,Watertown


# Data Manipulation

In [55]:
# Formating/processing data into usuable/correct format

# Selecting columns to be used
df_proc = df[['Date_Collected', 'Component_Name', 'Results.Site_ID', 'Site_Name', 'Latitude_DD', 'Longitude_DD', 'Reporting_Result', 'Unit_Abbreviation']]

# Renaming the Site ID column
df_proc.rename(columns = {'Results.Site_ID':'Site_ID'}, inplace = True)

In [56]:
#Finding number of unique values within the column to see if there is erronious data
for i in df_proc.columns:
    print(i)
    print(df_proc[i].nunique())

Date_Collected
193
Component_Name
2
Site_ID
44
Site_Name
43
Latitude_DD
38
Longitude_DD
39
Reporting_Result
469
Unit_Abbreviation
2


In [57]:
# Finding unique values within column 
print(df_proc['Component_Name'].unique())
print(df_proc['Unit_Abbreviation'].unique())

['Escherichia coli' 'Fecal coliform']
['cfu/100ml' 'MPN/100ml']


Note: Component Name and Unit Abbreviation should all be the same. 

In [58]:
# Renaming Fecal coliform to match with Escherichia coli
df_proc.loc[df_proc['Component_Name'] == 'Fecal coliform', 'Component_Name'] = 'Escherichia coli'

# Renaming unit abbreviation to match
df_proc.loc[df_proc['Unit_Abbreviation'] == 'MPN/100ml', 'Unit_Abbreviation'] = 'cfu/100ml'

# Check
for i in df_proc.columns:
    print(i)
    print(df_proc[i].nunique())

Date_Collected
193
Component_Name
1
Site_ID
44
Site_Name
43
Latitude_DD
38
Longitude_DD
39
Reporting_Result
469
Unit_Abbreviation
1


In [59]:
# Check
df_proc.head()

Unnamed: 0,Date_Collected,Component_Name,Site_ID,Site_Name,Latitude_DD,Longitude_DD,Reporting_Result,Unit_Abbreviation
0,2018-06-26,Escherichia coli,1NBS,North Beacon St. Bridge-Center,42.3589,-71.1619,3200.0,cfu/100ml
1,2018-06-26,Escherichia coli,2LARZ,Larz Anderson Bridge-Center,42.3691,-71.1235,1530.0,cfu/100ml
2,2018-06-26,Escherichia coli,3BU,BU Bridge-Center,42.3526,-71.1112,320.0,cfu/100ml
3,2018-06-26,Escherichia coli,4LONG,Longfellow Bridge-Center,42.3611,-71.0758,120.0,cfu/100ml
4,2018-06-28,Escherichia coli,1NBS,North Beacon St. Bridge-Center,42.3589,-71.1619,250.0,cfu/100ml


In [60]:
# Get coordinates of the sites
coordinates = []
for row in df_proc.pivot_table(index=['Latitude_DD', 'Longitude_DD']).index:
    coordinates.append(row)
coordinates = set(coordinates)

# Get town names based on coordinates
towns = []
for coor in coordinates: # This takes about 30 seconds
    towns.append(rg.search(coor)[0]['name'])

In [61]:
# Make a dataframe with Latitude, Longitude and Town names
df_town = pd.DataFrame(data = coordinates, columns=['Latitude_DD', 'Longitude_DD'])

# Renaming column
df_town['Town'] = towns

# Check 
df_town.head()

Unnamed: 0,Latitude_DD,Longitude_DD,Town
0,42.1169,-71.5014,Milford
1,42.1311,-71.3768,Medway
2,42.1365,-71.4185,Medway
3,42.0943,-71.476,Bellingham
4,42.1395,-71.5123,Milford


In [62]:
# Adding town locations to dataset
df_final = df_proc.merge(df_town, on=['Latitude_DD', 'Longitude_DD'], how='left')

# Check 
df_final.head()

Unnamed: 0,Date_Collected,Component_Name,Site_ID,Site_Name,Latitude_DD,Longitude_DD,Reporting_Result,Unit_Abbreviation,Town
0,2018-06-26,Escherichia coli,1NBS,North Beacon St. Bridge-Center,42.3589,-71.1619,3200.0,cfu/100ml,Watertown
1,2018-06-26,Escherichia coli,2LARZ,Larz Anderson Bridge-Center,42.3691,-71.1235,1530.0,cfu/100ml,Cambridge
2,2018-06-26,Escherichia coli,3BU,BU Bridge-Center,42.3526,-71.1112,320.0,cfu/100ml,Brookline
3,2018-06-26,Escherichia coli,4LONG,Longfellow Bridge-Center,42.3611,-71.0758,120.0,cfu/100ml,Boston
4,2018-06-28,Escherichia coli,1NBS,North Beacon St. Bridge-Center,42.3589,-71.1619,250.0,cfu/100ml,Watertown


In [63]:
#Checking if dataset contains null values
df_final.isnull().sum()

Date_Collected         0
Component_Name         0
Site_ID                0
Site_Name              6
Latitude_DD          123
Longitude_DD         123
Reporting_Result       0
Unit_Abbreviation      0
Town                 123
dtype: int64

In [64]:
# Checking number of rows before deleting
print('Number of rows before deleting: {}'.format(df_final.shape))

# Deleting any rows with empty values
df_final = df_final.dropna(how='any', axis = 0)

# Checking number of after deleting
print('Number of rows before deleting: {}'.format(df_final.shape))

#Checking if dataset contains null values
print('')
df_final.isnull().sum()

Number of rows before deleting: (4140, 9)
Number of rows before deleting: (4017, 9)



Date_Collected       0
Component_Name       0
Site_ID              0
Site_Name            0
Latitude_DD          0
Longitude_DD         0
Reporting_Result     0
Unit_Abbreviation    0
Town                 0
dtype: int64

In [65]:
#Adding safety column based on reporting result column
df_final['Safety'] = df_final['Reporting_Result']
df_final['Safety'] = pd.cut(df_final['Safety'], [-1, 235, 1260, 1000000], labels=['Safe', 'No Swimming', 'Not safe for activities'])

# note: leave the labels as they are, because they are, as is, the basis for the color coding in the ArcGIS base map.
#       we then manually changed the legend of the web map to be more precise and be what is displayed on the final dashboard

# Check
df_final.head()

Unnamed: 0,Date_Collected,Component_Name,Site_ID,Site_Name,Latitude_DD,Longitude_DD,Reporting_Result,Unit_Abbreviation,Town,Safety
0,2018-06-26,Escherichia coli,1NBS,North Beacon St. Bridge-Center,42.3589,-71.1619,3200.0,cfu/100ml,Watertown,Not safe for activities
1,2018-06-26,Escherichia coli,2LARZ,Larz Anderson Bridge-Center,42.3691,-71.1235,1530.0,cfu/100ml,Cambridge,Not safe for activities
2,2018-06-26,Escherichia coli,3BU,BU Bridge-Center,42.3526,-71.1112,320.0,cfu/100ml,Brookline,No Swimming
3,2018-06-26,Escherichia coli,4LONG,Longfellow Bridge-Center,42.3611,-71.0758,120.0,cfu/100ml,Boston,Safe
4,2018-06-28,Escherichia coli,1NBS,North Beacon St. Bridge-Center,42.3589,-71.1619,250.0,cfu/100ml,Watertown,No Swimming


# Clearing Old Data and Importing New Data

In [66]:
# Estabilishing credentials to interact with Google drive & Google Sheet
gc = gspread.authorize(GoogleCredentials.get_application_default())

Note: There must be a spreadsheet in google drive named "E.Coli Database for ArcGis"

In [67]:
# Defining worksheet 
sheet = gc.open('E.Coli Database for ArcGis').sheet1

In [68]:
# Clear all contents in worksheet
sheet.clear()

{'clearedRange': 'Data!A1:Z4018',
 'spreadsheetId': '1ua9lk6zM7AXeko9905qmmK6z4KHKEvR_QyzNFWmSi2E'}

In [69]:
# Importing final data frame into google sheets
set_with_dataframe(sheet, df_final)

# Creating Data_site Sheet

In [70]:
# Selecting columns to be used to create data_site sheet
df_site = df_final[['Town', 'Site_ID', 'Site_Name', 'Latitude_DD', 'Longitude_DD']]

# Check 
df_site.head()

Unnamed: 0,Town,Site_ID,Site_Name,Latitude_DD,Longitude_DD
0,Watertown,1NBS,North Beacon St. Bridge-Center,42.3589,-71.1619
1,Cambridge,2LARZ,Larz Anderson Bridge-Center,42.3691,-71.1235
2,Brookline,3BU,BU Bridge-Center,42.3526,-71.1112
3,Boston,4LONG,Longfellow Bridge-Center,42.3611,-71.0758
4,Watertown,1NBS,North Beacon St. Bridge-Center,42.3589,-71.1619


In [71]:
# Checking number of before deleting
print('Number of rows before deleting: {}'.format(df_site.shape))

# Deleting any duplicates (keeping unique rows)
df_site_final = df_site.drop_duplicates()

# Checking number of after deleting
print('Number of rows after deleting: {}'.format(df_site_final.shape))


Number of rows before deleting: (4017, 5)
Number of rows after deleting: (39, 5)


# Creating Data_Site File

In [72]:
# Estabilishing credentials to interact with Google drive & Google Sheet
gc = gspread.authorize(GoogleCredentials.get_application_default())

Note: There must be a spreadsheet in google drive named "data_site"

In [73]:
# Defining worksheet 
sheet = gc.open('data_site').sheet1

In [74]:
# Clear all contents in worksheet
sheet.clear()

{'clearedRange': 'data_site!A1:Y1000',
 'spreadsheetId': '172irAQZIu7NkOzVp8KsrsQ2Tujx8twzPK60KZXfQez0'}

In [75]:
# Importing final data frame into google sheets
set_with_dataframe(sheet, df_site_final)

# Archiving Dataset to Google Drive

In [76]:
# Follow instructions to allow access: copy the code ffrom new tab into box below and press enter to confirm. 
drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [77]:
# Naming csv file to current date
today = date.today()
file_name = f'{today}_e_coli_data.csv'

# Uploading final dataframe to google drive as referrence
upload_df = df_final.to_csv(file_name)
!cp  $file_name "drive/My Drive/E. Coli Data/"

# Creating Download File

In [78]:
# Naming csv file to current date
download_file_name = 'e_coli_data.csv'

# Uploading final dataframe to google drive as referrence
upload_df = df_final.to_csv(download_file_name)
!cp  $download_file_name "drive/My Drive/"