# SARA	Data			
#### Water Quality Data
- Water quality data taken at set sampling locations along the San Antonio river and its tributaries. Parameters include observational data, organics, bacteria, etc. 22.79Mb	
./data/sara-water-quality-bexar.csv	
link for more info:
https://sara-tx.maps.arcgis.com/apps/MapSeries/index.html?appid=3a4ca132222e41589e6f41eebfe6d36d	


#### San Antonio River Authority - Index of Elevation Contours with Download Links - Four Counties	SARA	
- High resolution terrain contours of Bexar, Wilson, Karnes, and Goliad counties.	
Only available via website:
https://sara-tx.maps.arcgis.com/home/item.html?id=c8d366e2680242fbae437006a3e8a3d2	

#### Rainfall	
- 5 minute interval rainfall amounts for locations around Bexar county. 2.56Mb	
./data/sara-rainfall-detail.csv		

- Daily total rainfall amounts for locations around Bexar county. 800.73Kb	
./data/sara-rainfall-summary.csv		

#### Flood stage levels	
- Surface water levels at four locations along SA river within Mission Reach area. 109.96Kb	
./data/sara-riverstage-daily-average.csv	


#### Impervious coverage (GIS data)	
- GIS shape files detailing the impervious cover within Brooks City Base area 21.27Mb	
./data/brooks_ic/brooks_ic.cpg
./data/brooks_ic/brooks_ic.dbf
./data/brooks_ic/brooks_ic.prj
./data/brooks_ic/brooks_ic.sbn
./data/brooks_ic/brooks_ic.sbx
./data/brooks_ic/brooks_ic.shp
./data/brooks_ic/brooks_ic.shp.xml
./data/brooks_ic/brooks_ic.shx

- GIS shape files detailing the impervious cover within TAMUS areas. 1.7Mb	
./data/tamusa_ic/tamusa_ic.cpg
./data/tamusa_ic/tamusa_ic.dbf
./data/tamusa_ic/tamusa_ic.prj
./data/tamusa_ic/tamusa_ic.sbn
./data/tamusa_ic/tamusa_ic.sbx
./data/tamusa_ic/tamusa_ic.shp
./data/tamusa_ic/tamusa_ic.shp.xml
./data/tamusa_ic/tamusa_ic.shx

In [1]:
import os
import acquire
import pandas as pd

# data visualization 
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import statsmodels.api as sm

from datetime import timedelta, datetime
from pylab import rcParams

import pyspark
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.sql.types as T

# Acquire

In [2]:
saws = acquire.read_data('saws-ssos.csv')
saws.head()

Unnamed: 0,SSO_ID,INSPKEY,SERVNO,REPORTDATE,SPILL_ADDRESS,SPILL_ST_NAME,TOTAL_GAL,GALSRET,GAL,SPILL_START,...,Root_Cause,STEPS_TO_PREVENT,SPILL_START_2,SPILL_STOP_2,HRS_2,GAL_2,SPILL_START_3,SPILL_STOP_3,HRS_3,GAL_3
0,6582,567722.0,,3/10/19,3200,THOUSAND OAKS DR,2100,2100.0,2100.0,3/10/2019 1:16:00 PM,...,,,,,0.0,0.0,,,0.0,0.0
1,6583,567723.0,,3/10/19,6804,S FLORES ST,80,0.0,80.0,3/10/2019 2:25:00 PM,...,,,,,0.0,0.0,,,0.0,0.0
2,6581,567714.0,,3/9/19,215,AUDREY ALENE DR,79,0.0,10.0,3/9/2019 6:00:00 PM,...,,,03/10/2019 09:36,03/10/2019 10:45,1.15,69.0,,,0.0,0.0
3,6584,567713.0,,3/9/19,3602,SE MILITARY DR,83,0.0,83.0,3/9/2019 3:37:00 PM,...,,,,,0.0,0.0,,,0.0,0.0
4,6580,567432.0,,3/6/19,100,PANSY LN,75,0.0,75.0,3/6/2019 9:40:00 AM,...,,,,,0.0,0.0,,,0.0,0.0


In [3]:
spark = SparkSession.builder.master("local").appName("read").\
    enableHiveSupport().\
    getOrCreate()

# ./data/sara-water-quality-bexar.csv

In [4]:
sara_water_quality = spark.read.format("csv").\
    option("sep", ",").\
    option("header", True).\
    option("inferSchema", True).\
    load("./data/sara-water-quality-bexar.csv")

sara_water_quality.printSchema()

root
 |-- Station ID: integer (nullable = true)
 |-- Station Description: string (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- End Date: timestamp (nullable = true)
 |-- Tag ID: string (nullable = true)
 |-- End Time: string (nullable = true)
 |-- End Depth: double (nullable = true)
 |-- Sample Type: string (nullable = true)
 |-- Program Code: string (nullable = true)
 |-- TEMPERATURE, WATER (DEGREES CENTIGRADE) (00010): string (nullable = true)
 |-- RESERVOIR STAGE (FEET ABOVE MEAN SEA LEVEL) (00052): string (nullable = true)
 |-- RESERVOIR PERCENT FULL (00053): string (nullable = true)
 |-- FLOW  STREAM, INSTANTANEOUS (CUBIC FEET PER SEC) (00061): string (nullable = true)
 |-- DEPTH, MAXIMUM, OF SAMPLE (FEET) (00068): string (nullable = true)
 |-- TRANSPARENCY, SECCHI DISC (METERS) (00078): string (nullable = true)
 |-- SPECIFIC CONDUCTANCE,FIELD (US/CM @ 25C) (00094): string (nullable = true)
 |-- TEMPERATURE, WATER (DEGREES 

In [5]:
sara_water_quality.show(1, vertical=True, truncate=False)

-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------
 Station ID                                                                               | 12689                                              
 Station Description                                                                      | ROSILLO CREEK 0.1 KM ABOVE SALADO CREEK CONFLUENCE 
 Latitude                                                                                 | 29.320101                                          
 Longitude                                                                                | -98.406263                                         
 End Date                                                                                 | 2008-09-18 00:00:00                                
 Tag ID                                                                                   | SA10917T                                    

In [6]:
shape= sara_water_quality.count(), len(sara_water_quality.columns)
shape

(10884, 500)

In [7]:
sara_water_quality.select('End Date') \
                  .sort('End Date', ascending=False) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0-----------------------
 End Date | 2019-02-27 00:00:00 
-RECORD 1-----------------------
 End Date | 2019-02-27 00:00:00 
-RECORD 2-----------------------
 End Date | 2019-02-20 00:00:00 
only showing top 3 rows



In [8]:
sara_water_quality.select('End Date') \
                  .sort('End Date', ascending=True) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0-----------------------
 End Date | 1998-01-05 00:00:00 
-RECORD 1-----------------------
 End Date | 1998-01-05 00:00:00 
-RECORD 2-----------------------
 End Date | 1998-01-08 00:00:00 
only showing top 3 rows



# ./data/sara-rainfall-summary.csv

In [9]:
sara_rainfall_summary = spark.read.format("csv").\
    option("sep", ",").\
    option("header", True).\
    option("inferSchema", True).\
    load("./data/sara-rainfall-summary.csv")

sara_rainfall_summary.printSchema()

root
 |-- location_name: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- daily_rainfall_total_inches: double (nullable = true)



In [10]:
sara_rainfall_summary.show(1, vertical=True, truncate=False)

-RECORD 0------------------------------------------
 location_name               | Blanco Road Dam     
 latitude                    | 29.6248             
 longitude                   | -98.52135           
 date                        | 2018-01-01 00:00:00 
 daily_rainfall_total_inches | 0.01                
only showing top 1 row



In [11]:
shape= sara_rainfall_summary.count(), len(sara_rainfall_summary.columns)
shape

(13395, 5)

In [12]:
sara_rainfall_summary.select('date') \
                  .sort('date', ascending=False) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0-------------------
 date | 2019-04-08 00:00:00 
-RECORD 1-------------------
 date | 2019-04-08 00:00:00 
-RECORD 2-------------------
 date | 2019-04-08 00:00:00 
only showing top 3 rows



In [13]:
sara_rainfall_summary.select('date') \
                  .sort('date', ascending=True) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0-------------------
 date | 2018-01-01 00:00:00 
-RECORD 1-------------------
 date | 2018-01-01 00:00:00 
-RECORD 2-------------------
 date | 2018-01-01 00:00:00 
only showing top 3 rows



In [14]:
sara_rainfall_summary.select('daily_rainfall_total_inches') \
                  .sort('daily_rainfall_total_inches', ascending=False) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0----------------------------
 daily_rainfall_total_inches | 12.66 
-RECORD 1----------------------------
 daily_rainfall_total_inches | 7.75  
-RECORD 2----------------------------
 daily_rainfall_total_inches | 6.76  
only showing top 3 rows



In [15]:
sara_rainfall_summary.select('daily_rainfall_total_inches') \
                  .sort('daily_rainfall_total_inches', ascending=True) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0--------------------------
 daily_rainfall_total_inches | 0.0 
-RECORD 1--------------------------
 daily_rainfall_total_inches | 0.0 
-RECORD 2--------------------------
 daily_rainfall_total_inches | 0.0 
only showing top 3 rows



# ./data/sara-riverstage-daily-average.csv

In [16]:
sara_riverstage_daily_avg = spark.read.format("csv").\
    option("sep", ",").\
    option("header", True).\
    option("inferSchema", True).\
    load("./data/sara-riverstage-daily-average.csv")

sara_riverstage_daily_avg.printSchema()

root
 |-- location_name: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- daily_average_stage: double (nullable = true)
 |-- tranducer_elevation: double (nullable = true)



In [17]:
sara_riverstage_daily_avg.show(1, vertical=True, truncate=False)

-RECORD 0----------------------------------
 location_name       | SAR 01 (Lonestar)   
 latitude            | 29.4019             
 longitude           | -98.48849           
 date                | 2018-06-20 00:00:00 
 daily_average_stage | 1.12641791          
 tranducer_elevation | 602.3               
only showing top 1 row



In [18]:
shape= sara_riverstage_daily_avg.count(), len(sara_riverstage_daily_avg.columns)
shape

(1470, 6)

In [19]:
sara_riverstage_daily_avg.select('date') \
                  .sort('date', ascending=False) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0-------------------
 date | 2019-04-09 00:00:00 
-RECORD 1-------------------
 date | 2019-04-09 00:00:00 
-RECORD 2-------------------
 date | 2019-04-09 00:00:00 
only showing top 3 rows



In [20]:
sara_riverstage_daily_avg.select('date') \
                  .sort('date', ascending=True) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0-------------------
 date | 2018-06-20 00:00:00 
-RECORD 1-------------------
 date | 2018-06-20 00:00:00 
-RECORD 2-------------------
 date | 2018-06-20 00:00:00 
only showing top 3 rows



In [21]:
sara_riverstage_daily_avg.select('daily_average_stage') \
                  .sort('daily_average_stage', ascending=False) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0-------------------------
 daily_average_stage | 6.37479167 
-RECORD 1-------------------------
 daily_average_stage | 5.89284722 
-RECORD 2-------------------------
 daily_average_stage | 5.25489583 
only showing top 3 rows



In [22]:
sara_riverstage_daily_avg.select('daily_average_stage') \
                  .sort('daily_average_stage', ascending=True) \
                  .show(3, vertical=True, truncate=False)

-RECORD 0-------------------------
 daily_average_stage | 0.2159375  
-RECORD 1-------------------------
 daily_average_stage | 0.21989583 
-RECORD 2-------------------------
 daily_average_stage | 0.22291228 
only showing top 3 rows



In [23]:
spark.stop()