---
Eli Schwat

elilouis@uw.edu

Created for Professor Michael Brett's CEWA547 Course, Winter 2021

---

# Extract WQM Outputs at WA Ecology Monitoring Stations

This notebook extracts sub-daily water quality model (WQM) outputs produced by the Salish Sea Model.  The model generates an output file 

```
SSM_2014_v2.7hyak_chk/WQM/SSM_2014_DO_Ph_T52/outputs/ssm_station.out
```

which contains outputs at 26 monitoring stations that represent a variety of environmental conditons. 

To run this notebook, you must provide the path to a copy of that output file. The notebook will create a CSV of results. The saved results are a small subset of available outputs, including

Identifying variables:
```
    StationID
    Node
    Layer
    depth
```
Simulated Values:
```
    DO
    NO3
    NH4
    T
    S
```

The data is saved in a wide format, with identifying and simulated variables all represented in columns along with a time index. A gzipped CSV is saved alongside the identified output file.

User input is required when you see this...

**<span style="color:red">USER INPUT REQUIRED</span>**

In [1]:
import pandas as pd
import os
import numpy as np

**<span style="color:red">USER INPUT REQUIRED</span>**

In [2]:
!find outputs -type f -name "*.out"

outputs/ssm_station3.out
outputs/ssm_station0.out
outputs/ssm_station8_shortened.out
outputs/ssm_station15.out
outputs/ssm_stationNEW60.out
outputs/ssm_stationNEW15.out
outputs/ssm_station60.out
outputs/ssm_stationNEW3.out
outputs/ssm_stationNEW0.out
outputs/ssm_stationNEW32.out
outputs/ssm_stationNEW25.out
outputs/ssm_stationNEW8.out
outputs/ssm_station25.out
outputs/ssm_stationNEW120.out
outputs/ssm_station32.out
outputs/ssm_station8.out
outputs/ssm_station120.out


In [3]:
n_stations = 35
input_file = (
    "/Users/elischwat/Google Drive/UW/Classes Winter 2021/Watershed MGMT/salish sea model/code/outputs/ssm_stationNEW120.out"
)

In [4]:
with open(input_file, mode='r') as fp:
    df = pd.DataFrame(fp.readlines())

In [5]:
def get_list_of_variable_names_from_line(line):
    return line[0].replace("Variables=", "").replace("\"", "").split(",")
variables_list = get_list_of_variable_names_from_line(df.iloc[2])

In [6]:
variables_list

['StationID',
 'Node',
 'Layer',
 'depth(m)',
 'DO',
 'NO3',
 'NH4',
 'Alg1',
 'Alg2',
 'LDOC',
 'RDOC',
 'LPOC',
 'RPOC',
 'PO4',
 'DIC',
 'TALK',
 'pH',
 'pCO2',
 'T',
 'S',
 'P1',
 'P2',
 'BM1',
 'BM2',
 'NL1',
 'NL2',
 'PL1',
 'PL2',
 'FI1',
 'FI2',
 'B1SZ',
 'B2SZ',
 'B1LZ',
 'B2LZ',
 'PR1',
 'PR2',
 'IAVG',
 'DICUPT',
 'DICBMP',
 'DICPRD',
 'DICMNL',
 'DICDEN',
 'DICGAS',
 'DICSED',
 'DICADV',
 'DICVDIF',
 'ALKNH4',
 'ALKNO3',
 'ALKNIT',
 'ALKDEN',
 'ALKREM',
 'ALKNH4SED',
 'ALKNO3SED',
 'ALKADV',
 'ALKVDIF',
 'Jcin1',
 'Jcin2',
 'Jcin3',
 'Jnin1',
 'Jnin2',
 'Jnin3',
 'Jpin1',
 'Jpin2',
 'Jpin3',
 'Jsin',
 'O20',
 'Depth',
 'Tw',
 'NH30',
 'NO30',
 'PO40',
 'SI0',
 'CH40',
 'SALw',
 'SOD',
 'Jnh4',
 'Jno3',
 'JDenitT',
 'Jch4',
 'Jch4g',
 'Jhs',
 'Jpo4',
 'Jsi',
 'NH31',
 'NH32',
 'NO31',
 'NO32',
 'PO41',
 'PO42',
 'Si1',
 'Si2',
 'CH41',
 'CH42',
 'HS1',
 'HS2',
 'POC21',
 'POC22',
 'POC23',
 'PON21',
 'PON22',
 'PON23',
 'POP21',
 'POP22',
 'POP23',
 'POS2',
 'H1',
 'BEN_STR\

In [7]:
df = df.iloc[3:].reset_index(drop=True)

Drop empty lines

In [8]:
df = df[df[0].str.strip() != '']

In [9]:
def process_one_timestamp_chunk(df):
    """
    Split a df with data for one timestamp into a list of dataframes, 
    one df per station
    """
    df['time'] =  df[0].iloc[0]
    df = df.iloc[1:]
    one_timestamp_df_list = np.array_split(df, len(df)/76)
    return one_timestamp_df_list

In [10]:
def process_one_station_chunk(df):
    """
    Convert a df for one station and one time into a wide format.
    """
    top_layers_df_list = np.array_split(df[:-13], 9)
    
    bottom_layer_df = df.iloc[-13:]
    bottom_layer_data_list = ' '.join(bottom_layer_df[0]).split()
    index, values = zip(*zip(variables_list, bottom_layer_data_list))
    bottom_layer_df = pd.DataFrame(values, index=index).transpose()
    time = df.time.iloc[0]
    top_layers_df = pd.DataFrame()
    for df in top_layers_df_list:
        index, values = zip(*zip(variables_list, ' '.join(df[0]).split()))
        top_layers_df = top_layers_df.append(pd.DataFrame(values, index=index).transpose())
    return_df = top_layers_df.append(bottom_layer_df)
    return_df['time'] = time
    return return_df

In [11]:
chunk_length = n_stations*76+1

In [12]:
chunk_length

2661

In [13]:
time_step_df_list = np.array_split(df, len(df)/chunk_length)

In [14]:
NUM_PROCS = 7
from multiprocessing import Pool

def process_single_df(df):
    station_df_list = process_one_timestamp_chunk(df)
    all_stations_one_date_df = pd.concat([
        process_one_station_chunk(station_df)
        for station_df in station_df_list

    ])
    return all_stations_one_date_df

pool = Pool(processes=NUM_PROCS)

allDfs = pool.map(process_single_df, time_step_df_list)

In [15]:
final_df = pd.concat(allDfs)

In [16]:
final_df = final_df[["StationID","Node","Layer","depth(m)","DO","NO3","NH4","T","S",'time']]

In [17]:
final_df['time'] = final_df['time'].apply(lambda x: x.split()[2]).astype('float')

In [18]:
final_df['StationID'] = final_df['StationID'].astype('int')
final_df['Node'] 	  = final_df['Node'].astype('int')
final_df['Layer'] 	  = final_df['Layer'].astype('int')
final_df['depth(m)']  = final_df['depth(m)'].astype('float')
final_df['DO'] 		  = final_df['DO'].astype('float')
final_df['NO3'] 	  = final_df['NO3'].astype('float')
final_df['NH4'] 	  = final_df['NH4'].astype('float')
final_df['T'] 		  = final_df['T'].astype('float')
final_df['S'] 		  = final_df['S'].astype('float')

In [19]:
final_df.to_csv(
    input_file.replace(".out", ".csv.gz"),
    compression="gzip"
)