# Preprocessing the Data
---

The first model I plan to make will utilize temperature data from the surface mooring and CTD-O variables from the 200 meter platform in 2017. The steps that need to be completed before modeling include: 
* find proportion of missing data - then drop or impute it accordingly
* resample to lower resolution to reduce number of observations to pass to model
* create new dataframe containing the target variable and features
* save clean dataframe as csv

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

---
### Load data

Read in the surface mooring and 200m platform data.

In [2]:
METBK_data = pd.read_csv('../../coastal_upwelling_output/metbk_data_2017.csv')
platform_data = pd.read_csv('../../coastal_upwelling_output/platform_data_2017.csv')

---
### Missing data

There's a significant amount of missing at random values that we should consider dropping or replacing with imputed values. Instead of being filled with nulls, these missing values are automatically filled with a designated fill value. The fill value is unique for each variable and can be found in the variable descriptions in the OPeNDAP servers. 

METBK variables: sea_surface_temperature, met_windavg_mag_corr_east, met_windavg_mag_corr_north


| METBK variable             | Fill value |
|----------------------------|------------|
| sea_surface_temperature    | -9999999.0 |
| met_windavg_mag_corr_east  | -9999999.0 |
| met_windavg_mag_corr_north | -9999999.0 |

platform variables: seawater_pressure, density, practical_salinity, seawater_temperature, dissolved_oxygen]

| 200m platform variable | Fill value |
|------------------------|------------|
| seawater_pressure      | -9999999.0 |
| density                | -9999999.0 |
| practical_salinity     | -9999999.0 |
| seawater_temperature   | -9999999.0 |
| dissolved_oxygen       | -9999999.0 |

How nice that they all have the same fill value! 

---
### Appending the CUTI index

In [49]:
CUTI['time'][0] == METBK_daily['time'][0] # == CTD_daily['time'][0]

True

In [50]:
METBK_daily['time'].max()

'2017-11-14'

In [51]:
CUTI[:318]['time'].max()

'2017-11-14'

In [52]:
CUTI[:318]['44N']

0      1.731
1      1.308
2      0.360
3      0.742
4      1.469
       ...  
313   -1.239
314   -0.740
315   -0.543
316   -2.091
317   -2.339
Name: 44N, Length: 318, dtype: float64

In [53]:
METBK_daily['CUTI'] = CUTI[:318]['44N']

In [54]:
METBK_daily

Unnamed: 0,time,Sea Surface Temperature (deg_C),Eastward Wind Velocity (m s-1),Northward Wind Velocity (m s-1),CUTI
0,2017-01-01,11.247412,6.413056,-5.371468,1.731
1,2017-01-02,11.149430,-1.148830,-1.207261,1.308
2,2017-01-03,11.089363,-5.203772,2.533484,0.360
3,2017-01-04,10.926763,-6.222477,-4.027738,0.742
4,2017-01-05,10.756387,-6.911579,-3.048631,1.469
...,...,...,...,...,...
313,2017-11-10,12.468681,-0.408066,4.630594,-1.239
314,2017-11-11,12.591080,0.274552,4.486309,-0.740
315,2017-11-12,12.417127,-0.501287,8.977260,-0.543
316,2017-11-13,11.889144,4.022042,10.856365,-2.091


In [55]:
METBK_daily['upwelling'] = METBK_daily['CUTI'].apply(lambda x: 1 if x > 0 else 0)

In [56]:
METBK_daily['upwelling'].value_counts(normalize=True)

1    0.613208
0    0.386792
Name: upwelling, dtype: float64

Depending on the fill values for these datasets, I may need to drop zeros and negative values. I'm not concerned with how clean the data is for this initial EDA, but I'm leaving this function here to come back to later.

In [None]:
# this function can only be used if the expected values are always >0
def remove_zeros(df, variables):
    cnames = []
    for v in variables:
        if v not in ['lat', 'lon']:
            cname = v + '_ind'
            cnames.append(cname)
            df[cname] = df[v] > 0.00
    for cn in cnames:
        df = df.loc[df[cn] == True]
    df = df.drop(columns=cnames)
    return df

In [None]:
# METBK_data = remove_zeros(METBK_data, METBK_var)
# profiler_data = remove_zeros(profiler_data, profiler_var)
# platform_data = remove_zeros(platform_data, platform_var)