## Find gaps in EK80 data (Armstrong)

Analyze underway data from the EK80 to determine start/stop times

The first step is to download and scrape the data from an HTML index page. Data is split into files each of which is approximately 1min of data

In [6]:
INDEX_URL = 'http://dlacruisedata.whoi.edu/AR/AR028L0A/ek80/'

import requests

In [7]:
html = requests.get(INDEX_URL).text

In [8]:
from bs4 import BeautifulSoup

bs = BeautifulSoup(html, 'html.parser')

In [9]:
import re
import pandas as pd

dts = []

# find all anchors whose href ends ".raw"
for a in bs.find_all('a', href=re.compile(r'\.raw$')):
    y, m, d, H, M, S = re.match('.*D(\d{4})(\d{2})(\d{2})-T(\d{2})(\d{2})(\d{2})\.raw', a['href']).groups()
    dt = '{}-{}-{} {}:{}:{}'.format(y,m,d,H,M,S)
    dts.append(dt)

dts = pd.Series(pd.to_datetime(dts))
dts.head()

0   2018-03-23 23:09:48
1   2018-03-23 23:11:01
2   2018-03-23 23:12:13
3   2018-03-23 23:13:36
4   2018-03-23 23:15:16
dtype: datetime64[ns]

In [10]:
# now find >=2-minute gaps in this sequence of timestamps

# https://stackoverflow.com/questions/32974166/how-do-i-find-5-minutes-gaps-in-a-pandas-dataframe

gap = dts.diff()

df = pd.DataFrame({
    'timestamp': dts,
    'gap': gap,
    'event': 'run',
})
start_times = df[df.gap > pd.Timedelta(minutes=2)]
start_times['event'] = 'start'
start_times

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,timestamp,gap,event
850,2018-03-25 19:44:35,1 days 09:57:49,start
950,2018-03-26 00:33:30,0 days 02:49:25,start
1050,2018-03-28 10:52:52,2 days 08:18:29,start
1186,2018-03-28 14:34:46,0 days 01:42:48,start
1328,2018-03-28 20:42:11,0 days 04:13:08,start
1368,2018-03-28 23:52:54,0 days 02:39:57,start
1532,2018-03-30 11:46:33,1 days 09:47:32,start
1828,2018-03-31 14:59:32,0 days 23:09:49,start
2031,2018-03-31 19:30:23,0 days 01:07:40,start


In [11]:
# stop times are the ones immediately preceding start times

stop_times = df.iloc[start_times.index - 1]
stop_times['event'] = 'stop'
stop_times

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,timestamp,gap,event
849,2018-03-24 09:46:46,00:01:11,stop
949,2018-03-25 21:44:05,00:01:33,stop
1049,2018-03-26 02:34:23,00:01:30,stop
1185,2018-03-28 12:51:58,00:01:31,stop
1327,2018-03-28 16:29:03,00:01:36,stop
1367,2018-03-28 21:12:57,00:00:46,stop
1531,2018-03-29 01:59:01,00:01:01,stop
1827,2018-03-30 15:49:43,00:00:57,stop
2030,2018-03-31 18:22:43,00:01:15,stop


In [12]:
# now we need to create an initial start event. this will have no value for "gap"
events = pd.concat([df.head(1), start_times, stop_times]).sort_values('timestamp')
events.event.iloc[0] = 'start'
events

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,timestamp,gap,event
0,2018-03-23 23:09:48,NaT,start
849,2018-03-24 09:46:46,0 days 00:01:11,stop
850,2018-03-25 19:44:35,1 days 09:57:49,start
949,2018-03-25 21:44:05,0 days 00:01:33,stop
950,2018-03-26 00:33:30,0 days 02:49:25,start
1049,2018-03-26 02:34:23,0 days 00:01:30,stop
1050,2018-03-28 10:52:52,2 days 08:18:29,start
1185,2018-03-28 12:51:58,0 days 00:01:31,stop
1186,2018-03-28 14:34:46,0 days 01:42:48,start
1327,2018-03-28 16:29:03,0 days 00:01:36,stop
