Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of Concept for Pepys in Jupyter #1078

Open
2 of 7 tasks
IanMayo opened this issue Nov 11, 2021 · 7 comments · May be fixed by #1096
Open
2 of 7 tasks

Proof of Concept for Pepys in Jupyter #1078

IanMayo opened this issue Nov 11, 2021 · 7 comments · May be fixed by #1096
Assignees
Labels
rw_backlog_opt Optional items for RW rw_backlog Candidate tasks for Robin Task Package of work

Comments

@IanMayo
Copy link
Member

IanMayo commented Nov 11, 2021

🐞 Overview

Produce a proof-of-concept for viewing Pepys data in a Jupyter notebook.

This will de-risk the future use of Jupyter notebooks both in Pepys and in general usage by analysts, offering lessons learned in data connectivity, data processing, and visualisation.

Time-permitting, to include:

  • connecting to Pepys (Postgres)
  • extracting State data for a period of time from one or more platforms
  • giving initial visualisations of
    • spatial perspective (tracks)
    • temporal perspective (speed vs time & range vs time)
  • data-pipeline to perform some cleaning of data (smooth speed?)
  • perform some SciKit-Learn analysis on data to produce some new calculated variable (TBD, see below)
  • push some calculated data back into Pepys
  • any thoughts on offline mapping (more below)

🔗 Feature

This represents an alternate solution for #859

🔢 Acceptance criteria

Machine Learning

SciKit provides capable clustering algorithms. But, we need to think of an application of this method to Pepys data

  • We could try to apply clustering to course and speed data (or change of course/speed) to identify "hi-tempo" periods for a platform, and shade the track accordingly.
  • Use frequency of comments to infer tempo of operations (note: I don't see this as M/L).
  • Some sentiment analysis/clustering of comments?

Offline mapping

Pepys will frequently be used without an Internet connection, unable to provide an OpenStreetMap backdrop. It would be useful to consider how a similar capability could be used to provide coverage in these areas of descending importance:

  • NW Scotland
  • All UK Waters
  • Europe
  • Global coverage

I guess some options are:

  • GeoTIFFs of UK Admiralty Charts
  • UK/Other Coastlines expressed in GeoJSON
  • Server-based solution such as OpenStreetMap

Sample analysis task #

  • Pick a track (primary) that is present for > 50% of the day
  • Randomly apply a dummy sensor activation / deactivation on that track (probably something like a ~1 hour cycle? Can vary to make a more interesting dataset...)
  • Assume the sensor has a range of X, select all the other tracks within range of X of primary while the sensor is active (again, can vary X to make the dataset more interesting)
  • Produce summary statistics and visualisations of the above. For example, a plot of the # tracks in range as a function of time, a histogram of distances from selected tracks to the primary or a spatial plot that lets you select each interaction. Focus would be on things you wouldn't do in debrief.
  • Test the process works by repeating using different days
  • An illustration of the concept the above could be used for is in Fisheries Protection. A platform may be suspected of Dark Fishing, and analysts wish to inspect the platforms interactions with others - to see if it may be offloading illegal catch. So, they may wish to view a timeline showing periods when other platforms are in vicinity of the suspect one. Clicking on an item on the timeline would show a map-plot of the interaction - to let them assess if the pattern of behaviour shows that catch was being offloaded. A similarly clickable table of CPAs (Closest Points of Approach - that's not the nearest point between two polylines, but a time-sensitive measurements of the closest point the two moving tracks got to each other) could also be useful. Whereas the above description uses periods of sensor coverage, in this example it could be periods of day/night, or periods with suitable sea-state. Fundamentally, there are temporal windows of interest that relate to the data.

Extended analysis task, considering bulk data #

  • Fictional high-level requirement. A government agency is interested in catching smugglers around the UK coastline. They have identified a handful of tracks of known smuggling vessels, but wish to identify occasions when other shipping could have passed within some distance of those vessels. A sample range is 2km.
  • The agency wishes to provide a quarterly report on this data. In a quarter it can typically get 20 x 24-hour smuggling vessel tracks, with a minute sample time.
  • The report should list the number of "close-encounters" for each smuggling vessels, with extended data on time-duration of close-encounter.
  • As sample data for this task, 24-hour datasets can be downloaded from: https://marinecadastre.gov/ais/
  • In the UK context, 24 hour of data is a 74MB .csv with ~2000 unique vessels in a 9x700,000 table. So, for a quarter of data there is a very large amount of data present.
  • As far as possible, the analysts would like the above process to be automated, so they are free to concentrate on the "close-encounters", and not spend time collating the data
  • It is accepted that this bulk-data challenge may introduce new technological challenges/requirements.

Prioritised subsequent tasks #

  • Continue with examples of the ‘vessels within X distance of a specific vessel’ analysis, including tidying up and sharing outputs, plus adding other visualisations such as timelines
  • Extend ‘vessels within X distance…’ example to deal with ‘periods of interest’ (eg. just during night time etc)
  • SQL implementation of Closest Point of Approach analysis
  • Think about and demo offline background mapping for use on non-internet-connected systems
  • Demonstrate GUI-based dataframe manipulation in a Jupyter notebook using various tools that we’ve discussed before - and showing how those could be used to do interesting quick analyses and could integrate with the rest of Pepys
  • Continuing PR to make pandas support SQLAlchemy 2.0 properly (very early PR at ENH: Initial implementation of support for SQLAlchemy 2.0 in read_sql etc (WIP) pandas-dev/pandas#44794, needs quite a bit more work)
  • Create a ‘dashboard-style’ demo where you can interactively choose a vessel, choose distance parameters, view plots of timelines and then click to get maps - all built into a nice little demo application (probably using the tool ‘Panel’)
@IanMayo IanMayo added Task Package of work rw_backlog_opt Optional items for RW labels Nov 11, 2021
@IanMayo IanMayo added the rw_backlog Candidate tasks for Robin label Nov 11, 2021
@robintw
Copy link
Collaborator

robintw commented Nov 29, 2021

@IanMayo Here are some initial demos of a very simple notebook interface:
Jupyter_1

Jupyter_2

There are loads of problems with this interface, but it's just an idea of what is possible with just a few lines of code. I'll put up a PR shortly so you can see the actual notebook code, and then I'll move on to some of the other stuff we wanted to demo.

@robintw
Copy link
Collaborator

robintw commented Nov 29, 2021

See #1096 for a PR including this notebook code. I've also included some static and interactive plots of other variables - see:

image

and

image

Notably at the moment we have to work around pandas incompatibility with SQLAlchemy 2.0. This means that the SQLAlchemy 'engine' that we create in the Pepys DataStore won't work with pandas, as we create it using future=True (to use the new features and deprecations of SQLAlchemy 1.4, to make it ready for 2.0). There is a pandas issue for adding support for SQLAlchemy 2.0 (see pandas-dev/pandas#40460), which seems to be stalled for lack of volunteers with the relevant experience - that might be something we could contribute to if you were interested.

@robintw
Copy link
Collaborator

robintw commented Nov 29, 2021

Ah yes, one more thing:

Do you have any really good, realistic (ideally actually real - but not sensitive) data that I could use for playing around with developing analysis capabilities in Jupyter? Part of the reason I built the UI for selecting a platform and plotting the points was so that I could see if I could find a realistic looking track - a lot of the data on TracStor is obviously test data. The best I found was this HIPP platform, but it hasn't got a massive amount of data (only ~350 data points). If I were to start running scikit-learn models on data I'd ideally like something fairly realistic and reasonably large. Any ideas?

@IanMayo
Copy link
Member Author

IanMayo commented Nov 30, 2021

Aah, @robintw - from the depths of my memory I remembered where I'd seen a sample dataset, it's in the CSV files here:
https://www.gov.uk/government/news/dstl-shares-new-open-source-framework-initiative

Some tracks appeared to have up to 3k points.

Obvs you'll either have to produce a parser to get the data into Pepys, or do some Excel column fiddling to make it look like an existing format which we parse. The "unknown platform" handling will be great for this data :-D

@IanMayo IanMayo added interactive_review Triggers an interactive review via Binder and removed interactive_review Triggers an interactive review via Binder labels Nov 30, 2021
@IanMayo IanMayo linked a pull request Nov 30, 2021 that will close this issue
8 tasks
@IanMayo
Copy link
Member Author

IanMayo commented Nov 30, 2021

Here's another source of AIS data @robin - it's a huge dataset, hopefully they're long tracks rather than just lots of small ones.
https://marinecadastre.gov/ais/

@IanMayo
Copy link
Member Author

IanMayo commented Nov 30, 2021

@robintw - the analyst have come up with a useful analysis task ( above ) to "drive" the technical demonstrator. I'm happy to either expand the terms or rephrase the description as necessary for you to understand/implement it.

@robintw
Copy link
Collaborator

robintw commented Nov 30, 2021

Thanks @IanMayo. That's an interesting task, and slightly different to what I was expecting. I'll have a ponder and do some experimentation and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rw_backlog_opt Optional items for RW rw_backlog Candidate tasks for Robin Task Package of work
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants