# Southern Ocean Geospatial Analysis
## Sample Code
### TODO: Add geolabel.ipynb code here

## Notes

### NSIDC Sea Ice Extent data challenge
- TODO: FINISH SECTION
- [Include images of the grid from NSIDC]
- Isolated boundary but the points were out of order. Solution to that problem



### Summer 2022 second half goals
- Main goal is to enter data and make visualization/front end for the database
- Features still pending are metadata information when entering datasets and ways to visualize the different types of data
- Latest task completed was separating data into two different categories. Presence vs data with biological and chemical data present. Essentially a table that has our ideal datasets like LTER with HPLC + chlorophyll, etc. Then another table for datasets that have lat, lon, time, and species present only. Phytobase and OBIS have lots of data like that. All the data will still be there, it will just avoid having tons of data with no entry for our biological columns. It can also make grouping the data easier since we can just ignore the Phytobase type data when we are doing clustering later on.
- Current software progress:
<p align="center">
  <img 
    width="350"
    src="img/fronts_shapes.png"
></p>
- All fronts/boundaries and their respective zones are shapefiles. Data is easily labelled with a spatial join to this GeoDataFrame

### GeoLabel class
- Class to label data by fronts and sector. Also creates the shapefiles with raw data
- Wrote hundreds of test lines over 2 weeks to get shapefiles and geopandas working
- Refactored several times to simplify the functions as much as possible. Zones are labeled with a simple geospatial join
```python
import geopandas as gpd
zones_gdf = gpd.read_file(GeoLabel.zones_shapefile)
result = gpd.sjoin(data_gdf, zones_gdf)
```

### Modifications to Kim and Orsi's front data
- Removed small extra dots included in the fronts
- Old versions have SBdy from Kim and Orsi. There is a large gap in the SBdy that is closed with a straight line by Shapely
- Data includes points from lat [-180, 360] but [-180, 180] is duplicate of [0, 360]; we kept [-180, 180]

### RG Climatology to Southern Ocean Fronts
- I have access to code that converts [RG Climatology data](http://sio-argo.ucsd.edu/RG_Climatology.html) from Argo sensors into ocean front polygons
- The two options are:
1) Rewrite all the code
   - Code is well documented but not efficient and contains lots of extra steps, hardcoded values, and not very modular
   - Meet with author and document every step/calculation. Then replicate process but with my own code
   - Pros:
     - High quality code. Easy to read, modify, and debug
   - Cons:
     - Time (at least a week). I need to understand the program, new datastructures, new packages etc.
     - If there are updates to the original code, I will not receive them (no longer a fork)
2) Make small updates to code
   - Fix hard coding
   - Pros:
     - Code is already written and verified 
     - Saves me a lot of time
   - Cons:
     - Leaves overall codebase vulnerable to bugs in that portion of code
  - My current choice is option 2. The core of this project is to make the most useful database possible, so I think my time is better spent building other features. I will still make updates to the code to avoid hardcoding and small performance issues as they come up. Now I can focus on tasks building around this code. Some examples include:
    - Automatic data retrieval. Simply call a function or script and the database will download new climatology data, calculate the fronts, and update the database.
    - NetCDF export/import. Add feature to export fronts as a NetCDF file, so it can be used for other research. In the future, if there are new ways of calculating fronts we can import the NetCDF from that calculation and keep the rest of the code the same.

### Spatial Data Workflow
- How will data flow from raw input data to queryable results at the front end?
- Geo data is different because points inside a shape are not directly stored (only the edges are)
- This section describes the different technologies that can be used and how they will interact

#### GeoPandas and GeoJSON
- GeoJSON is an open-source format for encoding a variety of geographic data structures such as Lines, Polygons, Inner/Outer Rings, etc
- Similar to the popular ESRI shape file, but it's open source. It can also be easily modified and created with python/shapely
- GeoPandas allows for geospatial queries with GeoJSON's and DataFrames
- For example PIP calculations can be done with a geospatial join using .contains(), .within(), .join(), etc.
- GeoPandas also allows for indexing GeoJSON's for speeding up queries. [This example](https://geoffboeing.com/2016/10/r-tree-spatial-index-python/) shows R-Tree Spatial Indexing in GeoPandas
#### SpatiaLite?
-  Open-source extension to SQLite that allows for spatial data (like GeoJSON shapes) to be stored and queried from
- Has all the functionality of sqlite, so it doesn't narrow the usability of the database
- This will only be used if we decide to keep shape files in the database for query use. Otherwise just labeling data serves the same purpose of having geospatial queries
#### Point Inside Polygon (PIP)
- Large amounts of data will need to be processed with PIP
- All data entries will be labeled with a zone (STZ, SAZ, etc.) and sector which could be expensive
- An alternative is to store the shapes and calculate PIP at runtime
  - However, this is a runtime cost whereas I am trying to optimize for insert time costs
  - This also adds complexity for the users since they need to understand what a polygon/shapefile is and how to use them in queries

### Southern Ocean Fronts and Sectors
#### Fronts
- "Boundary between two distinct water masses"
- Dividing line where ocean is physically different
- Change over time but data uses composite data to pick one line for a span of time (e.g. 1 year)
- Get raw data as series of points over time. Convert that to shape file (GeoJSON?). Load into python and geopandas
- Then users can add new front data if they want by simply replacing the file and calling an update on the database
<p align="center">
  <img 
    width="350"
    src="img/fronts.png"
>
</p>

#### Sectors
- Fixed sectors of the sea commonly used
- Arbitrarily defined so people have other definitions they use
- Similar to fronts, define a shape file and allow users to upload one of their own
<p align="center">
  <img 
    width="350"
    src="img/sectors.png"
  >
</p>

### Defining geospatial regions in the Southern Ocean
- We want to be able to label rows with a certain regions like the one seen in the map:
<p align="center">
  <img 
    width="350"
    src="img/southern_ocean.jpg"
>
</p>- Cannot use simple lat long filters since they are rectangular unlike our regions
- Solutions inlcude ArcGIS filter/map or polygons that have predefined these regions
- Main point is that there must be a mapping from latitude to region
- If one is not available, we can explore how to create that mapping
- Mapping can and should be done at insert time, not runtime to speed up queries that use this data
- Some options are GeoPandas, [GeoJSON](https://handsondataviz.org/geojsonio.html), ArcGIS, PostGIS, pyshp, PyGIS, and arcpy
- An issue with the Arc family is that they are not open-source unlike GeoJSON and others