# Capstone Project - Predicting Commercial Fishing Habits

## Overview of Process - CRISP-DM
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

# 1. Business Understanding
According to NOAA, illegal, unreported, and unregulated fishing activities (`IUU`) violate both national and international fishing regulations. IUU is a global problem that threatens ocean ecosystems and sustainable fisheries.  It also threatens economic secuity and the natural resources that are critical to global food security. IUU also puts law-abiding fishing operations at a disadvantage. 

Illegal fishing refers to fishing activities conducted in contravention of applicable laws and regulations, including those laws and rules adopted at the regional and international level. 

Unreported fishing refers to fishing activities that are not reported or are misreported to relevant authorities in contravention of national laws and regulations or reporting procedures of a relevant regional fisheries management organization.

Unregulated fishing occurs in areas or for fish stocks for which there are no applicable conservation or management measures and where such fishing activities are conducted in a manner inconsistent with State responsibilities for the conservation of living marine resources under international law. Fishing activities are also unregulated when occurring in an RFMO-managed area and conducted by vessels without nationality, or by those flying a flag of a State or fishing entity that is not party to the RFMO in a manner that is inconsistent with the conservation measures of that RFMO. https://www.fisheries.noaa.gov/insight/understanding-illegal-unreported-and-unregulated-fishing

AIS stands for Automatic Identification System, and is used for tracking marine vessel traffic data. AIS data is collected by the US Coast Guard through an onboard safety navigation device that transmits and monitors the location and characteristics of large vessels in the US and international waters in real time. In the United States, the Coast Guard and commercial vendors collect AIS data, which can also be used for a variety of coastal planning initiatives. https://marinecadastre.gov/ais/

AIS is a maritime navigation safety communications system standardized by the international telecommunications union and adopted by the International Maritime Organization (IMO) that provides vessel information, including the vessel's identity, type, position, course, speed, navigational status and other safety-related information automatically to appropriately equipped shore stations, other ships, and aircraft; receives automatically such information from similarly fitted ships; monitors and tracks ships; and exchanges data with shore-based facilities. More information can be found here https://www.navcen.uscg.gov/?pageName=AISFAQ#1

# 2. Data Understanding

1. What data is available to us? Where does it come from?
2. Who controls the data and what steps are needed to access the data?
3. What is our target?
4. What predictors are available to us?
5. What data types are the predictors?
6. What is the distribution of our data?
7. How big is our data?
8. Do we have enough data to build a model? Will we need to use resampling?
9. How do we know the data is correct?  Is there a chance the data is wrong?

Two primary datasets will be used for this process:
1. Vessel AIS data sourced from Global Fishing Watch
2. Various ocean measurements and Ocean Station Data sourced from NOAA and the World Ocean Database

In [8]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime

# set visualization style
plt.style.use('ggplot')

## AIS Data (Global Fishing Watch)
Global fishing watch dataset was sourced from `https://globalfishingwatch.org/data-download/`. 7 separate files were downloaded, each corresponding to a type of fishing vessel:
1. `drifting_longlines.csv`
2. `fixed_gear.csv`
3. `pole_and_line.csv`
4. `purse_seines.csv`
5. `trawlers.csv`
6. `trollers.csv`
7. `unknown.csv`

In [9]:
# load datasets
drifting_longlines = pd.read_csv('datasets/drifting_longlines.csv')
fixed_gear = pd.read_csv('datasets/fixed_gear.csv')
pole_and_line = pd.read_csv('datasets/pole_and_line.csv')
purse_seines = pd.read_csv('datasets/purse_seines.csv')
trawlers = pd.read_csv('datasets/trawlers.csv')
trollers = pd.read_csv('datasets/trollers.csv')
unknown = pd.read_csv('datasets/unknown.csv')

Per Global Fishing Watch each dataset contains the following columns:
* `mmsi` - anonymized vessel identifier
* `timestamp` - unix timestamp
* `distance_from_shore` - distance from shore in meters
* `distance_from_port` - distance from port in meters
* `speed` - vessel speed in knots
* `course` - vessel course
* `lat` - latitude in decimal degrees
* `long` - longitude in decimal degrees
* `source` - The training data batch. Data was prepared by GFW, Dalhousie, and a crowd sourcing campaign. False positives are marked as false_positives
* `vessel_type` - type of vessel
* `is_fishing` - label indicating fishing activity
    * `0` = not fishing
    * `>0` = fishing. Data values between 0 and 1 indicate the average score for the position if scored by multiple people
    * `-1` = no data. 
    
`is_fishing` will be our primary target variable, with other columns available to us used as features. 

### AIS - Drifting Longlines

In [13]:
# display top 5 rows
drifting_longlines.head()

Unnamed: 0,mmsi,timestamp,distance_from_shore,distance_from_port,speed,course,lat,lon,is_fishing,source
0,12639560000000.0,1327137000.0,232994.28125,311748.65625,8.2,230.5,14.865583,-26.853662,-1.0,dalhousie_longliner
1,12639560000000.0,1327137000.0,233994.265625,312410.34375,7.3,238.399994,14.86387,-26.8568,-1.0,dalhousie_longliner
2,12639560000000.0,1327137000.0,233994.265625,312410.34375,6.8,238.899994,14.861551,-26.860649,-1.0,dalhousie_longliner
3,12639560000000.0,1327143000.0,233994.265625,315417.375,6.9,251.800003,14.822686,-26.865898,-1.0,dalhousie_longliner
4,12639560000000.0,1327143000.0,233996.390625,316172.5625,6.1,231.100006,14.821825,-26.867579,-1.0,dalhousie_longliner


In [14]:
# display info
drifting_longlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13968727 entries, 0 to 13968726
Data columns (total 10 columns):
 #   Column               Dtype  
---  ------               -----  
 0   mmsi                 float64
 1   timestamp            float64
 2   distance_from_shore  float64
 3   distance_from_port   float64
 4   speed                float64
 5   course               float64
 6   lat                  float64
 7   lon                  float64
 8   is_fishing           float64
 9   source               object 
dtypes: float64(9), object(1)
memory usage: 1.0+ GB


In [16]:
# display summary statistics for continuous columns
drifting_longlines.describe()

Unnamed: 0,mmsi,timestamp,distance_from_shore,distance_from_port,speed,course,lat,lon,is_fishing
count,13968730.0,13968730.0,13968730.0,13968730.0,13968630.0,13968630.0,13968730.0,13968730.0,13968730.0
mean,129385000000000.0,1434290000.0,584531.1,789750.5,5.464779,181.4876,-8.997629,3.758693,-0.9743015
std,78873570000000.0,39842750.0,542006.8,691543.8,4.043567,105.0503,24.39311,109.5971,0.2119947
min,5601266000000.0,1325376000.0,0.0,0.0,0.0,0.0,-75.19017,-180.0,-1.0
25%,62603840000000.0,1410706000.0,101909.2,213020.6,2.1,90.7,-26.0155,-88.08668,-1.0
50%,118485900000000.0,1447302000.0,457639.3,637524.9,5.5,181.1,-14.97954,-1.716495,-1.0
75%,198075800000000.0,1466506000.0,960366.4,1210432.0,8.5,271.1,4.48579,100.9811,-1.0
max,281205800000000.0,1480032000.0,4430996.0,7181037.0,102.3,511.0,83.33266,179.9938,1.0


In [17]:
# display number of unique vessels
drifting_longline_ids = drifting_longlines['mmsi'].unique()
print(f'There are {len(drifting_longline_ids)} unique anonymized vessel IDs')

There are 110 unique anonymized vessel IDs


In [18]:
# concatenate - add vessel type

## NOAA Ocean Station Data (OSD)
Ocean station data was sourced from the following url: `https://www.ncei.noaa.gov/access/world-ocean-database-select/dbsearch.html`.

Using the WODselect retrieval system enables a user to search World Ocean Database and new data using user-specified criteria. 

The World Ocean Database is encoded per the following documentation: `https://www.ncei.noaa.gov/data/oceans/woa/WOD/DOC/wodreadme.pdf`

As a result, downloaded files are returned in a native `.OSD` format.  To handle reading and importing of `.OSD` data, I made use of the `wodpy` package.  More information regarding `wodpy` can be found: `https://github.com/IQuOD/wodpy`.  

After successful loading of NOAA OSD, we have access to the following additional data: 
* `z`: level depths in meters
* `z_level_qc`: level depth qc flags (0 == all good)
* `z_unc`: depth uncertainty
* `t`: level temperature in Celcius
* `t_level_qc`: level temperature qc flags (0 == all good)
* `t_unc`: temperature uncertainty
* `s`: level salinities
* `s_level_qc`: level salinity qc flags (0 == all good)
* `s_unc`: salinity uncertainty
* `oxygen`: oxygen content (mL / L)
* `phosphate`: phosphate content (uM / L)
* `silicate`: silicate content (uM / L)
* `pH`: pH levels
* `p`: pressure (decibar)

NOAA OSD will be merged with vessel AIS data to add additional features  to use for classification. 

In [2]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime

# set visualization style
plt.style.use('ggplot')

In [3]:
# load datasets
drifting_longlines = pd.read_csv('datasets/drifting_longlines.csv')
fixed_gear = pd.read_csv('datasets/fixed_gear.csv')
pole_and_line = pd.read_csv('datasets/pole_and_line.csv')
purse_seines = pd.read_csv('datasets/purse_seines.csv')
trawlers = pd.read_csv('datasets/trawlers.csv')
trollers = pd.read_csv('datasets/trollers.csv')
unknown = pd.read_csv('datasets/unknown.csv')

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np