# Capstone Project - Predict Fishing Habits

## Overview of Process - CRISP-DM:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

## MVP:
Create a classifier that can predict if a commercial fishing vessel is fishing based off AIS data

## Level Up: 
Predict further carrier behavior, type of vessel from same information. 

# 1. Business Understanding

According to NOAA, illegal, unreported, and unregulated fishing activities violate both national and international fishing regulations.  IUU is a global problem that threatens ocean ecosystems and sustainable fisheries. It also threatens our economic security and the natural resources that are critical to global food security.  IUU also puts law-abiding fishing operations at a disadvantage. 

lllegal fishing refers to fishing activities conducted in contravention of applicable laws and regulations, including those laws and rules adopted at the regional and international level.

Unreported fishing refers to fishing activities that are not reported or are misreported to relevant authorities in contravention of national laws and regulations or reporting procedures of a relevant regional fisheries management organization.

Unregulated fishing occurs in areas or for fish stocks for which there are no applicable conservation or management measures and where such fishing activities are conducted in a manner inconsistent with State responsibilities for the conservation of living marine resources under international law. Fishing activities are also unregulated when occurring in an RFMO-managed area and conducted by vessels without nationality, or by those flying a flag of a State or fishing entity that is not party to the RFMO in a manner that is inconsistent with the conservation measures of that RFMO. `https://www.fisheries.noaa.gov/insight/understanding-illegal-unreported-and-unregulated-fishing`

AIS stands for Automatic Identification System, and is used for tracking marine vessel traffic data. AIS data is collected by the US Coast Guard through an onboard safety navigation device that transmits and monitors the location and characteristics of large vessels in the US and international waters in real time. In the United States, the Coast Guard and commercial vendors collect AIS data, which can also be used for a variety of coastal planning initiatives. `https://marinecadastre.gov/ais/`

AIS is a maritime navigation safety communications system standardized by the international telecommunications union and adopted by the International Maritime Organization (IMO) that provides vessel information, including the vessel's identity, type, position, course, speed, navigational status and other safety-related information automatically to appropriately equipped shore stations, other ships, and aircraft; receives automatically such information from similarly fitted ships; monitors and tracks ships; and exchanges data with shore-based facilities. More information can be found here `https://www.navcen.uscg.gov/?pageName=AISFAQ#1`

# 2. Data Understanding

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime

# set visualization style 
plt.style.use('ggplot')

In [2]:
# load datasets
drifting_longlines = pd.read_csv('datasets/drifting_longlines.csv')
fixed_gear = pd.read_csv('datasets/fixed_gear.csv')
pole_and_line = pd.read_csv('datasets/pole_and_line.csv')
purse_seines = pd.read_csv('datasets/purse_seines.csv')
trawlers = pd.read_csv('datasets/trawlers.csv')
trollers = pd.read_csv('datasets/trollers.csv')
unknown = pd.read_csv('datasets/unknown.csv')

In [3]:
fishing_vessels = ['drifting_longlines', 
                   'fixed_gear',
                   'pole_and_line',
                   'purse_seines',
                   'trawlers',
                   'trollers',
                   'unknown']

## Drifting Longlines

In [4]:
# display top 5 rows
drifting_longlines.head()

Unnamed: 0,mmsi,timestamp,distance_from_shore,distance_from_port,speed,course,lat,lon,is_fishing,source
0,12639560000000.0,1327137000.0,232994.28125,311748.65625,8.2,230.5,14.865583,-26.853662,-1.0,dalhousie_longliner
1,12639560000000.0,1327137000.0,233994.265625,312410.34375,7.3,238.399994,14.86387,-26.8568,-1.0,dalhousie_longliner
2,12639560000000.0,1327137000.0,233994.265625,312410.34375,6.8,238.899994,14.861551,-26.860649,-1.0,dalhousie_longliner
3,12639560000000.0,1327143000.0,233994.265625,315417.375,6.9,251.800003,14.822686,-26.865898,-1.0,dalhousie_longliner
4,12639560000000.0,1327143000.0,233996.390625,316172.5625,6.1,231.100006,14.821825,-26.867579,-1.0,dalhousie_longliner


In [5]:
# display info
drifting_longlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13968727 entries, 0 to 13968726
Data columns (total 10 columns):
 #   Column               Dtype  
---  ------               -----  
 0   mmsi                 float64
 1   timestamp            float64
 2   distance_from_shore  float64
 3   distance_from_port   float64
 4   speed                float64
 5   course               float64
 6   lat                  float64
 7   lon                  float64
 8   is_fishing           float64
 9   source               object 
dtypes: float64(9), object(1)
memory usage: 1.0+ GB


In [6]:
# display summary stats for continuous columns
drifting_longlines.describe()

KeyboardInterrupt: 

In [None]:
# add vessel_type column to assist with concatenation
drifting_longlines['vessel_type'] = 'drifting_longlines'

In [None]:
# number of unique vessels
drifting_longline_ids = drifting_longlines['mmsi'].unique()
print(f'There are {len(drifting_longline_ids)} unique anonymized vessel IDs')

## Fixed Gear

In [None]:
fixed_gear.head()

In [None]:
fixed_gear.info()

In [None]:
fixed_gear.describe()

In [None]:
# add vessel_type column to assist with concatenation
fixed_gear['vessel_type'] = 'fixed_gear'

In [None]:
# number of unique vessels
fixed_gear_ids = fixed_gear['mmsi'].unique()
print(f'There are {len(fixed_gear_ids)} unique anonymized vessel IDs')

## Pole and Line

In [None]:
pole_and_line.head()

In [None]:
pole_and_line.info()

In [None]:
pole_and_line.describe()

In [None]:
# add vessel_type column to assist with concatenation
pole_and_line['vessel_type'] = 'pole_and_line'

In [None]:
# number of unique vessels
pole_and_line_ids = pole_and_line['mmsi'].unique()
print(f'There are {len(pole_and_line_ids)} unique anonymized vessel IDs')

## Purse Seines

In [None]:
purse_seines.head()

In [None]:
purse_seines.info()

In [None]:
purse_seines.describe()

In [None]:
# add vessel_type column to assist with concatenation
purse_seines['vessel_type'] = 'purse_seines'

In [None]:
# number of unique vessels
purse_seines_ids = purse_seines['mmsi'].unique()
print(f'There are {len(purse_seines_ids)} unique anonymized vessel IDs')

## Trawlers


In [None]:
trawlers.head()

In [None]:
trawlers.info()

In [None]:
trawlers.describe()

In [None]:
# add vessel_type column to assist with concatenation
trawlers['vessel_type'] = 'trawlers'

In [None]:
# number of unique vessels
trawlers_ids = trawlers['mmsi'].unique()
print(f'There are {len(trawlers_ids)} unique anonymized vessel IDs')

## Trollers

In [None]:
trollers.head()

In [None]:
trollers.info()

In [None]:
trollers.describe()

In [None]:
# add vessel_type column to assist with concatenation
trollers['vessel_type'] = 'trollers'

In [None]:
# number of unique vessels
trollers_ids = trollers['mmsi'].unique()
print(f'There are {len(trollers_ids)} unique anonymized vessel IDs')

## Unknown

In [None]:
unknown.head()

In [None]:
unknown.info()

In [None]:
unknown.describe()

In [None]:
# add vessel_type column to assist with concatenation
unknown['vessel_type'] = 'unknown'

In [None]:
# number of unique vessels
unknown_ids = unknown['mmsi'].unique()
print(f'There are {len(unknown_ids)} unique anonymized vessel IDs')

## All Fishing Vessels

In [None]:
# create consolidated dataset
boats_df = pd.concat([drifting_longlines,
                      fixed_gear,
                      pole_and_line,
                      purse_seines,
                      trawlers,
                      trollers,
                      unknown], axis=0)

In [None]:
boats_df.info()

Exploring our dataset, we can see that there are 6.8 million entries, and 10 columns.  Our target variable can be found int he `is_fishing` column.  Move forward with further exploration of target variable

In [None]:
# breakdown of target variable
boats_df['is_fishing'].value_counts()

Based on the source of our dataset (Global Fishing Watch), we know the following: 
* `0` = not fishing
* `>0` = fishing.  data values between 0 and 1 indicate average score for position if scored by multiple people
* `-1` = no data

Knowing this, we see the majority of entries are missing fishing labels, marked as `-1`.  Despite the number of entries without helpful labels, we still see a large number of `1` and `0` labels.  Will likely want to remove those with `-1` labels as we move forward with training. 

In [None]:
boats_df.columns

Other columns in our dataset are :
* `mmsi` - anonymized vessel identifier
* `timestamp` - unix timestamp
* `distance_from_shore` - distance from shore in meters
* `distance_from_port` - distance from port in meters
* `speed` - vessel speed in knots
* `course` - vessel course
* `lat` - latitude in decimal degrees
* `long` - longitude in decimal degrees
* `source` - The training data batch. Data was prepared by GFW, Dalhousie, and a crowd sourcing campaign. False positives are marked as false_positives
* `vessel_type` - type of vessel

In [None]:
# number of unique vessels
all_ids = boats_df['mmsi'].unique()
print(f'There are {len(all_ids)} unique anonymized vessel IDs')

In [None]:
# print unique timestamp
time_stamps = boats_df['timestamp'].unique()
print(f'There are {len(time_stamps)} unique time stamps')

In [None]:
# pull out example unix timestamp
unix_timestamp = boats_df['timestamp'].iloc[2]
unix_timestamp

In [None]:
# convert unix timestamp so easier to understand
converted_timestamp = datetime.utcfromtimestamp(unix_timestamp).strftime('%Y-%m-%d %H:%M:%S')
converted_timestamp

In [None]:
# create converted column
# boats_df['updated_timestamp'] = boats_df['timestamp'].apply(lambda x: dt.utcfromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
# visualize distribution of continuous columns
boats_df.hist(figsize=(15, 8), bins='auto')
plt.tight_layout()
plt.show()

In [None]:
# better understand the timeframe of data we have
min_time = boats_df['timestamp'].min()
max_time = boats_df['timestamp'].max()

# convert to more readable format
converted_min_time = datetime.utcfromtimestamp(min_time).strftime('%Y-%m-%d %H:%M:%S')
converted_max_time = datetime.utcfromtimestamp(max_time).strftime('%Y-%m-%d %H:%M:%S')

print(f'Dataset ranges from {converted_min_time} to {converted_max_time}')

In [None]:
# load NOAA ocean station data - pickle file
import pickle
with open('noaa_data.pickle', 'rb') as file:
    ods_df = pickle.load(file)

In [None]:
# drop unnecessary columns
clean_ods = ods_df.drop(['z_unc', 't_unc', 's_unc', 'p',
                         'z_level_qc', 't_level_qc', 's_level_qc'], axis=1)

clean_ods.info()

In [None]:
boats_df.info()

In [None]:
test_lat = boats_df['lat'].iloc[1]
test_lon = boats_df['lon'].iloc[1]
test_lat_2 = boats_df['lat'].iloc[2]
test_lon_2 = boats_df['lon'].iloc[2]

In [None]:
# use haversine distance to get distance between two spherical points

In [None]:
from sklearn.metrics.pairwise import haversine_distances
from math import radians

In [None]:
test_point = [test_lat, test_lon]
test_in_radians = [radians(_) for _ in test_point]

test_point2 = [test_lat_2, test_lon_2]
test_in_radians2 = [radians(_) for _ in test_point2]

In [None]:
result = haversine_distances([test_in_radians, test_in_radians2])
result * 6371000/1000  # multiply by Earth radius to get kilometers

In [None]:
result[1][0] * 6371000/1000

In [None]:
clean_ods['coords'] = clean_ods[['lat', 'lon']].values.tolist()
clean_ods['coords']

In [None]:
boats_df['coords'] = boats_df[['lat', 'lon']].values.tolist()

In [None]:
clean_ods['radians'] = clean_ods['coords'].apply(lambda x: [radians(_) for _ in x])
boats_df['radians'] = boats_df['coords'].apply(lambda x: [radians(_) for _ in x])

In [None]:
clean_ods['radians'].head()

In [None]:
# for each radian coord in boats_df, find closest coord in clean_ods and use that to pull in remaining info

In [None]:
def find_min_dist(point, coords):
    """
    Function to return the distance and index of the minimum haversine dist between a point and a series of points. 
    """
    min_dist = 500
    min_dist_idx = None
    
    for idx, coord in enumerate(coords): 
        hav_dist = haversine_distances([point, coord])
        if hav_dist[1][0] < min_dist:
            min_dist = hav_dist[1][0]
            min_dist_idx = idx
    
    return (min_dist * 6371000/1000, min_dist_idx)

In [None]:
test_in_radians

In [None]:
find_min_dist(test_in_radians2, clean_ods['radians'])

In [None]:
clean_ods.info()

In [None]:
boats_df.info()

In [None]:
# add empty columns to bring in values
updated_boats_df = boats_df.copy()
updated_boats_df['depth'] = 0
updated_boats_df['temp'] = 0
updated_boats_df['salinity'] = 0
updated_boats_df['oxygen'] = 0
updated_boats_df['phosphate'] = 0
updated_boats_df['silicate'] = 0
updated_boats_df['pH'] = 0

In [None]:
updated_boats_df.info()

In [None]:
# drop unnecessary labels
boats = updated_boats_df.loc[(updated_boats_df['is_fishing'] == 1) | (updated_boats_df['is_fishing'] == 0)]

In [None]:
boats.info()

In [None]:
import time, sys
from IPython.display import clear_output

def update_progress(progress):
    """
    Function to build progress bar.
    Function returns percent of calculations remaining along with visual of progress completed so far.
    For use in loops.
    """
    bar_length = 20
    if isinstance(progress, int):
        progress = float(progress)
    if not isinstance(progress, float):
        progress = 0
    if progress < 0:
        progress = 0
    if progress >= 1:
        progress = 1
        
    block = int(round(bar_length * progress))
    clear_output(wait = True)
    text = "Progress: [{0}] {1:.1f}%".format( "#" * block + "-" * (bar_length - block), progress * 100)
    print(text) 

In [None]:
# for each entry in boats_df bring in z, t, s, oxygen, phosphate, silicate, pH
# will not worry about year, month, day, or time right now
for idx, row in boats.iterrows():
    min_dist_idx = find_min_dist(row['radians'], clean_ods['radians'])[1]
    row['depth'] = clean_ods.iloc[min_dist_idx]['z']
    row['temp'] = clean_ods.iloc[min_dist_idx]['t']
    row['salinity'] = clean_ods.iloc[min_dist_idx]['s']
    row['oxygen'] = clean_ods.iloc[min_dist_idx]['oxygen']
    row['phosphate'] = clean_ods.iloc[min_dist_idx]['phosphate']
    row['silicate'] = clean_ods.iloc[min_dist_idx]['silicate']
    row['pH'] = clean_ods.iloc[min_dist_idx]['pH']
    
    update_progress(idx / (len(boats)-1))