# Table of Contents

- [Libraries](#Libraries)
- [Data](#Data)

# Part 1 - Cleaning of Pirate Data

__Within this notebook we did the following:__

1. Imported descriptive accounts of specific hostile activity against ships and mariners from the following source,  [Anti-Shipping Activity](https://livingatlas-dcdev.opendata.arcgis.com/datasets/esri2::anti-shipping-activity-messages/about). 

2. Performed country identification.

3. Performed cleaning by identifying any null values and imputing any missing values we deemed necessary. 

4. Created a dictionary to map out country geography codes. 

# Libraries

In [1]:
import numpy as np
import pandas as pd
import geopy
from geopy.geocoders import Nominatim
import time

# Data

In [2]:
df = pd.read_csv('../datasets/Anti-Shipping_Activity_Messages.csv')
df.dropna(inplace = True)

* Creating singular coordinate column to pass into closest country function.

In [3]:
df['Y'] = [str(x) for x in df['Y']]
df['X'] = [str(x) for x in df['X']]

In [4]:
df['coords'] = df['Y'] + ',' + df['X']

In [5]:
gloc = Nominatim(user_agent = 'Anything')

In [6]:
def closest_country(c):
    time.sleep(1)
    try:
        c = gloc.reverse(c, language = 'en')
        return c.raw['address']['country']
    except:
        return 'International Waters'

[source](https://www.geeksforgeeks.org/get-the-city-state-and-country-names-from-latitude-and-longitude-using-python/)

* Calling closest country on all rows. Due to the the amount of time it takes to perform this step, we broke down the the process into smaller steps. We exported the data to _'esri_w_country_columns'_ to avoid having to perform this step again. 

In [7]:
#df_locations_1_250 = df['coords'][:250].apply(lambda x: closest_country(x))
#df_locations_250_500 = df['coords'][250:500].apply(lambda x: closest_country(x))
#df_locations_500_750 = df['coords'][500:750].apply(lambda x: closest_country(x))
#df_locations_750_1000 = df['coords'][750:1000].apply(lambda x: closest_country(x))
#df_locations_1000_1500 = df['coords'][1000:1500].apply(lambda x: closest_country(x))
#df_locations_1500_2500 = df['coords'][1500:2500].apply(lambda x: closest_country(x))
#df_locations_2500_end = df['coords'][2500:].apply(lambda x: closest_country(x))

In [8]:
#df_loc_all = pd.concat([df_locations_1_250,
#                         df_locations_250_500, 
#                         df_locations_500_750, 
#                         df_locations_750_1000,
#                         df_locations_1000_1500,
#                         df_locations_1500_2500,
#                         df_locations_2500_end
#                         ])

#df['country'] = df_loc_all

#df.to_csv('../datasets/esri_w_country_columns')

* A separation of the 'date' column into month, year, day and time, for further analysis.

In [9]:
df = pd.read_csv('../datasets/esri_w_country_columns')

In [10]:
df['year'] = [x[0:4] for x in df.dateofocc]
df['month'] = [x[5:7] for x in df.dateofocc]
df['day'] = [x[8:10] for x in df.dateofocc]
df['time'] = [x[11:] for x in df.dateofocc]

In [11]:
df[df['time'] == '00:00:00+00'].count()
#most rows are missing 'time'

Unnamed: 0         7792
X                  7792
Y                  7792
OBJECTID           7792
reference          7792
dateofocc          7792
subreg             7792
hostility_d        7792
victim_d           7792
description        7792
hostilitytype_l    7792
victim_l           7792
navarea            7792
coords             7792
country            7792
year               7792
month              7792
day                7792
time               7792
dtype: int64

* We dropped the following columns as we assessed they weren't necessary for our analysis moving forward.

In [12]:
#Dropping time column
df.drop(columns = 'time', inplace = True)

In [13]:
#Dropping original date of occurence column
df.drop(columns = 'dateofocc', inplace = True)

In [14]:
#Dropping 'OBJECTID'
df.drop(columns = 'OBJECTID', axis = 1, inplace = True)

In [15]:
#Lowercase all text
for i in df.columns:
    if df[i].dtypes == 'O':
        df[i] = df[i].str.lower()

* The dataset provides an account of the different types of piracy attacks(_'hostilitytype_l'_) that can occur, and the types of vessels (_'victim_lst'_) that have been attacked. We decided to dummify these in order for us to have numeric data to analyze.

In [16]:
df.hostilitytype_l.value_counts()

1.0    6971
3.0     537
4.0     163
6.0     124
2.0      47
5.0      40
9.0       6
7.0       1
Name: hostilitytype_l, dtype: int64

In [17]:
host_lst = ['pirate_assaults', 
            'navel_engagement', 
            'suspicious_approach', 
            'kidnapping', 
            'unknown', 
            'other', 
            'hijacking', 
            'no entries', 
            'attempted_boarding']

df.hostilitytype_l = [host_lst[int(x) - 1] for x in df.hostilitytype_l]

In [18]:
victim_lst = ['anchored_vessel', 
            'barge', 
            'cargo_ship', 
            'fishing_vessel', 
            'merchant_vessel', 
            'offshore_vessel', 
            'passenger_ship', 
            'sailing_vessel', 
            'tanker',
            'tugboat',
            'vessel',
            'unknown',
            'other']

df.victim_l = [victim_lst[int(x) - 1] for x in df.victim_l]

* Additionally we dummified the subregion(_'subreg'_) columns, the navigational area (_'navarea'_) columns, and the months column which was derived when we separated the 'date' date performed above.

In [19]:
#Dummy all categorical columns
to_dummy = ['subreg', 'hostilitytype_l', 'victim_l', 'navarea', 'month']

for i in to_dummy:
    df = pd.concat([df, pd.get_dummies(df[i], prefix = i)], axis=1)
#df.drop(columns = to_dummy, axis = 1)

* We created a binarized column from the _'hostilitytype_l'_ column that indicates pirate success ('1') or failure ('0').

In [20]:
success_dict = {
    'pirate_assaults':1,      
    'suspicious_approach':0,  
    'kidnapping':1,
    'other':0,
    'navel_engagement':0, 
    'unknown':0,
    'attempted_boarding':0,
    'hijacking':1
}

df['pirate_success'] = df.hostilitytype_l.map(success_dict)

In [21]:
df['year'] = df['year'].astype(int)

* Due to the sporadic nature of the UN SDG data we decided to hone in on observations from 2010 onwards, this decision helped reduce the the amount of NaN values we would have to encounter and allowed for better analysis across all countries and across all the years within our scope.

In [22]:
#Incorporate only observations from 2010 on.
df = df[df['year'] >= 2010]

* Created a dictionary for the _'geocode'_ column and used this as a unique identifier in order to merge separate datasets.

In [23]:
df.drop(columns = ['reference', 'hostility_d', 'victim_d', 'description'], inplace = True)

In [24]:
countries = ['The Bahamas', 'Indonesia','International Waters', 'Eritrea', 'India', 'Brazil',
 'Somalia', 'Ecuador', 'Philippines', 'Malta', 'China', 'Cameroon', 'Sri Lanka','Nicaragua',
 'Senegal', 'Bangladesh', 'Vietnam', 'Malaysia', 'Mozambique', 'Guyana', 'Algeria', 'Tanzania',
 'Lebanon', 'Visayas', 'Colombia', 'Nigeria', 'Egypt', 'Thailand', 'Russia', 'Guinea',
 'Morocco', "Côte d'Ivoire", 'Portugal', 'Japan', 'Myanmar', 'Mindanao',
 'Dominican Republic', 'Iran', 'Venezuela', 'Ghana', 'Angola', 'Sierra Leone', 
 'Democratic Republic of the Congo', 'Madagascar', 'Turkey', 'Peru', 'Italy', 'Oman', 'Djibouti',
 'North Korea', 'Greece', 'Yemen', 'Taiwan', 'Comoros', 'Papua New Guinea', 'Jamaica', 'Saudi Arabia',
 'Netherlands', 'Panama', 'Singapore', 'Kenya', 'France', 'Pakistan', 'United States', 'Gabon',
 'Congo-Brazzaville', 'Belgium', 'Brunei', 'Cyprus', 'Haiti', 'Liberia', 'Belize', 'Qatar', 'Solomon Islands',
 'Equatorial Guinea', 'Guatemala', 'Fiji', 'South Africa', 'Tunisia', 'Mauritania', 'United Arab Emirates',
 'Germany', 'Mexico', 'Montenegro', 'Togo', 'Honduras', 'United Kingdom', 'Benin', 'Trinidad and Tobago',
 'Bulgaria', 'Georgia', 'Cuba', 'Iraq', 'Suriname', 'Australia', 'El Salvador', 'Romania', 'Saint Lucia',
 'Uruguay', 'British Virgin Islands', 'Saint Vincent and the Grenadines', 'Sudan', 'Dominica',
 'Spain', 'Costa Rica', 'Antigua and Barbuda', 'Grenada', 'South Korea', 'Cape Verde', 'Seychelles',
 'Libya', 'Cayman Islands', 'Saint Kitts and Nevis']

countries = [x.lower() for x in countries]

In [25]:
lis = [44, 360, 1, 232, 356, 76, 706, 218, 608, 470, 156, 120, 144, 558, 686, 50, 704, 458, 508, 328, 
       12, 834, 422, 608, 170, 566, 818, 764, 643, 324, 504, 384, 620, 392, 104, 608, 214, 364, 862, 
       288, 24, 694, 180, 450, 792, 604, 380, 512, 262, 408, 300, 887, 156, 174, 598, 388, 682, 528,
       591, 702, 404, 250, 586, 840, 266, 178, 56, 96, 196, 332, 430, 84, 634, 90, 226, 320, 242, 710, 
       788, 478, 784, 276, 484, 499, 768, 340, 826, 204, 780, 100, 268, 192, 368, 740, 36, 222, 642, 
       662, 858, 92, 670, 729, 212, 724, 188, 28, 308, 410, 132, 690, 434, 136, 659]

In [26]:
dic = {countries[i]: lis[i] for i in range(len(countries))}

In [27]:
df['country_code'] = df['countries'].map(dic)

In [28]:
df['join_key'] = list(zip(df.country_code, df.year))

* Saved final changes  made within this cleaning I notebook to the csv file detailed below.

In [29]:
df.to_csv('../datasets/cleaned_pirate_activity_eda.csv')