# Matthew Moskal Final Project Milestone 3

### Overview of Cleaning of My Dataset:
#### I decided from the start that for this project, I wanted to spend time picking a dataset that was already in great shape for my uses, rather than make the mistake of dealing with a bad database and dealing with technical errors due to that fact. My database, which has been acquired through the organization Global Fishing Watch, was constructed using machine learning models, vessel registry databases, and manual review by GFW and regional experts. As a result of the careful curation of this database and its cross-checking by experts in the field, I have in my hands an extremely reliable and capable dataset with very few gaps in it that need data cleaning in order to answer my questions. 

#### However, there are still a few issues. The first is the flags. Each fishing vessel is required to sail under a nation's flag to comply with maritime law, but these flags are often draped off the back of the ship and not readily apparent to be seen by satellites. As a result, my dataset is full of NAs on two specific columns, the flag_ais and flag_registry columns (I will get rid of the flag_registry column later in the cleaning). Since some of my questions have to do with the fishing efforts of individualized nations, I want to designate a category for fishing done with no flag tied to the vessel in the dataset. Thus, my cleaning involved replacing the NAs in those two columns with a group identifier "NO FLAG", which will allow for easy plotting of these otherwise neglected NAs.

#### I also wanted to deal with the maritime registry columns. Since most of the vessels in this dataset are from far-off countries with no centralized ship reporting database, the columns that go off of the registry information for these vessels are predominantly empty and useless for me, as I could simply use the vessel information from the other columns gathered from more robust and centralized sources like the machine learning model.

In [7]:
import pandas as pd #importing pandas

#reading in my dataset, using the "low_memory = False" parameter to get around errors caused by the sheer size of my set
df = pd.read_csv("fishing-vessels-v3.csv", low_memory = False)

In [8]:
#filling the NAs in the two flag columns with "NO FLAG" as their new identifier
df['flag_ais'] = df['flag_ais'].fillna("NO FLAG")
df['flag_registry'] = df['flag_registry'].fillna("NO FLAG")

In [13]:
registry_columns = [ #placing the registry columns in an array for use in the .drop command
    'flag_registry',
    'vessel_class_registry',
    'length_m_registry',
    'engine_power_kw_registry',
    'tonnage_gt_registry',
    'registries_listed'
]

df = df.drop(columns=registry_columns, errors='ignore') #dropping the registry columns from the dataset

In [14]:
df.to_csv("Fishing Vessels - Cleaned.csv", index=False) #saving the cleaned data to the new dataset