# Downloading your own ship dataset from Planet.com

The code provided here serves as an example of how to download satellite imagery from planet.com to form a dataset. The dataset created here is of images either including ships or not and can be used to train a ship recognition model.

The thought process is also explained in my blog post here: 



In [None]:
import subprocess
import numpy as np
import pandas as pd
import datetime
import os
import geopandas as gpd
from shapely.geometry import Point, Polygon
import shapely.wkt
import random

## The datasources

Our satellite imagery comes from planet.com. We're using the PlanetScope 4Band Images which have RGB+IR channels and a resolution of 3 * 3m. Those are downloaded using the porder API which is explained here (https://medium.com/@samapriyaroy/order-up-using-and-building-with-planet-s-new-ordersv2-api-ba2fe14eac8e).

The second datasource is historical AIS data downloaded from https://marinecadastre.gov/ais/. 

The third thing we need is the .geojson file. The one here is created using http://geojson.io/#map=2/20.0/0.0 and can be found in the folder /geojson/miami_to_palm.geojson .

So first we have to set some variables:

In [None]:
month = 12
year = 2017
num_days_in_month = 31
geojson = "miami_to_palm.geojson"
ais_file = "AIS_2017_12_Zone17.csv"
path_extension = file_extension= ais_file.split('.')[0]

As you can see we're creating our dataset month-by-month. This means you have to run this multiple times, but it helps if you are on a quota constraint.

## How to use:
The notebook is constructed to be run from top to bottom. The steps do have a certain order - you for example can't download the data before ordering it. And there's an extra layer of caution after ordering - planet takes a while to process your orders. Therefore the code fails if you don't wait long enough before running the downloading cell!

**Don't use "run all"!**

## Reducing the original AIS file

Because the original AIS file is giant, we first reduce it to only include ships that were tracked in the area inside our .geojson

In [None]:
def reduce_ais_csv(filename, geojson_name):
    # reduce the ais csv to contain only the region of interest -- takes forever!
    print('Reducing the original AIS file - this might take a while...')
    data_dir = os.path.join(os.getcwd(), 'data')
    ais_org_dir = os.path.join(data_dir, 'original_ais')
    csv_path = os.path.join(ais_org_dir, filename)
    df = pd.read_csv(csv_path)
    def to_coordinate(df):
        # We convert the Longitude and Latidute to Koordinates so we can handle it with geopandas
        coordinate = Point(df.loc['LON'], df.loc['LAT'])
        return coordinate
    df['coordinate'] = df.apply(to_coordinate, axis=1, raw=True)
    geojson_dir = os.path.join(data_dir, "geojson")
    path_to_geojson = os.path.join(geojson_dir, geojson_name)
    gjson = gpd.read_file(path_to_geojson)
    # We select the first shape (our actual geojson shape we placed in the file)
    gjson = gjson.iloc[0]
    # And convert it to a GeoSeries
    series = gpd.GeoSeries(gjson)
    # We check that the ship coordinates are within the .geojson
    reduced_index = df.loc[df.loc[:,'coordinate'].apply(lambda x: x.within(series[0])),:].index
    # And reduce the whole thing
    reduced_df = df.loc[reduced_index,:]
    # we convert the date to datetime to easier handle it
    reduced_df['datetime'] = reduced_df.apply(lambda x: datetime.datetime.fromisoformat(x.loc['BaseDateTime']), axis=1)
    # and save the file
    ais_reduced_dir = os.path.join(data_dir, 'reduced_ais')
    reduced_csv_path = os.path.join(ais_reduced_dir, filename)
    reduced_df.to_csv(reduced_csv_path)
    return

In [None]:
# First we load or construct the reduced df
try:
    data_dir = os.path.join(os.getcwd(), 'data')
    ais_reduced_dir = os.path.join(data_dir, 'reduced_ais')
    reduced_csv_path = os.path.join(ais_reduced_dir, ais_file)
    reduced_df = pd.read_csv(reduced_csv_path)
    print('Dataframe loaded')
except FileNotFoundError:
    print('Reduced Version not found, creating it ....')
    reduce_ais_csv(ais_file, geojson)
    reduced_df = pd.read_csv(reduced_csv_path)
    print('Dataframe created and loaded')
reduced_df['datetime'] = reduced_df.apply(lambda x: datetime.datetime.fromisoformat(x.loc['BaseDateTime']), axis=1)

### Getting the satellite times

Next we query the timestamps of the satellite images available for that month.

In [None]:
def create_id_list(geojson_name, date):
    # Creates an id list of all data that is available in the geojson for this date. We need this, so we can 
    # check where there were boats while the satellite took it's picture.
    data_dir = os.path.join(os.getcwd(), 'data')
    geo_dir = os.path.join(data_dir, 'geojson')
    id_dir = os.path.join(data_dir, 'idlists')
    in_file = os.path.join(geo_dir, geojson_name)
    out_file = os.path.join(id_dir, str(date) + "idlist.csv")
    end_date = date + datetime.timedelta(days=1)#datetime.date(date.year, date.month, date.day+1)
    # We create a command to query the idlist using the geojson, max 50 % cloud coverage and at least 80% overlap with our geojson
    command_1 = "porder idlist --input \"" + str(in_file) + "\" "
    command_2 = "--start \"" + str(date) + "\" --end \"" + str(end_date) + "\" --item \"PSScene4Band\" --asset \"analytic\""
    command_3 = " --cmax 0.5 --outfile \"" + str(out_file) +  "\" --overlap 80" 
    command = command_1 + command_2 + command_3
    query = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True, stderr = subprocess.PIPE)
    result = query.stdout.read().decode("utf-8")
    # We compute the estimated cost (in km2)
    try:
        estimated_cost = np.float(result[result.find('clipped:')+9:].split(' ')[0].replace(',',''))
    except ValueError:
        estimated_cost = 0
    return estimated_cost, out_file

In [None]:
def get_time(file_name):
    # extract the time from the filename given in the id_list
    time_string = file_name.split('_')[1]
    date_string = file_name.split('_')[0]
    start_time = datetime.datetime(year=np.int(date_string[0:4]), month=np.int(date_string[4:6]),
                                   day= np.int(date_string[6:8]), hour=np.int(time_string[0:2]), 
                                   minute=np.int(time_string[2:4]))
    return start_time

def get_unique_times(file_name):
    # read the id list file and extract the times when the pictures were taken.
    file = pd.read_csv(file_name, names=['file_name'])
    if file.shape[0] > 0:
        file['time'] = file.applymap(get_time)
        times = file.loc[:,'time'].unique()
    else:
        times=[]
    return times

In [None]:
def get_ships(time_list, reduced_df):
    # Extract the ships from the reduced ais file using the times gathered from the id list.
    entries=pd.DataFrame()
    for time in time_list:
        # We use a delta of 1 minute since the AIS logging resolution is 1 minute
        entries = pd.concat([entries,reduced_df.loc[abs(reduced_df['datetime'] - time) < 
                                                        datetime.timedelta(minutes=1),:]])
    return entries

In [None]:
# Now we check the full month
ship_df = pd.DataFrame()
# iterating over all days
for day in range(1,num_days_in_month+1):
    date = datetime.date(year=year, month=month, day=day)
    # we query the id_list - so all available filenames
    estimated_cost, id_list_file = create_id_list(geojson, date)
    print('Imagery for ',  str(date) ,' is around ',  str(estimated_cost), ' sqkm')
    # we extract the available times from the id list
    times = get_unique_times(id_list_file)
    print('Extracted ' , str(len(times)) , ' unique time slots.')
    # We intersect this with the AIS data to only select ships that are inside the geojson at the time an image was taken
    day_ship_df = get_ships(times, reduced_df)
    print('Found ', str(day_ship_df.shape[0]), ' ships.')
    ship_df = pd.concat([ship_df, day_ship_df])
print(ship_df)

In [None]:
ship_path = os.path.join(data_dir, 'ships' + ais_file)
ship_df.to_csv(ship_path

### Ordering the ship files

Now we have all our possible ship candidates saved in ship_df. Next we make sure our coordinates are points, so we can access them more readily.

In [None]:
def to_coordinate(df):
    # Converts the Lon and LAT information in the dataframe to a geo point
    coordinate = Point(df.loc['LON'], df.loc['LAT'])
    return coordinate

In [None]:
# We convert the coordinates in ship df into points so we can handle them.
ship_df['coordinate'] = ship_df.apply(to_coordinate, axis=1, raw=True)

Next we set our set the size of our rectangle that we want to draw around the ship that we want to download.
This is speciefied in Degrees (on EPSG48 i guess), not pixels or meters, so it's hard to find a correct size. 0.02 workes fine and covers around 1200 pixels.

In [None]:
gdf = gpd.GeoDataFrame(geometry = ship_df.loc[:,'coordinate'])
# buffer 0.02 corresponds to 1200 pixel
# buffer 0.01 is to small to download
buffer = gdf.buffer(0.02)
ship_polygon = buffer.envelope
ship_df['envelope'] = ship_polygon

Now we can start ordering the images. We also keep track of the associated cost in km2 since planet has a quota on it for most users.

We iterate over all ships and create a geojson corresponding to the rectangle around that one ship, that we have designed before. Next we query planet for this small image, and if it exists we download it. (not all images are available because we have not yet taken into account that the satellite not always covers the full geojson. Sometimes it just covers a small part of it and not all boats are included in this coverage.)

In [None]:
def order_ship(ship, id_list_filename):
    # Ordering a ship from planet.com using porder API
    data_dir = os.path.join(os.getcwd(), 'data')
    try:
        name = ship['BaseDateTime'] + str(ship['MMSI'])
    except KeyError:
        name = str(ship['BaseDateTime']) + str(ship['ship_idx'])
    idlist = filename
    geojson_dir = os.path.join(data_dir, 'geojson')
    geojson = os.path.join(geojson_dir, 'test.geojson')
    # We order the cliped image around our boat
    command_1 = "porder order --name \"" + str(name) + "\" "
    command_2 = "--idlist \"" + str(idlist) + "\" --item \"PSScene4Band\" --bundle \"analytic\""
    command_3 = " --boundary \"" + str(geojson) + "\" --op clip" 
    command = command_1 + command_2 + command_3
    #print(command)
    query = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True, stderr = subprocess.PIPE)
    result = query.stdout.read().decode("utf-8")
    url = result[result.find('https:') :].split(' ')[0]
    # Setting out the order
    print(result)
    return url

In [None]:
total_cost = 0
downloaded_ships = pd.DataFrame()
data_dir = os.path.join(os.getcwd(), 'data')
geojson_dir = os.path.join(data_dir, 'geojson')
geojson = os.path.join(geojson_dir, 'test.geojson')
 
for index, ship in ship_df.iterrows():
    # We iterate over all possible ships and check if for the small ship region the satellite took a picture e.g. 
    # there's a quota associated to our requested ship.
    # We saveguard our downloaded ships info some times, in case something goes wrong / crashes
    if index%100 == 0:
        downloaded_ships.to_csv('ship_urls' + str(index) + '.csv')
    current_ship = ship.copy()
    # we create the geojson
    gpd.GeoSeries([ship['envelope']]).to_file(geojson, driver='GeoJSON')
    try:
        date = datetime.datetime.fromisoformat(ship['datetime']).date()
    except TypeError:
        date = ship['datetime'].date()
    # we query if there's data available for this ship / date
    cost, filename = create_id_list(geojson, date)
    if cost == 0:
        continue
    else:
        # if so, we make sure it's the correct timing (in case there's multiple satellite images in a day)
        times = get_unique_times(filename)
        if len(times) > 1:
            # this could be done nicer... like using the extra data and not throwing it away
            continue
        try:
            time_dif = datetime.datetime.utcfromtimestamp(times[0].astype('O')/1e9) - datetime.datetime.fromisoformat(ship['datetime'])
        except TypeError:
            time_dif = datetime.datetime.utcfromtimestamp(times[0].astype('O')/1e9) - ship['datetime']
        # We only order if the time is correct
        if abs(time_dif) < datetime.timedelta(minutes=1):       
            # if we catch a ship we request and download the asset.
            # we order the asset
            url = order_ship(ship, filename)
            current_ship['url'] = url
            downloaded_ships = downloaded_ships.append(current_ship)
            total_cost +=cost
print(total_cost)
print(downloaded_ships)

In [None]:
data_dir = os.path.join(os.getcwd(), 'data')
data_set_dir = os.path.join(data_dir, 'ship_downloads')
downloaded_ships.to_csv(ship_path)

**Don't run the next cells right away!**

Read first and check if your order has been completed!

### Download the requested ships

Before we download the data we have to wait a while. Our order at planet is taken care of and it sometimes takes a few minutes. You can check if your data is ready by clicking on the downloading URL displayed while ordering. This gives you a current status.

In [None]:
def download_ship(url, index, file_path):
    #filepath = 'downloaded_ships\\AIS_Filename'
    data_dir = os.path.join(os.getcwd(), 'data')
    url = url
    save_path = os.path.join(data_dir, file_path)
    idx_save_path = os.path.join(save_path, str(index))
    
    # We create a directory for the downloaded files
    mkdir_command ="mkdir " + str(idx_save_path)
    subprocess.Popen(mkdir_command, stdout=subprocess.PIPE, shell=True, stderr = subprocess.PIPE)
    
    # We construct the download command
    command_1 = "porder download --url \"" + str(url) + "\" "
    command_2 = "--local \"" + str(idx_save_path) + "\""
    command = command_1 + command_2 

    query = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True, stderr = subprocess.PIPE)
    result = query.stdout.read().decode("utf-8")
    print(result)
    #extract the correct basefilename
    try:
        filename = result[result.find('Downloading:') :].split(' ')[1].split('.')[0]
    except IndexError:
        try:
            # print a different error when the file is already downloaded
            filename = result[result.find('SKIPPING:') :].split(' ')[1].split('.')[0]
        except IndexError:
            filename = 'Error'
    return filename

In [None]:
base_path = 'downloaded_ships'
base_path = os.path.join(base_path, file_extension)
downloaded_ships['base_name'] = np.nan
# iterate over all ships
for index, row in downloaded_ships.iterrows():
    print(index)
    url = row.loc['url']
    base_name = download_ship(url, index, base_path)
    downloaded_ships.loc[index, 'base_name'] = base_name
    
downloaded_ships

In [None]:
downloaded_ships.to_csv(ship_path)

### Creating non-ship databits

Create the non-ship data by iterating over the ship dataset and for every ship we create a random point inside the geojson & request the same envelope size. We check that no ship is inside the selected region. If this check fails we repeat until we actually have a successfull download. For each order a dataframe entry is created and the corresponding ship is marked.

In [None]:
downloaded_ships['non_ship'] = False

In [None]:
def download_non_ship(url, index, file_path):
    #filepath = 'downloaded_non_ships\\AIS_Filename'
    data_dir = os.path.join(os.getcwd(), 'data')
    url = url
    save_path = os.path.join(data_dir, file_path)
    idx_save_path = os.path.join(save_path, str(index))
    
    # create a folder for the downloaded file
    mkdir_command ="mkdir " + str(idx_save_path)
    subprocess.Popen(mkdir_command, stdout=subprocess.PIPE, shell=True, stderr = subprocess.PIPE)

    # construct the porder command
    command_1 = "porder download --url \"" + str(url) + "\" "
    command_2 = "--local \"" + str(idx_save_path) + "\""
    command = command_1 + command_2 

    query = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True, stderr = subprocess.PIPE)
    result = query.stdout.read().decode("utf-8")
    print(result)
    #extract the correct basefilename
    try:
        filename = result[result.find('Downloading:') :].split(' ')[1].split('.')[0]
    except IndexError:
        try:
            filename = result[result.find('SKIPPING:') :].split(' ')[1].split('.')[0]
        except IndexError:
            filename = 'Error'
    return filename

def generate_random(number, polygon):
    # we generate random points inside a polygon
    list_of_points = []
    minx, miny, maxx, maxy = polygon.bounds
    counter = 0
    while counter < number:
        pnt = Point(random.uniform(minx, maxx), random.uniform(miny, maxy))
        if polygon.contains(pnt):
            list_of_points.append(pnt)
            counter += 1
    return list_of_points


In [None]:
# Specify the geojson and the directories
geojson_name = 'miami_to_palm.geojson'
data_dir = os.path.join(os.getcwd(), 'data')
geojson_dir = os.path.join(data_dir, "geojson")#
path_to_geojson = os.path.join(geojson_dir, geojson_name)
gjson = gpd.read_file(path_to_geojson)
gjson = gjson.iloc[0,0]

# we create a non_ship_df
non_ship_df = pd.DataFrame(columns=['coordinate','envelope','ship_base_name', 'ship_idx','BaseDateTime'])

# Iterate over all filenames
for index, ship in downloaded_ships.iterrows():
    # if that ship already has a non_ship we don't need to create a new one
    if ship['non_ship']:
        continue
    name = '_'.join(ship['base_name'].split('\\')[-1].split('_')[:3])
    ships = downloaded_ships.loc[downloaded_ships['base_name'].str.contains(name)]
    base_name = ship['base_name']
    
    # we set a counter
    non_ship_not_found = True
    trials = 0

    # we create random points near the original ship, check if they contain ships and if they are downloadable from planet
    while non_ship_not_found:
        trials += 1
        try:
            ship_point = shapely.wkt.loads(ship['coordinate'])
        except AttributeError:
            ship_point = ship['coordinate']
        # We have a 0.06 degree region around the original ship
        buffer_ship = ship_point.buffer(0.06)
        big_ship_envelope = buffer_ship.envelope
        while True:
            # generate a random point and it's buffer
            x = generate_random(1, big_ship_envelope )
            point = gpd.GeoDataFrame(geometry = x)
            buffer = point.buffer(0.02)
            # we check if the generated point is in the geojson
            if x[0].within(gjson):
                break
        non_ship_envelope = buffer.envelope
        non_ship_gdf = gpd.GeoDataFrame(geometry = non_ship_envelope)
        # extract ship envelopes
        try:
            ship_envelopes = [shapely.wkt.loads(x) for x in ships['envelope']]
        except AttributeError:
            ship_envelopes = [x for x in ships['envelope']]
        ship_gdf = gpd.GeoDataFrame(geometry = ship_envelopes)
        # get intersection - see if there's any ship inside our non ship data
        intersection = gpd.overlay(ship_gdf, non_ship_gdf, how='intersection')
        if len(intersection) < 1:
            # if no intersection
            # check if this image is available
            non_ship_envelope.to_file(os.path.join(geojson_dir,'nonship.geojson'), driver='GeoJSON')
            try:
                cost, file = create_id_list('nonship.geojson', datetime.datetime.fromisoformat(ships.iloc[0,:]['datetime']).date())
            except TypeError:
                cost, file = create_id_list('nonship.geojson', ships.iloc[0,:]['datetime'].date())
            # if it's available save it
            if cost > 0:
                # concatenate if it is
                this_non_ship_df = pd.DataFrame({'coordinate':x,'envelope':[non_ship_envelope[0]], 'ship_base_name':base_name})
                this_non_ship_df['ship_idx']= index
                this_non_ship_df['BaseDateTime']=ship['BaseDateTime']
                non_ship_df = pd.concat([non_ship_df, this_non_ship_df])
                # stop the loop
                trials = 0
                non_ship_not_found = False
                print('found this one ', base_name)
        if trials > 60:
            # After 60 trials we skip this ship and
            # restart the loop
            trials = 0
            non_ship_not_found = False
            print('skipped this one ', base_name)
print(non_ship_df)

### Ordering and downloading the non ships

This works just as it did for the ships.

In [None]:
downloaded_non_ships = pd.DataFrame()
data_dir = os.path.join(os.getcwd(), 'data')
geojson_dir = os.path.join(data_dir, 'geojson')
geojson = os.path.join(geojson_dir, 'test.geojson')

for index, ship in non_ship_df.iterrows():
    current_ship = ship.copy()
    gpd.GeoSeries([ship['envelope']]).to_file(geojson, driver='GeoJSON')
    date = datetime.datetime.fromisoformat(ship['BaseDateTime']).date()
    cost, filename = create_id_list(geojson, date)
    if cost == 0:
        continue
    else:
        times = get_unique_times(filename)
        if len(times) > 1:
            # this could be done nicer... like using the extra data
            continue

        time_dif = datetime.datetime.utcfromtimestamp(times[0].astype('O')/1e9) - datetime.datetime.fromisoformat(ship['BaseDateTime'])
        # We only order if the time is correct
        if abs(time_dif) < datetime.timedelta(minutes=1):       
            # if we catch a ship we request and download the asset.
            # we order the asset
            url = order_ship(ship, filename)
            current_ship['url'] = url
            downloaded_non_ships = downloaded_non_ships.append(current_ship)
downloaded_non_ships

In [None]:
downloaded_non_ships.to_csv(non_ship_path)

**Wait before running the next cells!**

### Downloading the non ships

Again just as for the ships. Take care to wait before running this!

In [None]:
downloaded_non_ships['base_name'] = np.nan
downloaded_non_ships.index = pd.RangeIndex(downloaded_non_ships.shape[0])
base_path = 'downloaded_non_ships'
base_path = os.path.join(base_path, file_extension)

for index, row in downloaded_non_ships.iterrows():
    url = row.loc['url']
    base_name = download_non_ship(url, index, base_path)
    downloaded_non_ships.loc[index, 'base_name'] = base_name
    
downloaded_non_ships

### Gathering missing non ships

We probably have a little less non ships than ships, and we don't want that in order to have a balanced dataset. So we mark already downloaded ships and can restart the whole process from creating non-ships.

You probably have to run the following cells a couple of times to get as many balanced samples as possible.

In [None]:
for index, row in downloaded_non_ships.iterrows():
    downloaded_ships.loc[row['ship_idx'], 'non_ship'] = True

In [None]:
# Specify the geojson and the directories
geojson_name = 'miami_to_palm.geojson'
data_dir = os.path.join(os.getcwd(), 'data')
geojson_dir = os.path.join(data_dir, "geojson")#
path_to_geojson = os.path.join(geojson_dir, geojson_name)
gjson = gpd.read_file(path_to_geojson)
gjson = gjson.iloc[0,0]
left_over_count = 0


next_non_ship_df = pd.DataFrame(columns=['coordinate','envelope','ship_base_name', 'ship_idx','BaseDateTime'])

# Iterate over all filenames
for index, ship in downloaded_ships.iterrows():
    if ship['non_ship']:
        # we only act if there is no corresponding non ship yet
        continue
    left_over_count += 1
    name = '_'.join(ship['base_name'].split('\\')[-1].split('_')[:3])
    ships = downloaded_ships.loc[downloaded_ships['base_name'].str.contains(name)]
    base_name=downloaded_ships['base_name']
    non_ship_not_found = True
    trials = 0
    #print(ship)

    while non_ship_not_found:
        trials += 1
        base_name=ship['base_name']
        try:
            ship_point = shapely.wkt.loads(ship['coordinate'])
        except AttributeError:
            ship_point = ship['coordinate']
        buffer_ship = ship_point.buffer(0.06)
        big_ship_envelope = buffer_ship.envelope
        #print(big_ship_envelope)
        while True:
            # generate a random point and it's buffer
            x = generate_random(1, big_ship_envelope )
            point = gpd.GeoDataFrame(geometry = x)
            buffer = point.buffer(0.02)
            # we check if the generated point is in the geojson
            if x[0].within(gjson):
                break
        non_ship_envelope = buffer.envelope
        non_ship_gdf = gpd.GeoDataFrame(geometry = non_ship_envelope)
        # extract ship envelopes
        try:
            ship_envelopes = [shapely.wkt.loads(x) for x in ships['envelope']]
        except AttributeError:
            ship_envelopes = [x for x in ships['envelope']]
        ship_gdf = gpd.GeoDataFrame(geometry = ship_envelopes)
        # get intersection
        intersection = gpd.overlay(ship_gdf, non_ship_gdf, how='intersection')
        if len(intersection) < 1:
            # check if this is available
            #print('no intersection')
            non_ship_envelope.to_file(os.path.join(geojson_dir,'nonship.geojson'), driver='GeoJSON')
            #cost = 0
            try:
                cost, file = create_id_list('nonship.geojson', datetime.datetime.fromisoformat(ships.iloc[0,:]['datetime']).date())
            except TypeError:
                cost, file = create_id_list('nonship.geojson', ships.iloc[0,:]['datetime'].date())
            #print(cost)
            if cost > 0:
                # concatenate if it is
                this_non_ship_df = pd.DataFrame({'coordinate':x,'envelope':[non_ship_envelope[0]], 'ship_base_name':base_name})
                this_non_ship_df['ship_idx']= index
                this_non_ship_df['BaseDateTime']=ship['BaseDateTime']
                next_non_ship_df = pd.concat([next_non_ship_df, this_non_ship_df])
                # restart the loop
                trials = 0
                non_ship_not_found = False
                print('found this one ')
        #else:
            #print('point in intersection')
        if trials > 100:
            # After 20 trials we skip
            # restart the loop
            trials = 0
            non_ship_not_found = False
            print('skipped this one ')
print(next_non_ship_df)

print(left_over_count)
print(next_non_ship_df.shape)

In [None]:
next_downloaded_non_ships = pd.DataFrame()
data_dir = os.path.join(os.getcwd(), 'data')
geojson_dir = os.path.join(data_dir, 'geojson')
geojson = os.path.join(geojson_dir, 'test.geojson')

for index, ship in next_non_ship_df.iterrows():
    current_ship = ship.copy()
    gpd.GeoSeries([ship['envelope']]).to_file(geojson, driver='GeoJSON')
    date = datetime.datetime.fromisoformat(ship['BaseDateTime']).date()
    cost, filename = create_id_list(geojson, date)
    if cost == 0:
        continue
    else:
        times = get_unique_times(filename)
        if len(times) > 1:
            # this could be done nicer... like using the extra data
            continue

        time_dif = datetime.datetime.utcfromtimestamp(times[0].astype('O')/1e9) - datetime.datetime.fromisoformat(ship['BaseDateTime'])
        # We only order if the time is correct
        if abs(time_dif) < datetime.timedelta(minutes=1):       
            # if we catch a ship we request and download the asset.
            # we order the asset
            url = order_ship(ship, filename)
            current_ship['url'] = url
            next_downloaded_non_ships = next_downloaded_non_ships.append(current_ship)
next_downloaded_non_ships

**Again! Wait! Check if the order is ready**

In [None]:
next_downloaded_non_ships['base_name'] = np.nan
all_non_ships = downloaded_non_ships.shape[0] + next_downloaded_non_ships.shape[0]
next_downloaded_non_ships.index = pd.RangeIndex(start=downloaded_non_ships.shape[0], stop=all_non_ships)
base_path = 'downloaded_non_ships'
base_path = os.path.join(base_path, file_extension)


for index, row in next_downloaded_non_ships.iterrows():
    url = row.loc['url']
    base_name = download_non_ship(url, index, base_path)
    next_downloaded_non_ships.loc[index, 'base_name'] = base_name
    downloaded_ships.loc[row['ship_idx'], 'non_ship'] = True
    
next_downloaded_non_ships

In [None]:
downloaded_non_ships = pd.concat([downloaded_non_ships, next_downloaded_non_ships])
downloaded_non_ships.to_csv(non_ship_path)

### Getting the oversampled non-ships

Most of the non-ships can be created in the way shown above, but some are just super resistant. To fill our dataset despite those, we oversample some other non ships to get an even dataset.

In [None]:
# Specify the geojson and the directories
geojson_name = 'miami_to_palm.geojson'
data_dir = os.path.join(os.getcwd(), 'data')
geojson_dir = os.path.join(data_dir, "geojson")#
path_to_geojson = os.path.join(geojson_dir, geojson_name)
gjson = gpd.read_file(path_to_geojson)
gjson = gjson.iloc[0,0]
left_over_count = 0
# Add how many you would like to sample
num_sample = 6


next_non_ship_df = pd.DataFrame(columns=['coordinate','envelope','ship_base_name', 'ship_idx','BaseDateTime'])

# Iterate over all filenames
for index, ship in downloaded_ships.sample(num_sample, axis=0).iterrows():
    #if ship['non_ship']:
    #    continue
    left_over_count += 1
    name = '_'.join(ship['base_name'].split('\\')[-1].split('_')[:3])
    ships = downloaded_ships.loc[downloaded_ships['base_name'].str.contains(name)]
    base_name=downloaded_ships['base_name']
    non_ship_not_found = True
    trials = 0
    #print(ship)

    while non_ship_not_found:
        trials += 1
        base_name=ship['base_name']
        try:
            ship_point = shapely.wkt.loads(ship['coordinate'])
        except AttributeError:
            ship_point = ship['coordinate']
        buffer_ship = ship_point.buffer(0.06)
        big_ship_envelope = buffer_ship.envelope
        #print(big_ship_envelope)
        while True:
            # generate a random point and it's buffer
            x = generate_random(1, big_ship_envelope )
            point = gpd.GeoDataFrame(geometry = x)
            buffer = point.buffer(0.02)
            # we check if the generated point is in the geojson
            if x[0].within(gjson):
                break
        non_ship_envelope = buffer.envelope
        non_ship_gdf = gpd.GeoDataFrame(geometry = non_ship_envelope)
        # extract ship envelopes
        try:
            ship_envelopes = [shapely.wkt.loads(x) for x in ships['envelope']]
        except AttributeError:
            ship_envelopes = [x for x in ships['envelope']]
        ship_gdf = gpd.GeoDataFrame(geometry = ship_envelopes)
        # get intersection
        intersection = gpd.overlay(ship_gdf, non_ship_gdf, how='intersection')
        if len(intersection) < 1:
            # check if this is available
            #print('no intersection')
            non_ship_envelope.to_file(os.path.join(geojson_dir,'nonship.geojson'), driver='GeoJSON')
            #cost = 0
            try:
                cost, file = create_id_list('nonship.geojson', datetime.datetime.fromisoformat(ships.iloc[0,:]['datetime']).date())
            except TypeError:
                cost, file = create_id_list('nonship.geojson', ships.iloc[0,:]['datetime'].date())
            #print(cost)
            if cost > 0:
                # concatenate if it is
                this_non_ship_df = pd.DataFrame({'coordinate':x,'envelope':[non_ship_envelope[0]], 'ship_base_name':base_name})
                this_non_ship_df['ship_idx']= index
                this_non_ship_df['BaseDateTime']=ship['BaseDateTime']
                next_non_ship_df = pd.concat([next_non_ship_df, this_non_ship_df])
                # restart the loop
                trials = 0
                non_ship_not_found = False
                print('found this one ')
        #else:
            #print('point in intersection')
        if trials > 100:
            # After 20 trials we skip
            # restart the loop
            trials = 0
            non_ship_not_found = False
            print('skipped this one ')
print(next_non_ship_df)

print(left_over_count)
print(next_non_ship_df.shape)

You have to download those ships using the download code from above. :)

Now we have all the data available for the whole month. We could clean the file system up and combine some things... 
But it's the backbone of a decent satellite imagery ship dataset :)

## Known issues:
- Some wrong labeled non-ships are downloaded and created. Those were found using visual inspection. It was pretty obvious.
- We do have some cloud covered datapoints.
- In some ship samples we cannot see the ship - This can be due to ship size, wrong AIS data or the ship moved to far in the past minute.