## iNatAtor Data Extractor

(also provided as .py)

This notebook will walk you through how to gather data for fine tuning. Start the database instance if you are running it locally, if you are connecting to azure isntance of the database, you will specify connection parameters in .env, explained below.

In [8]:
import pandas as pd
import sqlalchemy
from dotenv import load_dotenv
import os
import h3
import shapely
import pyproj
import numpy as np
import json

import datetime

load_dotenv()

True

### Parameters

You need to supply a .env file that contains necessary db secrets, or hardcode them here.
`max_amount` determines how many points you will sample from a hexagon.
You can think of `sampling_amount = max(max_amount - hexagon_size, 0) + 1`, therefore the amount of points you sample is inversely correlated to the hexagon's resolution.

`sampling_mode` refers to sampling strategy. `polygon` randomly checks points in a square that lies inside the hexagon. `circle` calculates a radius from the center of the hexagon to its corner and samples a point inside the circle, this elminates repeadetly checking if a point lies in a polygon. `circle` is slightly faster than `polygon` but samples with slight errors.

In [47]:
params = {
    'db': os.getenv('POSTGRES_DB'),
    'user': os.getenv('POSTGRES_USER'),
    'password': os.getenv('POSTGRES_PASSWORD'),
    'url': os.getenv('DATABASE_URL'),
    'sampling_mode': 'circle', # polygon | circle
    'max_amount': 7
}

params

{'db': 'inat',
 'user': 'postgres',
 'password': 'inat4cg',
 'url': 'postgresql+psycopg2://postgres:inat4cg@localhost:5433/inat',
 'sampling_mode': 'circle',
 'max_amount': 7}

You can try any query you want, however, to use with the fine-tuner you need `taxa_id`, `hex_id`, `type` from the query result. These three fields are what the model uses to fine tune.

Some example queries are:

`SELECT an."taxa_id", ah."hex_id", ah."hex_type" FROM "annotation" AS an INNER JOIN "annotation_hexagon" AS ah ON an."annotation_id"=ah."annotation_id" WHERE an."taxa_id" = 12345`
- This query gets annotations for only one specific taxa.

In [48]:
QUERY = 'SELECT an."taxa_id", ah."hex_index", ah."hex_type" FROM "annotation" AS an INNER JOIN "annotation_hexagon" AS ah ON an."annotation_id"=ah."annotation_id"'

This block makes communication with the database and reads by executing the query.

In [49]:
engine = sqlalchemy.engine.create_engine(url=params['url'])
df = pd.read_sql(QUERY, engine)
df.head()

Unnamed: 0,taxa_id,hex_index,hex_type
0,5165,84962e7ffffffff,presence
1,5165,84975d7ffffffff,presence
2,5165,8497517ffffffff,presence
3,5165,84962cbffffffff,presence
4,5165,8497537ffffffff,presence


In [50]:
len(df)

2404

This block provides functions to sample points from a hexagon

In [51]:
def calculate_geo_distance(loc1, loc2):
    lat1, lng1 = loc1
    lat2, lng2 = loc2

    geod = pyproj.Geod(ellps="WGS84")
    _, _, distance = geod.inv(lons1=lng1, lats1=lat1, lons2=lng2, lats2=lat2)
    return distance


def generate_random_points_in_polygon(boundary, N):
    polygon = shapely.Polygon(boundary)
    min_alt, min_lng, max_alt, max_lng = polygon.bounds

    random_points = []
    while len(random_points) < N:
        alt = np.random.uniform(min_alt, max_alt)
        lng = np.random.uniform(min_lng, max_lng)

        point = shapely.Point(alt, lng)
        if polygon.contains(point):
            random_points.append((alt, lng))

    return random_points

def generate_random_points_in_circle(lat, lng, R, N):
    random_points = []
    while len(random_points) < N:
        r = R * np.sqrt(np.random.uniform(0, 1)) # random distance from center
        theta = np.random.uniform(0, 2 * np.pi) # random degree
        
        # 111320m distance between longitudes and latitudes at the equator
        x = lng + r * np.cos(theta) / (111320 * np.cos(lat * np.pi / 180)) # r * np.cos(theta) / / (111320 * np.cos(lat * np.pi / 180)) finds the random point in x axis, division adjusts for length in the poles
        y = lat + r * np.sin(theta) / 111320
        
        random_points.append((y, x))
    
    return random_points

In [52]:
hex_resolution = [h3.h3_get_resolution(hex_id) for hex_id in df['hex_index']]
df['hex_resolution'] = hex_resolution

hex_boundary = [h3.h3_to_geo_boundary(hex_id, geo_json=False) for hex_id in df['hex_index']]
df['hex_boundary'] = hex_boundary

if params['sampling_mode'] == 'polygon':
    pass
else:
    center_point = [h3.h3_to_geo(hex_id) for hex_id in df['hex_index']]
    df['center_point'] = center_point

    radius = [min([calculate_geo_distance(r['center_point'], loc) for loc in r['hex_boundary']]) for _, r in df.iterrows()]
    df['R'] = radius

In [53]:
df.head()

Unnamed: 0,taxa_id,hex_index,hex_type,hex_resolution,hex_boundary,center_point,R
0,5165,84962e7ffffffff,presence,4,"((-14.973196837524473, 27.789938965859054), (-...","(-14.769088700222758, 27.62719511680403)",28519.239849
1,5165,84975d7ffffffff,presence,4,"((-17.736216364299104, 23.30326368482732), (-1...","(-17.5302380946822, 23.143965361646774)",28170.591584
2,5165,8497517ffffffff,presence,4,"((-17.913493465296703, 25.757271715344977), (-...","(-17.708921620453513, 25.594704115579415)",28293.89978
3,5165,84962cbffffffff,presence,4,"((-16.669469802640553, 26.809174339152214), (-...","(-16.46511955136319, 26.646279688575284)",28420.970601
4,5165,8497537ffffffff,presence,4,"((-16.938454389750355, 26.43976487571977), (-1...","(-16.73395359665564, 26.277087683934333)",28392.018654


In [54]:
df.tail()

Unnamed: 0,taxa_id,hex_index,hex_type,hex_resolution,hex_boundary,center_point,R
2399,5174,84bc447ffffffff,absence,4,"((-34.18267871344204, 23.829406513185848), (-3...","(-33.99821041999803, 23.656778590222167)",24856.871403
2400,5174,84bc445ffffffff,absence,4,"((-34.34708962764286, 24.289870021150534), (-3...","(-34.16352079371115, 24.116443398782966)",24847.296287
2401,5174,84bc407ffffffff,absence,4,"((-33.23663321785284, 23.255461102402393), (-3...","(-33.04949250162404, 23.084506231542456)",25093.853527
2402,5174,84bc401ffffffff,absence,4,"((-33.6274772906967, 23.31276536178162), (-33....","(-33.44122544070739, 23.141383249306227)",24980.174438
2403,5174,84bc409ffffffff,absence,4,"((-34.0163963486983, 23.37030977336294), (-33....","(-33.831046627187575, 23.198498678885336)",24865.336236


In [44]:
start_time = datetime.datetime.now()

psuedo_points = []
for i, r in df.iterrows():
    random_n_points = None
    N = max(params['max_amount'] - r['hex_resolution'], 0) + 1
    if params['sampling_mode'] == 'polygon':
       random_n_points = generate_random_points_in_polygon(r['hex_boundary'], N)
    else:
        lat, lng = r['center_point']
        random_n_points = generate_random_points_in_circle(lat, lng, r['R'], N)

    for random_lat, random_lng in random_n_points:
        psuedo_point = {
            'taxon_id': r['taxa_id'],
            'hex_type': 1 if r['hex_type'] == 'presence' else 0,
            'latitude': random_lat,
            'longitude': random_lng
        }

        psuedo_points.append(psuedo_point)

df_psuedo_points = pd.DataFrame(psuedo_points)

end_time = datetime.datetime.now()
print('Executed in: ', (end_time - start_time))

Executed in:  0:00:00.472690


In [45]:
df_psuedo_points.head(), len(df_psuedo_points)

(   taxon_id  hex_type   latitude  longitude
 0      5165         1 -14.636833  27.610717
 1      5165         1 -14.618621  27.511168
 2      5165         1 -14.790290  27.693306
 3      5165         1 -14.689536  27.622481
 4      5165         1 -17.750372  23.167425,
 9616)

Your annotation will be saved in a .csv file with a time stamp.

You can use the new annotation data you extracted to use in fine tuning, head to `fine_tune_main.py` view instructions on how to set parameters and start fine-tuning a geomodel.

In [46]:
with open("paths.json", 'r') as f:
    paths = json.load(f)

date_now = datetime.datetime.now()
df_psuedo_points.to_csv(os.path.join(paths['annotation'], str(date_now)+'.csv'))