## Code Setup

In [19]:
import os
import pandas as pd
import scgt
import sys
sys.path.append(os.path.join(os.getcwd(), "birdmaps"))

In [20]:
# If True, assumes everything is running locally.
IS_LOCAL = True
# We should not do the run locally, except in rare cases in testing.
DO_RUN = False

# Path to main directory
LOCAL_PATH = os.path.join(os.getcwd(), "data/CA-Final")
DATA_PATH = LOCAL_PATH

In [21]:
import ebird_db, bird_runs
from scgt import GeoTiff

## Bird Run Definition

In [22]:
bird_run = bird_runs.BirdRun(DATA_PATH)

NUM_SIMULATIONS = 400
HOPS = [1, 2, 3, 4, 5, 6]
SPREADS = [2, 3, 4, 6, 8, 10, 15, 20, 30, 40]
RUN_NAME = "Paper"


First we define the runs for 400 simulations.

In [23]:

birds = []

for hop_distance in HOPS:
    for num_spreads in SPREADS:
        birds.append(bird_run.get_bird_run(
            "acowoo", "Acorn Woodpecker",
            do_validation=True, run_name=RUN_NAME,
            hop_distance=hop_distance, num_spreads=num_spreads,
            num_simulations=NUM_SIMULATIONS))

for hop_distance in HOPS:
    for num_spreads in SPREADS:
        birds.append(bird_run.get_bird_run(
            "stejay", "Steller's Jay",
            do_validation=True, run_name=RUN_NAME,
            hop_distance=hop_distance, num_spreads=num_spreads,
            num_simulations=NUM_SIMULATIONS))


Then the ones for 10000 simulations.

In [24]:
birds10000 = []

birds10000.append(bird_run.get_bird_run(
    "acowoo", "Acorn Woodpecker",
    do_validation=True, run_name="Paper10000",
    hop_distance=2, num_spreads=20,
    num_simulations=10000))

birds10000.append(bird_run.get_bird_run(
    "stejay", "Steller's Jay",
    do_validation=True, run_name="Paper10000",
    hop_distance=2, num_spreads=3,
    num_simulations=10000))

birds10000.append(bird_run.get_bird_run(
    "stejay", "Steller's Jay",
    do_validation=True, run_name="Paper10000",
    hop_distance=1, num_spreads=6,
    num_simulations=10000))


For each bird, we compute a Pandas dataframe with data for each square where checklists
have occurred.  Thus, we read the csv produced by `GenerateValidationData.ipynb` into
a Pandas dataframe, and for each row of the dataframe, we add information about the
amounts of habitat and repopulation.

This is a time-consuming operation, as we need to access the repopulation
file for each of the squares.  You need to run this once only for each bird run;
then you can analyze the resulting data as much as you like.

In [25]:
max_distance = 2
date_range = ("2012-01-01", "2018-12-31")
num_sample_squares = 20000 # Sampling number for the squares.

def compute_validation(bird_list):
    for bird in bird_list:
        if bird.do_validation:
            repop = GeoTiff.from_file(bird.repopulation_fn)
            hab = GeoTiff.from_file(bird.habitat_fn)
            obs_fn = bird_run.get_observations_all_fn(
                bird.obs_path, max_distance=max_distance,
                date_range="-".join(date_range),
                num_squares=num_sample_squares)
            # Here it's better to cache for the habitat_fn, as there is only one per species.
            validation = ebird_db.Validation(obs_fn, bird.habitat_fn)
    
            # Augments the dataframe with the values for each square of repopulation and habitat.
            df = validation.get_repop_ratios(repop, hab, tile_scale=3)
            # Computes a repopulation range.
            df["RepopRange"] = df.apply(lambda row : int(row["avg_repop"] * 10) / 10, axis=1)
            # Computes birds and sightings per checklist.
            df["ObsRatio"] = df["NumBirdChecklists"] / df["NumChecklists"]
            df["BirdRatio"] = df["NumBirds"] / df["NumChecklists"]
    
            # Writes the resulting dataset.
            df.to_csv(bird.obs_csv_path)
            print("Done with", bird.nickname, bird.hop_distance, bird.num_spreads)

In [26]:
compute_validation(birds)

Done with acowoo 1 2
Done with acowoo 1 3
Done with acowoo 1 4
Done with acowoo 1 6
Done with acowoo 1 8
Done with acowoo 1 10
Done with acowoo 1 15
Done with acowoo 1 20
Done with acowoo 1 30
Done with acowoo 1 40
Done with acowoo 2 2
Done with acowoo 2 3
Done with acowoo 2 4
Done with acowoo 2 6
Done with acowoo 2 8
Done with acowoo 2 10
Done with acowoo 2 15
Done with acowoo 2 20
Done with acowoo 2 30
Done with acowoo 2 40
Done with acowoo 3 2
Done with acowoo 3 3
Done with acowoo 3 4
Done with acowoo 3 6
Done with acowoo 3 8
Done with acowoo 3 10
Done with acowoo 3 15
Done with acowoo 3 20
Done with acowoo 3 30
Done with acowoo 3 40
Done with acowoo 4 2
Done with acowoo 4 3
Done with acowoo 4 4
Done with acowoo 4 6
Done with acowoo 4 8
Done with acowoo 4 10
Done with acowoo 4 15
Done with acowoo 4 20
Done with acowoo 4 30
Done with acowoo 4 40
Done with acowoo 5 2
Done with acowoo 5 3
Done with acowoo 5 4
Done with acowoo 5 6
Done with acowoo 5 8
Done with acowoo 5 10
Done with aco

In [27]:
compute_validation(birds10000)

Done with acowoo 2 20
Done with stejay 2 3
Done with stejay 1 6


The dataframe contains, for each square, the:
* Number of checklists
* Number of checlists containing the bird
* Total number of birds (of the given species) seen
* average habitat around the square (counting 1 as habitat and 0 as non-habitat)
* max habitat around the square
* average repopulation around the square (counting as 0 out of habitat of course)
* max repopulation around the square.

I look for the correlation between:
* Average number of birds seen (the BirdRatio),
* Max repopulation (MaxRepopRange).

Why max repopulation and not average?  Because average repopulation mixes two concerns: (a) how much habitat there is around, and (b) how high the repopulation is in that habitat.  This confounds the signal.  It is much cleaner to look at the correlation between BirdRatio and MaxRepopRange.