## Code Setup

In [None]:
import os
import pandas as pd
import scgt
import sys
sys.path.append(os.path.join(os.getcwd(), "birdmaps"))

In [None]:
# If True, assumes everything is running locally.
IS_LOCAL = True
# We should not do the run locally, except in rare cases in testing.
DO_RUN = False

# Path to main directory
LOCAL_PATH = os.path.join(os.getcwd(), "data/CA-EcoScape-Paper")
DATA_PATH = LOCAL_PATH

In [None]:
import ebird_db, bird_runs
from scgt import GeoTiff

## Bird Run Definition

In [None]:
bird_run = bird_runs.BirdRun(DATA_PATH)

NUM_SIMULATIONS = 400
HOPS = [1, 2, 3, 4, 5, 6]
SPREADS = [2, 4, 6, 8, 10, 15, 20, 30, 40]
RUN_NAME="test" # You can use "Jun8" if you want to reuse our results. 

birds = []

for hop_distance in HOPS:
    for num_spreads in SPREADS:
        birds.append(bird_run.get_bird_run(
            "acowoo", "Acorn Woodpecker",
            do_validation=True, run_name=RUN_NAME,
            hop_distance=hop_distance, num_spreads=num_spreads,
            num_simulations=NUM_SIMULATIONS))

for hop_distance in HOPS:
    for num_spreads in SPREADS:
        birds.append(bird_run.get_bird_run(
            "stejay", "Steller's Jay",
            do_validation=True, run_name=RUN_NAME,
            hop_distance=hop_distance, num_spreads=num_spreads,
            num_simulations=NUM_SIMULATIONS))

for hop_distance in range(1, 5):
    for num_spreads in HOPS:
        birds.append(bird_run.get_bird_run(
            "oaktit", "Oak Titmouse",
            do_validation=True, run_name=RUN_NAME,
            hop_distance=hop_distance, num_spreads=num_spreads,
            num_simulations=NUM_SIMULATIONS))

For each bird, we compute a Pandas dataframe with data for each square where checklists
have occurred.  Thus, we read the csv produced by `GenerateValidationData.ipynb` into
a Pandas dataframe, and for each row of the dataframe, we add information about the
amounts of habitat and repopulation.

This is a time-consuming operation, as we need to access the repopulation
file for each of the squares.  You need to run this once only for each bird run;
then you can analyze the resulting data as much as you like.

In [None]:
# These values must match the ones in GenerateValidationData.ipynb
max_distance = 2
date_range = ("2012-01-01", "2018-12-31")
num_sample_squares = 20000 # Sampling number for the squares.

validation = ebird_db.Validation()
for bird in birds:

    repop = GeoTiff.from_file(bird.repopulation_fn)
    hab = GeoTiff.from_file(bird.habitat_fn)

    obs_fn = bird_run.get_observations_all_fn(
        bird.obs_path, max_distance=max_distance,
        date_range="-".join(date_range),
        num_squares=num_sample_squares)

    # This reads information on each square: how many checklists, birds, etc.
    with open(obs_fn) as f:
        df = pd.read_csv(obs_fn)

    # Augments the dataframe with the values for each square of repopulation and habitat.
    validation.get_repop_ratios(df, repop, hab, tile_scale=3)
    # Computes a repopulation range.
    df["RepopRange"] = df.apply(lambda row : int(row["avg_repop"] * 10) / 10, axis=1)
    # Computes birds and sightings per checklist.
    df["ObsRatio"] = df["NumBirdChecklists"] / df["NumChecklists"]
    df["BirdRatio"] = df["NumBirds"] / df["NumChecklists"]

    # Writes the resulting dataset.
    df.to_csv(bird.obs_csv_path)
    print("Done with", bird.nickname, bird.hop_distance, bird.num_spreads)


The dataframe contains, for each square, the:
* Number of checklists
* Number of checlists containing the bird
* Total number of birds (of the given species) seen
* average habitat around the square (counting 1 as habitat and 0 as non-habitat)
* max habitat around the square
* average repopulation around the square (counting as 0 out of habitat of course)
* max repopulation around the square.

I look for the correlation between:
* Average number of birds seen (the BirdRatio),
* Max repopulation (MaxRepopRange).

Why max repopulation and not average?  Because average repopulation mixes two concerns: (a) how much habitat there is around, and (b) how high the repopulation is in that habitat.  This confounds the signal.  It is much cleaner to look at the correlation between BirdRatio and MaxRepopRange.