## Validation for OmniScape

This generates the dataframes that are used for Omniscape validation, from the observation data. 

In [9]:
import os
import pandas as pd
from birdmaps import ebird_db
from scgt import GeoTiff

In [10]:
# Point here to where you expanded the Omniscape.zip file. 
DATA_PATH = "data/CA-Final/Omniscape"

In [11]:
# These are the runs that correspond to the EcoScape runs. 
birds = [("acowoo", "acowoo_CA"), 
         ("stejay", "stejay_CA_h2_S3")]

For each bird, we compute a Pandas dataframe with data for each square where checklists
have occurred.  Thus, we read the csv produced by `GenerateValidationData.ipynb` into
a Pandas dataframe, and for each row of the dataframe, we add information about the
amounts of habitat and repopulation.

This is a time-consuming operation, as we need to access the repopulation
file for each of the squares.  You need to run this once only for each bird run;
then you can analyze the resulting data as much as you like.

In [12]:
max_distance = 2
date_range = ("2012-01-01", "2018-12-31")
num_sample_squares = 20000 # Sampling number for the squares.

for bird, bird_out in birds:
    for fn in ["cum_currmap"]: 

        hab_fn = os.path.join(DATA_PATH, bird, "OriginalLayers", "habitat.tif")
        repop = GeoTiff.from_file(os.path.join(DATA_PATH, "output", bird_out, fn + ".tif"))
        hab = GeoTiff.from_file(hab_fn)
        obs_fn = os.path.join(DATA_PATH, bird, "CA_all_len_2_2012-01-01-2018-12-31_20000.csv")
        result_fn = os.path.join(DATA_PATH, "output", bird_out, "validation_" + fn + ".csv")
        
        validation = ebird_db.Validation(obs_fn, hab_fn)

        # Augments the dataframe with the values for each square of repopulation and habitat.
        df = validation.get_repop_ratios(repop, hab, tile_scale=3)
        # Computes a repopulation range.
        df["RepopRange"] = df.apply(lambda row : int(row["avg_repop"] * 10) / 10, axis=1)
        # Computes birds and sightings per checklist.
        df["ObsRatio"] = df["NumBirdChecklists"] / df["NumChecklists"]
        df["BirdRatio"] = df["NumBirds"] / df["NumChecklists"]

        # Writes the resulting dataset.
        df.to_csv(result_fn)
        print("Done with", bird, fn)

Done with acowoo cum_currmap
Done with stejay cum_currmap


The dataframe contains, for each square, the:
* Number of checklists
* Number of checlists containing the bird
* Total number of birds (of the given species) seen
* average habitat around the square (counting 1 as habitat and 0 as non-habitat)
* max habitat around the square
* average repopulation around the square (counting as 0 out of habitat of course)
* max repopulation around the square.

I look for the correlation between:
* Average number of birds seen (the BirdRatio),
* Max repopulation (MaxRepopRange).

Why max repopulation and not average?  Because average repopulation mixes two concerns: (a) how much habitat there is around, and (b) how high the repopulation is in that habitat.  This confounds the signal.  It is much cleaner to look at the correlation between BirdRatio and MaxRepopRange.