We're assuming that `mongod` is running and pointing to a database with the BIRT data. In my case, that's `mongod --dbpath /Volumes/Transcend/data/db`.

In [1]:
from pymongo import MongoClient

In [2]:
client = MongoClient()
db = client.birt

In [3]:
# Make sure we've got the thing hooked up right.
db.collection_names()

['migrations', 'birds']

In [4]:
birds = db.birds

So, now we have `birds`, which is a collection of the eBird sightings. The documentation for the `Collection()` class is [here](https://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection). `birds.find_one()` will get us the first record to take a look at the structure.

In [5]:
birds.find_one()

{'_id': 'rhodostethia_rosea',
 'category': 'species',
 'family_name': 'Laridae (Gulls- Terns- and Skimmers)',
 'genus_name': 'Rhodostethia',
 'order_name': 'Charadriiformes',
 'primary_com_name': "Ross's_Gull",
 'species_name': 'rosea',
 'subfamily_name': None,
 'taxon_order': 4292.0}

In [7]:
migrations = db.migrations

Below, we examine a single `migrations` document. It includes... gosh, I think this data is pretty shittily organized. How can I index this? It doesn't have like, a "sightings" array. It has just, all the variables, and then the number of birds at that location.

So, for each bird, I should:

1. Find all documents with that species name.
2. Extract all covariates for those locations, and the "target", in `scikit-learn` terminology, is the number of birds.
3. Load all background covariates

In [8]:
migrations.find_one()

{'_id': 'S4080710',
 'actitis_macularius': 1,
 'antrostomus_carolinensis': 2,
 'bailey_ecoregion': None,
 'bcr': None,
 'bubulcus_ibis': 1,
 'buteo_platypterus': 1,
 'catharus_ustulatus': 3,
 'caus_prec': 6,
 'caus_snow': 1,
 'caus_temp_avg': 8,
 'caus_temp_max': 7,
 'caus_temp_min': 9,
 'circus_cyaneus': 1,
 'count_type': 'P20',
 'country': 'United_States',
 'date': datetime.datetime(2002, 10, 4, 0, 0),
 'day': 277,
 'dolichonyx_oryzivorus': 8,
 'dumetella_carolinensis': 1,
 'effort_area_ha': 0.0,
 'effort_distance_km': 0.0,
 'effort_hrs': 3.58,
 'elev_gt': None,
 'elev_ned': 2.33,
 'empidonax_sp': 2,
 'falco_columbarius': 2,
 'geothlypis_trichas': 14,
 'group_id': None,
 'hirundo_rustica': 3,
 'housing_density': 591.8242830994509,
 'housing_percent_vacant': 0.4845360824742268,
 'leucophaeus_atricilla': 1,
 'loc': {'coordinates': [-81.8106, 24.5463], 'type': 'Point'},
 'megaceryle_alcyon': 2,
 'melanerpes_carolinus': 3,
 'mimus_polyglottos': 5,
 'mniotilta_varia': 2,
 'month': 10,
 'n