## Load a single bird's data from Mongo

We're assuming that `mongod` is running and pointing to a database with the BIRT data. In my case, that's `mongod --dbpath /Volumes/Transcend/data/db`.

In [1]:
# ! mongod --dbpath /Volumes/Transcend/data/db --fork --logpath ~/Library/Logs/mongodb.log

In [18]:
from pymongo import MongoClient
import pandas as pd
import os

In [3]:
client = MongoClient()
db = client.birt

In [4]:
# Make sure we've got the thing hooked up right.
db.collection_names()

['migrations', 'birds', 'halunka:i18n']

In [5]:
birds = db.birds

So, now we have `birds`, which is a collection of the eBird sightings. The documentation for the `Collection()` class is [here](https://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection). `birds.find_one()` will get us the first record to take a look at the structure.

In [6]:
birds.find_one()

{'_id': 'abeillia_abeillei',
 'category': 'species',
 'family_name': 'Trochilidae (Hummingbirds)',
 'genus_name': 'Abeillia',
 'order_name': 'Apodiformes',
 'primary_com_name': 'Emerald-chinned_Hummingbird',
 'species_name': 'abeillei',
 'subfamily_name': None,
 'taxon_order': 9016.0}

In [7]:
migrations = db.migrations

Below, we examine a single `migrations` document. It includes... gosh, I think this data is pretty shittily organized. How can I index this? It doesn't have like, a "sightings" array. It has just, all the variables, and then the number of birds at that location.

So, for each bird, I should:

1. Find all documents with that species name.
2. Extract all covariates for those locations, and the "target", in `scikit-learn` terminology, is the number of birds.
3. Load all background covariates

In [8]:
migrations.find_one()

{'_id': 'S10000010',
 'agelaius_phoeniceus': 2,
 'baeolophus_bicolor': 4,
 'bailey_ecoregion': '-222J',
 'bcr': 12,
 'bucephala_clangula': 9,
 'cardinalis_cardinalis': 1,
 'caus_prec': 3,
 'caus_snow': 3,
 'caus_temp_avg': 2,
 'caus_temp_max': 1,
 'caus_temp_min': 2,
 'corvus_brachyrhynchos': 2,
 'count_type': 'P22',
 'country': 'United_States',
 'cyanocitta_cristata': 7,
 'date': datetime.datetime(2012, 2, 23, 0, 0),
 'day': 54,
 'effort_area_ha': 0.0,
 'effort_distance_km': 1.931,
 'effort_hrs': 0.83,
 'elev_gt': 182,
 'elev_ned': None,
 'group_id': None,
 'haemorhous_mexicanus': 2,
 'haliaeetus_leucocephalus': 3,
 'housing_density': None,
 'housing_percent_vacant': None,
 'larus_argentatus': 3,
 'larus_delawarensis': 81,
 'loc': {'coordinates': [-83.911171, 43.6727141], 'type': 'Point'},
 'lophodytes_cucullatus': 4,
 'mergus_merganser': 87,
 'month': 2,
 'nlcd2001_fs_c11_7500_pland': 38.0488,
 'nlcd2001_fs_c12_7500_pland': 0.0,
 'nlcd2001_fs_c21_7500_pland': 6.928,
 'nlcd2001_fs_c22

In [9]:
migrations.find_one(filter={'sightings.bird_id': 'zenaida_macroura'},
                    projection=['sightings.$'])

# And then I'll just add the core_covariates to `projection`!!!

{'_id': 'S10000010',
 'sightings': [{'bird_id': 'zenaida_macroura', 'count': 1}]}

In [10]:
# So, I need to read in the list of core covariates.
core_covariates = open('../data/core-covariates.names').readlines()

In [11]:
core_covariates = [cv.split(":")[0].lower() for cv in\
                   open('../data/core-covariates.names').readlines()]

## Planning

We need to decide which bird to use as a demo. The `Species_Analysis_Matrix_V1` document lists species and various properties.

In [12]:
core_covariates

['sampling_event_id',
 'pop00_sqmi',
 'housing_density',
 'housing_percent_vacant',
 'elev_gt',
 'elev_ned',
 'bcr',
 'bailey_ecoregion',
 'omernik_l3_ecoregion',
 'caus_temp_avg',
 'caus_temp_min',
 'caus_temp_max',
 'caus_prec',
 'caus_snow',
 'nlcd2001_fs_c11_7500_pland',
 'nlcd2001_fs_c12_7500_pland',
 'nlcd2001_fs_c21_7500_pland',
 'nlcd2001_fs_c22_7500_pland',
 'nlcd2001_fs_c23_7500_pland',
 'nlcd2001_fs_c24_7500_pland',
 'nlcd2001_fs_c31_7500_pland',
 'nlcd2001_fs_c41_7500_pland',
 'nlcd2001_fs_c42_7500_pland',
 'nlcd2001_fs_c43_7500_pland',
 'nlcd2001_fs_c52_7500_pland',
 'nlcd2001_fs_c71_7500_pland',
 'nlcd2001_fs_c81_7500_pland',
 'nlcd2001_fs_c82_7500_pland',
 'nlcd2001_fs_c90_7500_pland',
 'nlcd2001_fs_c95_7500_pland',
 'nlcd2006_fs_c11_7500_pland',
 'nlcd2006_fs_c12_7500_pland',
 'nlcd2006_fs_c21_7500_pland',
 'nlcd2006_fs_c22_7500_pland',
 'nlcd2006_fs_c23_7500_pland',
 'nlcd2006_fs_c24_7500_pland',
 'nlcd2006_fs_c31_7500_pland',
 'nlcd2006_fs_c41_7500_pland',
 'nlcd2006_

In [13]:
migrations.find_one(filter={'sightings.bird_id' : 'zenaida_macroura'},
                    projection=['sightings.$'] + core_covariates)

{'_id': 'S10000010',
 'bailey_ecoregion': '-222J',
 'bcr': 12,
 'caus_prec': 3,
 'caus_snow': 3,
 'caus_temp_avg': 2,
 'caus_temp_max': 1,
 'caus_temp_min': 2,
 'elev_gt': 182,
 'elev_ned': None,
 'housing_density': None,
 'housing_percent_vacant': None,
 'nlcd2001_fs_c11_7500_pland': 38.0488,
 'nlcd2001_fs_c12_7500_pland': 0.0,
 'nlcd2001_fs_c21_7500_pland': 6.928,
 'nlcd2001_fs_c22_7500_pland': 9.4952,
 'nlcd2001_fs_c23_7500_pland': 4.6392,
 'nlcd2001_fs_c24_7500_pland': 1.5504,
 'nlcd2001_fs_c31_7500_pland': 1.4828,
 'nlcd2001_fs_c41_7500_pland': 4.1796,
 'nlcd2001_fs_c42_7500_pland': 0.1372,
 'nlcd2001_fs_c43_7500_pland': 0.2728,
 'nlcd2001_fs_c52_7500_pland': 0.0636,
 'nlcd2001_fs_c71_7500_pland': 0.8444,
 'nlcd2001_fs_c81_7500_pland': 2.272,
 'nlcd2001_fs_c82_7500_pland': 22.8724,
 'nlcd2001_fs_c90_7500_pland': 5.65,
 'nlcd2001_fs_c95_7500_pland': 1.5636,
 'nlcd2006_fs_c11_7500_pland': 38.3464,
 'nlcd2006_fs_c12_7500_pland': 0.0,
 'nlcd2006_fs_c21_7500_pland': 7.15,
 'nlcd2006_fs

In [14]:
# Alternately, using the flat bird name:
migrations.find_one(filter={'zenaida_macroura' : {'$gt' : 0}},
                   projection=['zenaida_macroura'] + core_covariates)
# Equivalent to
migrations.find_one(filter={'zenaida_macroura' : {'$exists':True}},
                   projection=['zenaida_macroura'] + core_covariates)

{'_id': 'S10000010',
 'bailey_ecoregion': '-222J',
 'bcr': 12,
 'caus_prec': 3,
 'caus_snow': 3,
 'caus_temp_avg': 2,
 'caus_temp_max': 1,
 'caus_temp_min': 2,
 'elev_gt': 182,
 'elev_ned': None,
 'housing_density': None,
 'housing_percent_vacant': None,
 'nlcd2001_fs_c11_7500_pland': 38.0488,
 'nlcd2001_fs_c12_7500_pland': 0.0,
 'nlcd2001_fs_c21_7500_pland': 6.928,
 'nlcd2001_fs_c22_7500_pland': 9.4952,
 'nlcd2001_fs_c23_7500_pland': 4.6392,
 'nlcd2001_fs_c24_7500_pland': 1.5504,
 'nlcd2001_fs_c31_7500_pland': 1.4828,
 'nlcd2001_fs_c41_7500_pland': 4.1796,
 'nlcd2001_fs_c42_7500_pland': 0.1372,
 'nlcd2001_fs_c43_7500_pland': 0.2728,
 'nlcd2001_fs_c52_7500_pland': 0.0636,
 'nlcd2001_fs_c71_7500_pland': 0.8444,
 'nlcd2001_fs_c81_7500_pland': 2.272,
 'nlcd2001_fs_c82_7500_pland': 22.8724,
 'nlcd2001_fs_c90_7500_pland': 5.65,
 'nlcd2001_fs_c95_7500_pland': 1.5636,
 'nlcd2006_fs_c11_7500_pland': 38.3464,
 'nlcd2006_fs_c12_7500_pland': 0.0,
 'nlcd2006_fs_c21_7500_pland': 7.15,
 'nlcd2006_fs

In [15]:
# Getting all of them:
zenaida_macroura = migrations.find(filter={'zenaida_macroura' : {'$exists':True}},
                                   projection=['zenaida_macroura'] + core_covariates)

In [16]:
zenaida_macroura.count()

1542192

In [53]:
projection = dict.fromkeys(['zenaida_macroura'] + core_covariates, 1)

In [55]:
# This object is a pymongo cursor. We need to change it to a DataFrame.
# It's too big to do that so maybe we can sample it.
zenaida_macroura = migrations.aggregate(
    [
        {'$match': {'zenaida_macroura' : {'$exists' : True}}},
        {'$project' : projection},
        {'$sample' : {'size' : 13}}
    ]
)

In [56]:
zenaida_macroura_df = pd.DataFrame(list(zenaida_macroura))

In [57]:
zenaida_macroura_df

Unnamed: 0,_id,bailey_ecoregion,bcr,caus_prec,caus_snow,caus_temp_avg,caus_temp_max,caus_temp_min,elev_gt,elev_ned,...,nlcd2006_fs_c43_7500_pland,nlcd2006_fs_c52_7500_pland,nlcd2006_fs_c71_7500_pland,nlcd2006_fs_c81_7500_pland,nlcd2006_fs_c82_7500_pland,nlcd2006_fs_c90_7500_pland,nlcd2006_fs_c95_7500_pland,omernik_l3_ecoregion,pop00_sqmi,zenaida_macroura
0,S8675086,-322C,33,1,,9,9,9,-66.0,,...,0.0,21.2564,0.1116,0.5084,12.1964,0.11,0.0824,81,,4
1,S2320053,-231A,30,6,,6,6,7,29.0,,...,1.8104,0.2704,0.5696,2.006,0.6952,3.9564,0.004,65,,3
2,S7410669,M231A,25,6,1.0,4,4,4,243.0,,...,8.934,2.9836,8.064,5.1336,0.0,0.5232,0.0088,36,2.6,15
3,S12125682,M221D,29,6,1.0,4,4,5,,,...,0.5632,0.5204,0.1468,33.5364,6.39,0.4504,0.2236,64,36.6,1
4,S8183633,-222J,23,6,,5,5,6,309.0,,...,0.0612,0.4204,0.5372,25.0164,33.5528,13.8504,0.4636,56,,7
5,S11183614,-321A,34,7,,7,7,8,1503.0,1478.44,...,0.6008,54.2852,0.0,0.0,0.0,0.1664,0.0,79,1.7,3
6,S6715331,-212F,13,6,,6,6,7,116.0,116.21,...,1.426,9.7668,2.9168,16.0016,9.2592,9.2576,0.9576,83,82.2,3
7,S5337147,-232C,27,6,,7,7,8,3.0,2.58,...,0.652,4.7496,1.9364,0.7136,2.5556,3.7336,32.8048,75,152.6,6
8,S3492209,-231A,29,6,1.0,4,4,4,107.0,111.12,...,2.074,0.3492,2.0752,1.7784,0.2,5.536,0.0912,45,1921.8,6
9,S10828465,-222E,24,7,,6,6,7,274.0,,...,3.2096,1.8724,1.1692,44.3992,7.1388,0.9204,0.1408,71,,4


## Read in Core Covariates CSV

In [19]:
datadir = '/Volumes/Transcend/birt data/eBird raw data'

file = 'srd_point_data_30km_v3.0.csv'
path_to_file = os.path.join(datadir, 'srd_point_data_30km_v3.0.csv')

# Missing values seem to be encoded as "?", so we're going to add this to the na_values argument.
srd30km = pd.read_csv(path_to_file, na_values = '?')

In [66]:
cov_samp = srd30km.sample(13)

In [85]:
# Make sure the covariate sample's names are lowercase.
cov_samp.columns = map(str.lower, cov_samp.columns)

Index(['decimal_latitude', 'decimal_longitude', 'pop00_sqmi',
       'housing_density', 'housing_percent_vacant', 'elev_gt', 'elev_ned',
       'subnational2_code', 'bcr', 'bailey_ecoregion',
       ...
       'nlcd2006_fs_c82_7500_pd', 'nlcd2006_fs_c82_7500_pland',
       'nlcd2006_fs_c90_7500_ed', 'nlcd2006_fs_c90_7500_lpi',
       'nlcd2006_fs_c90_7500_pd', 'nlcd2006_fs_c90_7500_pland',
       'nlcd2006_fs_c95_7500_ed', 'nlcd2006_fs_c95_7500_lpi',
       'nlcd2006_fs_c95_7500_pd', 'nlcd2006_fs_c95_7500_pland'],
      dtype='object', length=496)

In [110]:
# Joining method one: we remove cols in covariates which aren't in `migrations`.
cov_samp2 = cov_samp.loc[:, cov_samp.columns.isin(core_covariates)]
pd.concat([cov_samp2, zenaida_macroura_df], join = 'outer')

Unnamed: 0,_id,bailey_ecoregion,bcr,caus_prec,caus_snow,caus_temp_avg,caus_temp_max,caus_temp_min,elev_gt,elev_ned,...,nlcd2006_fs_c43_7500_pland,nlcd2006_fs_c52_7500_pland,nlcd2006_fs_c71_7500_pland,nlcd2006_fs_c81_7500_pland,nlcd2006_fs_c82_7500_pland,nlcd2006_fs_c90_7500_pland,nlcd2006_fs_c95_7500_pland,omernik_l3_ecoregion,pop00_sqmi,zenaida_macroura
32695,,-251E,22.0,,,,,,213.0,230.1,...,0.6184,2.1636,10.09,17.0428,2.7488,1.264,0.0048,38.0,18.6,
68746,,M242B,5.0,,,,,,732.0,734.44,...,0.0516,13.452,0.0564,0.0,0.0,0.0,0.0032,4.0,2.1,
128007,,M332C,10.0,,,,,,1219.0,1212.5,...,0.0,2.758,57.6612,9.2064,24.0184,2.88,0.3436,42.0,0.7,
29001,,-311A,19.0,,,,,,416.0,424.1,...,0.0684,1.0256,28.9216,0.0,63.3956,0.0,0.0028,27.0,3.3,
43294,,M262A,32.0,,,,,,217.0,267.39,...,1.216,12.6848,55.446,3.5676,6.8312,1.5436,0.9028,6.0,11.3,
6625,,-222F,24.0,,,,,,256.0,250.24,...,0.0056,0.3696,1.5724,19.4912,2.218,0.0168,0.0552,71.0,2175.1,
99303,,-331F,17.0,,,,,,989.0,1003.47,...,0.0,1.9836,87.9428,0.1796,6.6748,1.1168,1.2748,43.0,2.2,
16912,,-341A,9.0,,,,,,1449.0,1448.02,...,0.0,93.6348,3.832,0.2168,0.0,0.0,0.0052,13.0,0.8,
9441,,-341B,16.0,,,,,,2218.0,2239.91,...,1.3704,19.2196,4.048,5.6012,3.9464,1.974,0.0,21.0,0.5,
55833,,-231A,29.0,,,,,,213.0,200.34,...,1.9304,2.5608,4.3172,11.2244,0.4012,1.6208,0.0036,45.0,22.0,


In [111]:
# Joining method two: we add a zero-occurrence column to the covariate DataFrame
# and then inner join.
cov_samp3 = cov_samp
cov_samp3['zenaida_macroura'] = 0
pd.concat([cov_samp3, zenaida_macroura_df], join = 'inner')

Unnamed: 0,bailey_ecoregion,bcr,elev_gt,elev_ned,housing_density,housing_percent_vacant,nlcd2001_fs_c11_7500_pland,nlcd2001_fs_c12_7500_pland,nlcd2001_fs_c21_7500_pland,nlcd2001_fs_c22_7500_pland,...,nlcd2006_fs_c43_7500_pland,nlcd2006_fs_c52_7500_pland,nlcd2006_fs_c71_7500_pland,nlcd2006_fs_c81_7500_pland,nlcd2006_fs_c82_7500_pland,nlcd2006_fs_c90_7500_pland,nlcd2006_fs_c95_7500_pland,omernik_l3_ecoregion,pop00_sqmi,zenaida_macroura
32695,-251E,22.0,213.0,230.1,7.971644,0.09772,9.5512,0.0,4.6116,0.8348,...,0.6184,2.1636,10.09,17.0428,2.7488,1.264,0.0048,38.0,18.6,0
68746,M242B,5.0,732.0,734.44,0.676357,0.099768,0.0568,0.0,0.1088,0.002,...,0.0516,13.452,0.0564,0.0,0.0,0.0,0.0032,4.0,2.1,0
128007,M332C,10.0,1219.0,1212.5,0.474449,0.437795,1.3456,0.0,1.2768,0.3556,...,0.0,2.758,57.6612,9.2064,24.0184,2.88,0.3436,42.0,0.7,0
29001,-311A,19.0,416.0,424.1,1.624412,0.169492,0.2908,0.0,4.1136,0.0412,...,0.0684,1.0256,28.9216,0.0,63.3956,0.0,0.0028,27.0,3.3,0
43294,M262A,32.0,217.0,267.39,4.028901,0.056901,0.0,0.0,9.4512,1.2268,...,1.216,12.6848,55.446,3.5676,6.8312,1.5436,0.9028,6.0,11.3,0
6625,-222F,24.0,256.0,250.24,1061.005249,0.035611,1.2848,0.0,15.4444,12.7916,...,0.0056,0.3696,1.5724,19.4912,2.218,0.0168,0.0552,71.0,2175.1,0
99303,-331F,17.0,989.0,1003.47,1.068509,0.163767,0.2952,0.0,0.5812,0.0228,...,0.0,1.9836,87.9428,0.1796,6.6748,1.1168,1.2748,43.0,2.2,0
16912,-341A,9.0,1449.0,1448.02,0.359748,0.248214,1.6276,0.0,0.0,0.0,...,0.0,93.6348,3.832,0.2168,0.0,0.0,0.0052,13.0,0.8,0
9441,-341B,16.0,2218.0,2239.91,0.269576,0.312605,0.1104,0.0,2.5896,0.0812,...,1.3704,19.2196,4.048,5.6012,3.9464,1.974,0.0,21.0,0.5,0
55833,-231A,29.0,213.0,200.34,10.770011,0.154964,1.6256,0.0,10.3872,7.6192,...,1.9304,2.5608,4.3172,11.2244,0.4012,1.6208,0.0036,45.0,22.0,0


Now *that's* the kind of thing we can give to `scikit-learn`.

Outstanding questions:

- Why are some covariates missing from each dataset? Which ones are included in the migrations table and which aren't?
- Is it more correct to include *all* of the migrations table as well as the background covariates, or should I only subsample the observations of a bird and compare that to the background covariates?

I'll research these tomorrow with some lit review.