# Data processing pipeline for eBird data

> Walkthrough for data processing steps to build the Birds of a Feather birding partner recommender from eBird observation data.

## Contents:
1.  Read relevant columns from eBird raw data (obtainable on https://ebird.org/science/download-ebird-data-products) <a href='#step1'> [step 1]</a>
2. Group observation by user and extract features for that user <a href='#step2'> [step 2]</a>
3. Extract pairs of users <a href='#step3'> [step 3]</a>
4. Create georeferenced shapefile with users <a href='#step4'> [step 4]</a>
5. Find user names with the eBird API <a href='#step5'> [step 5]</a>
6. Scrape user profiles from eBird with a webbot <a href='#step6'> [step 6]</a>

## 1. Read raw eBird data <a id='step1'></a>

> Reads eBird data *.txt* by chunks using pandas and write chunks to a *.csv* with observations on rows and a subset of columns used for feature extraction by the data processing script. Usage:

In [4]:
!python utils/data_processing/read_ebird.py -h

usage: eBird database .txt file muncher [-h] [--input_txt INPUT_TXT]
                                        [--period PERIOD] [--output OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  --input_txt INPUT_TXT, -i INPUT_TXT
                        path to eBird database file
  --period PERIOD, -p PERIOD
                        start year to end year separated by a dash
  --output OUTPUT, -o OUTPUT
                        path to output csv file


## 2. Process eBird data <a id='step2'></a>
> Reads oservations *.csv* from previous step, sorts observations by the **OBSERVER ID** column, chunks observations by **OBSERVER ID** and compiles all observation rows for a user into a single row with features for that user. Finding the centroid for a user takes $O(n^{2})$; be advised this may take a considerable time for users with > 100000 observations. See usage below:

In [7]:
!python utils/data_processing/process_ebird.py -h

usage: Script to process eBird observations into user data [-h]
                                                           [--input_csv INPUT_CSV]
                                                           [--cores CORES]
                                                           [--output OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  --input_csv INPUT_CSV, -i INPUT_CSV
                        path to observations .csv file
  --cores CORES, -c CORES
                        number of cores for parallel processing
  --output OUTPUT, -o OUTPUT
                        path to output csv file


## 3. Extract pairs of users <a id='step3'></a>
> Reads observation *.csv* file from step 1 and user features from step 2 to create a *.csv* with a subset of users that have paired eBird activity. Pairs are found looking for users that share a unique **GROUP IDENTIFIER** from the observations data. Usage:

In [11]:
!python utils/data_processing/extract_pairs.py -h

usage: Script to get all pairs of users within observations
       [-h] [--input_obs INPUT_OBS] [--input_users INPUT_USERS]
       [--cores CORES] [--output OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  --input_obs INPUT_OBS, -i INPUT_OBS
                        path to observations .csv file
  --input_users INPUT_USERS, -u INPUT_USERS
                        path to users .csv file
  --cores CORES, -c CORES
                        number of cores for parallel processing
  --output OUTPUT, -o OUTPUT
                        path to output csv file


## 4. Create georeferenced dataset <a id='step4'></a>
> Converts latitude and longitude columns from step 2 *.csv* with user features DataFrame into shapely Points. Writes new data frame as *.shp* file readable by GIS software and geopandas. Used to filter matches by distance in the app. See usage:

In [13]:
!python utils/data_processing/get_shapefile.py -h

usage: copies a .csv dataframe with latitude and longitude columns into a GIS shapefile
       [-h] [--input_csv INPUT_CSV] [--output_shp OUTPUT_SHP]

optional arguments:
  -h, --help            show this help message and exit
  --input_csv INPUT_CSV, -i INPUT_CSV
                        Path to .csv file.
  --output_shp OUTPUT_SHP, -o OUTPUT_SHP
                        path to output shapefile


## 5. Find user names using eBird API <a id='step5'></a>
> Uses checklist identifiers from user features (step 2) to find user profile names with the eBird API and add them to the georeferenced dataset (step 4). See usage:

In [14]:
!python utils/data_processing/add_user_names.py -h

usage: Adds users ID column to users shapefile [-h] [--users_shp USERS_SHP]
                                               [--counties_shp COUNTIES_SHP]

optional arguments:
  -h, --help            show this help message and exit
  --users_shp USERS_SHP, -u USERS_SHP
                        path to users shapefile
  --counties_shp COUNTIES_SHP, -c COUNTIES_SHP
                        path to counties shapefile


## 6. Scrape user profiles from eBird <a id='step6'></a>
> Uses webbot, checklist identifiers from step 2 and user profile names from step 5 to find links to public user profile for each user. Defaults to the unique checklist IDs when profiles are not found (only ~25% of eBird users currently have public profiles). Profile column added to *.shp* file from step 4 and is provided to recommendations. See usage: 

In [15]:
!python utils/data_processing/get_ebird_profile.py -h

usage: Uses webbot to extract user profile urls from ebird [-h]
                                                           [--input_users INPUT_USERS]
                                                           [--output_txt OUTPUT_TXT]

optional arguments:
  -h, --help            show this help message and exit
  --input_users INPUT_USERS, -i INPUT_USERS
                        path to users dataframe with checklist IDs to search
                        for profiles
  --output_txt OUTPUT_TXT, -o OUTPUT_TXT
                        path to .txt file where user profile urls will be
                        written to
