# Pull Sort

This code is used to make pull requests from inaturalist with authorization. It takes a list of taxonIds makes a dictionary out of it with taxonId as the key and the scientific name as the value. Then it gets the authorization via getaccesstoken to allow for pulling of all data rather than being capped. These two functions feed their data into dataPuller to iterate through the dictionary and use the authorization to pull each species data and store it in a list before writing it out to a taxonID titled csv with headers of id,location,observed_on, and time_observed_at as this is the only data we are interested in.

The Code Works with the functions: 
* listOfIdNames 
* getAccessToken
* dataPuller

imports:
* requests for http requests
* json to parse jason return
* csv to write csv files
* subprocess to allow clearing of screen during runs


In [1]:
import requests #required for http requests
import json #used to translate returned data
import csv #used to write out csvs
import subprocess as sp #used for clearning console during runs
import os #used for directory creation

## listOfIdsNames

This function works similar to data puller. You have a two column text file that has taxonID first then species scientific name second that a dictionary is created from and then returned to main to be used in the cleaner function.

*file - takes the file called taxon-Ids.txt

returns the created dictionary with the taxonID as the key and species scientific name as the value

In [2]:
def listOfIdsNames(filename):
    file = open(filename).readlines()
    dictionary = {}
    count = 0
    for data in file:
        data = data.strip().split()
        if(count != 0):
            dictionary[data[0]] = data[1]+" "+data[2] #key is taxon ID and value is scientific name
        else:       
            count = 1;
    return dictionary

## getAccessToken
This is the key needed to allow for uncapped data pulls.
* app_id and app_secret are the 24 hour tokens generated by the user on inaturalist
* username and password of an inaturalist user
* payload - is the dictionary of all the above values to allow for pass in for response
* response -  uses requests to get post value from the site with payload values to get the Bearer token
* headers - the actual authorization that is returned to be used in the dataPuller function

In [3]:
def getAccessToken(app_id,app_secret,username,password):
    
    site = "https://www.inaturalist.org"
    app_id = app_id #granted by inaturalist after logging in
    app_secret = app_secret #granted by inaturalist after logging in
    username = username #your username
    password = password #your password

    #payload is all required data changed into a dictionary to allow access with requests
    payload = {
        'client_id': app_id,
        'client_secret': app_secret,
        'grant_type': "password",
        'username': username,
        'password': password
    }

    response = requests.post(("%s/oauth/token" % site), payload) #the 24 hour token is generated 
    token = response.json()["access_token"] #access token is stored
    headers = {"Authorization": "Bearer %s" % token} #token is used for headers to finally allow access

    return headers

## dataPuller
This function takes a dictionary and authorization string and then begains to make get requests from inaturalists servers. There is a loading screen set up to let the user know the code is running and where it is currently at. The first get request provides how many pages are available for the requested taxonId. The pages are stored so a for loop can run that many times to make the request for each page, then runs a loop for each entry and saves the data in a list.Once all pages are completed it writes it out to a csv with the taxonId as the filename. id,location,observed_on, and time_observed_at are keys within the data that we pull from but you are able to get more data by simply adding another line with a valid key and appending it to the list. Ensure if you do so the header for the csv is updated with this new key value. 

### parameters and variables

* species - a dictionary passed it that is set up with the taxonID as the key and the scientific name as the value
* headers - the authorization string required to get uncapped data requests

* obs_data - note that we use a get requests using the inaturlist url and specify we want observations for taxonIDs that is then              concatinated in and that we also only accept research grade observations at this time
* jData - converts the obs_data into a json format (like a dictionary) so that we can iterate through
* templist - is a list of lists of all data from a species. This format is used as csv writer needs to iterate through data.                  Once a species taxonID is complete and written to a csv the list empties for the next species.
* holder - this is a list that holds data for one observation and is then appended to the templist 


** The if statments that creates our page amount is needed in the event that there are observations counts that are say 32 observations. If this is the case there are two pages that need to be pulled from where as if there are 29 observations only one page needs to be pulled from







### Data Struct
This function takes a dictionary provided either by the user in main or from listOfIdsNames and then begins going through all data within the directory with taxonIds that are within the dictionarys key values. The data structure works in a dictionary of dictionaries. 

* data -> 
        * id ->
                * observationData = [id,lat,long]
                
Once complete a triple nested loop runs with the following structure

* id ->
        * years ->
                    * months ->
                                *geoloation data write out for month
                                

As you can see each month for the taxonID is written out as well as yearly and total but can be modified for just monthly or yearly etc.
                                
                


In [7]:
def dataPuller(species,headers1):
    print("Running...")

    run = 1
    for butterfly in species: #iterate through the dictionary pulling the taxon id observations
        
        percent = round((run/len(species))*100,1) #used to track progress of total pull
        
        print(str(percent)+"%")
       
        #used to get the requested data, obs_data allows us to determine how many pages there are so we can iterate through them all
        obs_data = requests.get(("http://api.inaturalist.org/v1/observations?taxon_id=" + str(butterfly) +"&quality_grade=research&page=1"), headers=headers1)
        jData = json.loads(obs_data.text)
        total_Observations = int(jData["total_results"])
        pages=0
        if( total_Observations % 30 != 0): #needed if there is exactly 30 per page or if there is an odd count say 31 therefore 2 pages
            pages = (total_Observations//30)+1
        else:
            pages = total_Observations//30

        run2 = 1
        tempList = [] #tempList will store all requested data for one species before it is reset for the next
        for i in range(1,pages): #pulls each page up one at a time 
            
            percent2 = round((run2/pages)*100,1)
            print(str(percent)+"%")
            print("... "+str(run2)+" of "+str(pages)+" pages")

            obs_data = requests.get(("http://api.inaturalist.org/v1/observations?taxon_id=" + str(butterfly) +"&quality_grade=research&page="+str(i)), headers=headers1) #TODO ids

            data = json.loads(obs_data.text)
            #works through each record per page
            for records in data['results']:

                holder = []# stores taxonID, lat,long, and date of observation per entry before being cleared

                
                holder.append(str(butterfly))
                
                
                if(records["location"] is not None):
                    latlong = records["location"].split(",")

                    holder.append(latlong[0])
                    holder.append(latlong[1])
                else:
                    holder.append("")
                    holder.append("")
                holder.append(records["observed_on"])
                
                tempList.append(holder) #temp list takes holders data as a list of lists

            run2+=1

            sp.call('cls',shell=True)
        run+=1
        print('Pulling Complete')
#With the pull complete for a species it is then written out to total, yearly, and month observation csv files
#---------DATA STRUCT----------DATA STRUCT---------------------
        data[butterfly] = {} #this will be a double nested dictionary containg year info, and month info.
        print(" Creating data structure")
        total = len(tempList)
        upper = 1
        for date_info in tempList: #a for loop to prep the tempList into organized data by year and month
            print(upper/total)
            upper +=1
            print(date_info)

            dates = date_info[3].split("-") #dates are in a yyyy-mm-dd format and thus are split to better sort the data

            if dates[0] not in data[butterfly]: #if the year isn't present, create a key with the year value and an empty dictionary for months
                data[butterfly][dates[0]] = {}
            if dates[1] not in data[butterfly][dates[0]]: #if the month in that year with data isn't present, create a key of that month with a list value
                data[butterfly][dates[0]][dates[1]]=[]

            observationData = [] #list to store observation recoreds of taxonid lat and long data
            observationData.append(butterfly)
            observationData.append(date_info[1])
            observationData.append(date_info[2])
            data[butterfly][dates[0]][dates[1]].append(observationData)
        sp.call('cls',shell=True)#clear the console
        print("Complete")

        masterWrite = [] #a list is required to write out a csv, will hold all needed data

        directory = 'data/inaturalist/'+str(butterfly)

        if not os.path.exists(directory):#checks for taxon id, year and month folders, creates them as needed
            os.makedirs(directory)

        print("Placing data into data structure")
        total = len(data[butterfly])
        upper = 1
        for years in data[butterfly]: #for every year store the data of the butter fly by each month
            print(upper/total)
            upper+=1
            directory = 'data/inaturalist/'+str(butterfly)
            directory+='/'+str(years)

            if not os.path.exists(directory):
                os.makedirs(directory)

            for months in data[butterfly][years]:

                directory = 'data/inaturalist/'+str(butterfly)+'/'+str(years)
                directory += "/"+str(months)

                if not os.path.exists(directory):
                    os.makedirs(directory)

                outtie = open('data/inaturalist/'+str(butterfly)+'/'+str(years)+'/'+str(months)+'/'+str(butterfly)+"-"+str(years)+"-"+str(months)+".csv",'w',newline='',encoding='utf-8')
                headers = ["taxonID", "latitude", "longitude"]
                writer = csv.writer(outtie)
                writer.writerow(headers)

                for geo in data[butterfly][years][months]:
                    masterWrite.append(geo)
                    writer.writerow(geo) #writes the csv for specified year and month
                outtie.close()
            sp.call('cls',shell=True)

            outtie = open('data/inaturalist/'+str(butterfly)+"/" + str(years)+'/'+str(butterfly)+"-"+str(years)+".csv",'w',newline='',encoding='utf-8')
            headers = ["taxonID", "latitude", "longitude"]
            writer = csv.writer(outtie)
            writer.writerow(headers)
            print("Final write outs")
            for all_data in masterWrite:
                writer.writerow(all_data) #writes data for entire year
            outtie.close()
            masterWrite = [] #resets masterWrite for next year

#---------DATA STRUCT----------DATA STRUCT---------------------
        with open('data/inaturalist/'+str(butterfly)+'/'+str(butterfly)+".csv", "w",encoding='utf-8') as file:
            writer = csv.writer(file)
            headers = ["taxonID", "latitude", "longitude","date"]
            writer.writerow(headers)
            writer.writerows(tempList) #writes a master file for the species of all data
        tempList = [] #resets templist for next species


In [5]:
def main():

    app_id = '69345738565c2bd88f2dafa49857e426ad01918d5e5a72fcdde40d258f22b49c'
    app_secret = '62899ac1d355f1743b84db1e21e94f2bc40de4915cb7a2cb2afaeab41dfb0de8'
    username = 'ornelaseduardo'
    password = 'qb7A1PAl4eRp6rPh'
    file = "taxon-Ids.txt"
    print("Running")
    file = "taxon-ids.txt"
    butterflys = listOfIdsNames(file)
    key = getAccessToken(app_id,app_secret,username,password)
    dataPuller(butterflys,key)
    print("Complete")

In [8]:
main()

Running
Running...
0.1%
0.1%
... 1 of 2 pages
Pulling Complete
 Creating data structure
0.03333333333333333
['58505', '48.7779153812', '-93.7311719839', '2015-07-21']
0.06666666666666667
['58505', '44.48526993', '-77.2924661636', '2017-08-11']
0.1
['58505', '38.8009463', '-76.6928959', '2017-09-23']
0.13333333333333333
['58505', '40.3734690873', '-79.387854784', '2017-09-18']
0.16666666666666666
['58505', '39.3154149', '-76.8752726', '2017-09-04']
0.2
['58505', '35.8865083333', '-79.0173333333', '2017-08-12']
0.23333333333333334
['58505', '42.340885', '-73.2342783333', '2017-08-02']
0.26666666666666666
['58505', '43.3475488623', '-80.1188340041', '2017-07-27']
0.3
['58505', '43.4054625847', '-76.3467807996', '2017-07-03']
0.3333333333333333
['58505', '40.0989178682', '-83.1606474989', '2017-07-24']
0.36666666666666664
['58505', '42.6410016667', '-72.2234116667', '2017-07-19']
0.4
['58505', '43.3293297558', '-80.1221752167', '2017-07-11']
0.43333333333333335
['58505', '40.481637589', '-

['81559', '36.08784', '-87.025505', '2017-07-23']
0.2872340425531915
['81559', '31.6235285', '-91.2505447', '2016-04-02']
0.2875886524822695
['81559', '41.1389694214', '-81.5744171143', '2017-07-25']
0.28794326241134754
['81559', '38.5897861228', '-92.232603728', '2017-07-25']
0.28829787234042553
['81559', '43.3741860884', '-80.3736348265', '2017-07-25']
0.2886524822695035
['81559', '42.4555462045', '-70.9887209693', '2017-07-09']
0.28900709219858156
['81559', '36.1799366667', '-82.7793283333', '2017-07-15']
0.28936170212765955
['81559', '36.1804166667', '-82.77938', '2017-07-20']
0.2897163120567376
['81559', '44.9619400279', '-73.1596553558', '2015-06-19']
0.2900709219858156
['81559', '41.7832522', '-87.578253', '2015-07-03']
0.2904255319148936
['81559', '41.7832522', '-87.578253', '2015-07-03']
0.2907801418439716
['81559', '36.092', '-86.557', '2017-07-24']
0.29113475177304965
['81559', '41.7832522', '-87.578253', '2015-06-16']
0.29148936170212764
['81559', '38.8848467963', '-90.0173

0.7304964539007093
['81559', '41.142654', '-81.60762', '2016-07-03']
0.7308510638297873
['81559', '38.7203616667', '-77.2109616667', '2016-07-10']
0.7312056737588652
['81559', '40.32232', '-75.190267', '2016-07-05']
0.7315602836879432
['81559', '36.766029', '-79.965718', '2016-07-09']
0.7319148936170212
['81559', '40.3082166667', '-77.0275916667', '2016-07-02']
0.7322695035460993
['81559', '40.3055883333', '-77.007705', '2016-07-02']
0.7326241134751773
['81559', '32.9944687839', '-97.7788768169', '2015-08-23']
0.7329787234042553
['81559', '36.18430627', '-97.17797402', '2016-07-01']
0.7333333333333333
['81559', '42.3230164462', '-84.2899417877', '2016-06-29']
0.7336879432624114
['81559', '42.2189816667', '-71.1188733333', '2016-07-06']
0.7340425531914894
['81559', '40.493796', '-74.419584', '2015-06-23']
0.7343971631205674
['81559', '42.4293866667', '-71.2255583333', '2016-07-06']
0.7347517730496453
['81559', '41.104191', '-81.902282', '2016-06-29']
0.7351063829787234
['81559', '36.408

Final write outs
0.29411764705882354
Final write outs
0.35294117647058826
Final write outs
0.4117647058823529
Final write outs
0.47058823529411764
Final write outs
0.5294117647058824
Final write outs
0.5882352941176471
Final write outs
0.6470588235294118
Final write outs
0.7058823529411765
Final write outs
0.7647058823529411
Final write outs
0.8235294117647058
Final write outs
0.8823529411764706
Final write outs
0.9411764705882353
Final write outs
1.0
Final write outs
0.4%
0.4%
... 1 of 33 pages
0.4%
... 2 of 33 pages
0.4%
... 3 of 33 pages
0.4%
... 4 of 33 pages
0.4%
... 5 of 33 pages
0.4%
... 6 of 33 pages
0.4%
... 7 of 33 pages
0.4%
... 8 of 33 pages
0.4%
... 9 of 33 pages
0.4%
... 10 of 33 pages
0.4%
... 11 of 33 pages
0.4%
... 12 of 33 pages
0.4%
... 13 of 33 pages
0.4%
... 14 of 33 pages
0.4%
... 15 of 33 pages
0.4%
... 16 of 33 pages
0.4%
... 17 of 33 pages
0.4%
... 18 of 33 pages
0.4%
... 19 of 33 pages
0.4%
... 20 of 33 pages


KeyboardInterrupt: 