# Parquet Example: GBIF Occurrence Data on AWS

[Apache Parquet](https://parquet.apache.org) is a column-based file format. [Dask](https://docs.dask.org/en/stable/) is a Python package for parallel computation. This example uses a Dask Dataframe to interact with Parquet data from the the [Global Biodiversity Information Facility (GBIF) Species Occurrences dataset on AWS](https://aws.amazon.com/marketplace/pp/prodview-dvyemtksskta2?sr=0-1&ref_=beagle&applicationId=AWSMPContessa#resources). Details on using Dask Dataframes with Parquet data can be found [here](https://docs.dask.org/en/latest/dataframe-parquet.html).

1. Load dependencies and set up Dask to use a multithreading scheduler

In [5]:
import dask
import dask.dataframe as dd


dask.config.set(scheduler='threads')

<dask.config.set at 0x134b03da0>

2. Create a DataFrame for AWS GBIF occurrence data for June 2021

In [6]:
df = dd.read_parquet(
    "s3://gbif-open-data-af-south-1/occurrence/2021-06-01/occurrence.parquet/",
    storage_options={"anon": True},
    engine="pyarrow",
    parquet_file_extension=""
)
print(f"Number of partitions: {df.npartitions}")
print(f"Columns: {df.columns.tolist()}")
df.head()


Number of partitions: 930
Columns: ['gbifid', 'datasetkey', 'occurrenceid', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'infraspecificepithet', 'taxonrank', 'scientificname', 'verbatimscientificname', 'verbatimscientificnameauthorship', 'countrycode', 'locality', 'stateprovince', 'occurrencestatus', 'individualcount', 'publishingorgkey', 'decimallatitude', 'decimallongitude', 'coordinateuncertaintyinmeters', 'coordinateprecision', 'elevation', 'elevationaccuracy', 'depth', 'depthaccuracy', 'eventdate', 'day', 'month', 'year', 'taxonkey', 'specieskey', 'basisofrecord', 'institutioncode', 'collectioncode', 'catalognumber', 'recordnumber', 'identifiedby', 'dateidentified', 'license', 'rightsholder', 'recordedby', 'typestatus', 'establishmentmeans', 'lastinterpreted', 'mediatype', 'issue']


Unnamed: 0,gbifid,datasetkey,occurrenceid,kingdom,phylum,class,order,family,genus,species,...,identifiedby,dateidentified,license,rightsholder,recordedby,typestatus,establishmentmeans,lastinterpreted,mediatype,issue
0,1321272647,821cc27a-e3bb-4bc5-ac34-89ada245069d,http://n2t.net/ark:/65665/3b40089e7-fe0c-438f-...,Animalia,Chordata,Actinopterygii,Characiformes,Curimatidae,Curimatella,Curimatella immaculata,...,,,CC0_1_0,,R. Vari et al.,,,2021-05-28T10:40:34.061Z,[],[OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_CO...
1,1321281052,821cc27a-e3bb-4bc5-ac34-89ada245069d,http://n2t.net/ark:/65665/3b45f294d-215e-42ea-...,Animalia,Sipuncula,,,,,,...,"Ward, L. A.",,CC0_1_0,,Texas Instruments For BLM / MMS,,,2021-05-28T10:39:17.926Z,[],[OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_CO...
2,1321283384,821cc27a-e3bb-4bc5-ac34-89ada245069d,http://n2t.net/ark:/65665/3b47b2c59-b3ec-4c64-...,Plantae,Tracheophyta,Magnoliopsida,Asterales,Asteraceae,Piptocarpha,Piptocarpha tetrantha,...,,,CC0_1_0,,A. H. Liogier,,,2021-05-28T10:39:55.596Z,[StillImage],[OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_CO...
3,1321284755,821cc27a-e3bb-4bc5-ac34-89ada245069d,http://n2t.net/ark:/65665/3b489464f-dcad-40f8-...,Plantae,Tracheophyta,Magnoliopsida,Vitales,Vitaceae,Cissus,Cissus polita,...,,,CC0_1_0,,J. Wen & H. Tombondray,,,2021-05-28T10:40:01.495Z,[StillImage],[OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_CO...
4,1321286153,821cc27a-e3bb-4bc5-ac34-89ada245069d,http://n2t.net/ark:/65665/3b49c6db1-18bd-41ba-...,Animalia,Arthropoda,Malacostraca,Decapoda,Paguridae,Pseudopagurodes,Pseudopagurodes piliferus,...,"McLaughlin, Patsy A., Shannon Point Marine Center",,CC0_1_0,,United States Fish Commission,,,2021-05-28T10:39:17.939Z,[StillImage],[OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_CO...


3. Count the number of occurrences by country

(This takes several minutes to run.)

In [None]:
reduced_df = df[['countrycode', 'specieskey']] \
    .groupby(['countrycode']) \
    .size() \
    .compute() \
    .sort_values(ascending=False)

print(reduced_df)

The Dask `DataFrame` has the same API as Pandas DataFrames. Read more about them [here](https://docs.dask.org/en/stable/dataframe.html) and see example scripts [here](https://examples.dask.org/dataframe.html).