# Parquet Example: GBIF Occurrence Data on AWS

[Apache Parquet](https://parquet.apache.org) is a column-based file format. [Dask](https://docs.dask.org/en/stable/) is a Python package for parallel computation. This example uses a Dask Dataframe to interact with Parquet data from the the [Global Biodiversity Information Facility (GBIF) Species Occurrences dataset on AWS](https://aws.amazon.com/marketplace/pp/prodview-dvyemtksskta2?sr=0-1&ref_=beagle&applicationId=AWSMPContessa#resources). Details on using Dask Dataframes with Parquet data can be found [here](https://docs.dask.org/en/latest/dataframe-parquet.html).

1. Load dependencies and set up Dask to use a multithreading scheduler

In [None]:
import dask
import dask.dataframe as dd


dask.config.set(scheduler='threads')

2. Create a DataFrame for AWS GBIF occurrence data for June 2021

In [None]:
df = dd.read_parquet(
    "s3://gbif-open-data-af-south-1/occurrence/2021-06-01/occurrence.parquet/",
    storage_options={"anon": True},
    engine="pyarrow",
    parquet_file_extension=""
)
print(f"Number of partitions: {df.npartitions}")
print(f"Columns: {df.columns.tolist()}")
df.head()


3. Count the number of occurrences by country

(This takes several minutes to run.)

In [None]:
reduced_df = df[['countrycode', 'specieskey']] \
    .groupby(['countrycode']) \
    .size() \
    .compute() \
    .sort_values(ascending=False)

print(reduced_df)

The Dask `DataFrame` has the same API as Pandas DataFrames. Read more about them [here](https://docs.dask.org/en/stable/dataframe.html) and see example scripts [here](https://examples.dask.org/dataframe.html).