# Quickstart


Let's package the "[Trends in Atmospheric Carbon Dioxide](https://gml.noaa.gov/ccgg/trends/global.html)" dataset. This dataset is available as a CSV file and is updated on a daily basis. We can either package the external dataset or duplicate the data and package that instead.

In [3]:
from frictionless import Package


## Packaging External Data

Let's start with the simplest option, packaging the external dataset. To do that, we simply wrap the dataset around a [Frictionless Data Package](https://specs.frictionlessdata.io/data-package/). We can do this by creating a `datapackage.yaml` file and adding the dataset as a resource.

In [4]:
%%writefile /tmp/external_data_datapackage.yaml
name: co2-mm-mlo
title: Trends in Atmospheric Carbon Dioxide
resources:
  - name: co2_trend_gl
    type: table
    path: https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_trend_gl.csv
    scheme: https
    format: csv
    mediatype: text/csv
    encoding: utf-8
    dialect:
      headerRows:
        - 42
    schema:
      fields:
        - name: year
          type: string
        - name: month
          type: string
        - name: day
          type: string
        - name: smoothed
          type: string
        - name: trend
          type: string

Overwriting /tmp/external_data_datapackage.yaml


Now, any tool compatible with the Frictionless Data Package Specs can use the dataset.

In [5]:
p = Package("/tmp/external_data_datapackage.yaml")


In [6]:
p.resources[0].to_pandas().head()


Unnamed: 0,year,month,day,smoothed,trend
0,2013,1,1,395.18,394.28
1,2013,1,2,395.2,394.29
2,2013,1,3,395.22,394.29
3,2013,1,4,395.24,394.3
4,2013,1,5,395.27,394.31


As you can see, the package is simply pointing to the path of the dataset. This means that the dataset is not duplicated and the package is not self-contained. This is the simplest option, but it has some drawbacks:

- If the original dataset is deleted, the package will be broken.
- If the original dataset is updated, the package will be broken. You need to manually update the package to point to the new dataset.

This could be a great starting point to make the dataset available to the ecosystem. That said, for most use cases, you will want to duplicate the data itself and package that instead.

### Saving the Data

Most datasets are not updated frequently or are relatively small in size. A great approach is to use Github Actions to bake the data into the GitHub repository as a CSV. That will provide versioned data alongside a versioned package.

This repository serves as an example of how to do that. You can check the GitHub Actions workflow and the `run.py` script to see how it's done. Basically, every day a job is executed that downloads and overwrite the entire dataset. This is a simple approach, but it works well for most use cases.

## Creating a Catalog

A catalog is a collection of packages. It's a great way to organize and group packages. It's also a great way to link similar packages together and make them quickly discoverable.

In [7]:
from frictionless import Catalog


In [8]:
%%writefile /tmp/datapackage_catalog.yaml
datasets:
  - name: co2_trend_gl
    package: /tmp/external_data_datapackage.yaml
  - name: oil
    package: https://raw.githubusercontent.com/datasets/oil-prices/main/datapackage.json
  - name: rotten_tomatoes
    package:
      name: rotten_tomatoes
      resources:
        - name: rotten_tomatoes
          path: https://huggingface.co/datasets/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/rotten_tomatoes-train.parquet
  - name: ipfs_co2
    package: https://bafybeierqai7xkaxkyakdynw5uq7f2g4o5uz3kzvnh55thmazcff3bwgse.ipfs.w3s.link/ipfs_datapackage.yaml


Writing /tmp/datapackage_catalog.yaml


In [9]:
c = Catalog("/tmp/datapackage_catalog.yaml")


In [15]:
c.get_dataset("oil").package.get_resource("brent-daily").to_pandas().tail(5)


Unnamed: 0,Date,Price
9101,2023-03-28,78.07
9102,2023-03-29,77.51
9103,2023-03-30,78.45
9104,2023-03-31,79.19
9105,2023-04-03,85.81
