# Frictionless Data Packages

This notebook walks over a few interesting things we can do over a Data Package using the [Frictionless Data](https://frictionlessdata.io/) library.

In [4]:
%pip install frictionless --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [24]:
from frictionless import Package, Resource
import duckdb

import pandas as pd

## Using an existing Data Package

In this case, the [CO2 PPM - Trends in Atmospheric Carbon Dioxide](https://datahub.io/core/co2-ppm) Package. 


In [25]:
from frictionless import Package, Resource
import duckdb

In [26]:
package = Package('https://datahub.io/core/co2-ppm/datapackage.json')

In [27]:
resource = package.get_table_resource('co2-mm-mlo')

In [29]:
duckdb.sql(f"select * from '{resource.path}' order by Date desc limit 5")

┌─────────┬──────────────┬─────────┬──────────────┬────────┬────────────────┐
│  Date   │ Decimal Date │ Average │ Interpolated │ Trend  │ Number of Days │
│ varchar │    double    │ double  │    double    │ double │     int64      │
├─────────┼──────────────┼─────────┼──────────────┼────────┼────────────────┤
│ 2018-09 │     2018.708 │  405.51 │       405.51 │ 409.02 │             29 │
│ 2018-08 │     2018.625 │  406.99 │       406.99 │  408.9 │             30 │
│ 2018-07 │     2018.542 │  408.71 │       408.71 │ 408.32 │             27 │
│ 2018-06 │     2018.458 │  410.79 │       410.79 │ 408.49 │             29 │
│ 2018-05 │     2018.375 │  411.24 │       411.24 │ 407.91 │             24 │
└─────────┴──────────────┴─────────┴──────────────┴────────┴────────────────┘

## Packaging external data

In this case, OWID COVID-19 data. They're maintaining it. We're going to package it!

In [38]:
resource = Resource('https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.csv')

Starting with the simplest possible Package:

In [39]:
package = Package(
  name="co2-mm-mlo",
  title="Trends in Atmospheric Carbon Dioxide",
  resources=[resource]
)

In [40]:
print(package.to_yaml())

$frictionless: package/v2
name: co2-mm-mlo
title: Trends in Atmospheric Carbon Dioxide
resources:
  - name: co2_mm_mlo
    type: table
    path: https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.csv
    scheme: https
    format: csv
    mediatype: text/csv



In [42]:
duckdb.sql(f"select * from '{resource.path}' order by year desc limit 5")

┌───────┬───────┬──────────────┬─────────┬────────────────┬───────┬────────┬────────┐
│ year  │ month │ decimal date │ average │ deseasonalized │ ndays │  sdev  │  unc   │
│ int64 │ int64 │    double    │ double  │     double     │ int64 │ double │ double │
├───────┼───────┼──────────────┼─────────┼────────────────┼───────┼────────┼────────┤
│  2023 │     1 │    2023.0417 │  419.47 │         419.14 │    31 │    0.4 │   0.14 │
│  2023 │     2 │     2023.125 │  420.41 │         419.49 │    25 │   0.64 │   0.25 │
│  2022 │     1 │    2022.0417 │  418.19 │         417.86 │    29 │   0.73 │   0.26 │
│  2022 │     2 │     2022.125 │  419.28 │         418.36 │    27 │   0.92 │   0.34 │
│  2022 │     3 │    2022.2083 │  418.81 │         417.32 │    30 │   0.78 │   0.27 │
└───────┴───────┴──────────────┴─────────┴────────────────┴───────┴────────┴────────┘

Now, for this to become an actual package it needs to be published somewhere. 

In [45]:
package.to_yaml("/tmp/datapackage.yaml")

'$frictionless: package/v2\nname: co2-mm-mlo\ntitle: Trends in Atmospheric Carbon Dioxide\nresources:\n  - name: co2_mm_mlo\n    type: table\n    path: https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.csv\n    scheme: https\n    format: csv\n    mediatype: text/csv\n'

In [50]:
%%bash --out temp_file_path
curl --upload-file /tmp/datapackage.yaml https://transfer.sh/datapackage.yaml

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   306  100    43  100   263     45    280 --:--:-- --:--:-- --:--:--   325


In [53]:
temp_file_path

'https://transfer.sh/M5Ower/datapackage.yaml'

In [52]:
package = Package(temp_file_path)

In [49]:
duckdb.sql(f"select * from '{package.get_resource('co2_mm_mlo').path}' order by year desc limit 5")

┌───────┬───────┬──────────────┬─────────┬────────────────┬───────┬────────┬────────┐
│ year  │ month │ decimal date │ average │ deseasonalized │ ndays │  sdev  │  unc   │
│ int64 │ int64 │    double    │ double  │     double     │ int64 │ double │ double │
├───────┼───────┼──────────────┼─────────┼────────────────┼───────┼────────┼────────┤
│  2023 │     1 │    2023.0417 │  419.47 │         419.14 │    31 │    0.4 │   0.14 │
│  2023 │     2 │     2023.125 │  420.41 │         419.49 │    25 │   0.64 │   0.25 │
│  2022 │     1 │    2022.0417 │  418.19 │         417.86 │    29 │   0.73 │   0.26 │
│  2022 │     2 │     2022.125 │  419.28 │         418.36 │    27 │   0.92 │   0.34 │
│  2022 │     3 │    2022.2083 │  418.81 │         417.32 │    30 │   0.78 │   0.27 │
└───────┴───────┴──────────────┴─────────┴────────────────┴───────┴────────┴────────┘