## Purpose
To teach about using Frictionless to create a reproducible data workflow. You will start with a dataset that needs some more information, and will end with a data package that can be published!

## Outline
- Start by looking at the dataset & discussing the metadata. Are there things we need to change with the dataset?
- Describe the data (infer metadata + a schema; edit the metadata + schema)
- Extract (read in the dataset according to the schema)
- Validate (check the data for errors)
- Transform (clean the data)
- Package (containerize the data + metadata/schema)

## Resources
- Dataset: https://figshare.com/articles/dataset/Portal_Project_Teaching_Database/1314459?file=10717186
- Code documentation: https://framework.frictionlessdata.io/
- Frictionless website: https://frictionlessdata.io/
- Frictionless Slack if you want to join :-) https://frictionlessdata.slack.com/messages/general
- Jupyter Notebook intro: https://datacarpentry.org/python-ecology-lesson/jupyter_notebooks/

So, what are we going to do today? We just saw how this dataset could be improved by having the metadata more machine readible & more easily accessible. We'll be packaging the data, doing some light cleaning, and then getting the data ready to publish! 

In [None]:
!pip install frictionless

# Describe

### Discussion

Let's start by looking at the dataset & discussing the metadata. Are there things we need to change with the dataset?

(Everyone to open the data set to look at it). What kind of metadata is missing from this? (what info would we need to know to be able to use this data? eg units for weight, what does the species code mean? are there specific values for missing values? (note that the missing values are kind of interesting, but confusing))

Can you find the metadata? https://esapubs.org/archive/ecol/E090/118/Portal_rodent_metadata.htm (Walk them through how I found the metadata from the software carpentries link, which has the data archive link (https://esapubs.org/archive/ecol/E090/118/), which has a metadata file!) 

Does it have the metadata we need? (e.g. units for weight)

### Question: Is it easy to find the info you need in this file? Do you think it would be easy for a computer to parse/find info in this file?

We'll now use Frictionless to add some of the missing metadata.

In [None]:
from frictionless import describe # import these modules
from pprint import pprint # pretty print

# describe is the function that reads in data and automatically infers metadata & a schema
resource = describe('combined.csv')
# NOTE: during intro, tell them to save this csv in the right place while showing jupyter

# resource is Frictionless terminology for 'file'

In [None]:
#let's look at our resource descriptor
pprint(resource)
# this is JSON...

### Questions:
- Look at the resource - what is it?
- What metadata has been automatically inferred?
- Is there other info that would be helpful to future researchers? let's add more metadata to it - manually

In [None]:
# first we'll look at just the schema
resource.schema

We can access specific fields from the schema to edit. Let's look at the hindfoot_length field.

In [None]:
resource.schema.get_field('hindfoot_length')
# get_field is one way to access information inside the schema
# see more examples here https://framework.frictionlessdata.io/docs/guides/framework/schema-guide#field-management

What does hindfoot length mean? What are the units? Let's add some metadata here as a description to make this data more reusable in the future.

In [None]:
# now we can add a description to that field
resource.schema.get_field('hindfoot_length').description = "Hindfoot length measured in millimeters"

In [None]:
pprint(resource.schema)

We can also update other aspects of the schema, like missing values or constraints. Here's the full API for reference: https://framework.frictionlessdata.io/docs/references/api-reference#field

### On your own

What other descriptions should we update? Spend a few minutes updating the description for other columns using the metadata https://esapubs.org/archive/ecol/E090/118/Portal_rodent_metadata.htm

In [None]:
# things that Lilly will manipulate on screen while everyone else works on their own
resource.schema.get_field('weight').description = "The weight in grams"
resource.schema.get_field('plot_type').description = "Describes the experimental condition"
resource.schema.get_field('species_id').description = "See table 2 https://esapubs.org/archive/ecol/E090/118/Portal_rodent_metadata.htm"

In [None]:
pprint(resource.schema)

(Question for helpers to discuss before the workshop: Should we put people in breakout rooms to work together or just give people like 7 to 10 min in the main room to work silently?)

### Question: 
Share what you edited (2 - 3 min)

In [None]:
# now we'll save the resource descriptor
resource.to_yaml("resource.yaml")
# you can also use JSON if you want with '.to_json'

Check out the saved YAML file in the Jupyter Notebook Home directory

# Extract

Now let's look at the Extract function. Extract reads in a data set, and can also manipulate the data in a few ways by forcing it to conform to a schema. To do this, we'll create a new schema in a resource descriptor and then extract the data from that descriptor.

In [None]:
# first we will create a new descriptor called resource_string
# then we will change the schema so the data type for 'plot_id' is a string
# then we save this descriptor to a yaml file
from frictionless import extract

resource_string = describe('combined.csv')
resource_string.schema.get_field('plot_id').type = "string"
resource_string.to_yaml("string_resource.yaml")

In [None]:
# now we will extract (aka read) the data inside the descriptor file
data = extract("string_resource.yaml")

In [None]:
# let's take a look at the first few rows of the read-in dataset
# what do we see?
data[0:5]

You can see that plot_id is now a string.

### Question: 
So, what are some instances when you might want to do this type of manipulation? (examples: The data type was inferred incorrectly; You could replace missing values; you could read in only a few lines of the data; you can read just the headers; etc). 

More info: https://framework.frictionlessdata.io/docs/guides/extracting-data 

DAY 1 DONE

# Validate

Take some time to talk about data validation: what it means, why it is important.

What kinds of things can be validated? content + structure

Examples of both...

In [None]:
from frictionless import validate

# create a report variable to store the validation report 
report = validate(resource)

In [None]:
pprint(report)
# look at the scope here to see all the built-in validation checks

Let's purposefully create an error now so we can see how the validation report changes.

Make a change to the data file (eg remove a value & comma, or duplicate a header) & validate again

Note that this time we are using the data file as the input this time, and frictionless is automatically inferring the metadata from that dataset.

In [None]:
report_invalid = validate('combined.csv')
# It isn't using the schema we have edited.

In [None]:
pprint(report_invalid)

### Question: 
What has changed in the report? Is it what you expected?

We can also create data constraints that limit the *content* of the data.
https://specs.frictionlessdata.io/table-schema/#constraints

In [None]:
constrained_resource = describe('combined.csv')
constrained_resource.schema.get_field('sex').constraints["enum"] = ["M"]
# this means that only values of "M" are acceptable for the "sex" column

In [None]:
# save this descriptor
constrained_resource.to_yaml("constrained_resource.yaml")

In [None]:
# create a new validation report
report = validate(constrained_resource)
pprint(report)

### On your own

What other things can you validate for? Play around with the data and the schema to create validation errors!
We won't get into this today, but Frictionless also has a tool for continuous data validation, [Repository](https://repository.frictionlessdata.io/). A use case for Repository is if you host a dataset on GitHub, everytime that you push changes to that dataset, Repository will run validation checks via a GitHub action and will alert you if there are any errors.
(e.g. make sure record_id is unique)

In [None]:
# another example: record_id must be unique
# remember to change a record_id value to be a duplicate in the data file
constrained2_resource = describe('combined.csv')
constrained2_resource.schema.get_field('record_id').constraints["unique"] = True
report = validate(constrained2_resource)
pprint(report)

In [None]:
# bonus example
# you can selectively print out parts of the report if you don't want it all
# https://framework.frictionlessdata.io/docs/guides/validation-guide#validation-report
report = validate(constrained2_resource, pick_errors=['unique-error'])
pprint(report.flatten(["rowPosition", "fieldPosition", "code", "message"]))

### Question: 
What else did you validate? Share with the group.

# Transform

Now we'll look at the Transform function, which transforms (or cleans) the data set and metadata too.
https://framework.frictionlessdata.io/docs/guides/transform-guide

Let's look at the date columns. Having 3 separate columns for dates is not standard. (What is standard?) Let's combine those columns into a new column to be more standard. We'll keep the original columns (this is a best practice to keep the original columns).

In [None]:
from frictionless import Resource, transform, steps

# Define source resource
source = 'resource.yaml'

# Apply transform steps
target = transform(
    source,
    steps=[
        steps.field_add(name="date", type="integer", formula="year+'-'+month+'-'+day"),
        steps.field_update(name="date", type="date"),
    ],
)

# the first step creates a new field that has year, month, and day combined
# the second step changes the data type from integer to date
# (we need to use this order so we can "add" the 3 column names)

In [None]:
# check out the new field in the schema
pprint(target.schema)

In [None]:
# print out some of the data rows to check the new column 
# note: this file is HUGE, so stop this cell from running forever with the STOP square button on the menu
pprint(target.read_rows())

### Question: 
What are some other things you might want to transform in this dataset?

In [None]:
# check the validity 
target_report = validate(target)
pprint(target_report)

In [None]:
# write the data to a file called transformed.csv
# then we can look at the data file in whole
target.write('transformed.csv')

In [None]:
# and we'll save the descriptor of this datafile too
target.to_yaml("transformed_descriptor.yaml")

Now we can package it up!

### Question: 
Why do we want to package it? (note: we will have talked about this during the intro on day 1, so this will be a good reminder). We package it so we could have the metadata, schema + data all together in 1 file! 

In [None]:
from frictionless import Package
package = Package(resources=Resource(path='transformed.csv'), descriptor='transformed_descriptor.yaml') 
# this package contains the data file + the descriptor file

In [None]:
# let's look at the package
pprint(package)

In [None]:
# save the package
package.to_yaml('package.yaml')
# package.to_json('package.json') will save as JSON instead

Now we can publish this machine-readible packaged (contained) information in 1 place.

## Conclusion

To recap, we started with a pretty clean data file that was missing some important metadata. To make the datafile more reusable, we added metadata with Frictionless Describe. Then we saw how we could use Extract to read in data according to a schema. Next, we validated the data set and metadata/schema to check for data content and structure errors. After that, we transformed the data to add a new, standardized date column. And finally, we packed the data and metadata/schema together so we can publish it!

### Bonus Example
Dr. Katerina Drakoulaki will tell you all about her recent experience using the Frictionless Framework with Byzantine music data.