# Reshaping open data
---
## Learning Goals:
- Learn how to clean data with Transform
- Learn how to create metadata and a schema with Describe
- Learn how to edit metadata
- This tutorial assumes introductory knowledge of the [Frictionless Framework](https://framework.frictionlessdata.io/docs/guides/introduction) and Python

## Introduction
For this tutorial, we are going to use some open data to show how we can use Frictionless to **transform** (aka clean) data and also infer metadata to **describe** the data. 

**Transforming** open data is a common goal - many times, data is in a messy format that is hard to use. In this example, we are using a dataset that has been formatted for human reading, but not for machine readability. By the end of this tutorial, the dataset will be machine readable!

We will also be adding metadata to this dataset using Frictionless **Describe**. Adding metadata is important for understanding what the data is. In our example dataset, the authors have provided some metadata, but it is contained within three separate datasets, so there is some friction involved in accessing and understanding the metadata. We'll use Frictionless Describe to quickly infer some metadata and then edit the metadata to add more information.

For this tutorial, we are going to use open data from a recent publication about sourdough starters, as several of the Frictionless team are avid bakers. The publication is titled *The diversity and function of sourdough starter microbiomes* and can be accessed at this link [https://doi.org/10.7554/eLife.61644](https://doi.org/10.7554/eLife.61644). We will be looking at the data from Figure 2, source data 1: *The most abundant bacterial and fungal taxa across the 500 sourdough starter samples that are not typically considered an active part of starter communities (e.g. excluding yeasts, lactic acid bacteria, and acetic acid bacteria).* You can access or download the data from eLife's website at the link above (yay for open science!).

## Step 0: Installation

In [None]:
!pip install frictionless
!pip install frictionless[excel] 
# we need this plugin for working with the excel data file
# Need help? Check out the Quick Start Guide: https://framework.frictionlessdata.io/docs/guides/quick-start

## Step 1: Import & read data

I have downloaded the data as an Excel file, so I will upload the data using the following code.




In [None]:
# this is to upload a local file
from google.colab import files
uploaded = files.upload()
# this is the data I'm using: https://cdn.elifesciences.org/articles/61644/elife-61644-fig2-data1-v1.xlsx

You can also load data from a URL like this: 
```
source = Resource(
    "https://cdn.elifesciences.org/articles/61644/elife-61644-fig2-data1-v1.xlsx"
)
```

Now, we will use Resource to read in the Excel file. See the [Excel tutorial](https://framework.frictionlessdata.io/docs/tutorials/formats/excel-tutorial/) for more information.

In [None]:
from frictionless import Resource
from pprint import pprint

resource = Resource(path='elife-61644-fig2-data1-v1.xlsx')

Let's look at our resource with to_view(). 

(You can see more about this at the [resource.to_view API documentation](https://framework.frictionlessdata.io/docs/references/api-reference/#resourceto_view))

In [None]:
resource.to_view(type="lookall") 
#note: you can also print all the rows using resource.read_rows()

We can see that the first 2 rows don't contain usefully formatted data. One is empty and the other has 1 cell. We'll want that information from field1 ('Bacteria') for later. You can see a similar issue later with 'Fungi'.

To use this data, we will first transform (aka clean) the data so it is more useful.

## Step 2: Transform our data

We need to create a new column called "Organism" & then assign a value to that column:
  - if the organism is a bacteria, the value should be 1
  - if the organism is fungi the value should be 2

We also need to rename the first column to "Id".
We'll also use table_normalize to update our data as well.

This example shows how we can write custom transformation steps, but we can also use some of Frictionless' built-in transformation steps. Read more about the [Transform steps here](https://framework.frictionlessdata.io/docs/guides/transform-steps).

In [None]:
from pprint import pprint
from frictionless import Resource, Field, transform, steps
# we'll create a function, called 'clean' to clean our data
def clean(resource):
    current = resource.to_copy()
    # Data
    def data():
        with current:
            organism = 0
            for row in current.row_stream: #use stream to access individual rows
                data = row.to_dict() #returns the rows as a dictionary
                if data["field1"] == "Bacteria":
                    organism = 1 #sets the 'organism' column value to 1 if...
                      # ...the data is a Bacteria
                elif data["field1"] == "Fungi":
                    organism = 2 #sets the 'organism' column value to 2 if...
                      # ...the data is a Fungi
                elif data["field1"]:
                    data["Organism"] = organism
                    yield data
    # Metadata
    resource.data = data
    resource.schema.get_field("field1").name = "Id" #renames the 1st column
    resource.schema.add_field(Field(name="Organism", type="integer"))#adds an... 
      #...Organism column
source = Resource(
    resource
)
target = transform(source, steps=[clean, steps.table_normalize()])

Now we can view our cleaned data!

In [None]:
target.to_view(type = "lookall")

We can see that we have transformed our data by renaming the 1st column as "Id" and we've added a new column called "Organism". Next, we assigned a value of "1" to the organisms that are Bacteria & a value of "2" for the organisms that are Fungi. We also removed the empty rows. Our data is now much more machine readable!


## Step 3: Describe our data


Next, we can generate metadata and a schema to describe our data. We'll use the 'Describe' function for this. Adding metadata helps other people understand our data (sometimes it helps us in the future too!).

We'll use describe_resource to describe our data. This creates basic metadata about the data file content and structure.

In [None]:
from frictionless import describe_resource
from pprint import pprint

In [None]:
data_descriptor = describe_resource(target)
pprint(data_descriptor)

In [None]:
# save the description to a yaml file
data_descriptor.to_yaml("descriptor.yaml")

Next, we can add a bit more details to our metadata to ensure future users can understand this data.

In [None]:
data_descriptor.schema.get_field("Id").description = "identifier for each organism"
data_descriptor.schema.get_field("Class").description = "taxonomic class"
data_descriptor.schema.get_field("Order").description = "taxonomic order"
data_descriptor.schema.get_field("Family").description = "taxonomic family"
data_descriptor.schema.get_field("Genus").description = "taxonomic genus"

#resave our descriptor file
data_descriptor.to_yaml("descriptor.yaml")

In [None]:
# we can also view this file using cat
!cat "descriptor.yaml"

Great work! Now you have specific metadata that describes your data. 

## Conclusion
To recap, we started this tutorial with a messy data file and used Frictionless to transform the data (aka clean it) then describe the data (to generate metadata that we can edit). Now we have a data descriptor file that contains our data, metadata and a schema.

## Related Reference Documentation
- Frictionless-py code repository: https://github.com/frictionlessdata/frictionless-py
- FAQ:
- Confused? Ask us questions on [Discord](https://discord.gg/j9DNFNw)!