# Importing data

Importing data means taking data in some form, and preparing it so that we can express that data as nodes and edges. On its own, this is not too challenging - it mostly means converting data formats. The harder part is harmonizing the data, so that the fields used across imported databases are consistent enough that we can link consumers and supplier.

Let's make this more concrete with an example. In the file `lci-carbon-fiber.xlsx` we have data from the publication [Ecological assessment of fuel cell electric vehicles with special focus on type IV carbon fiber hydrogen tank](https://www.sciencedirect.com/science/article/abs/pii/S0959652620333229). As this data is from Excel, it is tabular, and so on its surface looks different than the graph:

<img src="images/spreadsheet.png">

However, this difference is mostly cosmetic. Both the _document_ and _graph_ perspectives are showing the same information, but with a different emphasis and organizing structure. In the graph perspective, edges are independent objects with their own metadata, and their sources and targets are given as [pointers](https://en.wikipedia.org/wiki/Pointer_(computer_programming)) to the node objects. In the document perspective, edges are subsumed in the definition of the nodes, and because most input data formats don't have pointers, references to input or output flows are defined by the attributes of thoses flows.

Because we only have flow attributes, we need to define a way that we associate those attributes with nodes in our existing databases. This is trickier than you might think, as those is no guarantee that two data providers will use the same labels for things like locations or units; indeed, sometimes we even find different labels for the same attributes.

Therefore, Brightway treats IO as a classic [ETL pipeline](https://en.wikipedia.org/wiki/Extract,_transform,_load), and applies a series of transformation functions to prepare the data and find the correct flows. Let's look at our real-world example:

In [None]:
import bw2data as bd
import bw2io as bi
from pathlib import Path

The example data is built on top of ecoinvent. You should update the project name to a project with ecoinvent 3.10 already installed.

In [None]:
bd.projects.set_current("Durian fruit is controversial")

In [None]:
imp = bi.ExcelImporter(Path.cwd() / "lci-carbon-fiber.xlsx")

Before we make any changes, let's see what the data looks like in its raw form:

In [None]:
imp.data[0]

This is actually aleady quite close to the final form. In this case we are lucky as the import data was designed to be used in Brightway. Normally we would need to apply transformation functions; lets see what those default transformation functions would be:

In [None]:
imp.apply_strategies()

We can look at the imported data statistics:

In [None]:
imp.statistics()

We can iterate over the unlinked edges to get a sense for what we are missing:

In [None]:
for edge, _ in zip(imp.unlinked, range(5)):
    print(edge)

OK, some unlinked exchanges are clearly from ecoinvent. Let's try to link those.

In [None]:
imp.match_database("ecoinvent-3.10-cutoff", fields=('name', 'reference product', 'unit', 'location'))
imp.statistics()

Let's check the unlinked edges:

In [None]:
for edge in imp.unlinked:
    print(edge)

Let's look at the missing Argon flow first. Let's check our database layout:

In [None]:
bd.databases

That should be in the `ecoinvent-3.10-biosphere` database. Let's search for argon:

In [None]:
[(x['code'], x) for x in bd.Database("ecoinvent-3.10-biosphere") if "argon" in x["name"].lower()]

OK, so we have the following. In our imported data:

```python
{
    'name': 'Argon-40', 
    'unit': 'kilogram', 
    'categories': ('air',),
}
```

And in our `ecoinvent-3.10-biosphere` database:

```python
{
    'name': 'Argon', 
    'unit': 'kilogram', 
    'categories': ('air',),
}
```

We can patch this manually:

In [None]:
migration = {
    "fields": ["name", "categories"],
    "data": [
        (
            ("Argon-40", ("air",)),
            {
                "name": "Argon",
            },
        )
    ],
}

In [None]:
bi.Migration(name="ei3.9-3.10").write(data=migration, description="ei 3.9 to 3.10")

In [None]:
imp.data = bi.strategies.migrate_exchanges(
    db=imp.data,
    migration="ei3.9-3.10"
)

For other changes, we have some builtin migrations to take care of exactly these kinds of discrepancies. This is the [randonneur](https://github.com/brightway-lca/randonneur) tool, and its accompanying [randonneur_data](https://github.com/brightway-lca/randonneur_data) set of pre-computed migrations. Let look and see what is available:

In [None]:
import randonneur_data as rd
registry = rd.Registry()
list(registry)

We can sample the data in each migration:

In [None]:
registry.sample('ecoinvent-3.9.1-biosphere-ecoinvent-3.10-biosphere')

We should also look at the technosphere migrations:

In [None]:
registry.sample('ecoinvent-3.9.1-cutoff-ecoinvent-3.10-cutoff')

In this specific case, the name difference was a change from ecoinvent version 3.9 to 3.10. We can apply the migration `ecoinvent-3.9.1-biosphere-ecoinvent-3.10-biosphere`, but need to be careful, as our unit labels don't match exactly. That's OK, we don't need to match against the unit.

In [None]:
imp.randonneur(
    label='ecoinvent-3.9.1-cutoff-ecoinvent-3.10-cutoff',
    fields=['name', 'location', 'reference product'],
)

In [None]:
for edge in imp.unlinked:
    print(edge)

In [None]:
for edge in imp.unlinked:
    edge['unit'] = 'kilogram'

In [None]:
imp.match_database("ecoinvent-3.10-cutoff", fields=('name', 'reference product', 'location'))
imp.match_database("ecoinvent-3.10-biosphere", fields=('name', 'unit', 'categories'))
imp.statistics()

In [None]:
assert imp.all_linked
imp.write_database()