# What happens? When a database is processed

## What is processing?

Processing a `Database` means taking the *quantitative* data and saving it as a set of [numpy files](https://numpy.org/doc/1.13/neps/npy-format.html) using the [bw_processing](https://github.com/brightway-lca/bw_processing) library and the [data package](https://specs.frictionlessdata.io/data-package/) specification.

## What does a `Database` mean technically?

A `Database` in Brightway is a collections of nodes which are useful to keep together. These collections can be big (e.g. ecoinvent, EXIOBASE) or small (a case study with a few activities). In the database schema, each `Node` has a `database` column label as a string:

```python
class ActivityDataset(Model):
    database = TextField()
    ... # other attributes
```

## Why isn't `database` a `ForeignKeyField`?

The default database schema is very simple, and has only two tables, as it is designed for use by people who don't know SQL or database normalization.

## When is a `Database` processed?

Processing can be manually initiated, but is also done automatically when needed. Processing is needed when the data stored in the SQLite database changes; currently, every time we save a change to (or delete anything from) the SQLite database we set a `dirty` flag in the `databases` metadata storage:

```python
Database.set_dirty(<database_label>)
```

## When is processing done automatically?

When the function `bw2data.compat.prepare_lca_inputs` is called, either manually or by other code. This function is also called when `bw2calc.lca.LCA` is instantiated, if you do not pass `data_objs`. `data_objs` are the data packages; you don't need to call `prepare_lca_inputs` to get data packages if you already have them.

`prepare_lca_inputs` calls ` databases.clean()`, which is a wrapper to `Database.clean_all()`, which checks the `dirty` flag and calls `.process` for databases which are `dirty` (i.e. databases where the processed data package is obsolete):

```python
class Database(Model):
    @classmethod
    def clean_all(cls):
        for db in cls.select().where(cls.dirty == True):
            db.process()
```

## What happens when `.process` is called?

Each `Database` is processed separately.

* We construct a filepath for a zipped data package from the name of the database
* We create the data for `inv_geomapping_matrix`, which is only used in regionalized LCA. It links node ids to their locations.
* We create the data for `biosphere_matrix`:
    - We only consider exchanges whose `output_database` string is the same as the database we are processing
    - We only consider exchanges whose `type` is `biosphere`
* We create the data for the `technosphere_matrix`. This is split into three sections: 
    - First, we find nodes which don't have an explicit production exchange. If there is no explicit production exchange, we assume that the node produces one unit of itself. The filter for explicit production exchanges is:
        - We only consider exchanges whose `output_database` string is the same as the database we are processing
        - We only consider exchanges whose `output` activity `type` is `"process"` or `None`
        - There are no exchanges with this node as `output` whose `type` is in `("production", "generic production")`
        - If you don't want a production exchange for a given node, just create a self-referential production exchange with amount zero
    - Second, we get technosphere exchanges which have positive values (i.e. we can insert these values directly into the matrix without changing their sign, **not** that the numbers are necessarily positive). These are *outputs* of an activity, and are production or production-like exchanges. The filters here are:
        - We only consider exchanges whose `output_database` string is the same as the database we are processing
        - We only consider exchanges whose `type` is in `('production', 'substitution', 'generic production')`
    - Finally, we get technosphere exchanges which have negative values (i.e. we have to flip the sign of these values when inserting into the matrix, **not** that the numbers are negative - they usually aren't). These are *inputs consumed* by an activity. The filters here are:
        - We only consider exchanges whose `output_database` string is the same as the database we are processing
        - We only consider exchanges whose `type` is in `('technosphere', 'generic consumption')`
* The created data is added to the zipped data package and its `metadata.json` file, and the file object is closed