# Fixing large uncertainty distributions

The notebook [Finding out why Monte Carlo results are significantly different than static ones](https://github.com/brightway-lca/brightway2/blob/master/notebooks/Investigating%20interesting%20Monte%20Carlo%20results.ipynb) showed that some uncertainty distributions can be unpredictably large or one-tailed.

In this notebook I show how these distributions can be reduced to produce answers which seem more reasonable.

One difficulty here is that our database is not well-normalized - all the uncertainty data is [stored as a binary blob](https://github.com/brightway-lca/brightway2-data/blob/0759011516ab02e601ad6f1f57424d935eca994b/bw2data/backends/schema.py#L21). This means we will have to load (deserialize) all the exchange data to see if it lies outside our accepted bounds.

What is reasonable for a triangular distribution, like the ones found in the earlier notebook? We can think about a number of attributes:

* The difference in left and right sides versus the mean
* The difference beween the mode and the mean or median
* The absolute or relative ratio of the upper bound to the mode
* Whether the left side crosses zero

For this example, we will consider the ratio of the average to the mode. Let's plot the expected distribution for the same database: ecoinvent 3.8, cutoff by classification.

In [10]:
import bw2data as bd
from bw2data.backends.schema import ExchangeDataset as ED
import bw2calc as bc
import bw2analyzer as ba
import bw2io as bi
import numpy as np
import pandas as pd
import seaborn as sb
from matplotlib import pyplot as plt
import stats_arrays as sa

In [None]:
bd.projects.set_current("ecoinvent 3.8 consequential 25")

Create a copy of the database that we will modify

In [7]:
bd.Database("ecoinvent 3.8 consequential").copy("ecoinvent 3.8 consequential adjusted")

Not able to determine geocollections for all datasets. This database is not ready for regionalization.


Writing activities to SQLite3 database:
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:44


Title: Writing activities to SQLite3 database:
  Started: 07/04/2022 14:06:00
  Finished: 07/04/2022 14:06:45
  Total time elapsed: 00:00:44
  CPU %: 91.40
  Memory %: 30.40


Brightway2 SQLiteBackend: ecoinvent 3.8 consequential adjusted

In [9]:
queryset = ED.select().where(ED.output_database=="ecoinvent 3.8 consequential adjusted")
queryset.count()

629959

Get triangular ratio

In [15]:
def get_triangular_ratio(exc):
    if exc.data.get('uncertainty type') != 5:
        return None
    mean = (exc.data.get('loc') + exc.data.get('minimum') + exc.data.get('maximum')) / 3
    return mean / exc.data.get('loc')

In [19]:
ratios = [(get_triangular_ratio(exc), exc) for exc in queryset if get_triangular_ratio(exc) is not None]
ratios.sort(reverse=True, key=lambda x: x[0])

In [17]:
len(ratios)

177

In [21]:
ratios[:20]

[(2061.727333333331, <ExchangeDataset: 855054>),
 (985.6604534251463, <ExchangeDataset: 855055>),
 (667.0000000000001, <ExchangeDataset: 855052>),
 (666.9999999999999, <ExchangeDataset: 855050>),
 (185.97474811092914, <ExchangeDataset: 903711>),
 (85.94296951819075, <ExchangeDataset: 722572>),
 (85.94296951819075, <ExchangeDataset: 722573>),
 (11.306426553672315, <ExchangeDataset: 846346>),
 (11.306426553672315, <ExchangeDataset: 846347>),
 (11.306426553672315, <ExchangeDataset: 846348>),
 (11.306426553672315, <ExchangeDataset: 846349>),
 (6.672670348612234, <ExchangeDataset: 1081814>),
 (5.3982683982683985, <ExchangeDataset: 976534>),
 (5.3982683982683985, <ExchangeDataset: 760226>),
 (5.392131431041937, <ExchangeDataset: 976527>),
 (5.392131431041937, <ExchangeDataset: 976531>),
 (5.392131431041937, <ExchangeDataset: 760229>),
 (5.392131431041937, <ExchangeDataset: 760233>),
 (5.341880341880342, <ExchangeDataset: 976523>),
 (5.341880341880342, <ExchangeDataset: 760231>)]

In [24]:
[bd.backends.proxies.Exchange(b) for a, b in ratios if a > 10]

[Exchange: 0.0129362298845668 kilogram 'market for rosin size, for paper production' (kilogram, RER, None) to 'tissue paper production, virgin' (kilogram, GLO, None)>,
 Exchange: 0.0270637701154332 kilogram 'market for rosin size, for paper production' (kilogram, RoW, None) to 'tissue paper production, virgin' (kilogram, GLO, None)>,
 Exchange: 0.01 kilogram 'market for chemical, inorganic' (kilogram, GLO, None) to 'tissue paper production, virgin' (kilogram, GLO, None)>,
 Exchange: 0.001 kilogram 'market for chemical, organic' (kilogram, GLO, None) to 'tissue paper production, virgin' (kilogram, GLO, None)>,
 Exchange: 2143.47 kilogram 'market for sodium hydroxide, without water, in 50% solution state' (kilogram, GLO, None) to 'treatment of sulfidic tailing, off-site, high gold content' (kilogram, ZA, None)>,
 Exchange: 67800000.0 square meter 'Transformation, from unspecified' (square meter, None, ('natural resource', 'land')) to 'mine construction, gold' (unit, ZA, None)>,
 Exchange

It seems like 11.3 might be a reasonable cutoff. Let's limit this ratio to 12:

In [26]:
def fix_triangular_upper_limits(exc):
    ratio = get_triangular_ratio(exc)
    if ratio and ratio > 12:
        exc.data['maximum'] = 12 * exc.data['loc']
        exc.save()

In [27]:
for ratio, exc in ratios:
    fix_triangular_upper_limits(exc)

In [28]:
bd.Database("ecoinvent 3.8 consequential adjusted").process()

Is this enough to fix the Monte Carlo results? From first principles there should still be a big difference, but smaller...

In [35]:
tissue = bd.Database("ecoinvent 3.8 consequential adjusted").get(name='tissue paper production, virgin')
ipcc = ('IPCC 2013', 'climate change', 'GWP 100a')

Create static and stochastic LCA instances

In [36]:
static = bc.LCA({tissue: 1}, ipcc)
static.lci()
static.lcia()

In [37]:
mc = bc.LCA({tissue: 1}, ipcc, use_distributions=True)
mc.lci()
mc.lcia()

Confirm that the scores are quite different

In [38]:
static.score, np.mean([mc.score for _ in zip(range(20), mc)])

(3.9271282663369003, 5.2474693807935004)

Similar checks and adjustments can be done for other distribution types and cutoff criteria.