# Match - Identical names in water

If the name is exactly the same, then we can be pretty confidant there is a 1-1 correspondence. We need to do different base contexts separately, as there can be small implementation differences or other gotchas in each one.

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime, timezone

Get paths of input and output directories

In [2]:
input_data_dir = (Path.cwd().parent / "Mapping" / "Input" / "Flowlists").resolve()
existing_matches_dir = (Path.cwd().parent / "Mapping" / "Output" / "Mapped_files").resolve()
output_dir = (Path.cwd().parent / "Contribute").resolve()

Read input dataframes

In [3]:
sp = pd.read_csv(input_data_dir / 'SimaProv9.4.csv')

In [4]:
ei = pd.read_csv(input_data_dir / 'ecoinventEFv3.7.csv')

In [5]:
sp.Context.unique()

array(['Raw materials', 'Airborne emissions', 'Waterborne emissions',
       'Final waste flows', 'Emissions to soil', 'Non material emissions',
       'Social issues', 'Economic issues'], dtype=object)

In [6]:
sorted(ei.Context.unique())

['air/indoor',
 'air/low population density, long-term',
 'air/lower stratosphere + upper troposphere',
 'air/non-urban air or from high stacks',
 'air/unspecified',
 'air/urban air close to ground',
 'natural resource/biotic',
 'natural resource/fossil well',
 'natural resource/in air',
 'natural resource/in ground',
 'natural resource/in water',
 'natural resource/land',
 'soil/agricultural',
 'soil/forestry',
 'soil/industrial',
 'soil/unspecified',
 'water/fossil well',
 'water/ground-',
 'water/ground-, long-term',
 'water/ocean',
 'water/surface water',
 'water/unspecified']

# Dealing with different `Context` values and available levels

The names might match, but we also need the `Context` to match. To do this we need to match the `Context` systems from `simapro` and `ecoinvent`. We can normalize to either system, but as we are matching to ecoinvent, we also match to ecoinvent `Context` values.

In this notebook we look only at emissions to water, so restrict ourselves to these contexts:

In [7]:
sp = sp[sp.Context == 'Waterborne emissions']
ei = ei[ei.Context.str.startswith("water/")]

The Simapro flows only have one `Context`, but we need all the subcontexts available in ecoinvent. They are also available in Simapro, but not given in our master flow list. Therefore, we can use an [outer](https://www.ionos.com/digitalguide/hosting/technical-matters/sql-outer-join/) [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) to expand the Simapro contexts to match one Simapro flow against all relevant ecoinvent flows:

In [8]:
sp[sp.Flowable == '1-Pentanol']

Unnamed: 0,Flowable,CAS No,Formula,Synonyms,Unit,Class,Context,Flow UUID,Description
5734,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O


In [9]:
ei[ei.Flowable == '1-Pentanol']

Unnamed: 0,Flowable,CASNo,Formula,Synonyms,Unit,Class,ExternalReference,Preferred,Context,FlowUUID,AltUnit,Unnamed: 11,Second CAS
16,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,water/ground-,5074e239-b510-49aa-928c-fcdb462481d8,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
17,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,"water/ground-, long-term",cfa50eaf-a817-4352-b9fe-aa834240d269,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
18,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,water/ocean,8d28a5b3-1b1c-41e9-9eba-90c607aad7db,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
19,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,water/surface water,e4526360-b2a1-4e77-9f00-57dbfe228bde,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
20,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,water/unspecified,070dc6b3-0976-45a0-803e-0a87d7e96959,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...


## Merging the two `Contexts`

In [10]:
WATER_CONTEXT_CORRESPONDENCE = pd.DataFrame([
    # ecoinvent context, simapro context, match condition ei to sp, match condition sp to ei
    ('water/fossil well', 'Emissions to water/groundwater, long-term', "~", "~"),
    ('water/ground-', 'Emissions to water/groundwater', "=", "="),
    ('water/ground-, long-term', 'Emissions to water/groundwater, long-term', "=", "="),
    ('water/ocean', 'Emissions to water/ocean', "=", "="),
    ('water/surface water', 'Emissions to water/lake', ">", "<"),
    ('water/surface water', 'Emissions to water/river', ">", "<"),
    ('water/surface water', 'Emissions to water/river, long-term', ">", "<"),
    ('water/unspecified', 'Emissions to water/(unspecified)', "=", "="),
], columns=["EI_Context", "SP_Context", "MatchCondition_EI_to_SP", "MatchCondition_SP_to_EI"])

We can just expand each Simapro flow by the number of possible complete `Context` values using a cross product:

In [11]:
sp_expanded = sp.merge(WATER_CONTEXT_CORRESPONDENCE, how="cross")

Fix the column `SP_Context` to be the normal `Context`:

In [12]:
sp_expanded.drop(columns=["Context"], inplace=True)
sp_expanded.rename(columns={"SP_Context": "Context"}, inplace=True)

In [13]:
sp_expanded[sp_expanded.Flowable == '1-Pentanol']

Unnamed: 0,Flowable,CAS No,Formula,Synonyms,Unit,Class,Flow UUID,Description,EI_Context,Context,MatchCondition_EI_to_SP,MatchCondition_SP_to_EI
544,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/fossil well,"Emissions to water/groundwater, long-term",~,~
545,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/ground-,Emissions to water/groundwater,=,=
546,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,"water/ground-, long-term","Emissions to water/groundwater, long-term",=,=
547,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/ocean,Emissions to water/ocean,=,=
548,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/surface water,Emissions to water/lake,>,<
549,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/surface water,Emissions to water/river,>,<
550,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/surface water,"Emissions to water/river, long-term",>,<
551,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/unspecified,Emissions to water/(unspecified),=,=


# Merging based on identical names and `Context`

Once we have the `Context` systems aligned, it is quite simple to merge the two dataframes and take results when the `Flowable` and `Context` match exactly.

In [14]:
df = sp_expanded.merge(ei, how="inner", left_on=["Flowable", "EI_Context"], right_on=["Flowable", "Context"])

In [15]:
df[df.Flowable == '1-Pentanol']

Unnamed: 0,Flowable,CAS No,Formula_x,Synonyms_x,Unit_x,Class_x,Flow UUID,Description,EI_Context,Context_x,...,Synonyms_y,Unit_y,Class_y,ExternalReference,Preferred,Context_y,FlowUUID,AltUnit,Unnamed: 11,Second CAS
0,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/ground-,Emissions to water/groundwater,...,amyl alcohol,kg,chemical,,,water/ground-,5074e239-b510-49aa-928c-fcdb462481d8,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
1,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,"water/ground-, long-term","Emissions to water/groundwater, long-term",...,amyl alcohol,kg,chemical,,,"water/ground-, long-term",cfa50eaf-a817-4352-b9fe-aa834240d269,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
2,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/ocean,Emissions to water/ocean,...,amyl alcohol,kg,chemical,,,water/ocean,8d28a5b3-1b1c-41e9-9eba-90c607aad7db,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
3,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/surface water,Emissions to water/lake,...,amyl alcohol,kg,chemical,,,water/surface water,e4526360-b2a1-4e77-9f00-57dbfe228bde,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
4,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/surface water,Emissions to water/river,...,amyl alcohol,kg,chemical,,,water/surface water,e4526360-b2a1-4e77-9f00-57dbfe228bde,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
5,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/surface water,"Emissions to water/river, long-term",...,amyl alcohol,kg,chemical,,,water/surface water,e4526360-b2a1-4e77-9f00-57dbfe228bde,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
6,1-Pentanol,000071-41-0,,1-Pentanol,kg,Waterborne emissions,CBD72D82-FB5E-49B5-821E-A808A79E0A2B,Formula: C5H12O,water/unspecified,Emissions to water/(unspecified),...,amyl alcohol,kg,chemical,,,water/unspecified,070dc6b3-0976-45a0-803e-0a87d7e96959,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...


# Fixing different units

There are only a few cases where this is an issue, and only one conversion factor to add:

In [16]:
df[df.Unit_x != df.Unit_y]

Unnamed: 0,Flowable,CAS No,Formula_x,Synonyms_x,Unit_x,Class_x,Flow UUID,Description,EI_Context,Context_x,...,Synonyms_y,Unit_y,Class_y,ExternalReference,Preferred,Context_y,FlowUUID,AltUnit,Unnamed: 11,Second CAS
1745,Water,007732-18-5,,Water,kg,Waterborne emissions,E0F21491-61BA-44CF-B0A1-0A3670B63F8D,,water/fossil well,"Emissions to water/groundwater, long-term",...,,m3,water,,,water/fossil well,2256a142-8242-4b4f-b9aa-a167803989ca,,,13670-17-2; 7732-18-5
1746,Water,007732-18-5,,Water,kg,Waterborne emissions,E0F21491-61BA-44CF-B0A1-0A3670B63F8D,,water/ground-,Emissions to water/groundwater,...,,m3,water,,,water/ground-,51254820-3456-4373-b7b4-056cf7b16e01,,,13670-17-2; 7732-18-5
1747,Water,007732-18-5,,Water,kg,Waterborne emissions,E0F21491-61BA-44CF-B0A1-0A3670B63F8D,,"water/ground-, long-term","Emissions to water/groundwater, long-term",...,,m3,water,,,"water/ground-, long-term",06d4812b-6937-4d64-8517-b69aabce3648,,,13670-17-2; 7732-18-5
1748,Water,007732-18-5,,Water,kg,Waterborne emissions,E0F21491-61BA-44CF-B0A1-0A3670B63F8D,,water/ocean,Emissions to water/ocean,...,,m3,water,,,water/ocean,4f0f15b3-b227-4cdc-b0b3-6412d55695d5,,,13670-17-2; 7732-18-5
1749,Water,007732-18-5,,Water,kg,Waterborne emissions,E0F21491-61BA-44CF-B0A1-0A3670B63F8D,,water/surface water,Emissions to water/lake,...,,m3,water,,,water/surface water,db4566b1-bd88-427d-92da-2d25879063b9,,,13670-17-2; 7732-18-5
1750,Water,007732-18-5,,Water,kg,Waterborne emissions,E0F21491-61BA-44CF-B0A1-0A3670B63F8D,,water/surface water,Emissions to water/river,...,,m3,water,,,water/surface water,db4566b1-bd88-427d-92da-2d25879063b9,,,13670-17-2; 7732-18-5
1751,Water,007732-18-5,,Water,kg,Waterborne emissions,E0F21491-61BA-44CF-B0A1-0A3670B63F8D,,water/surface water,"Emissions to water/river, long-term",...,,m3,water,,,water/surface water,db4566b1-bd88-427d-92da-2d25879063b9,,,13670-17-2; 7732-18-5
1752,Water,007732-18-5,,Water,kg,Waterborne emissions,E0F21491-61BA-44CF-B0A1-0A3670B63F8D,,water/unspecified,Emissions to water/(unspecified),...,,m3,water,,,water/unspecified,2404b41a-2eed-4e9d-8ab6-783946fdf5d6,,,13670-17-2; 7732-18-5


In [17]:
df['ConversionFactor'] = pd.Series([1] * len(df))

In [18]:
water_mask = (df.Flowable == "Water") * (df.Unit_x == "kg") * (df.Unit_y == "m3")
water_mask.sum()

8

In [19]:
df.loc[water_mask, 'ConversionFactor'] = 1e-3

# Finalize export

Adjust columns to match expected format:

In [20]:
def fix_names_after_merge(df):
    mapping = {
        'Flow UUID': 'SourceFlowUUID', 
        'FlowUUID': 'TargetFlowUUID',  # Incorrect column header in provided ecoinvent data
        'Flowable_x': 'SourceFlowName', 
        'Flowable_y': 'TargetFlowName',
        'Unit_x': 'SourceUnit',
        'Unit_y': 'TargetUnit',
        'Context_x': 'SourceFlowContext',
        'Context_y': 'TargetFlowContext',
    }
    return df.rename(columns={k: v for k, v in mapping.items() if k in df.columns})

In [21]:
df = fix_names_after_merge(df)

Add some useful columns.

* `author` is your name
* `notebook_name` is the name of this notebook; we can't figure this out automatically. It should normally start with `Match -`.
* `default_match_condition` is one of `=`, `~`, `<`, or `>`.

In [22]:
def add_common_columns(df, author, notebook_name, default_match_condition="="):
    df['SourceListName'] = 'SimaPro9.4'
    df['TargetListName'] = 'ecoinventEFv3.7'
    df['MatchCondition'] = default_match_condition
    df['Mapper'] = author
    df['MemoMapper'] = f'Automated match. Notebook: {notebook_name}'
    df['MemoSource'] = ''
    df['MemoTarget'] = ''
    df['MemoVerifier'] = ''
    df['LastUpdated'] = datetime.now(timezone.utc).astimezone().isoformat()
    df['Verifier'] = ''
    return df

In [23]:
df = add_common_columns(df, "Chris Mutel", "Match - Identical names in water")

Make sure the required columns are present

In [24]:
def check_required_columns(df):
    expected = set([     
        "SourceListName", "SourceFlowName", "SourceFlowUUID", "SourceFlowContext", "SourceUnit", 
        "MatchCondition", "TargetListName", "TargetFlowName", "TargetFlowUUID", 
        "TargetFlowContext", "TargetUnit", "Mapper", "Verifier", "LastUpdated", "MemoMapper", 
        "MemoVerifier", "MemoSource", "MemoTarget"
    ])
    given = set(df.columns)
    difference = expected.difference(given)
    if difference:
        print("Missing the following required columns:", difference)

In [25]:
check_required_columns(df)

Missing the following required columns: {'TargetFlowName', 'SourceFlowName'}


The names are exactly the same, so we can just duplicate them:

In [26]:
df['SourceFlowName'] = df['TargetFlowName'] = df['Flowable']

In [27]:
check_required_columns(df)

Export the dataframe to the `contribute` directory. Please make your filename meaningful.

In [28]:
def export_dataframe(df, name):
    SPEC_COLUMNS = [
        "SourceListName", "SourceFlowName", "SourceFlowUUID", "SourceFlowContext", "SourceUnit", 
        "MatchCondition", "ConversionFactor", "TargetListName", "TargetFlowName", "TargetFlowUUID", 
        "TargetFlowContext", "TargetUnit", "Mapper", "Verifier", "LastUpdated", "MemoMapper", 
        "MemoVerifier", "MemoSource", "MemoTarget"
    ]
    
    df = df[[col for col in SPEC_COLUMNS if col in df.columns]]
    
    if not name.lower().endswith(".csv"):
        name += ".csv"
    
    df.to_csv(output_dir / name, index=False)

In [29]:
export_dataframe(df, 'identical-names-in-water')