# Match - Identical names in air

If the name is exactly the same, then we can be pretty confidant there is a 1-1 correspondence. We need to do different base contexts separately, as there can be small implementation differences or other gotchas in each one.

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime, timezone

Get paths of input and output directories

In [2]:
input_data_dir = (Path.cwd().parent / "Mapping" / "Input" / "Flowlists").resolve()
existing_matches_dir = (Path.cwd().parent / "Mapping" / "Output" / "Mapped_files").resolve()
output_dir = (Path.cwd().parent / "Contribute").resolve()

Read input dataframes

In [3]:
sp = pd.read_csv(input_data_dir / 'SimaProv9.4.csv')

In [4]:
ei = pd.read_csv(input_data_dir / 'ecoinventEFv3.7.csv')

# Dealing with different `Context` values and available levels

The names might match, but we also need the `Context` to match. To do this we need to match the `Context` systems from `simapro` and `ecoinvent`. We can normalize to either system, but as we are matching to ecoinvent, we also match to ecoinvent `Context` values.

In this notebook we look only at emissions to air, so restrict ourselves to these contexts:

In [5]:
sp = sp[sp.Context == 'Airborne emissions']
ei = ei[ei.Context.str.startswith("air/")]

The Simapro flows only have one `Context`, but we need all the subcontexts available in ecoinvent. They are also available in Simapro, but not given in our master flow list. Therefore, we can use an [outer](https://www.ionos.com/digitalguide/hosting/technical-matters/sql-outer-join/) [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) to expand the Simapro contexts to match one Simapro flow against all relevant ecoinvent flows:

In [6]:
sp[sp.Flowable == '1-Pentanol']

Unnamed: 0,Flowable,CAS No,Formula,Synonyms,Unit,Class,Context,Flow UUID,Description
858,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Airborne emissions,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,


In [7]:
ei[ei.Flowable == '1-Pentanol']

Unnamed: 0,Flowable,CASNo,Formula,Synonyms,Unit,Class,ExternalReference,Preferred,Context,FlowUUID,AltUnit,Unnamed: 11,Second CAS
11,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,"air/low population density, long-term",9dd01d5b-3677-4822-9cd4-36d21b0e23d1,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
12,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,air/lower stratosphere + upper troposphere,b78e77cb-7636-4420-855e-17239984f8b3,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
13,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,air/non-urban air or from high stacks,cc9a442f-c96a-4bdc-990d-8b58f72b4e07,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
14,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,air/unspecified,048baf1e-6cdc-44a5-92e2-32d15ff54885,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
15,1-Pentanol,000071-41-0,,amyl alcohol,kg,chemical,,,air/urban air close to ground,541a823c-0aad-4dc4-9123-d4af4647d942,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...


## Merging the two `Contexts`

In [8]:
air_categories = pd.DataFrame([
    ('Airborne emissions', 'air/indoor',),
    ('Airborne emissions', 'air/low population density, long-term',),
    ('Airborne emissions', 'air/lower stratosphere + upper troposphere',),
    ('Airborne emissions', 'air/non-urban air or from high stacks',),
    ('Airborne emissions', 'air/unspecified',),
    ('Airborne emissions', 'air/urban air close to ground'),
], columns=["Context", "EI_Context"])    

In [9]:
sp_expanded = sp.merge(air_categories, how="outer", on="Context")
sp_expanded[sp_expanded.Flowable == '1-Pentanol']

Unnamed: 0,Flowable,CAS No,Formula,Synonyms,Unit,Class,Context,Flow UUID,Description,EI_Context
864,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Airborne emissions,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/indoor
865,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Airborne emissions,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,"air/low population density, long-term"
866,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Airborne emissions,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/lower stratosphere + upper troposphere
867,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Airborne emissions,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/non-urban air or from high stacks
868,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Airborne emissions,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/unspecified
869,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Airborne emissions,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/urban air close to ground


## Fixing the Simapro `Context`

In the master flow list, we only have base `Context` values, like `"Airborne emissions"`. However, in Simapro CSV exports we know that we will get more specific contexts, as we can match these one-to-one with the ecoinvent `Context` values.

**Note**: This won't be perfect. In the latest data we have available, the SimaPro CSVs use slightly different values depending on the kind of CSV data being exported. For examples, you might find:

* In the master flow list: `Airborne emissions`
* In an LCI file: `Emissions to air`
* In an LCIA file: `Air`

There is no *right* answer, but we will use the LCI variant, as that is the most common import type. We can correct these later in the import step if necessary.

In [10]:
ei_to_sp_context = {
    'air/indoor': 'Emissions to air/indoor',
    'air/low population density, long-term': 'Emissions to air/low. pop., long-term',
    'air/lower stratosphere + upper troposphere': 'Emissions to air/stratosphere + troposphere',
    'air/non-urban air or from high stacks': 'Emissions to air/low. pop.',
    'air/unspecified': 'Emissions to air/(unspecified)',
    'air/urban air close to ground': 'Emissions to air/high. pop.',
}
mapped_context = sp_expanded.EI_Context.replace(ei_to_sp_context)
sp_expanded.Context = mapped_context

# Merging based on identical names and `Context`

Once we have the `Context` systems aligned, it is quite simple to merge the two dataframes and take results when the `Flowable` and `Context` match exactly.

In [11]:
df = sp_expanded.merge(ei, how="inner", left_on=["Flowable", "EI_Context"], right_on=["Flowable", "Context"])

In [12]:
df

Unnamed: 0,Flowable,CAS No,Formula_x,Synonyms_x,Unit_x,Class_x,Context_x,Flow UUID,Description,EI_Context,...,Synonyms_y,Unit_y,Class_y,ExternalReference,Preferred,Context_y,FlowUUID,AltUnit,Unnamed: 11,Second CAS
0,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,"Emissions to air/low. pop., long-term",5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,"air/low population density, long-term",...,amyl alcohol,kg,chemical,,,"air/low population density, long-term",9dd01d5b-3677-4822-9cd4-36d21b0e23d1,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
1,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Emissions to air/stratosphere + troposphere,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/lower stratosphere + upper troposphere,...,amyl alcohol,kg,chemical,,,air/lower stratosphere + upper troposphere,b78e77cb-7636-4420-855e-17239984f8b3,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
2,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Emissions to air/low. pop.,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/non-urban air or from high stacks,...,amyl alcohol,kg,chemical,,,air/non-urban air or from high stacks,cc9a442f-c96a-4bdc-990d-8b58f72b4e07,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
3,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Emissions to air/(unspecified),5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/unspecified,...,amyl alcohol,kg,chemical,,,air/unspecified,048baf1e-6cdc-44a5-92e2-32d15ff54885,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
4,1-Pentanol,000071-41-0,C10H12N2O5 ,1-Pentanol,kg,Airborne emissions,Emissions to air/high. pop.,5E6F7EF4-1C2C-414A-BB7D-D29AE6450363,,air/urban air close to ground,...,amyl alcohol,kg,chemical,,,air/urban air close to ground,541a823c-0aad-4dc4-9123-d4af4647d942,,,158778-85-9; 71-41-0; 64118-19-0; 30899-19-5; ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1561,Zirconium-95,013967-71-0,,Zirconium-95,kBq,Airborne emissions,"Emissions to air/low. pop., long-term",97A3FD41-4B43-4BFB-9054-12F3CDA6B78B,Formula: Zr-95,"air/low population density, long-term",...,,kBq,chemical,,,"air/low population density, long-term",7e59174a-d66b-4304-81dc-e0aeb3c58ae2,,,13967-71-0
1562,Zirconium-95,013967-71-0,,Zirconium-95,kBq,Airborne emissions,Emissions to air/stratosphere + troposphere,97A3FD41-4B43-4BFB-9054-12F3CDA6B78B,Formula: Zr-95,air/lower stratosphere + upper troposphere,...,,kBq,chemical,,,air/lower stratosphere + upper troposphere,ae941326-4981-44fa-8be1-8757aab10d00,,,13967-71-0
1563,Zirconium-95,013967-71-0,,Zirconium-95,kBq,Airborne emissions,Emissions to air/low. pop.,97A3FD41-4B43-4BFB-9054-12F3CDA6B78B,Formula: Zr-95,air/non-urban air or from high stacks,...,,kBq,chemical,,,air/non-urban air or from high stacks,fa260f53-4850-4585-9b67-5cdbc603c5ef,,,13967-71-0
1564,Zirconium-95,013967-71-0,,Zirconium-95,kBq,Airborne emissions,Emissions to air/(unspecified),97A3FD41-4B43-4BFB-9054-12F3CDA6B78B,Formula: Zr-95,air/unspecified,...,,kBq,chemical,,,air/unspecified,88a02d30-d9fe-4dc1-b1e5-e95745df0956,,,13967-71-0


# Fixing different units

There are only a few cases where this is an issue, and only one conversion factor to add:

In [13]:
df[df.Unit_x != df.Unit_y]

Unnamed: 0,Flowable,CAS No,Formula_x,Synonyms_x,Unit_x,Class_x,Context_x,Flow UUID,Description,EI_Context,...,Synonyms_y,Unit_y,Class_y,ExternalReference,Preferred,Context_y,FlowUUID,AltUnit,Unnamed: 11,Second CAS
1500,Water,007732-18-5,,Water,kg,Airborne emissions,"Emissions to air/low. pop., long-term",C7E61CA8-9E6E-4743-B1F8-1CE425AFAE83,Formula: H2O,"air/low population density, long-term",...,,m3,water,,,"air/low population density, long-term",f977a02e-3564-4798-843c-9fb9a18bc18b,,,13670-17-2; 7732-18-5
1501,Water,007732-18-5,,Water,kg,Airborne emissions,Emissions to air/stratosphere + troposphere,C7E61CA8-9E6E-4743-B1F8-1CE425AFAE83,Formula: H2O,air/lower stratosphere + upper troposphere,...,,m3,water,,,air/lower stratosphere + upper troposphere,f14b59ff-d438-442d-8bad-b53694b8263a,,,13670-17-2; 7732-18-5
1502,Water,007732-18-5,,Water,kg,Airborne emissions,Emissions to air/low. pop.,C7E61CA8-9E6E-4743-B1F8-1CE425AFAE83,Formula: H2O,air/non-urban air or from high stacks,...,,m3,water,,,air/non-urban air or from high stacks,09872080-d143-4fb1-a3a5-647b077107ff,,,13670-17-2; 7732-18-5
1503,Water,007732-18-5,,Water,kg,Airborne emissions,Emissions to air/(unspecified),C7E61CA8-9E6E-4743-B1F8-1CE425AFAE83,Formula: H2O,air/unspecified,...,,m3,water,,,air/unspecified,075e433b-4be4-448e-9510-9a5029c1ce94,,,13670-17-2; 7732-18-5
1504,Water,007732-18-5,,Water,kg,Airborne emissions,Emissions to air/high. pop.,C7E61CA8-9E6E-4743-B1F8-1CE425AFAE83,Formula: H2O,air/urban air close to ground,...,,m3,water,,,air/urban air close to ground,5d368100-b1bc-4456-8420-e469edccf349,,,13670-17-2; 7732-18-5


In [14]:
df['ConversionFactor'] = pd.Series([1] * len(df))

In [15]:
water_mask = (df.Flowable == "Water") * (df.Unit_x == "kg") * (df.Unit_y == "m3")
water_mask.sum()

5

In [16]:
df.loc[water_mask, 'ConversionFactor'] = 1e-3

# Finalize export

Adjust columns to match expected format:

In [17]:
def fix_names_after_merge(df):
    mapping = {
        'Flow UUID': 'SourceFlowUUID', 
        'FlowUUID': 'TargetFlowUUID',  # Incorrect column header in provided ecoinvent data
        'Flowable_x': 'SourceFlowName', 
        'Flowable_y': 'TargetFlowName',
        'Unit_x': 'SourceUnit',
        'Unit_y': 'TargetUnit',
        'Context_x': 'SourceFlowContext',
        'Context_y': 'TargetFlowContext',
    }
    return df.rename(columns={k: v for k, v in mapping.items() if k in df.columns})

In [18]:
df = fix_names_after_merge(df)

Add some useful columns.

* `author` is your name
* `notebook_name` is the name of this notebook; we can't figure this out automatically. It should normally start with `Match -`.
* `default_match_condition` is one of `=`, `~`, `<`, or `>`.

In [19]:
def add_common_columns(df, author, notebook_name, default_match_condition="="):
    df['SourceListName'] = 'SimaPro9.4'
    df['TargetListName'] = 'ecoinventEFv3.7'
    df['MatchCondition'] = default_match_condition
    df['Mapper'] = author
    df['MemoMapper'] = f'Automated match. Notebook: {notebook_name}'
    df['MemoSource'] = ''
    df['MemoTarget'] = ''
    df['MemoVerifier'] = ''
    df['LastUpdated'] = datetime.now(timezone.utc).astimezone().isoformat()
    df['Verifier'] = ''
    return df

In [20]:
df = add_common_columns(df, "Chris Mutel", "Match - Identical names in air")

Make sure the required columns are present

In [21]:
def check_required_columns(df):
    expected = set([     
        "SourceListName", "SourceFlowName", "SourceFlowUUID", "SourceFlowContext", "SourceUnit", 
        "MatchCondition", "TargetListName", "TargetFlowName", "TargetFlowUUID", 
        "TargetFlowContext", "TargetUnit", "Mapper", "Verifier", "LastUpdated", "MemoMapper", 
        "MemoVerifier", "MemoSource", "MemoTarget"
    ])
    given = set(df.columns)
    difference = expected.difference(given)
    if difference:
        print("Missing the following required columns:", difference)

In [22]:
check_required_columns(df)

Missing the following required columns: {'TargetFlowName', 'SourceFlowName'}


The names are exactly the same, so we can just duplicate them:

In [23]:
df['SourceFlowName'] = df['TargetFlowName'] = df['Flowable']

Export the dataframe to the `contribute` directory. Please make your filename meaningful.

In [24]:
def export_dataframe(df, name):
    SPEC_COLUMNS = [
        "SourceListName", "SourceFlowName", "SourceFlowUUID", "SourceFlowContext", "SourceUnit", 
        "MatchCondition", "ConversionFactor", "TargetListName", "TargetFlowName", "TargetFlowUUID", 
        "TargetFlowContext", "TargetUnit", "Mapper", "Verifier", "LastUpdated", "MemoMapper", 
        "MemoVerifier", "MemoSource", "MemoTarget"
    ]
    
    df = df[[col for col in SPEC_COLUMNS if col in df.columns]]
    
    if not name.lower().endswith(".csv"):
        name += ".csv"
    
    df.to_csv(output_dir / name, index=False)

In [25]:
export_dataframe(df, 'identical-names-in-air')