# Match - Identical names in air

If the name is exactly the same, then we can be pretty confidant there is a 1-1 correspondence. We need to do different base contexts separately, as there can be small implementation differences or other gotchas in each one.

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime, timezone

Get paths of input and output directories

In [2]:
input_data_dir = (Path.cwd().parent / "Mapping" / "Input" / "Flowlists").resolve()
existing_matches_dir = (Path.cwd().parent / "Mapping" / "Output" / "Mapped_files").resolve()
output_dir = (Path.cwd().parent / "Contribute").resolve()

Read input dataframes

In [3]:
sp = pd.read_csv(input_data_dir / 'SimaProv9.4.csv')

In [4]:
ei = pd.read_csv(input_data_dir / 'ecoinventEFv3.7.csv')

In [5]:
sorted(sp.Context.unique())

['Airborne emissions',
 'Economic issues',
 'Emissions to soil',
 'Final waste flows',
 'Non material emissions',
 'Raw materials',
 'Social issues',
 'Waterborne emissions']

In [6]:
sorted(ei.Context.unique())

['air/indoor',
 'air/low population density, long-term',
 'air/lower stratosphere + upper troposphere',
 'air/non-urban air or from high stacks',
 'air/unspecified',
 'air/urban air close to ground',
 'natural resource/biotic',
 'natural resource/fossil well',
 'natural resource/in air',
 'natural resource/in ground',
 'natural resource/in water',
 'natural resource/land',
 'soil/agricultural',
 'soil/forestry',
 'soil/industrial',
 'soil/unspecified',
 'water/fossil well',
 'water/ground-',
 'water/ground-, long-term',
 'water/ocean',
 'water/surface water',
 'water/unspecified']

# Matching the contexts

Only look at land use and occupation, and match names which are identical.

In [10]:
sp = sp[sp.Context == 'Raw materials']
ei = ei[ei.Context == 'natural resource/land']

## Fixing the Simapro `Context`

The master flow list uses `Raw materials`, but the LCI exports use `Resources/land`. We just fix all of them.

In [13]:
sp.Context = "Resources/land"

# Merging based on identical names

Once we have the `Context` systems aligned, it is quite simple to merge the two dataframes and take results when the `Flowable` matches exactly.

In [16]:
df = sp.merge(ei, how="inner", left_on=["Flowable"], right_on=["Flowable"])

In [17]:
df

Unnamed: 0,Flowable,CAS No,Formula_x,Synonyms_x,Unit_x,Class_x,Context_x,Flow UUID,Description,CASNo,...,Synonyms_y,Unit_y,Class_y,ExternalReference,Preferred,Context_y,FlowUUID,AltUnit,Unnamed: 11,Second CAS
0,"Occupation, annual crop",,,"Occupation, annual crop",m2a,Raw materials,Resources/land,322AC138-364B-4E92-8BDB-0A86A4BD95C6,,,...,,m2*a,land,,,natural resource/land,c5aafa60-495c-461c-a1d4-b262a34c45b9,,,
1,"Occupation, annual crop, flooded crop",,,"Occupation, annual crop, flooded crop",m2a,Raw materials,Resources/land,3AB83A38-BB37-478D-8E23-71ED99C52254,,,...,,m2*a,land,,,natural resource/land,7956039f-1181-42ab-b03b-ba9992733394,,,
2,"Occupation, annual crop, greenhouse",,,"Occupation, annual crop, greenhouse",m2a,Raw materials,Resources/land,E51A65A6-0FE4-4D81-9011-E08CB30AC361,,,...,,m2*a,land,,,natural resource/land,9e80f7cd-47fa-4c7f-8f2c-bdb9731b3196,,,
3,"Occupation, annual crop, irrigated",,,"Occupation, annual crop, irrigated",m2a,Raw materials,Resources/land,FB4F29FA-EFF3-4D25-91E2-FE5F1740DC21,,,...,,m2*a,land,,,natural resource/land,c4a82f46-381f-474c-a362-3363064b9c33,,,
4,"Occupation, annual crop, irrigated, extensive",,Pd,"Occupation, annual crop, irrigated, extensive",m2a,Raw materials,Resources/land,D2E60B56-EE72-43F7-80F0-22991F396C17,,,...,,m2*a,land,,,natural resource/land,12c7671c-e4aa-46c6-93c5-b6f9ac1c453b,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,"Transformation, to unspecified",,C5H8O2 ,"Transformation, to unspecified",m2,Raw materials,Resources/land,D05B135A-A729-493D-A35E-8C5AAFB84645,,,...,,m2,land,,,natural resource/land,512a5356-8059-4772-a43f-42e3c4f3d299,,,
138,"Transformation, to unspecified, natural (non-use)",,,"Transformation, to unspecified, natural (non-use)",m2,Raw materials,Resources/land,64B35EA9-14CF-40FE-B86B-13EA5988E6E4,,,...,,m2,land,,,natural resource/land,46cfaeaf-f124-409f-998d-47b159051cec,,,
139,"Transformation, to urban, continuously built",,,"Transformation, to urban, continuously built",m2,Raw materials,Resources/land,952D05EA-6D59-44AC-A6DC-A29550056CDB,,,...,,m2,land,,,natural resource/land,66f25f1d-1898-4827-bcbb-ca82f15c4d02,,,
140,"Transformation, to urban, discontinuously built",,C5H10O ,"Transformation, to urban, discontinuously built",m2,Raw materials,Resources/land,A3DEA065-9BDB-4564-A1B0-334C3A7A2079,,,...,,m2,land,,,natural resource/land,55beee8d-d04e-4307-bb0e-4e113dc07ee7,,,


# Check different units

They don't use the same labels, but it's OK.

In [22]:
df.groupby(["Unit_x", "Unit_y"]).size().reset_index()

Unnamed: 0,Unit_x,Unit_y,0
0,m2,m2,98
1,m2a,m2*a,44


# Finalize export

Adjust columns to match expected format:

In [23]:
def fix_names_after_merge(df):
    mapping = {
        'Flow UUID': 'SourceFlowUUID', 
        'FlowUUID': 'TargetFlowUUID',  # Incorrect column header in provided ecoinvent data
        'Flowable_x': 'SourceFlowName', 
        'Flowable_y': 'TargetFlowName',
        'Unit_x': 'SourceUnit',
        'Unit_y': 'TargetUnit',
        'Context_x': 'SourceFlowContext',
        'Context_y': 'TargetFlowContext',
    }
    return df.rename(columns={k: v for k, v in mapping.items() if k in df.columns})

In [24]:
df = fix_names_after_merge(df)

Add some useful columns.

* `author` is your name
* `notebook_name` is the name of this notebook; we can't figure this out automatically. It should normally start with `Match -`.
* `default_match_condition` is one of `=`, `~`, `<`, or `>`.

In [25]:
def add_common_columns(df, author, notebook_name, default_match_condition="="):
    df['SourceListName'] = 'SimaPro9.4'
    df['TargetListName'] = 'ecoinventEFv3.7'
    df['MatchCondition'] = default_match_condition
    df['Mapper'] = author
    df['MemoMapper'] = f'Automated match. Notebook: {notebook_name}'
    df['MemoSource'] = ''
    df['MemoTarget'] = ''
    df['MemoVerifier'] = ''
    df['LastUpdated'] = datetime.now(timezone.utc).astimezone().isoformat()
    df['Verifier'] = ''
    return df

In [26]:
df = add_common_columns(df, "Chris Mutel", "Match - Identical names in air")

Make sure the required columns are present

In [27]:
def check_required_columns(df):
    expected = set([     
        "SourceListName", "SourceFlowName", "SourceFlowUUID", "SourceFlowContext", "SourceUnit", 
        "MatchCondition", "TargetListName", "TargetFlowName", "TargetFlowUUID", 
        "TargetFlowContext", "TargetUnit", "Mapper", "Verifier", "LastUpdated", "MemoMapper", 
        "MemoVerifier", "MemoSource", "MemoTarget"
    ])
    given = set(df.columns)
    difference = expected.difference(given)
    if difference:
        print("Missing the following required columns:", difference)

In [28]:
check_required_columns(df)

Missing the following required columns: {'SourceFlowName', 'TargetFlowName'}


The names are exactly the same, so we can just duplicate them:

In [29]:
df['SourceFlowName'] = df['TargetFlowName'] = df['Flowable']

Export the dataframe to the `contribute` directory. Please make your filename meaningful.

In [30]:
def export_dataframe(df, name):
    SPEC_COLUMNS = [
        "SourceListName", "SourceFlowName", "SourceFlowUUID", "SourceFlowContext", "SourceUnit", 
        "MatchCondition", "ConversionFactor", "TargetListName", "TargetFlowName", "TargetFlowUUID", 
        "TargetFlowContext", "TargetUnit", "Mapper", "Verifier", "LastUpdated", "MemoMapper", 
        "MemoVerifier", "MemoSource", "MemoTarget"
    ]
    
    df = df[[col for col in SPEC_COLUMNS if col in df.columns]]
    
    if not name.lower().endswith(".csv"):
        name += ".csv"
    
    df.to_csv(output_dir / name, index=False)

In [31]:
export_dataframe(df, 'resources-land')