# Match - Identical names in soil

If the name is exactly the same, then we can be pretty confidant there is a 1-1 correspondence. We need to do different base contexts separately, as there can be small implementation differences or other gotchas in each one.

In [1]:
import pandas as pd
from pathlib import Path
from notebook_utils import finish_notebook

Get paths of input and output directories

In [2]:
input_data_dir = (Path.cwd().parent / "Mapping" / "Input" / "Flowlists").resolve()
existing_matches_dir = (Path.cwd().parent / "Mapping" / "Output" / "Mapped_files").resolve()

Read input dataframes

In [3]:
sp = pd.read_csv(input_data_dir / 'SimaProv9.4.csv')

In [4]:
ei = pd.read_csv(input_data_dir / 'ecoinventEFv3.7.csv')

# Dealing with different `Context` values and available levels

The names might match, but we also need the `Context` to match. To do this we need to match the `Context` systems from `simapro` and `ecoinvent`. We can normalize to either system, but as we are matching to ecoinvent, we also match to ecoinvent `Context` values.

In this notebook we look only at emissions to air, so restrict ourselves to these contexts:

In [5]:
sp = sp[sp.Context == 'Emissions to soil']
ei = ei[ei.Context.str.startswith("soil/")]

The Simapro flows only have one `Context`, but we need all the subcontexts available in ecoinvent. They are also available in Simapro, but not given in our master flow list. Therefore, we can use an [outer](https://www.ionos.com/digitalguide/hosting/technical-matters/sql-outer-join/) [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) to expand the Simapro contexts to match one Simapro flow against all relevant ecoinvent flows:

In [6]:
sp[sp.Flowable == 'Zinc']

Unnamed: 0,Flowable,CAS No,Formula,Synonyms,Unit,Class,Context,Flow UUID,Description
13252,Zinc,007440-66-6,,Zinc,kg,Emissions to soil,Emissions to soil,E993F6E9-0409-4C02-A9AD-02C64644ED43,Formula: Zn


In [7]:
ei[ei.Flowable == 'Zinc']

Unnamed: 0,Flowable,CASNo,Formula,Synonyms,Unit,Class,ExternalReference,Preferred,Context,FlowUUID,AltUnit,Unnamed: 11,Second CAS
4266,Zinc,007440-66-6,,,kg,chemical,,,soil/agricultural,84aa799e-9d98-4d34-85e0-516d28ab1be9,,,7440-66-6
4267,Zinc,007440-66-6,,,kg,chemical,,,soil/forestry,8d226423-1351-4366-b09f-d16c9e38683c,,,7440-66-6
4268,Zinc,007440-66-6,,,kg,chemical,,,soil/industrial,b88999ec-84fd-462d-9dec-7e20a4636a58,,,7440-66-6
4269,Zinc,007440-66-6,,,kg,chemical,,,soil/unspecified,77887584-ddca-4920-952c-3609730e0c13,,,7440-66-6


## Merging the two `Contexts`

In [8]:
soil_categories = pd.DataFrame([
    ('Emissions to soil', 'soil/agricultural'),
    ('Emissions to soil', 'soil/forestry'),
    ('Emissions to soil', 'soil/industrial'),
    ('Emissions to soil', 'soil/urban, non industrial'),
    ('Emissions to soil', 'soil/unspecified'),
], columns=["Context", "EI_Context"])    

In [9]:
sp_expanded = sp.merge(soil_categories, how="outer", on="Context")
sp_expanded[sp_expanded.Flowable == 'Zinc']

Unnamed: 0,Flowable,CAS No,Formula,Synonyms,Unit,Class,Context,Flow UUID,Description,EI_Context
18025,Zinc,007440-66-6,,Zinc,kg,Emissions to soil,Emissions to soil,E993F6E9-0409-4C02-A9AD-02C64644ED43,Formula: Zn,soil/agricultural
18026,Zinc,007440-66-6,,Zinc,kg,Emissions to soil,Emissions to soil,E993F6E9-0409-4C02-A9AD-02C64644ED43,Formula: Zn,soil/forestry
18027,Zinc,007440-66-6,,Zinc,kg,Emissions to soil,Emissions to soil,E993F6E9-0409-4C02-A9AD-02C64644ED43,Formula: Zn,soil/industrial
18028,Zinc,007440-66-6,,Zinc,kg,Emissions to soil,Emissions to soil,E993F6E9-0409-4C02-A9AD-02C64644ED43,Formula: Zn,"soil/urban, non industrial"
18029,Zinc,007440-66-6,,Zinc,kg,Emissions to soil,Emissions to soil,E993F6E9-0409-4C02-A9AD-02C64644ED43,Formula: Zn,soil/unspecified


## Fixing the Simapro `Context`

In the master flow list, we only have base `Context` values, like `"Emissions to soil"`. However, in Simapro CSV exports we know that we will get more specific contexts, as we can match these one-to-one with the ecoinvent `Context` values.

**Note**: This won't be perfect. In the latest data we have available, the SimaPro CSVs use slightly different values depending on the kind of CSV data being exported. For examples, you might find:

* In the master flow list: `Emissions to soil`
* In an LCI file: `Emissions to soil`
* In an LCIA file: `Soil`

There is no *right* answer, but we will use the LCI variant, as that is the most common import type. We can correct these later in the import step if necessary.

In [10]:
ei_to_sp_context = {
    'soil/agricultural': 'Emissions to soil/agricultural',
    'soil/forestry': 'Emissions to soil/forestry',
    'soil/industrial': 'Emissions to soil/industrial',
    'soil/unspecified': 'Emissions to soil/(unspecified)',
}
mapped_context = sp_expanded.EI_Context.replace(ei_to_sp_context)
sp_expanded.Context = mapped_context

# Merging based on identical names and `Context`

Once we have the `Context` systems aligned, it is quite simple to merge the two dataframes and take results when the `Flowable` and `Context` match exactly.

In [11]:
df = sp_expanded.merge(ei, how="inner", left_on=["Flowable", "EI_Context"], right_on=["Flowable", "Context"])

In [12]:
df

Unnamed: 0,Flowable,CAS No,Formula_x,Synonyms_x,Unit_x,Class_x,Context_x,Flow UUID,Description,EI_Context,...,Synonyms_y,Unit_y,Class_y,ExternalReference,Preferred,Context_y,FlowUUID,AltUnit,Unnamed: 11,Second CAS
0,"2-Amino-3-chloro-1,4-naphthoquinone",002797-51-5,,"2-Amino-3-chloro-1,4-naphthoquinone",kg,Emissions to soil,Emissions to soil/agricultural,DC8E14C2-D172-4120-BF06-972470A77881,,soil/agricultural,...,,kg,chemical,,,soil/agricultural,f9ba52c4-ffe0-53fc-9724-6c1daebcbc1c,,,2797-51-5
1,2-Phenylphenol,000090-43-7,,2-Phenylphenol,kg,Emissions to soil,Emissions to soil/agricultural,766541E6-BBD1-453C-958C-6EB5EC51A2C0,Formula: C12H10OSynonyms: orthohydroxydipbeny...,soil/agricultural,...,,kg,chemical,,,soil/agricultural,5357f101-f2df-5f6e-b534-f1705b597773,,,61788-42-9; 90-43-7; 90-43-7
2,"2,4-D",000094-75-7,,"2,4-D",kg,Emissions to soil,Emissions to soil/agricultural,0266AEAA-9C73-4874-A015-A20DF754060C,Formula: C8H6Cl2O3,soil/agricultural,...,"(2,4-dichlorophenoxy)acetic acid",kg,chemical,,,soil/agricultural,f681eb3c-854a-4f78-bcfe-76dfbcf9df3c,,,94-75-7
3,"2,4-D ester",,,"2,4-D ester",kg,Emissions to soil,Emissions to soil/agricultural,C91A0AC1-85AE-4AE3-9346-AD358C6C91EA,,soil/agricultural,...,"2,4-D polypropoxybutyl ester",kg,chemical,,,soil/agricultural,6986913c-284b-4173-95fe-4a242498b1bc,,,
4,8-Quinolinol,000148-24-3,,8-Quinolinol,kg,Emissions to soil,Emissions to soil/agricultural,26C96A47-D123-4390-9EAF-762E2F573769,,soil/agricultural,...,,kg,chemical,,,soil/agricultural,3d071bc4-855b-52d7-baa8-2cacbf536777,,,148-24-3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
598,Zinc,007440-66-6,,Zinc,kg,Emissions to soil,Emissions to soil/industrial,E993F6E9-0409-4C02-A9AD-02C64644ED43,Formula: Zn,soil/industrial,...,,kg,chemical,,,soil/industrial,b88999ec-84fd-462d-9dec-7e20a4636a58,,,7440-66-6
599,Zinc,007440-66-6,,Zinc,kg,Emissions to soil,Emissions to soil/(unspecified),E993F6E9-0409-4C02-A9AD-02C64644ED43,Formula: Zn,soil/unspecified,...,,kg,chemical,,,soil/unspecified,77887584-ddca-4920-952c-3609730e0c13,,,7440-66-6
600,Zineb,012122-67-7,,Zineb,kg,Emissions to soil,Emissions to soil/agricultural,49B6CAB9-F212-4B89-AA70-DB7FABC68F0D,Formula: C4H6N2S4Zn ,soil/agricultural,...,,kg,chemical,,,soil/agricultural,02156abf-3839-5778-897a-4c06444701d4,,,9006-42-2; 12122-67-7
601,Ziram,000137-30-4,,Ziram,kg,Emissions to soil,Emissions to soil/agricultural,E5C0FEDE-88E8-4D4A-B420-5CFA5C15BEA6,Formula: C6H12N2S4Zn Synonyms: aaprotect; aav...,soil/agricultural,...,,kg,chemical,,,soil/agricultural,fb4f9b12-7ec0-59aa-8087-c6115d1fcd31,,,137-30-4


# Fixing different units

Nothing to fix here.

In [13]:
df[df.Unit_x != df.Unit_y]

Unnamed: 0,Flowable,CAS No,Formula_x,Synonyms_x,Unit_x,Class_x,Context_x,Flow UUID,Description,EI_Context,...,Synonyms_y,Unit_y,Class_y,ExternalReference,Preferred,Context_y,FlowUUID,AltUnit,Unnamed: 11,Second CAS


# Finalize export

In [14]:
df['SourceFlowName'] = df['TargetFlowName'] = df['Flowable']

In [15]:
finish_notebook(
    df=df,
    author="Chris Mutel",
    notebook_name="Match - Identical names in soil",
    filename='identical-names-in-soil',
)