# Generating ecoSPold files for WWT

Pascal Lesage notes for ICRA

## NOTES TO SELF: Ctrl-F zzz

## Description

The WWT LCI tool must generate importable ecoSpold2 tools.  
An example ecoSpold already in ecoinvent is found [here](zzz).  
This document shows: 
  - how to inform all the required fields  
  - how to generate the ecoSPold files

## Standard inputs

In [35]:
import os
import pandas as pd
import pickle # Used temporarily to access a MasterData dictionary - check if still useful at the end of the project.
from lxml import objectify #Convert XML to dict

## Guillaume Bourgault (GB) code

This document relies heavily on the code prepared by GB and distributed in July (`spold2_writer_use.py`).  
The GB code is meant to be executed from a .py file, and uses the `os.path.realpath(__file__)` function as a starting point to identify where files are found.  
I'll therefore make heavy use of the [%run](http://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-run) magic command

The functions written in `spold2_writer_use.py` are imported in the following cell.

In [64]:
os.chdir(r'C:\mypy\code\wastewater_treatment_tool\waste_water_tool')
from spold2_writer_use import *

## Master data

The ecoinvent database contains master data for the following entities: Activity Names, Classifications, Companies, Compartments, Exchanges (Elementary and Intermediate), Geographies, Languages, Market Models, Parameters, Persons, Properties, Scenarios, Sources, Tags and Units. 

There are discussions underway to have the tool access the master data on the ecoinvent/IFU server. However, for now, this has not yet been resolved, and many at the ecoinvent Center do not feel this is very important because the amount of master data used for the WWT datasets is not that important, and because the master data could be stored on the server that will host the WWT tool and easily be regularly updated.

For now, I will use the master data that is downloaded on my computer via the [ecoEditor](http://www.ecoinvent.org/data-provider/data-provider-toolkit/ecoeditor/ecoeditor.html).  
Guillaume of the ecoinvent Center (henceforth GB) has written the following code to help **find the master data**:

In [65]:
master_data_folder = find_current_MD_path()
master_data_folder

'C://Users\\Pascal Lesage\\Documents\\ecoinvent\\EcoEditor\\xml\\MasterData\\Production'

Here are the **contents of the master data directory**:

In [66]:
os.listdir(master_data_folder)

['ActivityIndex.xml',
 'ActivityNames.xml',
 'Classifications.xml',
 'Companies.xml',
 'Compartments.xml',
 'Context.xml',
 'DeletedMasterData.xml',
 'ElementaryExchanges.xml',
 'ExchangeActivityIndex.xml',
 'Geographies.xml',
 'IntermediateExchanges.xml',
 'Languages.xml',
 'MacroEconomicScenarios.xml',
 'Parameters.xml',
 'Persons.xml',
 'Properties.xml',
 'Sources.xml',
 'SystemModels.xml',
 'Tags.xml',
 'UnitConversions.xml',
 'Units.xml',
 'user']

There are multiple ways of looking at this data: directly with a text editor, via Python using `lxml` or ElementTree.  
I'm not sure what you are more comfortable with. I will from now on simply paste XML as text in a text editor. Here is what that would look like for the first elementary exchange (ecoinvent term for "elementary flow", or flow of a substance from a unit process, or *activity* in ecoinvent, and the environment):

```XML
<?xml version="1.0" encoding="utf-8"?>
<validElementaryExchanges majorRelease="3" minorRelease="0" majorRevision="0" minorRevision="38930" xmlns="http://www.EcoInvent.org/EcoSpold02">
  <elementaryExchange id="38a622c6-f086-4763-a952-7c6b3b1c42ba" unitId="487df68b-4994-4027-8fdc-a4dc298257b7" casNumber="000110-63-4">
    <name xml:lang="en">1,4-Butanediol</name>
    <unitName xml:lang="en">kg</unitName>
    <compartment subcompartmentId="e8d7772c-55ca-4dd7-b605-fee5ae764578">
      <compartment xml:lang="en">air</compartment>
      <subcompartment xml:lang="en">urban air close to ground</subcompartment>
    </compartment>
    <synonym xml:lang="en">Butylene glycol</synonym>
    <property propertyId="6393c14b-db78-445d-a47b-c0cb866a1b25" amount="0" />
    <property propertyId="6d9e1462-80e3-4f10-b3f4-71febd6f1168" amount="0" />
    <property propertyId="a9358458-9724-4f03-b622-106eda248916" amount="0" />
    <property propertyId="c74c3729-e577-4081-b572-a283d2561a75" amount="0.533098393070742" />
    <property propertyId="3a0af1d6-04c3-41c6-a3da-92c4f61e0eaa" amount="1" />
    <property propertyId="67f102e2-9cb6-4d20-aa16-bf74d8a03326" amount="1" />
  </elementaryExchange>
```

GB has also written some code to **assemble all the master data in one dictionary**, where the keys of the dictionary are the names of the files above (`ActivityIndex`, `ActivityNames`, etc.) and the values are the contents of the master data xml assembled as **pandas dataframes**.  
Here are some details:

`get_current_MD()`:   
  - Find the master data folder
  - Generate a list of all files in the master data
  - Find the youngest file and date it
  - Compare with a master_data_dictionary stored on system, if available

In [56]:
os.chdir('C:\mypy\code\wastewater_treatment_tool\waste_water_tool')

In [59]:
%run run_magic_test.py

7


In [60]:
devil2

443556

In [None]:
def build_MD(folder, version, system_model, basepath, master_data_folder = ''):
    print('reading master data from %s' % folder)
    MD = {}
    if master_data_folder == '':
        master_data_folder = os.path.join(folder, 'MasterData')
    p = os.path.join(os.path.dirname(os.path.realpath(__file__)), 
        'documentation', 'MasterData_fields.xlsx')
    MD_fields = pandas.read_excel(p, 'fields')
    MD_tags = pandas.read_excel(p, 'tags').set_index('file')
    properties = {}
    grouped = MD_fields.groupby('file')
    for filename, group in grouped:
        if 'Exchange' in filename:
            properties[filename] = []
        df = []
        with open(os.path.join(master_data_folder, '%s.xml' % filename), encoding = 'utf8') as f:
            root = objectify.parse(f).getroot()
        fields = list(zip(list(group['field type']), list(group['field name'])))
        if filename == 'Classifications':
            for child in root.iterchildren(tag = tag_prefix + MD_tags.loc[filename, 'tag']):
                for c in child.iterchildren(tag = tag_prefix + 'classificationValue'):
                    to_add = {'classificationSystemId': child.get('id'), 
                              'classificationSystemName': child.name.text}
                    to_add = store_fields(fields, c, to_add = to_add)
                    df.append(copy(to_add))
        elif filename == 'Compartments':
            for child in root.iterchildren(tag = tag_prefix + MD_tags.loc[filename, 'tag']):
                for c in child.iterchildren(tag = tag_prefix + 'subcompartment'):
                    to_add = {'compartmentId': child.get('id'), 
                              'compartmentName': child.name.text, 
                              'subcompartmentId': c.get('id'), 
                                'subcompartmentName': c.name.text, 
                                'comment': c.comment.text}
                    df.append(copy(to_add))
        else:
            for child in root.iterchildren(tag = tag_prefix + MD_tags.loc[filename, 'tag']):
                to_add = store_fields(fields, child, to_add = {})
                if hasattr(child, 'compartment'):
                    to_add['compartment'] = child.compartment.compartment.text
                    to_add['subcompartment'] = child.compartment.subcompartment.text
                df.append(copy(to_add))
                for p in child.iterchildren(tag = tag_prefix + 'property'):
                    to_add_ = copy(to_add)
                    to_add_['propertyId'] = p.get('propertyId')
                    to_add_['amount'] = p.get('amount')
                    if is_empty(to_add_['amount']):
                        to_add_['amount'] = 0.
                    else:
                        to_add_['amount'] = float(to_add_['amount'])
                    properties[filename].append(copy(to_add_))
        MD[filename] = list_to_df(df)
        for col in list(group['field name']):
            if col not in MD[filename]:
                MD[filename][col] = ''
    for filename in properties:
        properties[filename] = list_to_df(properties[filename])
        if 'id' not in properties[filename]:
            1/0
    MD, properties = join_info(MD, properties)
    filename = 'ExchangeActivityIndex.xml'
    filelist = build_file_list(master_data_folder, 'xml')
    if filename in filelist:
        with open(os.path.join(master_data_folder, filename), encoding='utf8') as f:
            root = objectify.parse(f).getroot()
        df = []
        for exchangeActivityIndexEntry in root.iterchildren():
            for o in exchangeActivityIndexEntry.output.iterchildren():
                to_add = {'id': exchangeActivityIndexEntry.get('validIntermediateExchangeId'), 
                          'activityIndexEntryId': o.get('activityIndexEntryId')}
                df.append(to_add)
        df = list_to_df(df)
        df = df.set_index('id')
        tab = 'IntermediateExchanges'
        df = df.join(MD[tab].set_index('id')[['name']])
        df = df.reset_index().rename(columns = {'id': 'intermediateExchangeId'})
        tab = 'ActivityIndex'
        df = df.set_index('activityIndexEntryId').join(MD[tab].set_index('id')[['activityName', 'geography', 
                          'startDate', 'endDate']]).reset_index()
        MD['ExchangeActivityIndex'] = df.rename(columns = {'index': 'activityIndexEntryId'})
        
    MD = to_excel(folder, MD, properties)
    folder = os.path.join(os.path.dirname(folder), 'pkl')
    pkl_dump(folder, 'MD', MD)

In [50]:
os.getcwd()

'C:\\Users\\Pascal Lesage\\Documents\\ecoinvent\\EcoEditor\\xml\\MasterData\\Production'

In [52]:
os.path.realpath(__file__)

NameError: name '__file__' is not defined

In [36]:
with open(os.path.join(master_data_folder, 'ElementaryExchanges.xml'), encoding = 'utf8') as f:
    root = objectify.parse(f).getroot()

In [41]:
for child in root.iterchildren():
    print(child)











































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































In [24]:
os.chdir('C://Users\\Pascal Lesage\\Documents\\ecoinvent\\EcoEditor\\xml\\MasterData\\Production')

In [3]:
MD_dir = r'C:\mypy\code\wastewater_treatment_tool\waste_water_tool\pkl'

In [4]:
with open(os.path.join(MD_dir, 'MD.pkl'), 'rb') as f:
    MD = pickle.load(f)

In [44]:
type(MD)

dict

In [45]:
MD_as_string = str(MD)

In [49]:
if 'Undefined' in MD_as_string:
    print('yeah')

yeah


In [12]:
list(MD.keys())

['ActivityIndex',
 'ActivityNames',
 'Classifications',
 'Companies',
 'Compartments',
 'ElementaryExchanges',
 'Geographies',
 'IntermediateExchanges',
 'Languages',
 'MacroEconomicScenarios',
 'Parameters',
 'Persons',
 'Properties',
 'Sources',
 'SystemModels',
 'Tags',
 'UnitConversions',
 'Units',
 'ExchangeActivityIndex',
 'IntermediateExchanges prop.',
 'ElementaryExchanges prop.']

In [53]:
MD['Compartments']

Unnamed: 0,comment,compartmentId,compartmentName,subcompartmentId,subcompartmentName,nan
0,Only used if no specific information available.,269f1973-46b4-4092-aaa7-36b5b8eb7b3e,soil,dbeb0ac7-0dec-439e-887a-9924cc8005dd,unspecified,
1,Emission to soil that is used for plant produc...,269f1973-46b4-4092-aaa7-36b5b8eb7b3e,soil,15f47463-77ea-40d0-bfe8-ca632819f556,forestry,
2,Emission to soil that is used for or is suitab...,269f1973-46b4-4092-aaa7-36b5b8eb7b3e,soil,e1bc9a16-5b6a-494f-98ef-49f461b1a11e,agricultural,
3,"Emission to soil used for industry, manufactur...",269f1973-46b4-4092-aaa7-36b5b8eb7b3e,soil,912f1ae3-734e-4cc6-bbf7-0f36843cd7de,industrial,
4,"Natural resources in air, e.g. argon, carbon d...",9ee6ba06-4401-409c-ac4e-e8ec188aa512,natural resource,45bb416c-a63b-429f-8754-b3f76a069c43,in air,
5,Land occupation and transformation.,9ee6ba06-4401-409c-ac4e-e8ec188aa512,natural resource,7d704b6f-d455-4f41-9c28-50b4f372f315,land,
6,Natural resource in soil e.g. ores; landfill v...,9ee6ba06-4401-409c-ac4e-e8ec188aa512,natural resource,6a098164-9f04-4f65-8104-ffab7f2677f3,in ground,
7,"Natural resource in water, e.g. magnesium, water.",9ee6ba06-4401-409c-ac4e-e8ec188aa512,natural resource,30347aef-a90b-46ba-8746-b53741aa779d,in water,
8,"Biogenic resource, e.g. wood",9ee6ba06-4401-409c-ac4e-e8ec188aa512,natural resource,2d0acbd3-2083-4011-9a29-20c626b23dc3,biotic,
9,"Labour cost, net tax, net operating surplus, r...",cc6b14b8-9e1c-423e-afc8-3ec5fbe5230c,economic,afa7ae6d-bbd9-4d9d-8d5a-a55f815c2d05,primary production factor,


In [14]:
MD['ActivityIndex'].shape

(15112, 11)

In [17]:
MD['ActivityIndex'].columns

Index(['systemModelId', 'activityNameId', 'geographyId', 'endDate', 'id',
       'specialActivityType', 'startDate', 'geography', 'activityName', 'name',
       'systemModelName'],
      dtype='object')