# Generating ecoSPold files for WWT

Pascal Lesage notes for ICRA

## 1) Description

The WWT LCI tool must generate importable ecoSpold2 tools.  
An example ecoSpold already in ecoinvent is found [here](zzz).  
This document shows: 
  - how to inform all the required fields  
  - how to generate the ecoSPold files  
Note that **two** ecoSpold files may need to be generated for a given situation: one for WW treatment, and one for WW discharged to the environment without treatment and for particulate content and dissolved substances flushed during hydraulic overload episodes (**See emails on the subject**).

## 2) Standard inputs

In [1]:
import os
import pandas as pd
import pickle # Used temporarily to access a MasterData dictionary - check if still useful at the end of the project.
from lxml import objectify #Convert XML to dict

In [86]:
# Due to some pickle files having been generated with an older version of Pandas
import pandas.core.indexes
import sys
sys.modules['pandas.indexes'] = pandas.core.indexes

## 3) Guillaume Bourgault (GB) code and adaptations/additions

This document relies heavily on the code prepared by GB and distributed in July (`spold2_writer_use.py`).  

Some slight modifications were made to make it easier to use with the WWT tool ==> see `spold2_writer_functions.py`

To run the .py file from the Notebooks, one can use the [%run](http://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-run) magic command

In [95]:
os.chdir(r'C:\mypy\code\wastewater_treatment_tool\waste_water_tool')
%run spold2_writer_functions.py

## 4) Master data and the generation of the `MD` dictionary

The ecoinvent database contains master data for the following entities: Activity Names, Classifications, Companies, Compartments, Exchanges (Elementary and Intermediate), Geographies, Languages, Market Models, Parameters, Persons, Properties, Scenarios, Sources, Tags and Units. 

There are discussions underway to have the tool access the master data on the ecoinvent/IFU server. However, for now, this has not yet been resolved, and many at the ecoinvent Center do not feel this is very important because the amount of master data used for the WWT datasets is not that important, and because the master data could be stored on the server that will host the WWT tool and easily be regularly updated.

For now, I will use the master data that is downloaded on my computer via the [ecoEditor](http://www.ecoinvent.org/data-provider/data-provider-toolkit/ecoeditor/ecoeditor.html).  
Guillaume of the ecoinvent Center (henceforth GB) has written the following code to help **find the master data**:

In [4]:
master_data_folder = find_current_MD_path()
master_data_folder

'C://Users\\Pascal Lesage\\Documents\\ecoinvent\\EcoEditor\\xml\\MasterData\\Production'

Here are the **contents of the master data directory**:

In [5]:
os.listdir(master_data_folder)

['ActivityIndex.xml',
 'ActivityNames.xml',
 'Classifications.xml',
 'Companies.xml',
 'Compartments.xml',
 'Context.xml',
 'DeletedMasterData.xml',
 'ElementaryExchanges.xml',
 'ExchangeActivityIndex.xml',
 'Geographies.xml',
 'IntermediateExchanges.xml',
 'Languages.xml',
 'MacroEconomicScenarios.xml',
 'Parameters.xml',
 'Persons.xml',
 'Properties.xml',
 'Sources.xml',
 'SystemModels.xml',
 'Tags.xml',
 'UnitConversions.xml',
 'Units.xml',
 'user']

The py file includes code to **assemble all the master data in one dictionary**, **`MD`**, where:  
  - the keys of the dictionary are the names of the files above (`ActivityIndex`, `ActivityNames`, etc.)  
  - the values are the contents of the master data xml assembled as **pandas dataframes**.  

Here are some details:  

`get_current_MD(master_data_folder=None, pkl_folder=None, return_MD=False)`:   
  - Arguments:  
    - `master_data_folder` = dir of master data. If `None`, `find_current_MD_path` is used  
    - `pkl_folder` = directory of previously built master data dictionary. If `None` passed, the function will look where it expects it to be, i.e. ` os.path.join(os.path.dirname(os.path.realpath(__file__)),'pkl')`  
    - `return_MD`: if False, the function retunrs None, else it returns the MD
  - Compares the age of the existing master data dictionary MD (if it exists) with that of the actual master data to determine whether the disctionary can be used as-is or whether it needs to be created/updated.  
  - If it needs to be created, the function `build_MD` is called.
  - returns MD
  
`build_MD(md_fields_xls=None, master_data_folder=None, pickle_dump_folder=None, xls_dump_folder=None)`:
  - Called from `get_current_MD`, if needed. 
  - Arguments:  
    - `md_fields_xls`: path to the file `MasterData_fields.xlsx`, by default in `root_dir/documentation`. Default used if argument not passed.  
    - `master_data_folder` = dir of master data. If `None`, `find_current_MD_path`is used  
    - `pickle_dump_folder` = Directory where the pickled **`MD`** should be stored. If `None`, the MD pickle is not stored.
    - `xls_dump_folder` = Directory where the xls version of the master data should be stored. If `None`, the xls is not generated.

**Use in our case**: the `MD.pkl` dictionary is required later, and so needs to be generated.

In [65]:
MD = get_current_MD(return_MD=True)

In [66]:
# Names of dataframes:
[*MD.keys()]

['ActivityIndex',
 'ActivityNames',
 'Classifications',
 'Companies',
 'Compartments',
 'ElementaryExchanges',
 'Geographies',
 'IntermediateExchanges',
 'Languages',
 'MacroEconomicScenarios',
 'Parameters',
 'Persons',
 'Properties',
 'Sources',
 'SystemModels',
 'Tags',
 'UnitConversions',
 'Units',
 'ExchangeActivityIndex',
 'IntermediateExchanges prop.',
 'ElementaryExchanges prop.']

Here is a sample DF, for geographies.

In [71]:
MD['Geographies'].head()

Unnamed: 0_level_0,id,latitude,longitude,name,uNCode,uNRegionCode,uNSubregionCode
shortname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AD,10033c48-7d7e-11de-9ae2-0019e336be3a,42.549,1.576,Andorra,0,0,0
AE,14f9e5d0-7d7e-11de-9ae2-0019e336be3a,23.549,54.163,United Arab Emirates,0,0,0
AF,0c608726-7d7e-11de-9ae2-0019e336be3a,33.677,65.216,Afghanistan,0,0,0
AG,09581fbc-7d7e-11de-9ae2-0019e336be3a,17.078,-61.783,Antigua and Barbuda,0,0,0
AI,0fefde28-7d7e-11de-9ae2-0019e336be3a,18.237,-63.032,Anguilla,0,0,0


## 5) High level parameters needed to generate the ecoSpold files

### 5.1) Wastewater properties  
  - BOD5, metal content, etc. 
  - For now, I have stored these in an Excel file (WW_properties.xlsx, in the documentation folder), with random values. 
  - The tool will need to collect this information (ICRA). 
  - The Excel file also contains (random) data on the **dissolved and particulate fractions**, used later in calculations. 
  - I store this information in a pandas dataframe: I suggest the tool does the same thing, since many functions below use this object.

In [73]:
def get_WW_properties(xls=None):
    # From excel for now
    if xls==None:
        xls = os.path.join(root_dir, 'Documentation', 'WW_properties.xlsx')
    return pandas.read_excel(xls, sheet_name='Sheet1', index_col=1)

In [75]:
WW_prop_df = get_WW_properties()
WW_prop_df.head().T #Transposed for easier viewing. NaN means the cell was empty in the Excel file. 

name,"BOD5, mass per volume","COD, mass per volume","mass concentration, DOC","mass concentration, TOC","mass concentration, dissolved sulfate SO4 as S"
id,dd13a45c-ddd8-414d-821f-dfe31c7d2868,3f469e9e-267a-4100-9f43-4297441dc726,efe22a60-b1a3-4b33-a5ba-4bf575e0a889,a547f885-601d-4d52-9bf9-60f0cef06269,1e4ef691-c7d3-49fc-9aee-6d77575a7b8a
unitName,kg/m3,kg/m3,kg/m3,kg/m3,kg/m3
comment,"Biological Oxygen Demand BOD5, as O2",Chemical Oxygen Demand as O2,Mass concentration of Dissolved Organic Carbon,Mass concentration of Total Organic Carbon,Mass concentration of dissolved sulfate SO4 (C...
Amount,0.33673,0.242895,0.151075,0.416505,0.770858
Dissolved,0.0019646,0.421685,0.674033,0.814631,0.0278536
Particulate,0.998035,0.578315,0.325967,0.185369,0.972146
Variance,0.0847729,0.518737,0.599728,0.0945057,0.261701
pedigree1,2,4,2,3,5
pedigree2,4,4,5,2,2
pedigree3,3,1,5,2,3


### 5.2) Fraction of wastewater discharged to sewer but not treated
This is the unconnected fraction (see emails on the subject).  
It is assumed here that the (ideally weighted) average of countries is OK to use in larger geographies.

In [76]:
# Example value
untreated_fraction = 0.3

## 6) Generating the ecoSpold file - treatment

This section documents the data that the tool must generate. It does not discuss *how* this data is generated (underlying models, averaging, etc.) 

The data/documentation/naming requirements are taken from several sources:  
- The [Data Quality Guidelines](http://www.ecoinvent.org/files/dataqualityguideline_ecoinvent_3_20130506.pdf)  
- Notes found in the [ecoEditor](http://www.ecoinvent.org/data-provider/data-provider-toolkit/ecoeditor/ecoeditor.html) itself
- A [dataset documentation](http://www.ecoinvent.org/files/dataset_documentation_ecoinvent_3.pdf) document

### 6.1) Generate an empty dataset

Until it is rendered (see below), the dataset will be a dictionary.  
Use the function `create_empty_dataset`  
Some fields which will remain constant are pre-filled.  
One modification remains to be done in the tool: the default author needs to be changed (I put myself as a placeholder). I recommend putting the tool developer here.

In [78]:
treatment_dataset = create_empty_dataset()

### 6.2) Add ActivityIndex
Collection of many things that need to be generated from user input.  
There are a series of functions to run (see below), and then the final function to put it all together is `generate_activityIndex(dataset)`

#### 6.2.1) Activity name

From ecoEditor:
>Activity Name  
>The name describes the activity that is represented by this dataset. The activity name can only be edited when a new dataset is created. If you want to use this dataset under a new activity name, you need to create a new dataset with the desired name, using the current dataset as a template (menu File..., New..., FromExistingDataset).  
>Length: 120  
>Required: Yes  
>EcoSpold02 FieldId: 100  

From DQG:
>Activity names are spelled with lower case starting letter, i.e. “lime production”, not “Lime production”. 

In the case of the WWT datasets, the name will depend on the situation: 

- CASE 1: treatment of the wastewater from a specific source in an "average" WWTP  
- CASE 2: treatment of the wastewater from a specific source in a "specific" WWTP technology/capacity  
- CASE 3: treatment of average wastewater in an "average" WWTP  
- CASE 4: treatment of average wastewater in a "specific" WWTP technology/capacity  

I have written a function `create_WWT_activity_name` that generates a valid name based on three arguments that the tool will need to get from the user:  
- `WW_type` = two choices only: average, or "from x" (e.g. "from steel production", "from residence")
- `technology`: TBD  
- `capacity` = two choices only: 'average' or int representing the yearly capacity in l/year. A check on type should be done.

This function is later used by a second function, `generate_WWT_activity_name`, see below.

In [79]:
def create_WWT_activity_name(WW_type, technology, capacity):
    if WW_type == 'average':
        WW_type_str = ", average"
    else:
        WW_type_str = " {}".format(WW_type)
    
    if technology == 'average':
        technology_str = ""
    else:
        technology_str = "{}, ".format(technology)
    
    if capacity == 'average':
        capacity_str = "average capacity"
    else:
        capacity_str = "capacity {:.1E}l/year".format(capacity).replace('+', '').replace('E0', 'E').replace('.0', '')
    
    return "treatment of wastewater{}, {}{}".format(WW_type_str, technology_str, capacity_str)

Examples:

In [80]:
print(create_WWT_activity_name("average", "technology A", 1e9))
print(create_WWT_activity_name("from steel production", "technology A", 1e9))
print(create_WWT_activity_name("average", "average", 1e9))
print(create_WWT_activity_name("from steel production", "average", 1.1e9))
print(create_WWT_activity_name("average", "average", "average"))
print(create_WWT_activity_name("from steel production", "average", "average"))

treatment of wastewater, average, technology A, capacity 1E9l/year
treatment of wastewater from steel production, technology A, capacity 1E9l/year
treatment of wastewater, average, capacity 1E9l/year
treatment of wastewater from steel production, capacity 1.1E9l/year
treatment of wastewater, average, average capacity
treatment of wastewater from steel production, average capacity


Tests

In [81]:
existing_names = list(MD['ActivityNames'].index)

In [82]:
[create_WWT_activity_name("from black chrome coating", "average", 1.1e10) in existing_names,
create_WWT_activity_name("from lorry production", "average", 4.7e10) in existing_names,
create_WWT_activity_name("average", "average", 1.6e8) in existing_names]

[True, True, True]

`generate_WWT_activity_name` has four arguments:  
- `dataset`: the dataset (dictionary) that the name should be added to  
- Same three arguments as the `create_WWT_activity_name`  

It adds the name to the dataset.  
It also adds the WW_type to the dataset (because it is needed later in the creation of the reference flow name).

In [87]:
treatment_dataset = generate_WWT_activity_name(treatment_dataset, 'from ceramic production', 'average', 5e9)

In [89]:
treatment_dataset['WW_type']

'from ceramic production'

#### 6.2.2) ActivityNameID
Use `ActivityNameID` if the activity name already exists, else generate a new `ActivityNameID`.

To check if the Master data already exists, we:  
   (1) Get the `ActivityIndex` dataframe from `MD`   
   (2) Make the name the index of the dataframe  
   (3) Check if our name is in the dataframe.  

If the name does not exist:  
   (1) Generate a UUID  
   (2) Add a "generic object" to the dataset.
   
I created the function `generate_activityNameId` that takes as argument the `dataset` and the `MD` dictionary:

In [96]:
generate_activityNameId(treatment_dataset, MD)

#### 6.2.3) Geography  
The geography chosen should correspond to a geography already in the master data. It would probably therefore be a good idea to have a drop-down list to choose the geography, and a message saying that the data supplier should communicate with ecoinvent if the geography they want is not in the master data (email address: data@ecoinvent.org).  
I have written a function `generate_geography` that takes as argument the dataset and the geography **shortname**.  

In [97]:
MD['Geographies'].head()

Unnamed: 0_level_0,id,latitude,longitude,name,uNCode,uNRegionCode,uNSubregionCode
shortname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AD,10033c48-7d7e-11de-9ae2-0019e336be3a,42.549,1.576,Andorra,0,0,0
AE,14f9e5d0-7d7e-11de-9ae2-0019e336be3a,23.549,54.163,United Arab Emirates,0,0,0
AF,0c608726-7d7e-11de-9ae2-0019e336be3a,33.677,65.216,Afghanistan,0,0,0
AG,09581fbc-7d7e-11de-9ae2-0019e336be3a,17.078,-61.783,Antigua and Barbuda,0,0,0
AI,0fefde28-7d7e-11de-9ae2-0019e336be3a,18.237,-63.032,Anguilla,0,0,0


In [101]:
generate_geography(treatment_dataset, MD, 'GLO')

#### 6.2.4) TimePeriod (Start and end date)
Period for which the dataset is meant to be valid.  Supplied by user. It would be useful to supply default values.    

Format: 'YYYY-MM-DD'. No validation in formula now.

I have written a function `generate_time_period` that takes as argument the dataset, a start date and an end date. 

Format: 'YYYY-MM-DD'. No validation in formula now.

In [102]:
generate_time_period(treatment_dataset, start='1995-01-31', end='2020-12-31')

#### 6.2.5) Dataset ID
The dataset UUID is generated from the dataset['activityName'], dataset['geography'], dataset['startDate'], dataset['endDate']

In [103]:
generate_dataset_id(treatment_dataset)

#### Putting it all together

In [104]:
generate_activityIndex(treatment_dataset)

### 6.3) Activity description

#### 6.3.1) includedActivitiesStart, includedActivitiesEnd

Two text fields. Describes the boundaries of the unit process.

##### Included Activities Start
Suggested text for industrial wastewater:

In [105]:
source = 'from ceramic production'
includedActivitiesStartText = "From the discharge of wastewater {} to the sewer grid.".format(treatment_dataset['WW_type'])
includedActivitiesStartText

'From the discharge of wastewater from ceramic production to the sewer grid.'

Suggested text for municipal wastewater.

In [106]:
includedActivitiesStartText = "From the discharge of municipal wastewater to the sewer grid."

##### Included Activities End  
Based on the [dataset documentation](http://www.ecoinvent.org/files/dataset_documentation_ecoinvent_3.pdf) document, this section has three parts:  
>(i) what is the last activity covered resp. what is the point
of delivery of this dataset?
(ii) what activities are included (and not obvious from the name of the
activity)
(iii) what activities are intentionally excluded from this activity (Among other things, if
the activity is a service like e.g. spinning of bast fibres, that does not include the product used in
the process (i.e. the bast fibres), this information will be included here).  

Each part has a specific mandatory wording. I suggest providing default text for the three sections separately and allowing the user to modify as required.  

**Note** The suggested text below includes many specifications about how the tool should work based on discussions with ecoinvent and from reviewing the existing tool. Please read this carefully. 

In [107]:
includedActivitiesEndText_last =\
"This activity ends with the discharge of treated wastewater to the natural environment."

includedActivitiesEndText_included =\
"This activity includes the transportation of wastewater via the sewer grid, "\
"and the treatment of the wastewater in the wastewater treatment plant."\
"The amounts of infrastructure and consumables are also included as inputs to the activity."

includedActivitiesEndText_excluded = \
" By definition, wastewater not sent to the sewer grid is also excluded. "\
"The fraction of wastewater discharged to the sewer grid but ultimately not treated because the sewer is "\
"unconnected and direct emissions due to hydraulic overload are also excluded. "\
"These are included in another dataset specifically covering the discharge of untreated wastewater. "\
"The production of sludge is included, but its treatment is covered by another treatment activity."

In [108]:
"{} {} {}".format(
    includedActivitiesEndText_last,
    includedActivitiesEndText_included,
    includedActivitiesEndText_excluded
)

'This activity ends with the discharge of treated wastewater to the natural environment. This activity includes the transportation of wastewater via the sewer grid, and the treatment of the wastewater in the wastewater treatment plant.The amounts of infrastructure and consumables are also included as inputs to the activity.  By definition, wastewater not sent to the sewer grid is also excluded. The fraction of wastewater discharged to the sewer grid but ultimately not treated because the sewer is unconnected and direct emissions due to hydraulic overload are also excluded. These are included in another dataset specifically covering the discharge of untreated wastewater. The production of sludge is included, but its treatment is covered by another treatment activity.'

##### Adding the includedActivitiesStart, includedActivitiesEnd

In [109]:
generate_activity_boundary_text(treatment_dataset,
                               includedActivitiesStartText,
                               includedActivitiesEndText_last,
                               includedActivitiesEndText_included,
                               includedActivitiesEndText_excluded)

#### 6.3.2) Technology level

From the Data Quality Guidelines:

>The technology level of a transforming activity is classified in one of these five classes:  
0=Undefined. For market activities that do not have a technology level.  
1=New. For a technology assumed to be on some aspects technically superior to modern technology, but not yet the most commonly installed when investment is based on purely economic considerations.  
2=Modern. For a technology currently used when installing new capacity, when investment is based on purely economic considerations (most competitive technology).  
3=Current (default). For a technology in between modern and old.  
4=Old. For a technology that is currently taken out of use, when decommissioning is based on purely economic considerations (least competitive technology).  
5=Outdated. For a technology no longer in use.The terms used does not necessarily reflect the age of the technologies.   
A modern technology can be a century old, if it is still the most competitive technology, and an old technology can be relatively young, if it is one that has quickly become superseded by other more competitive ones. The technology level is relative to the year for which the data are valid, as given under Time Period. In a time series, the same technology can move between different technology levels over time. The same technology can also be given different technology levels in different geographical locations, even in the same year.  

The user should be able to choose the most relevant technology level from a drop-down menu, have access to descriptions and have the tool default to an appropriate value (current).

In [110]:
generate_technology_level(treatment_dataset, 'Current')

#### 6.3.3) Activity-level comment fields

This is a list of text cells. (Images are also legal, but I'm not sure how we could handle that, nor if they are useful)  

We should supply default text and allow the users to change the text if necessary.  

We should also determine how many cells we want to provide (different cells are listed one after another in ecoEditor, and this seperation is used to split different subjects).

I have written a generic function to add comment fields, `generate_comment` that takes the following arguments:  
- the `dataset`  
- the `comment_type`. Valid types are 'allocationComment','generalComment','geographyComment','technologyComment' and 'timePeriodComment'.  
- the list of text comments (one per cell). 

##### 6.3.3.1) Technology description
>Text (_and image, but I'm not sure how we'd handle that_) field to describe the technology of the activity. The text should cover information necessary to identify the properties and particularities of the technology(ies) underlying the activity data. Describes the technological properties of the unit process. If the activity comprises several subactivities, the corresponding technologies should be reported as well. Professional nomenclature should be used for the description.  

**We should discuss how default text can be generated here based on the modelled technologies/regions. Some of this is still up in the air as of today (August 18), see emails**  

In [111]:
tech_comment_1 = 'The technologies modelled are x and y'
tech_comment_2 = 'They were averaged based on z'
tech_comment_3 = 'These technologies rock'

generate_comment(treatment_dataset, 'technologyComment', [tech_comment_1, tech_comment_2, tech_comment_3])

##### 6.3.3.2) generalComment
From the [dataset documentation](http://www.ecoinvent.org/files/dataset_documentation_ecoinvent_3.pdf) document:  
>Information that concerns the construction of the inventory (details about the Functional Unit, background,
etc.) shall be entered in the General Comment field. Actually, this field can be compared to
the abstract of a scientific article – i.e. the field should offer to the user a first, rough overview of the
dataset.
Please start the text always with "This dataset represents [the production]/ [the service of] ...."  

We should provide default text and allow the user to modify. Again, multiple cells are possible. 

In [112]:
general_comment_1 = 'This dataset represents the treatment of wastewater discharged to the sewer grid {}'.format(
                            treatment_dataset['WW_type'])
general_comment_2 = 'It includes the transportation of the wastewater to the wastewater treatment plant and the actual treatment.'
general_comment_3 = 'It was modelled using XYZ'

generate_comment(treatment_dataset, 'generalComment', [general_comment_1, general_comment_2, general_comment_3])

#####  6.3.3.3) 'geographyComment', 'timePeriodComment'
The user _could_ want to include a comment. 
>**'timePeriodComment'** Text and image field for additional explanations concerning the temporal validity of the flow data reported. It may e.g. include information about:
- how strong the temporal correlation is for the unit process at issue (e.g., are four year old data still adequate for the activity operated today?)  

> **'geographyComment'**  Especially for area descriptions, the nature of the geographical delimitation may be given, especially when this is not an administrative area.  

Let's suppose no such comment now.

In [113]:
generate_comment(treatment_dataset, 'timePeriodComment', [''])
generate_comment(treatment_dataset, 'geographyComment', [''])

##### 6.3.3.4) 'allocationComment'
I would leave this out - I don't see the user needing this.

### 6.4) 'modellingAndValidation', Representativeness
There are multiple fields that are filled in by default (by function). The tool/user should provide, insofar as possible information on three specific items:  
 - 'samplingProcedure'  
>Text describing the sampling and calculation procedures applied for quantifying the exchanges. Reports whether the sampling procedure for particular elementary and intermediate exchanges differ from the general procedure. Mentions possible problems in combining different sampling procedures.  

I will let ICRA generate default text for this field.

 - 'Extrapolations'
 
> Describes extrapolations of data from another time period, another geographical area or another technology and the way these extrapolations have been carried out. It should be reported whether different extrapolations have been done on the level of individual exchanges. If data representative for a activity operated in one country is used for another country's activity, its original representativity can be indicated here. Changes in mean values due to extrapolations may also be reported here.

We should talk about the text text to include here.

`percent`
> Percent of data sampled out of the total that the activity is intended to represent (as given by the fields geography, technology and time period).

Perhaps blank as default (allowed), with the option to add information if available. 

Putting it all together in a function `generate_representativeness(dataset, samplingProcedure_text, extrapolations_text, percent)`

In [114]:
samplingProcedure_text = 'This is a description of the sampling procedure, and it should be changed by IRCA'
extrapolation_text = 'This is a description of the sampling procedure, and it should be changed by IRCA'
percent = 80

generate_representativeness(treatment_dataset, samplingProcedure_text, extrapolation_text, percent)

### 6.5) Data entry section:  
For now, filled in with dummy data ([Current User]). Gets populated by ecoEditor.  I moved the whole section to `create_empty_dataset`

### 6.6) DataGeneratorAndPublication
The data here should basically be the reference of the tool. We will need to include the following information:
  - Author (called dataGenerator)
  - PublishedSource (reference, if there is a report or paper coming out of this work). If we don't publish, this is empty and the `dataPublishedIn` is set to 0.  
  - pageNumber  
  - ... see ecoEditor. 
For now, I'll assume no publication and put myself as author (this will need to change).

The users will have the possibility to change all this in the ecoEditor, which is probably easier than doing it in the tool.

### 6.7) Reference exchange (reference flow)
This is the amount of wastewater treated in the WWTP. 
Here are some key things to know about this exchange:  
  - It needs to be expressed in m3  
  - Its amount is -1 (1 because it is the common denominator for the whole dataset, and - because this is a convention in ecoinvent to identify treated exchanges)  
  - The name is auto-generated by the `generate_reference_exchange` function  
  - The uncertainty of the reference exchange is nul (it is the only exchange without uncertainty). 
  - The reference exchange shall also be accompanied by a comment. I propose a default comment below.  
  - A production volume needs to be defined. It is equal to the total amount of the WW of interest sent to the sewer in the regional scope of the dataset minus the amount discharged to the environment due to the unconnected fraction. It is expressed in m3/year. For "average" WW, we can possibly find default numbers. For industrial WW, we should provide guidance on how to generate this value (total production volume of the production dataset * the amount of WW generated per unit produced * (1-untreated fraction))  
  - The production volume needs to accompanied with a comment. We could determine what default comment would be appropriate once we determine how the default value for the production volume will be calculated.  
  - The production volume also needs to be accompanied with an uncertainty dictionary. The uncertainty dictionary has the following format: `{'variance':basic uncertainty with a default value of 0.0006, 'pedigreeMatrix':[scores from 1-5 for five indicators]}`. The default pedigree scores will depend on how we generate the default production volume amount. The scores refer to the following (to be included in the tool for all exchanges):
  ![Pedigree scores](pix\pedigree_scores.PNG)

The reference exchange also needs to be associated with its properties. 
**Note that the properties of the water sent to the WWTP are NOT the same as those of the WW discharged to the sewer due to some losses associated with hydraulic overload.**

I propose here a **dummy** function to calculate the amount lost to hydraulic overload. It is based on factors used in the existing ecoinvent WWT tool: 2% loss for dissolved fraction, and 1% for particulates. This will probably be better calculated in the tool.

In [115]:
def dummy_loss_to_hydraulic_overload(df, loss_particulates, loss_dissolved):
    return {prop: df.loc[prop, 'Amount'] * (1\
                                            - df.loc[prop, 'Dissolved']*loss_dissolved\
                                            - df.loc[prop, 'Particulate']*loss_particulates)
           for prop in df.index}

In [118]:
dummy_property_amounts_after_loss_to_hydraulic_overload_dict = dummy_loss_to_hydraulic_overload(WW_prop_df, 0.01, 0.02)

The DataFrame `WW_prop_df` need to get converted to a list of tuples that are used to generate the ecoSpold2 file.  

The uncertainty of the ratio between particulates and dissolved fractions and of the losses due to hydraulic overloads are ignored **for now** (this may change, questions with ecoinvent pending).

**Important**: For now, it is assumed that the properties are not given a "variable name", which would allow them to easily be used in equations within ecoEditor. If we decide to use equations, we should add this field to the tuple. There are **numerous** advantages to doing this (two most important: transparency, and propagation of uncertainty). 

In [122]:
def convert_WW_prop_to_list(df, new_amount_dict):
    #(property_name, amount, unit, comment, uncertainty)
    return [(i,
             new_amount_dict[i],
             df.loc[i, 'comment'] + ". Accounts for mass lost in sewer due to hydraulic overloads.",
             df.loc[i, 'unitName'],
             {
                 'variance': df.loc[i, 'Variance'],
                 'pedigreeMatrix': [
                     df.loc[i, 'pedigree1'],
                     df.loc[i, 'pedigree2'],
                     df.loc[i, 'pedigree3'],
                     df.loc[i, 'pedigree4'],
                     df.loc[i, 'pedigree5'],
                 ]
             }
            ) for i in df.index]

In [123]:
convert_WW_prop_to_list(WW_prop_df, dummy_loss_to_hydraulic_overload_dict)

[('BOD5, mass per volume',
  0.33335652826327788,
  'Biological Oxygen Demand BOD5, as O2. Accounts for mass lost in sewer due to hydraulic overloads.',
  'kg/m3',
  {'pedigreeMatrix': [2, 4, 3, 5, 4], 'variance': 0.084772935248650816}),
 ('COD, mass per volume',
  0.23944212752988905,
  'Chemical Oxygen Demand as O2. Accounts for mass lost in sewer due to hydraulic overloads.',
  'kg/m3',
  {'pedigreeMatrix': [4, 4, 1, 5, 4], 'variance': 0.51873718055969376}),
 ('mass concentration, DOC',
  0.14854549373081755,
  'Mass concentration of Dissolved Organic Carbon. Accounts for mass lost in sewer due to hydraulic overloads.',
  'kg/m3',
  {'pedigreeMatrix': [2, 5, 5, 3, 4], 'variance': 0.59972841525803788}),
 ('mass concentration, TOC',
  0.40894667565425985,
  'Mass concentration of Total Organic Carbon. Accounts for mass lost in sewer due to hydraulic overloads.',
  'kg/m3',
  {'pedigreeMatrix': [3, 2, 2, 4, 5], 'variance': 0.094505715185841388}),
 ('mass concentration, dissolved sulfat

#### Other inputs required for the reference exchange

It is assumed the tool will generate these. Code for the default comments could be reused:

In [127]:
ref_exchange_comment = "Refers to the amount of wastewater treated in the wastewater treatment plant."
if untreated_fraction != 0:
    ref_exchange_comment += " Excludes fraction ({:.0f}%) of wastewater sent to sewer grid but not treated in a wastewater treatment plant.".format(
                                untreated_fraction*100)

PV = 1000000 # Some fake number, we need to determine how to calculate this.

PV_comment = "Yearly volume of wastewater treated."
if untreated_fraction != 0:
    PV_comment += " Excludes the fraction that is discharged directly to the environment ({:.0f}%).".format(
        untreated_fraction*100)

PV_uncertainty = {'variance':0.01, 'pedigreeMatrix':[2,4,3,2,4]} # Fake numbers for now.  

In [128]:
ref_exchange_comment

'Refers to the amount of wastewater treated in the wastewater treatment plant. Excludes fraction (30%) of wastewater sent to sewer grid but not treated in a wastewater treatment plant.'

In [129]:
PV_comment

'Yearly volume of wastewater treated. Excludes the fraction that is discharged directly to the environment (30%).'

#### Using the function to generate the reference exchange

In [130]:
treatment_dataset, MD = generate_reference_exchange(treatment_dataset,
                                                    exc_comment=ref_exchange_comment,
                                                    PV=PV,
                                                    PV_comment=PV_comment,
                                                    PV_uncertainty=PV_uncertainty,
                                                    MD=MD)

going to create new property


### 6.8) Byproducts/wastes  
There are two types of wastes/byproducts to be considered:  
1) Sludge  
2) Grit  

#### 6.8.1) Sludge
Sludge treatment is outside the scope of this project. 
However, the tool must generate a "sludge exchange" that indicates how much sludge is generated (per reference exchange) and what its composition is (via its "properties").  
The properties of the sludge will be dictated by the transfer coefficients the tool will calculate.   
There are three approaches here: 
- Enter the transfer coefficients in the dataset, and let the sludge composition be calcualted within the ecoSpold file itself, by ecoinvent (better for transparency and uncertainty propagation)  
- Calculate the sludge properties and enter them as static values, but include the transfer coefficients in the comments (good for transparency).  
- Calculate the sludge properties and enter them as static values with generic comments (worst for transparency)  

I'll assume for now that option 2 is chosen. 

##### Transfer coefficients  
These will be calculated, but I will store them in a table in the excel spreadsheet "sludge_transfer_coeff.xlsx" for now.  

In [42]:
r'C:\mypy\code\wastewater_treatment_tool\waste_water_tool\documentation\sludge_transfer_coeff.xlsx'

'C:\\mypy\\code\\wastewater_treatment_tool\\waste_water_tool\\documentation\\sludge_transfer_coeff.xlsx'

In [43]:
def get_WW_properties(xls=None):
    # From excel for now
    if xls==None:
        xls = os.path.join(root_dir, 'Documentation', 'sludge_transfer_coeff.xlsx')
    return pandas.read_excel(xls, sheet_name='Sheet1', index_col=1, )

**REST DEPENDS ON ADDING PROPERTIES ON THE FLY**

##### 6.8.2) Grit
I will assume here, like in ecoinvent, that there are two types of grit: 
  - plastics  
  - biomass, modelled as paper  
  
The tool should provide default values for amounts of grit removed, as well as the uncertainty for these.  
The default values in ecoinvent v2.2 are 15.5 g/m3 of each, and the basic uncertainty is 0.0006.  
If we use these values, the pedigree scores should be: [1,3,5,5,1]  
It would be MUCH better to use other data for this.

In [134]:
grit_default_total_amount = 0.031 #kg/m3 in WWTP - ideally this value would be updated, and in any case the user can override it
grit_default_plastic_ratio = 0.5
grit_default_biomass_ratio = 0.5
grit_uncertainty = {'variance':0.0006, 'pedigreeMatrix':[1,3,5,5,1]} #If we use the ecoinvent v2.2 data.
grit_plastics_comment_default = "Amount of plastics removed from wastewater. Based on an assumed {} kg/m3 of grit removed, "\
                                " and an assumed {:2}% of the grit that is plastics".format(grit_default_total_amount,
                                                                                grit_default_plastic_ratio*100)
grit_biomass_comment_default = "Amount of biomass  removed from wastewater. Based on an assumed {} kg/m3 of grit removed, "\
                                " and an assumed {:2}% of the grit that is biomass. "\
                                "Biomass waste management modelled as paper waste management".format(
                                    grit_default_total_amount,
                                    grit_default_biomass_ratio*100)

In [135]:
grit_plastics_comment_default

'Amount of plastics removed from wastewater. Based on an assumed 0.031 kg/m3 of grit removed,  and an assumed 50.0% of the grit that is plastics'

In [136]:
grit_biomass_comment_default

'Amount of biomass  removed from wastewater. Based on an assumed 0.031 kg/m3 of grit removed,  and an assumed 50.0% of the grit that is biomass. Biomass waste management modelled as paper waste management'

In [47]:
treatment_dataset =  add_grit(treatment_dataset,
                              grit_default_total_amount,
                              grit_default_plastic_ratio,
                              grit_default_biomass_ratio,
                              grit_uncertainty,
                              grit_plastics_comment_default,
                              grit_biomass_comment_default,
                              PV,
                              MD)

### Inputs from the technosphere  
Inputs from the technosphere correspond to consummables, energy and infrastructure inputs. Here is the list in the current WWT datasets:  
  - Consumables:  
    - iron (III) chloride, without water, in 40% solution state  
    
- Energy:  
    - heat, district or industrial, natural gas  
    - heat, central or small-scale, other than natural gas  
    - electricity, low voltage  
    
    
- Infrastructure - not yet resolved, see email:  
  - wastewater treatment facility, capacity XXl/year  
  - sewer grid, XXl/year, YY km    

#### 1. Consumables:

From Yves' list: 
  - aluminium sulfate, powder  
  - aluminium sulfate, without water, in 4.33% aluminium solution state  
  - lime  
  - iron(III) chloride, without water, in 14% iron solution state  
  - polyelectrolyte (No proxy found yet)
  
Are there others you would like to include?  

Would you like to allow the user to add inputs themselves? If so, they should be based on the known list of products, see [this excel file](http://www.ecoinvent.org/files/activity_overview_for_users_3.3_undefined_1.xlsx), tab "intermediate exchanges".

**Mandatory inputs**: amount per treated m3 only. Calculated by tool. Can possibly be overriden, but there should be a comment that indicates why this decision was taken.
Note as well that the amount is "per m3 treated", while the amount we want in the dataset is "per m3 discharged to the sewers". This conversion is included in the function.

**Other elective inputs to override default values**: comment (*mandatory if default amount not used*), uncertainty.  

**Units used should be the default units**, see 'MD['IntermediateExchanges'][.loc[exchange_name, 'unitName']

In [48]:
def generate_consumables(dataset,
                         exchange_name,
                         amount,
                         WW_discharged_without_treatment,
                         uncertainty,
                         comment,
                         MD):
    exc = create_empty_exchange()
    exc.update({'group': 'FromTechnosphere',
                'name': exchange_name,
                'unitName': MD['IntermediateExchanges'].loc[exchange_name, 'unitName'],
                'amount': amount * WW_discharged_without_treatment,
                'comment': comment
               })
    dataset, _ = append_exchange(exc, dataset, MD, uncertainty=uncertainty)
    return dataset

In [49]:
consumable_example_exchange_name = 'lime'
consumable_example_amount = 0.42 #
consumable_example_uncertainty = {'variance': 0.0006, 'pedigreeMatrix': [2, 4, 3, 3, 1]}
consumable_example_comment = "Amount calculated based on technology mix, wastewater properties and fraction of wastewater discharged to sewers actually treated."

In [50]:
dataset = generate_consumables(dataset,
                               consumable_example_exchange_name,
                               consumable_example_amount,
                               WW_discharged_without_treatment,
                               consumable_example_uncertainty,
                               consumable_example_comment,
                               MD)

#### 2. Energy inputs

##### 2.1 Heat
Heat inputs are entered as MJ of heat provided by a combustion process, and **NOT** as MJ of fuel input or physical quantity of fuel.  
The ecoinvent database distinguishes between heat from natural gas and heat from other sources. If this is not known, a fraction coming from each must be estimated.  
Heat from the onsite combution of sludge should be modelled directly in the tool and should not be considered here.  

In [54]:
def generate_heat_inputs(dataset,
                         total_heat,
                         fraction_from_natural_gas,
                         WW_discharged_without_treatment,
                         heat_uncertainty,
                         heat_NG_comment,
                         heat_other_comment,
                         MD):
    assert 0 <= fraction_from_natural_gas <= 1, "fraction_from_natural_gas must be a number between 0 and 1"
    
    # Natural gas
    if fraction_from_natural_gas > 0:
        if heat_NG_comment == 'default':
            heat_NG_comment = "Heat input associated with wastewater treatment."
            if fraction_from_natural_gas != 1:
                heat_NG_comment += "Based on total heat requirement ({:2E} MJ, calculated) and an assumed "\
                                    "split between heat from natural gas ({:2}) and from other sources ({:2})".format(
                                        total_heat, fraction_from_natural_gas, 1-fraction_from_natural_gas)
            if WW_discharged_without_treatment > 0:
                heat_NG_comment += "Accounts for the {:2}% of discharged wastewater assumed not to "\
                                   "be treated at the wastewater treatment plant".format(WW_discharged_without_treatment*100)
                
        exc = create_empty_exchange()
        exc.update({'group': 'FromTechnosphere',
                    'name': 'heat, district or industrial, natural gas',
                    'unitName': 'MJ',
                    'amount': total_heat * fraction_from_natural_gas * WW_discharged_without_treatment,
                    'comment': heat_NG_comment
               })
    dataset, _ = append_exchange(exc, dataset, MD, uncertainty=heat_uncertainty)
    
    #Other
    if fraction_from_natural_gas < 1:
        if heat_other_comment == 'default':
            heat_other_comment = "Heat input associated with wastewater treatment."
            if fraction_from_natural_gas != 1:
                heat_other_comment += "Based on total heat requirement ({:2E} MJ, calculated) and an assumed "\
                                      "split between heat from natural gas ({:2}) and from other sources ({:2})".format(
                                          total_heat, fraction_from_natural_gas, 1-fraction_from_natural_gas)
            if WW_discharged_without_treatment > 0:
                heat_other_comment += "Accounts for the {:2}% of discharged wastewater assumed not to "\
                                      "be treated at the wastewater treatment plant".format(
                                          WW_discharged_without_treatment*100)
                
        exc = create_empty_exchange()
        exc.update({'group': 'FromTechnosphere',
                    'name': 'heat, district or industrial, other than natural gas',
                    'unitName': 'MJ',
                    'amount': total_heat * (1-fraction_from_natural_gas) * WW_discharged_without_treatment,
                    'comment': heat_other_comment
               })
    dataset, _ = append_exchange(exc, dataset, MD, uncertainty=heat_uncertainty)
    
    return dataset

In [55]:
example_total_heat = 10
example_fraction_from_natural_gas = 0.8
example_heat_uncertainty = {'variance': 0.0006, 'pedigreeMatrix': [2, 4, 3, 3, 1]} # Placeholder
example_heat_NG_comment = "default" # Automatically generate comment, can be overridden
example_heat_other_comment = "default" # Automatically generate comment, can be overridden

In [56]:
dataset = generate_heat_inputs(dataset,
                               example_total_heat,
                               example_fraction_from_natural_gas,
                               WW_discharged_without_treatment,
                               example_heat_uncertainty,
                               heat_NG_comment="default",
                               heat_other_comment="default",
                               MD=MD)

##### 2.2 Electricity  
In kWh. Low voltage assumed. 

In [57]:
def generate_electricity_input(dataset,
                               amount,
                               WW_discharged_without_treatment,
                               uncertainty,
                               comment,
                               MD):

    if comment == 'default':
        comment = "Electricity input associated with wastewater treatment."
        if WW_discharged_without_treatment > 0:
            comment += "Accounts for the {:2}% of discharged wastewater assumed not to "\
                       "be treated at the wastewater treatment plant".format(
                                      WW_discharged_without_treatment*100)

    exc = create_empty_exchange()
    exc.update({'group': 'FromTechnosphere',
                'name': 'electricity, low voltage',
                'unitName': 'kWh',
                'amount': amount * WW_discharged_without_treatment,
                'comment': comment
               })
    dataset, _ = append_exchange(exc, dataset, MD, uncertainty=uncertainty)
    return dataset

In [58]:
dataset = generate_electricity_input(dataset,
                                     amount=2,
                                     WW_discharged_without_treatment=WW_discharged_without_treatment,
                                     uncertainty={'variance': 0.0006, 'pedigreeMatrix': [2, 4, 3, 3, 1]},
                                     comment='default',
                                     MD=MD)

#### 3. Infrastructure

##### Sewer grids: 
- There are currently five sewer grid construction/repair/end-of-life datasets in ecoinvent, each associated with different WWTP capacities (smaller grids have smaller diameters, are longer per capita and transport less WW over their lifetime).  
- In order to know how to model these, we need to decide whether:  
  - We want to create new datasets for this (I suppose not).  
  - We want to 

##### WWTP

### Direct emissions to water  
Direct emissions associated with:  
1) the share of wastewater discharged to the sewer grid but not treated (unconnected sewers or hydraulic overload). Chemical reactions of the wastewater within the sewer are not considered.   
2) the share of chemicals not removed by the treatment before the treated water is released.

# TBC

In [None]:
generate_ecoSpold2(dataset,
                   r'C:\mypy\code\wastewater_treatment_tool\waste_water_tool\templates',
                   'test_1605.spold',
                   r'C:\mypy\code\wastewater_treatment_tool\waste_water_tool\result_folder')