(download_additional_information_data)=

# Download Additional Information data

In this tutorial we will show how can one access the additional information data that is stored in each scenario for a given package. We will do this for the EAWAG-SLUDGE package and eventually we will show some visualizations with regard the data stored on this package.

We begin with the import of the relevant enviPath object for this tutorial

In [1]:
from enviPath_python.enviPath import enviPath
from enviPath_python.objects import *

import pandas as pd

As in other tutorials, we instantiate the host and the package we want to work with

In [2]:
INSTANCE_HOST = "envipath.org"
EAWAG_SLUDGE_DATA_PACKAGE = "https://envipath.org/package/7932e576-03c7-4106-819d-fe80dc605b8a"

In the following lines of code, we will:

1. Define enviPath and packages
2. Declare a `data` list where we will store all the information retrieved
3. Loop over each node on a pathway
    1. Extract all the scenarios
    2. For each scenario, get all the experimental data (additional information) and store it on the data 
    list together with its SMILES, node, scenario and pathway IDs and the scenario description
4. Create a pandas DataFrame and use it to generate a .csv file with all the extracted data

To reduce the amount of requests and computation time, we have saved the output of the commented lines of code in a .csv file placed on assets/ folder.

In [6]:
eP = enviPath(INSTANCE_HOST)
pkg = Package(eP.requester, id=EAWAG_SLUDGE_DATA_PACKAGE)
# data = []

# for path in pkg.get_pathways():
#     for node in path.get_nodes():
#         scenarios = node.get_scenarios()
#         for scenario in scenarios:
#             temp_data = {"smiles": node.get_smiles(), "node_id": node.get_id(), 
#                          "scenario_id": scenario.get_id(), "scenario_description": scenario.get_description(),
#                          "pathway_id": path.get_id()}
#             temp_add_info = scenario.get_additional_information()
#             for ai in temp_add_info:
#                 add_info = {ai.name + "_" + key: value for (key,value) in ai.params.items()}
#                 temp_data.update(add_info)
#             data.append(temp_data)
            
# # save data
# raw_data = pd.DataFrame(data)
# raw_data.to_csv("../assets/additional_information_data.csv", sep='\t', index=False)
raw_data = pd.read_csv("../assets/additional_information_data.csv", sep="\t")
raw_data.head()

Finally, we use the extracted data to analyze the locations of each experiment in EAWAG-SLUDGE. To do this we map similar locations to a common name, i.e. (Dübendorf, WWTP Duebendorf (ARA Neugut), Switzerland, ...) -> Dübendorf

In [22]:
import plotly.express as px

def process_location(df):
    if pd.notna(df):
        if "Duebendorf" in df or "Dübendorf" in df:
            return "WWTP Duebendorf (ARA Neugut), Switzerland"
        elif "IND" in df or "DOM" in df:
            return "Switzerland (IND1, IND2, IND3, IND4, IND5, DOM1, DOM2, DOM3, DOM4, DOM5)"
    return df

plot_df = raw_data
plot_df.location_location = plot_df.location_location.apply(lambda x: process_location(x))
plot_df = plot_df[["smiles", "scenario_id", "location_location"]].groupby(["scenario_id", "location_location"]).count().reset_index()[["location_location", "smiles"]].groupby("location_location").sum().reset_index()
plot_df.rename(columns={"location_location": "location", "smiles": "count"}, inplace=True)
px.pie(plot_df, names="location", values="count", title="Location of experiments in EAWAG-SLUDGE")

We see that there Dübendorf, Switzerland is the predominant location on our dataset. In the same way, one could analyze other relevant features, such as temperature, pH or half lives