In [None]:
# ThermoML Data Analysis Workflow

This notebook demonstrates two methods for extracting and analyzing ThermoML data using the `thermoml-fair` package.

There are two primary ways to accomplish this:
1.  **Using the Command-Line Interface (CLI)**: A straightforward method for generating CSV files directly.
2.  **Calling the Python Function Directly**: A more flexible approach that loads data directly into pandas DataFrames in your notebook for immediate analysis.

---

## Method 1: Using the Command-Line Interface (CLI)

This method is recommended for its simplicity and robustness. The CLI tool processes all XML files in a directory, skips any that fail, and reports the failed files. This ensures you always get a CSV file with all the data that could be successfully parsed.

In [13]:
# The `build-dataframe` command is the primary entry point for the CLI.
# It scans a directory for ThermoML files (.xml or cached .pkl), processes them,
# and saves the extracted data into three separate CSV files.
# We use --input-dir to specify the location of our data.
# The --show-failed-files flag provides a summary of any files that couldn't be processed.
!thermoml-fair build-dataframe

No input directory specified, using default: C:\Users\angel\.thermoml\extracted_xml
Building DataFrames from data in: C:\Users\angel\.thermoml\extracted_xml
Found 11923 XML file(s) and 11624 .parsed.pkl file(s) in C:\Users\angel\.thermoml\extracted_xml.
The process will prioritize .parsed.pkl files.
Starting DataFrame construction process... This may take a while depending on data size and if parsing is needed.

Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ---

Default schema not found at C:\Users\angel\.ai-navigator\micromamba\envs\cpu\Lib\site-packages\thermoml_fair\data\ThermoML.xsd. Please ensure it's correctly installed or set THERMOML_SCHEMA_PATH.
Failed to parse XML file C:\Users\angel\.thermoml\extracted_xml\10.1016\j.fluid.2018.07.020.xml: 'utf-8' codec can't decode byte 0xe2 in position 10143: invalid continuation byte
Traceback (most recent call last):
  File "C:\Users\angel\.ai-navigator\micromamba\envs\cpu\Lib\site-packages\thermoml_fair\core\utils.py", line 66, in parse_one
    return parse_thermoml_xml(file_path, xsd_path_or_obj=schema_for_parsing)
  File "C:\Users\angel\.ai-navigator\micromamba\envs\cpu\Lib\site-packages\thermoml_fair\core\parser.py", line 52, in parse_thermoml_xml
    data_dict = xmltodict.parse(f.read())
                                ~~~~~~^^
  File "<frozen codecs>", line 325, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10143: invalid continuation byte

Failed to parse X

After running the command, you will have three CSV files in your root directory: `thermoml_data.csv`, `thermoml_compounds.csv`, and `thermoml_properties.csv`. Let's load them using pandas to see the contents.

In [14]:
import pandas as pd

# Load the main data file
data_df = pd.read_csv('thermoml_data.csv')
data_df.head()

  data_df = pd.read_csv('thermoml_data.csv')


Unnamed: 0,material_id,components,thermoml_fair_version,property,value,phase,method,uncertainty,source_file,doi,...,Final mole fraction of solute,"Amount concentration (molarity), mol/dm3_4","Ratio of amount of solute to mass of solution, mol/kg_2","Ratio of amount of solute to mass of solution, mol/kg_3","Solvent: Ratio of amount of component to mass of solvent, mol/kg","Lower temperature, K","Upper temperature, K","Amount concentration (molarity), mol/dm3_3",Volume ratio of solute to solvent,formula
0,1,"1,2-ethanediol",1.0.14,"Thermal conductivity, W/m/K",0.2496,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,...,,,,,,,,,,C2H6O2
1,1,"1,2-ethanediol",1.0.14,"Thermal conductivity, W/m/K",0.2518,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,...,,,,,,,,,,C2H6O2
2,1,"1,2-ethanediol",1.0.14,"Thermal conductivity, W/m/K",0.2539,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,...,,,,,,,,,,C2H6O2
3,1,"1,2-ethanediol",1.0.14,"Thermal conductivity, W/m/K",0.2561,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,...,,,,,,,,,,C2H6O2
4,1,"1,2-ethanediol",1.0.14,"Thermal conductivity, W/m/K",0.258,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,...,,,,,,,,,,C2H6O2


In [15]:
# Load the compounds file
compounds_df = pd.read_csv('thermoml_compounds.csv')
compounds_df.head()

Unnamed: 0,nOrgNum,sCommonName,sFormulaMolec,source_file,sCASRegistryNum
0,1,"1,2-ethanediol",C2H6O2,C:\Users\angel\.thermoml\extracted_xml\10.1007...,
1,2,diethylene glycol,C4H10O3,C:\Users\angel\.thermoml\extracted_xml\10.1007...,
2,3,triethylene glycol,C6H14O4,C:\Users\angel\.thermoml\extracted_xml\10.1007...,
3,4,tetraethylene glycol,C8H18O5,C:\Users\angel\.thermoml\extracted_xml\10.1007...,
4,1,butan-1-ol,C4H10O,C:\Users\angel\.thermoml\extracted_xml\10.1007...,


In [16]:
# Load the properties file
properties_df = pd.read_csv('thermoml_properties.csv')
properties_df.head()

Unnamed: 0,sPropName
0,"2nd virial coefficient, m3/mol"
1,"3rd virial coefficient, m6/mol2"
2,"Amount density, mol/m3"
3,"Apparent molar volume, m3/mol"
4,"Binary diffusion coefficient, m2/s"


## Method 2: Using the Python API

This method provides more flexibility and allows you to integrate the data extraction directly into your Python scripts or notebooks. The core function is `build_pandas_dataframe`.

In [18]:
from thermoml_fair.core.utils import build_pandas_dataframe
import os

# Define the path to the XML data files
data_dir = 'thermoml_fair/data/'

In [20]:
# Run the extraction
api_data_df, api_compounds_df, api_properties_df, failed_files = build_pandas_dataframe(data_dir)

print("Data extraction complete.")


⚠️ Failed to parse 19 XML file(s) out of 19.
Failed files:
  - f
  - t
  - a
  - /
  - a
  - r
  - e
  - /
  - m
  - t
  - m
  - h
  - l
  - r
  - a
  - _
  - d
  - o
  - i


ValueError: too many values to unpack (expected 4)

Now, let's inspect the DataFrames and the list of failed files.

In [21]:
print("Main Data:")
api_data_df.head()

Main Data:


NameError: name 'api_data_df' is not defined

In [None]:
print("Compounds Data:")
api_compounds_df.head()

In [None]:
print("Properties Data:")
api_properties_df.head()

In [None]:
print("Failed Files:")
if failed_files:
    for file, error in failed_files:
        print(f"- {file}: {error}")
else:
    print("No files failed to parse.")

## Bonus: Visualizing the Data

Now that we have the data loaded into pandas DataFrames, we can create some visualizations to explore it. An interactive plot is a great way to showcase the richness of the dataset.

Below, we will create a scatter plot of Density vs. Temperature. Each point will represent a measurement, colored by the compound's name. You can hover over any point to see more details, like the pressure and the source publication (DOI).

In [None]:
import plotly.express as px
import pandas as pd

# Ensure the data from the API call is available. 
# If you haven't run the API cells yet, please run them first.
if 'api_data_df' in locals() and 'api_compounds_df' in locals():
    # Merge the data and compounds dataframes to get compound names
    merged_df = pd.merge(api_data_df, api_compounds_df, on='nPureOrMixtureNum')

    # Filter for density measurements, which is a common and interesting property
    density_df = merged_df[merged_df['ePropName'] == 'Density, kg/m3'].copy()

    # Convert the property value column to numeric type for plotting
    density_df['ePropValue'] = pd.to_numeric(density_df['ePropValue'])

    if not density_df.empty:
        # Create the interactive scatter plot
        fig = px.scatter(
            density_df,
            x='Temperature, K',
            y='ePropValue',
            color='sIUPAC_name',  # Color points by the name of the compound
            hover_data=['Pressure, kPa', 'sDOI'],  # Show these details on hover
            title='Density of Compounds vs. Temperature'
        )

        # Update axis and legend labels for clarity
        fig.update_layout(
            yaxis_title="Density, kg/m3",
            xaxis_title="Temperature, K",
            legend_title="Compound"
        )

        fig.show()
    else:
        print("No density data found to plot.")
else:
    print("Please run the Python API cells above to load the data before creating the plot.")


In [None]:
## Method 1: Using the Command-Line Interface (CLI)

This is the simplest way to get started. The `thermoml-fair build-dataframe` command will scan the default data directory for XML files, process them, and save the results into the specified CSV files.

The `--show-failed-files` flag is recommended to get a list of any XML files that could not be parsed, which is useful for debugging.

In [None]:
# This command will create the CSV files in the same directory as this notebook.
# It will process all XML files found in the default cache directory (~/.thermoml/extracted_xml).
!thermoml-fair build-dataframe --output-data-file thermoml_data.csv --output-compounds-file thermoml_compounds.csv --output-properties-file thermoml_properties.csv --show-failed-files

In [None]:
After running the cell above, you should see `thermoml_data.csv`, `thermoml_compounds.csv`, and `thermoml_properties.csv` in your project directory.

---

In [None]:
## Method 2: Calling the Python Function Directly

This method provides much more flexibility. By calling the `build_pandas_dataframe` function from the `thermoml_fair` library, you can load the data directly into memory as pandas DataFrames without first saving them to disk. This is ideal for interactive analysis and allows for more advanced configuration.

The steps are:
1.  Import the necessary functions.
2.  Locate the XML files in the `thermoml-fair` cache directory.
3.  Call the function to build the data bundle.
4.  Extract the DataFrames from the returned bundle.
5.  Optionally, save the DataFrames to files and inspect them.

In [2]:
import os
from pathlib import Path
import pandas as pd
from thermoml_fair.core.utils import build_pandas_dataframe, get_cache_dir
from thermoml_fair.scripts.cli import get_schema_object

# 1. Set up paths and parameters
# Get the default directory where the XML files are stored
# This is typically ~/.thermoml/extracted_xml
xml_directory = get_cache_dir()
output_data_file = 'thermoml_data_from_py.csv'
output_compounds_file = 'thermoml_compounds_from_py.csv'
output_properties_file = 'thermoml_properties_from_py.csv'

# Find all XML files in the directory
xml_files = list(xml_directory.rglob("*.xml"))
print(f"Found {len(xml_files)} XML files in {xml_directory}")

# Load the ThermoML schema for validation during parsing
schema_obj = get_schema_object()

if xml_files and schema_obj:
    print("Starting DataFrame construction...")
    # 2. Call the core function
    # This will parse the XMLs (using cached .pkl files if available) and build the DataFrames
    data_bundle = build_pandas_dataframe(
        xml_files=[str(f) for f in xml_files],
        xsd_path_or_obj=schema_obj,
        normalize_alloys=True, # Optional: Set to True if you need normalized formulas for alloys
        max_workers=None # Optional: Uses all available CPU cores by default
    )

    # 3. Extract the results
    df_data = data_bundle.get('data')
    df_compounds = data_bundle.get('compounds')
    df_properties = data_bundle.get('properties')
    failed_files = data_bundle.get('failed_files', [])

    # 4. Save the DataFrames to CSV files
    if df_data is not None and not df_data.empty:
        df_data.to_csv(output_data_file, index=False)
        print(f"Successfully saved {len(df_data)} records to {output_data_file}")

    if df_compounds is not None and not df_compounds.empty:
        df_compounds.to_csv(output_compounds_file, index=False)
        print(f"Successfully saved {len(df_compounds)} unique compounds to {output_compounds_file}")
        
    if df_properties is not None and not df_properties.empty:
        df_properties.to_csv(output_properties_file, index=False)
        print(f"Successfully saved {len(df_properties)} unique properties to {output_properties_file}")

    # 5. Report any failures
    if failed_files:
        print(f"\n[WARNING] {len(failed_files)} file(s) failed to parse and were skipped.")
        print("Failed files:")
        for f in failed_files[:10]: # Print the first 10 failed files
            print(f"  - {Path(f).name}")
        if len(failed_files) > 10:
            print(f"  ... and {len(failed_files) - 10} more.")
    
    print("\nDataFrame construction complete.")
    
    # You can now work with the DataFrame directly
    if df_data is not None:
        print("\n**Preview of the main data:**")
        display(df_data.head())

else:
    if not xml_files:
        print("No XML files found. Please run 'thermoml-fair update-archive' first.")
    if not schema_obj:
        print("Could not load the ThermoML schema. Please ensure the package is installed correctly.")

Found 0 XML files in C:\Users\angel\.thermoml\thermoml_cache
No XML files found. Please run 'thermoml-fair update-archive' first.
