# ThermoML Data Analysis

This notebook demonstrates how to use the `thermoml-fair` package to load, process, and analyze data from ThermoML archives.

We will perform the following steps:
1.  Generate standardized data files (CSVs) from the raw ThermoML XML files.
2.  Load the data into pandas DataFrames.
3.  Explore the structure of the data.
4.  Filter the data to find specific properties, focusing on **thermal conductivity of crystalline solids**.
5.  Visualize the results.

In [18]:
# First, ensure the required packages are installed
import sys
!{sys.executable} -m pip install pandas plotly



In [None]:
# Step 1: Generate the data files using the thermoml-fair CLI
# This command processes the XML files and creates CSVs for easier analysis.
# It'''s equivalent to running `thermoml-fair build-dataframe` in your terminal.
# This parses over 10,000 ThermoML XML files, which can take some time depending on your system's performance.
!thermoml-fair build-dataframe --output-data-file thermoml_data.csv --output-compounds-file thermoml_compounds.csv

^C


No input directory specified, using default: C:\Users\angel\.thermoml\extracted_xml
Building DataFrames from data in: C:\Users\angel\.thermoml\extracted_xml
Found 11923 XML file(s) and 11624 .parsed.pkl file(s) in C:\Users\angel\.thermoml\extracted_xml.
The process will prioritize .parsed.pkl files.
Starting DataFrame construction process... This may take a while depending on data size and if parsing is needed.

Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:00
Parsing XML files... ------------------------   0% (0 of 11923) -:--:-- 0:00:01
Parsing XML files... ---

Failed to parse XML file C:\Users\angel\.thermoml\extracted_xml\10.1007\s10765-013-1404-4.xml with schema XMLSchema10(name='ThermoML.xsd', namespace='http://www.iupac.org/namespaces/ThermoML'): C:\Users\angel\.thermoml\extracted_xml\10.1007\s10765-013-1404-4.xml is not valid against the provided ThermoML schema.
Failed to parse XML file C:\Users\angel\.thermoml\extracted_xml\10.1007\s10765-013-1468-1.xml with schema XMLSchema10(name='ThermoML.xsd', namespace='http://www.iupac.org/namespaces/ThermoML'): C:\Users\angel\.thermoml\extracted_xml\10.1007\s10765-013-1468-1.xml is not valid against the provided ThermoML schema.
Failed to parse XML file C:\Users\angel\.thermoml\extracted_xml\10.1007\s10765-016-2150-1.xml with schema XMLSchema10(name='ThermoML.xsd', namespace='http://www.iupac.org/namespaces/ThermoML'): C:\Users\angel\.thermoml\extracted_xml\10.1007\s10765-016-2150-1.xml is not valid against the provided ThermoML schema.
Failed to parse XML file C:\Users\angel\.thermoml\extracte

In [None]:
# Step 2: Load the generated data into pandas DataFrames
import pandas as pd
import os

data_file = 'thermoml_data.csv'
compounds_file = 'thermoml_compounds.csv'

if os.path.exists(data_file) and os.path.exists(compounds_file):
    df_data = pd.read_csv(data_file)
    df_compounds = pd.read_csv(compounds_file)
    print("DataFrames loaded successfully.")
else:
    print("Error: Data files not found. Make sure the previous cell ran successfully.")


DataFrames loaded successfully.


In [None]:
# Step 3: Explore the data
# Let'''s look at the first few rows and the columns of the main data file.
if 'df_data' in locals():
    print("Columns in the main data file:")
    print(df_data.columns)
    print("\nFirst 5 rows of the main data:")
    display(df_data.head())

Columns in the main data file:
Index(['material_id', 'components', 'thermoml_fair_version', 'property',
       'value', 'phase', 'method', 'uncertainty', 'source_file', 'doi',
       'publication_year', 'title', 'author', 'journal'],
      dtype='object')

First 5 rows of the main data:


Unnamed: 0,material_id,components,thermoml_fair_version,property,value,phase,method,uncertainty,source_file,doi,publication_year,title,author,journal
0,1,"1,2-ethanediol",1.0.12,"Thermal conductivity, W/m/K",0.2496,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,2005,Application of the Multi-Current Transient Hot...,"Khayet, M.",Int. J. Thermophys.
1,1,"1,2-ethanediol",1.0.12,"Thermal conductivity, W/m/K",0.2518,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,2005,Application of the Multi-Current Transient Hot...,"Khayet, M.",Int. J. Thermophys.
2,1,"1,2-ethanediol",1.0.12,"Thermal conductivity, W/m/K",0.2539,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,2005,Application of the Multi-Current Transient Hot...,"Khayet, M.",Int. J. Thermophys.
3,1,"1,2-ethanediol",1.0.12,"Thermal conductivity, W/m/K",0.2561,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,2005,Application of the Multi-Current Transient Hot...,"Khayet, M.",Int. J. Thermophys.
4,1,"1,2-ethanediol",1.0.12,"Thermal conductivity, W/m/K",0.258,Liquid,Hot disk method,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-005-5568-4,2005,Application of the Multi-Current Transient Hot...,"Khayet, M.",Int. J. Thermophys.


In [None]:
# Step 4: Filter for Thermal Conductivity of Crystalline Solids

# Thank you for providing the CSV header. It clarifies the data structure.
# The data is in a simple "long" format, so no complex pivoting is needed.
# My previous attempts were based on an incorrect assumption about the column names.

# Let'''s use the correct column names from your file to filter the data.

if 'df_data' in locals():
    # The correct column names from your CSV are:
    property_col = 'property'
    phase_col = 'phase'
    value_col = 'value'

    # Check if these columns exist
    if property_col in df_data.columns and phase_col in df_data.columns:
        
        # 1. Filter for rows where the property is 'Thermal conductivity'
        # We use .str.contains() for a case-insensitive search.
        conductivity_df = df_data[df_data[property_col].str.contains('Thermal conductivity', case=False, na=False)].copy()

        # 2. From those results, filter for crystalline phases
        crystalline_df = conductivity_df[conductivity_df[phase_col].str.contains('crystal', case=False, na=False)]

        print(f"Found {len(crystalline_df)} records for thermal conductivity in crystalline phases.")
        
        # Display the filtered data
        if not crystalline_df.empty:
            print("Displaying filtered results:")
            display(crystalline_df.head())
        else:
            print("No matching records found after filtering.")

    else:
        print(f"Error: The required columns ('{property_col}' and/or '{phase_col}') were not found in the DataFrame.")
        print("Please check the column list from cell 5.")


Found 318 records for thermal conductivity in crystalline phases.
Displaying filtered results:


Unnamed: 0,material_id,components,thermoml_fair_version,property,value,phase,method,uncertainty,source_file,doi,publication_year,title,author,journal
40565,1,iron,1.0.12,"Thermal conductivity, W/m/K",82.5,Crystal 4,,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-019-2568-3,2019,Intercomparison of Thermophysical Property Mea...,"Ebert, H.-P.[Hans-Peter]",Int. J. Thermophys.
40566,1,iron,1.0.12,"Thermal conductivity, W/m/K",73.4,Crystal 4,,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-019-2568-3,2019,Intercomparison of Thermophysical Property Mea...,"Ebert, H.-P.[Hans-Peter]",Int. J. Thermophys.
40567,1,iron,1.0.12,"Thermal conductivity, W/m/K",65.5,Crystal 4,,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-019-2568-3,2019,Intercomparison of Thermophysical Property Mea...,"Ebert, H.-P.[Hans-Peter]",Int. J. Thermophys.
40568,1,iron,1.0.12,"Thermal conductivity, W/m/K",58.2,Crystal 4,,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-019-2568-3,2019,Intercomparison of Thermophysical Property Mea...,"Ebert, H.-P.[Hans-Peter]",Int. J. Thermophys.
40569,1,iron,1.0.12,"Thermal conductivity, W/m/K",52.0,Crystal 4,,,C:\Users\angel\.thermoml\extracted_xml\10.1007...,10.1007/s10765-019-2568-3,2019,Intercomparison of Thermophysical Property Mea...,"Ebert, H.-P.[Hans-Peter]",Int. J. Thermophys.
