### Download & Unzip Codebase Repository


In [None]:
!wget -O ALIGNN-BERT-TL-crystal.zip https://figshare.com/ndownloader/files/50344371

In [None]:
!unzip "./ALIGNN-BERT-TL-crystal.zip" -d .

### Install Python3.9

In [None]:
# skip if python3.9 already installed
# install python3.9
!sudo apt-get install python3.9
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1
# install pip
!sudo apt-get install python3.9-distutils
!wget https://bootstrap.pypa.io/get-pip.py
!python get-pip.py

In [None]:
#check python version
!python --version

Install Library Dependencies. **IMPORTANT: ignore the restarting warning in popup window and don't restart. Click CANCEL**

In [None]:
!cd ALIGNN-BERT-TL-crystal && pip install -r requirements.txt

### Feature Extraction and Concatenation

#### 1. Download ALIGNN embeddings for 75k dft-3d dataset (source model: ALIGNN formation energy trained MP project dataset). Alternatively, we can follow instructions in [ALIGNNTL: Feature Extraction](https://github.com/NU-CUCIS/ALIGNNTL.git) to extract ALIGNN-based embeddings from pre-trained ALIGNN model

In [None]:
!cd ALIGNN-BERT-TL-crystal/data/embeddings && wget -O data0.csv https://figshare.com/ndownloader/files/49434619

#### 2. Download MatBERT pretrained model following [instructions](https://github.com/lbnlp/MatBERT) and save it under /matbert (skip if using bert model not matbert)

In [None]:
!export MODEL_PATH="/content/ALIGNN-BERT-TL-crystal//matbert" && mkdir -p $MODEL_PATH/matbert-base-cased

!export MODEL_PATH="/content/ALIGNN-BERT-TL-crystal/matbert" && curl -# -o $MODEL_PATH/matbert-base-cased/config.json https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/config.json
!export MODEL_PATH="/content/ALIGNN-BERT-TL-crystal//matbert" && curl -# -o $MODEL_PATH/matbert-base-cased/vocab.txt https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/vocab.txt
!export MODEL_PATH="/content/ALIGNN-BERT-TL-crystal//matbert" && curl -# -o $MODEL_PATH/matbert-base-cased/pytorch_model.bin https://cedergroup-share.s3-us-west-2.amazonaws.com/public/MatBERT/model_2Mpapers_cased_30522_wd/pytorch_model.bin




#### 3. Genearate and concatenate LLM feature extraction.

In [None]:

# 3.a Generate text descriptions for samples
# Example command for robocystallographer text source (the first 20 samples only)
!cd ALIGNN-BERT-TL-crystal && python generater.py --text robo --end 20
# Use --text to specify text generator source: robo/chemnlp;
# Use --end k to select an small subset (the first k samples), ignore to use full dataset in production run

#  For robocystallographer, the above generation for entire 75k samples can take hours.
# (Recommanded for production run) We provide staging csv for robocystallographer(output from this step) available. To download:
# !cd ALIGNN-BERT-TL-crystal/data/text && wget -O robo_0_75993_skip_none.csv https://figshare.com/ndownloader/files/49576959


In [None]:
# Optional: check the text content of generated text description
import pandas as pd
df = pd.read_csv("ALIGNN-BERT-TL-crystal/data/text/robo_0_20_skip_none.csv", index_col=0) # for example samples (first 20)
# df = pd.read_csv("ALIGNN-BERT-TL-crystal/data/text/robo_0_75993_skip_none.csv", index_col=0) # for production run
df.head()

In [None]:
# 3.b Generate LLM embeddings
# Example command for robocystallographer text source and matbert-base-cased model
# The process can take up to hours for the entire dataset
!cd ALIGNN-BERT-TL-crystal && python preprocess.py \
--llm matbert-base-cased --text robo --cache_csv "./data/text/robo_0_20_skip_none.csv" # for first 20 samples only
# Use --text to specify text generator source: robo/chemnlp;
#--llm to select language model:  matbert-base-cased/bert-base-uncased
# --cache_csv to specify the staged text description file (output from last step)

# Alternatively, we provide staging csv for the embeddings from robocystallographer + matbert-base-cased (output from this step) available. To download:
# !cd ALIGNN-BERT-TL-crystal/data/embeddings && wget -O embeddings_matbert-base-cased_robo_75966.csv https://figshare.com/ndownloader/files/50342622

In [None]:
# 3.c Prepare dataset for training and evaluating property predictor model (concatenate embeddings if needed)
# Example command for robocystallographer text source and matbert-base-cased model with full text
# --gnn_file_path gnn embedding path
# --split_dir json file containing dataset split information for train, validation and test set
# --prop property name to predict. "all" for all 7 properties
# --input_dir llm embeddings folder, the program searches for all embedding csv files under the path that matches name pattern based on llm model, text source and skip words
# NOTE: Begin using the production run dataset starting from this checkpoint onward
# The programs checks if the provided embedding csv files covers all train/val/test sample ids, if not it will skip this property

# 1. Download the staging embedding csv file in case that production run (full dataset) is not used in previous steps
!cd ALIGNN-BERT-TL-crystal/data/embeddings && wget -O embeddings_matbert-base-cased_robo_75966.csv https://figshare.com/ndownloader/files/50342622

# 2. Concatenate LLM and GNN embeddings and prepared train/val/test datasets
!cd ALIGNN-BERT-TL-crystal && python features.py \
--input_dir "./data/embeddings" \
--gnn_file_path "./data/embeddings/data0.csv" \
--split_dir "./data/split/" \
--llm matbert-base-cased \
--text robo --prop mbj_bandgap

### Predictor Model Training

1. Create config file to specify dataset path, model architecture, hyperparameters and other info in `/CrossPropertyTL/sample/`. Example config file provided for mbj_bandgap property with merged embeddings from ALIGNN and MatBERT and text from Robocystallographer: `./CrossPropertyTL/elemnet/sample/example_alignn_matbert-base-cased_robo_prop_mbj_bandgap.config`

2. Make sure the filepaths of generated train/val/test datasets from last step are corrected entered in config file:

```
{
   ...
   "train_data_path": "../../data/dataset_alignn_matbert-base-cased_robo_prop_mbj_bandgap_train.csv",
   "val_data_path": "../../data/dataset_alignn_matbert-base-cased_robo_prop_mbj_bandgap_val.csv",
   "test_data_path": "../../data/dataset_alignn_matbert-base-cased_robo_prop_mbj_bandgap_test.csv",
   ...
      }
```

3. Pass the config file to the dl_regressors_tf2.py to start model training. Example command for training predictor model of spillage with GNN & MATBERT embeddings generated from robocystallographer text. The test error in MAE is printed at the end training process. The log is also saved under `ALIGNN-BERT-TL-crystal/CrossPropertyTL/elemnet/log/alignn_matbert-base-cased_robo_prop_mbj_bandgap.log`

In [None]:
# GPU env only: install cuda 11 to match with tensorflow 2.7
# Google collab notebook has a default cuda version of cuda 12
!apt-get update
!apt-get install cuda-toolkit-11-8

In [None]:
# Example command for robocystallographer text source and matbert-base-cased model with full text
!export LD_LIBRARY_PATH=/usr/local/cuda-11/lib64:$LD_LIBRARY_PATH && \
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH && \
cd ALIGNN-BERT-TL-crystal/CrossPropertyTL/elemnet && python dl_regressors_tf2.py \
--config_file sample/example_alignn_matbert-base-cased_robo_prop_mbj_bandgap.config

### Ablation Study

 We can optionally only select embeddings from singular source for the purpose of ablation study

*   ALIGNN embeddings only: to do this, pass --gnn_only flag to `feature.py` program with the following values. And then proceed with Predictor Model Training section:
  1. Create new config file from a copy of the example one for GNN-embeddings only test with a new name like `./CrossPropertyTL/elemnet/sample/example_alignn_prop_mbj_bandgap.config`
  2. Make sure the filepaths of generated train/val/test datasets from last step are corrected entered in config file e.g `./data/dataset_alignn_only_prop_mbj_bandgap_test.csv`
  3. Proceed with the model traiining python command with new config file in argument

In [None]:
# 3.c(GNN only) Add --gnn_only flag to use GNN embeddings only
!cd ALIGNN-BERT-TL-crystal && python features.py \
--input_dir "./data/embeddings" \
--gnn_only \
--gnn_file_path "./data/embeddings/data0.csv" \
--split_dir "./data/split/" \
--llm matbert-base-cased \
--text robo \
--prop mbj_bandgap

*   LLM embeddings only: to do this, skip --gnn_file_path arg  to `preprocess.py`. And then proceed with Predictor Model Training section:
  1. Create new config file from a copy of the example one for LLM-embeddings only test with a new name like `./CrossPropertyTL/elemnet/sample/example_matbert-base-cased_robo_prop_mbj_bandgap.config`
  2. Make sure the filepaths of generated train/val/test datasets from 3.c step are corrected entered in config file e.g `./data/dataset_matbert-base-cased_robo_prop_mbj_bandgap_val.csv`
  3. Proceed with the model traiining python command with new config file in argument

In [None]:
# 3.c(LLM only) Remove gnn_file_path flag to use GNN embeddings only
!cd ALIGNN-BERT-TL-crystal && python features.py \
--input_dir "./data/embeddings" --split_dir "./data/split/" --llm matbert-base-cased --text robo --prop mbj_bandgap

### Text Representation Analysis

 We can optionally remove the sentences that belongs to a specific topic. Available removable topics:

*    Generated text from Robocrystallographer is categorized into five classes: summary, structure coordination, site info,
bond length, and bond angle.
*   Generated text from ChemNLP is categorized into 3 classes: chemical info, structure info
 and bond length.

To do this, pass `--skip_sentence` arg to `preprocess.py` and `feature.py` program with the following values. And then proceed normally with Predictor Model Training

* Robocrystallographer: ['summary', 'site', 'bond', 'length', 'angle']
* ChemNLP: ['structure', 'chemical', 'bond']


In [None]:
# 3.b (with text-removal) Generate LLM embeddings
# Example command for robocystallographer text source and matbert-base-cased model with "summary" text removed
!cd ALIGNN-BERT-TL-crystal && python preprocess.py \
--skip_sentence "summary" --llm matbert-base-cased --text robo --cache_csv "./data/text/robo_0_75993_skip_none.csv"

In [None]:
# 3.c (with text-removal) Prepare dataset for training and evaluating property predictor model
# Example command for robocystallographer text source and matbert-base-cased model with summary text removed
# NOTE: the programs checks if the provided embedding csv files covers all sample ids.
!cd ALIGNN-BERT-TL-crystal && python features.py \
--skip_sentence summary --input_dir "./data/embeddings" \
--gnn_file_path "./data/embeddings/data0.csv" \
--split_dir "./data/split/" \
--llm matbert-base-cased \
--text robo \
--prop mbj_bandgap

And then proceed with Predictor Model Training section:
1. Create a new config file or modify the example one. Make sure the filepaths of generated train/val/test datasets from last step are corrected entered in config file e.g ./data/dataset_matbert-base-cased_robo_prop_skip_summary_mbj_bandgap_val.csv
2. Proceed with the model traiining python command with the new config file in argument