# VIP Fall 2024
# Fine-tuning LLMs for Materials
Our sub-group is going to focus on fine-tuning LLMs to generate crystal structures. The key idea is that when we give the fine-tuned LLM a prompt, it should be able to output a textual representation of a crystal structure (which we will then decode into an actual 3D structure, if needed).

# Tutorial I

# 1. Textual Representations of Crystal Materials
In general, a crystal structure is stored in the [CIF](https://en.wikipedia.org/wiki/Crystallographic_Information_File) format. One can view it using the Software [VESTA](https://jp-minerals.org/vesta/en/download.html) or you can also upload your CIF file on this [website](https://next-gen.materialsproject.org/toolkit) for an interactive visualization (login required).

This notebook explores 2 different text-based representations of crystal materials:

1. CIF
2. SLICES

## 1.1 Setting Up

In [None]:
# The Atomic Simulation Environment (ASE) is a package for setting up,
# manipulating, running, visualizing and analyzing atomistic simulations.
# see more: https://wiki.fysik.dtu.dk/ase/index.html
!pip install ase

Collecting ase
  Downloading ase-3.23.0-py3-none-any.whl.metadata (3.8 kB)
Downloading ase-3.23.0-py3-none-any.whl (2.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ase
Successfully installed ase-3.23.0


In [None]:
import ase
import ase.io
import numpy as np
from glob import glob

In [None]:
# download zip file containing some CIF files
!wget https://github.com/shuyijia/crystal-data/raw/main/MP-20/raw_val.zip

# unzip silently
!unzip -q raw_val.zip

--2024-09-30 02:07:04--  https://github.com/shuyijia/crystal-data/raw/main/MP-20/raw_val.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/shuyijia/crystal-data/main/MP-20/raw_val.zip [following]
--2024-09-30 02:07:05--  https://raw.githubusercontent.com/shuyijia/crystal-data/main/MP-20/raw_val.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5224067 (5.0M) [application/zip]
Saving to: ‘raw_val.zip’


2024-09-30 02:07:05 (59.8 MB/s) - ‘raw_val.zip’ saved [5224067/5224067]



We have downloaded a folder of CIF files in the current directory. You can view it by using the file explorer on the left panel. Alternatively, we can also use `glob`. Let us count the number of CIF files we have downloaded.

In [None]:
cif_path = "raw_val"
cif_files = glob(f"{cif_path}/*.cif")
cif_files.sort()

print(f"Number of CIF files: {len(cif_files)}")

Number of CIF files: 9047


## 1.2 Looking at a single CIF file

In [None]:
cif_file = cif_files[0]
print(f"CIF file: {cif_file}")

CIF file: raw_val/mp-10015.cif


In [None]:
# load it with ase
# crystal structure is loaded as an ase.atoms.Atoms object
atoms = ase.io.read(cif_file)

print(type(atoms))

<class 'ase.atoms.Atoms'>


In [None]:
# view the structure with ase
# you can drag the visualizer!

from ase.visualize import view


atoms_repeated = atoms.repeat((2, 2, 2))  # Replicate the unit cell in all 3 dimensions
#view(atoms_repeated, viewer='x3d')
view(atoms, viewer='x3d')

Each crystal structure is defined (completely) by the following 3 attributes:
1. Atomic numbers, $(N,)$
2. Atomic coordinates, $(N,3)$
3. Lattice parameters, $(6,)$

Let's get these attributes from the `atoms` object.

In [None]:
# atomic numbers
atomic_numbers = atoms.get_atomic_numbers()
# atomic symbols
atomic_symbols = atoms.get_chemical_symbols()

print(f"Atomic numbers:\n{atomic_numbers}")
print(f"Atomic symbols:\n{atomic_symbols}")

Atomic numbers:
[14 76]
Atomic symbols:
['Si', 'Os']


In [None]:
# atomic coordinates
atomic_coordinates = atoms.get_positions()
fractional_coordinates = atoms.get_scaled_positions() # scaled positions in [0,1)
print(f"Atomic coordinates:\n{atomic_coordinates}")
print(f"Fractional coordinates:\n{fractional_coordinates}")

Atomic coordinates:
[[1.479645 1.479645 1.479645]
 [0.       0.       0.      ]]
Fractional coordinates:
[[0.5 0.5 0.5]
 [0.  0.  0. ]]


In [None]:
# lattice parameters
lattice_parameters = atoms.cell.cellpar()

print(f"Lattice lengths:\n{lattice_parameters[:3]}")
print(f"Lattice angles:\n{lattice_parameters[3:]}")

Lattice lengths:
[2.95929 2.95929 2.95929]
Lattice angles:
[90. 90. 90.]


# 2 CIF Representation used by Crystal-text-LLM (Meta Research)

The leftmost panel in the following figure shows the textual representation used by [Gruver et al. (2024)](https://arxiv.org/abs/2402.04379).

![](https://github.com/facebookresearch/crystal-text-llm/blob/main/assets/crystal_llm_graphic.png?raw=true)

Upon inspection, you will notice that their representation is fairly close to the raw CIF representation:
- the first line encodes lattice lengths,
- the second line encodes lattice angles,
- Every two lines afterwards encode an atom, followed by its fractional coordinates.

## TODO: Exercise I
Complete the following function which takes in an `ase.atoms.Atoms` object and outputs a string representation similar to that in Crystal-text-LLM (figure above).

Note the following:

1. lengths should be rounded to 1 decimals
2. Angles should be rounded to the nearest integer
3. Coordinates should be rounded to 2 decimals
4. Output a single string; for a new line, use `\n`.

In [None]:
def cif_to_text(atoms):
  cif_string = ""
  # complete this function

  lattice_lengths = atoms.cell.cellpar()[:3]
  lattice_angles = atoms.cell.cellpar()[3:]

  cif_string += ' '.join(map(str, lattice_lengths))
  cif_string += '\n'
  cif_string += ' '.join(map(str, lattice_angles))
  cif_string += "\n"

  atom_symbols = atoms.get_chemical_symbols()
  atom_fc = atoms.get_scaled_positions()

  symb_and_coord = []

  for i in range(len(atom_symbols)):
    cif_string += atom_symbols[i]
    cif_string += "\n"
    cif_string += ' '.join(map(str, atom_fc[i]))
    cif_string += "\n"
  return cif_string

In [None]:
# test your function
# does it match the Crystal-text-LLM format?
cif_string = cif_to_text(atoms)
print(cif_string)

2.95929 2.95929 2.95929
90.0 90.0 90.0
Si
0.5 0.5 0.5
Os
0.0 0.0 0.0



# 3 SLICES Representation
Simplified Line-Input Crystal-Encoding System ([SLICES](https://github.com/xiaohang007/SLICES.git)) is an invertible and invariant crystal representation.


In [None]:
# install
# you might see an error on pip's dependency solver;
# it should be fine for the purpose of this notebook.
!pip install slices pymatgen --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.2/9.2 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.1/285.1 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from slices.core import SLICES
from pymatgen.core.structure import Structure

# obtaining the pymatgen Structure of the cif
original_structure = Structure.from_file(filename=cif_file)

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


In [None]:
# create SLICES backend
backend = SLICES()
# converting a crystal structure to its SLICES string
slices_string = backend.structure2SLICES(original_structure)

CHGNet v0.3.0 initialized with 412,525 parameters
CHGNet will run on cpu




In [None]:
print(slices_string)

Si Os 0 1 +o+ 0 1 ++o 0 1 o++ 0 1 +oo 0 1 o+o 0 1 oo+ 0 1 ooo 0 1 +++ 


In [None]:
reconstructed_structure, _ = backend.SLICES2structure(slices_string)

In [None]:
_atoms = reconstructed_structure.to_ase_atoms()

In [None]:
view(_atoms, viewer='x3d')

Note that SLICES' decoder automatically relaxes the input structure, therefore the decoded structure might resemble the original structure exactly.