# PWscf input generation with LLMs

This is a test input generator to test different models. </br>
No tools are used, structure is not provided. Pure internal knoledge of language models

In [6]:
import numpy as np
import pandas as pd
import json
import re
import os

Specify the compound and see what different models can produce. </br>
Do not forget to save your API keys as environmental variables for each model...

In [7]:
compound="Ca O"

In [8]:
output_dir="generated_files/"

## OpenAI models

In [9]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

OPENAI_MODEL="gpt-4" # "gpt-3.5-turbo", "gpt-4o-mini", "gpt-4o"
# see the list of availible models https://platform.openai.com/docs/models
openai_api_key = os.environ.get('OPENAI_API_KEY')

In [66]:
llm = ChatOpenAI(
    model=OPENAI_MODEL,
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    api_key=openai_api_key
)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant specialising in generate Quantum Espresso input file for single point energy calculations.",
        ),
        ("human", "Please, generate an input file for {compound}"),
    ]
)
chain = prompt | llm
ai_msg = chain.invoke(
    {
        "compound": compound,
    }
)
print(ai_msg.content)

Sure, here is a basic Quantum Espresso input file for a single point energy calculation of a CaO molecule:

```
&CONTROL
  calculation = 'scf',
  prefix = 'CaO',
  pseudo_dir = './',
  outdir = './',
  verbosity = 'high',
/

&SYSTEM
  ibrav = 1,
  celldm(1) = 10.0,
  nat = 2,
  ntyp = 2,
  ecutwfc = 30.0,
  ecutrho = 240.0,
/

&ELECTRONS
  conv_thr = 1.0d-8,
  mixing_beta = 0.7,
/

ATOMIC_SPECIES
  Ca  40.08  Ca.pbe-n-kjpaw_psl.1.0.0.UPF
  O   15.999 O.pbe-n-kjpaw_psl.1.0.0.UPF

ATOMIC_POSITIONS (angstrom)
  Ca  0.0  0.0  0.0
  O   1.5  0.0  0.0

K_POINTS (automatic)
  1 1 1 0 0 0
```

Please note that you need to adjust the pseudopotential files (`pseudo_dir` and `ATOMIC_SPECIES` sections) and the atomic positions (`ATOMIC_POSITIONS` section) according to your system. The `celldm(1)` parameter in the `&SYSTEM` section defines the size of the simulation cell, and you may need to adjust it to avoid interactions between periodic images. The `ecutwfc` and `ecutrho` parameters define the k

In [67]:
mes=ai_msg.content
messages = [
    ("system", "You are clever parser now"),
    ("human", "Extract information between ```. Do not output any other messages or ```. The text is:" + mes)
]
ai_msg = llm.invoke(messages)
mes=ai_msg.content
mes=mes.replace("```\n","")
mes=mes.replace("```","")
compound_name=compound.replace(" ","")
file_name=compound_name+"_scf.in"
with open(output_dir+OPENAI_MODEL+"_"+file_name, "w") as text_file:
    text_file.write(mes)

## Llama (Meta) models

Here I use free models availible on huggingface.

For usage restricted Huggingface models: </br>
(1) login on HuggingFace </br>
(2) go to the model repo on hugging faceand accept usage conditions </br>
(3) generate free huggingface access token in you huggingface profile </br>
(4) save it as environmental variable HUGGINGFACE_TOKEN

Note also that you can't use huggingface inference API as the size of all models > 10GB
If you download them locally keep in mind that you need to have sufficient ammount of memory and GPU to run inference. So, it is recommended to use colab (and I'm affraid you will need to pay for sufficient resources) or kaggle._TOKEN

In [14]:
from huggingface_hub import login
from google.colab import userdata
#HUGGINGFACE_TOKEN = os.environ.get('HUGGINGFACE_TOKEN')
HUGGINGFACE_TOKEN = userdata.get('HUGGINGFACE_TOKEN')
login(token=HUGGINGFACE_TOKEN)

ModuleNotFoundError: No module named 'google.colab'

In [1]:
import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

def string_with_placeholder(message, input):
  return message.replace("{'input'}", input)

messages = [
    {"role": "system", "content": "You are a helpful assistant specialising in generate Quantum Espresso input file for single point energy calculations."},
    {"role": "user", "content": string_with_placeholder("Please, generate an input file for {'input'}", compound)},
]

outputs = pipeline(
    messages
)

mes=outputs[0]["generated_text"][-1]
mes['content']
name_compound=compound.replace(' ','_')
with open(output_dir+'Meta-Llama-3-8B-Instruct_'+name_compound+'_output.txt','w') as f:
  f.write(mes['content'])

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/4.70k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/38.9k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/7.80k [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

consolidated.00.pth:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

original/params.json:   0%|          | 0.00/211 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

'C:\\Users\\xyq44482\\Llama_models\\7B-v0.3'

## Groq models

In [1]:
GROQ_API_KEY = os.environ.get('GROQ_API_KEY')

In [7]:
import os

from groq import Groq

client = Groq(
    # This is the default and can be omitted
    api_key=GROQ_API_KEY,
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant specialising in generate Quantum Espresso input file for single point energy calculations."
        },
        {
            "role": "user",
            "content": "Please, generate an input file for CaO",
        }
    ],
    model="llama-3.1-70b-versatile",
)

print(chat_completion.choices[0].message.content)

Here's a sample Quantum Espresso input file for a single-point energy calculation of CaO:

```fortran
&CONTROL
  calculation = 'scf',
  title = 'CaO single point energy calculation',
  verbosity = 'high'
  tprnfor = .true.
  tstress = .true.
  tefield = .false.
 restart_mode = 'from_scratch',
  prefix = 'CaO',
  pseudo_dir = './',
  outdir = './',
  etot_conv_thr = 1.0d-6,
  forc_conv_thr = 1.0d-4,
/

&SYSTEM
  ibrav = 2,            ! Face-centered cubic (FCC)
  A = 4.866,            ! lattice constant
  nat = 2,               ! Number of atoms in the primitive cell
  ntyp = 2,              ! Number of atom types
  nbnd = 12,            ! Number of electronic bands
  ecutwfc = 30.0,       ! kinetic energy cutoff for wave functions
  ecutrho = 200.0,      ! kinetic energy cutoff for charge density
  occupations = 'fixed',
  smearing = 'mp',
  degauss = 0.005,
  ecfixed = 0.0000,
/

&ELECTRONS
  mixing_beta = 0.7,    ! mixing factor
  conv_thr = 1.0d-6,    ! convergence threshold for sel

In [12]:
import os

from groq import Groq

client = Groq(
    # This is the default and can be omitted
    api_key=GROQ_API_KEY,
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant specialising in generate Quantum Espresso input file for single point energy calculations."
        },
        {
            "role": "user",
            "content": "Please, generate an input file for CaO",
        }
    ],
    model="gemma2-9b-it",
)

print(chat_completion.choices[0].message.content)

```
&CONTROL
  calculation = 'scf'
  pseudo_dir = '/path/to/your/pseudo/directory' 
  outdir = 'CaO_output'  
  verbose = 'high'
/

&SYSTEM
  ibrav = 0     ! Rock salt structure
  a = 4.18       ! Lattice constant (angstrom)
  
  nat = 2        ! Number of atoms per cell
  ntyp = 2

  &ATOMIC_SPECIES
   Ca  = 40.08   ! Atomic mass of Ca
    O  = 16.00   ! Atomic mass of O
  /

  &CELL
   ! Define atomic positions (in fractional coordinates)
    0.0, 0.0, 0.0 | Ca
    0.5, 0.5, 0.5 | O

  /
/



&ELECTRONS
  ecutrho = 500.0   ! Cut-off energy (Ry)
  conv_thr = 1.0e-6  ! Convergence threshold
  mixing_beta = 0.2    ! Mixing parameter
/
```

**Explanation:**

- **&CONTROL:** This section controls overall calculation parameters:
    - **calculation = 'scf'**: Indicates a self-consistent field (SCF) calculation for single-point energy.
    - **pseudo_dir**: Specify the directory containing pseudopotentials. 
    - **outdir**: Set the output directory for calculation results.
    - **verbose