# A note on how to develop the PFAS dataset 

Xiping Gong (xipinggong@uga.edu) from the [Jack Huang's Lab](https://site.caes.uga.edu/huanglab/)

Date: 4/23/2025

# Introduction

This notebook is designed to demonstrate how the PFAS dataset can be developed and utilized for evaluating and benchmarking the docking performance of computational tools, such as AlphaFold 3 and AutoDock Vina. Specifically, this PFAS dataset will be collected from the RCSB PDB, so the native structures of protein-PFAS systems can be used to evaluate the success rate of docking tools. For example, we can calculate their RMSD values with the predicted structures.


# Methodology

We used the OECD (2021) definition of PFAS to collect the protein-PFAS systems (see Appendix). 

```
“PFASs are defined as fluorinated substances that contain at least one fully fluorinated methyl or methylene carbon atom (without any H/Cl/Br/I atom attached to it), i.e. with a few noted exceptions, any chemical with at least a perfluorinated methyl group (−CF3) or a perfluorinated methylene group (−CF2−) is a PFAS.”
```

# Tutorial

In this section, we will show how we can collect the PFAS database from the RCSB PDB.

## Download this responsitory
It is required to download this responsitory to your local computer, so that we can efficiently obtain the PFAS dataset by running some scripts.

```bash
# Download this responsitory to the local.
$ git clone https://github.com/XipingGong/pfas_docking.git
$ cd pfas_docking # go to the downloaded folder
```

## Download the PFAS dataset from the existing data files

We created two types of PFAS dataset, including the "before_set" and "after_set". The "before_set" includes the PDB files related before the date (9/30/2021), while the "after_set" includes the PDB files related after this date.
Their data files can be found in the "data" folder. Here, we will introduce how we can obtain their native structures by using the downloaded scripts.

**Algorithm**

Given that we have the data files in the "data" folder, we can directly use a "get_native_pdb.sh" script to obtain the PDB files based on the "PDBID" and "LigandID". After that, we can also have another script ("run.sh", see below) to do a batch processing.

```bash
# Show the "before_set" and "after_set"
$ ls -lrt tutorials/pfas_dataset/
# >> before_set.dat
# >> after_set.dat

# Download the PDB files for both dataset, taking the 7FEU_4I6 as an example
$ mkdir -p test/dock_dir/7FEU_4I6 # Create a test folder
$ cd test/dock_dir/7FEU_4I6 # go to this created folder
#
# Please modify this "../../../scripts/get_native_pdb.sh" file.
#
$ bash ../../../scripts/get_native_pdb.sh --pdbid 7FEU --ligandid 4I6 --output_pdb 7FEU_4I6.pdb # It will create two pdb files
$ ls -ltr 
#>> 7FEU_4I6_ori.pdb
#>> 7FEU_4I6.pdb # this only includes two chains: chain A for protein and chain B for ligand
```

To do a batch processing, we can write a bash script to run it.

```bash
$ cd ../ # go to the "dock_dir" directory, and we will do the similar things for all samples
$ bash run.sh | tee run.out # run a bash script to download all PDB files
$ ls -lrt */ # see all downloaded PDF files
# >> 7FEU_4I6/:
# >> 7FEU_4I6_ori.pdb
# >> 7FEU_4I6.pdb
# >> 
# >> 7RHN_QNG/:
# >> 7RHN_QNG_ori.pdb
# >> 7RHN_QNG.pdb
# >> ...
```

The "run.sh" script is below,

```bash
#!/bin/bash

scripts_dir='/home/xg69107/program/pfas_docking/scripts' # need to change
input_file="../../tutorials/pfas_dataset/*_set.dat" # it includes both dataset

cat $input_file | awk '{print $1}' | grep -v '^#' > tmp.dat
head tmp.dat
# >> 7FEU_4I6
# >> 7RHN_QNG
# >> 7RJ2_QNG

while IFS=_ read -r pdbid ligandid; do
    # Skip empty lines and lines starting with #
    [[ -z "$pdbid" || "$pdbid" == \#* ]] && continue

    echo "Processing $pdbid and $ligandid..."
    mkdir -p ${pdbid}_${ligandid}
    cd ${pdbid}_${ligandid}
    bash $scripts_dir/get_native_pdb.sh \
        --pdbid "$pdbid" \
        --ligandid "$ligandid"
    cd ../ # go back to the main directory
done < "tmp.dat"

rm -f tmp.dat # remove the temporary data file
```

## Fetch PFAS dataset from the scratch

The RCSB PDB is keeping the update, so it can be useful if we can obtain the protein-PFAS systems directly from the PDB. The following shows how we can obtain them from the scratch. The goal is to find out the PDBID_LigandID samples that have the protein and PFAS.

**Algorithm**

+ **Step 1: Download the ligand info from the "components.cif"**

This file can be downloaded from the following link:
```bash
https://files.wwpdb.org/pub/pdb/data/monomers/components.cif
```
This file includes all ligand systems in the PDB database.
We can extract the interested ligands from this file, like the F-containing ligands and PFAS, to directly obtain the important ligand info.

```bash
$ cd test # go to the test directory
$ wget https://files.wwpdb.org/pub/pdb/data/monomers/components.cif # download this cif file from RCSB PDB
```
+ **Step 2: Filter out the fluorine-containing ligand info from the component.cif**

```bash
$ python ../scripts/get_ligand_id_from_ccdcif.py components.cif 1 | tee ligand_id_from_ccdcif.log # the output will be written into the log file. "1" is the number of F atoms
# $ python ../scripts/get_ligand_id_from_ccdcif.py -h # check the usage of a python script
# Note: every python scipt has this function to check out the usage.
#
$ head -3 ligand_id_from_ccdcif.log # as of 2/14/2025
>> # Ligand IDs - Found 6908 ligands with at least 1 fluorine atoms:
>> Ligand ID: W1Z ; SMILES: OC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F ; Fluorine Count: 23 ;
>> Ligand ID: W10 ; SMILES: OC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F;  Fluorine Count: 19 ;
```

+ **Step 3: Identify PFAS ligands from the obtained data file: ligand_id_from_ccdcif.log"**

```bash
# Filter out the PFAS OECD
# OECD definition of PFAS: '[$([#6X4]([!#1;!Cl;!Br;!I])(F)(F)(F)),$([#6X4](F)(F)([!#1;!Cl;!Br;!I])([!#1;!Cl;!Br;!I]))]'
$ python ../scripts/filter_ligands_by_smiles.py ligand_id_from_ccdcif.log --pattern '[$([#6X4]([!#1;!Cl;!Br;!I])(F)(F)(F)),$([#6X4](F)(F)([!#1;!Cl;!Br;!I])([!#1;!Cl;!Br;!I]))]' | tee pfas_oecd_no_pdbid.log 

# Add PDBID info into the "pfas_oecd_no_pdbid.log"
$ awk '/^#/ {print} /Ligand ID:/ {print $3}' pfas_oecd_no_pdbid.log  > x1.log # store the input arguments
$ python ../scripts/get_pdbid_by_ligand_id.py x1.log > x2.log # computation-intensive, ~2 hours
$ grep 'Ligand' x2.log > x1.log; paste pfas_oecd_no_pdbid.log  x1.log > pfas_oecd.log # Save the results into a final log file
# >> Ligand ID: W1Z ; SMILES: OC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F ; Fluorine Count: 23 ;      Ligand ID: W1Z ; PDBID: 5B2W ;
# >> Ligand ID: W10 ; SMILES: OC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F ; Fluorine Count: 19 ;    Ligand ID: W10 ; PDBID: 5B2Y ;
# >> Ligand ID: 4I6 ; SMILES: OC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F ; Fluorine Count: 17 ; Ligand ID: 4I6 ; PDBID: 7FEU ;
$ rm -f x1.log x2.log # remove the temporary files
# Now, we can collect all PDB IDs that include the PFAS OECD.

# Clean ligand & pdbid data file
$ python ../scripts/clean_pfas_oecd_log.py pfas_oecd.log | tee pdbid_ligandid_pfas_oecd.dat
$ head pdbid_ligandid_pfas_oecd.dat # show the head of this data file
# >> 5B2W_W1Z
# >> 5B2Y_W10
# >> 7FEU_4I6
# This data file can be directly used for the batch processing we mentioned above.
```


# Appendix

## Defintion of PFAS

**Table 1**: Definitions of PFAS included in analysis

| Definition                        | Formal definition verbatim from organization                                                                                                                                                                                                                                                                                                     | Informal interpretation                                                                                                                                                                                    |
|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Buck et al. (2011)**            | “Aliphatic substances containing one or more C atoms on which all the H substituents present in the nonfluorinated analogues from which they are notionally derived have been replaced by F atoms, in such a manner that PFASs contain the perfluoroalkyl moiety CnF2n+1−.”                                                                | Compounds that contain at least one carbon atom that is bound to three fluorine atoms (-CF3). The structure must be saturated with no double or triple bonds (the only definition with this restriction). |
| **OECD (2018)**                   | “PFASs, including perfluorocarbons, that contain a perfluoroalkyl moiety with three or more carbons (i.e. –CnF2n−, n ≥ 3) or a perfluoroalkylether moiety with two or more carbons (i.e. –CnF2nOCmF2m−, n and m ≥ 1).”                                                                                    | Compounds with at least three carbons on which all the hydrogens have been replaced by a fluorine atom, so as to form a three-carbon unit with subunits (-CF2-). Also includes oxygen-linked compounds. |
| **OECD (2021)**                   | “PFASs are defined as fluorinated substances that contain at least one fully fluorinated methyl or methylene carbon atom (without any H/Cl/Br/I atom attached to it), i.e. with a few noted exceptions, any chemical with at least a perfluorinated methyl group (−CF3) or a perfluorinated methylene group (−CF2−) is a PFAS.” | Compounds containing at least one carbon that has three fluorine atoms attached (-CF3). Also includes compounds where a carbon has at least two fluorine atoms (-CF2-).                                   |
| **Glüge et al. (2020)**           | In addition to substances containing CnF2n+1 where n ≥ 1, it also “includes (i) substances where a perfluorocarbon chain is connected with functional groups on both ends, (ii) aromatic substances that have perfluoroalkyl moieties on the side chains, and (iii) fluorinated cycloaliphatic substances.”                                               | Excludes compounds with a single -CF2-, but includes compounds with two or more -CF2- or -CF3- groups. Also includes compounds with carbon-fluorine units linked by an oxygen atom (-CF2OCF2-).          |
| **TURA (2021a)**                  | “Those PFAS that contain a perfluoroalkyl moiety with three or more carbons (e.g., –CnF2n−, n ≥ 3; or CF3−CnF2n−, n ≥ 2) or a perfluoroalkylether moiety with two or more carbons (–CnF2nOCmF2m− or –CnF2nOCmF2m−, n and m ≥ 1).”                                                                           | Key to this definition is that the compound must contain a string of at least three carbon atoms, each containing two or more fluorine atoms. Includes various chemical structures.                      |
| **TURA (2021b)**                  | “Certain PFAS not otherwise listed includes those PFAS that contain a perfluoroalkyl moiety with three or more carbons (e.g., –CnF2n−, n ≥ 3; or CF3−CnF2n−, n ≥ 2) or a perfluoroalkylether moiety with two or more carbons, where the carbon structures shown the dash (-) is not a bond to a hydrogen and may represent a straight or branched structure.” | Clarifies that in TURA 2021a the (-) does not include a bond to hydrogen.                                                                                        |
| **U.S. EPA OPPT (2021)**          | “… a structure that contains the unit R-CF2-CF(R’)(R’’), where R, R’, and R’’ do not equal ‘H’ and the carbon-carbon bond is saturated (note: branching, heteroatoms, and cyclic structures are included).”                                                                                                   | Compounds that contain a string of two adjacent carbon atoms, one containing at least two fluorine atoms and the other containing at least one fluorine atom, neither carbon bound to hydrogen.          |
| **≥1 Fully Fluorinated Carbon**   | Organic chemicals containing “at least one fully fluorinated carbon atom.”                                                                                                                                                                                                                             | A compound with at least one carbon where all the hydrogen atoms have been replaced by fluorine atoms. The number of bonds on the carbon is not specified.                                              |
| **All Organofluorine**            | All organic compounds containing at least one fluorine atom should be classified as PFAS.                                                                                                                                                                                                              | Any compound whose structure contains a carbon attached to a fluorine atom.                                                                                     |

**Footnotes:**  
- **a** Authorities whose legislation defines PFAS as a class of fluorinated organic chemicals containing at least one fully fluorinated carbon atom (WA, VT, ME, CA, NDAA).  
- **b** NGOs that advocate for broader definitions of PFAS to include all organofluorines.

**Reference**. Hammel, E., Webster, T. F., Gurney, R., & Heiger-Bernays, W. (2022). Implications of PFAS definitions using fluorinated pharmaceuticals. Iscience, 25(4).


## PFAS ligand visualization

### Define a Python function: ligand_display

In [49]:
# Python function: ligand_display(log_file, smarts_str, num_ligands_display=200)
#
import re
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Draw
from IPython.display import display

def ligand_display(log_file, smarts_str, num_ligands_display=200):
    """
    Reads a .log file to parse ligands and SMILES, filters for those
    containing fully fluorinated methyl/methylene units based on the
    specified SMARTS pattern, and displays them in batches.

    Parameters
    ----------
    log_file : str
        Path to the log file that contains lines with ligand ID and SMILES.
    smarts_str : str
        A SMARTS pattern used to detect the desired fully fluorinated units.
        Example:
        '[$([#6X4]([!#1;!Cl;!Br;!I])(F)(F)(F)),$([#6X4](F)(F)([!#1;!Cl;!Br;!I])([!#1;!Cl;!Br;!I]))]'
        which matches either:
        - CF3 attached to a carbon that is not bound to H/Cl/Br/I
        - CF2 attached to a carbon that is not bound to H/Cl/Br/I in both remaining substituents.
    num_ligands_display : int
        Maximum number of ligands to display (in batches of 10).
    """

    # Read the file lines
    with open(log_file, "r") as file:
        data = file.readlines()

    # Extract Ligand ID and SMILES using regex
    ligands = []
    for line in data:
        # Looks for lines matching:
        #  Ligand ID: X123, SMILES: CCCC..., Fluorine Count:
        match = re.search(r'Ligand ID:\s*(\S+)\s*;\s*SMILES:\s*([^;]+)', line)
        if match:
            ligand_id = match.group(1).strip()
            smiles = match.group(2).strip()
            ligands.append((ligand_id, smiles))

    # Convert to DataFrame
    df = pd.DataFrame(ligands, columns=["Ligand ID", "SMILES"])

    # Convert SMILES to RDKit Molecule objects
    df["Molecule"] = df["SMILES"].apply(lambda x: Chem.MolFromSmiles(x) if Chem.MolFromSmiles(x) else None)

    # Drop invalid molecules
    df = df.dropna(subset=["Molecule"])

    # Convert the SMARTS string into an RDKit pattern
    pfas_pattern = Chem.MolFromSmarts(smarts_str)
    if pfas_pattern is None:
        raise ValueError("Invalid SMARTS pattern. Check syntax.")

    # Check if the molecule has the specified substructure
    def has_fully_fluorinated_unit(mol):
        return mol.HasSubstructMatch(pfas_pattern)

    # Apply filtering to identify ligands that match
    df["Has_Fully_Fluorinated_Unit"] = df["Molecule"].apply(has_fully_fluorinated_unit)
    df_fully_fluorinated = df[df["Has_Fully_Fluorinated_Unit"] == True]
  
    # Print summary
    print(f"Number of ligands matching the pattern {smarts_str}: {len(df_fully_fluorinated)} (Total = {len(df)})")
    print("Ligand IDs that match this fully fluorinated pattern:")
    print(df_fully_fluorinated["Ligand ID"].tolist())

    # Display molecules in batches
    batch_size = 10
    for i in range(0, min(len(df_fully_fluorinated), num_ligands_display), batch_size):
        batch_df = df_fully_fluorinated.iloc[i : i + batch_size]
        print(f"Showing molecules {i + 1} to {i + len(batch_df)} of {len(df_fully_fluorinated)}:")
        display(
            Draw.MolsToGridImage(
                batch_df["Molecule"].tolist(),
                molsPerRow=5,
                legends=batch_df["Ligand ID"].tolist(),
                subImgSize=(400, 400)
            )
        )

    return df_fully_fluorinated

### Display the PFAS ligand

In [None]:
# PFAS ligand visulization
log_file = 'pfas_oecd.log'
# $ head ccdcif_info.log
# # Ligand IDs - Found 6936 ligands with at least 1 fluorine atoms:       # Ligand IDs - Found 6936 ligands with at least 1 fluorine atoms:
# Ligand ID: W1Z ; SMILES: OC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F ; Fluorine Count: 23 ;     Ligand ID: W1Z ; PDBID: 5B2W ;
# Ligand ID: W10 ; SMILES: OC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F ; Fluorine Count: 19 ;   Ligand ID: W10 ; PDBID: 5B2Y ;
# Ligand ID: 4I6 ; SMILES: OC(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F ; Fluorine Count: 17 ; Ligand ID: 4I6 ; PDBID: 7FEU ;

#ligand_smarts_pattern = 'F' # All ligands (F-containing)
#ligand_smarts_pattern = "FC(F)F" # -CF3 ligands
#ligand_smarts_pattern = 'Oc1cccc(F)c1'  # phenolic fluorine (-Ph-F) ligands
#ligand_smarts_pattern = "[CX4](F)-[CX4](F)-[CX4](F)(F)(F)"  # -(CF2)n-CF3 for n >= 1 ligands
ligand_smarts_pattern = '[CX4]([!#1])(F)(F)-[CX4](F)([!#1])([!#1])' # R-CF2-CF(R')(R") where R, R', R" ≠ H; US EPA OPPT definition
#ligand_smarts_pattern = '[$([#6X4]([!#1;!Cl;!Br;!I])(F)(F)(F)),$([#6X4](F)(F)([!#1;!Cl;!Br;!I])([!#1;!Cl;!Br;!I]))]' # CF3 or CF2 pattern with no H/Cl/Br/I substituents; OECD definition

df = ligand_display(log_file, ligand_smarts_pattern, num_ligands_display=50)


### Get the released date of pdbids

```bash
$ awk -F'_' '{print $1}' pdbid_ligandid_pfas_oecd.dat  > x1.log # store the input arguments
$ python ../scripts/get_date_for_pdbids.py x1.log 
# >> PDBID: 5B2W ; Deposited date: 2016-02-03 ; Initially released date: 2017-02-08 ;
# >> PDBID: 5B2Y ; Deposited date: 2016-02-07 ; Initially released date: 2017-02-08 ;
# >> ...

# You want to see what they look like? Please see the section "Visualization of Ligands" in the Appendix.
# Please see the section of "PFAS ligand visualization"

# It is noted that the downloaded PDB file could have multiple proteins or ligands, so we need further identify the PDB files that are good for the molecular docking benchmarking.

```
