## Construction of structured antibiotic resistance database (SARD) — Class Level Classification

Here, we will provide an in-depth exposition of the construction procedure for the Structured Antibiotic Resistance
 Database (SARD), and other dataset development efforts can adhere to the procedural framework we furnish.

Please ensure the following files are located in the './Tutorials/Data' directory. The download link to the data is as follows: https://drive.google.com/drive/folders/1ZM0p5YHCg2FBTQwBCHyl11L-0fzNEk_A?usp=drive_link :

- **embARG-Full-V1.0-2021.7.csv**: Contains the mapping of IDs to categories in the Expanded Antibiotic Resistance Genes (ARGs) dataset.

- **embARG-Full-V1.0-2021.7.fasta**: Stores IDs and their corresponding amino acid sequences in the Expanded ARGs dataset.

- **uniprot-reviewed+NOT+KW-0046-filtered-full.fasta**: The Negative dataset constructed by filtering out sequences with 100% sequence identity to the ARGs dataset, sourced from the Swiss-Prot dataset.

- **uniprot-reviewed+NOT+KW-0046-filtered-id80cov80.fasta**: The Negative dataset created by filtering out sequences with 80% sequence identity in comparison to the ARGs dataset, using the Swiss-Port dataset.

- **uniprot-reviewed+NOT+KW-0046-filtered-id50cov80.fasta**: The Negative dataset generated by filtering out sequences with 50% sequence identity to the ARGs dataset, sourced from the Swiss-Prot dataset.

- **uniprot-reviewed+NOT+KW-0046-filtered-id30cov80.fasta**: The Negative dataset formed by filtering out sequences with 30% sequence identity to the ARGs dataset, based on the Swiss-Port dataset."


In [2]:
import gc
import os
import numpy as np
from sklearn.model_selection import StratifiedKFold
from typing import Union
from pathlib import Path
from Bio import SeqIO
from sklearn.model_selection import train_test_split

## (1) Expanded ARGs dataset
Load the Expanded Antibiotic Resistance Genes (ARGs) dataset
 and split it 8:2 into training and testing sets.

In [3]:
os.chdir("../Tutorials")
Id_to_class = {}
Id_to_class_little = {}
with open("Data/embARG-Full-V1.0-2021.7.csv") as file:
    for i,line in enumerate(file):
        if i == 0:
            pass
        else:
            content = line.strip().split(",")
            # The Elfamycins category has only one sequence and cannot be trained and verified.
            # This category is not considered in the 19 categories
            if content[7] != 'Elfamycins':
                Id_to_class[content[3]] = content[7]
                Id_to_class_little[content[3]] = content[12]
Id_to_seq = {}
for seq_record in SeqIO.parse("Data/embARG-Full-V1.0-2021.7.fasta", "fasta"):
    Id_to_seq[seq_record.id] = seq_record.seq

all_data = list(Id_to_class.keys())
all_label = list(Id_to_class.values())
data_train,data_test,label_train,label_test = train_test_split(all_data,all_label,train_size=0.8,random_state=2021,stratify=all_label)

train_number = 0
with open("Data/positive_data_train.txt","w") as write:
    for i in data_train:
        write.write(i + "\t" + str(Id_to_seq[i]) + "\t" + Id_to_class[i] + "\t" + Id_to_class_little[i] +"\n")
        train_number += 1
print(f"Training numbers: {train_number}")

test_number = 0
with open("Data/positive_data_test.txt","w") as write:
    for i in data_test:
        write.write(i + "\t" + str(Id_to_seq[i]) + "\t" + Id_to_class[i] + "\t" + Id_to_class_little[i] +"\n")
        test_number += 1
print(f"Test numbers: {test_number}")

# ouput
category = {}
num = 0
for item in label_test:
    num += 1
    if item not in category:
        category[item] = 1
    else:
        category[item] += 1
print(category)
print(num)

category = {}
num = 0
for item in label_train:
    num += 1
    if item not in category:
        category[item] = 1
    else:
        category[item] += 1
print(category)
print(num)

category = {}
num = 0
for item in all_label:
    num += 1
    if item not in category:
        category[item] = 1
    else:
        category[item] += 1
print(category)
print(num)



Training numbers: 49498
Test numbers: 12375
{'Multi-drug resistance': 7073, 'Peptide': 655, 'Mupirocin': 115, 'MLS': 463, 'Betalactams': 1182, 'Aminoglycosides': 697, 'Rifampin': 518, 'Fosfomycin': 63, 'Tetracyclines': 410, 'Nucleosides': 52, 'Fluoroquinolones': 134, 'Glycopeptides': 407, 'Aminocoumarins': 238, 'Bacitracin': 123, 'Trimethoprim': 46, 'Phenicol': 160, 'Sulfonamide': 25, 'Fusidic acid': 4, 'Triclosan': 10}
12375
{'Multi-drug resistance': 28289, 'Peptide': 2621, 'Betalactams': 4727, 'MLS': 1854, 'Aminoglycosides': 2786, 'Rifampin': 2074, 'Phenicol': 640, 'Glycopeptides': 1630, 'Mupirocin': 459, 'Bacitracin': 493, 'Fluoroquinolones': 536, 'Tetracyclines': 1638, 'Trimethoprim': 182, 'Aminocoumarins': 950, 'Nucleosides': 206, 'Fosfomycin': 250, 'Sulfonamide': 103, 'Fusidic acid': 17, 'Triclosan': 43}
49498
{'Betalactams': 5909, 'Trimethoprim': 228, 'MLS': 2317, 'Fusidic acid': 21, 'Fosfomycin': 313, 'Aminoglycosides': 3483, 'Fluoroquinolones': 670, 'Multi-drug resistance': 35

## (2) Negative datasets (non-ARGs datasets)

1) Load four distinct negative datasets.

`identity 0 (ID ≤0%, sequence: 453,481), identity 30 (ID ≤30%, sequence: 470,358),
identity 50 (ID ≤50%, sequence: 474,570), and identity 80 (ID ≤80%, sequence: 475,049)`

2) Partition the four negative datasets into training and testing sets at an 80:20 ratio, and subsequently merge
them with the positive dataset to form the final training and testing sets.

In [None]:
class BigLabelClassify(object):
    BigLabel_list = ['Betalactams', 'Trimethoprim', 'MLS', 'Fusidic acid', 'Fosfomycin', 'Aminoglycosides', 'Fluoroquinolones', 'Multi-drug resistance', 'Glycopeptides', 'Phenicol', 'Rifampin', 'Tetracyclines', 'Peptide', 'Bacitracin', 'Sulfonamide', 'Nucleosides', 'Aminocoumarins', 'Triclosan', 'Mupirocin',"Others"]


def processing_negative(filename,outfile,outfilename):
    num = 0
    sequence_all = {}
    try:
        if not os.path.exists(outfile):
            os.makedirs(outfile)
        with open(outfile+outfilename, "w") as write:
            for seq_record in SeqIO.parse(filename, "fasta"):
                sequence = str(seq_record.seq)
                # Trimming the sequence to a length of 1022."
                max_length = 1022
                sequence = sequence[:max_length]
                if sequence not in sequence_all:
                    num += 1
                    write.write(sequence + "\n")
                    sequence_all[sequence] = 1
        print(filename, str(num))
    except FileNotFoundError:
        print(f"File '{filename}' not found.")

def Split_train_test(output,input,name):
    data_negative = []
    with open(output+"/"+input,"r") as read:
        for item in read:
            data_negative.append(item.strip())

    # split negative data with train and test
    data_full_train, data_full_test = train_test_split(data_negative, train_size=0.8, random_state=2021)
    print(output+"/"+input,"train_number: ",str(len(data_full_train)),"test_number: ",str(len(data_full_test)))
    number = 0
    with open(output+"train_"+str(name)+".txt","w") as train:
        for item in data_full_train:
            number += 1
            train.write(item + "\t" + "Others" + "\t" + "Others" + "\n")
        with open("Data/positive_data_train.txt","r") as p:
            for item in p:
                _,seq,category_big,category_little = item.strip().split("\t")
                train.write(seq + "\t" + category_big +"\t"+ category_little +"\n")
                number += 1
    print(f"all_train {number}")
    number = 0
    with open(output+"test_"+str(name)+".txt", "w") as train:
        for item in data_full_test:
            number += 1
            train.write(item + "\t" + "Others" + "\t" + "Others" + "\n")
        with open("Data/positive_data_test.txt", "r") as p:
            for item in p:
                _, seq, category_big, category_little = item.strip().split("\t")
                train.write(seq + "\t" + category_big + "\t" + category_little + "\n")
                number += 1
    print(f"all_test {number}")


def processingTestData(dir,file_name):
    with open(dir+"finally_test.txt","w") as write:
        with open(dir+file_name,"r") as read:
            for item in read:
                seq, label_big, label_small = item.strip().split("\t")
                write.write(seq + "\t" + str(BigLabelClassify.BigLabel_list.index(label_big)) + "\n")
    test_category = {}
    with open(dir+"finally_test.txt","r") as read:
        for item in read:
            _,category = item.strip().split("\t")
            cate = int(category)
            if cate not in test_category:
                test_category[cate] = 1
            else:
                test_category[cate] += 1
    print(test_category)

processing_negative("Data/uniprot-reviewed+NOT+KW-0046-filtered-full.fasta","Data/full/","full_negative.txt")
processing_negative("Data/uniprot-reviewed+NOT+KW-0046-filtered-id30cov80.fasta","Data/30/","30_negative.txt")
processing_negative("Data/uniprot-reviewed+NOT+KW-0046-filtered-id50cov80.fasta","Data/50/","50_negative.txt")
processing_negative("Data/uniprot-reviewed+NOT+KW-0046-filtered-id80cov80.fasta","Data/80/","80_negative.txt")


Split_train_test("Data/30/","30_negative.txt",30)
Split_train_test("Data/50/","50_negative.txt", 50)
Split_train_test("Data/80/","80_negative.txt", 80)
Split_train_test("Data/full/","full_negative.txt", "full")


processingTestData("Data/30/","test_30.txt")
processingTestData("Data/50/","test_50.txt")
processingTestData("Data/80/","test_80.txt")
processingTestData("Data/full/","test_full.txt")

Data/uniprot-reviewed+NOT+KW-0046-filtered-full.fasta 453481
Data/uniprot-reviewed+NOT+KW-0046-filtered-id30cov80.fasta 470358
Data/uniprot-reviewed+NOT+KW-0046-filtered-id50cov80.fasta 474570


## (4) Conduct 5-fold cross-validation on the training dataset and implement oversampling to maintain balanced category distribution

In [None]:
def KFoldProcessing(dir_name:Union[str,Path]):
    if not os.path.exists(dir_name):
        raise ValueError("dir_name is not correct! please confirm the correct dir_name!")
    all_data = np.loadtxt(dir_name, delimiter='\t', dtype=list)
    sequence = all_data[:,0]
    labelBig = all_data[:,1]
    labelSmall = all_data[:,2]

    # big label -> index
    Index_BigLabel = [BigLabelClassify.BigLabel_list.index(x) for x in labelBig]
    print("All Data length: {}, Using {} Fold to process data... ".format(len(all_data),5))

    # 5-fold
    KfoldData = StratifiedKFold(n_splits=5,random_state=2021,shuffle=True)
    Data_Fold = {}
    folder_name = "/".join(dir_name.split("/")[:-1])
    for index,(train_index,test_index) in enumerate(KfoldData.split(sequence,Index_BigLabel)):
        Data_Fold[index] = {"train":{
                                     "sequence": sequence[train_index], "label": np.array(Index_BigLabel)[train_index]},
                             "vail":{
                                     "sequence": sequence[test_index], "label": np.array(Index_BigLabel)[test_index]}
                            }

        with open(folder_name + "/train_padding_" + str(index + 1) + ".txt", "w") as write:
            for seq_write, label_write in zip(Data_Fold[index]["train"]["sequence"],
                                              Data_Fold[index]["train"]["label"]):
                write.write(seq_write + "\t" + str(label_write) + "\n")
        with open(folder_name + "/train_" + str(index + 1) + ".txt", "w") as write:
            for seq_write, label_write in zip(Data_Fold[index]["train"]["sequence"],
                                              Data_Fold[index]["train"]["label"]):
                write.write(seq_write + "\t" + str(label_write) + "\n")
        with open(folder_name + "/Vail_" + str(index + 1) + ".txt", "w") as write:
            for seq_write, label_write in zip(Data_Fold[index]["vail"]["sequence"],
                                              Data_Fold[index]["vail"]["label"]):
                write.write(seq_write + "\t" + str(label_write) + "\n")

    """
    Balancing the data only for the positive category
    """
    MAX = {}
    Category_number = {}
    for item in Data_Fold:
        train_data_label = Data_Fold[item]["train"]["label"]
        category = {}
        for label in train_data_label:
            if label not in category:
                category[label] = 1
            else:
                category[label] += 1
        # MAX.append(sorted(category.values(),reverse=True)[1])
        del category[19]
        MAX[item] = max(category.values())
        Category_number[item] = category
    print("Balance data for positive data...")
    for item in Data_Fold:
        max_fold = MAX[item]
        origin_category = Category_number[item]
        sequence_fold = Data_Fold[item]["train"]["sequence"]
        label_fold = Data_Fold[item]["train"]["label"]

        num_padding = {}
        zero_num_padding = {}
        for i in origin_category:
            num_padding[i] = max_fold - origin_category[i]
            zero_num_padding[i] = 0

        with open(folder_name + "/train_padding_" + str(item + 1) + ".txt", "a+") as write:
            while zero_num_padding != num_padding:
                for seq,label_big in zip(sequence_fold,label_fold):
                    if label_big != 19 and zero_num_padding[label_big] < num_padding[label_big]:
                        write.write(seq +"\t"+str(label_big)+"\n")
                        zero_num_padding[label_big] += 1
        balance_category = {}
        with open(folder_name + "/train_padding_" + str(item + 1) + ".txt", "r") as read:
            for item in read:
                _,category = item.strip().split("\t")
                if int(category) not in balance_category:
                    balance_category[int(category)] = 1
                else:
                    balance_category[int(category)] += 1
    gc.collect()

'''
KFoldProcessing("Data/30/train_30.txt")
KFoldProcessing("Data/50/train_50.txt")
KFoldProcessing("Data/80/train_80.txt")
'''
KFoldProcessing("Data/full/train_full.txt")

All Data length: 412282, Using 5 Fold to process data... 


🎉 Now that you have all the necessary data prepared for training, please refer to the 'FunGeneTyper/README.md' for guidance on model training.

## (5) Constructing the final dataset using the 'uniprot-reviewed+NOT+KW-0046-filtered-full.fasta' as the negative dataset.


After validation, it was determined that 'uniprot-reviewed+NOT+KW-0046-filtered-full.fasta' is the most effective negative dataset for training. Consequently,
 the 'uniprot-reviewed+NOT+KW-0046-filtered-full.fasta' dataset is processed and divided into training, validation, and test sets in a 6:2:2 ratio.

In [None]:
def SplitTrainToVail(dir_train:Union[str,Path],dir_test:Union[str,Path],):
    filename_write = "Data/ARGs_ClassLevel"
    if not os.path.exists(filename_write):
        os.makedirs(filename_write)
    all_data = np.loadtxt(dir_train, delimiter='\t', dtype=list)
    sequence = all_data[:,0]
    labelBig = all_data[:,1]
    labelSmall = all_data[:,2]

    # label -> index
    Index_BigLabel = [BigLabelClassify.BigLabel_list.index(x) for x in labelBig]
    print("All Data length: {}, Split data to train and Vaildation. ".format(len(all_data)))

    data_train, data_test, label_train, label_test = train_test_split(sequence, Index_BigLabel, train_size=0.75,
                                                                      random_state=2021, stratify=Index_BigLabel)
    data_train = list(data_train)
    data_test = list(data_test)
    label_train = list(label_train)
    label_test = list(label_test)
    with open(filename_write+"/"+"Train_full.txt","w") as write:
        for seq,categogy in zip(data_train,label_train):
            write.write(seq +"\t"+ str(categogy) +"\n")
    with open(filename_write+"/"+"TrainPaddingAll_full.txt","w") as write:
        for seq,categogy in zip(data_train,label_train):
            write.write(seq +"\t"+ str(categogy) +"\n")
    with open(filename_write+"/"+"Vail_full.txt", "w") as write:
        for seq, categogy in zip(data_test, label_test):
            write.write(seq + "\t" + str(categogy) + "\n")
    with open(filename_write+"/"+"Test_full.txt", "w") as write:
        with open(dir_test,"r") as read:
            for item in read:
                seq,big,little = item.strip().split("\t")
                write.write(seq +"\t"+ str(BigLabelClassify.BigLabel_list.index(big)) +"\n")
    out = {}
    with open(filename_write+"/"+"Train_full.txt", "r") as read:
        for item in read:
            seq_, category_ = item.strip().split("\t")
            if int(category_) not in out:
                out[int(category_)] = 1
            else:
                out[int(category_)] += 1
    print(out)
    MAX = max(out.values())
    del out[19]
    Category_positive = out
    num_padding = {}
    zero_num_padding = {}
    for i in Category_positive:
        num_padding[i] = MAX - Category_positive[i]
        zero_num_padding[i] = 0
    with open(filename_write+"/"+"TrainPaddingAll_full.txt", "a+") as write:
        while zero_num_padding != num_padding:
            for seq, label_big in zip(data_train, label_train):
                if label_big != 19 and zero_num_padding[label_big] < num_padding[label_big]:
                    write.write(seq + "\t" + str(label_big) + "\n")
                    zero_num_padding[label_big] += 1
                    print(zero_num_padding)
    """
    just output and see
    """
    ##########################
    out = {}
    with open(filename_write+"/"+"Train_full.txt", "r") as read:
        for item in read:
            seq_,category_ = item.strip().split("\t")
            if int(category_) not in out:
                out[int(category_)] = 1
            else:
                out[int(category_)] += 1
    print(out)
    out = {}
    with open(filename_write+"/"+"TrainPaddingAll_full.txt", "r") as read:
        for item in read:
            seq_, category_ = item.strip().split("\t")
            if int(category_) not in out:
                out[int(category_)] = 1
            else:
                out[int(category_)] += 1
    print(out)
    out = {}
    with open(filename_write+"/"+"Vail_full.txt", "r") as read:
        for item in read:
            seq_,category_ = item.strip().split("\t")
            if int(category_) not in out:
                out[int(category_)] = 1
            else:
                out[int(category_)] += 1
    print(out)
    out = {}
    with open(filename_write+"/"+"Test_full.txt", "r") as read:
        for item in read:
            seq_, category_ = item.strip().split("\t")
            if int(category_) not in out:
                out[int(category_)] = 1
            else:
                out[int(category_)] += 1
    print(out)

SplitTrainToVail("Data/full/train_full.txt","Data/full/test_full.txt")

👏 You can employ the training, validation, and test sets to replicate ARGTyper results of Class Level presented in our paper.
Furthermore, we also offer a direct download link to these datasets: https://drive.google.com/drive/folders/1uKP9-IIkOXqgQYSSfruycdCyl0otY41J?usp=drive_link






