# DNA Sequence Classification

DNA sequence prediction is a crucial task in bioinformatics, enabling researchers to analyze genetic patterns, predict mutations, and model gene structures. This dataset can be used to implement three machine learning approaches to predict nucleotide sequences: N-Gram, LSTM, and Transformer models.

We use nucleotide sequences of human genes from the NCBI Gene Database. The dataset consists of:

1. Gene symbols, descriptions, and types.
2. Nucleotide sequences represented as A, T, C, G.
3. Train-validation split: 80% training, 20% testing.

Source: https://www.kaggle.com/datasets/harshvardhan21/dna-sequence-prediction

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import os

df_all = "Data"  # folder tempat file CSV kamu

for name_file in os.listdir(df_all):
    full_path = os.path.join(df_all, name_file)

    # cek: beneran file + hanya CSV
    if os.path.isfile(full_path) and name_file.endswith(".csv"):
        print(full_path)  # path lengkap
        print(f"=== {name_file} ===")  # nama file saja

        df = pd.read_csv(full_path)
        
        print(f"{name_file}: {df.shape}")  # (baris, kolom)
        print("-" * 40)

Data\test.csv
=== test.csv ===
test.csv: (8326, 7)
----------------------------------------
Data\train.csv
=== train.csv ===
train.csv: (22593, 7)
----------------------------------------
Data\validation.csv
=== validation.csv ===
validation.csv: (4577, 7)
----------------------------------------


### Data Cleaning and Preprocessing

In [3]:
test_df = pd.read_csv("Data/test.csv", encoding='ascii')
train_df = pd.read_csv("Data/train.csv", encoding='ascii')
val_df = pd.read_csv("Data/validation.csv", encoding='ascii')

train_df.head()

Unnamed: 0.1,Unnamed: 0,NCBIGeneID,Symbol,Description,GeneType,GeneGroupMethod,NucleotideSequence
0,0,106481178,RNU4-21P,"RNA, U4 small nuclear 21, pseudogene",PSEUDO,NCBI Ortholog,<AGCTTAGCACAGTGGCAGTATCATAGGCAGTGAGGTTTATCCGAG...
1,1,123477792,LOC123477792,Sharpr-MPRA regulatory region 12926,BIOLOGICAL_REGION,NCBI Ortholog,<CTGGAGCGGCCACGATGTGAACTGTCACCGGCCACTGCTGCTCCG...
2,2,113174975,LOC113174975,Sharpr-MPRA regulatory region 7591,BIOLOGICAL_REGION,NCBI Ortholog,<TTCCCAATTTTTCCTCTGCTTTTTAATTTTCTAGTTTCCTTTTTC...
3,3,116216107,LOC116216107,CRISPRi-validated cis-regulatory element chr10...,BIOLOGICAL_REGION,NCBI Ortholog,<CGCCCAGGCTGGAGTGCAGTGGCGCCATCTCGGCTCACTGCAGGC...
4,4,28502,IGHD2-21,immunoglobulin heavy diversity 2-21,OTHER,NCBI Ortholog,<AGCATATTGTGGTGGTGACTGCTATTCC>


In [4]:
#train_df = train_df.drop(columns=["Unnamed: 0"])
#test_df = test_df.drop(columns=["Unnamed: 0"])
#val_df = val_df.drop(columns=["Unnamed: 0"])

In [5]:
print(type(train_df))

<class 'pandas.core.frame.DataFrame'>


Dataset yang didapat sudah bersih dan siap untuk ditraining. terdapat 3 file csv, yaitu
- train.csv
- test.csv
- validation.csv

### Classification Models

In [6]:
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from itertools import product

# fitur dan label
X_train = train_df["NucleotideSequence"].astype(str)
y_train = train_df["GeneType"]

X_val = val_df["NucleotideSequence"].astype(str)
y_val = val_df["GeneType"]

X_test = test_df["NucleotideSequence"].astype(str)

### Sequence + buat k-mer

In [7]:
def clean_sequence(seq: str) -> str:
    seq = str(seq).upper().strip()
    seq = seq.replace("<","").replace(">", "")
    return seq

def get_kmers(seq, k=3):
    seq = clean_sequence(seq)
    if len(seq) < k:
        return []
    return [seq[i:i+k] for i in range(len(seq) - k + 1)]

def kmer_analyzer(seq):
    return get_kmers(seq, k=3)

#### Pipeline models classic (k-mer + TF-IDF + SVM)

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

pipeline = Pipeline([
    ("cv", CountVectorizer(
        analyzer=kmer_analyzer,
        max_features=5000,       # batasi jumlah fitur
        min_df=5)),
    ("tfidf", TfidfTransformer()),
    ("clf", RandomForestClassifier(
        n_estimators=200,        # jumlah pohon
        max_depth=None,          # biarkan None dulu (boleh kamu batasi misal 20)
        n_jobs=-1,               # pakai semua core CPU
        random_state=42
    ))
])

### Results

In [9]:
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)

print("Validation accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val,y_pred))

Validation accuracy: 0.9997815162770374
                   precision    recall  f1-score   support

BIOLOGICAL_REGION       1.00      1.00      1.00      1398
            OTHER       1.00      1.00      1.00        88
   PROTEIN_CODING       1.00      1.00      1.00       101
           PSEUDO       1.00      1.00      1.00      2133
            ncRNA       1.00      1.00      1.00       516
             rRNA       1.00      1.00      1.00         8
            snRNA       0.96      1.00      0.98        22
           snoRNA       1.00      1.00      1.00       239
             tRNA       1.00      1.00      1.00        72

         accuracy                           1.00      4577
        macro avg       1.00      1.00      1.00      4577
     weighted avg       1.00      1.00      1.00      4577



In [10]:
import joblib  # biasanya sudah ada dari scikit-learn

# misal kamu sudah punya:
# pipeline.fit(X_train_full, y_train_full)

# Simpan model ke file
joblib.dump(pipeline, "dna_classifier_pipeline.joblib", compress=3)

print("Model saved as dna_classifier_pipeline.joblib")

Model saved as dna_classifier_pipeline.joblib


In [11]:
# Test Model
seq_list = [
    "<AGCTTAGCAAGTCCGATC>",
    "<TTTCCCGGGAAA>",
    "<ACGTACGTACGT>"
]

preds = pipeline.predict(seq_list)
for s, p in zip(seq_list, preds):
    print(s, "→", p)


<AGCTTAGCAAGTCCGATC> → ncRNA
<TTTCCCGGGAAA> → ncRNA
<ACGTACGTACGT> → BIOLOGICAL_REGION
