Build a multi-class DNA sequence classifier in pure PHP using PHP-ML โ from raw data to predictions.
This tutorial demonstrates how to use the PHP-ML library to build a machine learning model that classifies DNA sequences into:
- ๐ฆ Bacteria
- ๐พ Animal
- ๐ Fungi
- ๐งซ Virus
- ๐ฟ Plant
Youโll go through the complete pipeline:
- Data preparation
- Exploratory Data Analysis (EDA)
- Model training
- Evaluation
- Prediction
All examples are located in:
example/dna/
โโโ eda.php
โโโ train.php
โโโ predict.php
DNA sequences contain patterns that can be used to identify their biological origin. Instead of binary promoter detection, this project performs multi-class classification across five organism types.
Machine learning helps by:
- Automatically discovering patterns in DNA sequences
- Scaling to large biological datasets
- Providing fast and accurate classification
Ensure you have:
- PHP โฅ 8.2
- Composer
- Install PHP-ML:
composer require ghostjat/pml:*- Basic command-line knowledge
- Total Samples: 244,447
- Features: 256 (k-mer frequencies)
- Classes: 5
- bacteria
- animal
- fungi
- virus
- plant
datasets/train_*.csv
This script loads and inspects the dataset.
$trainFiles = glob(__DIR__ . '/datasets/train_*.csv');
$dataset = loadDna($trainFiles[0]);
for ($i = 1; $i < count($trainFiles); $i++) {
$dataset = $dataset->stack(loadDna($trainFiles[$i]));
}
$df0 = DataFrame::fromCSV($trainFiles[0], false);
$cols0 = $df0->columns();
$classes = $df0->categories(end($cols0));- Loads multiple CSV files
- Merges them into one dataset
- Extracts class distribution
Train a neural network using MLPClassifier.
$pipeline = new Pipeline(
[new NumericStringConverter(), new ZScaleStandardizer()],
new MLPClassifier(
architecture: [32, 16],
epochs: 10,
learningRate: 0.01,
batchSize: 32
)
);
Dataset::seed(42);
$dataset->randomize();
[$train, $val] = $dataset->split(0.8);
$pipeline->train($train);
$valPreds = $pipeline->predict($val);
$valAcc = (new Accuracy())->score($valPreds, $val->labels());- Train Samples: 195,558
- Validation Samples: 48,889
- Validation Accuracy: ~90.07%
- Training Time: ~20 seconds
Use a trained model to classify new DNA sequences.
// โโ 1. Load model + class map โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
$logger->info('Loading model โฆ');
$pipeline = Pipeline::load($modelDir);
$classes = json_decode(file_get_contents($modelDir . '/classes.json'), true);
$logger->info('Model loaded', ['classes' => $classes]);
// โโ 2. Load unknown CSV โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
$logger->info('Loading unknown data โฆ');
$df = DataFrame::fromCSV($unknownCsv, false);
$cols = $df->columns();
// Check if last col is a label (STRING) or a feature (float32)
$dtypes = $df->dtypes();
$lastCol = end($cols);
$hasLabels = ($dtypes[$lastCol] === 'string');
$X = $df->drop($hasLabels ? [$lastCol] : [])->toTensor();
$dataset = new Dataset($X);
$logger->info('Data ready', ['rows' => $dataset->numRows(), 'features' => $dataset->numColumns()]);
// โโ 3. Predict โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
$logger->info('Predicting โฆ');
$predIndices = $pipeline->predict($dataset)->toFlatArray(); // [N] class indices
// โโ 4. Evaluate if labels available โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if ($hasLabels) {
$yTrue = $df->castToFloat($lastCol)->col($lastCol)->squeeze();
$predT = \Pml\Tensor::fromArray($predIndices);
$acc = (new Accuracy())->score($predT, $yTrue);
$logger->info(sprintf('Test accuracy: %.4f (%.2f%%)', $acc, $acc * 100));
}php eda.phpphp train.php //softmax
php trainMLP.phpphp predict.php- Accuracy โ Overall correctness
- Multi-class Predictions โ Output label among 5 classes
- Increase epochs for better accuracy
- Try deeper architectures
- Experiment with other classifiers
- Add cross-validation
You now have a complete workflow for building a multi-class DNA classifier in PHP.
Push PHP beyond traditional limits โ even into machine learning.
Happy coding! ๐