Skip to content

ghostjat/DNA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงฌ PHP-ML DNA Classification Tutorial

Build a multi-class DNA sequence classifier in pure PHP using PHP-ML โ€” from raw data to predictions.


๐Ÿ“Œ Introduction

This tutorial demonstrates how to use the PHP-ML library to build a machine learning model that classifies DNA sequences into:

  • ๐Ÿฆ  Bacteria
  • ๐Ÿพ Animal
  • ๐Ÿ„ Fungi
  • ๐Ÿงซ Virus
  • ๐ŸŒฟ Plant

Youโ€™ll go through the complete pipeline:

  • Data preparation
  • Exploratory Data Analysis (EDA)
  • Model training
  • Evaluation
  • Prediction

All examples are located in:

example/dna/
โ”œโ”€โ”€ eda.php
โ”œโ”€โ”€ train.php
โ””โ”€โ”€ predict.php

๐Ÿงช Problem Overview

DNA sequences contain patterns that can be used to identify their biological origin. Instead of binary promoter detection, this project performs multi-class classification across five organism types.


๐Ÿค– Why Machine Learning?

Machine learning helps by:

  • Automatically discovering patterns in DNA sequences
  • Scaling to large biological datasets
  • Providing fast and accurate classification

โš™๏ธ Prerequisites

Ensure you have:

  • PHP โ‰ฅ 8.2
  • Composer
  • Install PHP-ML:
composer require ghostjat/pml:*
  • Basic command-line knowledge

๐Ÿ“‚ Dataset Overview

๐Ÿ“Š Summary

  • Total Samples: 244,447
  • Features: 256 (k-mer frequencies)
  • Classes: 5

๐Ÿงฌ Classes

  • bacteria
  • animal
  • fungi
  • virus
  • plant

๐Ÿ“ Storage

datasets/train_*.csv

๐Ÿ” Step 1: Exploratory Data Analysis (eda.php)

This script loads and inspects the dataset.

$trainFiles = glob(__DIR__ . '/datasets/train_*.csv');
$dataset = loadDna($trainFiles[0]);

for ($i = 1; $i < count($trainFiles); $i++) {
    $dataset = $dataset->stack(loadDna($trainFiles[$i]));
}

$df0 = DataFrame::fromCSV($trainFiles[0], false);
$cols0 = $df0->columns();
$classes = $df0->categories(end($cols0));

๐Ÿ”Ž What it does

  • Loads multiple CSV files
  • Merges them into one dataset
  • Extracts class distribution

๐Ÿง  Step 2: Training & Evaluation (train.php)

Train a neural network using MLPClassifier.

$pipeline = new Pipeline(
    [new NumericStringConverter(), new ZScaleStandardizer()],
    new MLPClassifier(
        architecture: [32, 16],
        epochs: 10,
        learningRate: 0.01,
        batchSize: 32
    )
);

Dataset::seed(42);
$dataset->randomize();

[$train, $val] = $dataset->split(0.8);

$pipeline->train($train);

$valPreds = $pipeline->predict($val);
$valAcc = (new Accuracy())->score($valPreds, $val->labels());

โšก Training Details

  • Train Samples: 195,558
  • Validation Samples: 48,889
  • Validation Accuracy: ~90.07%
  • Training Time: ~20 seconds

๐Ÿ”ฎ Step 3: Prediction (predict.php)

Use a trained model to classify new DNA sequences.

// โ”€โ”€ 1. Load model + class map โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
$logger->info('Loading model โ€ฆ');
$pipeline = Pipeline::load($modelDir);
$classes  = json_decode(file_get_contents($modelDir . '/classes.json'), true);
$logger->info('Model loaded', ['classes' => $classes]);

// โ”€โ”€ 2. Load unknown CSV โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
$logger->info('Loading unknown data โ€ฆ');
$df   = DataFrame::fromCSV($unknownCsv, false);
$cols = $df->columns();

// Check if last col is a label (STRING) or a feature (float32)
$dtypes      = $df->dtypes();
$lastCol     = end($cols);
$hasLabels   = ($dtypes[$lastCol] === 'string');

$X       = $df->drop($hasLabels ? [$lastCol] : [])->toTensor();
$dataset = new Dataset($X);
$logger->info('Data ready', ['rows' => $dataset->numRows(), 'features' => $dataset->numColumns()]);

// โ”€โ”€ 3. Predict โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
$logger->info('Predicting โ€ฆ');
$predIndices = $pipeline->predict($dataset)->toFlatArray();   // [N] class indices

// โ”€โ”€ 4. Evaluate if labels available โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
if ($hasLabels) {
    $yTrue = $df->castToFloat($lastCol)->col($lastCol)->squeeze();
    $predT = \Pml\Tensor::fromArray($predIndices);
    $acc   = (new Accuracy())->score($predT, $yTrue);
    $logger->info(sprintf('Test accuracy: %.4f  (%.2f%%)', $acc, $acc * 100));
}

โ–ถ๏ธ Running the Example

๐Ÿ” EDA

php eda.php

๐Ÿง  Training

php train.php  //softmax

php trainMLP.php

๐Ÿ”ฎ Prediction

php predict.php

๐Ÿ“Š Interpreting Results

  • Accuracy โ†’ Overall correctness
  • Multi-class Predictions โ†’ Output label among 5 classes

๐Ÿš€ Extending the Tutorial

  • Increase epochs for better accuracy
  • Try deeper architectures
  • Experiment with other classifiers
  • Add cross-validation

๐Ÿ Conclusion

You now have a complete workflow for building a multi-class DNA classifier in PHP.


โค๏ธ Final Note

Push PHP beyond traditional limits โ€” even into machine learning.

Happy coding! ๐Ÿš€

About

An example project demonstrating the use of machine learning to identify microbe taxonomies from their DNA sequences.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages