Skip to content

Train your Miner with previously entered data. Start collecting data from them automatically. Supports semi-structured data-sctructures (such as PDF invoices) and unstuctured data (free text like emails)

License

Notifications You must be signed in to change notification settings

apajo/php-data-miner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PHP Data Miner

Introduction

php-data-miner extracts data from structured data formats (such as PDF documents).

Prerequisites

  • Unix-like OS
  • GNU Make
  • PHP (>=7.4)
  • NodeJS (>=8)

Installation

$ sudo apt-get install make gcc gfortran php-dev libopenblas-dev liblapacke-dev re2c build-essential

Usage

Annotation

Annotate your model with @Model() and properties with @Property() annotations.

use PhpDataMiner\Model\Annotation\Model;
use PhpDataMiner\Model\Annotation\Property;

/**
 * @Model()
 */
class Invoice
{
    /**
     * @var string
     * @Property()
     */
    protected string $number;
}

Create your miner

$miner = $this->miner->create($entity, [
    'storage' => new CustomStorage(),
    'property_types' => [
        new FloatProperty(),
        new IntegerProperty(),
        new DateProperty(),
        new Property(),
    ]
]);

$pdfContents = shell_exec('pdftotext -layout incoice.pdf -');

$doc = $miner->normalize($pdfContents, [
    'filters' => [
        DateFilter::class,
        ColonFilter::class,
        Section::class,
        WordTree::class,
    ]
]);

$entity = new Invoice();

You need to have pdftotext installed to read PDF contents like shown above

  • filters (or transformers) transform and normalize the content
  • WordTree filter is as special kind of tokenizer for nesting and grouping the contents (by rows, columns, sentences etc)

It's recommended that you place your tokenizers as the last ones in the filters list

Training

Train your model with data you've already entered (supervised learning):

...

$trainedProperties = $miner->train($entity, $doc);

Mining (or predicting)

Apply predicted data to your model:

...

$predictedProperties = $miner->predict($entity, $doc);

Entry discrimination (filtering)

Edit your storage model PhpDataMiner\Storage\Model\Model::createEntryDiscriminator() method to set entry filter:

use PhpDataMiner\Storage\Model\Model;
use PhpDataMiner\Storage\Model\ModelInterface;

class InvoiceModel extends Model implements ModelInterface
{
    public static function createEntryDiscriminator($invoice): DiscriminatorInterface
    {
        return new Discriminator([
            $invoice->getClient() ? $entity->getClient()->getId() : null,
            $invoice->getId(),
        ]);
    }
}

Versioning

Version numbering is done following the semantic versioning

TODO

  • Natural language toolkit (NLTK) support
  • Feature vectors for properties

Testing

$ make tests [test_name]

About

Train your Miner with previously entered data. Start collecting data from them automatically. Supports semi-structured data-sctructures (such as PDF invoices) and unstuctured data (free text like emails)

Topics

Resources

License

Stars

Watchers

Forks