Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add relation duplicate checks and creation strategies #11

Closed
jedateach opened this issue Mar 31, 2015 · 1 comment
Closed

Add relation duplicate checks and creation strategies #11

jedateach opened this issue Mar 31, 2015 · 1 comment

Comments

@jedateach
Copy link
Member

Sometimes you may want to check if a record is a duplicate based on a relation being the same. E.g. ProductSelection is a duplicate when the ProductID is the same.

We don't always want relation objects to be created, so the behaviour should be configurable.

The existing behaviour does not allow duplicate checks based on a relation ID, and relation objects are always created.

Current duplicate check / loading issues

The current loading approach is:

  1. Find(duplicate check) or create a new $dataobject
  2. Loop over each record field to find or create and set any relations (known as the first run)
    • Dot notation or a callback can be used to reference relation object
    • Write new relation objects, if not already written
  3. Loop over each field and set data on the $dataobject using ->update() (known as the second run)
    • When dot notation is used a write() will be performed on the $dataobject and new relation objects.

Reasoning for the two phase (first/second run) approach:

//find/create any relations and store them on the object
//we can't combine runs, as other columns might rely on the relation being present

Cyclic dependency prevents simple solution

Relation callbacks are currently run after the duplicate checks run. To introduce relation-based duplicate checks, we need to fire the developer-configured relation/dotnotation callbacks, which either reside on a subclass of BulkLoader, or a singleton of the relation object class. This puts us in a chicken-egg situation, where we want the relation to perform a duplicate check, but the duplicate check is run before relation callbacks, because callbacks may need to be fired on an existing object.

Update() method is not flexible

The ->update() method gives a specific behaviour that can't really be manipulated. It is limiting, and inflexible.

Proposed solution

A new approach could be:

  1. Loop over fields in columnMap, populating a placeholder DataObject with relation ids, and fields from record data. Callbacks can transform data or retrieve relation objects.
  2. Find existing objects matching various specified duplicateCheck fields on the placeholder. Either update an existing object, save the placeholder, or do nothing...depending on configuration. Duplicate checks could be on multiple fields, e.g. ProductID && Size.

Because we loop over the columnMap, rather than the record itself, we can configure the order that fields are imported. So if importing one field relies on another, there is no need to do the two stage/phase approach.
If a columnMap is not provided, then the mappable columns need to be scaffolded.

We would need to somehow ensure that callbacks don't try writing the placeholder object. This could persist DataObjects that should never be persisted.

To tidy up the configuration system, I think that all of the callbacks should be anonymous functions/ Closures, instead of string callback names of functions on $obj and subclasses of BulkLoader.

Whilst the ->update() function may continue to be used, by the time it is reached, it will not contain any dot notation fields that would trigger a potential relation creation. Relation creation will be handled separately to make it configurable.

Here is some pseudo code demonstrating how data could go through a new process:

<?php

/**
 * process:
 * raw data is extracted using BulkLoaderSource as iterable rows
 * row data is mapped into a standardised form
 * standard form is transformed into a placeholder dataobject
 */

//raw data
$rawdata = "name,age,country
            joe bloggs,62,NZ
            alice smith,24,AU
            ";

//CSVBulkLoaerSource parses raw into records
$rows = array(
    array("name" => "joe bloggs", "age" => "62", "country" => "NZ"),
    array("name" => "alice smith", "age" => "24", "country" => "AU")
);

//mapping for getting data into a standard form
//(either hard-coded, or defined by user)
$mapping = array(
    "first name" => "FirstName",
    "last name" => "Surname",
    "name" => "Name",
    "age" => "Age",
    "country" => "Country.Code",
);

//first record after mapping has been performed
$record = array(
    "Name" => "joe bloggs",
    "Age" => "62",
    "Country.Code" => "NZ"
);

//define how data will be transformed
$transforms = array(
    "Name" => array(
        'callback' => function($value, $obj){
           $name =  explode(" ", $value);
           $obj->FirstName = $name[0];
           $obj->Surname = $name[1];
        }
    ),
    "Country.Code" => array(
        "link" => true, //link up relations
        "create" => false //don't creaet new relation objects
    )
);

//dataobject record after tranformation
$dataobj->record = array(
    "FirstName" => "Joe",
    "Surname" => "Bloggs",
    "CountryID" => 234
);
@jedateach jedateach added this to the 0.1.x milestone Mar 31, 2015
@jedateach jedateach changed the title Allow duplicate checks on relation ID Relation duplicate checks and creation strategies Apr 1, 2015
@jedateach jedateach changed the title Relation duplicate checks and creation strategies Add relation duplicate checks and creation strategies Apr 1, 2015
jedateach added a commit that referenced this issue Apr 2, 2015
@jedateach
Copy link
Member Author

This has been added in 0.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant