Data

Groovity data module

Groovity-data is a data access module offering domain-agnostic conventions for developing scalable groovity applications on top of arbitrary and disparate data sources. While groovity-core contains all the logical building blocks you need to develop highly scalable applications, groovity-data goes a step further and defines fundamental patterns for organizing your code in a consistent fashion to achieve excellent testability, reusability and maintainability.

Data Types
Model
DataModel
Ingest
Data Factory
Store and Stored
Pointers
Shared
Data Sources
Watching data

Data Types: Domain classes and Data Models

As you model an application, you typically design a number of domain classes. For example, groovity's sample webapp has a User class to represent users, and a Note class to represent notes that users have made.

To make domain classes available in groovity-data, we define one factory script per domain class under /data/types/ that can produce new instances of that domain class, or data models. In the sample webapp, that means there are "/data/types/note.grvt" and "/data/types/user.grvt" scripts, each of which returns from its script body newly constructed Note and User DataModels, respectively.

Users of the data model do not load these data type factory scripts directly; instead data models are always acquired via a global data factory, which in turn automatically delegates to type factory scripts.

So for example, to create and store a new user DataModel in the sample application:

load '/data/factory'

def newUser = factory('user')
newUser.userName="Mickey"
newUser.digestPassword("Minnie123")
newUser.store()

Calling the factory with a single string argument is how you construct new, bare data models from any domain class that has a factory under /data/types/. You can also call the factory with a pointer or a second string argument id which can be used to load raw data from a data source and place it into the DataModel using the Ingest interface.

Model

The Model interface defines map-style put and putAll methods for absorbing loosely structured data into a domain model, for example from JSON calls or SQL result sets. It also defines a groovy each{} iterator for visiting all the fields in a Model with a ModelConsumer or closure. Groovity automatically discovers all the properties of a class using Groovy's MetaClass and MetaProperties, and uses them to provide a default implementation of the each() and put() methods for all Models. Model also defines static each and put methods that can be used to provide the default each and put behavior for arbitrary objects that do not implement Model, for example java beans or raw Maps or Lists; this can be used to create deep copies of arbitrary object graphs using Model.copy.

ModelVisitor provides an extension of ModelConsumer that in addition to accepting key/value pairs, can visit the other 4 high-level types in a model:

* visitList(Iterable list)
* visitListMember(Object o)
* visitNull()
* visitObject(Object o)
* visitObjectField(String name, Object value)  // aliased to ModelConsumer.call()

ModelWalker provides a concrete base class for ModelVisitors that keeps track of field state, breaks circular object references, and allows setting a list of ModelFilter implementations to dynamically alter the model as it is being walked. ModelFilter defines a number of static convenience filter builders, e.g. ModelFilter.include('title','author','author.name'), but simple filters can also be defined using groovy closures that accept 3 arguments; key, value, and consumer. A pass-through filter would simply be { k, v, c -> c(k, v) }; a field lowercasing filter would be { k, v, c -> c(k.toLowerCase(), v) }. ModelFilter also provides a full java API covering all 5 visitor methods with filter methods:

* filterList(Iterable list, ModelVisitor visitor)
* filterListMember(Object member, ModelVisitor visitor)
* filterNull(ModelVisitor visitor)
* filterObject(Object obj, ModelVisitor visitor)
* filterObjectField(String name, Object value, ModelVisitor visitor)

A groovy closure used as a filter is automatically mapped to the fifth method, filterObjectField.

ModelCollector, ModelJsonWriter and ModelXmlWriter are subclasses of ModelWalker that provide implementations to support filtered devolution and serialization of object models. The Model Writers are also used as the default Json and XML serializers in groovity, so you can also pass filters to the write tag to create custom output.

write(
	value : myModel,
	filter : [
		ModelFilter.include('title', 'description', 'date'),
		ModelFilter.transform('date', { it.format 'yyyy-MM-dd' }),
		{ k, v, c ->
			if(k.length() > 6 && !v.endsWith(' copyright myOrg')){
				v += ' copyright myOrg'
			}
			c(k, v)
		}
	]
)

ModelCollector is used by the map() function automatically inherited by all Model implementations, and can also be used with filters.

def flattenedMap = myModel.map(
	ModelFilter.collapse('pointer'),
	ModelFilter.promote('place.name')
)

DataModel

DataModel is a groovy trait that applies both the Model and Ingest APIs to a class, along with a pointer field; the Pointer contains the type and an id of the DataModel, which constitutes a reference that is used to retrieve, and possibly store and delete, the model from the common Data factory. DataModels can be composed from multiple traits, whose fields and methods are merged together into the class.

Data models MUST implement DataModel in order to work as a data factory type. Additional behavioral traits are then stacked on top of that foundation. Here is an example domain class representing an Author that implements the Stored trait, and an IsPerson trait that adds name fields to the type.

class Author implements DataModel, Stored, IsPerson{
	String twitter
	String facebook
}

Ingest

Groovity-data defines an Ingest api that is used to absorb raw data into a Model. There are two ingest methods defined, one that takes a single key/value pair, and one that take an entire map of data. These methods are analagous to put and putAll in Model, and by default ingest delegates to put; it offers an extension point for traits and DataModels that consume raw data that does not map cleanly to the java object model, providing an opportunity to convert or pre-process the raw data before applying it to the model.

For example, here is a base trait for Rss elements that defines well-known rss fields along with a flexible map of attributes; at ingest time it picks a single title from feeds that might have several title elements, and copies unknown properties (those that fail to put) into its attributes map.

trait RssElement implements Model, Ingest{
	String title
	String link
	String description
	Map<String, Object> attributes = [:]

	boolean ingest(String key, Object value){
		if(key=='title' && value instanceof Collection){
			Collection c = (Collection) value
			title = c.isEmpty() ? null: c.first()
			return true
		}
		if(!put(key, value)){
			attributes.put(key, value)
		}
		true
	}
}

Here is then an Item domain class that extend RssElement with a date and guid; it has ingest logic to parse a textual date in RSS information, and delegates other fields to the ingest method defined in the RssElement trait. Because RssElement's ingest method delegates to put it will automatically handle the guid field defined here in this inheriting class.

class Item implements Ingest, RssElement{
	String guid
	Date pubDate

	boolean ingest(String key, Object value){
		if(key=='pubDate' && value instanceof String){
			pubDate = Date.parse("EEE', 'dd' 'MMM' 'yyyy' 'HH:mm:ss' 'Z", value)
			return true
		}
		RssElement.super.ingest(key,value)
	}
}

Finally, here is a DataModel that ingests the full raw RSS feed and transforms its "channel" into an array of Items.

class Rss implements DataModel, RssElement{
	Item[] items

	boolean ingest(String k, Object v){
		if(k=='channel'){
			items = v.item.collect{ new Item().ingest(it) }.toArray(new Item[0])
			this.title = v.title
			this.link = v.link
			this.description = v.description
			return true
		}
		RssElement.super.ingest(k, v)
	}
}

Data Factory

Because Rss implements DataModel, it is ready to be accessed via the data factory. It just needs to be placed in a type script, e.g. /data/types/rss.grvt, define its data source and cache TTL and refresh interval (in seconds) and return a bare instance. That script would contain the above classes along with this simple configuration:

public static conf = [
	source : 'http',
	ttl : 60,
	refresh : 15
]

new Rss()

Now to retrieve a fully populated Rss DataModel, you would simply call the factory with the type and an http url.

load '/data/factory'
def rss = factory('rss','http://some.domain/some.rss')
rss.items.each{
	...
}

Repeated calls to the factory for the same URL will hit the factory cache; cache hits after the refresh interval has passed, but before the TTL removes the DataModel from cache, will trigger a background refresh of the data, so that frequently accessed DataModels are almost always refreshed in the background.

Store and Stored

Store is an interface with store() and delete() methods for traits to be able to extend those operations. Stored is the base trait for persistent data models that can apply additional traits that implement Store. Traits implementing Store allow the store and delete behavior of a Stored data model to be chained among traits. This is useful to allow traits to be able to perform before and after actions during store and delete operations, for example to perform validation, defaulting, or handle cascading updates or deletes. Here are a pair of handy traits that implement modified and created fields; the HasCreated trait automatically sets the create date the first time the DataModel is stored, and the HasModified trait updates the modified date every time the DataModel is stored.

trait HasModified implements Store{
	Date modified

	Store store(){
		modified = new Date()
		super.store()
	}
}


trait HasCreated implements Store{
	Date created

	Store store(){
		if(!created){
			created = new Date()
		}
		super.store()
	}
}

And here is a HasRowId trait; it takes care of keeping the data pointer and the field "id" in the model synchronized both ways, by passing updates to the ID through to the pointer, and pulling the ID from the from the pointer after a store() operation that might generate a new primary key.

trait HasRowId implements Store{
	long id

	def setId(Number number){
		id = number.toLong()
		setPointer(new Pointer(pointer?.type, id.toString()))
	}

	def setId(String str){
		if(str.isLong()){
			setId(str.toLong())
		}
		else{
			setId(0)
		}
	}

	Store store(){
		super.store()
		//grab ID from the pointer AFTER store is complete in case this was an insert
		id = pointer.id.find(/\d+/).toLong()
		this
	}

}

And here is an example Stored DataModel that applies these three traits along with its own domain-specific fields. If the domain class does not have any specific store or delete time methods it does not have to redeclare them; they will automatically be chained through the traits from right to left, so it is important that the Stored trait be immediately after DataModel and before any traits implementing Store; Stored implements the terminal store and delete methods that actually reach out to the underlying data source after the other traits have called super.store().

class Foo implements DataModel, Stored, HasModified, HasCreated, HasRowId{
	String name
	String description
}

Pointers

DataModels retrieved from the data factory always carry a Pointer that describes the data type and ID that were used to retrieve the model from the factory, or in the case of newly created and stored models, the pointer allows your application to discover the ID generated for a DataModel by its data source. The pointer allows stored data models to be updated or deleted in the underlying data source; the pointer can also be used to model foreign-key style relationships between domain classes.

//example of how we could store a collection of pointers
//and at runtime offer resolution to DataModels

Pointer[] references
public Collection<DataModel> resolveReferences(){
	load('/data/factory')(references)
}

Data sources that support queries may actually return a list of pointers; in this case the "ID" in the factory call is some type of source-specific query. If the source returns a list of pointers (i.e a list of primary keys for DataModels that match the query), the factory caches the list of pointers in association with the query, and each time you run the query the pointers are dynamically dereferenced from the cache. While the list of pointers itself may become stale, the DataModels in the list you retrieve are always the same as if you had queried the factory with a primary key.

A Pointer or type and ID can also be used to request a cache refresh or invalidation. Normally the factory caches DataModels according to rules defined in the type configuration, and automatically invalidates the cache for the primary key when a Stored DataModel is stored or deleted. You may want to add logic to your application to invalidate queries as well, to force an update to the list of pointers returned. For example, if you query for a DataModel and get no results, then create a DataModel, you might invalidate the query cache to make sure the next query gets the new result and not the cached empty list.

load '/data/factory'

def mUser = factory('user','userName=Mickey')?.first()
if(!mUser){}
	mUser = factory('user')
	mUser.userName='Mickey'
	mUser.store()
	factory.invalidate('user','userName=Mickey')
}

Data Sources

As far as groovity-data is concerned the id in a pointer or factory call is an opaque string, but it should have very specific meaning to the data source that resolves it. For example, groovity-data comes with an "http" data source that takes URLs as data ids; it also provides an "http" data type that provides convenient default handling of well-formed XML and JSON responses.

load '/data/factory'

def posts = factory('http','https://jsonplaceholder.typicode.com/posts')
posts.findAll{ it.userId==2 && it.body.contains('quia') }.each{
	log(info:"Found quia post ${it}")
}

Data sources are associated with data types in the conf map of the data type; the type configuration is also passed through to the data source so types may carry datasource-specific configuration, such as table name for a SQL data source. Use of conf for this purpose allows groovity-data to inherit natural flexibility as data source configuration can be managed using standard mechanisms on a per-environment basis. Here is an example from the groovity sample webapp; the Note class is configured by default to acquire data from the "notes" table in the "sampleDB" SQL DataSource using the general purpose SQL data loader (located at /data/sources/sql.grvt) built into the groovity-sql module.

static conf=[
	source:'sql',
	ttl:60,
	refresh:30,
	'sql.tableName':'notes',
	'sql.dataSource' : 'sampleDB'
]

@ModelOrder(['message','id','userId','worldRead','created','modified'])
public class Note implements DataModel, Stored, HasRowId {
	long userId;
	String message;
	boolean worldRead;
	Date created;
	Date modified;

	@ModelSkip public String getUserName(){
		if(userId){
			return load('/data/factory')('user',userId.toString())?.userName
		}
	}

}

new Note()

The factory calls upon the data source to produce raw data for a given id; if the raw data is in Map form, the map is ingested into a DataModel constructed by the type script; if the raw data is a List of Maps, then the factory produces a list of DataModels by running the type script repeatedly to create each element in the list, unless the DataModel type is itself a List, in which case a single instance of the DataModel list is created and the raw values are added to it.

The note example shows an how a "virtual" property of an DataModel can under the hood perform a factory lookup, in this case to resolve a user's name from their ID. The Note class makes use of two special groovity annotations: @ModelSkip is used to prevent the computed property "userName" from being iterated in the default each() implementation; this annotation can be applied either to a field or getter. @ModelOrder allows the class to control the iteration order of fields in the default each() method (which is alphabetical by field name by default), and may refer to trait fields as well as locally declared fields.

If you have some virtual properties in your model that you would like to iterate for serialization purposes, but which need to be filtered out for storage to a database backend, you can define a storeFilters() method on your Stored DataModel to control how the model is mapped back to storage. Here is an example from the portal application; a delivery contains a dynamic reference to both the recipient getPerson() and message getNotice(), which do not belong in the delivery table that just stores the foreign keys of those references in the personId and noticeId fields. The storeFilters method can be implemented by traits and chained; because it provides a List of ModelFilters, traits can add additional filters that are specific to the fields and methods in that trait.

static conf=[
	source:'sql',
	ttl:120,
	refresh:90,
	'sql.dataSource':'portalDB',
	'sql.tableName':'delivery',
	'sql.dateCol':'delivered'
]

class Delivery implements DataModel, Stored, HasRowId{
	long personId
	long noticeId
	Date delivered

	void storeFilters(List<ModelFilter> filters){
		filters.add(ModelFilter.exclude('person','notice'))
		Stored.super.storeFilters(filters)
	}

	Delivery store(){
		if(!delivered){
			delivered = new Date()
		}
		HasRowId.super.store()
		load('/data/factory').refresh('delivery',"personId=${personId}")
		this
	}

	public void delete(){
		Stored.super.delete()
		load('/data/factory').invalidate('delivery',"personId=${personId}")
	}

	public Object getPerson(){
		load('/data/factory')('person',personId.toString())
	}

	public Object getNotice(){
		load('/data/factory')('notice',noticeId.toString())
	}
}

new Delivery()

Shared

By default, the data factory makes a deep copy of DataModels upon retrieval; this way models can be mutated without concern for polluting the cache or encountering threading issues. However, for high-throughput applications that use complex read-only data models, the overhead of making these deep copies can become a noticeable bottleneck. So Groovity provides an interface to mark a DataModel as a Shared, which means the factory should return the cached DataModel without making a copy, and entrust the application to either NOT perform mutations, OR to explicitly call Model.copy() BEFORE making mutations.

For example, consider a sample /data/types/story.grvt

public static conf = [
	source : 'memory',
	ttl : '30',
	refresh : '15'
]

class Story implements DataModel, Stored, Shared{
	String headline
	String body

	public int getWordCount(){
		if(!body){
			return 0
		}
		body.tokenize().size()
	}

	void storeFilters(List<ModelFilter> filters){
		filters.add(ModelFilter.exclude('wordCount'))
		Stored.super.storeFilters(filters)
	}
}

new Story()

Now when you retrieve a story from the factory it will be a single, shared instance; to make a copy safe for mutations you can manually copy it.

load '/data/factory'

def story = factory('story','12345').copy()
story.headline = "My New Headline"
story.store()

Watching Data

If you are using a data source that supports date range queries, such as a SQL or ElasticSearch database, you can use the data factory to "watch" for updates to data, and react to them programmatically. You can watch either a type, to react to any updated data of that type, or you can watch a type+key, for example to only watch a specific DataModel or DataModels matching a specific query. Your watcher will receive a pointer to every DataModel that triggers the watch, so that depending on your use case you may ask the factory to invalidate or refresh that pointer before retrieving the data.

Creating a watch returns a ScheduledFuture you can use to cancel your watch later.

static ScheduledFuture myWatcher

static start(){
	def factory = load('/data/factory')
	myWatcher = factory.watch('widget'){ ptr->
		factory.invalidate(ptr)
		def widget = factory(ptr)
		//now do something with that fresh widget ...
	}
}

static destroy(){
	myWatcher.cancel(true)
}

Under the hood, data watchers poll data sources with a frequency defaulting to once a second; you can override this by passing in a different unit of time. Here is an example that watches a certain query once a minute:

factory.watch('widget','color=blue', 1, TimeUnit.MINUTES){ ptr->
	//perhaps widgets have a 30 minute cache refresh interval,
	//but blue widget users can't wait that long
	factory.refresh(ptr)
}

Both SQL and ElasticSearch data source support watching data types that are configured with the name of the data field that holds the date to be used as a trigger. For SQL data sources, configure sql.dateCol, and for elasticsearch use es.date

//SQL data type configuration
static conf=[
	source:'sql',
	ttl:'10',
	refresh:'7',
	'sql.dataSource' : 'wagonDB',
	'sql.tableName':'wagon',
	'sql.dateCol':'modified'
]

//ElasticSearch data type configuration
static conf=[
	source:'elasticsearch',
	ttl:60,
	refresh:45,
	'es.index':'unit_test_shoe_inventory',
	'es.type':'shoe',
	'es.date':'modified'
]

Under the hood, watches use dateRange queries; you can also manually fire a single date-range query via the factory. This will run in the calling thread.

load '/data/factory'

factory.dateRange('shoe', 'mens=true', afterTimeMillis, beforeTimeMillis){ ptr->
	offer(channel:'new-shoe'){ factory(ptr) }
}

However, be aware that to guarantee freshness, dateRange queries are never cached, so generally you are best advised to run them in the background via a watch(). The dateRange function is provided primarily to facilitate unit testing against canned data sources, which don't function in real time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly