Data Science tools for MarkLogic

This project provides a base line of add-on functionality to MarkLogic Server. Various statistical and data science algorithms are implemented, with multiple implementations to suit your particular needs.

A REST API invocation mechanism is provided allowing easy invocation. A library invocation mechanism is also provided for those needing to invoke these algorithms from other libraries already inside MarkLogic Server.

Motivation

Various customers, partners, prospects, and MarkLogic employees are interested in "Doing more with their data". This allows them to minimise the tools they use, and to use shared compute power within a MarkLogic Server cluster in order to process datasets they could not feasibly process on their own machines.

Others need to take snapshots of data, prepare them for statistical analaysis, generate temporary fields, and then perform matching or other analysis, with the modified dataset being made available to other data scientists for similar but subtlely different needs. E.g. analysing several different treatment affects within the same data set using PSM.

Implementation

Several implementations of each method are provided. This is largely because each user's needs will vary, but also due to allowing the caller to take advantage of MarkLogic Server features where enabled.

One good example of this is in the use of range indexes. These allow for less than/greater than comparison, but also for fast lexicon lookups and distributed analysis using MarkLogic User Defined Functions (UDFs). In operational systems range indexes are usually configured for often used fields.

In a data science context, however, a range index may not be configured for the multitude of derived fields in the prepared data set. In this scenario a different implementation needs to be provided, with the caller choosing the appropriate method. This normally isn't an issue in a non operational batch analysis use case, where the extra storage needs of indexing may not be a good trade off for not often used indexes.

Implemented functions

Currently the following are available out of the box in MarkLogic Server, and are not part of this project:-

sum, count, mean, mode, median, standard deviation, standard deviation p, variance, variance p
linear-model (i.e. linear regression)
value lexicon lookups
tuples for co-occurence

The following are implemented on top of MarkLogic Server by this project:-

An XQuery mean brute force function
Log-linear regression in JavaScript, XQuery, and as a UDF
Logistic regression in XQuery and as a UDF
kNN in XQuery using search scoring to determine euclidean distance (requires range indexes)
kNN in XQuery using manual euclidean calculation
kNN in XQuery using manual euclidean calculation, but parallelised to match more ‘treated’ results in parallel on multi core machines
kNN in XQuery using UDF euclidean calculation, with parallelisation of outer loop using XQuery, as above
A Group by UDF that allows Mean and Sum to be calculated but summarised by category

Geospatial analytics functions are currently out of scope for this project, but may be added in future.

If there are any functions you wish were implemented, please add an Issue on GitHub.

Design

There is not a lot of design documentation yet. Below is what is currently available:-

REST Stored Procedure Pattern

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
config		config
data		data
docs		docs
modules/datascience		modules/datascience
rest-api		rest-api
scripts		scripts
udfs		udfs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science tools for MarkLogic

Motivation

Implementation

Implemented functions

Design

About

Releases

Packages

Languages

License

adamfowleruk/datascience

Folders and files

Latest commit

History

Repository files navigation

Data Science tools for MarkLogic

Motivation

Implementation

Implemented functions

Design

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages