Skip to content

GSoC2014 Proposal: Type Hinting from the start (rajul)

Rajul edited this page Apr 1, 2014 · 102 revisions

Type Hinting from the start

Project Goals

The goal of this project is to implement support for sources and parsers to mark a particular value with a type hint, which will then flow through the whole system. To accomplish this, one needs to find the right spot to implement the feature at, without noticeable loss of performance.

Benefits to the community

With Type hinting, logging becomes self-defining and self-documenting and the parameters are automatically validated at the time of logging, as their data types are already known. With the implementation of sources and parsers, to mark a particular value with a type hint, the redundant coding and processing at each destination is reduced. The data type of a value is tagged with it at the source itself and it then cleanly passes on to the destination without much further processing.

Background

Syslog-ng has support for various logging destinations, like data stores and message queues, for example, SQL, MongoDB, JSON, AMPQ etc.

Syslog-ng Sources and Destinationss

Syslog-ng collects its log messages from various sources: internal messages, text files, named pipes, process accounting logs on Linux, external applications, on Sun Solaris, the IETF syslog protocol, the system-specific log messages of a platform, remote hosts using the BSD syslog protocol, UNIX domain sockets.

A destination is where a log message is sent if the filtering rules match. Similarly to sources, destinations consist of one or more drivers, each defining where and how messages are sent. Syslog-ng can store messages in a number of destinations plain-text files, a MongoDB database, named pipes, external applications, a SQL database, a remote logserver using the IETF-syslog protocol, a remote logserver using the legacy BSD-syslog protocol, UNIX domain sockets, a user terminal — usertty() destination

Current Type Hinting Support in Syslog-ng

Earlier logging used to be simple writing of strings to the logging destination. However, now as many of the destinations listed above support numerous data types besides string, it just makes sense to make full use of this availability of multiple data type options. Hence, the concept of type hinting was introduced in syslog-ng, which enabled the passing templates, annotated with type hints, to some destinations and template functions wherever multiple data types were supported, for the destination driver to optimally use them.

At present, type hinting features are implemented for the mongodb() destination and the $(format-json) template. They are present in the /modules/afmongodb/afmongodb.c and /modules/json/format-json.c files respectively. The primary code for type hinting is present in lib/type-hinting.c, lib/type-hinting.h files.

The syntax to add hints is fairly simple; we simple wrap the template with the hinted data type as key-value pair:

mongodb(
         value-pairs(pair("date", datetime("$UNIXDATE"))
         pair("pid", int64("$PID")))
);

The types that are supported for type hinting at present are: boolean (anything starting with 't' or 'T' or 1 : TRUE, anything starting with 'f' or 'F' or 0 : FALSE), string, literal, int32 (common int), int64, datetime(only UNIX timestamps are supported for time hinting as of now, anything else results in casting error).

In case of type casting failures, it is possible to specify the actions as one of the three. All of these properties may also be specified to work silently: syslog-ng can drop the whole message, drop the property (the syslog-ng application converts every message it receives to a set of name-value pairs called properties) or fall back to string.

options {
  typecast(on-error(silently-drop-property));
};

Similarly, the property is used by $(format-json) too: $(format-json date=datetime("$UNIXDATE") pid=int64("$PID"))

This feature enables us to store non-string values with their proper types.

Project Idea and Design

Syslog-ng has the ability to directly log to SQL databases, TLS-encrypted message queues, and also has the ability to modify the content of log messages, that is parse and identify messages based on a pattern database. It can also correlate log messages to identify events, serving as a log analysing engine at the same as serving as the log daemon.

Object definitions (also called statements and can be one of source, destination, log, filter, parser, rewrite rule, or template) in syslog-ng configuration files have the following syntax: object_type object_id {<options>};

Type Hinting

Type Hinting was introduced in syslog-ng for a number of destinations and template functions (MongoDB and $(format-json)), making it possible to utilize these destination's ability to store various data types besides strings and storing the values from the log messages in their original data types making them more cost effective and proper.

However, at present the logs are passed as name-value pairs to the logger functions of these destinations, wherein we have to explicitly type in the data types for each destination, within this pair. However, it shall save a lot of redundant coding and processing if a value can be marked with a data type once, at the source or parser level, and then pass it along the whole system to its destination without much further processing.

Syslog-ng has an internal representation for every value. This internal representation is always in the form of a string. Type-hinting will mark these values as the type that we desire them to have when stored at a particular destination. Type-hinting shall let the destination drivers know that we'd like a different representation, and they can then do the conversion themselves, if needed.

For example, $(format-json), when encounters a value which has an integer type-hint, prints the value without quotes (but does not perform any type conversion) . So if syslog-ng has a "foo": "1" key-value pair internally, with the value tagged as int, format-json will print that as {"foo":1} instead of {"foo":"1"}. However, mongodb will convert it to int first. This is because it is a type hint, and it is not enforced. It is up to the destinations to figure out how to handle the hint.

The destinations need to deal with these hints, in their own way. Besides, we also have helper functions to help with converting one type to another, so destinations can reuse that. It is important that different destinations are given freedom to deal with hints in different ways, because all of them have different requirements regarding how conversion works (like format-json not doing any conversion at all, and the difference is in formatting only, while mongodb does conversion)

At present, to use type hinting, we need to tell the “destination”, that we want a particular value with a different type. But this shall lead to becoming very verbose if we have more than a handful of properties that would benefit from type hinting. For example, we may need to write statements like : $(format-json PID=int($PID) DATE=datetime($UNIXTIME) ...), but we aim to write the same thing as: $(format-json( PID=$PID DATE=$UNIXTIME), without the need of explicitly passing types: int, datetime etc. And for that, the sources that make $PID and $UNIXTIME available need to be able to set a type hint.

The sources work with a LogMessage object, which have no type-hinting support. But destinations work with LogTemplates (formatted LogMessages), and those do have a hint support. LogMessage is kind-of the source of a LogTemplate. LogTemplate is more like a formatter. From a single LogMessage, we can create many LogTemplates. For example, $(format-json PID=$PID DATE=$UNIXTIME MESSAGE=$MSG) creates 4 LogTemplates: one for $PID, one for $UNIXTIME, one for $MSG, and a fourth that combines all these into JSON.

The goal of the project is to teach sources to set the hint themselves, and push that hint through the whole of syslog-ng.

Various ways of doing that can be:

  1. The type hint will be stored along to $PID values, $UNIXTIME values, as an attribute in the LogMessage object, which needs to be expanded. LogMessage uses something called nvtable (name-value table) to store key-value pairs, and it shall be the best place to add a type hint. And then expand the LogTemplate formatting functionalities, to look at type hints within nvtable.
  • But there may be some corner cases. For example, a template like $FOO $BAR, where $FOO has a type-hint of int, while $BAR is string. We can't type-cast the whole template to int (because type-hints need to be template-bound when they get to the destinations, but they're property-bound when they leave the parser/source and templates can reference multiple properties (or name-value pairs within the LogMessage).
  • However, we can deal with these cases like $FOO $BAR , we can specify a priority list like say if $FOO is int and $BAR is string , then treat the entire template as string hinted, for some template that has multiple data types, we can determine an overall data type based on this priority table. This is similar to automatic type conversion, like when we divide an integer by a float, the entire resulting value is converted to float.
  1. Another simple way of doing it can be, for the value and type to be wrapped in a structure and then passed to the destination. And then the destination can determine the type hint and take appropriate action.

Data Type Validation

Data validation is about having valid data for a given type of variable. Data validation is making sure that all data (whether user input variables, read from file or read from a database) are valid for their intended data types and stay valid throughout the application that is driving this data.

The level of Data validation of a field will be based on two criteria:

  • degree of importance (can be specified alongside the type hint. Default to least critical)
  • actual type of the data

Different levels of validation required in syslog-ng

  • Log Template (Field) Level Validation
  • Log Message Level Validation
  • Data Saving Validation: This type of validation is performed at the routine that will be performing the actual logging of the information to the file or database record or some other destination.

Depending on the data type at hand different types of field specific validations can be made available for different purposes. The different techniques and what kind of data they apply to and how to effectively use them.

  1. Range Validation: This usually applies to numeric values or even dates. They perform a test to make sure that a value entered is within a range of specific values. Note that this could apply to characters as well.
  2. Lookup Validation: Typically, this type of validation is done when a value entered needs to be compared to a list of possible values.
  3. Masked Input Validation: Examples like an email address, URL, IP address, all have one thing in common, they have a specific Input pattern that should be respected when entered and when read to assure that the right information is read from the source and written to destination.

Different types of checks:

Check Description
Allowed character checks Ascertains that only expected characters are present in a field
Check digits Used for numerical data
Data type checks Checks the data type of the input
Format or picture check Checks that the data is in a specified format (template). Regular expressions should be considered for this type of validation
Length check Checks the data isn't too short or too long
Limit check Unlike range checks, data are checked for one limit only, upper OR lower
Logic check Checks that an input does not yield a logical error
Presence check Checks that important data is actually present and have not been missed out
Range check Checks that the data lie within a specified range of values
Table Look Up Check A table look up check takes the entered data item and compares it to a valid list of entries

Field types already existing for Type Hinting: boolean, string, literal, int32, int64, date. Various examples of Type Validation checks on them that can be performed on already existing data types hints are:

Data Types Validation Checks
boolean Return TRUE for "1", anything starting with "t" or "T". FALSE is returned for "0", anything starting with "f" or "F". NULL is returned for all non-boolean values.
string and literals Length checks, Allowed character checks, Presence checks, Table Look-up checks
int32 and int64 range checks, limit checks, logic checks, check digits, data type checks, table look-up checks
date range checks, limit checks, table look-up checks, Format or picture check

Other examples of desirable field types that can be implemented in future: float, URL, E-mail addresses, IP addresses, Regular Expressions, Lists, Maps/Dictionaries, Arrays, etc......

Casting Failure Mechanism Enhancement (in case of casting error)

In case of casting errors, it is possible to specify the actions as one of the three. All of these properties may also be specified to work silently:

  • syslog-ng can drop the whole message
  • drop the property (the syslog-ng application converts every message it receives to a set of name-value pairs called properties)
  • fall back to string.

A feature that can facilitate failure mechanisms: We can implement functionality to specify criticality (severity level) of a particular property along side its type hint. The specifying of criticality will be optional and will default to lowest severity level. Based on that criticality we can make decisions about what actions need to be taken.

Options that can specified along side type-hinting

  • Ability to specify specifically the type of a value or a near about type, for example, we can specify keep whatever int-looking (to encompass all types of integers), that'd be an option or restrict the type of different ints (signed/unsigned, how big, etc)
  • We can implement functionality for a boolean, to specify what values map to True, and what map to False, and what map to NULL
  • For string we can have implement options to specify, limits on the string length, the words in the string, characters, etc.
  • For integers, dates we can have functionality to specify range within which the values must lie
  • Ability to specify criticality (severity level) of a particular property along side its type hint. The specifying of criticality will be optional and will default to lowest severity level. Based on that criticality we can make decisions about what actions need to be taken.
  • All the above functionalities can be specified for separate values or can be placed in a main configuration file to apply to all values.

So, additional failing mechanisms that can be implemented beside the ones stated above:

  • boolean values : we can log NULL value if the passed value translates to neither True or False
  • int values, if int32 casting fails, we can try casting to int64 before dropping to string
  • Take any of the the existing three actions with a log message being logged specifying that casting failed, with the logging level based on the criticality of field
  • If data marked as non-critical level, fall to string and pass the data to destination unchanged, but take advisory actions and send a message to the source actor indicating those validation issues that were encountered.
  • If data marked critical, drop the property or drop the entire message based on the critical level of property, and send a message to the source to make a change that brings the data into compliance
  • Give an option of verification, which when enabled, the source actor is asked to verify that this data is what they would really want to enter, in the light of a suggestion to the contrary. Here, the check step suggests an alternative. The source has the option of accepting the recommendation or keeping their version.

Project Deliverables

  1. Support in the core of syslog-ng for type hinting settable by sources and parsers.
  • Do away with the need of specifying data types of all values everytime we are logging to some destination and these values should be internally marked with their hinted data types
  • This can be achieved by implementing type hinting facility in the LogMessage object itself, as is described above in the Project Idea and Design section
  • And when they are passed to the destination driver, it can act on these values in whatever way it wants. That is, we still want to be able to override the type hints set by sources. For example, if I have a $PID which is marked int, we may want to store that as a string for whatever reason, so string($PID) should override the type-hint there
  • Also, if a template in LogMessage has values of more than one data type, we need to implement a priority table dealing with it to convert them all to one data type
  1. Type hints set by sources and parsers flow through the program, and need not be set on the destination side.
  • At present, to use type hinting, we need to tell the “destination”, that we want a particular value with a different type. But this shall lead to becoming very verbose if we have more than a handful of properties that would benefit from type hinting.
  • For example, we may need to write statements like : $(format-json PID=int($PID) DATE=datetime($UNIXTIME) ...)
  • But, we aim to write the same thing, without the need of explicitly passing types: int, datetime etc, as: $(format-json( PID=$PID DATE=$UNIXTIME)
  • And for that, the sources that make $PID and $UNIXTIME available need to be able to set a type hint.
  1. Improving PatternDB to support type-hints
  • PatternDB should also have the ability to support type hints. PatternDB segregates log message based on various patterns as in comparing the passed patterns to pre-defined patterns. PatternDB should be able to set the type hint.
  • Once a string segment is parsed, patterndb should be able to set a type hint on that segment. For example, if we have program[1234], and patterndb can split that into "program" and "1234", where "1234" is the pid, it could set "1234" to int . This needs some syntax extension for patterndb as it can't automatically figure out the type.
  • Also the user can tell patterndb which type is a particular segment. For example, if we write @ESTRING:program:[@@ESTRING:pid:]@ now to parse the above , if we can have if @ESTRING:program:[@@EINT:pid:]@, that shall be very much desirable.
  • PatternDB already has a few typed parsers, so it can figure out the type of an item, if we use the proper parser. But we need a couple of new parsers and a few improvements to existing ones.
  1. An improved JSON parser that can set type-hints
  • An improved JSON parser that can set type-hints, meaning, that json-parser should have the ability to set the type-hints on LogMessage objects.
  • For example, if we have {"foo":1} coming in. Currently, that stores the 1 as a string: "1", with no type hint. Once sources/parsers can set type hints, json-parser could store the type hint there too, that it is an int.
  • Although the json-parser “knows” the intended type of a value, but it can't store that anywhere currently. But it should be able to so it.
  1. Implement type validation and casting error failure mechanism
  • In case of casting errors, it is possible to specify three possible actions as described in the Casting Failure Mechanism above. Implement more refined and fine-grained casting mechanisms and casting error handling

Brief Timeline

Time Period Task Description
21 April to 18 May Community Bonding period. Besides I shall also like to utilize this time to come up to speed with the Documentation and Design and codebase of the syslog-ng
19 May to 31 May Comprehensively Study and document the area in the code-base that need to be overhauled to teach sources to set the hint themselves, and push that hint through the whole of syslog-ng
1 June to 7 June Modify the source/parser handling drivers to set type hints
8 June to 17 June Enhance the LogMessage object to include type hints to develop functionalities to set type hints at source/parser itself, for sources that support it
18 June to 27 June Modify destination drivers that can support and benefit from type hinting. 23 June to 27 June: Mid-Term Evaluation
28 June to 30 June Implement Priority table to deal with conflicts when a template in LogMessage has values of more than one data type, to convert them all to one data type
1 July to 10 July Implement type validation and casting error failure mechanisms
11 July to 17 July Improve PatternDB to support type-hints
18 July to 20 July Improve JSON parser to enable it to set type-hints
21 July to 27 July Buffer time for final catch-up on any left-over work before testing begins
28 July to 5 August Write test cases and comprehensively test the code I develop by developing unit tests
6 August to 12 August Document my work developing both developer and user guides. Integrate the code into the mainline code branch
13 August to 22 August BUFFER TIME. 18 August to 22 August: Final Evaluation

Future Ideas for extending Type Hinting

Include type-hinting support for more data types and also implement user-defined type hinting facility for destinations that support it.

  • At present, we have support for 6 basic data types. There is a need to add more data types.
  • The functionality to include user-defined data types, for such drivers that support custom type definitions, may be desired. So that the user may define his desired data type there, and then can pass objects of that type to that destination marked with that custom data type that is defined for that destination.
  • For example, for the Lua destination syslog-ng supports, where we can write our destination in Lua, instead of C. That is high level enough to support custom data types in a reasonable way.

Ultimately, we plan to achieve complete type-hinting support. For example: for all parsers, like csv parsers, or any other things which can set a value in the log message.

Contact information

Contact Field Details
Name Rajul
E-mail rajul.iitkgp@gmail.com
IRC rajul@irc.freenode.net
Instant Messaging GTalk: rajul09@gmail.com
Phone Number +91 80166 17078
Physical Address A – 270, L.I.G., Govindpur Colony, Allahabad, Uttar Pradesh – 211004 India

My past work experience:

I have always been interested in programming and in the past I have participated in Google Summer of Code 2012, with the organisation Network Time Foundation, working on the project "improving the Logging/Debugging System of Network Time Protocol Software". I have also interned in the Global Technology division of Barclays, during the summers of 2013, working with the Market Risk IT team. Besides I have worked on a few Research projects in the fields of Computational Finance, Complex Networks, and Computational Chemistry. I am currently working on my Thesis project in the field of Computational Sciences on a project titled "Network Analysis of Chemical Reactions". I have had courses in the fields of Programming and Data Structures, Complex Networks, Distributed Systems, Algorithms, Operations Research in the past. I am proficient with programming languages C/C++, Python, Java, Groovy, Ruby. I am very interested in the fields of Algorithms, Computer Organisation and Architecture and Operating Systems.

The school I am attending

I am presently a final year undergraduate student at the Indian Institute of Technology Kharagpur. My major is Chemistry. As I am pursuing an Integrated Master’s degree, I expect to graduate in 2014.

How much time I will have during GSoC to work on my project?

As mentioned earlier, my main focus during the period of GSoC, shall be development for GSoC itself. I suppose that I shall be able to devote my full attention and time to it and can easily work on the project for like 35-40 hours per week. However, I am ready to put in any extra time and effort that might be demanded by the project.

What other things I will be doing during GSoC (vacation, exams, travel)

As far as I can as I can anticipate, I shall be having all the time to myself to work on my project during most part of GSoC. I do not have any travelling plans as of yet.

Acknowledgement:

I hereby thank Mr. Gergely Nagy (algernon), Mr. Fabien Wernli (faxm0dem) and Mr. Viktor Tusa (talien), who gave me extremely valuable suggestions that enabled me to come up with this proposal.

References:

Clone this wiki locally