New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: Add serial differencing aggregation #10190

Closed
wants to merge 279 commits into
base: feature/aggs_2_0
from

Conversation

Projects
None yet
@polyfractal
Member

polyfractal commented Mar 20, 2015

This is still a WIP, just putting it up for discussion. We may want to roll this functionality into a different Agg.

Serial Differencing

Serial differencing (or just differencing) is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where n is the period being used.

A period of 1 is equivalent to a derivative: it is simply the change from one point to the next. Single periods are useful for removing constant, linear trends.

Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.

But once we plot the first-difference, it becomes a stationary series (we know this because the first difference is randomly distributed around zero, and doesn't seem to exhibit any pattern/behavior). The transformation reveals that the dataset is a random-walk model, which allows us to use further analysis.

screen shot 2015-03-19 at 10 42 04 am

Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.

The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.

screen shot 2015-03-19 at 12 15 06 pm

API

{
   "aggs": {
      "my_date_histo": {
         "date_histogram": {
            "field": "timestamp",
            "interval": "day"
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "lemmings"
               }
            },
            "first_difference": {
               "diff": {
                  "bucketsPath": "the_sum",
                  "periods" : 1
               }
            },
            "thirtieth_difference": {
               "diff": {
                  "bucketsPath": "first_difference",
                  "periods" : 30
               }
            }
         }
      }
   }
}

TODO

  • Tests :)
  • Javadocs, cleanup, etc
  • How does this interact with Derivative? The first difference is technically a derivative. We could roll this behavior into Deriv, but if Deriv ever gets time normalization this will get weird. We could also tell people to just use Diff with first-period, but I quite like having a single, simple Deriv agg
@polyfractal

This comment has been minimized.

Show comment
Hide comment
@polyfractal

polyfractal Mar 24, 2015

Member

Just a note: I think I want to rename periods parameter to lags. More standardized naming, and I think a bit more descriptive.

Member

polyfractal commented Mar 24, 2015

Just a note: I think I want to rename periods parameter to lags. More standardized naming, and I think a bit more descriptive.

@colings86

This comment has been minimized.

Show comment
Hide comment
@colings86

colings86 Mar 25, 2015

Member

+1 to renaming periods to lags, but what is the affect of having multiple lags? or is that not the reason why it's plural? We can still only get one output per bucket right?

Member

colings86 commented Mar 25, 2015

+1 to renaming periods to lags, but what is the affect of having multiple lags? or is that not the reason why it's plural? We can still only get one output per bucket right?

@polyfractal

This comment has been minimized.

Show comment
Hide comment
@polyfractal

polyfractal Mar 25, 2015

Member

Oh, good point. It should be lag. I think supporting multiple lags would be very confusing and unnecessary.

Member

polyfractal commented Mar 25, 2015

Oh, good point. It should be lag. I think supporting multiple lags would be very confusing and unnecessary.

@colings86

This comment has been minimized.

Show comment
Hide comment
@colings86

colings86 Mar 25, 2015

Member

Agreed, we should not try to support multiple lags in one reducer

Member

colings86 commented Mar 25, 2015

Agreed, we should not try to support multiple lags in one reducer

bleskes and others added some commits Mar 27, 2015

Decouple recoveries from engine flush
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted)  we have to recover a long translog when we come back.

To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).

Change highlights:
- Refactor Translog file management to allow for multiple files.
- Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
- A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
- Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery  to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
- IndexShard now creates the translog together with the engine. The translog is closed by the engine on close. ShadowIndexShards do not open the translog.
- Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`)

Closes #10624
Mappings: Remove file based default mappings
Using files that must be specified on each node is an anti-pattern
from the API based goal of ES. This change removes the ability
to specify the default mapping with a file on each node.

closes #10620
Scripting: Add Field Methods
Added infrastructure to allow basic member methods in the expressions
language to be called.  The methods must have a signature with no arguments.  Also
added the following member methods for date fields (and it should be easy to add more)
* getYear
* getMonth
* getDayOfMonth
* getHourOfDay
* getMinutes
* getSeconds

Allow fields to be accessed without using the member variable [value].
(Note that both ways can be used to access fields for back-compat.)

closes #10890
Add span within/containing queries.
Expose new span queries from https://issues.apache.org/jira/browse/LUCENE-6083

Within returns matches from 'little' that are enclosed inside of a match from 'big'.
Containing returns matches from 'big' that enclose matches from 'little'.
Merge pull request #10913 from rmuir/spanspanspanspanspan
Add span within/containing queries.
Exclude jackson-databind dependency
the jackson yaml data format pulls in the databind dependency, its important that we exclude it so we won't use any of its classes by mistake
Merge pull request #10924 from kimchy/exclude_jackson_ann
Exclude jackson-databind dependency
Trimmed the main `elasticsearch.yml` configuration file
The main `elasticsearch.yml` file mixed configuration, documentation
and advice together.

Due to a much improved documentation at <http://www.elastic.co/guide/>,
the content has been trimmed, and only the essential settings have
been left, to prevent the urge to excessive over-configuration.

Related: 8d0f1a7

rmuir and others added some commits May 14, 2015

Merge pull request #11163 from rmuir/jna_nosys
Use our provided JNA library, versus one installed on the system
Scripting: Add Multi-Valued Field Methods to Expressions
Add methods to operate on multi-valued fields in the expressions language.
Note that users will still not be able to access individual values
within a multi-valued field.

The following methods will be included:

* min
* max
* avg
* median
* count
* sum

Additionally, changes have been made to MultiValueMode to support the
new median method.

closes #11105
Mappings: Add back support for enabled/includes/excludes in _source
This adds back the ability to disable _source, as well as set includes
and excludes. However, it also restricts these settings to not be
updateable. enabled was actually already not modifiable, but no
conflict was previously given if an attempt was made to change it.

This also adds a check that can be made on the source mapper to
know if the the source is "complete" and can be used for
purposes other than returning in search or get requests. There is
one example use here in highlighting, but more need to be added
in a follow up issue (eg in the update API).

closes #11116
Re-structure collate option in PhraseSuggester to only collate on loc…
…al shard.

Previously, collate feature would be executed on all shards of an index using the client,
this leads to a deadlock when concurrent collate requests are run from the _search API,
due to the fact that both the external request and internal collate requests use the
same search threadpool.

As phrase suggestions are generated from the terms of the local shard, in most cases the
generated suggestion, which does not yield a hit for the collate query on the local shard
would not yield a hit for collate query on non-local shards.

Instead of using the client for collating suggestions, collate query is executed against
the ContextIndexSearcher. This PR removes the ability to specify a preference for a collate
query, as the collate query is only run on the local shard.

closes #9377
Merge pull request #11171 from rjernst/fix/11116
Mappings: Add back support for enabled/includes/excludes in _source
Change includes/excludes back to null based for now, since it
complicates serialization and causes a number of test failures.
Add index name to log statements when settings update fails
When an index setting is invalid and fails to be set, a WARN statement
is logged but it doesn't contain the index name, making tracking down
and fixing the problem more difficult. This commit adds the index name
to the log statement.
HttpServer: Support relative plugin paths in configuration
When specifying relative paths on startup, handling plugin
paths failed due to recently added security fix. This fix
ensures normalization of the plugin path as well.

In addition a new matcher has been added to easily check for a
status code of an HTTP response likes this

assertThat(response, hasStatus(OK));

Closes #10958
Analysis: Add multi-valued text support
Add support array text as a multi-valued for AnalyzeRequestBuilder
Add support array text as a multi-valued for Analyze REST API
Add docs

Closes #3023
Removed `id_cache` from stats and cat apis.
Also removed the `id_cache` option from the clear cache api.

Closes #5269
Merge pull request #11183 from martijnvg/parent-child/remove_id_cache…
…_from_stats_and_clear_cache_apis

Removed `id_cache` from stats and cat apis.
Merge pull request #11144 from jpountz/fix/remove_hppc_esoteric_dep
Internal: remove dependency on hppc:esoteric.
Aggs: Make it possible to configure missing values.
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now
support a new `missing` option which defines the value to consider when a
field does not have a value. This can be handy if you eg. want a terms
aggregation to handle the same way documents that have "N/A" or no value
for a `tag` field.

This works in a very similar way to the `missing` option on the `sort`
element.

One known issue is that this option sometimes cannot make the right decision
in the unmapped case: it needs to replace all values with the `missing` value
but might not know what kind of values source should be produced (numerics,
strings, geo points?). For this reason, we might want to add an `unmapped_type`
option in the future like we did for sorting.

Related to #5324
Merge pull request #11042 from jpountz/feature/aggs_missing
Aggs: Make it possible to configure missing values.
Merge pull request #11141 from jpountz/fix/fieldnameanalyzer_leniency
Mappings: Make FieldNameAnalyzer less lenient.
Search: Make SCAN faster.
When scrolling, SCAN previously collected documents until it reached where it
had stopped on the previous iteration. This makes pagination slower and slower
as you request deep pages. With this change, SCAN now directly jumps to the
doc ID where is had previously stopped.
Aggregations improvement: exclude clauses with a medium/large number …
…of clauses fail.

The underlying automaton-backed implementation throws an error if there are too many states.

This fix changes to using an implementation based on Set lookups for lists of excluded terms.
If the global-ordinals execution mode is in effect this implementation also addresses the slowness identified in issue 11181 which is caused by traversing the TermsEnum - instead the excluded terms’ global ordinals are looked up individually and unset the bits of acceptable terms. This is significantly faster.

Closes #11176
Highlighting: nuke XPostingsHighlighter
Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't
 make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably.

One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow
 things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term
s are pulled out of the automata for multi term queries).

Removing our fork means the following in terms of features:
- dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
- some custom es queries won't be supported anymore, meaning they won't be highlighted. The only one I found up until now is the phrase_prefix. Postings highlighter rewrites against an empty reader to avoid slow operations (like the ones that we were performing with the fork that we are removing here), thus the prefix will not be expanded to any term. What the postings highlighter does instead is pulling the automata out of multi term queries, but this is not supported at the moment with our MultiPhrasePrefixQuery.

Closes #10625
Closes #11077
@polyfractal

This comment has been minimized.

Show comment
Hide comment
@polyfractal

polyfractal May 15, 2015

Member

Closing since this is against an outdated branch (feature/aggs_2_0)

Member

polyfractal commented May 15, 2015

Closing since this is against an outdated branch (feature/aggs_2_0)

@kingaj

This comment has been minimized.

Show comment
Hide comment
@kingaj

kingaj Feb 25, 2016

any java example if have then please give me a link please

kingaj commented Feb 25, 2016

any java example if have then please give me a link please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment