[FEATURE][ML] A measure of strength of relationship between RVs plus better seeding of feature sample probabilities for boosted tree #488

tveasey · 2019-05-29T12:49:23Z

This implements the (refined) "maximal information coefficient" measure of the strength of the relationship between two variables and uses it to initialise feature sample probabilities for the boosted tree. It also puts in place a mechanism to restrict the features used, if there are insufficient training data, to those variables with the strongest relationship with dependent variable.

valeriy42

Good job on implementing MICe! 👏 I have a number of comments which aim to improve readability.

include/maths/CMic.h

lib/maths/CDataFrameUtils.cc

lib/maths/CMic.cc

lib/maths/unittest/CDataFrameUtilsTest.cc

lib/maths/unittest/CMicTest.cc

…lculation of k in main search when there are duplicates

tveasey · 2019-07-05T11:21:46Z

Thanks for the review @valeriy42! I think I've addressed all your comments. Can you take another look.

valeriy42

LGTM. Good job!

tveasey added 5 commits May 25, 2019 15:48

MIC variable dependence measure

24a20b1

Compute MIC on a data frame. Wire in to boosted tree regression

cd3a503

Merge branch 'feature/regression' into regression-tune-feature-weights

d02e9fd

Some more testing, fixes and tidy ups

9f8e85c

Comments

db3a634

tveasey added review >non-issue :ml labels May 29, 2019

tveasey added 4 commits May 29, 2019 16:14

Build test fixes

8b08090

Tweak test threshold

e95cebf

Merge branch 'feature/regression' into regression-tune-feature-weights

545bd12

Typo

e994799

tveasey requested a review from valeriy42 July 2, 2019 15:28

valeriy42 reviewed Jul 4, 2019

View reviewed changes

tveasey added 7 commits July 4, 2019 14:06

Some review comments, mainly improve naming, also IMPORTANT change ca…

d51a59c

…lculation of k in main search when there are duplicates

More explanation of optimizeXAxis

7e1a03c

Explanation and better assert

f4aceda

Cache normalisation factors (saves around 5% of the runtime)

e9a441e

Missing test assertion

fb55088

Factor in memory and on disk from micWithColumn to their own functions

b078457

Comments

dde73c9

valeriy42 approved these changes Jul 5, 2019

View reviewed changes

Typo in comment

b486d14

tveasey merged commit 8bfe832 into elastic:feature/regression Jul 5, 2019

tveasey deleted the regression-tune-feature-weights branch July 5, 2019 12:06

tveasey mentioned this pull request Jul 22, 2019

[FEATURE][ML] Introducing factory for boosted tree creation #552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE][ML] A measure of strength of relationship between RVs plus better seeding of feature sample probabilities for boosted tree #488

[FEATURE][ML] A measure of strength of relationship between RVs plus better seeding of feature sample probabilities for boosted tree #488

Uh oh!

tveasey commented May 29, 2019

Uh oh!

valeriy42 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tveasey commented Jul 5, 2019

Uh oh!

valeriy42 left a comment

Uh oh!

Uh oh!

[FEATURE][ML] A measure of strength of relationship between RVs plus better seeding of feature sample probabilities for boosted tree #488

[FEATURE][ML] A measure of strength of relationship between RVs plus better seeding of feature sample probabilities for boosted tree #488

Uh oh!

Conversation

tveasey commented May 29, 2019

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tveasey commented Jul 5, 2019

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!