Skip to content

Conversation

valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Dec 16, 2019

In this PR I introduce a new class CDataFrameAnalysisState, which is responsible for centralized collection of statistics related to the data frame analytics jobs, i.e. progress, memory usage, etc. Hence, the related variables are also moved out of CDataFrameAnalysisRunner class.

Right now we have two callback functions related to the job state: progressRecorder and memoryUsageRecorder. However, since we can expect that every new kind of statistics (quality of results, time, parameters) would require another update callback function in maths classes, I pass the reference to the State object. To avoid dependency from maths to api, I introduce a state interface class owned by maths.

Related to #976

for (const auto& model : m_Models) {
model.addOutlierScores(points, scores, m_RecordMemoryUsage);
state.nextStep(step++);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this constitutes a single step of the outlier detection job. WDYT @tveasey ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This seems like a natural place to me as well.

@valeriy42 valeriy42 removed the WIP label Dec 18, 2019
@valeriy42 valeriy42 requested a review from tveasey December 18, 2019 15:51
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I've done a first pass through. I really like the design here. I've left some minor comments. My two larger comments are I have some reservations about the naming and it feels like CDataFrameAnalysisStateInterface is superfluous (and it's slightly odd it is a concrete type). I'd rejig things slightly to remove and factor out CStubAnalysisState and use throughout the unit tests. Overall good job!


private:
SInternalState m_InternalState;
core::CConcurrentQueue<SInternalState, 10> m_StateQueue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is going to be a block if we try and push more than 10 state documents. Since we pop at fixed intervals in time we need to be aware of this behaviour. For example, if this gets updated many times it will potentially mean the working threads are blocked. I don't think this will cause a problem as it stands, but think we should document this behaviour in the class description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added that elements can be dropped if the queue is full.

for (const auto& model : m_Models) {
model.addOutlierScores(points, scores, m_RecordMemoryUsage);
state.nextStep(step++);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This seems like a natural place to me as well.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through those comments. I've done another pass. It now seems to me that we can ditch the queue altogether: as currently implemented it isn't serving a purpose, but in fact the concurrent line writer avoids the need for any synchronisation that this would be used for.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has come together well. The unit test failure is caused by requiring that the training is instrumented. I think this is too severe. Aside from that it's LGTM.

@droberts195 droberts195 added v7.7.0 and removed v7.6.0 labels Jan 15, 2020
# Conflicts:
#	lib/api/unittest/CDataFrameAnalyzerFeatureImportanceTest.cc
# Conflicts:
#	docs/CHANGELOG.asciidoc
#	include/maths/CBoostedTreeFactory.h
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment. Otherwise, LGTM. Let's get this in!

@valeriy42 valeriy42 merged commit 2b05f39 into elastic:master Jan 29, 2020
valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Jan 29, 2020
…o Java backend (elastic#906)

In this PR I introduce a new class CDataFrameAnalysisState, which is responsible for centralized collection of statistics related to the data frame analytics jobs, i.e. progress, memory usage, etc. Hence, the related variables are also moved out of CDataFrameAnalysisRunner class.

Right now we have two callback functions related to the job state: progressRecorder and memoryUsageRecorder. However, since we can expect that every new kind of statistics (quality of results, time, parameters) would require another update callback function in maths classes, I pass the reference to the State object. To avoid dependency from maths to api, I introduce a state interface class owned by maths.
@valeriy42 valeriy42 deleted the feature/instrumentalize branch January 29, 2020 12:10
@valeriy42 valeriy42 restored the feature/instrumentalize branch January 29, 2020 12:11
valeriy42 added a commit that referenced this pull request Jan 29, 2020
@valeriy42 valeriy42 deleted the feature/instrumentalize branch May 6, 2020 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants