Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[HIVEMALL-120] Refactor on LDA/pLSA's mini-batch & buffered iteration logic #89

Closed
wants to merge 6 commits into from

Conversation

takuti
Copy link
Member

@takuti takuti commented Jun 23, 2017

What changes were proposed in this pull request?

Refactor LDA/pLSA implementation for better mini-batch & buffered iteration logic

What type of PR is it?

Refactoring

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-120

How was this patch tested?

Unit test & manual test on EMR

How to use this feature?

Nothing has been changed

@takuti
Copy link
Member Author

takuti commented Jun 23, 2017

At first, I planned to replace String[][] miniBatch with things like FeatureValue[][] miniBatch similarly to the generic predictor implementation; that is, instead of dealing word counts as String e.g. "foo:10," I tried to represent them as FeatureValue for readability. However, since PLSAPredictionUDAF and LDAPredictionUDAF need to create OI for such word counts, Java standard string object is much more handy rather than FeatureValue objects.

Thus, document is still represented as an array (or list) of String for now.

@takuti
Copy link
Member Author

takuti commented Jun 23, 2017

@myui How do you think about adding -tol option as alias of the -eps option as I mentioned #87
?

Also, if you have further ideas which improve the topic modeling UDFs, let you share and we discuss them here!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 40.236% when pulling bc4b9d3 on takuti:topicmodel-refactor into c06378a on apache:master.

@coveralls
Copy link

coveralls commented Jun 23, 2017

Coverage Status

Coverage decreased (-0.06%) to 39.945% when pulling bc4b9d3 on takuti:topicmodel-refactor into c06378a on apache:master.

@myui
Copy link
Member

myui commented Jun 23, 2017

@takuti Adding an alias -tol for -eps is a good idea.

@myui
Copy link
Member

myui commented Jun 23, 2017

No need to replace String[][] miniBatch. It already uses FeatureVector for parsing feature vectors as thus it accept the input format of Hivemall's feature vector.

@takuti
Copy link
Member Author

takuti commented Jun 23, 2017

Adding an alias -tol for -eps is a good idea.

Sure.

it accept the input format of Hivemall's feature vector

Oh, it's not format matter by the way; String or FeatureValue in terms of type.

Current implementation repeatedly parses feature-value-formatted String at HERE for every time mini-batch is passed. If we directly pass a two-dimensional array of FeatureValue as word counts, we can avoid the unnecessary parse operations.

@myui
Copy link
Member

myui commented Jun 24, 2017

ah, parsing multiple times could be avoided.

@asfgit asfgit closed this in 0495ffa Jun 27, 2017
@myui
Copy link
Member

myui commented Jun 27, 2017

Merged this PR. Create [HIVEMALL-120-2] for -tol alias for -eps.

@takuti
Copy link
Member Author

takuti commented Jun 27, 2017

@myui Oh, you move very fast :)

-tol alias

Sure~

@takuti takuti deleted the topicmodel-refactor branch June 27, 2017 18:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants