Lda snapping to template #3442

sfilipi · 2019-04-19T20:52:24Z

towards #3204. LDA

codecov · 2019-04-19T21:34:24Z

Codecov Report

Merging #3442 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3442      +/-   ##
==========================================
+ Coverage   72.76%   72.76%   +<.01%     
==========================================
  Files         808      808              
  Lines      145452   145452              
  Branches    16244    16244              
==========================================
+ Hits       105839   105843       +4     
+ Misses      35193    35189       -4     
  Partials     4420     4420

Flag	Coverage Δ
#Debug	`72.76% <ø> (ø)`	⬆️
#production	`68.27% <ø> (ø)`	⬆️
#test	`89.04% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs	`41.66% <ø> (ø)`	⬆️
...ML.Data/Transforms/ConversionsExtensionsCatalog.cs	`64.07% <ø> (ø)`	⬆️
...t.ML.Data/Transforms/ValueToKeyMappingEstimator.cs	`88.67% <ø> (ø)`	⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs	`89.89% <ø> (ø)`	⬆️
src/Microsoft.ML.Maml/MAML.cs	`24.75% <0%> (-1.46%)`	⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`84.7% <0%> (-0.21%)`	⬇️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs	`86.1% <0%> (-0.16%)`	⬇️
...StandardTrainers/Standard/LinearModelParameters.cs	`60.31% <0%> (+0.26%)`	⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs	`100% <0%> (+20.51%)`	⬆️

shmoradims · 2019-04-19T21:40:07Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    /// |  |  |
+    /// | -- | -- |
+    /// | Does this estimator need to look at the data to train its parameters? | Yes |
+    /// | Input column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType) data types|


data types [](start = 80, length = 11)

just 'key type'

#Resolved

Why not just use only xref? That would make references to Key consistent.

In reply to: 277098008 [](ancestors = 277098008)

shmoradims · 2019-04-19T21:40:33Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    /// | -- | -- |
+    /// | Does this estimator need to look at the data to train its parameters? | Yes |
+    /// | Input column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType) data types|
+    /// | Output column data type | Vector or <xref:System.Single>|


or [](start = 43, length = 2)

or -> of ?? #Resolved

shmoradims · 2019-04-19T21:40:55Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  It can be used to featurize any text fields as low-dimensional topical vectors.
+    ///  LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
+    ///  optimization techniques.
+    ///  With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary


/ [](start = 6, length = 3)

newline

?

In reply to: 277098147 [](ancestors = 277098147)

shmoradims · 2019-04-19T21:41:12Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  The most significant innovation is a super-efficient O(1) [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm),
+    ///  whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
+    ///
+    ///  In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input.


Ml.Net [](start = 15, length = 6)

ML.NET #Resolved

shmoradims · 2019-04-19T21:42:06Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  If we have the following three lines of text, as data points:
+    ///  * I like to eat bananas.
+    ///  * I eat bananas everyday.
+    ///  * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler,


[](start = 8, length = 3)

should this * be a removed ? #Resolved

no, that is the third sentence.

In reply to: 277098342 [](ancestors = 277098342)

A shorter example might be clearer here :) #Pending

I tried a bunch, but none of them gave nice numbers, like this one.

In reply to: 277138988 [](ancestors = 277138988)

changed, at the end.

In reply to: 277150472 [](ancestors = 277150472,277138988)

shmoradims · 2019-04-19T21:42:51Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

-        /// Uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform a document (represented as a vector of floats)
-        /// into a vector of floats over a set of topics.
+        /// Create a <see cref="LatentDirichletAllocationEstimator"/>, which uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform text (represented as a vector of floats)
+        /// into a vector of floats indicating the similarity of the text with each topic identified.


floats [](start = 29, length = 6)

single #Resolved

Should this also be crefd?

In reply to: 277098467 [](ancestors = 277098467)

shmoradims

singlis · 2019-04-19T22:57:15Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  Latent Dirichlet Allocation is a well-known [topic modeling](https://en.wikipedia.org/wiki/Topic_model) algorithm that infers semantic structure from text data,
+    ///  and ultimately helps answer the question on "what is this document about?".
+    ///  It can be used to featurize any text fields as low-dimensional topical vectors.
+    ///  LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of


developed in MSR-Asia [](start = 66, length = 21)

does this matter? We probably just say "implementation of LDA that incorporates..."
#Resolved

singlis · 2019-04-19T23:01:02Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
+    ///
+    ///  In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input.
+    ///  A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA.


than [](start = 129, length = 4)

suggestion: n-grams to supply to the LDA transformer. #Resolved

singlis · 2019-04-19T23:01:46Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA.
+    ///  See the example usage in the SeeAlso section for usage suggestions.
+    ///
+    ///  If we have the following three lines of text, as data points:


three [](start = 34, length = 5)

it looks like a line is missing? #Resolved

it is the third bullet point. I substituted the line with 'example"

In reply to: 277109119 [](ancestors = 277109119)

singlis · 2019-04-19T23:02:29Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  * I eat bananas everyday.
+    ///  * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler,
+    ///  and allows a small cluster of machines to tackle very large data and model sizes based on the model scheduling
+    ///  and data parallelism capabilities of the DMTK parameter server.(quoted from [LightLDA](http://www.dmtk.io/lightlda.html))


run-on sentence, can this be reworked? #Resolved

it is just an example sentence.. but i see how it can be confusing. Let me pick something else.

In reply to: 277109187 [](ancestors = 277109187)

singlis · 2019-04-19T23:05:14Z

    /// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>

Expected input column type and expected output column type? #Resolved

Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False)

singlis

singlis · 2019-04-19T23:06:28Z

Looks good, I left some feedback.

sfilipi · 2019-04-20T06:29:48Z

    /// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>

?

In reply to: 485034243 [](ancestors = 485034243)

Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False)

wschin · 2019-04-20T16:04:37Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
+    ///
+    ///  In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input.
+    ///  A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA.


For the suggested steps, please add xref to them.

The reason i didn't do it, is because I don't know how to format extension methods in xref format.
I did point them to the sample, which contains the same steps .

In reply to: 277138943 [](ancestors = 277138943)

natke · 2019-04-20T15:59:59Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  It can be used to featurize any text fields as low-dimensional topical vectors.
+    ///  LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
+    ///  optimization techniques.
+    ///  With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary


I millions word vocabulary? #Resolved

natke · 2019-04-20T16:00:44Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
+    ///
+    ///  In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input.
+    ///  A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA.


If you're in here editing anyway, you could remove the "performing" #Resolved

natke · 2019-04-20T16:06:34Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  If we have the following three lines of text, as data points:
+    ///  * I like to eat bananas.
+    ///  * I eat bananas everyday.
+    ///  * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler,


A shorter example might be clearer here :) #Pending

wschin · 2019-04-20T17:23:55Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///  If we have the following three lines of text, as data points:
+    ///  * I like to eat bananas.
+    ///  * I eat bananas everyday.
+    ///  * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler,


O(1) [](start = 87, length = 4)

$O(1)$. It's mathematical equation. #Resolved

wschin · 2019-04-20T17:28:57Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///
+    ///  If we have the following three lines of text, as data points:
+    ///  * I like to eat bananas.
+    ///  * I eat bananas everyday.


Are those sentences required? You provide some input to the very beginning of this transform and then switch to algorithm details. I feel there might be a missing bridge between them.

Also, the descriptions of training algorithm should be put into one single place. This section is somehow repleating information described above.

The above means:

/// on a 1-billion-token document set one a single machine in a few hours(typically, LDA at this scale takes days and requires large clusters). /// The most significant innovation is a super-efficient O(1) [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm), /// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling). ``` #Resolved

I used the sentence as an example text, but i am realizing it is confusing. Let me try to find somethign unrelated.

In reply to: 277141226 [](ancestors = 277141226)

wschin · 2019-04-20T17:38:08Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+    ///
+    ///  To illustrate the effect of this estimator on text, notice the similarity in values of the first and second row, compared to the third,
+    ///  and see how those values are indicative of semantic similarities between those lines.
+    ///


Context is missing.

What is topic?

What are the values of a Topic?

What's the relation between those values and the two inputs I like to eat bananas. and I eat bananas everyday.?

The way to describe an operation has a SOP --- first, describe input, second describe output, finally describe how to (at least conceptually if writing equations is not doable) compute output from input. #Pending

The illustration read fine to me. I got that the first two were related to the topic of bananas and the other wasn't. #Resolved

leaving it then. thanks @natke

In reply to: 277152635 [](ancestors = 277152635)

wschin · 2019-04-20T17:40:49Z

    /// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>

We need to specify the accepted input type such as a vector of <see cref="System.Single">.

In reply to: 485062948 [](ancestors = 485062948,485034243)

Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False)

sfilipi requested review from Ivanidzo4ka, natke, shmoradims and artidoro April 19, 2019 20:52

sfilipi self-assigned this Apr 19, 2019

sfilipi added the documentation Related to documentation of ML.NET label Apr 19, 2019

shmoradims reviewed Apr 19, 2019

View reviewed changes

shmoradims approved these changes Apr 19, 2019

View reviewed changes

singlis reviewed Apr 19, 2019

View reviewed changes

singlis approved these changes Apr 19, 2019

View reviewed changes

sfilipi mentioned this pull request Apr 19, 2019

API reference - XML documentation template for transforms #3204

Closed

wschin reviewed Apr 20, 2019

View reviewed changes

natke reviewed Apr 20, 2019

View reviewed changes

wschin reviewed Apr 20, 2019

View reviewed changes

Lda snapping to template

56df0cd

sfilipi force-pushed the lda branch from 67d5669 to 56df0cd Compare April 21, 2019 04:35

sfilipi merged commit b301aec into dotnet:master Apr 21, 2019

sfilipi deleted the lda branch April 21, 2019 06:09

dotnet locked as resolved and limited conversation to collaborators Mar 22, 2022

Lda snapping to template #3442

Lda snapping to template #3442

Conversation

sfilipi commented Apr 19, 2019

codecov bot commented Apr 19, 2019 • edited

Codecov Report

shmoradims Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natke Apr 20, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmoradims left a comment

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singlis commented Apr 19, 2019 • edited by sfilipi

singlis left a comment

Choose a reason for hiding this comment

singlis commented Apr 19, 2019

sfilipi commented Apr 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natke Apr 20, 2019 • edited by sfilipi

Choose a reason for hiding this comment

natke Apr 20, 2019 • edited by sfilipi

Choose a reason for hiding this comment

natke Apr 20, 2019 • edited by sfilipi

Choose a reason for hiding this comment

wschin Apr 20, 2019 • edited by sfilipi

Choose a reason for hiding this comment

wschin Apr 20, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wschin Apr 20, 2019 • edited by sfilipi

Choose a reason for hiding this comment

natke Apr 21, 2019 • edited by sfilipi

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wschin commented Apr 20, 2019

codecov bot commented Apr 19, 2019 •

edited

shmoradims Apr 19, 2019 •

edited by sfilipi

shmoradims Apr 19, 2019 •

edited by sfilipi

shmoradims Apr 19, 2019 •

edited by sfilipi

shmoradims Apr 19, 2019 •

edited by sfilipi

natke Apr 20, 2019 •

edited by sfilipi

shmoradims Apr 19, 2019 •

edited by sfilipi

singlis Apr 19, 2019 •

edited by sfilipi

singlis Apr 19, 2019 •

edited by sfilipi

singlis Apr 19, 2019 •

edited by sfilipi

singlis Apr 19, 2019 •

edited by sfilipi

singlis commented Apr 19, 2019 •

edited by sfilipi

natke Apr 20, 2019 •

edited by sfilipi

natke Apr 20, 2019 •

edited by sfilipi

natke Apr 20, 2019 •

edited by sfilipi

wschin Apr 20, 2019 •

edited by sfilipi

wschin Apr 20, 2019 •

edited by sfilipi

wschin Apr 20, 2019 •

edited by sfilipi

natke Apr 21, 2019 •

edited by sfilipi