Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lda snapping to template #3442

Merged
merged 1 commit into from Apr 21, 2019
Merged

Lda snapping to template #3442

merged 1 commit into from Apr 21, 2019

Conversation

sfilipi
Copy link
Member

@sfilipi sfilipi commented Apr 19, 2019

towards #3204. LDA

@sfilipi sfilipi self-assigned this Apr 19, 2019
@sfilipi sfilipi added the documentation Related to documentation of ML.NET label Apr 19, 2019
@codecov
Copy link

codecov bot commented Apr 19, 2019

Codecov Report

Merging #3442 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3442      +/-   ##
==========================================
+ Coverage   72.76%   72.76%   +<.01%     
==========================================
  Files         808      808              
  Lines      145452   145452              
  Branches    16244    16244              
==========================================
+ Hits       105839   105843       +4     
+ Misses      35193    35189       -4     
  Partials     4420     4420
Flag Coverage Δ
#Debug 72.76% <ø> (ø) ⬆️
#production 68.27% <ø> (ø) ⬆️
#test 89.04% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 41.66% <ø> (ø) ⬆️
...ML.Data/Transforms/ConversionsExtensionsCatalog.cs 64.07% <ø> (ø) ⬆️
...t.ML.Data/Transforms/ValueToKeyMappingEstimator.cs 88.67% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.89% <ø> (ø) ⬆️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.7% <0%> (-0.21%) ⬇️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 86.1% <0%> (-0.16%) ⬇️
...StandardTrainers/Standard/LinearModelParameters.cs 60.31% <0%> (+0.26%) ⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 100% <0%> (+20.51%) ⬆️

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType) data types|
Copy link
Contributor

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data types [](start = 80, length = 11)

just 'key type'

#Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use only xref? That would make references to Key consistent.


In reply to: 277098008 [](ancestors = 277098008)

/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType) data types|
/// | Output column data type | Vector or <xref:System.Single>|
Copy link
Contributor

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or [](start = 43, length = 2)

or -> of ?? #Resolved

/// It can be used to featurize any text fields as low-dimensional topical vectors.
/// LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
/// optimization techniques.
/// With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ [](start = 6, length = 3)

newline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?


In reply to: 277098147 [](ancestors = 277098147)

/// The most significant innovation is a super-efficient O(1) [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm),
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
///
/// In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input.
Copy link
Contributor

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ml.Net [](start = 15, length = 6)

ML.NET #Resolved

/// If we have the following three lines of text, as data points:
/// * I like to eat bananas.
/// * I eat bananas everyday.
/// * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler,
Copy link
Contributor

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [](start = 8, length = 3)

should this * be a removed ? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that is the third sentence.


In reply to: 277098342 [](ancestors = 277098342)

Copy link
Contributor

@natke natke Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shorter example might be clearer here :) #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a bunch, but none of them gave nice numbers, like this one.


In reply to: 277138988 [](ancestors = 277138988)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, at the end.


In reply to: 277150472 [](ancestors = 277150472,277138988)

/// Uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform a document (represented as a vector of floats)
/// into a vector of floats over a set of topics.
/// Create a <see cref="LatentDirichletAllocationEstimator"/>, which uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform text (represented as a vector of floats)
/// into a vector of floats indicating the similarity of the text with each topic identified.
Copy link
Contributor

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floats [](start = 29, length = 6)

single #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be crefd?


In reply to: 277098467 [](ancestors = 277098467)

Copy link
Contributor

@shmoradims shmoradims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// Latent Dirichlet Allocation is a well-known [topic modeling](https://en.wikipedia.org/wiki/Topic_model) algorithm that infers semantic structure from text data,
/// and ultimately helps answer the question on "what is this document about?".
/// It can be used to featurize any text fields as low-dimensional topical vectors.
/// LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

developed in MSR-Asia [](start = 66, length = 21)

does this matter? We probably just say "implementation of LDA that incorporates..."
#Resolved

/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
///
/// In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input.
/// A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA.
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

than [](start = 129, length = 4)

suggestion: n-grams to supply to the LDA transformer. #Resolved

/// A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA.
/// See the example usage in the SeeAlso section for usage suggestions.
///
/// If we have the following three lines of text, as data points:
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

three [](start = 34, length = 5)

it looks like a line is missing? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the third bullet point. I substituted the line with 'example"


In reply to: 277109119 [](ancestors = 277109119)

/// * I eat bananas everyday.
/// * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler,
/// and allows a small cluster of machines to tackle very large data and model sizes based on the model scheduling
/// and data parallelism capabilities of the DMTK parameter server.(quoted from [LightLDA](http://www.dmtk.io/lightlda.html))
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run-on sentence, can this be reworked? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just an example sentence.. but i see how it can be confusing. Let me pick something else.


In reply to: 277109187 [](ancestors = 277109187)

@singlis
Copy link
Member

singlis commented Apr 19, 2019

    /// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>

Expected input column type and expected output column type? #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False)

Copy link
Member

@singlis singlis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@singlis
Copy link
Member

singlis commented Apr 19, 2019

Looks good, I left some feedback.

@sfilipi
Copy link
Member Author

sfilipi commented Apr 20, 2019

    /// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>

?


In reply to: 485034243 [](ancestors = 485034243)


Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False)

/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
///
/// In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input.
/// A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the suggested steps, please add xref to them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason i didn't do it, is because I don't know how to format extension methods in xref format.
I did point them to the sample, which contains the same steps .


In reply to: 277138943 [](ancestors = 277138943)

/// It can be used to featurize any text fields as low-dimensional topical vectors.
/// LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
/// optimization techniques.
/// With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary
Copy link
Contributor

@natke natke Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I millions word vocabulary? #Resolved

/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
///
/// In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input.
/// A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA.
Copy link
Contributor

@natke natke Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're in here editing anyway, you could remove the "performing" #Resolved

/// If we have the following three lines of text, as data points:
/// * I like to eat bananas.
/// * I eat bananas everyday.
/// * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler,
Copy link
Contributor

@natke natke Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shorter example might be clearer here :) #Pending

/// If we have the following three lines of text, as data points:
/// * I like to eat bananas.
/// * I eat bananas everyday.
/// * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler,
Copy link
Member

@wschin wschin Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(1) [](start = 87, length = 4)

$O(1)$. It's mathematical equation. #Resolved

///
/// If we have the following three lines of text, as data points:
/// * I like to eat bananas.
/// * I eat bananas everyday.
Copy link
Member

@wschin wschin Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those sentences required? You provide some input to the very beginning of this transform and then switch to algorithm details. I feel there might be a missing bridge between them.

Also, the descriptions of training algorithm should be put into one single place. This section is somehow repleating information described above.

The above means:

    ///  on a 1-billion-token document set one a single machine in a few hours(typically, LDA at this scale takes days and requires large clusters).
    ///  The most significant innovation is a super-efficient O(1) [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm),
    ///  whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).

``` #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the sentence as an example text, but i am realizing it is confusing. Let me try to find somethign unrelated.


In reply to: 277141226 [](ancestors = 277141226)

///
/// To illustrate the effect of this estimator on text, notice the similarity in values of the first and second row, compared to the third,
/// and see how those values are indicative of semantic similarities between those lines.
///
Copy link
Member

@wschin wschin Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is missing.

  1. What is topic?
  2. What are the values of a Topic?
  3. What's the relation between those values and the two inputs I like to eat bananas. and I eat bananas everyday.?
  4. The way to describe an operation has a SOP --- first, describe input, second describe output, finally describe how to (at least conceptually if writing equations is not doable) compute output from input. #Pending

Copy link
Contributor

@natke natke Apr 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The illustration read fine to me. I got that the first two were related to the topic of bananas and the other wasn't. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving it then. thanks @natke


In reply to: 277152635 [](ancestors = 277152635)

@wschin
Copy link
Member

wschin commented Apr 20, 2019

    /// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>

We need to specify the accepted input type such as a vector of <see cref="System.Single">.


In reply to: 485062948 [](ancestors = 485062948,485034243)


Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False)

@sfilipi sfilipi merged commit b301aec into dotnet:master Apr 21, 2019
@sfilipi sfilipi deleted the lda branch April 21, 2019 06:09
@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Related to documentation of ML.NET
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants