Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made 'StopWordsRemover' in TextFeaturizer configurable again. #2962

Merged
merged 6 commits into from Mar 18, 2019

Conversation

Projects
None yet
3 participants
@zeahmed
Copy link
Member

zeahmed commented Mar 14, 2019

This PR addresses the StopWordsRemover issue in #838

This PR make available two options to remove stop words.

  1. PredefinedStopWordsRemover: removes stops words using built in technique in ML.NET.
  2. UseCustomStopWordsRemover: User provides the list of stop words to remove. Currently string[] is exposed to collect that list.

No support is added to load stop words from file. I think if we make No.2 take list as IEnumerable<string> instead of string[] then user can load list on their own. Let me know if reviewers have any thought.

@zeahmed zeahmed requested review from Ivanidzo4ka , wschin , artidoro and sfilipi Mar 14, 2019

@codecov

This comment has been minimized.

Copy link

codecov bot commented Mar 14, 2019

Codecov Report

Merging #2962 into master will decrease coverage by 0.01%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master    #2962      +/-   ##
==========================================
- Coverage   72.26%   72.25%   -0.02%     
==========================================
  Files         802      800       -2     
  Lines      142901   142894       -7     
  Branches    16134    16137       +3     
==========================================
- Hits       103272   103245      -27     
- Misses      35210    35228      +18     
- Partials     4419     4421       +2
Flag Coverage Δ
#Debug 72.25% <50%> (-0.02%) ⬇️
#production 67.98% <48.83%> (-0.02%) ⬇️
#test 88.48% <100%> (ø) ⬆️
Impacted Files Coverage Δ
...crosoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs 98.08% <100%> (ø) ⬆️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 85.78% <100%> (+0.25%) ⬆️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 83.2% <43.58%> (-7.12%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
...OnnxTransformer.StaticPipe/OnnxStaticExtensions.cs
...r.StaticPipe/DnnImageFeaturizerStaticExtensions.cs
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.9% <0%> (+0.2%) ⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.89% <0%> (+0.62%) ⬆️
... and 1 more
@@ -96,8 +129,42 @@ public sealed class Options : TransformInputBase
[Argument(ArgumentType.AtMostOnce, HelpText = "Dataset language or 'AutoDetect' to detect language per row.", ShortName = "lang", SortOrder = 3)]
public Language Language = DefaultLanguage;

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Mar 14, 2019

Member

Language [](start = 28, length = 8)

it needs to be part of StopWordsRemovingTransformer #Resolved

This comment has been minimized.

@zeahmed

zeahmed Mar 15, 2019

Author Member

I think Language parameter is used by a couple components in TextTransform e.g. it used by tokenizer also. I know currently we are using very naïve whitespace tokenizer which does not use it. However, in future if we develop a sophisticated tokenizer for different language then this parameter will be needed. What do you say?


In reply to: 265808271 [](ancestors = 265808271)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Mar 18, 2019

Member

I don't know what kind of future lays ahead of us, but for now, this option just look silly.


In reply to: 265812513 [](ancestors = 265812513,265808271)

This comment has been minimized.

@zeahmed

zeahmed Mar 18, 2019

Author Member

hmm...so you think that it should not be here? it should instead be in PredefinedStopWordsRemoverOptions?


In reply to: 266564835 [](ancestors = 266564835,265812513,265808271)

/// <summary>
/// Defines the different type of stop words remover supported.
/// </summary>
public enum StopWordsRemoverType

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Mar 14, 2019

Member

StopWordsRemoverType [](start = 20, length = 20)

I think it would be much cleaner if we just introduce IStopWordRemover interface and inherit both stop word removers from it and make it part of public interface. #Resolved

This comment has been minimized.

@zeahmed

zeahmed Mar 15, 2019

Author Member

Interface make it difficult to find out different type of StopWordsRemover in the system at API level. When user will use this enum they immediately know that what type of StopWordsRemover are available and how to use them.


In reply to: 265808886 [](ancestors = 265808886)

This comment has been minimized.

@zeahmed

zeahmed Mar 16, 2019

Author Member

I have done what you proposed for the interface thing.


In reply to: 265811781 [](ancestors = 265811781,265808886)

/// <summary>
/// List of stop words. It is used when <see cref="StopWordsRemover"/> is equal to <see cref="StopWordsRemoverType.UseCustomStopWordsRemover"/>.
/// </summary>
public string[] StopWords;

This comment has been minimized.

@artidoro

artidoro Mar 15, 2019

Contributor

string [](start = 19, length = 6)

Why not IEnumerable<string> ? #Resolved

This comment has been minimized.

@zeahmed

zeahmed Mar 16, 2019

Author Member

I doubt that EntryPoint will break because of this. The actual interface is still using string[]

I have mentioned that in the comments in the my PR. I am waiting for the people to comment on it. So that I can decide what to do.


In reply to: 266099423 [](ancestors = 266099423)

@artidoro

This comment has been minimized.

Copy link
Contributor

artidoro commented Mar 15, 2019

Could you add a test with a custom list of stop words? #Resolved

@zeahmed

This comment has been minimized.

Copy link
Member Author

zeahmed commented Mar 15, 2019

I had opened an issue already for adding more tests for TextFeaturizer. The rationale behind this is to test all the options in the text featurizer. So, your concern can be addressed in that issue.
#2967


In reply to: 473395783 [](ancestors = 473395783)

/// <summary>
/// Use stop words remover that can removes language-specific list of stop words (most common words) already defined in the system.
/// </summary>
public sealed class CustomStopWordsRemoverOptions : IStopWordsRemoverOptions

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Mar 18, 2019

Member

CustomStopWordsRemoverOptions [](start = 28, length = 29)

Why this interface and class is part of text transform and not part of StopWordRemover itself? #Resolved

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Mar 18, 2019

Member

My question is why it's not implemented same way as word ngram and char ngram implemented?


In reply to: 266565265 [](ancestors = 266565265)

This comment has been minimized.

@zeahmed

zeahmed Mar 18, 2019

Author Member

It can be part of StopWordRemover. I will move it.
Word ngram and char ngram are do different options that can be exclusively applied in Text transform (two parameter for two different options). On the contrary, only one type of StopWordRemover can be applied (one parameter to take two different options). Any idea if it can be done better than this?


In reply to: 266565861 [](ancestors = 266565861,266565265)

@Ivanidzo4ka
Copy link
Member

Ivanidzo4ka left a comment

:shipit:

@artidoro
Copy link
Contributor

artidoro left a comment

:shipit:

@zeahmed zeahmed merged commit 6733515 into dotnet:master Mar 18, 2019

3 checks passed

MachineLearning-CI #20190318.27 succeeded
Details
MachineLearning-CodeCoverage #20190318.27 succeeded
Details
license/cla All CLA requirements met.
Details
@zeahmed

This comment has been minimized.

Copy link
Member Author

zeahmed commented Mar 18, 2019

Thanks all for the useful feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.