Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made 'StopWordsRemover' in TextFeaturizer configurable again. #2962

Merged
merged 6 commits into from Mar 18, 2019
@@ -25,6 +25,7 @@
namespace Microsoft.ML.Transforms.Text
{
using CaseMode = TextNormalizingEstimator.CaseMode;
using StopWordsCol = StopWordsRemovingTransformer.Column;
// A transform that turns a collection of text documents into numerical feature vectors. The feature vectors are counts
// of (word or character) ngrams in a given text. It offers ngram hashing (finding the ngram token string name to feature
// integer index mapping through hashing) as an option.
@@ -85,6 +86,38 @@ internal bool TryUnparse(StringBuilder sb)
}
}

/// <summary>
/// Defines the different type of stop words remover supported.
/// </summary>
public enum StopWordsRemoverType

This comment has been minimized.

Copy link
@Ivanidzo4ka

Ivanidzo4ka Mar 14, 2019

Member

StopWordsRemoverType [](start = 20, length = 20)

I think it would be much cleaner if we just introduce IStopWordRemover interface and inherit both stop word removers from it and make it part of public interface. #Resolved

This comment has been minimized.

Copy link
@zeahmed

zeahmed Mar 15, 2019

Author Member

Interface make it difficult to find out different type of StopWordsRemover in the system at API level. When user will use this enum they immediately know that what type of StopWordsRemover are available and how to use them.


In reply to: 265808886 [](ancestors = 265808886)

This comment has been minimized.

Copy link
@zeahmed

zeahmed Mar 16, 2019

Author Member

I have done what you proposed for the interface thing.


In reply to: 265811781 [](ancestors = 265811781,265808886)

{
/// <summary>
/// Use stop words remover that can removes language-specific list of stop words (most common words) already defined in the system.
/// </summary>
UsePredefinedStopWordsRemover,

/// <summary>
/// Use custom stop words remover that removes stop words based on custom defined list.
/// </summary>
UseCustomStopWordsRemover
}

/// <summary>
/// Defines the different type of stop words remover supported.
/// </summary>
public sealed class StopWordsRemoverOption
{
/// <summary>
/// Type of stop words remover to use.
/// </summary>
public StopWordsRemoverType StopWordsRemover;

/// <summary>
/// List of stop words. It is used when <see cref="StopWordsRemover"/> is equal to <see cref="StopWordsRemoverType.UseCustomStopWordsRemover"/>.
/// </summary>
public string[] StopWords;

This comment has been minimized.

Copy link
@artidoro

artidoro Mar 15, 2019

Contributor

string [](start = 19, length = 6)

Why not IEnumerable<string> ? #Resolved

This comment has been minimized.

Copy link
@zeahmed

zeahmed Mar 16, 2019

Author Member

I doubt that EntryPoint will break because of this. The actual interface is still using string[]

I have mentioned that in the comments in the my PR. I am waiting for the people to comment on it. So that I can decide what to do.


In reply to: 266099423 [](ancestors = 266099423)

}

/// <summary>
/// Advanced options for the <see cref="TextFeaturizingEstimator"/>.
/// </summary>
@@ -96,8 +129,42 @@ public sealed class Options : TransformInputBase
[Argument(ArgumentType.AtMostOnce, HelpText = "Dataset language or 'AutoDetect' to detect language per row.", ShortName = "lang", SortOrder = 3)]
public Language Language = DefaultLanguage;

This comment has been minimized.

Copy link
@Ivanidzo4ka

Ivanidzo4ka Mar 14, 2019

Member

Language [](start = 28, length = 8)

it needs to be part of StopWordsRemovingTransformer #Resolved

This comment has been minimized.

Copy link
@zeahmed

zeahmed Mar 15, 2019

Author Member

I think Language parameter is used by a couple components in TextTransform e.g. it used by tokenizer also. I know currently we are using very naïve whitespace tokenizer which does not use it. However, in future if we develop a sophisticated tokenizer for different language then this parameter will be needed. What do you say?


In reply to: 265808271 [](ancestors = 265808271)

This comment has been minimized.

Copy link
@Ivanidzo4ka

Ivanidzo4ka Mar 18, 2019

Member

I don't know what kind of future lays ahead of us, but for now, this option just look silly.


In reply to: 265812513 [](ancestors = 265812513,265808271)

This comment has been minimized.

Copy link
@zeahmed

zeahmed Mar 18, 2019

Author Member

hmm...so you think that it should not be here? it should instead be in PredefinedStopWordsRemoverOptions?


In reply to: 266564835 [](ancestors = 266564835,265812513,265808271)


[Argument(ArgumentType.Multiple, HelpText = "Use stop remover or not.", ShortName = "remover", SortOrder = 4)]
public bool UsePredefinedStopWordRemover = false;
[Argument(ArgumentType.Multiple, Name = "StopWordsRemover", HelpText = "Stopwords remover.", ShortName = "remover", NullName = "<None>", SortOrder = 4)]
internal IStopWordsRemoverFactory StopWordsRemover;

/// <summary>
/// The underlying state of <see cref="StopWordsRemover"/> and <see cref="StopWordsRemoverOptions"/>.
/// </summary>
private StopWordsRemoverOption _stopWordsRemoverOptions;

/// <summary>
/// Option to set type of stop word remover to use.
/// A stop word remover removes the language specific list of stop words from the input.
/// A custom list of stop words can also be defined instead of using predefined list in the system.
/// Setting this to 'null' does not remove stop words from the input.
/// </summary>
public StopWordsRemoverOption StopWordsRemoverOptions
{
get { return _stopWordsRemoverOptions; }
set
{
_stopWordsRemoverOptions = value;
IStopWordsRemoverFactory options = null;
if (_stopWordsRemoverOptions != null)
{
if (_stopWordsRemoverOptions.StopWordsRemover == StopWordsRemoverType.UsePredefinedStopWordsRemover)
options = new PredefinedStopWordsRemoverFactory();
else if (_stopWordsRemoverOptions.StopWordsRemover == StopWordsRemoverType.UseCustomStopWordsRemover)
{
options = new CustomStopWordsRemovingTransformer.LoaderArguments()
{
Stopwords = _stopWordsRemoverOptions.StopWords
};
}
}
StopWordsRemover = options;
}
}

[Argument(ArgumentType.AtMostOnce, HelpText = "Casing text using the rules of the invariant culture.", Name="TextCase", ShortName = "case", SortOrder = 5)]
public CaseMode CaseMode = TextNormalizingEstimator.Defaults.Mode;
@@ -203,6 +270,7 @@ public Options()

// These parameters are hardcoded for now.
// REVIEW: expose them once sub-transforms are estimators.
private IStopWordsRemoverFactory _stopWordsRemover;
private TermLoaderArguments _dictionary;
private INgramExtractorFactoryFactory _wordFeatureExtractor;
private INgramExtractorFactoryFactory _charFeatureExtractor;
@@ -220,7 +288,7 @@ private sealed class TransformApplierParams

public readonly NormFunction Norm;
public readonly Language Language;
public readonly bool UsePredefinedStopWordRemover;
public readonly IStopWordsRemoverFactory StopWordsRemover;
public readonly CaseMode TextCase;
public readonly bool KeepDiacritics;
public readonly bool KeepPunctuations;
@@ -252,7 +320,9 @@ internal LpNormNormalizingEstimatorBase.NormFunction LpNorm

// These properties encode the logic needed to determine which transforms to apply.
#region NeededTransforms
public bool NeedsWordTokenizationTransform { get { return WordExtractorFactory != null || UsePredefinedStopWordRemover || OutputTextTokens; } }
public bool NeedsWordTokenizationTransform { get { return WordExtractorFactory != null || NeedsRemoveStopwordsTransform || OutputTextTokens; } }

public bool NeedsRemoveStopwordsTransform { get { return StopWordsRemover != null; } }

public bool NeedsNormalizeTransform
{
@@ -298,7 +368,7 @@ public TransformApplierParams(TextFeaturizingEstimator parent)
CharExtractorFactory = parent._charFeatureExtractor?.CreateComponent(host, parent._dictionary);
Norm = parent.OptionalSettings.Norm;
Language = parent.OptionalSettings.Language;
UsePredefinedStopWordRemover = parent.OptionalSettings.UsePredefinedStopWordRemover;
StopWordsRemover = parent._stopWordsRemover;
TextCase = parent.OptionalSettings.CaseMode;
KeepDiacritics = parent.OptionalSettings.KeepDiacritics;
KeepPunctuations = parent.OptionalSettings.KeepPunctuations;
@@ -340,6 +410,7 @@ internal TextFeaturizingEstimator(IHostEnvironment env, string name, IEnumerable
if (options != null)
OptionalSettings = options;

_stopWordsRemover = null;
_dictionary = null;
_wordFeatureExtractor = OptionalSettings.WordFeatureExtractorFactory;
_charFeatureExtractor = OptionalSettings.CharFeatureExtractorFactory;
@@ -402,21 +473,23 @@ public ITransformer Fit(IDataView input)
view = new WordTokenizingEstimator(h, xfCols).Fit(view).Transform(view);
}

if (tparams.UsePredefinedStopWordRemover)
if (tparams.NeedsRemoveStopwordsTransform)
{
Contracts.Assert(wordTokCols != null, "StopWords transform requires that word tokenization has been applied to the input text.");
var xfCols = new StopWordsRemovingEstimator.ColumnOptions[wordTokCols.Length];
var xfCols = new StopWordsCol[wordTokCols.Length];
var dstCols = new string[wordTokCols.Length];
for (int i = 0; i < wordTokCols.Length; i++)
{
var tempName = GenerateColumnName(view.Schema, wordTokCols[i], "StopWordsRemoverTransform");
var col = new StopWordsRemovingEstimator.ColumnOptions(tempName, wordTokCols[i], tparams.StopwordsLanguage);
dstCols[i] = tempName;
tempCols.Add(tempName);
var col = new StopWordsCol();
col.Source = wordTokCols[i];
col.Name = GenerateColumnName(view.Schema, wordTokCols[i], "StopWordsRemoverTransform");
dstCols[i] = col.Name;
tempCols.Add(col.Name);
col.Language = tparams.StopwordsLanguage;

xfCols[i] = col;
}
view = new StopWordsRemovingEstimator(h, xfCols).Fit(view).Transform(view);
view = tparams.StopWordsRemover.CreateComponent(h, view, xfCols);
wordTokCols = dstCols;
}

@@ -443,7 +516,7 @@ public ITransformer Fit(IDataView input)
if (tparams.CharExtractorFactory != null)
{
{
var srcCols = tparams.UsePredefinedStopWordRemover ? wordTokCols : textCols;
var srcCols = tparams.NeedsRemoveStopwordsTransform ? wordTokCols : textCols;
charTokCols = new string[srcCols.Length];
var xfCols = new (string outputColumnName, string inputColumnName)[srcCols.Length];
for (int i = 0; i < srcCols.Length; i++)
@@ -568,6 +641,7 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema)
internal static IDataTransform Create(IHostEnvironment env, Options args, IDataView data)
{
var estimator = new TextFeaturizingEstimator(env, args.Columns.Name, args.Columns.Source ?? new[] { args.Columns.Name }, args);
estimator._stopWordsRemover = args.StopWordsRemover;
estimator._dictionary = args.Dictionary;
// Review: I don't think the following two lines are needed.
estimator._wordFeatureExtractor = args.WordFeatureExtractorFactory;
"Default": "English"
},
{
"Name": "UsePredefinedStopWordRemover",
"Type": "Bool",
"Desc": "Use stop remover or not.",
"Name": "StopWordsRemover",
"Type": {
"Kind": "Component",
"ComponentKind": "StopWordsRemover"
},
"Desc": "Stopwords remover.",
"Aliases": [
"remover"
],
"Required": false,
"SortOrder": 4.0,
"IsNullable": false,
"Default": false
"Default": null
},
{
"Name": "TextCase",
@@ -103,7 +103,7 @@ public void TrainSentiment()
{
OutputTokens = true,
KeepPunctuations = false,
UsePredefinedStopWordRemover = true,
StopWordsRemoverOptions = new TextFeaturizingEstimator.StopWordsRemoverOption() { StopWordsRemover = TextFeaturizingEstimator.StopWordsRemoverType.UsePredefinedStopWordsRemover },
Norm = TextFeaturizingEstimator.NormFunction.None,
CharFeatureExtractor = null,
WordFeatureExtractor = null,
@@ -973,7 +973,10 @@ public void EntryPointPipelineEnsembleText()
{
data = new TextFeaturizingEstimator(Env, "Features", new List<string> { "Text" },
new TextFeaturizingEstimator.Options {
UsePredefinedStopWordRemover = true,
StopWordsRemoverOptions = new TextFeaturizingEstimator.StopWordsRemoverOption()
{
StopWordsRemover = TextFeaturizingEstimator.StopWordsRemoverType.UsePredefinedStopWordsRemover
},
}).Fit(data).Transform(data);
}
else
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.