Add flexible group-term-matrix vectorization class #156
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
GroupVectorizer
class, a child of the existingVectorizer
class, as an extension of typical document-term matrix vectorization, where terms are grouped by the documents in which they co-occur. It allows for customized grouping, such as by a shared author or publication year, that may span multiple documents, without forcing users to merge those documents themselves.Vectorizer
for clarity and consistency withGroupVectorizer
:vocabulary
=>vocabulary_terms
andfeature_names
=>terms_list
GroupVectorizer
also includesvocabulary_grps
andgrps_list
attrsMotivation and Context
The more flexible vectorization is sort of a niche need, since grouping terms by documents is so generally useful, but you can imagine grouping terms by, say, year or speaker, and comparing these groups by similarity to each other. Another possibility is generating semantic networks where nodes are groups and edges are the similarity of term usage between group pairs.
How Has This Been Tested?
Existing tests for
Vectorizer
were updated to reflect the new attribute names, and those tests all pass.Haven't added any tests specifically forNew tests were added for bothGroupVectorizer
— yet.Vectorizer
andGroupVectorizer
, and they also pass.Types of changes
Checklist: