Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Thanks again for all your great work on textacy
import spacy from spacy.tokens import Doc from textacy.text_stats import TextStats class TextStatsComponent(object): name = 'textacy_text_stats' def __init__(self): Doc.set_extension('n_long_words', default=False) Doc.set_extension('n_syllables', default=False) Doc.set_extension('n_unique_words', default=False) def __call__(self, doc): ts = TextStats(doc) doc._.n_long_words = ts.n_long_words doc._.n_syllables = ts.n_syllables doc._.n_unique_words = ts.n_unique_words return doc nlp = spacy.load('en') text_stats_component = TextStatsComponent() nlp.add_pipe(text_stats_component) doc = nlp(u"This is a test test someverylongword") print(doc._.n_syllables, doc._.n_unique_words, doc._.n_long_words)
A more elegant solution of this could define a list of the attributes as strings, and then populate them programmatically (or even let the user define the attributes they need when initialising the component). For example:
class TextStatsComponent(object): name = 'textacy_text_stats' def __init__(self): self.attrs = ('n_chars', 'n_long_words', 'n_syllables', 'n_unique_words') # etc. for attr in self.attrs: Doc.set_extension(attr, default=False) def __call__(self, doc): ts = TextStats(doc) for attr in self.attrs: doc._.set(attr, getattr(ts, attr)) return doc
I used the
What do you think?
I felt inspired, so here's another one!
The function wraps spaCy's
import spacy from textacy.preprocess import normalize_whitespace, preprocess_text def insert_preprocessor(nlp, normalize_ws=True, fix_unicode=True): # etc. tokenizer = nlp.Defaults.create_tokenizer(nlp) def preprocess(text): if normalize_ws: text = normalize_whitespace(text) text = preprocess_text(text, fix_unicode=fix_unicode) # etc. return tokenizer(text) nlp.tokenizer = preprocess return nlp nlp = insert_preprocessor(spacy.load('en')) doc = nlp(u"This is a text about Â£\n\n") print(doc.text, [t.text for t in doc]) # This is a text about £ ['This', 'is', 'a', 'text', 'about', '£']
This is actually incredibly useful already and something I'll definitely add to my list of helper functions. (This type of preprocessing also comes up a lot in discussions around spaCy – so I'll start recommending this solution from now on
Hi @ines ! Thanks a ton for the general nudge and specific example. Update: Examples. I've been thinking about this since 2.0 was released, but have, I'll admit, punted on committing to anything thus far. Shame on me.
One ill-defined and ill-investigated source of caution is that many of the tasks in
Thanks – and nice to hear that you've had similar ideas!
Yes, I really love the ease of use and just being able to plug in spaCy
In some cases, you might still prefer using the
In general, custom components can maintain state. Something like this would be totally fine:
class Component(object): def __init__(self): self.count = 0 def __call__(self, doc): self.count += 1 print('Documents processed:', self.count) return doc
However, for the use case you describe, I don't think it's a good idea to attach all of this logic to the