Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: spaCy pipeline components for textacy features #168

Closed
ines opened this issue Feb 26, 2018 · 6 comments
Labels

Comments

@ines
Copy link

@ines ines commented Feb 26, 2018

Thanks again for all your great work on textacy 馃檹 I was just playing around with the latest version and had an idea: Now that spaCy v2.0 supports custom pipeline components and attributes, it might be cool to expose some of textacy's features as custom attributes, via functions that users can add to their spaCy pipeline. Here's a proof of concept:

import spacy
from spacy.tokens import Doc
from textacy.text_stats import TextStats


class TextStatsComponent(object):
    name = 'textacy_text_stats'

    def __init__(self):
        Doc.set_extension('n_long_words', default=False)
        Doc.set_extension('n_syllables', default=False)
        Doc.set_extension('n_unique_words', default=False)

    def __call__(self, doc):
        ts = TextStats(doc)
        doc._.n_long_words = ts.n_long_words
        doc._.n_syllables = ts.n_syllables
        doc._.n_unique_words = ts.n_unique_words
        return doc


nlp = spacy.load('en')
text_stats_component = TextStatsComponent()
nlp.add_pipe(text_stats_component)

doc = nlp(u"This is a test test someverylongword")
print(doc._.n_syllables, doc._.n_unique_words, doc._.n_long_words)

A more elegant solution of this could define a list of the attributes as strings, and then populate them programmatically (or even let the user define the attributes they need when initialising the component). For example:

class TextStatsComponent(object):
    name = 'textacy_text_stats'

    def __init__(self):
        self.attrs = ('n_chars', 'n_long_words', 'n_syllables', 'n_unique_words')  # etc.
        for attr in self.attrs:
            Doc.set_extension(attr, default=False)

    def __call__(self, doc):
        ts = TextStats(doc)
        for attr in self.attrs:
            doc._.set(attr, getattr(ts, attr))
        return doc

I used the TextStats, because it was the simplest and most straightforward example 鈥 but I'm sure there are a lot of other features that could be added in a similar way. Would be cool to hear your ideas on which features are the most popular and what would make sense to experiment with. The pipeline extensions could be part of the spaCy utils included in textacy, or maybe a separate little package that depends on both libraries.

What do you think?

@ines

This comment has been minimized.

Copy link
Author

@ines ines commented Feb 26, 2018

I felt inspired, so here's another one! 馃槃

The function wraps spaCy's nlp object and adds textacy's preprocessing features before the tokenizer. I've only added two options here, but it could easily be extended to allow all available parameters of preprocess_text.

import spacy
from textacy.preprocess import normalize_whitespace, preprocess_text


def insert_preprocessor(nlp, normalize_ws=True, fix_unicode=True):  # etc.
    tokenizer = nlp.Defaults.create_tokenizer(nlp)

    def preprocess(text):
        if normalize_ws:
            text = normalize_whitespace(text)
        text = preprocess_text(text, fix_unicode=fix_unicode)  # etc.
        return tokenizer(text)

    nlp.tokenizer = preprocess
    return nlp


nlp = insert_preprocessor(spacy.load('en'))
doc = nlp(u"This   is a text about 脗拢\n\n")
print(doc.text, [t.text for t in doc])
# This is a text about 拢 ['This', 'is', 'a', 'text', 'about', '拢']

This is actually incredibly useful already and something I'll definitely add to my list of helper functions. (This type of preprocessing also comes up a lot in discussions around spaCy 鈥 so I'll start recommending this solution from now on 馃槂)

@bdewilde

This comment has been minimized.

Copy link
Member

@bdewilde bdewilde commented Feb 26, 2018

Hi @ines ! Thanks a ton for the general nudge and specific example. Update: Examples. I've been thinking about this since 2.0 was released, but have, I'll admit, punted on committing to anything thus far. Shame on me. 馃槄

I've given textacy lots of entry points that accept spacy Docs, and I've availed myself of Doc.user_data for stashing document metadata. I do intend to continue building in better interoperability with spacy, so this definitely seems like something to try soon.

One ill-defined and ill-investigated source of caution is that many of the tasks in textacy involve inputs across multiple documents as well as maintaining shared state. For example, the Vectorizer class(es) that build a document-term matrix from a collection of tokenized documents (where the tokens have already been converted into strings). Is this paradigm handled by spaCy's custom components/extensions?

@ines

This comment has been minimized.

Copy link
Author

@ines ines commented Feb 26, 2018

Thanks 鈥 and nice to hear that you've had similar ideas!

I've given textacy lots of entry points that accept spacy Docs, and I've availed myself of Doc.user_data for stashing document metadata.

Yes, I really love the ease of use and just being able to plug in spaCy Doc objects 馃挅

In some cases, you might still prefer using the user_data directly, especially for data that's not fully exposed to the user. Internally, the new Underscore class, which handles the custom attributes, also proxies to the Doc.user_data attribute. The doc._. is mostly a convenient access point for the user.

Is this paradigm handled by spaCy's custom components/extensions?

In general, custom components can maintain state. Something like this would be totally fine:

class Component(object):
    def __init__(self):
        self.count = 0

    def __call__(self, doc):
        self.count += 1
        print('Documents processed:', self.count)
        return doc

However, for the use case you describe, I don't think it's a good idea to attach all of this logic to the nlp object. Some things are just better handled outside of the pipeline (and that's totally fine).

@bdewilde

This comment has been minimized.

Copy link
Member

@bdewilde bdewilde commented Feb 26, 2018

This is great, thanks for clarifying what's possible on my side! I'll probably work on some custom spacy components once I put together a few end-to-end usage examples in the docs. I should've learned that lesson from you folks 鈥 spaCy's examples are 馃挴

@bdewilde

This comment has been minimized.

Copy link
Member

@bdewilde bdewilde commented Apr 5, 2018

Update: I'm finally getting started on this. 72e97fd

Thanks again for the feature request. :)

@bdewilde

This comment has been minimized.

Copy link
Member

@bdewilde bdewilde commented Apr 8, 2018

Hey @ines , I'm going to close this issue out 鈥 I've got a down payment on this feature request now in master. I plan to add more custom components to textacy.spacier over time. 馃憤

@bdewilde bdewilde closed this Apr 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can鈥檛 perform that action at this time.