Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context for ngrams? #171

Open
DTchebotarev opened this issue Feb 28, 2018 · 3 comments
Open

Context for ngrams? #171

DTchebotarev opened this issue Feb 28, 2018 · 3 comments

Comments

@DTchebotarev
Copy link

Is it possible to add context to ngram extraction?

For example, currently running

list(textacy.Doc('I like green eggs and ham.').to_terms_list(ngrams=3,as_strings=True))

returns a list

['-PRON- like green', 'like green egg', 'egg and ham']

But I would ideally like to have the option to specify something like

list(textacy.Doc('I like green eggs and ham.').to_terms_list(ngrams=3,as_strings=True, left_pad=True, right_pad=True))

and have it return something along the lines of

['<s2> <s1> -PRON', '<s1> -PRON- like' ,'-PRON- like green', 'like green egg', 'egg and ham', 'and ham </s1>', 'ham </s1> </s2>]

I don't think this is possible in textacy currently, so I guess this is a feature request.

Also any ideas for a workaround are greatly appreciated :)

@bdewilde
Copy link
Collaborator

bdewilde commented Mar 1, 2018

Hi @DTchebotarev , this is not currently a feature, but I appreciate that padding sequences is a common task in deep learning. I've been dragging my feet on getting DL models into textacy, but when I do, I'd expect to include useful adjacent functionality like this as well.

@jnothman
Copy link

jnothman commented Jul 4, 2019

Padding sequences is common even not in deep learning. It gives more context to an n-gram (i.e. it states that it is text-initial).

@bdewilde
Copy link
Collaborator

bdewilde commented Jul 4, 2019

I recently implemented something like this in a keyterm extraction algorithm:

textacy/textacy/keyterms.py

Lines 247 to 251 in 794be59

for sent_idx, sent in enumerate(doc.sents):
padding = [None] * window_size
sent_padded = itertoolz.concatv(padding, sent, padding)
for window in itertoolz.sliding_window(1 + (2 * window_size), sent_padded):
lwords, word, rwords = window[:window_size], window[window_size], window[window_size + 1:]

Unlike extract.ngrams(), this method produces Tuple[Token] rather than Span objects, so it doesn't work in the context of to_terms_list(). But maybe it's helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants