-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character n-grams #103
Comments
One way of doing this is assigning bigram as your key and bool True as its
value:
features['fo'] = True
features['oo'] = True
features[''od'] = True
If you want to also consider position of the bigram, then it would be
something like
features['fo_word_prefix'] = True
features['oo_word_middle'] = True
features['od_word_suffix'] = True
For reference https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb
have a look at features['BOS'] = True in function word2features
…On Fri, Jul 6, 2018 at 5:09 PM, yamivicen ***@***.***> wrote:
I have a training data where each token is a word and I've already
extracted a few features like NER, POS and CHUNK for each token. But I have
a problem when I try to extract character n-grams features. Since this
features are computed at a character level, I don't know how to represent
their values following the attribute value format. For example, if the
current token is "food" then its character bigram feature will be something
like "fo, oo, od". So how do I have to format that feature? By writing
something like bigram[0]=fo, oo, od??
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#103>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEWfs1mCu1CR6MYNzgTeszY64FBu9zWTks5uD0yJgaJpZM4VFUot>
.
|
Take a look at Standford NLP NER features. These features are quite useful in morphologically rich languages like Finnsih, Turkish, Russian and others. You can write word "food" prefixes like:
And the suffixes:
I don't remember the exact start and end flags but you get the idea. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??
The text was updated successfully, but these errors were encountered: