-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use it as a library #7
Comments
Hi, Thank you ! You should be able to use SEM as a library if you install it via Nonetheless, let's say you want to have NER with character offsets. The following: from __future__ import print_function
from __future__ import unicode_literals
import os.path
import sem.modules.tagger
import sem.storage
pipeline, options, exporter, couples = sem.modules.tagger.load_master(os.path.expanduser("~/sem_data/resources/master/fr/NER.xml"))
# here we load the pipeline + some things to write the output in the right format (useless here)
# the initial loading of the pipeline takes some seconds.
if __name__ == "__main__":
document = sem.storage.Document(name="test.txt", content="Je suis chez ce cher Serge.", encoding="utf-8")
pipeline.process_document(document)
for annotation in document.annotation("NER").get_reference_annotations():
print(annotation, document.content[annotation.lb : annotation.ub])
# get_reference_annotations will give you a triplet (value, start, end) where start and end are character offsets in text. Should output |
Ok i just try it now (I try the POS_TAGGING) , it works perfectly ! Thank you ;) |
Hi,
witch is pretty fast ! Though when I test multiple raws into a for loop my duration time is relatively high, i got something like: from __future__ import print_function
from __future__ import unicode_literals
import os.path
import sem.modules.tagger
import sem.storage
import pandas as pd
import numpy as np
import time
def sem_result(raws):
pipeline, options, exporter, couples = sem.modules.tagger.load_master(os.path.expanduser("~/sem_data/resources/master/fr/pos.xml"))
df = []
for raw in raws:
start_time = time.time()
document = sem.storage.Document(name="test.txt", content=raw, encoding="utf-8")
pipeline.process_document(document)
token = []
pos = []
for annotation in document.annotation("POS").get_reference_annotations():
token.append(document.content[annotation.lb : annotation.ub])
pos.append(annotation.value)
duration = time.time() - start_time
df.append([token, pos, duration])
tokens = []
pos_tag = []
duration = []
for elem in df:
tokens.append(elem[0])
pos_tag.append(elem[1])
duration.append(elem[2])
d = {'araws': raws, 'btokens': tokens, 'cpostag': pos_tag, 'duration': duration }
dfSem = pd.DataFrame(data=d)
dfSem.to_csv('sem.csv')
return dfSem I think I load the library for each raw.. But how can I extracted it from the loop ? |
The first log you sent:
is for segmentation only, as indicated by the second column: Furthermore: you currently load the pipeline once at every call of |
Thanks I moved the pipeline up, and I downloaded the wapiti lib, the duration time is faster, also your pos_tagging is greater than NLTK pos_tagging (Stanford one) on many point, really impressive and useful. |
Thanks a lot for the feedback and thank you very much for using SEM! You made a developer really happy. :) |
With pleasure :) ! Do you have a mail address where I can reach you ? |
You can find it at the end of the |
Hi,
Thanks for the project you did it's pretty impressive !
I would like to know if I can run it directly as a library like nltk or spacy doing ?
I didn't find how to do it on the README
Thanks !
The text was updated successfully, but these errors were encountered: