Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use it as a library #7

Closed
Raphencoder opened this issue Dec 3, 2018 · 8 comments
Closed

Use it as a library #7

Raphencoder opened this issue Dec 3, 2018 · 8 comments

Comments

@Raphencoder
Copy link

Hi,
Thanks for the project you did it's pretty impressive !
I would like to know if I can run it directly as a library like nltk or spacy doing ?
I didn't find how to do it on the README
Thanks !

@YoannDupont
Copy link
Owner

YoannDupont commented Dec 3, 2018

Hi,

Thank you !

You should be able to use SEM as a library if you install it via setup.py. I'm still cleaning things left and right and improving a couple of others, so beware, some things might change in the future.

Nonetheless, let's say you want to have NER with character offsets. The following:

from __future__ import print_function
from __future__ import unicode_literals
import os.path
import sem.modules.tagger
import sem.storage

pipeline, options, exporter, couples = sem.modules.tagger.load_master(os.path.expanduser("~/sem_data/resources/master/fr/NER.xml"))
# here we load the pipeline + some things to write the output in the right format (useless here)
# the initial loading of the pipeline takes some seconds.

if __name__ == "__main__":
    document = sem.storage.Document(name="test.txt", content="Je suis chez ce cher Serge.", encoding="utf-8")
    pipeline.process_document(document)
    for annotation in document.annotation("NER").get_reference_annotations():
        print(annotation, document.content[annotation.lb : annotation.ub])
    # get_reference_annotations will give you a triplet (value, start, end) where start and end are character offsets in text.

Should output Person,[21:26] Serge.

@Raphencoder
Copy link
Author

Ok i just try it now (I try the POS_TAGGING) , it works perfectly ! Thank you ;)

@Raphencoder
Copy link
Author

Raphencoder commented Dec 6, 2018

Hi,
I test for pos_tagging the SEM, and the test wrote me something like:

INFO    2018-12-06 18:51:19,675 sem.segmentation        process_document        "test.txt" segmented in 1 sentences, 6 tokens
INFO    2018-12-06 18:51:19,675 sem.segmentation        process_document        in 0:00:00.000317

witch is pretty fast !

Though when I test multiple raws into a for loop my duration time is relatively high, i got something like:
0.310899257659912
How can I improve the speed performance ?
I used the code you wrote to code this one:

from __future__ import print_function
from __future__ import unicode_literals
import os.path
import sem.modules.tagger
import sem.storage
import pandas as pd
import numpy as np
import time

def sem_result(raws):
    pipeline, options, exporter, couples = sem.modules.tagger.load_master(os.path.expanduser("~/sem_data/resources/master/fr/pos.xml"))
    df = []
    for raw in raws:
        start_time = time.time()
        document = sem.storage.Document(name="test.txt", content=raw, encoding="utf-8")
        pipeline.process_document(document)
        token = []
        pos = []
        for annotation in document.annotation("POS").get_reference_annotations():
            token.append(document.content[annotation.lb : annotation.ub])
            pos.append(annotation.value)
        duration = time.time() - start_time
        df.append([token, pos, duration])
    tokens = []
    pos_tag = []
    duration = []
    for elem in df:
        tokens.append(elem[0])
        pos_tag.append(elem[1])
        duration.append(elem[2])
    d = {'araws': raws, 'btokens': tokens, 'cpostag': pos_tag, 'duration': duration }
    dfSem = pd.DataFrame(data=d)
    dfSem.to_csv('sem.csv')
    return dfSem

I think I load the library for each raw.. But how can I extracted it from the loop ?

@YoannDupont
Copy link
Owner

The first log you sent:

INFO    2018-12-06 18:51:19,675 sem.segmentation        process_document        "test.txt" segmented in 1 sentences, 6 tokens
INFO    2018-12-06 18:51:19,675 sem.segmentation        process_document        in 0:00:00.000317

is for segmentation only, as indicated by the second column: sem.segmentation. There are a couple more modules that are launched after this one for each document. To avoid loading the Wapiti models each time you call process_document, you can install python-wapiti.

Furthermore: you currently load the pipeline once at every call of sem_result. You could load it only once if you put the load_master line at module level.

@Raphencoder
Copy link
Author

Raphencoder commented Dec 7, 2018

Thanks I moved the pipeline up, and I downloaded the wapiti lib, the duration time is faster, also your pos_tagging is greater than NLTK pos_tagging (Stanford one) on many point, really impressive and useful.

@YoannDupont
Copy link
Owner

Thanks a lot for the feedback and thank you very much for using SEM! You made a developer really happy. :)

@Raphencoder
Copy link
Author

Raphencoder commented Dec 7, 2018

With pleasure :) ! Do you have a mail address where I can reach you ?

@YoannDupont
Copy link
Owner

You can find it at the end of the setup.py file (I do not like to copy/paste this kind of information directly into discussions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants