Use it as a library #7

Raphencoder · 2018-12-03T13:17:43Z

Hi,
Thanks for the project you did it's pretty impressive !
I would like to know if I can run it directly as a library like nltk or spacy doing ?
I didn't find how to do it on the README
Thanks !

YoannDupont · 2018-12-03T22:46:54Z

Hi,

Thank you !

You should be able to use SEM as a library if you install it via setup.py. I'm still cleaning things left and right and improving a couple of others, so beware, some things might change in the future.

Nonetheless, let's say you want to have NER with character offsets. The following:

from __future__ import print_function
from __future__ import unicode_literals
import os.path
import sem.modules.tagger
import sem.storage

pipeline, options, exporter, couples = sem.modules.tagger.load_master(os.path.expanduser("~/sem_data/resources/master/fr/NER.xml"))
# here we load the pipeline + some things to write the output in the right format (useless here)
# the initial loading of the pipeline takes some seconds.

if __name__ == "__main__":
    document = sem.storage.Document(name="test.txt", content="Je suis chez ce cher Serge.", encoding="utf-8")
    pipeline.process_document(document)
    for annotation in document.annotation("NER").get_reference_annotations():
        print(annotation, document.content[annotation.lb : annotation.ub])
    # get_reference_annotations will give you a triplet (value, start, end) where start and end are character offsets in text.

Should output Person,[21:26] Serge.

Raphencoder · 2018-12-04T12:49:18Z

Ok i just try it now (I try the POS_TAGGING) , it works perfectly ! Thank you ;)

Raphencoder · 2018-12-06T17:56:36Z

Hi,
I test for pos_tagging the SEM, and the test wrote me something like:

INFO    2018-12-06 18:51:19,675 sem.segmentation        process_document        "test.txt" segmented in 1 sentences, 6 tokens
INFO    2018-12-06 18:51:19,675 sem.segmentation        process_document        in 0:00:00.000317

witch is pretty fast !

Though when I test multiple raws into a for loop my duration time is relatively high, i got something like:
0.310899257659912
How can I improve the speed performance ?
I used the code you wrote to code this one:

from __future__ import print_function
from __future__ import unicode_literals
import os.path
import sem.modules.tagger
import sem.storage
import pandas as pd
import numpy as np
import time

def sem_result(raws):
    pipeline, options, exporter, couples = sem.modules.tagger.load_master(os.path.expanduser("~/sem_data/resources/master/fr/pos.xml"))
    df = []
    for raw in raws:
        start_time = time.time()
        document = sem.storage.Document(name="test.txt", content=raw, encoding="utf-8")
        pipeline.process_document(document)
        token = []
        pos = []
        for annotation in document.annotation("POS").get_reference_annotations():
            token.append(document.content[annotation.lb : annotation.ub])
            pos.append(annotation.value)
        duration = time.time() - start_time
        df.append([token, pos, duration])
    tokens = []
    pos_tag = []
    duration = []
    for elem in df:
        tokens.append(elem[0])
        pos_tag.append(elem[1])
        duration.append(elem[2])
    d = {'araws': raws, 'btokens': tokens, 'cpostag': pos_tag, 'duration': duration }
    dfSem = pd.DataFrame(data=d)
    dfSem.to_csv('sem.csv')
    return dfSem

I think I load the library for each raw.. But how can I extracted it from the loop ?

YoannDupont · 2018-12-06T23:00:35Z

The first log you sent:

INFO    2018-12-06 18:51:19,675 sem.segmentation        process_document        "test.txt" segmented in 1 sentences, 6 tokens
INFO    2018-12-06 18:51:19,675 sem.segmentation        process_document        in 0:00:00.000317

is for segmentation only, as indicated by the second column: sem.segmentation. There are a couple more modules that are launched after this one for each document. To avoid loading the Wapiti models each time you call process_document, you can install python-wapiti.

Furthermore: you currently load the pipeline once at every call of sem_result. You could load it only once if you put the load_master line at module level.

Raphencoder · 2018-12-07T10:26:39Z

Thanks I moved the pipeline up, and I downloaded the wapiti lib, the duration time is faster, also your pos_tagging is greater than NLTK pos_tagging (Stanford one) on many point, really impressive and useful.

YoannDupont · 2018-12-07T10:57:05Z

Thanks a lot for the feedback and thank you very much for using SEM! You made a developer really happy. :)

Raphencoder · 2018-12-07T13:16:20Z

With pleasure :) ! Do you have a mail address where I can reach you ?

YoannDupont · 2018-12-07T15:21:19Z

You can find it at the end of the setup.py file (I do not like to copy/paste this kind of information directly into discussions).

Raphencoder closed this as completed Dec 4, 2018

Raphencoder reopened this Dec 6, 2018

Raphencoder closed this as completed Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use it as a library #7

Use it as a library #7

Raphencoder commented Dec 3, 2018

YoannDupont commented Dec 3, 2018 •

edited

Loading

Raphencoder commented Dec 4, 2018

Raphencoder commented Dec 6, 2018 •

edited

Loading

YoannDupont commented Dec 6, 2018

Raphencoder commented Dec 7, 2018 •

edited

Loading

YoannDupont commented Dec 7, 2018

Raphencoder commented Dec 7, 2018 •

edited

Loading

YoannDupont commented Dec 7, 2018

Use it as a library #7

Use it as a library #7

Comments

Raphencoder commented Dec 3, 2018

YoannDupont commented Dec 3, 2018 • edited Loading

Raphencoder commented Dec 4, 2018

Raphencoder commented Dec 6, 2018 • edited Loading

YoannDupont commented Dec 6, 2018

Raphencoder commented Dec 7, 2018 • edited Loading

YoannDupont commented Dec 7, 2018

Raphencoder commented Dec 7, 2018 • edited Loading

YoannDupont commented Dec 7, 2018

YoannDupont commented Dec 3, 2018 •

edited

Loading

Raphencoder commented Dec 6, 2018 •

edited

Loading

Raphencoder commented Dec 7, 2018 •

edited

Loading

Raphencoder commented Dec 7, 2018 •

edited

Loading