Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

korp_mono.py list index out of range #3

Open
Phaqui opened this issue May 15, 2023 · 2 comments
Open

korp_mono.py list index out of range #3

Phaqui opened this issue May 15, 2023 · 2 comments

Comments

@Phaqui
Copy link
Contributor

Phaqui commented May 15, 2023

Det kan virke som at noe går galt et sted. I den fila jeg prøver å konvertere med korp_mono.py, ligger det analyser av typen

"<1024x768>"
	"1024x" Err/MissingSpace"768" Num @HNOUN #2->0

Som da gjør at scriptet krasjer med følgende melding (de første tre linjene har jeg skrevet ut slik at jeg skulle finne ut hvordan inputtet så ut.

anders@debian:~/corpus/corpus-fao$ korp_mono --skip-existing --ncpus most analysed/blogs/web_mix.txt.xml
--skip-existing given. Skipping 0 files that are already processed
Processing 1 files in parallel (9 workers)
word_form='1024x768'
lemma='1024x_∞_@HNOUN #2->0'
rest_cohort='\t"1024x" Err/MissingSpace"768" Num @HNOUN #2->0'
[1/1 FAILED: /home/anders/corpus/corpus-fao/analysed/blogs/web_mix.txt.xml
list index out of range
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/anders/.pyenv/versions/3.11.1/lib/python3.11/concurrent/futures/process.py", line 256, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anders/projects/CorpusTools/corpustools/korp_mono.py", line 528, in process_file
    make_vrt_xml(file, analysed_file.lang),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anders/projects/CorpusTools/corpustools/korp_mono.py", line 547, in make_vrt_xml
    make_sentences(valid_sentences(old_root.find(".//body/dependency").text), lang)
  File "/home/anders/projects/CorpusTools/corpustools/korp_mono.py", line 888, in make_sentences
    return [
           ^
  File "/home/anders/projects/CorpusTools/corpustools/korp_mono.py", line 889, in <listcomp>
    make_sentence(current_sentence, current_lang) for current_sentence in sentences
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anders/projects/CorpusTools/corpustools/korp_mono.py", line 879, in make_sentence
    [
  File "/home/anders/projects/CorpusTools/corpustools/korp_mono.py", line 880, in <listcomp>
    make_analysis_tuple(word_form, rest_cohort, current_lang)
  File "/home/anders/projects/CorpusTools/corpustools/korp_mono.py", line 840, in make_analysis_tuple
    maybe_pos = parts[1].replace("_∞_", "").strip()
                ~~~~~^^^
IndexError: list index out of range
@Phaqui
Copy link
Contributor Author

Phaqui commented May 15, 2023

Et forslag for å jobbe rundt feilen som oppstår, i alle fall mildertidig, som jeg fant, var å endre regexen som splittes på, til:

(linje 834) r"(_∞_\w+\s?|_∞_@\s?|_∞_\?\s?|_∞_\<ehead>\s?|_∞_#|_∞_\<mv>\s?\|_∞_\<aux>\s?)"

(fra r"(_∞_\w+\s?|_∞_\?\s?|_∞_\<ehead>\s?|_∞_#|_∞_\<mv>\s?\|_∞_\<aux>\s?)")

Altså, jeg la til _∞_@\s? - fordi mitt input til .split() var "1024x_∞_@HNOUN #2->0"

...Men det blir nok ikke korrekt. I tillegg vet jeg ikke om dette egentlig kommer av en feil tidligere i pipelinen et eller annet sted (og at det blir mer riktig å legge inn fiksen der), - eller kanskje jeg har noe feil med selve språkmodellen min?

@albbas
Copy link
Member

albbas commented May 16, 2023

Du blir vel nødt til å se hva som lager linjene med _∞_, og følge med hva som skjer fra input som er fra analyse til der du er nå, for å se om det gir mening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants