Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<unk> and <eps> #3

Open
chenguoguo opened this issue Jan 28, 2020 · 1 comment
Open

<unk> and <eps> #3

chenguoguo opened this issue Jan 28, 2020 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@chenguoguo
Copy link

Hey gusy, I finally got some spare time to look into this now. Thanks a lot for putting this together!

I'm looking at the symbol tables fro words and characters. I noticed that 0 was reserved for in words.txt, but was used for in characters.txt. As a results, in the resulting SG.fst graph, on the output side you have separate and symbols, while on the input side, you have a mixed and symbol. This is because OpenFST treat 0 as epsilon in all algorithms by default.

Shall we reserve 0 for as long as OpenFST is involved? This requires changes to both Athena and Athena-decoder. Correct me if I'm wrong though. @tjadamlee @godjealous

@chenguoguo chenguoguo added the bug Something isn't working label Jan 28, 2020
@godjealous
Copy link
Collaborator

Hey gusy, I finally got some spare time to look into this now. Thanks a lot for putting this together!

I'm looking at the symbol tables fro words and characters. I noticed that 0 was reserved for in words.txt, but was used for in characters.txt. As a results, in the resulting SG.fst graph, on the output side you have separate and symbols, while on the input side, you have a mixed and symbol. This is because OpenFST treat 0 as epsilon in all algorithms by default.

Shall we reserve 0 for as long as OpenFST is involved? This requires changes to both Athena and Athena-decoder. Correct me if I'm wrong though. @tjadamlee @godjealous

Thanks for your interest in athena-decoder project.

Actually, we always reserve 0 for epsilon on the input side and output side in WFST. As you have mentioned, symbol 0 is reserved in file words.txt. Symbol 0 is also reserved in file characters_disambig.txt.

The input symbol table for SG.fst graph is file "characters_disambig.txt" rather than the file "characters.txt". The output symbol table for SG.fst graph is file "words.txt".

Compared with file "characters.txt", file "characters_disambig.txt" contains some extra information including epsilon symbol and some disambiguate symbols.

@godjealous godjealous reopened this Feb 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants