Skip to content
This repository has been archived by the owner on Oct 22, 2019. It is now read-only.

[Transcription System] X-SAMPA and/or Kirshenbaum #84

Closed
tresoldi opened this issue Jan 5, 2018 · 10 comments
Closed

[Transcription System] X-SAMPA and/or Kirshenbaum #84

tresoldi opened this issue Jan 5, 2018 · 10 comments

Comments

@tresoldi
Copy link
Contributor

tresoldi commented Jan 5, 2018

Should these systems be considered? They would be good candidates for inclusion if we plan to develop a method to recognize an unknown transcription system. They might also be an alternative for situations in which Unicode is still not acceptable, like a fall-back.

The inclusion would be straightforward, a preliminary mapping can be done simply by using the Wikipedia articles:

@LinguList
Copy link
Collaborator

Yes, I was thinking about that. But given that there's a JS-sampa application which I regularly use (and which people could use to insert data), I was asking myself if it is needed in the end, since, SAMPA is more thought of as a system that renders to a certain subset of Unicode symbols using ASCII-chars, right? One could, however, also just test how far we can go with this. I recommend for rendering, however, to check both the lingpy-sampa symbols, and BXS.vim, which I further extended over the last years:

bxs.vim.txt

In fact, one could probably use this to replicate BIPA in Sampa-form, maybe a good idea?

@tresoldi
Copy link
Contributor Author

tresoldi commented Jan 5, 2018 via email

@LinguList
Copy link
Collaborator

Yes, excellent! I agree that having this would also facilitate handling for those who use the python interface.

@LinguList
Copy link
Collaborator

The more I think about it, the more I think that SAMPA is not a transcription system, but a transliteration for IPA. What we can consider instead, maybe, is to take the function for sampa2uni from lingpy and plant it into the util or another part of the package, to make clear that sampa-conversion is the task of CLTS, but not treat it as a transcription system. Since SAMPA can be further extended to cover more than the usual symbols, we could even then add a sampa-keyword to all transcription system methods, and would just be able to query strings using SAMPA, but not forcing it to be used as a full-fledged way to transcribe things. LingPy's sampa2uni-function in fact handles most cases we would expect, so it would be straightforward to just take it from there and later kick it out of lingpy.

@xrotwang
Copy link
Collaborator

That's a bit philosophical, isn't it? What would be the difference between "transliteration of IPA" and "transcription system"?

When you say

LingPy's sampa2uni-function in fact handles most cases we would expect

then that's exactly the problem CLTS should solve: I.e. turning implicit, possibly complex code into declarative, transparent data. So resorting to using the old code after all, when the aim was to describe what it actually is that "we expect", seems like failure.

@LinguList
Copy link
Collaborator

But the essence of sampa is completely different from the esssence of transcription systems. The parsing algorithm needs to be different, since SAMPA does not show the distinction between diachritics and base characters, but instead uses sub-characters to turn a base character into a diacritic. As a result, you cannot parse the grapheme t_h in sampa using our current transcription system code (which works well for other systems), simply because sampa was created to translate from ascii-glyphs (as opposed to graphemes) to unicode-glyphs. For sampa, all you need is a look-up-table, or an orthography profile (maybe even better!) to turn it to Unicode-IPA. Our pre-defined transcription system code, however, won't work on it, unless we spell out all characters.

@xrotwang
Copy link
Collaborator

Ah ok, I see. So basically, SAMPA is orthography, thus should be handled via orthography profiles. That's ok, since this uses a different, but also well-described and transparent mechanism :)

@LinguList
Copy link
Collaborator

Yes, I think this is the best way to go: we make a huge orthoprofile (no need to use lingpy's algo) with all 6000+ symbols, converting them to sampa where possible, provide it as an orthography profile and allow to load it quickly. I am just wondering: as ortho-profiles are in some way important for CLTS, should we consider putting the sampa-profile into the segments-package, or rather into the clts-package?

@xrotwang
Copy link
Collaborator

I guess it would make sense in the segments package, considering that SAMPA is one of the most prominent ways to write IPA. This would increase the immediate usefulness of the segments package, beyond "just" proper UNICODE tokenization.

@LinguList
Copy link
Collaborator

Okay. I suppose we transfer this issue now and make the argument that with help of our ~ 6000 segments, we could easily just provide a huge number of possible segmentations even of SAMPA (maybe excluding clusters, as they will mess it up), and it can be included in the next release of segments. At the same time, we may also use our 6000+ graphemes in our BIPA system to produce an orthography profile that could at some point be used instead of lingpy's segmentation algorithm.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants