-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Tiberian Transcription Schema #45
Comments
I tried with tiberian schema (
|
Sample for what we should accomplish:
|
Thanks for all this! In the branch with the new callback function for additional features, the callback gives access to the Right now, I'm running into a bit of a wall. Calling something like {
FEATURE: "syllable",
HEBREW: "\u{05B0}",
TRANSLITERATION: (syllable, hebrew, schema) => {
const next = syllable.next;
if(next && next.vowelName) {
// renamed function below from additionalFeatureTransliteration
return replaceAndTransliterate(syllable.text, new Regex(hebrew, "u"), schema[next.vowelName], schema);
}
}
} The problem, however, is this Not totally sure how to resolve other than merging these two packages into a monorepo or heavily refactoring the schema interface — probably the latter |
Probably the latter I think too. |
Refactoring allows for something a little more elegant: const heb = require("./dist/index");
const rules = require("./dist/rules");
const result = heb.transliterate("בְּרֵאשִׁ֖ית וַיַּבְדֵּל", {
ADDITIONAL_FEATURES: [
{
// matches any sheva in a syllable that is NOT preceded by a vowel character
HEBREW: "(?<![\u{05B1}-\u{05BB}\u{05C7}].*)\u{05B0}",
FEATURE: "syllable",
TRANSLITERATION: function (syllable, _hebrew, schema) {
const next = syllable.next;
// discrepancy here: in havarotjs SHEVA is simply the character
// whereas transliteration is concerned with a specific sheva, a vocal sheva
const nextVowel = next.vowelName === "SHEVA" ? "VOCAL_SHEVA" : next.vowelName;
if (next && nextVowel) {
const vowel = schema[nextVowel] || "";
// replaceAndTransliterate is an internal helper function
return rules.replaceAndTransliterate(syllable.text, new RegExp("\u{05B0}", "u"), vowel, schema);
}
return syllable.text;
}
}
]
});
// bērēʾšît wayyabdēl Though the regex is a little more complicated, it ensures that the sheva being matched is likely a vocal one. thinking out loud: the |
|
Checkout this branch for tiberian. If you could look through the tests, and let me know what is incorrect. Feel free to push changes or just comment here |
Ok. I'll do that
…On Thu, 16 Feb 2023 at 03:27, Charles Loder ***@***.***> wrote:
@johnlockejrr <https://github.com/johnlockejrr>
Checkout this branch for tiberian
<https://github.com/charlesLoder/hebrew-transliteration/tree/tiberian>.
If you could look through the tests, and let me know what is incorrect.
Feel free to push changes or just comment here
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD44GHVWAGUJDXVGCI6ZOU3WXWGAHANCNFSM6AAAAAAUDGEC2M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Quite close!!! Need some more little work but we are almost there: hebrew-transliteration output:
Geoffrey Khan:
NOTES: We should:
|
Let me take these a little at a time.
Ok, that one is easy.
I think I got this correct, see test
That makes sense. See tests on the following lines, and let me know if they're correct at least in regards to the aleph:
and
The rest will take a little more time to get to. |
Yes, you are correct! Forgot about |
See test:
That one is easy enough: hebrew-transliteration/src/schemas/tiberian.ts Lines 66 to 67 in da40956
Still have to work on the long vowels and sheva. Had a baby a few months ago, hence the stop-and-go work on this |
Splendid! Now we are even closer. Good work @charlesLoder Congratulations on the baby! |
Just realizing I forgot to add a test for שִׂיחַ [ˈsiːjaħ] |
Take a look at all these, and let me know if I'm missing something. hebrew-transliteration/test/schemas/tiberian.test.ts Lines 52 to 64 in 16480b5
What about a vav/yod before a he (not even sure if that happens)? |
All seem right, besides the long vowels of course. Summary:
NOTE: Many words carry a secondary stress in addition to the main stress (fortunatelly this is noted with the cantillation marks), e.g. הָ֣אָדָ֔ם [ˌhɔːʔɔːˈðɔːm] ‘the man’ (Gen. 2.19), נִֽתְחַכְּמָ֖ה [ˌniːθḥakkaˈmɔː] ‘let us deal wisely’ (Exod. 1.10). |
The furtive patach tests have a vav or yod before a chet or ayin. I'm trying to think if there are any words with a furtive patach before a he (e.g. גָּבֹ֗הַּ), where the he is preceded by a vav or yod.
This would be a feature to build out. I also really need to update the Will look at vowel length next |
|
Also, don't forget about the SHEVA rules when you got time.
|
Another round of work. More furtive tests Epenthetic vowel
These long vowels are going to be tricky.... See the updated tests here Could you comment on each line whether it is correct or not. A simple 👍 if it's correct, and if it's not correct, then comment with the correct value. |
I have commented on not correct ones, I hope I didn't make any mistakes, I could ask Khan to correct but maybe a little later. |
What's the latest branch with Tiberian Schema? |
Tried on the latest. Genesis 1
Khan:
|
Just updated the branch. I'm struggling a bit with the vowel length stuff. The most recent commit fro Gen 1:1-5 produces:
Someways it's closer, other ways it's way off |
Yes, way closer! We are on the right path :) |
Same branch gave me this for Gen 1:1-5:
One note (or two), the prolonged vowel appears only in accented closed syllable so not in |
Ok, some more progress is being made, but now I'm hitting up against some deeper issues related to the syllabification package: And some other issues I'm still trying to figure out. I'm going to remove this from the Once I make more substantial changes to the syllabification package, I'll return to this. It is, however, getting much closer! For Gen 1:1-5 I'm seeing a lot of the same issues occur, so much of it should be resolved soon. I'm also working on a book project soon so that may take time away from this (too many irons in the fire! 🔥 ) |
@johnlockejrr , nah, the holidays are kicking my butt |
Unblocked this. Going forward, I'm only going to be using Hebrew text from Sefaria for testing as it has the best ta'amim. The text they use, Miqra `al pi ha-Mesorah, has accent helpers which allow for more accurate stress marker placement. Obviously, I'm taking the happiest path and not dealing with edge cases, but I'd rather move forward. Hopefully, I'll incorporate Sefaria's text into the web app one day. Moving back to this repo when I get some time! |
Glad to hear it! Yes, MAM text is pretty good, some typos nonetheless but is an open community and improving. The text is based on what is left from Aleppo Codex and other Firkovich fragments to fill the gaps. Mikraot Gedolot HaKeter (https://www.mgketer.org/) have a good text also online based on Aleppo Codex. If in the future you need any help with that let me know, I have it and other versions in SQL databases, csv etc.
…On Fri, 29 Dec 2023 at 17:40, Charles Loder ***@***.***> wrote:
If you haven't noticed, I work in starts and stops :)
This issue <charlesLoder/havarotjs#147> in the
syllabication package has been the blocker. Had to dig a little deeper into
the accents
Unblocked this.
Going forward, I'm only going to be using Hebrew text from Sefaria for
testing as it has the best ta'amim. The text the use, Miqra `al pi
ha-Mesorah, has accent helpers which allow for more accurate stress marker
placement.
Obviously, I'm taking the happiest path and not dealing with edge cases,
but I'd rather move forward.
Hopefully, I'll incorporate Sefaria's text into the web app one day.
Moving back to this repo when I get some time!
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD44GHUUD5MQME6FPP6XMELYL3W75AVCNFSM6AAAAAAUDGEC2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGIYDSNZZGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The latest version fixes a few issues including alephs being doubled. As for the resh, this package doesn't work with any lexical information. Honestly, I didn't even know there were that many occurrences of a resh with a dagesh. |
@johnlockejrr when you get a chance, let me know if you see any issues with the latest version. |
Sure, I'll get back to you |
Still the issues I told you above are still there, in Obadia as an example: I'll do more tests. |
Gen. 1:7 ʔɛθ-ˌhɔːʀ̟ɔːq̟iːˈjaʕ should be ʔɛθ-ˌhɔːʀ̟ɔː'q̟iːjaʕ, furtiv patach is never accented. אֶת־הָֽרָקִיעַ֒ |
Ok, that was mentioned back here too. Let me take a look...
I'll have to research more, but I thought in Tiberian the sheva would still be silent.
The segolta accent is always postpositive, see this helpful article. MAM adds accent helpers so when a ta'am falls on an unaccented syllable, another is added for clarity. I added some pretty extensive tests in the syllabification package for all this. These little helpers are one of the reasons I decided to use MAM as the primary text for testing |
5 ʔim-ˈʃoːoððeː should be ʔim-ˈʃoːoðaðeː beacuse the sheva under the first
ð is vocal (אִם־שֹׁ֣ודְדֵי).
In Hebrew you can't double a spirant and here that would be the output, and
remember that the sheva between two identical consonants is almost always
vocal, with some exceptions.
…On Thu, 18 Jan 2024 at 05:16, Charles Loder ***@***.***> wrote:
2 vaħaʁveː-sˈsɛːlaʕ should be vaħaʁveː-sˈsɛːlaʕ because the maqqef binds
the words together that should be treated as one word and the samekh have
dagesh forte. (בְחַגְוֵי־סֶּ֖לַע)
Ok, that was mentioned back here
<#45 (comment)>
too. Let me take a look...
5 ʔim-ˈʃoːoððeː should be ʔim-ˈʃoːoðaðeː beacuse the sheva under the first
ð is vocal (אִם־שֹׁ֣ודְדֵי).
I'll have to research more, but I thought in Tiberian the sheva would
still be silent.
Note: I'm not sure why here SEGOLTA is sitting on the AYN because the
furtive patach can't get accentuated, MAM has here 2 SEGOLTA, one above the
QOF and one above the AYN (אֶת־הָרָקִ֒יעַ֒), I think is a typo because I
never seen any manuscript having 2 SEGOLTAS here.
The segolta accent is always postpositive, see this helpful article
<https://assets.cambridge.org/97811084/79936/excerpt/9781108479936_excerpt.pdf>
.
MAM adds accent helpers so when a ta'am falls on an unaccented syllable,
another is added for clarity. I added some pretty extensive tests
<https://github.com/charlesLoder/havarotjs/blob/main/test/syllable.isAccented.test.ts>
in the syllabification package for all this. These little helpers are one
of the reasons I decided to use MAM as the primary text for testing
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD44GHUO5KM4QYHMIPD52JTYPCOZPAVCNFSM6AAAAAAUDGEC2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJXG43DINZYGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for the heads up about MAM. I'm not really sure what they did is very masoretical :) I mean that breaks all the rules of the Masora, I think I'll move to something more reliable like Mikraot Gedolot HaKeter based like MAM on what's left of the Aleppo Codex and other old same family (Bar-Asher) manuscripts. |
About sheva na in Obadia 5, the Mikraot Gedolot of mg.alhatorah.org says is |
Yup! I just wasn't sure if that was one of those rules that was taught in Hebrew classes that didn't correspond to actual Tiberian (e.g. like distinguishing between qamets qatan and qamets gadol, which Tiberian does not do). But, in Khan I.2.5.7.3:
So I got to make some updates, but I think it will be simple |
Yes, you are right. The output overall is amazing and accurate, only some
minor tweaks.
…On Fri, 19 Jan 2024 at 05:11, Charles Loder ***@***.***> wrote:
5 ʔim-ˈʃoːoððeː should be ʔim-ˈʃoːoðaðeː beacuse the sheva under the first
ð is vocal (אִם־שֹׁ֣ודְדֵי).
In Hebrew you can't double a spirant and here that would be the output, and
remember that the sheva between two identical consonants is almost always
vocal, with some exceptions.
Yup! I just wasn't sure if that was one of those rules that was taught in
Hebrew classes that didn't correspond to actual Tiberian (e.g. like
distinguishing between qamets qatan and qamets gadol, which Tiberian does
not do).
But, in Khan I.2.5.7.3:
One notable case is a shewa under the first of a pair of identical
consonants, which was vocalic if the preceding vowel was long,
So I got to make some updates, but I think it will be simple
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD44GHQAH6ZCHYUFE5IAJ73YPHW55AVCNFSM6AAAAAAUDGEC2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZG4YDIOBTGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ok! Got אִם־שֹׁ֣ודְדֵי as ʔim-ˈʃoːðaðeː and בְחַגְוֵי־סֶּ֖לַע as vaħaʁveː-sˈsɛːlaʕ 🎉 Just so we don't have to find all the links:
Thanks for all the help testing btw! It feels like this is getting closer to done |
Awesome, thanks for the update! I'll do more tests and get back to you. I'm
really happy, very close indeed
…On Sat, 20 Jan 2024 at 04:17, Charles Loder ***@***.***> wrote:
Ok! Got אִם־שֹׁ֣ודְדֵי as ʔim-ˈʃoːðaðeː and בְחַגְוֵי־סֶּ֖לַע as
vaħaʁveː-sˈsɛːlaʕ 🎉
Just so we don't have to find all the links:
- the Tiberian branch
<https://github.com/charlesLoder/hebrew-transliteration/tree/tiberian>
with latest at 0d8b5ed
<0d8b5ed>
- the latest tiberian release on npm
<https://www.npmjs.com/package/hebrew-transliteration/v/2.5.1-tiberian.10>
(v2.5.1-tiberian.10)
- the web app
<https://deploy-preview-77--hebrewtransliteration.netlify.app/#> with
the latest tiberian version
Thanks for all the help testing btw! It feels like this is getting closer
to done
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD44GHRI6HPJ5M6SONFATOLYPMZLTAVCNFSM6AAAAAAUDGEC2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBRG4YTCMZZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
So testing the Genesis text has remained the same. Psalm 1, however, presents some more problems: ʰalˈʁeː',
received: 'ʕaːal-pʰalˈʁeː'
},
{
text: 'אֲשֶׁ֣ר־יַעֲשֶׂ֣ה',
expected: 'ʔaʃɛʀ̟-jaːʕaˈsɛː',
received: 'ʔaʃɛːɛʀ̟-jaːʕaˈsɛː'
},
{
text: 'יַצְלִֽיחַ׃',
expected: 'jɑsˁˈliːjaħ',
received: 'jɑsˁˈliːħaː'
},
{
text: 'לֹֽא־כֵ֥ן',
expected: 'loː-ˈχeːen',
received: 'ˌloː-ˈχeːen'
},
{
text: 'הָרְשָׁ֑עִים',
expected: 'hɔːʀ̟aʃɔːˈʕiːim',
received: 'hɔːɔʀ̟ˈʃɔːʕiːm'
},
{
text: 'אֲשֶׁר־תִּדְּפֶ֗נּוּ',
expected: 'ˌʔaˑʃɛʀ̟-tʰiddaˈfɛːɛnnuː',
received: 'ʔaʃɛʀ̟-tʰiddaˈfɛːɛnnuː'
},
{ text: 'רֽוּחַ׃', expected: 'ˈʀ̟uːwaħ', received: 'ˈʀ̟uːħaː' },
{
text: 'לֹֽא־יָקֻ֡מוּ',
expected: 'loː-jɔːˈq̟uːmuː',
received: 'ˌloː-jɔːˈq̟uːmuː'
},
{
text: 'רְשָׁ֤עִים',
expected: 'ʀ̟aʃɔːˈʕiːim',
received: 'ʀ̟aˈʃɔːʕiːm'
},
{
text: 'וְֽחַטָּאִ֥ים',
expected: 'vaħɑttˁɔːˈʔiːim',
received: 'vaħɑtˁtˁɔːˈʔiːim'
},
{
text: 'צַ֝דִּיקִ֗ים',
expected: 'sˁɑddiːˈq̟iːim',
received: 'ˈsˁɑːɑddiːˈq̟iːim'
},
{
text: 'כִּ֤י־יוֹדֵ֥עַ',
expected: 'ˌkʰiː-joːˈðeːjaʕ',
received: 'kʰiː-joːˈðeːaʕ'
},
{
text: 'צַ֥דִּיקִים',
expected: 'sˁɑddiːˈq̟iːim',
received: 'ˈsˁɑːɑddiːq̟iːm'
},
{ text: 'תֹּ֭אבֵד׃', expected: 'tʰoːˈveːeð', received: 'ˈtʰoːveð' }
] A few of the issue are related to how the taamim characters in poetry function differently |
Job, Proverbs and Psalms are a different story, they have different taamim |
Yeah, the difficulty is that when the syllabification package encounters a mark, like a tipcha, it doesn't know if it is a tipcha or a dechi. |
I was wrong! Yay! The Tipcha and dechi are encoded with different characters. So I just need to come up with some better logic |
Glad to hear it! Yes, they are identical in form but encoded differently in
the Unicode good fonts (not all)
…On Tue, 20 Feb 2024 at 02:28, Charles Loder ***@***.***> wrote:
I was wrong! Yay! The Tipcha and dechi are encoded with different
characters. So I just need to come up with some better logic
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD44GHUJGSFEYKKIFOERUY3YUP34PAVCNFSM6AAAAAAUDGEC2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTGM2DOMRWGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Latest diffs of Psalm 1: [
{
text: 'אַ֥שְֽׁרֵי-הָאִ֗ישׁ',
expected: 'ˌʔaːˌʃaˑʀ̟eː-hɔːˈʔiːiʃ',
received: 'ˈʔaːʃaʀ̟eː-hɔːˈʔiːiʃ'
},
{
text: 'וּֽבְתוֹרָת֥וֹ',
expected: 'ˌwuˑvθoːʀ̟ɔːˈθoː',
received: 'wuvθoːʀ̟ɔːˈθoː'
},
{ text: 'וְהָיָ֗ה', expected: 'ˌvɔˑhɔːˈjɔː', received: 'vɔhɔːˈjɔː' },
{
text: 'עַ֫ל־פַּלְגֵ֥י',
expected: 'ˌʕaˑl-pʰalˈʁeː',
received: 'ʕaːal-pʰalˈʁeː'
},
{
text: 'אֲשֶׁ֣ר־יַעֲשֶׂ֣ה',
expected: 'ʔaʃɛʀ̟-jaːʕaˈsɛː',
received: 'ʔaʃɛːɛʀ̟-jaːʕaˈsɛː'
},
{
text: 'לֹֽא־כֵ֥ן',
expected: 'loː-ˈχeːen',
received: 'ˌloː-ˈχeːen'
},
{
text: 'הָרְשָׁ֑עִים',
expected: 'hɔːʀ̟aʃɔːˈʕiːim',
received: 'hɔːɔʀ̟ˈʃɔːʕiːm'
},
{
text: 'אֲשֶׁר־תִּדְּפֶ֗נּוּ',
expected: 'ˌʔaˑʃɛʀ̟-tʰiddaˈfɛːɛnnuː',
received: 'ʔaʃɛʀ̟-tʰiddaˈfɛːɛnnuː'
},
{
text: 'עַל־כֵּ֤ן ׀',
expected: 'ʕal-ˈkʰeːen',
received: 'ʕal-ˈkʰeːen '
},
{
text: 'כִּ֤י־יוֹדֵ֥עַ',
expected: 'ˌkʰiː-joːˈðeːjaʕ',
received: 'kʰiː-joːˈðeːaʕ'
},
{
text: 'צַ֥דִּיקִים',
expected: 'sˁɑddiːˈq̟iːim',
received: 'ˈsˁɑːɑddiːq̟iːm'
}
] A lot of issues are related to secondary stress related to the minor gaya, which is really difficult to figure out for the syllabification package. |
Seems like it |
Alright! Made some updates and adjusted the text I was using to compare Psa 1, I get this: [
{
text: 'אַ֥שְֽׁרֵי-הָאִ֗ישׁ',
expected: 'ˌʔaːˌʃaˑʀ̟eː-hɔːˈʔiːiʃ',
received: 'ˈʔaːˌʃaˑʀ̟eː-hɔːˈʔiːiʃ'
},
{
text: 'הָרְשָׁעִ֑ים',
expected: 'hɔːʀ̟aʃɔːˈʕiːim',
received: 'hɔːɔʀ̟ʃɔːˈʕiːim'
},
{
text: 'אֲֽשֶׁר־תִּדְּפֶ֥נּוּ',
expected: 'ˌʔaˑʃɛʀ̟-tʰiddaˈfɛːɛnnuː',
received: 'ʔaʃɛʀ̟-tʰiddaˈfɛːɛnnuː'
},
{
text: 'כִּֽי־יוֹדֵ֣עַ',
expected: 'ˌkʰiː-joːˈðeːjaʕ',
received: 'ˌkʰiˑ-joːˈðeːaʕ'
}
] At which case, getting any more accurate results really means digging even deeper into the weeds. I don't think I have it in me! :) But, this honestly feels like it's in a good place. I may close it this week |
For now you can publish it like this. Maybe later, in time, we can find
something to tweak it, I’m very happy with it anyway, is one of its kind.
…On Tue, 12 Mar 2024 at 04:11, Charles Loder ***@***.***> wrote:
Alright!
Made some updates and adjusted the text I was using to compare Psa 1, I
get this:
[
{
text: 'אַ֥שְֽׁרֵי-הָאִ֗ישׁ',
expected: 'ˌʔaːˌʃaˑʀ̟eː-hɔːˈʔiːiʃ',
received: 'ˈʔaːˌʃaˑʀ̟eː-hɔːˈʔiːiʃ'
},
{
text: 'הָרְשָׁעִ֑ים',
expected: 'hɔːʀ̟aʃɔːˈʕiːim',
received: 'hɔːɔʀ̟ʃɔːˈʕiːim'
},
{
text: 'אֲֽשֶׁר־תִּדְּפֶ֥נּוּ',
expected: 'ˌʔaˑʃɛʀ̟-tʰiddaˈfɛːɛnnuː',
received: 'ʔaʃɛʀ̟-tʰiddaˈfɛːɛnnuː'
},
{
text: 'כִּֽי־יוֹדֵ֣עַ',
expected: 'ˌkʰiː-joːˈðeːjaʕ',
received: 'ˌkʰiˑ-joːˈðeːaʕ'
}]
At which case, getting any more accurate results really means digging even
deeper into the weeds. I don't think I have it in me! :)
But, this honestly feels like it's in a good place. I may close it this
week
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD44GHU2HNN3C46TRB4VZULYXZ6APAVCNFSM6AAAAAAUDGEC2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBZHE3TCNJWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
CLOSED! 🎉 |
If you shoot me an email (see my profile for my address), I can add you to the email updates and give credit |
See discussion here
Will definitely need a test under
test/schemas
.The text was updated successfully, but these errors were encountered: