Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Identification #50

Closed
jag3773 opened this issue Aug 15, 2019 · 8 comments
Closed

Language Identification #50

jag3773 opened this issue Aug 15, 2019 · 8 comments
Assignees
Milestone

Comments

@jag3773
Copy link
Collaborator

jag3773 commented Aug 15, 2019

We have defined a basic system of language identification in the documentation. It's likely that this fills the need but if there are critical pieces missing we should ask a group to discuss.

@jag3773 jag3773 added this to the SB 0.1.0-RC milestone Aug 15, 2019
@mvahowe
Copy link
Contributor

mvahowe commented Aug 17, 2019

The main challenge I see right now is turning non-BCP47 into BCP47. eg, for DBL, we have quite a long list of fields that, together, would feed into BCP47, but it's far from trivial to do this (ie it's not as simple as concatenating fields because BCP47 assumes that, eg, we know the default script for the language.)

@jonathanrobie
Copy link
Collaborator

We should probably discuss how much of LDML we need:

https://www.unicode.org/reports/tr35/

We should also discuss whether we want to use the same BCP 47 conformance as LDML:

https://www.unicode.org/reports/tr35/#BCP_47_Conformance

@rdb
Copy link
Collaborator

rdb commented Aug 22, 2019

Re our discussion, this is the standard endpoint for obtaining the full list of tags including Suppress-Script information:
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Libraries to deal with BCP 47 tags with understanding of this registry seem fairly ubiquitous. From a cursory look:
https://www.npmjs.com/package/language-tags
https://pypi.org/project/language-tags/

@mandolyte
Copy link

For ad-hoc searching the IANA registry: https://r12a.github.io/app-subtags/

@jag3773 jag3773 self-assigned this Oct 17, 2019
@jag3773 jag3773 modified the milestones: SB 0.1.0-RC, SB 0.2.0-beta Oct 30, 2019
@jag3773
Copy link
Collaborator Author

jag3773 commented Dec 19, 2019

I'm happy with what we have defined for language identification in https://github.com/bible-technology/scripture-burrito/blob/develop/schema/common.schema.json#L34-L39 .

However, @rdb I notice that we don't have scriptDirection like we did in the XML schema. Is that intentional?

@rdb
Copy link
Collaborator

rdb commented Dec 19, 2019

@jag3773 we do, it's just defined inline because it's not used anywhere else:

"scriptDirection": {
"type": "string",
"allowedValues": ["ltr", "rtl"]
}

@jag3773
Copy link
Collaborator Author

jag3773 commented Dec 19, 2019

Excellent, I was looking in the wrong place, but that's clearly where i should have been looking.

I'm happy with the current implementation and will close this issue.

If anyone has specific language related items that they think are not covered please create a new issue for them.

@jag3773 jag3773 closed this as completed Dec 19, 2019
@rdb
Copy link
Collaborator

rdb commented Dec 19, 2019

Note, for the record, that I renamed the field from "bcp47" to "tag" (since that describes what it is, and "(IETF) language tag" is widely understood by that name rather than by the document that defines it).

I also want to note that we don't currently have a "numeral system" field but expect it to be added to the IETF language tag using the Unicode extension syntax. I'm personally not 100% certain about this, but when I talked about it with Mark he seemed to prefer putting everything in the tag that can go in the tag, and there is some sense in that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants