Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data by Nishi1999 #77

Closed
3 of 4 tasks
LinguList opened this issue Nov 1, 2016 · 28 comments
Closed
3 of 4 tasks

Data by Nishi1999 #77

LinguList opened this issue Nov 1, 2016 · 28 comments
Assignees

Comments

@LinguList
Copy link
Collaborator

LinguList commented Nov 1, 2016

Nishi is another dataset, just as Mann1998, consisting of morphemes. We need to do the following

  • link data to concepticon
  • write orthography profile
  • check phoneme inventories in the original source and if there are, type them off
  • provide a small description of the dataset

The data is pages 100-107 of Nishi 1999 (Four Papers on Burmese: Toward the history of Burmese (the Myanmar langauge) Tokyo: Institute for the Study of Language and Cultures of Asia and Africa (ILLCAA), Tokyo University of Foreign Studies). These data amount to 359 proposed cognates among eight languages, viz. Written Burmese, Spoken Burmese, Achang, Xiandao, Zaiwa, Leqi, Langsu, and Bola. The non-Burmese data are cited from "ZYC, except fora few Achang and Zaiwa forms which are supplied from (Xu and Xu 1984) and (Dai and Cai 1985). Note that entries of all the four Burmish languages, Burmese (and Mod. WB. transliterated by the Beijing method), Achang, Zaiwa and Langsu contained in ZYHC are supplied by the same authors as those in ZYC" (Nishi 1999: 96). (It looks like he also cites from Dai, et al. 1991).

The phoneme inventories are from the same work pp. 90-94. Nishi uses ñ in place of a sign the Chinese use for palatal n (not the usual one) and he uses ï for the apical vowel. He marks irregular cognates in bold, and he notes with the abbreviations x/x, d/c, and d.

@LinguList
Copy link
Collaborator Author

this is a first automatically created profile of sequences that need to explained.

Nishi1999.xlsx

The data looks rather messy, unfortunately, as there are many idiosyncratic characters, it seems.

@nh36
Copy link
Collaborator

nh36 commented Nov 4, 2016

Nathan has made a new version of Nishi. The data are now cleaner. Mattis must recompile the orthography profile for Nathan to check.

@nh36
Copy link
Collaborator

nh36 commented Nov 4, 2016

Nishi.ods.zip

@LinguList
Copy link
Collaborator Author

Short question: the bold things in Nishi, are they meaningful? If so, I'd try to search-and-replace them by * and code them differently, maybe adding a note for those words...

@LinguList
Copy link
Collaborator Author

Looks much nicer, just preparing the concept list. There was one hidden row, though, you should have a look (but I resolved it): line 367 was hidden, and so it was exported, but with many blank lines. I now moved it up where it belongs, and also corrected one nishi gloss: to dye instead of dye (cloth), line 100. BTW: it's good having the original source noted there, as in this way, we can trace back to Sun 1991 (that is ZMYYC, right?).

@LinguList
Copy link
Collaborator Author

so here is the currently mapped data for Nishi, automatic mapping, percentage: 0.79, not bad actually:

Nishi-1999-mapped.ods.zip

I leave that to @nh36 to have a closer look at it, but will later double-check your cleaned version. The algorithm is better now, but also yields a lot of possibilities, yet I consider this as important, as we should be as strict as possible with those mappings.

@LinguList
Copy link
Collaborator Author

and here's the new test for the profile. Not much changed, to be honest, but it looks clearer now. I suppose, it's time to just work with the data as is, there are some five exceptions with tones, but I will handle them explicitly once I run the profile to re-create the data.

Nishi1999-prf.xlsx

@nh36
Copy link
Collaborator

nh36 commented Nov 5, 2016

The things in bold are those that Nishi himself identified to be irregular.
ZMYYC is Sun1991, it is his usual source. But note that he uses several
other sources, when he thought he got better data there.

Dr Nathan W. Hill
Reader in Tibetan and Historical Linguistics
Department of China & Inner Asia and Department of Linguistics
SOAS, University of London
Thornhaugh Street, Russell Square, London WC1H 0XG, UK
Tel: +44 (0)20 7898 4512

Profile -- http://www.soas.ac.uk/staff/staff46254.php

Tibetan Studies at SOAS -- http://www.soas.ac.uk/cia/tibetanstudies/

On Sat, Nov 5, 2016 at 9:09 AM, Johann-Mattis List <notifications@github.com

wrote:

Short question: the bold things in Nishi, are they meaningful? If so,
I'd try to search-and-replace them by * and code them differently, maybe
adding a note for those words...


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AIdHxceIXnXfJHjGd8wMpjw-c1ZBkPaPks5q7EflgaJpZM4KmQH1
.

@LinguList
Copy link
Collaborator Author

Just saw: even better, as you halved the number of rows, so this is really working nicely now!

@LinguList
Copy link
Collaborator Author

Once we have linked Sun1991 to concepticon, we can directly compare across the sources (also provided we have orthography profiles for Sun1991).

BTW: this workflow we are following up now could definitely be optimized. I think it is time I start thinking about a script to run to create an initial orthography profile for a given dataset. I'll make that an issue, and I'll probably handle it by writing a new function for either lingpy or the original orthography profile code, as it is of general interest to users, I'd say.

@nh36
Copy link
Collaborator

nh36 commented Nov 5, 2016 via email

@LinguList
Copy link
Collaborator Author

Every source has it's own right, and as far as I can see, we don't know whether Sun1991 uses the same concept labels, and the same orthography. Nishi may have well adjusted those. And since Sun1991 is also originally Chinese, there may be some divergences in translation. So I prefer to do the work two times, on time for Nishi and one time for Sun and then check the overlap, which will also be interesting as a scientific study on the sociology of research, as I guess we may find some coding errors, and it is interesting to see how they could influence an analysis.

@nh36
Copy link
Collaborator

nh36 commented Nov 5, 2016

Ok, in that case I will do the Nishi. You can work on automating things,
but at least the Nishi will be done already.

Dr Nathan W. Hill
Reader in Tibetan and Historical Linguistics
Department of China & Inner Asia and Department of Linguistics
SOAS, University of London
Thornhaugh Street, Russell Square, London WC1H 0XG, UK
Tel: +44 (0)20 7898 4512

Profile -- http://www.soas.ac.uk/staff/staff46254.php

Tibetan Studies at SOAS -- http://www.soas.ac.uk/cia/tibetanstudies/

On Sat, Nov 5, 2016 at 10:14 AM, Johann-Mattis List <
notifications@github.com> wrote:

Every source has it's own right, and as far as I can see, we don't know
whether Sun1991 uses the same concept labels, and the same orthography.
Nishi may have well adjusted those. And since Sun1991 is also originally
Chinese, there may be some divergences in translation. So I prefer to do
the work two times, on time for Nishi and one time for Sun and then check
the overlap, which will also be interesting as a scientific study on the
sociology of research, as I guess we may find some coding errors, and it is
interesting to see how they could influence an analysis.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AIdHxf7U08ktZ9KDDNRlV8QlkdLjqcaXks5q7FbzgaJpZM4KmQH1
.

@LinguList
Copy link
Collaborator Author

LinguList commented Nov 5, 2016 via email

@nh36
Copy link
Collaborator

nh36 commented Nov 5, 2016

Here is the Nishi concepticon mapping. In many cases I have left some
ambiguities, generally this is where Nishi seems to want to combine two
concepticon concepts. In some cases I changed the automatic map to '???'
because none of the concepticon concepts seems to work (e.g. dream vi,
which is certainly not the same as dream (something).

I also attach the Nishi orthography profile, but I am not sure it is done
correctly. I have fixed mistakes where I have seen them (mostly changing t
s into ts and things like that), but I find it odd that whole words come up
unsegmented into initial and final.

Dr Nathan W. Hill
Reader in Tibetan and Historical Linguistics
Department of China & Inner Asia and Department of Linguistics
SOAS, University of London
Thornhaugh Street, Russell Square, London WC1H 0XG, UK
Tel: +44 (0)20 7898 4512

Profile -- http://www.soas.ac.uk/staff/staff46254.php

Tibetan Studies at SOAS -- http://www.soas.ac.uk/cia/tibetanstudies/

On Sat, Nov 5, 2016 at 10:19 AM, Johann-Mattis List <
notifications@github.com> wrote:

yep, exactly what I was thinking. There is a possibility that we are
doing more than necessary here, but I prefer taking the risk over
risking to further change the data in any way. Sun1991 is extremely
interesting for us, but Nishi is a lower-hanging fruit and also
important for the QPA, as with this source, and with Mann, we have then
concrete tests where we can compare with your analysis of Huang1992.
Already that comparison will be some research that has not been carried
out yet, I'd say.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AIdHxWRjobZOoQ72A7z3Li2ju2X-RhCiks5q7FhBgaJpZM4KmQH1
.

@LinguList
Copy link
Collaborator Author

LinguList commented Nov 5, 2016 via email

@LinguList
Copy link
Collaborator Author

Ah: could you upload the profile and the mapping on the website? If you attach it in an email, it does not get submitted...

@nh36
Copy link
Collaborator

nh36 commented Nov 5, 2016

@nh36
Copy link
Collaborator

nh36 commented Nov 5, 2016

@nh36
Copy link
Collaborator

nh36 commented Dec 11, 2016

Nathan still needs to--

_ check phoneme inventories in the original source and if there are, type them off
_ provide a small description of the dataset

@nh36
Copy link
Collaborator

nh36 commented Jan 2, 2017

Here is the phoneme inventory for Nishi 1999, I am not sure it is the format you will want, but it should work one way or another.

x means 'doesn't have'
check means 'has'
airplane means 'only in loans'

@LinguList
Copy link
Collaborator Author

Nishi phoneme inventory.xlsx

Excellent, I just uploaded it here, but have it locally as well. I'll change tone letters to upper case, but otherwise, the format is very convenient, and it probably directly qualifies as a orthography profile (but will need to test this).

@nh36
Copy link
Collaborator

nh36 commented Jan 3, 2017

So, this issue can be closed, right? Although my data description, at the top, probably needs to be moved somewhere else.

@LinguList LinguList self-assigned this Jan 3, 2017
@LinguList
Copy link
Collaborator Author

don't close right now, as I'll need to add the profile to the repository, I just assigned myself to get this finalized.

@nh36
Copy link
Collaborator

nh36 commented Feb 11, 2017

Please send an update on this thread.

@LinguList
Copy link
Collaborator Author

Okay, Nishi1999 is the next target, as Mann1998, Nishi1999 and Huang1992 seem to be central (and the other Chinese source, whose name I keep forgetting...).

@nh36
Copy link
Collaborator

nh36 commented Feb 17, 2017

This issue may now be superseded. Please review and confirm.

@LinguList
Copy link
Collaborator Author

Yes, the issue which follows on this is the wrong concept list in the csv-file #90, all relevant data should be there. Please look in the folder called "raw" in Nishi for the csv-file I have been using (downloading and opening in openoffice should be straightforward, I hope).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants