Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to download a csv file that contains all colexifications? #17

Closed
ianjoo opened this issue Mar 8, 2021 · 6 comments
Closed

How to download a csv file that contains all colexifications? #17

ianjoo opened this issue Mar 8, 2021 · 6 comments

Comments

@ianjoo
Copy link

ianjoo commented Mar 8, 2021

What is the terminal command that allows me to download all the colexifications, containing:

| ID A | Concept A | ID B | Concept B | Families | Languages | Words |                                                                                                                                      
|-------:|:-------------------------|-------:|:---------------------------|-----------:|------------:|--------:|
| 906 | TREE | 1803 | WOOD | 59 | 348 | 361 |
...
@chrzyki
Copy link
Contributor

chrzyki commented Mar 8, 2021

What you can do, for instance, while building the colexification network (see clics colexification -h) is redirecting the output to a file, for example:

clics -t 3 -f families colexification --show 3000 --format tsv > out.tsv

@ianjoo
Copy link
Author

ianjoo commented Mar 8, 2021

Thanks. But why 3000? What is the total number?

@LinguList
Copy link
Contributor

LinguList commented Mar 8, 2021 via email

@chrzyki
Copy link
Contributor

chrzyki commented Mar 8, 2021

Thanks. But why 3000? What is the total number?

No particular reason other than that there are roughly 3000 concepts in CLICS and that, generally speaking, the less frequent colexifications also tend to be less reliable (However, of course note that number of concepts != to the number of colexifications in CLICS). network-3-families.gml in total has 4228 edges (note that this is before clustering with infomap), so in total there would be 4228 colexifications. The blog post that Mattis mentioned is a very good introduction to programmatically accessing the network data. Here's also a small snippet that shows how to access the data using igraph.

Note that the snippet is also based on @LinguList and @tresoldi's blog postings.

@tresoldi
Copy link
Contributor

tresoldi commented Mar 8, 2021

There is also some code from the "semantic distance" that I present at SLE2019 and discussed in another CALC blog post: https://github.com/tresoldi/semantic_distance

I think what you want is something similar to the full list ( https://github.com/tresoldi/semantic_distance/blob/master/data/colexifications.tsv ), but you should really compute it yourself, and @chrzyki 's snippet is clear. The data in this repository is outdated and includes all possible colexifications, including those found only between a single pair of languages, so that you have a lot of noise in there.

@chrzyki
Copy link
Contributor

chrzyki commented Mar 16, 2021

Closing this for now. Feel free to reopen should any other questions arise.

@chrzyki chrzyki closed this as completed Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants