Permalink
Newer
Older
100644 66 lines (37 sloc) 5.92 KB
Oct 23, 2015 @ddediu Initial release including the R code, the input data, the outputs, as…
1 # lgfam-newick: Language family classifications as Newick trees
Aug 25, 2015 @ddediu Initial README.md file
2
Oct 23, 2015 @ddediu Initial release including the R code, the input data, the outputs, as…
3 ## Summary
4
5 This repository contains the data, R code, outputs and description of a flexible method for generating standardized [Newick](http://evolution.genetics.washington.edu/phylip/newicktree.html) language family trees with branch lengths from the four most used language classification databases: [Ethnologue](http://www.ethnologue.com/), [WALS](http://wals.info/), [AUTOTYP](http://www.autotyp.uzh.ch/) and [Glottolog](http://glottolog.org/).
6 The code is released under [GPL v2](http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html), but the various pieces of input data might be governed by different licenses (specified in the respective folders).
Aug 25, 2015 @ddediu Initial README.md file
7
8 The aims of this project are to:
9
10 a) provide several well-known linguistic (genealogical) classifications (currently [WALS](http://wals.info/), [Ethnologue](http://www.ethnologue.com/), [Glottolog](http://glottolog.org/) and [AUTOTYP](http://www.autotyp.uzh.ch/)) in the *de facto* standard [Newick format](https://en.wikipedia.org/wiki/Newick_format), and
11 b) offer a set of [`R`](http://www.r-project.org/) `S3` classes and functions for reading, converting, writing and working with language family trees.
12
Mar 29, 2016 @ddediu Clarified README.md and added Balthasar's MatchTreesToData.R file
13 Also included is code by [Balthasar Bickel](http://www.linguistik.uzh.ch/en/about/mitglieder/bickel.html) that matches tree nodes to datasets and prunes the trees to keep only the nodes that have matching data (the `./code/MatchTreesToData.R` script).s
14
Oct 23, 2015 @ddediu Initial release including the R code, the input data, the outputs, as…
15 ## Accompanying paper, outputs and acknowledging this work
16
17 The **accompanying paper** (in the `./paper/` directory) describes in detail the data sources and the conversion process.
18 The paper itself is written in [`R Markdown`](http://rmarkdown.rstudio.com/) and can be compiled to PDF (the primary output in the `family-trees-with-brlength.pdf` file) or HTML (the `family-trees-with-brlength.html` file).
19
20 The actual Newick trees with branch lengths are in the `./output/` directory and can be used directly (the file formats are described in the **accompanying paper** but briefly they come as **CSV TAB-separated files** and equivalent **Nexus files** that contain the language family trees in the **Newick format**; the file name gives details about the classification, method and parameters used to compute the topology and branch lengths).
21
Mar 29, 2016 @ddediu Clarified README.md and added Balthasar's MatchTreesToData.R file
22 **Note**: when using these trees from `R` the best (and recommended) way to read them is with the function `languageclassification()` (in file `FamilyTrees.R`) which returns an `S3` object of type `languageclassification` containing the list of trees and giving access to various useful things such as pretty printing, collapsing and restoring single nodes, etc. (besides, those trees extend the standard `phylo` class so most usual things should work out-of-the-box). Definitely do *not* use `ape`'s `read.tree()` (as it is known to be pretty fussy especially when it comes to single nodes) and if you must please do use instead `phytools`'s `read.newick()` instead!
23
Oct 23, 2015 @ddediu Initial release including the R code, the input data, the outputs, as…
24 If you use (parts of) the `R` scripts and/or the generated Newick trees, please do cite this in your work and provide links to this repository ([https://github.com/ddediu/lgfam-newick](https://github.com/ddediu/lgfam-newick))!
25
26
27 ## Releases
28
Mar 5, 2017 @SimonGreenhill Update README.md
29 "Official" releases can be found in the `./releases` directory.
Aug 25, 2015 @ddediu Initial README.md file
30
31
Oct 23, 2015 @ddediu Initial release including the R code, the input data, the outputs, as…
32 ## Running the `R` code
Aug 25, 2015 @ddediu Initial README.md file
33
Oct 23, 2015 @ddediu Initial release including the R code, the input data, the outputs, as…
34 If you are **trying to run the `R` code yourself**, please note that I have removed some of the large cached intermediary results (in order to save space).
35 Thus, you must first generate these cached data, as follows.
Aug 25, 2015 @ddediu Initial README.md file
36
Oct 23, 2015 @ddediu Initial release including the R code, the input data, the outputs, as…
37 Run the `./input/distances/WALS/process-wals-distances.R` script to generate the WALS-based distance matrices.
38
39 Run the `./input/distances/ASJP/process-asjp16-distances.R` script to generate the ASJP16-based distance matrix.
40
41 Run the `./code/StandardizedTrees.R` main `R` script with the following parameters set to `TRUE`: `MATCH_CODES` (compute the equivalences between the ISO, WALS, AUTOTYP and GLOTTOLOG codes and generate the UULIDs), 'PREOPTIMIZE_DISTS' (pre-optimize the distance matrices for fast loading when required), `COMPUTE_GEO_DISTS` (compute the geographic distances between languages).
42 For later runs (after these data has been generated and cached) these parameters can be safely set to `FALSE` (this pre-processing is computationally very expensive).
43 The parameters `TRANSFORM_TREES` (transform the trees from their original specific representation to the Newick notation no branch length), `EXPORT_NEXUS` (export the trees to a NEXUS file), `EXPORT_NEXUS_TRANSLATE_BLOCK` (when exporting NEXUS files, generate a TRANSLATE block; useful when using programs such as BayesTraits that have issues parsing complicated taxa names), `EXPORT_CSV` (export the trees to a CSV file) can be left on `TRUE` (except perhaps the first as the tree topologies will probably not change very often in the original databases).
44 Please note that the first time the Ethnologue tree topologies are transformed to Newick, these will be downloaded from the Ethnologue website and cached locally.
45 The last two parameters are `COMPUTE_BRLEN` (apply the various branch length methods to the Newick topologies) and `COMPARE_TREES` (compute the distance between equivalent trees).
46 Finally, `CPU_CORES` controls multi-core processing (using `mclapply` -- might not work on Windows!).
47 (It is a good idea to leave `quotes="'"`).
48 Parameters `CLASSIFICATIONS`, `METHODS`, `CONSTANT` and `DISTS.CODES` control which classification, methods and parameters to use for generating the Newick trees.
49 These are very specific to the current implementation but can be used to extend this work to other classifications of branch length methods.s
50
51 ## Possible bugs! Please report them!
52
53 Please note that even if the `R` code is relatively well-tested there might be bugs or other issues!
54 So, please use these with caution and any comments, suggestions or bug reports are welcome, either through GitHub's own issue reporting facilities or by e-mail to <Dan.Dediu@mpi.nl>.
55
56
57 ## Thank you
Aug 25, 2015 @ddediu Small error in README.md
58
Aug 25, 2015 @ddediu Initial README.md file
59 Dan Dediu
Aug 25, 2015 @ddediu Small error in README.md
60
Oct 23, 2015 @ddediu Initial release including the R code, the input data, the outputs, as…
61 The Netherlands
62
63 October 2015
Aug 25, 2015 @ddediu Initial README.md file
64
65