---
title: Construct the Glottolog tree
format: html
---

Setting the working directory (just in case the IDE does something different), activate the local Julia environment, and load packages.

In [1]:
cd(@__DIR__)
using Pkg; Pkg.activate("."); Pkg.instantiate()
using CSV, DataFrames
using Pipe
using ProgressMeter


[32m[1m  Activating[22m[39m project at `/localscratch/nwsja01/projects/research/msa_vs_cognates/code`


Installing the Python package `ete3` and loading it into Julia.

In [2]:
ENV["PYTHON"] = chomp(read(pipeline(`which python`), String))
Pkg.build("PyCall")
using PyCall

ete3 = pyimport("ete3")

[32m[1m    Building[22m[39m Conda ─→ `/localscratch/nwsja01/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/51cab8e982c5b598eea9c8ceaced4b58d9dd37c9/build.log`
[32m[1m    Building[22m[39m PyCall → `/localscratch/nwsja01/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/9816a3826b0ebf49ab4926e2b18842ad8b5c8f04/build.log`


PyObject <module 'ete3' from '/localscratch/nwsja01/miniconda3/envs/msa/lib/python3.10/site-packages/ete3/__init__.py'>

Getting the Glottolog tree from the MPI website.


In [3]:
glottologF = "../data/tree_glottolog_newick.txt"

isfile(glottologF) || download(
    "https://cdstar.eva.mpg.de//bitstreams/EAEA0-CFBC-1A89-0C8C-0/tree_glottolog_newick.txt",
    glottologF
)

"../data/tree_glottolog_newick.txt"

Parsing the Glottolog familiy trees and cleaning them up.

In [4]:
raw = readlines(glottologF);

In [5]:
trees = []

for ln in raw
    ln = strip(ln)
    ln = replace(ln, r"\'" => "")
    ln = replace(ln, r"\'[A-ZÄÖÜ][^[]*\[" => "[")
    ln = replace(ln, r"\][^']*\'" => "]")
    ln = replace(ln, r"\[|\]" => "")
    ln = replace(ln, ":1" => "")
    tree = ete3.Tree(ln, format = 1)
    for nd in tree.traverse()
        nd.name = split(nd.name)[end][1:8]
    end
    push!(trees, tree)
end

Combining the trees into a single tree.

In [6]:
glot = ete3.Tree()

for t in trees
    glot.add_child(t)
end

nonLeaves = [nd.name for nd in glot.traverse()
             if (nd.name != "") & !nd.is_leaf()
]

@showprogress for nm in nonLeaves
    nd = (glot & nm)
    nd.name = ""
    nd.add_child(name = nm)
end


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:02:09[39m[K


Pruning the tree to only include the taxa in the Lexibank dataset.

In [7]:
lxb = @pipe CSV.File("../data/wordlist_complete.csv") |> 
    DataFrame |>
    dropmissing(_, [:ASJP, :Cognateset_ID]) |>
    filter(x -> x.ASJP != "", _)
taxa = lxb.Glottocode |> unique

glot_taxa = [nd.name for nd in glot.traverse() if nd.name != ""]

missing_taxa = [x for x in taxa if x ∉ glot_taxa]

for l in missing_taxa
    glot.add_child(name = l)
end

glot.prune([(glot & l) for l in taxa])

Saving the tree to a file.

In [10]:
open("../data/glottolog.tre", "w") do io
    write(io, glot.write(format = 9))
end

14481

Creating individual trees for each dataset.

In [11]:
dbs = unique(lxb.db)
mkpath("../data/glottolog_trees")
for db in dbs
    db_taxa = filter(x -> x.db == db, lxb).Glottocode |> unique
    if length(db_taxa) >= 10
        db_glot = deepcopy(glot)
        db_glot.prune([(db_glot & l) for l in db_taxa])
        open("../data/glottolog_trees/$(db)_glottolog.tre", "w") do io
            write(io, db_glot.write(format = 9))
        end
    end
end