# Phylogenetic Factor Analysis Pipeline

## 1. Setup Data Files

To run this analysis, you need to specify the locations of the following files in the __data__ directory of this repository:
 - CSV file containing the trait data
 - Text file with the phylogenetic tree in newick form

Here are the files in the __data__ directory.

In [1]:
cd("data") # setting the working directory to the "data" folder
readdir() # showing the files in the directory

3-element Array{String,1}:
 "mammals_labels.csv"
 "mammals_log_data.csv"
 "mammals_trimmed_newick.txt"

Right now, we care about the __mammals_log_data.csv__ file (which stores the trait data) and the __mammals_trimmed_newick.txt__ file (which contains the phylogenetic tree).

### Trait Data

The file containing the data must have labeled columns, with the first column labeled "taxon".

In [3]:
using CSV, DataFrames # importing packages to work with CSV files and data frames
CSV.read("mammals_log_data.csv")

Unnamed: 0_level_0,taxon,body_mass,age_at_first_birth,gestation_length,litter_size
Unnamed: 0_level_1,String,Float64⍰,Float64⍰,Float64⍰,Float64⍰
1,Camelus_dromedarius,5.6926,3.21791,2.58716,-0.00877392
2,Canis_adustus,4.01672,missing,1.81291,0.653213
3,Canis_aureus,3.98492,missing,1.78704,0.572872
4,Canis_latrans,4.07879,2.56229,1.79057,0.757396
5,Canis_lupus,4.50183,2.73838,1.80277,0.697229
6,Bos_frontalis,5.90317,missing,2.43735,0.0863598
7,Bos_grunniens,5.69897,missing,2.43735,0.0
8,Bos_javanicus,5.80344,2.96023,2.47243,0.0863598
9,Callicebus_cupreus,3.04806,missing,2.11143,0.00432137
10,Callicebus_donacophilus,2.95312,missing,missing,0.00860017


### Phylogenetic Tree

The tree structure should be in newick format.

In [14]:
newick = read("mammals_trimmed_newick.txt", String)
newick[1:100] # only showing the beginning of the newick as it is very large

"(((Tachyglossus_aculeatus:13.7,Zaglossus_bruijni:13.7):49.9,Ornithorhynchus_anatinus:63.6):102.6,((("

### <font color='red'>Set the data and tree file names in the `instructions.jl` file </font>

In the `instructions.jl` file, set the following variables, replacing `mammals_example`, `mammals_log_data.csv`, and `mammals_trimmed_newick.txt` with the values / files relevant to your analysis.

In [20]:
name = "mammals_example" # this will be the folder name that the results are stored
                         # in as well as the name of the final xml and log files

data_filename = "mammals_log_data.csv" # csv file where the data are stored (must be in ./data directory)
newick_filename = "mammals_trimmed_newick.txt" # file where the newick tree is stored (must be in ./data directory)
;

## 2. Setup model selection

Before running the final analysis, we must first determine (indirectly) the most appropriate number of factors.
We use the prior on the loadings matrix to shrink the number of factors.
Basically, 

### Trait labels (for plotting)

The first column should correspond to the column names of the data file (excluding the first "taxon" column). Note that you can rearrange the order as long as all entries are there.

The second column should be the labels that you want to see on the final plot at the end.

In [16]:
CSV.read("mammals_labels.csv")

Unnamed: 0_level_0,label,pretty,cat
Unnamed: 0_level_1,String,String,String
1,body_mass,body mass,
2,neonate_body_mass,neonate body mass,
3,gestation_length,gestation length,
4,weaning_age,weaning age,
5,age_at_first_birth,age at first birth,
6,reproductive_lifespan,reproductive lifespan,
7,litter_size,litter size,
8,litters_per_year,litters per year,
