#  Documentation for "Documenting nature: recreating published data"

Authors: 

**Objective:** Recreate results from Leininger et al. 2014, titled, "Developmental gene expression provides clues to relationships between sponge and eumetazoan body plans." 

# Data storage: 

**_All files and data are housed on github when able. <br> Visit https://github.com/aweeger/BCB546X_FinalGroup.git. <br> </font> For questions please email main author, Axelle Weeger at: aweeger@iastate.edu._**

# Downloading the data:

###  Download DNA sequences:

Navigate to : http://www.ebi.ac.uk/ena/data/view/HG973349-HG973421

Download single file of DNA sequences generated from this study, deposited at the European Nucleotide Archive under the accession codes __HG973349__ to __973421__. <br> Save file to GitHub.


**Data description:** _.fasta_ file, containing 73 developmental gene sequences from _Sycon ciliatum_.

### Download the RNA sequences:

Navigate to: www.ebi.ac.uk/arrayexpress.

Download the RNA-seq data, deposited in the ArrayExpress database under accession codes __E-MTAB-2430__ and __E-MTAB-2431__. <br> Save file to GitHub.

**Data description**: Sequences are housed in two folders (**_E-MTAB-2430_**: localization in the adult form and **_E-MTAB-2431_**: at different developmental stages), containing RNA expression for above identified developmental genes. <br> Each folder contains 2 technical replicates, and is used for differential expression analysis of each gene. 

_Note: Ensure proper amount of storage for files as they are large files. Files average 3GB for the 22 files in E-MTAB-2430, and average 3.5GB for the 44 files in E-MTAB-2431._ 

For this analysis, files were stored in the /ptmp folder of the remote computer "hpc-class" and accessed by group members as needed. 

**Obtaining RNA sequence files**:

A.W. wrote script to batch download _.fastq_ files: <br> _2430_fieldnld.sh_ for E-MTAB-2430, and _2431_fieldnld.sh_ for E-MTAB-2431.

 Download the summary files for both _E-MTAB-2430_ and _E-MTAB-2431_ available through the array express link. 

 First, the column with the hyperlink texts was extracted. 

This column was then used as a base, where the _wget_ command was then utilized to go to each corresponding link and download the files into their designated folders. 

# Data Preperation: 

Align with Clustal (browser GUI, accessed 12/2019) to ensure sequences are the same length. <br>
Obtain _.aln_ from clustal <br>
Save _.aln_ file. 

Run aligned file ( _.aln_ ) using ALTER (browser GUI, accessed 12/2019) to convert to nexus format ( _.nex_ ) <br>
Save _.nex_ file. 

Use ProTest 3.4.2 package, (https://github.com/ddarriba/prottest3/releases) to obtain model parameters for MrBayes analysis. <br> 
_Note: ProTest 3.4.2 was installed on local computer of A.A._ <br>
1. Open software. <br> 
2. Load _.nex_ alignment. <br>
3. Run maximum likelihood scores, selecting for the LG model as this was denoted in the original paper.
4. Compute. _note: compute time ~ 10 minutes_

**Results**: LG models, 143 parameters (0 + 143 branch length estimates), lnL = 211196.10

# Phylogenetic Tree Generation:

Downloaded **MrBayes v.3.2.7**  package (used on a local computer of a group member).

_Note: Ensure aligned .nex file is located in same folder as MrBayes to ensure file is located without relative path information._

In [None]:
Load mrbayes 

In [2]:
execute Sponge_Mrbayes.nex

SyntaxError: invalid syntax (<ipython-input-2-50ba8c46940c>, line 1)

In [None]:
prset aamodel =[fixed(LG)] #change model to LG based on ProTest results

In [None]:
prset #verify model change

In [None]:
log start _Sponge_Mrbayes.txt_ # start log file 

<font color = red> save file:

<font color = red> open file in tracer:

<font color = red> open file in **FigTree v.1.4.4** (used on a local computer of a group member)

<font color = red> edit and save image (.ext)

# RNA seq analysis:

To run RSEM to generate count data for the heatmap, navigate to your cluster with the appropriate .fasta and .fastq files, and run the following commands in the UNIX terminal. These commands are located in the RSEM.md file in this repository's parent directory.

<font color = red> **Please list any addition programs used to viz the data**

<font color = red>  please include any instructions or code written for this task as well.

The read counts results for average of replications are storaged in **_counts_** folder. And the read counts results for each replication are storaged in **_individual_** folder.

# Gene expression analysis and heatmap:

The entire work flow and codes are stored in **_RNAseq_GeneExpression/final.Rmd_**. The folder **_RNAseq_GeneExpression_** also includes metadata (**_genes.csv_**: list of gene names and their sequence ID, and **_samples_**: list of read counts result file names and their developmental stages), results of significantly differential expressed gene in apical region (**_DEG top_bottom.csv_**: DE genes between the top and bottom part. And **_DEG top_middle.csv_**: DE genes between the top and middle part. ), and heatmap (**_heatmap.jpg_**).