Skip to content

Function: repnodeTrim

G. Kenney edited this page Jul 28, 2023 · 6 revisions

repnodeTrim

Generally, after running prepNeighbors, I submit the trimmed protein sequences for my genes of interest to the EFI-EST server via option C (a user-uploaded fasta file). The EFI-EST toolsets are designed to work with UniProt data, and so genes are assigned faux-UniProt IDs, which can complicate tying the output back to analyses that use the IMG gene_oid values. This accessory function connects the two IDs, taking the node metadata file for the full network as an input. Additionally, during the EFI-EST SSN setup, sequences below/above length cutoffs are often trimmed, and a %ID cutoff is established, above which highly similar sequences are grouped and a representative node is chosen for SSN visualization. These "repnodes" are also useful for avoiding over-weighting of a network towards data from very highly sequenced species and genera (e.g. E. coli) and for working with very large and computationally intensive networks. Thus, it's often helpful to go ahead using only the representative nodes, and this function thus provides trimmed versions of gene and neighbor metadata and sequences, adding the node metadata file for the repnode network as an additional input. Note that use of this function is not required for use of analyzeNeighbors or prettyClusterDiagrams.

Before using, you'll need to export the node metadata tables for your full network and the network at your desired repnode %ID. By default, these will be .csv files. Make sure that you do this without putting the networks in the same "Network Collection" in Cytoscape - this can affect the metadata export!

Use of repnodeTrim

repnodeTrimOut <- repnodeTrim(imgGenes = "20210101_neighborTrim_genE_genes.txt", imgNeighbors = "20210101_neighborTrim_genE_neighbors.txt", imgGeneSeqs = "20210101_neighborTrim_genE_geneSeqs.fa", imgNeighborSeqs = "20210101_neighborTrim_genE_neighborSeqs.fa`", geneName = "genE", efiFullMetadata = "20210101_genE_efiMetadata_Full.csv", efiFinalMetadata = "20210101_genE_efiMetadata_Repnodes95.csv")

Note that if you use this, you'll want to specify whether downstream analyses use repnodes or not (this is an option in all functions)! Also, even when working with the full dataset, you'll want to use the metadata from this file, since you'll be able to map it back onto your SSNs.

Required input

  • imgGenes Filename. IMG-formatted metadata file for genes of interest. Required.
  • imgNeighbors Filename. IMG-formatted metadata file for neighbors of genes of interest. Required.
  • geneSeqs Filename. FASTA-formatted sequence file for genes of interest. Required.
  • neighborSeqs Filename. FASTA-formatted sequence file for neighbors of genes of interest. Required.
  • geneName Text. the name of your gene family of interest - used for filenames and figure labels. Required.
  • efiFullMetadata Filename. The node metadata file, exported for the full network from Cytoscape. Note for large networks: Layout and visualization of the network is not necessary to export this file! Required.
  • efiFinalMetadata Filename. A second node metadata file, exported from the network at the appropriate repnode cutoff. Required.

Output

  • 20210101_repnodeTrim_genE_imgGenes.txt File. Tab-delimited table, contains metadata for all input genes, but with EFI and repnode IDs added.
  • 20210101_repnodeTrim_genE_imgNeighbors.txt File. Tab-delimited table, contains metadata for all input neighbors, but with EFI and repnode ID for the associated genes of interest added.
  • 20210101_repnodeTrim_genE_repnodeGenes.txt File. Tab-delimited table, contains metadata (including EFI IDs) for repnodes only.
  • 20210101_repnodeTrim_genE_repnodeNeighbors.txt File. Tab-delimited table, contains metadata for neighbors of repnodes only.
  • 20210101_repnodeTrim_genE_repnodeGeneSeqs.fa File. FASTA-formatted file with amino acid sequences for repnodes only.
  • 20210101_repnodeTrim_genE_repnodeNeighborSeqs.fa File. FASTA-formatted file with amino acid sequences for neighbors of repnodes only.
  • repnodeTrimOut List. Contains repnodeTrimOut$repGenesTrimmed (data frame containing the metadata for the repnodes) and repnodeTrimOut$repNeighborsTrimmed (data frame containing the metadata for the neighbors of repnodes)