Skip to content

beiko-lab/DNA_encoders

Repository files navigation

DNA encoders

This package can encode DNA sequences into:

  • Pc3mer
  • Pc3mer stats
  • PseKNC
  • K-mers

The DNA sequence must be in fasta format with lines of 58 nucleotides and extension .fna or .fasta. Example:

>Sigma++
GGTTTATTGCCTTGCAGCTGGCGAGAGACGGTATTGCTCATGCACAAGCCTTGTTCAG
>Sigma24
TGCCCTGACTTCACCCCGCTGTGTCTGCTTTTCCCGACTATTCTTAATGAGCTTCGAT
>Sigma--
AATGTGGATAGATATGAATTATTTTTCTCCTTAAGGATCATCCGTTATTTGGGTCGTT
>Sigma70
CAGTTATTTACCTTACTTTACGCGCGCGTAACTCTGGCAACATCACTACAGGATAGCG
>Sigma++
AAAAAGTTATACGCGGTGGAAACATTGCCCGGATAGTCTATAGTCACTAAGCATTAAA
...

After encoding the sequences, the algorithm stores them at folder_for_output/output/encoder_name/.

Encoders

Pc3mer

Using a table [1] with twelve physicochemical properties values for each 3-mers, we standardize the values and calculate pc3mer by decomposing the input sequence into 3-mers and replacing each 3-mers in the order they appear in the sequence by its value for a given physicochemical property.

The physicochemical properties are Bendability-DNAse, Bendability-consensus, Trinucleotide GC Content, Nucleosome positioning, Consensus_roll, Consensus_Rigid, Dnase I, Dnase I-Rigid, MW-Daltons, MW-kg, Nucleosome, and Nucleosome-Rigid [1].

Example 1:

  • Sequence: GGGA...
  • (Algorithm's step 1) It decomposes the sequence into 3-mers: GGG, GGA, ...
  • Standardized physicochemical properties for GGG:
    • 'Bendability-DNAse': 0.07230364
    • 'Bendability-consensus': 0.3577835
    • 'Trinucleotide GC Content': 1.73205081
    • Etc.
  • Standardized physicochemical properties for GGA:
    • 'Bendability-DNAse': 0.26511335
    • 'Bendability-consensus': -0.0969693
    • 'Trinucleotide GC Content': 0.57735027
    • Etc.
  • (Algorithm's step 2) For each property, it replaces each 3-mer by its value for that property and stores as Pc3mer/<property_name>.md.
    • For Bendability-DNAse, the encoded sequence will look like: [0.07230364, 0.26511335, ..., sample_class]
    • For Bendability-consensus, the encoded sequence will look like: [0.3577835, -0.0969693, ..., sample_class]
    • For Trinucleotide GC Content, the encoded sequence will look like: [1.73205081, 0.57735027, .., sample_class]
    • Etc.
  • When combining the individual files, make sure to delete the sample_class column for all properties but the last one, so the encoded sequence look like:
     [0.07230364, 0.26511335, ..., 0.3577835, -0.0969693, ..., 1.73205081, 0.57735027, ..., sample_class]
    

Example 2:

  • Entry in the fasta file:
>Sigma++
GCTGAAAATACGTTGAACGCTTACCGTCGCGATCTGTCAATGATGGTGGAGTGGTTGC
  • Sequence encoded with Bendability-consensus:
0.919537039821516,0.919537039821516,1.3475397297384397,-0.605222559882525,-2.745236029467144,-2.745236029467144, -2.584735019498298,0.5717848498890153,-0.0702191899863703,0.06353164998766837,0.06353164998766837,..., Sigma++

Usage:

from pc3mer import Pc3mer
import os

input_fasta = "input_file.fasta"  # path + file name
folder_for_output = os.path.join(os.getcwd(), "output")  # path to store the output

encoder = Pc3mer(folder_for_output=folder_for_output)
encoder.encode_fasta_file(input_fasta, store_encode_by_indiv_prop=True)

Output: It creates the folder "Pc3mer" in <folder_for_output> and twelve .md files, each one containing the encoded sequences for one of the properties. Examples: output/Pc3mer/Bendability-consensus.md and output/Pc3mer/Dnase I.md.

Pc3mer stats

Encodes a sequence into pc3mer and then get a set of statistics over the encoded sequence. The statistics are:

  • minimum,
  • maximum,
  • mean,
  • standard deviation,
  • median, and
  • variance.

Usage:

from pc3mer import Pc3mer
import os

input_fasta = "input_file.fasta"  # path + file name
folder_for_output = os.path.join(os.getcwd(), "output")  # path to store the output

encoder = Pc3mer(folder_for_output=folder_for_output)
encoder.convert_fasta_file_to_pc3mer_stats(input_fasta)

PseKNC

This implementation of PseKNC I [1], [2] decomposes the sequence into 3-mers and maps them to physicochemical property values specific for each word that is used to calculate scores. The scores, called Theta_{i}, are concatenated to the 3-mer decomposition and refers to all 3-mers i nucleotides distant from each other, for i in [1, 2]. The final array is devided by the sum of 3-mers counts and Theta scores.

Usage:

from my_pseknc import Pseknc
import os

input_fasta = "input_file.fasta" # path + file name
folder_for_output = os.path.join(os.getcwd(), "output") # path to store the output
output_file = os.path.join(folder_for_output, "pseknc.csv")

encoder = Pseknc()
encoder.encode_fasta_into_pseknc(input_fasta, output_file)

k-mers

It counts the frequency of k-mers, an enumeration of all “words” of length k, for k in a given interval, in the DNA sequence.

Example 1:

  • For k in [2, 2] there are 4^k = 4^2 = 16 possible words: {AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT}
  • Sequence: GGGA
  • Decomposing into k-mers: GG, GG, GA
  • Algorithm's output: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0]

Example 2:

  • For k in [1, 2] there are 4^1 + 4^2 = 4 + 16 possible words: {A, C, G, T, AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT}
  • Sequence: GGGA
  • Decomposing into k-mers: G, G, G, A, GG, GG, GA
  • Algorithm's output: [1, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0]

Usage:

from kmers import Kmers
import os

# defining list of ks
k_start = 1
k_end = 5
k_values = list(range(k_start, k_end + 1))

input_fasta = "input_file.fasta"  # path + file name
folder_for_output = os.path.join(os.getcwd(), "output")  # path to store the output
output_file = os.path.join(folder_for_output, f"/{k_start}_to_{k_end}_mers.csv")

encoder = Kmers()
encoder.encode_fasta_file(fastafile=input_fasta, list_of_ks=k_values, outputfile=output_file)

References

[1] W. Chen, T. Y. Lei, D. C. Jin, H. Lin, and K. C. Chou, “PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition,” Anal. Biochem., vol. 456, no. 1, pp. 53–60, 2014, doi: 10.1016/j.ab.2014.04.001.

[2] W. Chen, X. Zhang, J. Brooker, H. Lin, L. Zhang, and K.-C. Chou, “PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions,” Bioinformatics, vol. 31, no. 1, pp. 119–120, Jan. 2015, doi: 10.1093/bioinformatics/btu602.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages