<div style="font-size: 26px;
            font-weight: bol;
            text-transform: uppercase;
            color: green">
Biotext: a Python library to work with natural language like biological sequence
</div>
<div style="font-size: 16px;
            font-weight: normal;
            font-style: italic;">
Diogo de J. S. Machado
</div>

---

Biotext is a package that provides resources to encode texts written in natural language into a format based on FASTA files, typically used for representing biological sequences. In addition, it offers other tools that support text mining operations using encoded strings.

---
# Quick start

## Encodes string in FASTA like format

The biotext has two methods to encode strings: **AMINOcode** and **DNAbits**.

---
### AMINOcode

#### Enconde

AMINOcode is based on the application of a character substitution through a dictionary, where the characters are reduced to the set that makes up the representation of amino acids in the FASTA format, that is 20 letters. To encompass a greater characters number, vowel and vowel-sounding letters (Y and W) are represented by combinations of 2 letters, while consonants are represented by isolated letters. In the base version, all digits are generalized by a fixed pair of characters, as well the points, but is possible add detailing options, which the digits and/or dots are represented by 3 characters, allowing differentiation.
<br><br>
The encoding is called by the "aminocode.encode_string" function, with the verbosity defined by the optional "detail" parameter. Use "d" for details in digits; "p" for details on the punctuation; "dp" or "pd" for both.

In [1]:
# import biotext lib
import biotext as bt

string = "Hello world! 1, 2, 3..."

# encondig with base AMINOcode using no detail
encoded_string_ac = bt.aminocode.encode_string(string)
print("String encoded with AMINOcode using no detail:\n%s\n" % encoded_string_ac)

# encondig with AMINOcode using digits detail
encoded_string_ac_d = bt.aminocode.encode_string(string,detail="d")
print("String encoded with AMINOcode using digits detail:\n%s\n" % encoded_string_ac_d)

# encondig with AMINOcode using points detail
encoded_string_ac_p = bt.aminocode.encode_string(string,detail="p")
print("String encoded with AMINOcode using points detail:\n%s\n" % encoded_string_ac_p)

# encondig with AMINOcode using digits and points detail
encoded_string_ac_db = bt.aminocode.encode_string(string,detail="dp")
print("String encoded with AMINOcode using digits and points detail:\n%s\n" % encoded_string_ac_db)

String encoded with AMINOcode using no detail:
HYELLYQYSYWYQRLDYPWYSYDQYPCYSYDTYPCYSYDHYPEYPEYPE

String encoded with AMINOcode using digits detail:
HYELLYQYSYWYQRLDYPYSYDQYPYSYDTYPYSYDHYPYPYP

String encoded with AMINOcode using points detail:
HYELLYQYSYWYQRLDYPWYSYDYPCYSYDYPCYSYDYPEYPEYPE

String encoded with AMINOcode using digits and points detail:
HYELLYQYSYWYQRLDYPWYSYDQYPCYSYDTYPCYSYDHYPEYPEYPE



#### Decode

It is also possible decode the strings, however the generalized character details are lost. For decoding, use the same detail specification as encoding.

In [2]:
# decoding with base AMINOcode using no detail
decoded_string_ac = bt.aminocode.decode_string(encoded_string_ac)
print("String decoded with base AMINOcode using no detail:\n%s\n" % decoded_string_ac)

# decoding with AMINOcode using digits detail
decoded_string_ac_d = bt.aminocode.decode_string(encoded_string_ac_d,detail="d")
print("String decoded with AMINOcode using digits detail:\n%s\n" % decoded_string_ac_d)

# decoding with AMINOcode using points detail
decoded_string_ac_p = bt.aminocode.decode_string(encoded_string_ac_p,detail="p")
print("String decoded with AMINOcode using points detail:\n%s\n" % decoded_string_ac_p)

# decoding with AMINOcode using digits and points detail
decoded_string_ac_dp = bt.aminocode.decode_string(encoded_string_ac_db,detail="dp")
print("String decoded with AMINOcode using digits and points detail:\n%s\n" % decoded_string_ac_dp)

String decoded with base AMINOcode using no detail:
hello world! 1, 2, 3...

String decoded with AMINOcode using digits detail:
hello world. 1. 2. 3...

String decoded with AMINOcode using points detail:
hello world! 9, 9, 9...

String decoded with AMINOcode using digits and points detail:
hello world! 1, 2, 3...



---
### DNAbits

#### Enconde

The DNAbits method works in two steps: replacing characters with binary form in ASCII; and replacement of the binary digits, two by two, by the characters A, C, T or G, according to a fixed rule.

In [3]:
# import biotext lib
import biotext as bt

string = "Hello world! 1, 2, 3..."

# encondig with DNAbits
encoded_string_db = bt.dnabits.encode_string(string)
print("String encoded with base DNAbits:\n%s\n" % encoded_string_db)

String encoded with base DNAbits:
AGACCCGCATGCATGCTTGCAAGATCTCTTGCGATCATGCACGCCAGAAAGACATAATGAAAGAGATAATGAAAGATATAGTGAGTGAGTGA



#### Decode

Decoding with DNAbits preserves all ASCII characters.

In [4]:
# import biotext lib
import biotext as bt

string = "Hello world! 1, 2, 3..."

# encondig with DNAbits
decoded_string_db = bt.dnabits.decode_string(encoded_string_db)
print("String decoded with base DNAbits:\n%s\n" % decoded_string_db)

String decoded with base DNAbits:
Hello world! 1, 2, 3...

