<div style="font-size: 26px;
            font-weight: bol;
            text-transform: uppercase;
            color: green">
Biotext: a Python library to work with natural language like biological sequence
</div>
<div style="font-size: 16px;
            font-weight: normal;
            font-style: italic;">
Diogo de J. S. Machado
</div>

---

Biotext is a package that provides resources to encode texts written in natural language into a format based on FASTA files, typically used for representing biological sequences. In addition, it offers other tools that support text mining operations using encoded strings.

---
# Introduction

---
# Quick start

## Encodes string in FASTA like format

The biotext has two methods to encode strings: **AMINOcode** and **DNAbits**.

### AMINOcode

#### Enconde

AMINOcode is based on the application of a character substitution through a dictionary, where the characters are reduced to the set that makes up the representation of amino acids in the FASTA format, that is 20 letters. To encompass a greater characters number, vowel and vowel-sounding letters (Y and W) are represented by combinations of 2 letters, while consonants are represented by isolated letters. In the base version, all digits are generalized by a fixed pair of characters, as well the points, but is possible add detailing options, which the digits and/or dots are represented by 3 characters, allowing differentiation.
<br><br>
The encoding is called by the "aminocode.encode_string" function, with the verbosity defined by the optional "detail" parameter. Use "d" for details in digits; "p" for details on the punctuation; "dp" or "pd" for both.

In [1]:
# import biotext lib
import biotext as bt

string = "Hello world! 1, 2, 3..."

# encondig with base AMINOcode
encoded_string = bt.aminocode.encode_string(string)
print("String encoded with base AMINOcode:\n%s\n" % encoded_string)

# encondig with AMINOcode using digits detail
encoded_string_d = bt.aminocode.encode_string(string,detail="d")
print("String encoded with AMINOcode using digits detail:\n%s\n" % encoded_string_d)

# encondig with AMINOcode using points detail
encoded_string_p = bt.aminocode.encode_string(string,detail="p")
print("String encoded with AMINOcode using points detail:\n%s\n" % encoded_string_p)

# encondig with AMINOcode using digits and points detail
encoded_string_db = bt.aminocode.encode_string(string,detail="dp")
print("String encoded with AMINOcode using digits and points detail:\n%s\n" % encoded_string_db)

String encoded with base AMINOcode:
HYELLYQYSYWYQRLDYPYSYDYPYSYDYPYSYDYPYPYP

String encoded with AMINOcode using digits detail:
HYELLYQYSYWYQRLDYPYSYDQYPYSYDTYPYSYDHYPYPYP

String encoded with AMINOcode using points detail:
HYELLYQYSYWYQRLDYPWYSYDYPCYSYDYPCYSYDYPEYPEYPE

String encoded with AMINOcode using digits and points detail:
HYELLYQYSYWYQRLDYPWYSYDQYPCYSYDTYPCYSYDHYPEYPEYPE



#### Decode

It is also possible decode the strings, however the generalized character details are lost. For decoding, use the same detail specification as encoding.

In [2]:
# decoding with base AMINOcode
decoded_string = bt.aminocode.decode_string(encoded_string)
print("String decoded with base AMINOcode:\n%s\n" % decoded_string)

# decoding with AMINOcode using digits detail
decoded_string_d = bt.aminocode.decode_string(encoded_string_d,detail="d")
print("String decoded with AMINOcode using digits detail:\n%s\n" % decoded_string_d)

# decoding with AMINOcode using points detail
decoded_string_p = bt.aminocode.decode_string(encoded_string_p,detail="p")
print("String decoded with AMINOcode using points detail:\n%s\n" % decoded_string_p)

# decoding with AMINOcode using digits and points detail
decoded_string_db = bt.aminocode.decode_string(encoded_string_db,detail="dp")
print("String decoded with AMINOcode using digits and points detail:\n%s\n" % decoded_string_db)

String decoded with base AMINOcode:
hello world. 9. 9. 9...

String decoded with AMINOcode using digits detail:
hello world. 1. 2. 3...

String decoded with AMINOcode using points detail:
hello world! 9, 9, 9...

String decoded with AMINOcode using digits and points detail:
hello world! 1, 2, 3...



---
# Pipeline examples

## Word Embedding

### Preparation of work session

#### Load and prepare dataset

In [3]:
# load an example dataset with nltk lib
import nltk

nltk.download('twitter_samples', quiet=True)

positive_tweets = nltk.corpus.twitter_samples.strings('positive_tweets.json')
negative_tweets = nltk.corpus.twitter_samples.strings('negative_tweets.json')

# concatenate in a pandas series
import pandas as pd
dataset = pd.concat([pd.Series(positive_tweets),pd.Series(negative_tweets)])

# show in notebook
dataset

0       #FollowFriday @France_Inte @PKuchly57 @Milipol...
1       @Lamb2ja Hey James! How odd :/ Please call our...
2       @DespiteOfficial we had a listen last night :)...
3                                    @97sides CONGRATS :)
4       yeaaaah yippppy!!!  my accnt verified rqst has...
                              ...                        
4995                 I wanna change my avi but uSanele :(
4996                           MY PUPPY BROKE HER FOOT :(
4997             where's all the jaebum baby pictures :((
4998    But but Mr Ahmad Maslan cooks too :( https://t...
4999    @eawoman As a Hull supporter I am expecting a ...
Length: 10000, dtype: object

### Encode all dataset

In [4]:
# enconde with AMINOcode using detail
dataset_encoded = bt.aminocode.encode_list(dataset,detail='dp')

# show in notebook
pd.Series(dataset_encoded)

0       YKFYQLLYQYWFRYIDYAYYYSYKFRYANCYEYKYINTYEYSYKPK...
1       YKLYAMEYDTIYAYSHYEYYYSIYAMYESYPWYSHYQYWYSYQDDY...
2       YKDYESPYITYEYQFFYICYIYALYSYWYEYSHYADYSYAYSLYIS...
3                      YKYDNYDESYIDYESYSCYQNGRYATSYSYPTYK
4       YYYEYAYAYAYAHYSYYYIPPPPYYYPWYPWYPWYSYSMYYYSYAC...
                              ...                        
9995    YIYSYWYANNYAYSCHYANGYEYSMYYYSYAVYIYSEYVTYSYVSY...
9996           MYYYSPYVPPYYYSERYQKYEYSHYERYSFYQYQTYSYPTYK
9997    YWHYERYEYKSYSYALLYSTHYEYSIYAYEEYVMYSEYAEYYYSPY...
9998    EYVTYSEYVTYSMRYSYAHMYADYSMYASLYANYSCYQYQKSYSTY...
9999    YKYEYAYWYQMYANYSYASYSYAYSHYVLLYSSYVPPYQRTYERYS...
Length: 10000, dtype: object

### Vectorize encoded text

Converts encoding texts to numeric format. The fastatools.fasta_to_mat use the SWeeP method, a vectorize approach to biological sequence representation, like described at <https://doi.org/10.1038/s41598-019-55627-4>.

In [5]:
dataset_sweep = bt.fastatools.fasta_to_mat(dataset_encoded)

# show in notebook
pd.DataFrame(dataset_sweep)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,590,591,592,593,594,595,596,597,598,599
0,-0.002328,0.023102,-0.014384,-0.036543,-0.018060,0.023961,0.029076,-0.017473,-0.039471,0.040876,...,-0.019379,0.026551,-0.052771,0.010473,-0.018344,-0.023809,0.051650,-0.031817,0.011466,-0.010741
1,-0.015432,-0.041320,0.001287,-0.068287,0.008015,0.030535,0.006891,-0.012919,0.034569,-0.049851,...,0.008598,0.036700,-0.001661,0.004249,-0.024264,0.018404,0.015891,-0.003265,-0.007884,-0.036205
2,-0.007596,0.008367,-0.012659,0.010581,-0.037480,0.007217,0.029441,0.012859,0.023102,-0.008360,...,0.011164,-0.014868,0.019819,0.003339,-0.021636,0.053620,0.004361,0.000278,-0.039269,0.058259
3,-0.023509,0.034010,0.008730,-0.007518,0.003712,-0.002125,-0.023288,-0.007280,0.003361,0.007839,...,-0.003928,-0.006537,-0.019817,-0.004727,-0.021227,0.001414,0.000095,-0.001293,0.007453,-0.004028
4,-0.047918,0.005773,-0.016394,-0.017167,-0.052931,-0.048789,-0.030509,0.021524,-0.055104,-0.018012,...,0.045387,0.032044,-0.037745,0.030751,0.027297,0.015442,0.005044,-0.034193,0.005183,-0.003996
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.007721,-0.004124,-0.015232,-0.015072,-0.017652,-0.005108,0.011864,-0.001966,0.007916,0.015417,...,-0.001735,0.002821,0.009910,-0.019164,-0.027337,0.012220,0.014049,-0.009250,-0.030730,0.006045
9996,0.016431,0.035704,0.023352,0.010773,0.008974,0.015355,-0.007993,0.018179,0.012453,0.003939,...,-0.009947,-0.010105,0.024959,0.008928,-0.008494,0.022613,0.026873,-0.032197,-0.013963,-0.014925
9997,-0.001092,0.000099,-0.011842,0.023119,0.001569,-0.021873,-0.022073,0.012582,0.027584,0.024823,...,-0.004693,-0.023730,0.013644,-0.010343,-0.021744,0.008313,0.029703,-0.011363,-0.021206,-0.008892
9998,-0.019776,-0.011744,-0.009749,-0.002626,0.038103,-0.001726,-0.017776,-0.007080,-0.013517,-0.007983,...,0.031971,0.045779,-0.029432,-0.027250,0.016877,0.005745,-0.004432,-0.011870,-0.006805,-0.021913


### Get unique vector for each word

To place each word in a single vector one of the possibilities is for each word select the texts which it occurs, obtain the corresponding vectors and calculate the averages.

In [6]:
# Split texts in word tokens
import biotext as bt
dataset_splited=dataset.apply(bt.nltools.word_tokenize)
dataset_splited

0       [followfriday, france_inte, pkuchly57, milipol...
1       [lamb2ja, hey, james, how, odd, please, call, ...
2       [despiteofficial, we, had, a, listen, last, ni...
3                                     [97sides, congrats]
4       [yeaaaah, yippppy, my, accnt, verified, rqst, ...
                              ...                        
4995            [i, wanna, change, my, avi, but, usanele]
4996                        [my, puppy, broke, her, foot]
4997          [where's, all, the, jaebum, baby, pictures]
4998    [but, but, mr, ahmad, maslan, cooks, too, http...
4999    [eawoman, as, a, hull, supporter, i, am, expec...
Length: 10000, dtype: object

In [7]:
# Concatenate all unique words in a pd.Series
all_words = pd.Series(list(set(dataset_splited.explode())))
all_words

0          erinmonzon
1             warlock
2        iyah_mohamad
3                ribs
4                rich
             ...     
21402         flippin
21403          juggle
21404        himseek8
21405     ghilalcelik
21406         sinhala
Length: 21407, dtype: object

In [8]:
# Count each word in all texts
import numpy as np
all_words_count = all_words.apply(lambda word: np.size(dataset_sweep[dataset_splited.apply(lambda x: word in x)],axis=0))

In [9]:
# Filter words

# select words that occur at least 0.1% of the entries
more_than=all_words_count > (len(all_words_count)*0.001)

# select non-stopwords
from nltk.corpus import stopwords
no_stop=all_words.apply(lambda y:y not in stopwords.words('english'))

# filter words with selected criteria
sel_idx=more_than&no_stop
sel_words=all_words[sel_idx].reset_index(drop=True)

# show in notebook
sel_words

0       people
1         feel
2         made
3          ugh
4          try
        ...   
415         dm
416      wanna
417    mention
418       sick
419       trip
Length: 420, dtype: object

In [10]:
# get mean vector from all words
words_vect = sel_words.apply(lambda word: np.mean(dataset_sweep[dataset_splited.apply(lambda x: word in x)],axis=0))

# transpose
words_vect = np.concatenate([[i.T] for i in words_vect])

# show in notebook
pd.DataFrame(words_vect)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,590,591,592,593,594,595,596,597,598,599
0,-0.027617,-0.023483,0.007975,-0.011718,-0.007753,0.024794,-0.010555,0.015880,0.003329,-0.015585,...,0.025162,0.001042,-0.001714,0.000274,-0.009457,-0.021509,0.010038,0.001340,0.008563,-0.025802
1,-0.010406,-0.001565,-0.002841,-0.014637,-0.010447,0.014594,0.004256,0.009785,0.014966,-0.002221,...,0.007063,-0.008410,-0.003814,-0.006069,-0.016237,-0.009062,-0.002260,-0.000340,-0.003404,-0.009667
2,-0.013074,-0.007395,-0.011770,-0.013526,-0.003187,0.016551,0.000437,0.013823,0.004412,-0.005903,...,0.005018,-0.006633,-0.000040,-0.023097,-0.011210,-0.007113,0.000848,0.022296,0.009534,-0.022621
3,-0.008354,-0.001957,-0.010077,-0.003069,-0.005184,0.019249,-0.012942,0.015220,0.003939,0.009509,...,-0.000191,-0.011919,-0.007323,0.001629,0.001055,-0.009597,0.014188,0.004262,0.001097,-0.007742
4,-0.006847,-0.014580,0.000550,-0.004993,0.002618,0.016761,0.002461,0.012116,0.006022,-0.016684,...,0.006172,-0.003472,0.000952,-0.006941,-0.013639,-0.007347,0.010891,-0.007996,0.010816,-0.016213
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
415,0.002148,-0.005928,-0.011965,-0.016136,-0.001994,0.031082,0.000101,0.008423,0.009281,-0.014132,...,0.014485,0.002956,-0.018088,-0.002144,-0.009764,-0.005839,0.007664,-0.001438,0.011634,-0.008742
416,0.001630,-0.005954,-0.008945,-0.019034,-0.006545,0.008849,0.000131,0.011829,0.013539,-0.007387,...,-0.000679,0.006979,-0.007217,-0.005517,-0.010866,-0.005958,0.007698,-0.007198,0.003236,-0.016729
417,-0.017965,-0.002529,-0.000866,-0.007931,-0.005262,0.034953,-0.007836,0.019373,-0.004654,-0.011026,...,0.003599,0.011708,-0.003718,-0.011226,-0.002982,-0.000907,0.024297,-0.022260,0.016344,-0.003315
418,-0.008884,0.009202,-0.013656,-0.018825,-0.022114,0.009963,-0.006074,0.016599,0.005721,0.003410,...,-0.000241,-0.003330,-0.010067,-0.006945,-0.010327,-0.005977,0.005096,-0.002827,-0.008434,-0.012847


### Check results

The purpose of word embedding is generate vectors so that representations of words that are semantically close to each other are also vectorially closer. So one way of empirical analysis is to get the vector of a word, get all the other closest ones by a distance metric, then return the corresponding words.

In [11]:
# Declare function to support result check
from scipy.spatial.distance import pdist,squareform
import numpy as np
from scipy.sparse import csr_matrix

# returns distance of all pairs in input matrix
def mat_to_dist(mat,metric='euclidean'):
    dist=csr_matrix(squareform(pdist(mat,metric=metric)))
    return dist

# returns top n closest vector to a row, based in a distance matrix (dist)
def get_closest(dist,row,n):
    s = dist[:,row]
    idx = s.todense().argsort(axis=0)
    idx_s=idx[0:n]
    idx_s=list(np.array(idx_s.T)[0])
    dist_s=np.array(np.concatenate(s.todense()[idx_s])).T[0]
    return idx_s,dist_s

In [12]:
dist = mat_to_dist(words_vect)

In [13]:
idx_s,dist_s=get_closest(dist,np.where(sel_words=="happy")[0][0],21);pd.DataFrame(dist_s,sel_words[idx_s])

Unnamed: 0,0
happy,0.0
birthday,0.153875
friday,0.165669
x,0.205993
day,0.206166
today,0.207498
xx,0.208979
may,0.210282
oh,0.211543
im,0.213425


In [14]:
idx_s,dist_s=get_closest(dist,np.where(sel_words=="sad")[0][0],21);pd.DataFrame(dist_s,sel_words[idx_s])

Unnamed: 0,0
sad,0.0
im,0.132266
x,0.14487
oh,0.149166
i'm,0.151018
bad,0.153942
lol,0.158612
p,0.159307
got,0.160059
man,0.163793
