<img width=150 src=https://raw.githubusercontent.com/autonomio/signs/master/logo.png><center><font size=3>Signs is a set of tools for text preparation, vectorization and processing. Below is provided a set of examples that cover many of the commonly used workflows. </font></center>

In [None]:
import signs as signs

First, we will read some data.

In [None]:
import dedomena as da
docs = da.apis.pubmed('cervical cancer', 500, True)

### Transformation | `signs.Transform()`

Then next, let's clean up the data. `signs.Transform()` allows the systematic creation of all important data formats from a single class object. If `clean=True` then the following preprocessing tasks will be performed:

- force lower case
- remove urls
- remove emojis
- remove punctuation
- remove linebreaks
- remove leading and traing whitespace
- comprehensive stopwords filtering
- decode from binary

In [None]:
docs = signs.Transform(docs)

In [None]:
# return the original docs
docs.docs()

# return the original docs cleaned
docs.docs(True)

# return original docs but flattened
docs.docs_flat()

# return original docs flattened and clean
docs.docs_flat(True)

# return original docs in a single string blob
docs.docs_string()

# return original docs in single string blob cleaned
docs.docs_string(True)

# return tokenized version of docs
docs.tokens()

# return tokenized version cleaned
docs.tokens(True)

# return tokenized and flattend
docs.tokens_flat()

# return tokenized and flattened clean
docs.tokens_flat(True)

All of the the following examples will utilize one of these data formats by calling the class object `docs` we have created above. It's better to always ingest the original docs into `signs.Transform` to minimize overhead while being sure of format compliance.

### Text Statistics | `signs.Stopwords()`

**Signs** allows stopword removal against an arbitrary sized list of stopwords in roughly 10,000 documents per second.

In [None]:
filtered_tokens = signs.Stopwords(docs.tokens())

### Text Statistics | `signs.Describe()`
**Signs** provides common text analytics functionalities under the Describe() class. 

In [None]:
# read documents
desc = signs.Describe(docs.tokens_flat())

In [None]:
desc.get_counts()
desc.get_gram_counts(3, 1)

### Grams | `signs.Grams()`
**Signs** provides access to ngrams and skipgrams through `signs.Grams()`. 

In [None]:
# bigrams
signs.Grams(docs.tokens_flat()).ngrams(2)

# trigrams
signs.Grams(docs.tokens_flat()).ngrams(3)

# trigram with 2-step skipgram
signs.Grams(docs.tokens_flat()).ngrams(3, 2)

### Snippets of text | `signs.Verbatims()`
Another helpful feature is exracting verbatims based on a keyword and boundary through `signs.Verbatims()`. 

In [None]:
signs.Verbatims(docs.tokens_flat()).verbatims('cell')

### Document Vectors | `signs.TrainDoc2Vec()`

In [None]:
# then train the Doc2Vec model
model, train_corpus = signs.TrainDoc2Vec(docs.tokens()[:450])

###  Document Similarity | `signs.DocSimilarity()`

There are several document similarity options available. Examples for each are provided below.

- similarity matrix for seen documents
- similarity matrix for unseen documents
- similarity between a single unseen document and seen docs
- spatial distance between two documents, seen or unseen

In [None]:
sims = signs.DocSimilarity(model, docs)

There are several options for getting the similarities:

- `similar_docs()` for any document to all training documents
- `spatial_distance()` for any document to any document
- `seen_matrix()` for a 2d similarity matrix for all training documents
- `unseen_matrix()` for a 2d similarity matrix for any set of documents

Note that `unseen_matrix()` might take time as the matrix grows. Example of use as below:

In [None]:
sims.similar_docs(docs.tokens()[1])

sims.spatial_distance(doc1=docs.tokens()[451],
                      doc2=docs.tokens()[452])

sims.seen_matrix()

sims.unseen_matrix(docs.tokens())

Finally, there is a method for previewing the most and least similar documents as a reference. This is done with `sims.preview_results()`.