<img width=150 src=https://raw.githubusercontent.com/autonomio/signs/master/logo.png><center><font size=3>Signs is a set of tools for text preparation, vectorization and processing. Below is provided a set of examples that cover many of the commonly used workflows. </font></center>

In [None]:
import signs as signs

### Cleaning | `signs.Clean()`

While `signs.Transform()` has `auto=True` for automatic cleaning, sometimes it's useful to explicitly clean documents. Let's start with an automated example.

In [None]:
doc = ' Jack is a green  😂😂😂 cat... \n with a hat \n  '

In [None]:
# create the object
cleaned = signs.Clean(doc)

# access the text
cleaned.text

# you could of course also directly do
signs.Clean(doc).text

All the cleaning operations can be accessed individually, and not all cleaning operations are included in the automatic processing.

In [None]:
# create the object
cleaned = signs.Clean(doc, auto=False)

In [None]:
# make text all caps
cleaned.caps()

# make text all lower
cleaned.low()

# decode the text
cleaned.decod()

# remove emojis
cleaned.emoji()

# remove leading and trailing whitespace
cleaned.leadtrail()

# remove all whitespace
cleaned.whitespace()

# remove linebreaks
cleaned.linebreaks()

# remove links
cleaned.links()

# remove punctuation
cleaned.punct()

# remove arbitrary string
cleaned.string('is a green')

### Remove common words | `signs.Stopwords()`

**Signs** another common operation for data cleaning involves removing a list of words from the documents. 

For this purpose we have to transform the documents into a list-of-lists where each sublist consist of a tokenized document. This easily done with `signs.Transform()`. Because `signs.Transform()` expects as input a set of documents, and only have one, we have to wrap the document in a list.

In [None]:
# transform doc/s to the right format
tokens = signs.Transform([doc]).tokens()

# filter the docs
filtered_tokens = signs.Stopwords(tokens)

# then access the filtered docs
filtered_tokens.docs

`signs.Stopwords()` allows several options for customization.

In [None]:
# set minimum length for words
signs.Stopwords(tokens, min_length=3)

# set maximum threshold for words (accept all words above this)
signs.Stopwords(tokens, max_threshold=8)

# add custom words
signs.Stopwords(tokens, add_stopwords=['jack'])

# just use custom words
signs.Stopwords(tokens, common_stopwords=False, add_stopwords=['jack']).docs

Bare in mind that all operations in `signs.Stopwords()` are destructive.