# Preprocessing

This notebook provides an overview of common techniques for processing text data.

In [None]:
import nltk

## Tokenezation

Tokenezation is the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens.

---

We will explore differences among tokenizers using the sentence provided in the following cell.

In [17]:
inp = "I'm loving NLP—it's amazing!"

Simple **whitespace** tokenization allows text to be split by spaces.

In [13]:
inp.split()

["I'm", 'loving', "NLP—it's", 'amazing!']

**Symbol** tokenization, of course, treats each symbol as a separate token:

In [14]:
print(list(inp))

['I', "'", 'm', ' ', 'l', 'o', 'v', 'i', 'n', 'g', ' ', 'N', 'L', 'P', '—', 'i', 't', "'", 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', '!']


The **Wold** tokenizer separates text into wolds. It differs from the whitespace tokenizer in that it assumes that different tokens can be separated by more than just whitespace. The following cell shows the transformation of the example sentence.

In [15]:
nltk.tokenize.word_tokenize(inp)

['I', "'m", 'loving', 'NLP—it', "'s", 'amazing', '!']

## Stop words

There are many words in text that are not considered to have much meaning - they are commonly called stop words and are not considered when processing text data.

---

The following cell shows the stop word list according to the `nltk` library.

In [25]:
print(nltk.corpus.stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Stemming