# Intro

Data comes in many different forms: *time stamps, sensor readings, images, categorical labels, and so much more*. But text is still some of the most valuable data out there for those who know how to use it.

In this course about Natural Language Processing (NLP), you will use the leading NLP library (spaCy) to take on some of the most important tasks in working with text.

By the end, you will be able to use spaCy for:

* Basic text processing and pattern matching
* Building machine learning models with text
* Representing text with word embeddings that numerically capture the meaning of words and documents


# NLP with spaCy

spaCy is the leading library for NLP, and it has quickly become one of the most popular Python frameworks. Most people find it intuitive, and it has excellent documentation.

spaCy relies on models that are language-specific and come in different sizes. You can load a spaCy model with spacy.load.

For example, here's how you would load the English language model.


In [3]:
import spacy
nlp = spacy.load('en')

With the model loaded, you can process text like this:


In [4]:
doc = nlp("Tea is healthy and calming, don't you think?")

There's a lot you can do with the `doc` object you just created.

# Tokenizing

This returns a document object that contains **tokens**. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". You can see the tokens by iterating through the document.

In [5]:
for token in doc:
    print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


Iterating through a document gives you token objects. Each of these tokens comes with additional information. In most cases, the important ones are `token.lemma_` and `token.is_stop`.

# Text preprocessing

There are a few types of preprocessing to improve how we model with words. The first is **lemmatizing**. The "lemma" of a word is its base form. For example, "walk" is the lemma of the word "walking". So, when you lemmatize the word walking, you would convert it to walk.

It's also common to remove stopwords. **Stopwords** are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not".

With a spaCy token, `token.lemma_` returns the lemma, while `token.is_stop` returns a boolean `True` if the token is a stopword (and `False` otherwise).

In [6]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		not		True
you		-PRON-		True
think		think		False
?		?		False


Why are lemmas and identifying stopwords important? Language data has a lot of noise mixed in with informative content. In the sentence above, the important words are tea, healthy and calming. Removing stop words might help the predictive model focus on relevant words. Lemmatizing similarly helps by combining multiple forms of the same word into one base form ("calming", "calms", "calmed" would all change to "calm").

However, lemmatizing and dropping stopwords might result in your models performing worse. So you should treat this preprocessing as part of your hyperparameter optimization process.
