<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Text Pre-processing
Text is messy, and a lot of work needs to be done to pre-process it before it is useful for modeling.  Generally a text pre-processing pipeline will include at least the following steps:  
- Tokenizing the text - splitting it into words and punctuation
- Remove stop words and punctuation  
- Convert words to root words using lemmatization or stemming  

This notebook walks through a basic example of how to perform those steps using two common NLP libraries: [NLTK](https://www.nltk.org) and spaCy (https://spacy.io).


In [1]:
import string

# Import Spacy and download model to use
import spacy
#!python -m spacy download en_core_web_sm
# aipi540-s23/ak704/miniconda3/envs/aipi540/bin !python -m spacy download en_core_web_sm

import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
example_doc = '''I saw some geese near the pond. Then they took off flying.'''

## SpaCy
Let's now walk through our simple example using spaCy.  With spaCy, we'll first tokenize as we did with NLTK.  But since spaCy's tokens are a bit different than NLTK (NLTK just creates string tokens, while spaCy's tokens contain lots of additional useful information on each word such as part-of-speech, root etc.), we will next use the spaCy tokens to extract the lemmas, and then remove stop words and punctuation from the list of string lemmas.
### Tokenization

In [3]:
# Process sentence
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_doc)
# Get tokens
tokens = [token for token in doc]

print(tokens)

[I, saw, some, geese, near, the, pond, ., Then, they, took, off, flying, .]


### Lemmatization

In [4]:
# Extract the lemmas for each token
tokens = [token.lemma_.lower().strip() for token in tokens]
print(tokens)

['i', 'see', 'some', 'geese', 'near', 'the', 'pond', '.', 'then', 'they', 'take', 'off', 'fly', '.']


### Remove stop words and punctuation

In [5]:
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = set(STOP_WORDS)
punctuations = string.punctuation

tokens = [token for token in tokens if token.lower() not in stopwords and token not in punctuations]
print(tokens)

['geese', 'near', 'pond', 'fly']


In [6]:
# Combine the filtered lemmas back into a string
doc_processed = " ".join([i for i in tokens])

print('Original:')
print(example_doc)
print('Processed:')
print(doc_processed)

Original:
I saw some geese near the pond. Then they took off flying.
Processed:
geese near pond fly
