# Cleaning and processing text data

## Objectives

In this documents, we will show:

- How to clean text data with regular expression
- How to processing text data with NLP toolkits (nltk and spacy).


## Cleaning textual data

### Demo data


In [None]:
sent1 = "Two explosions were just reported at a flood-damaged chemical plant near\u00a0Houston https://t.co/jTIZXoKr2B https://t.co/u0gTIsaHIy"

### Remove URL

In [None]:
import re

# When the UNICODE flag is not specified, matches any non-whitespace character
result = re.sub(r"http\S+", "", sent1)
print(result)

Two explosions were just reported at a flood-damaged chemical plant near Houston  


### Remove punctuations in text

In [None]:
import string

In [None]:
help(str.maketrans)

Help on built-in function maketrans:

maketrans(x, y=None, z=None, /)
    Return a translation table usable for str.translate().
    
    If there is only one argument, it must be a dictionary mapping Unicode
    ordinals (integers) or characters to Unicode ordinals, strings or None.
    Character keys will be then converted to ordinals.
    If there are two arguments, they must be strings of equal length, and
    in the resulting dictionary, each character in x will be mapped to the
    character at the same position in y. If there is a third argument, it
    must be a string, whose characters will be mapped to None in the result.



In [None]:
sent2 = "Stories of Pittsburghers trapped in #Houston flooding!!!! @@ - https://t.co/j5igfpvLJu https://t.co/8gsUpD8jsa"
sent2_ = re.sub(r"http\S+", "", sent2)
print(sent2_)

translator = str.maketrans('', '', string.punctuation)

Stories of Pittsburghers trapped in #Houston flooding!!!! @@ -  


In [None]:
sent2_pun = sent2_.translate(translator)
print(sent2_pun)

Stories of Pittsburghers trapped in Houston flooding    


## Processing text data with NLP


### Sentence tokenization

In text mining task, we may want to split long articles into sentences. We can do that with nltk and spacy.

Demo data

In [None]:
para = "Hello World. It's good to see you. Thanks for buying this book."

#### Using NLTK

If your have not yet downloaded the tokenization model, you should do that before processing.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import sent_tokenize
sent_tokenize(para)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

#### Using spacy

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
doc = nlp(para)
sents = [sent.text for sent in doc.sents]
print(sents)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']


### Word tokenization

Demo data

In [None]:
sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'

#### Using NLTK

In [None]:
from nltk.tokenize import word_tokenize
print( word_tokenize(sent) )

['The', 'history', 'of', 'NLP', 'generally', 'starts', 'in', 'the', '1950s', ',', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods', '.']


#### Using spacy

In [None]:
doc = nlp(sent)
tokens = [x.text for x in doc]
print(tokens)

['The', 'history', 'of', 'NLP', 'generally', 'starts', 'in', 'the', '1950s', ',', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods', '.']


### POS Tagging


Demo data

In [None]:
sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'

#### Using NLTK

First we need to download POS tagger model.

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
tokens = word_tokenize(sent)
nltk.pos_tag(tokens)

[('The', 'DT'),
 ('history', 'NN'),
 ('of', 'IN'),
 ('NLP', 'NNP'),
 ('generally', 'RB'),
 ('starts', 'VBZ'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('1950s', 'CD'),
 (',', ','),
 ('although', 'IN'),
 ('work', 'NN'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('found', 'VBN'),
 ('from', 'IN'),
 ('earlier', 'JJR'),
 ('periods', 'NNS'),
 ('.', '.')]

#### Using spacy

spacy use universal part-of-speech tags, but we can map that tagset to English Penntree bank tagset. See [https://spacy.io/api/annotation](https://spacy.io/api/annotation)

In [None]:
doc = nlp(sent)
[(x.text, x.pos_) for x in doc]

[('The', 'DET'),
 ('history', 'NOUN'),
 ('of', 'ADP'),
 ('NLP', 'PROPN'),
 ('generally', 'ADV'),
 ('starts', 'VERB'),
 ('in', 'ADP'),
 ('the', 'DET'),
 ('1950s', 'NOUN'),
 (',', 'PUNCT'),
 ('although', 'SCONJ'),
 ('work', 'NOUN'),
 ('can', 'VERB'),
 ('be', 'AUX'),
 ('found', 'VERB'),
 ('from', 'ADP'),
 ('earlier', 'ADJ'),
 ('periods', 'NOUN'),
 ('.', 'PUNCT')]

### Word Lemmatization


#### Using NLTK

We need to download wordnet model as first.

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'

tokens = nltk.word_tokenize(sent)
tags = nltk.pos_tag(tokens)

for i, token in enumerate(tokens):
    pos_tag = tags[i][1]
    if pos_tag.startswith('N'):
        lemma = lemmatizer.lemmatize(token, pos=NOUN)
    elif pos_tag.startswith('V'):
        lemma = lemmatizer.lemmatize(token, pos=VERB)
    elif pos_tag.startswith('J'):
        lemma = lemmatizer.lemmatize(token, pos=ADJ)
    else:
        lemma = token
    print("%s - %s" % (token, lemma))

The - The
history - history
of - of
NLP - NLP
generally - generally
starts - start
in - in
the - the
1950s - 1950s
, - ,
although - although
work - work
can - can
be - be
found - find
from - from
earlier - early
periods - period
. - .


#### Using spacy

Lemmatization is much easier in spacy.

In [None]:
doc = nlp(sent)
[(x.text, x.lemma_) for x in doc]

[('The', 'the'),
 ('history', 'history'),
 ('of', 'of'),
 ('NLP', 'NLP'),
 ('generally', 'generally'),
 ('starts', 'start'),
 ('in', 'in'),
 ('the', 'the'),
 ('1950s', '1950'),
 (',', ','),
 ('although', 'although'),
 ('work', 'work'),
 ('can', 'can'),
 ('be', 'be'),
 ('found', 'find'),
 ('from', 'from'),
 ('earlier', 'early'),
 ('periods', 'period'),
 ('.', '.')]

### Filtering stop words

#### Using NLTK

We need to download list of stop words first.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))

sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'

words = nltk.word_tokenize(sent)
[word for word in words if word not in english_stops]

['The',
 'history',
 'NLP',
 'generally',
 'starts',
 '1950s',
 ',',
 'although',
 'work',
 'found',
 'earlier',
 'periods',
 '.']

#### Using spacy

In [None]:
spacy_stopwords = set(spacy.lang.en.stop_words.STOP_WORDS)
sent = 'The history of NLP generally starts in the 1950s, although work can be found from earlier periods.'
doc = nlp(sent)
words = [x.text for x in doc]
[word for word in words if word not in spacy_stopwords]

['The',
 'history',
 'NLP',
 'generally',
 'starts',
 '1950s',
 ',',
 'work',
 'found',
 'earlier',
 'periods',
 '.']

## References

1. [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101)
2. [Advanced NLP with spacy](https://course.spacy.io/)
3. Bird, Steven; Klein, Ewan; Loper, Edward (2009). *Natural Language Processing with Python*. [http://www.nltk.org/book/](http://www.nltk.org/book/)
4. [NLTK in 20 minutes](http://www.slideshare.net/japerk/nltk-in-20-minutes), by Jacob Perkins
