# Text processing methods 

The most basic methods that are currently used for text processing are:

- tokenization
- pos taging,
- lemmatization,
- stemming,
- noun chunks,
- named entity recognition.

In this notebook we go through the methods and show how to use it.

## Tokenization

The process of tokenization can be summarized as a method that splits text into words.

In [1]:
example_text = "Dow Jones is a fintech company."

import re

pattern = "\\s+"
words = re.split(pattern, example_text)
print(words)

['Dow', 'Jones', 'is', 'a', 'fintech', 'company.']


Using NLP tools such as NLP or SpaCy, it can easily tokenize a sentence to get a list of tokens that are meaningful for researchers.

In [2]:
import nltk

tokens = nltk.word_tokenize(example_text)
print("Tokens: " + str(tokens))

Tokens: ['Dow', 'Jones', 'is', 'a', 'fintech', 'company', '.']


Another task that can be solved with tokenization is sentence split. It can be again done using many NLP tools.

In [5]:
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

example_sentences = "Dow Jones is a fintech company. We analyze news with NLP."

sent_tokenize(example_sentences)

[nltk_data] Downloading package punkt to /home/codete/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Dow Jones is a fintech company.', 'We analyze news with NLP.']

Both examples work well for typical documents, but sometimes it may need to handle text from different sources like social media. In such a case, it needs to deal with additional symbols like emojis or hashtags. Regular tokenizer will handle each sign in an emoji or a hashtag in an improper way. A solution for that kind of text are customized tokenizers like the Tweets tokenizer that is already available within NLTK.

In [6]:
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk.tokenize import TweetTokenizer # Yes, this tokenizer was based on Tweets

text = "This is a #hashtag: :-) :-P <3 and some arrows < > -> <--"
tokens = word_tokenize(text)

print(tokens)

twitter_tokenizer = TweetTokenizer()
tokens = twitter_tokenizer.tokenize(text)

print(tokens)

['This', 'is', 'a', '#', 'hashtag', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--']
['This', 'is', 'a', '#hashtag', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']


### Pos tagging

Part of speech tagging is a popular method that tags each token with part of speech. A PoS tag is a short name like JJ or NN that gives the part of speech, but also more details like grammatical category or mark.

In [7]:
example_text = "Dow Jones is a fintech company."
tokens = nltk.word_tokenize(example_text)
tags = nltk.pos_tag(tokens)
print("Tagged: " + str(tags))

Tagged: [('Dow', 'NNP'), ('Jones', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('fintech', 'JJ'), ('company', 'NN'), ('.', '.')]


Part of speech tagging is a popular method that tags each token with part of speech. A PoS tag is a short name like JJ or NN that gives the part of speech, but also more details like grammatical category or mark.

In [10]:
import spacy
from nltk.corpus import treebank

treebank.fileids()
raw = nltk.corpus.treebank_raw.raw()[0:100].replace('.START','').rstrip("\r\n")

nlp = spacy.load("en_core_web_sm")
doc = nlp(raw)

for span in doc.sents:
    for i in range(span.start, span.end):
        token = doc[i]
        print(i, token.text, token.tag_, token.pos_)

0  

 _SP SPACE
1 Pierre NNP PROPN
2 Vinken NNP PROPN
3 , , PUNCT
4 61 CD NUM
5 years NNS NOUN
6 old JJ ADJ
7 , , PUNCT
8 will MD VERB
9 join VB VERB
10 the DT DET
11 board NN NOUN
12 as IN ADP
13 a DT DET
14 nonexecutive JJ ADJ
15 director NN NOUN
16 Nov. NNP PROPN
17 29 CD NUM
18 . . PUNCT
19 
  SPACE
20 Mr. NNP PROPN
21 Vi NNP PROPN


### Lemmatization

Each word can be written differently depending on the inflection form used. There are many possibilities, but based on a specific part of speech like a noun or adjective, it has a root word from which the word changes. Lemmatization is the process of getting the root word based on a specific part of speech

![timeline](images/lemmatization.png)

In [None]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize('do',pos='v'))
print(wordnet_lemmatizer.lemmatize('does',pos='v'))
print(wordnet_lemmatizer.lemmatize('doing',pos='v'))

The base depends also on the part of speech that it should transform into. In NLTK it can be changed by setting the **pos** parameter to one of the values ``a``, ``s``, ``r``, ``n``, ``v``. Each value stands for a different part of speech of the base word.

In [None]:
wordnet_lemmatizer.lemmatize('are',pos='a')

### Stemming

The goal of stemming is also to reduce a given word to a root word. The difference between lemmatization and stemming is that stemming uses a set of rules how the word is reduced instead of vocabulary. It means that the reduced word after stemming does not need to exist in a vocabulary.

In [6]:
from nltk import PorterStemmer, LancasterStemmer, word_tokenize
from nltk.stem.snowball import SnowballStemmer

sample = "This is a research paper on natural language processing"

tokens = word_tokenize(sample)

porter = PorterStemmer()
p_stem = [porter.stem(t) for t in tokens]
print(p_stem)

lancaster = LancasterStemmer()
l_stem = [lancaster.stem(t) for t in tokens]
print(l_stem)

snowball = SnowballStemmer('english')
s_stem = [snowball.stem(t) for t in tokens]
print(s_stem)

['thi', 'is', 'a', 'research', 'paper', 'on', 'natur', 'languag', 'process']
['thi', 'is', 'a', 'research', 'pap', 'on', 'nat', 'langu', 'process']
['this', 'is', 'a', 'research', 'paper', 'on', 'natur', 'languag', 'process']


### Noun chunks

Noun chunks is a method to get nouns out of a text. It is much easier to understand a text by getting just a list of nouns. It can be used to retrieve information from a text.

In [7]:
for np in doc.noun_chunks:
    print(np)

a Dow Jones research paper
natural language processing


### Named Entity Recognition

A handy method that gives an even better understanding than noun chunks is the named entity recognition method. As almost all text processing methods, NER is trained for each language. Such training uses annotated datasets and machine learning methods.

![timeline](images/noun_chunks.png)

In [3]:
sample = "This is a Dow Jones research paper on natural language processing"

nlp = spacy.load("en_core_web_sm")
doc = nlp(sample)

for entity in doc.ents:
    print(entity.label_, entity.text)
    
for np in doc.noun_chunks:
    print(np)    

ORG Dow Jones
a Dow Jones research paper
natural language processing
