##  Tokenization and Stop word removal

**Author: Abhishek Dey**

Tokenization in NLP basically means splitting a stream text into smaller units called tokens.

Tokens can be at sentence level, word level or character level.

### Tokenization using NLTK Library

In [7]:
import nltk
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/abhishek/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/abhishek/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/abhishek/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### NLTK sentence tokenizer

In [13]:
text = "This is my first sentence. How does that look ? Do you have any questions ? Yeah ! I do have one : How old are you ? It doen't sound good ; Alright , I am leaving this conversation"

In [9]:
sentences = sent_tokenize(text)

sentences

['This is my first sentence.',
 'How does that look ?',
 'Do you have any questions ?',
 'Yeah !',
 'I do have one : How old are you ?',
 "It doen't sound good ; Alright , I am leaving this conversation"]

### NLTK Word tokenizer

In [14]:
words = word_tokenize(text)

print(words)

['This', 'is', 'my', 'first', 'sentence', '.', 'How', 'does', 'that', 'look', '?', 'Do', 'you', 'have', 'any', 'questions', '?', 'Yeah', '!', 'I', 'do', 'have', 'one', ':', 'How', 'old', 'are', 'you', '?', 'It', 'doe', "n't", 'sound', 'good', ';', 'Alright', ',', 'I', 'am', 'leaving', 'this', 'conversation']


In [15]:
len(words)

42

### Word Tokenization using transformers

In [16]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [17]:
words = tokenizer.tokenize(text)

print(words)

['this', 'is', 'my', 'first', 'sentence', '.', 'how', 'does', 'that', 'look', '?', 'do', 'you', 'have', 'any', 'questions', '?', 'yeah', '!', 'i', 'do', 'have', 'one', ':', 'how', 'old', 'are', 'you', '?', 'it', 'doe', '##n', "'", 't', 'sound', 'good', ';', 'alright', ',', 'i', 'am', 'leaving', 'this', 'conversation']


In [18]:
len(words)

44

## Stopwords

Stop words are common words in a language that are generally filtered out / removed before processing text.

They are removed because they don't add much value to many NLP tasks.

Removing stop words reduces dimensionality and improves model performance and speed

Examples : ["a", "an", "is", "are", "to", "by", "the", "this", "that", "these"]

### Define English stop words

In [20]:
stop_words=set(stopwords.words('english'))

In [21]:
len(stop_words)

198

In [22]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [19]:
text = "List of stop words: a, an, is, are, to, by, the, this, that, these"

In [25]:
words = word_tokenize(text)

In [27]:
print(words)

['List', 'of', 'stop', 'words', ':', 'a', ',', 'an', ',', 'is', ',', 'are', ',', 'to', ',', 'by', ',', 'the', ',', 'this', ',', 'that', ',', 'these']


In [29]:
filtered_words = [w for w in words if w not in stop_words]

filtered_words

['List', 'stop', 'words', ':', ',', ',', ',', ',', ',', ',', ',', ',', ',']