## Remove Stop Words Using NLTK

NLTK is shipped with stop words lists for most languages. To get English stop words, you can use this code:

In [6]:
from nltk.corpus import stopwords
 
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Now, let’s modify our code and clean the tokens before plotting the graph.

First, we will make a copy of the list, then we will iterate over the tokens and remove the stop words:

In [7]:
text = 'This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences'
tokens = [t for t in text.split()]

clean_tokens = tokens
sr = stopwords.words('english')
 
for token in tokens:
    if token in stopwords.words('english'):
        clean_tokens.remove(token)

In [11]:
clean_tokens

['This',
 'tokenizer',
 'divides',
 'text',
 'list',
 'sentences',
 'using',
 'unsupervised',
 'algorithm',
 'build',
 'a',
 'model',
 'abbreviation',
 'words,',
 'collocations,',
 'words',
 'start',
 'sentences']

## Tokenize Text Using NLTK

We saw how to split the text into tokens using split function, now we will see how to tokenize the text using NLTK.

Tokenizing text is important since text can’t be processed without tokenization. Tokenization process means splitting bigger parts into small parts.

You can tokenize paragraphs to sentences and tokenize sentences to words according to your needs. NLTK is shipped with sentence tokenizer and word tokenizer.

Let’s assume that we have a sample text like the following:

In [12]:

from nltk.tokenize import sent_tokenize
 
mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
 
print(sent_tokenize(mytext))

['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']


You may say that this is an easy job, I don’t need to use NLTK tokenization, and I can split sentences using regular expressions since every sentence precedes by punctuation and space.

Well, take a look at the following text:

In [11]:
from nltk.tokenize import sent_tokenize
 
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
 
print([w for w in mytext.split(".")])


['Hello Mr', ' Adam, how are you? I hope everything is going well', ' Today is a good day, see you dude', '']


OK, let’s try word tokenizer to see how it will work.


In [12]:

from nltk.tokenize import word_tokenize
 
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
 
print(word_tokenize(mytext))

['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well', '.', 'Today', 'is', 'a', 'good', 'day', ',', 'see', 'you', 'dude', '.']


## Tokenize non-English Languages Text

To tokenize other languages, you can specify the language like this:

In [13]:
from nltk.tokenize import sent_tokenize
 
mytext = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour."
 
print(sent_tokenize(mytext,"french"))

['Bonjour M. Adam, comment allez-vous?', "J'espère que tout va bien.", "Aujourd'hui est un bon jour."]




```Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Porter
Portuguese
Romanian
Russian
Spanish
Swedish```