In [5]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/biyichen/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#### The tokenization process means dividing large parts into widgets. use NLTK tokens to split the text. This is something you might think, it's too simple, you don't need to use NLTK's tokenizer, you can use regular expressions to split sentences, because each sentence has punctuation and spaces.

#### Then look at the following text: 'Hello Mr. Adam, how are you? '

#### So if you use punctuation to split, 'Hello Mr','Adam','how are you' will be considered each individual sentence, but if you use NLTK

In [6]:
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."

In [7]:
print(sent_tokenize(mytext))

['Hello Mr. Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']


#### Could also split the sentence into words. The word Mr. has not been separated. NLTK uses the PunktSentenceTokenizer of the punkt module, which is part of NLTK.tokenize. And this tokenizer is trained to work in multiple languages.

In [8]:
from nltk.tokenize import word_tokenize

In [9]:
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(word_tokenize(mytext))

['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well', '.', 'Today', 'is', 'a', 'good', 'day', ',', 'see', 'you', 'dude', '.']


#### Non-English Tokenize

In [11]:
from nltk.tokenize import sent_tokenize

In [12]:
mytext = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour."

In [13]:
print(sent_tokenize(mytext,"french"))

['Bonjour M. Adam, comment allez-vous?', "J'espère que tout va bien.", "Aujourd'hui est un bon jour."]


#### Synonym processing

#### Use the nltk.download() installation interface, one of which is WordNet. WordNet is a database built for natural language processing. It includes some synonym groups and some short definitions.

In [None]:
from nltk.corpus import wordnet
import nltk
nltk.download()

In [None]:
syn = wordnet.synsets("pain")

In [None]:
print(syn[0].definition())

In [None]:
print(syn[0].examples())

#### You can use WordNet to get synonyms like this

In [None]:
from nltk.corpus import wordnet
  
synonyms = []
for syn in wordnet.synsets('Computer'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
print(synonyms)

#### Antonym processing

In [None]:
from nltk.corpus import wordnet
  
antonyms = []
for syn in wordnet.synsets("small"):
    for l in syn.lemmas():
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
print(antonyms)

#### Stem extraction

#### In language morphology and information retrieval, stemming is the process of removing the affixes to get the roots. For example, the working stem is work. Search engines use this technique when indexing pages, so many people write different versions of the same word. There are many algorithms to avoid this, the most common being the Boolean stem algorithm. NLTK has a class called PorterStemmer, which is the implementation of this algorithm:

In [None]:
from nltk.stem import PorterStemmer
  
stemmer = PorterStemmer()
print(stemmer.stem('working'))
print(stemmer.stem('worked'))

#### Non-English stem extraction

#### In addition to English, SnowballStemmer also supports 13 languages. Supported languages:

In [None]:
from nltk.stem import SnowballStemmer
  
print(SnowballStemmer.languages)

#### Word variant reduction

#### A word variant restore is similar to a stem, but the difference is that the result of a variant restore is a real word. Unlike stemming, when you try to extract certain words, it produces similar words:

In [None]:
from nltk.stem import PorterStemmer
  
stemmer = PorterStemmer()
  
print(stemmer.stem('increases'))

#### Now, if you use NLTK's WordNet to perform a variant restore of the same word, the correct result:

In [None]:
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
  
print(lemmatizer.lemmatize('increases'))


#### The result may be a synonym or a different word of the same meaning. Sometimes when you restore a word to a variant, you always get the same word. This is because the default part of the language is a noun. To get a verb, you can specify

In [None]:
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
  
print(lemmatizer.lemmatize('playing', pos="v"))

In [None]:
#### In fact, this is also a good way to compress text, and finally get the text from the original 50% to 60%. The result can also be a verb (v), a noun (n), an adjective (a), or an adverb (r):

In [None]:
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))

#### The difference between stems and variants

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
  
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(stemmer.stem('stones'))
print(stemmer.stem('speaking'))
print(stemmer.stem('bedroom'))
print(stemmer.stem('jokes'))
print(stemmer.stem('lisa'))
print(stemmer.stem('purple'))
print('----------------------')
print(lemmatizer.lemmatize('stones'))
print(lemmatizer.lemmatize('speaking'))
print(lemmatizer.lemmatize('bedroom'))
print(lemmatizer.lemmatize('jokes'))
print(lemmatizer.lemmatize('lisa'))
print(lemmatizer.lemmatize('purple'))