In [0]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.....

True

In [0]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [0]:
nltk.download('genesis')

[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Unzipping corpora/genesis.zip.


True

In [0]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [0]:
whitman = nltk.corpus.gutenberg.words('whitman-leaves.txt')
print(whitman)

['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]


## Text Preprocessing

We will talk about the basic steps of text preprocessing. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

After a text is obtained, we start with text normalization. Text normalization includes:



*    removing punctuations, accent marks and other diacritics
*    removing white spaces
*    expanding abbreviations
*    removing stop words, sparse terms, and particular words
*    text canonicalization
*    converting all letters to lower or upper case
*    converting numbers into words or removing numbers



### Removing punctuations, accent marks, special symbols and diacritics

In [0]:
# Sample code to remove a regex pattern 
import re 

def remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

remove_regex("remove this #hashtag from my given string object", regex_pattern)

'remove this  from my given string object'

### Remove whitespaces

In [0]:
input_str = " \t a string example\t "
input_str = input_str.strip()
input_str

'a string example'

### Remove Numbers

In [0]:
import re
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
result = re.sub(r"\d+", "", input_str)
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


### Convert Case

In [0]:
input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


**Tokenization**

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

In [0]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
input_str = "NLTK is a leading platform for building Python programs to work with human language data."

from nltk.tokenize import word_tokenize
tokens = word_tokenize(input_str)
print (tokens)

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']


### Remove stop words

“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts.

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'needn', 'against', 'at', 'other', 'those', 'same', "she's", 'for', 'doing', 'be', "mightn't", 'who', 'that', 'couldn', 'once', 'such', 'him', 'ours', 'ain', 'out', 'aren', 'through', 'where', 'under', 'too', 'before', 'when', "don't", 'yourselves', "should've", 'do', 'this', 'between', 'so', 'are', 'is', 'mustn', 'them', 'will', 'more', 'further', 'her', 's', 'yours', 'have', 'off', 'i', 'its', 'over', 'in', 'while', 'above', 'can', 'mightn', "you'll", 'after', 'didn', 'don', 'herself', 'with', "shouldn't", "wasn't", 'me', 'not', 'should', 'shouldn', 'each', 'did', 'most', 't', "shan't", 'below', "haven't", "you've", "that'll", 'isn', 'any', 'had', 'haven', 'themselves', 'o', "didn't", 'myself', "needn't", 'weren', 'doesn', 'all', "aren't", 'these', 're', 'only', 'than', 'you', 'which', 'ourselves', "isn't", 'ma', 'theirs', 'y', "it's", 'how', "won't", 'hadn', "couldn't", 'into', 'hasn', 'then', 'there', 'shan', 'wouldn', 'your', 'as', 'down', 'here', 'd', "weren't", "you're", 'on', 

In [0]:
input_str = "All work and no play makes jack dull boy. Its good to go out and have fun at times."
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'Its', 'good', 'go', 'fun', 'times', '.']


In [0]:
#sklearn can also provide a list of standard english stop words
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
print (ENGLISH_STOP_WORDS)

frozenset({'against', 'whenever', 'due', 'across', 'at', 'noone', 'seemed', 'other', 'same', 'those', 'for', 'three', 'fifteen', 'be', 'still', 'somewhere', 'that', 'who', 'towards', 'however', 'once', 'since', 'such', 'him', 'ours', 'cry', 'six', 'toward', 'out', 'upon', 'fire', 'already', 'amoungst', 'anyhow', 'latterly', 'through', 'where', 'neither', 'detail', 'before', 'get', 'might', 'afterwards', 'per', 'thereafter', 'too', 'anyone', 'when', 'yourselves', 'do', 'between', 'show', 'so', 'this', 'among', 'co', 'last', 'top', 'are', 'is', 'twenty', 'back', 'whole', 'them', 'wherever', 'will', 'perhaps', 'more', 'thereby', 'move', 'fill', 'further', 'describe', 'around', 'her', 'yours', 'have', 'off', 'i', 'its', 'over', 'in', 'above', 'can', 'while', 'nine', 'after', 'everyone', 'rather', 'ten', 'hundred', 'indeed', 'herself', 'with', 'someone', 'me', 'not', 'should', 'mill', 'whereupon', 'each', 'most', 'meanwhile', 'below', 'much', 'although', 'fifty', 'besides', 'everything', 'l

Most of what we are going to do with language relies on ﬁrst separating out or tokenizing words (splitting the text into minimal meaningful units) from running text, known as the task of tokenization.

English words are often separated from each other by whitespace, but whitespace is not always sufﬁcient. “New York” and “rock ’n’ roll” are sometimes treated as large words despite the fact that they contain spaces, while sometimes we’ll need to separate “I’m” into the two words I and am.

For processing tweets or texts we’ll need to tokenize emoticons like “ :)” or hashtags like #nlproc.

In [0]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str="There are several types of stemming algorithms for Natural languages"
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
for
natur
languag


In [0]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
print(stemmer.stem("having"))
print(stemmer2.stem("having"))

print(SnowballStemmer("english").stem("generously"))

print(SnowballStemmer("porter").stem("generously"))

have
having
generous
gener


In [0]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
import nltk
lemma = nltk.wordnet.WordNetLemmatizer()
lemma.lemmatize('article')
lemma.lemmatize('leaves')

'leaf'

In [0]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

print(lookup_words("RT We are going to CCD @ MG Road!! dm for more info.!!"))

Retweet We are going to CCD @ MG Road!! direct message for more info.!!
