***Sentence Tokenize***

Tokenizing is the process of breaking a large set of texts into smaller meaningful chunks such as sentences, words, phrases. NLTK library provides sent_tokenize for sentence level tokenizing, which uses a pre-trained model PunktSentenceTokenize, to determine punctuation and characters marking the end of sentence for European language

In [7]:
%matplotlib inline
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/kazi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
text='Statistics skills, and programming skills are equally important for analytics. Statistics skills, and domain knowledge are important for analytics. I like reading books and travelling.'

In [9]:
sent_tokenize_list = sent_tokenize(text)
print(sent_tokenize_list)

['Statistics skills, and programming skills are equally important for analytics.', 'Statistics skills, and domain knowledge are important for analytics.', 'I like reading books and travelling.']


In [10]:
# There are total 17 european languages that NLTK support for sentence tokenize
# Let's try loading a spanish model
import nltk.data
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
spanish_tokenizer.tokenize('Hola. Esta es una frase espanola.')

['Hola.', 'Esta es una frase espanola.']

***Word Tokenize***

word_tokenize is a wrapper function that calls tokenize by the TreebankWordTokenizer

In [11]:
from nltk.tokenize import word_tokenize
print (word_tokenize(text))

['Statistics', 'skills', ',', 'and', 'programming', 'skills', 'are', 'equally', 'important', 'for', 'analytics', '.', 'Statistics', 'skills', ',', 'and', 'domain', 'knowledge', 'are', 'important', 'for', 'analytics', '.', 'I', 'like', 'reading', 'books', 'and', 'travelling', '.']


In [12]:
# Another equivalent call method
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
print (tokenizer.tokenize(text))

['Statistics', 'skills', ',', 'and', 'programming', 'skills', 'are', 'equally', 'important', 'for', 'analytics.', 'Statistics', 'skills', ',', 'and', 'domain', 'knowledge', 'are', 'important', 'for', 'analytics.', 'I', 'like', 'reading', 'books', 'and', 'travelling', '.']


In [13]:
# Except the TreebankWordTokenizer, there are other alternative word tokenizers, such as PunktWordTokenizer and WordPunktTokenizer
# PunktTokenizer splits on punctuation, but keeps it with the word
# from nltk.tokenize import PunktWordTokenizer
# punkt_word_tokenizer = PunktWordTokenizer()
# print punkt_word_tokenizer.tokenize(text) 

# WordPunctTokenizer splits all punctuations into separate tokens
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()
print (word_punct_tokenizer.tokenize(text))

['Statistics', 'skills', ',', 'and', 'programming', 'skills', 'are', 'equally', 'important', 'for', 'analytics', '.', 'Statistics', 'skills', ',', 'and', 'domain', 'knowledge', 'are', 'important', 'for', 'analytics', '.', 'I', 'like', 'reading', 'books', 'and', 'travelling', '.']


***PoS tagging***

The default pos tagger model using in NLTK is maxent_treebanck_pos_tagger model

In [19]:
from nltk import chunk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
tagged_sent = nltk.pos_tag(nltk.word_tokenize('This is a sample English sentence'))
print (tagged_sent)

tree = chunk.ne_chunk(tagged_sent)
tree.draw()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/kazi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/kazi/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/kazi/nltk_data...
[nltk_data]   Package words is already up-to-date!


[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('English', 'JJ'), ('sentence', 'NN')]


In [21]:
# To get help about tags
nltk.download('tagsets')
nltk.help.upenn_tagset('NNP')

[nltk_data] Downloading package tagsets to /home/kazi/nltk_data...


NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


[nltk_data]   Unzipping help/tagsets.zip.


In [23]:
from nltk.tag.perceptron import PerceptronTagger

PT = PerceptronTagger()
print (PT.tag('This is a sample English sentence'.split()))

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('English', 'JJ'), ('sentence', 'NN')]


***Remove stopwords***

In [25]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Function to remove stop words
def remove_stopwords(text, lang='english'):
    words = nltk.word_tokenize(text)
    lang_stopwords = stopwords.words(lang)
    stopwords_removed = [w for w in words if w.lower() not in lang_stopwords]
    return " ".join(stopwords_removed)
print (remove_stopwords('This is a sample English sentence'))

[nltk_data] Downloading package stopwords to /home/kazi/nltk_data...


sample English sentence


[nltk_data]   Unzipping corpora/stopwords.zip.


***Remove punctuations***

In [27]:
import string 

# Function to remove punctuations
def remove_punctuations(text):
    words = nltk.word_tokenize(text)
    punt_removed = [w for w in words if w.lower() not in string.punctuation]
    return " ".join(punt_removed)

print (remove_punctuations('This is a sample English sentence, with punctuations!'))

This is a sample English sentence with punctuations


***Remove whitespace & numbers***

In [28]:
import re

# Function to remove whitespace
def remove_whitespace(text):
    return " ".join(text.split())

# Function to remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

text = 'This 	is a     sample  English   sentence, \n with whitespace and numbers 1234!'
print ('Original Text: ', text)
print ('Removed whitespace: ', remove_whitespace(text))
print ('Removed numbers: ', remove_numbers(text))

Original Text:  This 	is a     sample  English   sentence, 
 with whitespace and numbers 1234!
Removed whitespace:  This is a sample English sentence, with whitespace and numbers 1234!
Removed numbers:  This 	is a     sample  English   sentence, 
 with whitespace and numbers !


***Stemming***


It is the process of transforming to the root word i.e., it uses an algorithm that removes common word endings for English words, such as “ly”, “es”, “ed” and “s”. For example, assuming for an analysis you may want to consider “carefully”, “cared”, “cares”, “caringly” as “care” instead of separate words.

In [29]:
from nltk import PorterStemmer, LancasterStemmer, SnowballStemmer

In [34]:
# Function to apply stemming to a list of words
def words_stemmer(words, type="PorterStemmer", lang="english"):
    supported_stemmers = ["PorterStemmer", "LancasterStemmer", "SnowballStemmer"]
    if type is False or type not in supported_stemmers:
        return words
    else:
        stem_words = []
        if type == "PorterStemmer":
            stemmer = PorterStemmer()
            for word in words:
                stem_words.append(stemmer.stem(word))
        if type == "LancasterStemmer":
            stemmer = LancasterStemmer()
            for word in words:
                stem_words.append(stemmer.stem(word))
        if type == "SnowballStemmer":
            stemmer = SnowballStemmer(lang)
            for word in words:
                stem_words.append(stemmer.stem(word))
        return " ".join(stem_words)

words = 'caring cares cared caringly carefully'

print("Original: ", words)
print("Porter: ", words_stemmer(nltk.word_tokenize(words), "PorterStemmer"))
print("Lancaster: ", words_stemmer(nltk.word_tokenize(words), "LancasterStemmer"))
print("Snowball: ", words_stemmer(nltk.word_tokenize(words), "SnowballStemmer"))


Original:  caring cares cared caringly carefully
Porter:  care care care caringli care
Lancaster:  car car car car car
Snowball:  care care care care care


**Lemmatizer**

It is the process of transforming to the dictionary base form.

In [41]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

# Function to apply lemmatization to a list of words
def words_lemmatizer(text, encoding="utf8"):
    words = nltk.word_tokenize(text)
    lemma_words = []
    wl = WordNetLemmatizer()
    for word in words:
        pos = find_pos(word)
        lemma_words.append(wl.lemmatize(word, pos).encode(encoding).decode(encoding))
    return " ".join(lemma_words)

# Function to find part of speech tag for a word
def find_pos(word):
    # Part of Speech constants
    # ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
    # You can learn more about these at http://wordnet.princeton.edu/wordnet/man/wndb.5WN.html#sect3
    # You can learn more about all the Penn Treebank tags at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    pos = nltk.pos_tag(nltk.word_tokenize(word))[0][1]
    # Adjective tags - 'JJ', 'JJR', 'JJS'
    if pos.lower()[0] == 'j':
        return 'a'
    # Adverb tags - 'RB', 'RBR', 'RBS'
    elif pos.lower()[0] == 'r':
        return 'r'
    # Verb tags - 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'
    elif pos.lower()[0] == 'v': 
        return 'v'
    # Noun tags - 'NN', 'NNS', 'NNP', 'NNPS'
    else:
        return 'n'

word = 'caring cares cared caringly carefully'
print("Lemmatized: ", words_lemmatizer(word))


Lemmatized:  care care care caringly carefully


In [44]:
from nltk.corpus import wordnet

syns = wordnet.synsets("good")
print ("Definition: ", syns[0].definition())
print ("Example: ", syns[0].examples())

synonyms = []
antonyms = []

# Print  synonums and antonyms (having opposite meaning words)
for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print ("synonyms: \n", set(synonyms))
print ("antonyms: \n", set(antonyms))

Definition:  benefit
Example:  ['for your own good', "what's the good of worrying?"]
synonyms: 
 {'serious', 'expert', 'right', 'practiced', 'honorable', 'adept', 'sound', 'unspoiled', 'soundly', 'unspoilt', 'commodity', 'skilful', 'effective', 'in_effect', 'thoroughly', 'estimable', 'goodness', 'just', 'skillful', 'good', 'honest', 'salutary', 'undecomposed', 'secure', 'safe', 'respectable', 'proficient', 'well', 'trade_good', 'near', 'in_force', 'upright', 'ripe', 'full', 'dear', 'beneficial', 'dependable'}
antonyms: 
 {'badness', 'bad', 'evil', 'evilness', 'ill'}


***N-grams***

In [45]:
from nltk.util import ngrams
from collections import Counter

# Function to extract n-grams from text
def get_ngrams(text, n):
    n_grams = ngrams(nltk.word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]  

text = 'This is a sample English sentence'

print ("1-gram: ", get_ngrams(text, 1))
print ("2-gram: ", get_ngrams(text, 2))
print ("3-gram: ", get_ngrams(text, 3))
print ("4-gram: ", get_ngrams(text, 4))

1-gram:  ['This', 'is', 'a', 'sample', 'English', 'sentence']
2-gram:  ['This is', 'is a', 'a sample', 'sample English', 'English sentence']
3-gram:  ['This is a', 'is a sample', 'a sample English', 'sample English sentence']
4-gram:  ['This is a sample', 'is a sample English', 'a sample English sentence']


In [46]:
text = 'Statistics skills, and programming skills are equally important for analytics. Statistics skills, and domain knowledge are important for analytics'

# remove punctuations
text = remove_punctuations(text)

# Extracting bigrams
result = get_ngrams(text,2)

# Counting bigrams
result_count = Counter(result)

print ("Words: ", result_count.keys()) # Bigrams
print ("\nFrequency: ", result_count.values()) # Bigram frequency

Words:  dict_keys(['Statistics skills', 'skills and', 'and programming', 'programming skills', 'skills are', 'are equally', 'equally important', 'important for', 'for analytics', 'analytics Statistics', 'and domain', 'domain knowledge', 'knowledge are', 'are important'])

Frequency:  dict_values([2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1])


In [47]:
# Converting to the result to a data frame
import pandas as pd
df = pd.DataFrame.from_dict(result_count, orient='index')
df = df.rename(columns={'index':'words', 0:'frequency'}) # Renaming index and column name
df

Unnamed: 0,frequency
Statistics skills,2
skills and,2
and programming,1
programming skills,1
skills are,1
are equally,1
equally important,1
important for,2
for analytics,2
analytics Statistics,1


***Bag of Words (BoW)***

In [66]:
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create a dictionary with key as file names and values as text for all files in a given folder
def CorpusFromDir(dir_path):
    result = dict(docs=[open(os.path.join(dir_path, f)).read() for f in os.listdir(dir_path)],
                  ColNames=map(lambda x: x, os.listdir(dir_path)))
    return result

docs = CorpusFromDir('Data/text_files/')
print(docs)

# Initialize
vectorizer = CountVectorizer()
doc_vec = vectorizer.fit_transform(docs.get('docs'))

# Create DataFrame
df = pd.DataFrame(doc_vec.toarray().transpose(), index=vectorizer.get_feature_names_out())

# Change column headers to be file names
df.columns = docs.get('ColNames')
df

{'docs': ['Statistics skills, and domain knowledge are important for analytics.', 'I like reading books and travelling.', 'Statistics skills, and programming skills are equally important for analytics.'], 'ColNames': <map object at 0x7f6f0d961f00>}


Unnamed: 0,Doc_2.txt,Doc_3.txt,Doc_1.txt
analytics,1,0,1
and,1,1,1
are,1,0,1
books,0,1,0
domain,1,0,0
equally,0,0,1
for,1,0,1
important,1,0,1
knowledge,1,0,0
like,0,1,0


***TF-IDF***

In the area of information retrieval TF-IDF is a good statistical measure to reflect the relevance of term to the document in a collection of documents or corpus. Let’s break TF_IDF and apply example to understand it better.

TF (term) = (Number of times term appears in a document)/(Total number of terms in the document) IDF (term) = log⁡( (Total number of documents)/(Number of documents with a given term in it))

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
import os
import pandas as pd

# Create a dictionary with key as file names and values as text for all files in a given folder
def CorpusFromDir(dir_path):
    result = dict(docs=[open(os.path.join(dir_path, f)).read() for f in os.listdir(dir_path)],
                  ColNames=map(lambda x: x, os.listdir(dir_path)))
    return result

docs = CorpusFromDir('Data/text_files/')

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()
doc_vec = vectorizer.fit_transform(docs.get('docs'))

# Create DataFrame
df = pd.DataFrame(doc_vec.toarray().transpose(), index=vectorizer.get_feature_names_out())

# Change column headers to be file names
df.columns = docs.get('ColNames')
df

Unnamed: 0,Doc_2.txt,Doc_3.txt,Doc_1.txt
analytics,0.315269,0.0,0.276703
and,0.244835,0.283217,0.214884
are,0.315269,0.0,0.276703
books,0.0,0.479528,0.0
domain,0.414541,0.0,0.0
equally,0.0,0.0,0.363831
for,0.315269,0.0,0.276703
important,0.315269,0.0,0.276703
knowledge,0.414541,0.0,0.0
like,0.0,0.479528,0.0
