# Text processing with NLTK

NLTK = Natural Language Tool Kit

it can be used for various types of preprocessing on text data

- Tokenization = converting text data into list of words and sentences
- Morphological analysis = converting a word into its root form
- PoS Tagging = tagging Part of speech for each word in the sentence
- Spelling Correction = suggesting write spelling for english words

In [1]:
import nltk

In [2]:
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")
nltk.download("tagsets")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\anshu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\anshu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\anshu\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\anshu\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [3]:
data = "Kuala Lumpur is the capital of Malaysia. Its modern skyline is dominated by the 451m-tall Petronas Twin Towers, a pair of glass-and-steel-clad skyscrapers with Islamic motifs. The towers also offer a public skybridge and observation deck. The city is also home to British colonial-era landmarks such as the Kuala Lumpur Railway Station and the Sultan Abdul Samad Building."
print(data)

Kuala Lumpur is the capital of Malaysia. Its modern skyline is dominated by the 451m-tall Petronas Twin Towers, a pair of glass-and-steel-clad skyscrapers with Islamic motifs. The towers also offer a public skybridge and observation deck. The city is also home to British colonial-era landmarks such as the Kuala Lumpur Railway Station and the Sultan Abdul Samad Building.


In [4]:
#sentence tokenization
nltk.sent_tokenize(data)

['Kuala Lumpur is the capital of Malaysia.',
 'Its modern skyline is dominated by the 451m-tall Petronas Twin Towers, a pair of glass-and-steel-clad skyscrapers with Islamic motifs.',
 'The towers also offer a public skybridge and observation deck.',
 'The city is also home to British colonial-era landmarks such as the Kuala Lumpur Railway Station and the Sultan Abdul Samad Building.']

In [5]:
# word tokenization
nltk.word_tokenize(data)

['Kuala',
 'Lumpur',
 'is',
 'the',
 'capital',
 'of',
 'Malaysia',
 '.',
 'Its',
 'modern',
 'skyline',
 'is',
 'dominated',
 'by',
 'the',
 '451m-tall',
 'Petronas',
 'Twin',
 'Towers',
 ',',
 'a',
 'pair',
 'of',
 'glass-and-steel-clad',
 'skyscrapers',
 'with',
 'Islamic',
 'motifs',
 '.',
 'The',
 'towers',
 'also',
 'offer',
 'a',
 'public',
 'skybridge',
 'and',
 'observation',
 'deck',
 '.',
 'The',
 'city',
 'is',
 'also',
 'home',
 'to',
 'British',
 'colonial-era',
 'landmarks',
 'such',
 'as',
 'the',
 'Kuala',
 'Lumpur',
 'Railway',
 'Station',
 'and',
 'the',
 'Sultan',
 'Abdul',
 'Samad',
 'Building',
 '.']

### Morphological Analysis 
- Converting a word to its root form
    - cars -> car
    - boxes -> box
    - went -> go
    
- Stemming - Faster, less accurate
- Lemmatization - slower, more accurate

In [6]:
#stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("cars")

'car'

In [7]:
ps.stem("boxes")

'box'

In [8]:
ps.stem("wives")

'wive'

In [9]:
# lemmatization
from nltk.stem import WordNetLemmatizer
wd= WordNetLemmatizer()
wd.lemmatize("wives")

'wife'

In [10]:
wd.lemmatize("children")

'child'

In [11]:
wd.lemmatize("went",'v') # v = verb

'go'

### PoS Tagging

In [12]:
data = "Hello all, I like python and how about you?"
nltk.pos_tag(nltk.word_tokenize(data))

[('Hello', 'NNP'),
 ('all', 'DT'),
 (',', ','),
 ('I', 'PRP'),
 ('like', 'VBP'),
 ('python', 'NNS'),
 ('and', 'CC'),
 ('how', 'WRB'),
 ('about', 'IN'),
 ('you', 'PRP'),
 ('?', '.')]

In [13]:
nltk.help.upenn_tagset("NNP")

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


## Spelling correction

#### Higher distance >> less similarity b/w words

In [16]:
nltk.jaccard_distance(set("orange"),set("help"))

0.8888888888888888

In [17]:
nltk.jaccard_distance(set("orange"),set("orenge"))

0.16666666666666666

In [18]:
dictionary = ['grapes','orange','banana','apple']
def recommend(word):
    ans = ""
    dist_score = 1
    for w in dictionary:
        dist = nltk.jaccard_distance(set(w),set(word))
        if dist <dist_score:
            dist_score=dist
            ans=w
    return ans

In [19]:
recommend("applo")

'apple'

In [20]:
recommend("orangoo")

'orange'