<a href="https://colab.research.google.com/github/bjohn22/Natural-Language-Processing/blob/main/NLP_Processing_and_Understanding_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To carry out different operations and analyze
text, you will need to process and parse textual data into more easy-to-interpret formats.
> All machine learning (ML) algorithms , be they supervised or unsupervised techniques, usually work with input features that are numeric in nature
> Input text are usually raw and highly unstructured.

Main objectives:
>Clean, normalize and pre-process initial textual data

Pre-processing :
> using a variety of techniques to convert raw text into well defined
sequences of linguistic components that have standard structure and notation.
List of common tasks:
1. Tokenization
1. Tagging
1. Chunking
1. Stemming
1. Lemmatization
1. Misspelled text correction
1. Stopwords removal



# Text Tokenization
tokens are independent and minimal textual components that have some definite syntax and semantics:
> sentence and word tokenization

##Sentence Tokenization/ sentence segmentation
splitting a text corpus into sentences that act as the first level of tokens which the corpus is comprised of.
* Text corpus is a body of text where each paragraph comprises several sentences.
Basic techniques:
* look for specific delimiters between sentences, e.g. periods(.) or a newline character (\n) and sometimes semi-colon (;).
* from NLTK framework use any of:
> * sent_tokenize
> * PunktSentenceTokenizer
> * RegexpTokenizer
> * Pre-trained sentence tokenization models

Let s see how these work!

In [6]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint
import re
import string

In [9]:
nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
#load some sample text and also a part of the Gutenberg corpus available in NLTK itself.
#Load some dependencies as well
alice = gutenberg.raw(fileids='carroll-alice.txt')

In [11]:
sample_text = 'We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming language!'

In [12]:
#check the length of the Alice in Wonderland corpus and also the first few lines
# Total characters in Alice in Wonderland
print('length is', len(alice),':', '1st 100 is', alice[0:100])


length is 144395 : 1st 100 is [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


nltk.sent_tokenize function is the default sentence tokenization function that
nltk recommends. It uses an instance of the PunktSentenceTokenizer class internally.

In [13]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)
print ('Total sentences in sample_text:', len(sample_sentences))
print ('Sample text sentences :-' )
pprint(sample_sentences)
print('\nTotal sentences in alice:', len(alice_sentences))
print ('First 5 sentences in alice:-')
pprint(alice_sentences[0:5])

Total sentences in sample_text: 3
Sample text sentences :-
['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']

Total sentences in alice: 1625
First 5 sentences in alice:-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 'Down the Rabbit-Hole\n'
 '\n'
 'Alice was beginning to get very tired of sitting by her sister on the\n'
 'bank, and of having nothing to do: once or twice she had peeped into the\n'
 'book her sister was reading, but it had no pictures or conversations in\n'
 "it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
 "conversation?'",
 'So she was considering in her own mind (as well as she could, for the\n'
 'hot day made her feel very sleepy and stupid), whether the pleasure\n'
 'of making a daisy-chain would be worth the t

The tokenizer is quite intelligent and doesn’t just use periods to
delimit sentences. It also considers other punctuation and the capitalization of words .

In [14]:
#Another tokenization with similar output as before.
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
pprint(sample_sentences)

['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']


In [15]:
#Another tokenization using RegexpTokenizer with similar output as before.
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN, gaps = True)
sample_sentences = regex_st.tokenize(sample_text)
pprint(sample_sentences)



['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']


## Word Tokenization
* process of splitting or segmenting sentences into their
constituent words.
* A sentence is a collection of words,
> with tokenization we essentially split a sentence into a list of words that can be used to reconstruct the sentence.

* Word tokenization is used in cleaning and normalizing
> for stemming and lemmatization to be performed on individual word.

From nltk package:
* word_tokenize (default go to function and recommended)
* TreebankWordTokenizer
* RegexpTokenizer
* Inherited tokenizers from RegexpTokenizer

In [16]:
#word tokenization for a single sentence
sentence = "The brown fox wasn't that quick and he couldn't win the race"
default_wt = nltk.word_tokenize
words = default_wt(sentence)
print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


# Text Normalization (text cleansing or wrangling)
Text normalization is defined as a process that consists of a series of steps that
should be followed to wrangle, clean, and standardize textual data into a form that
could be consumed by other NLP and analytics systems and applications as input.
* techniques include: cleaning text, case conversion, correcting spellings,
removing stopwords and other unnecessary terms, ***stemming, and lemmatization.***


In [17]:
corpus = ["The brown fox wasn't that quick and he couldn't win the race",
"Hey that's a great deal! I just bought a phone for $199",
"@@You'll (learn) a **lot** in the book. Python is an amazing language !@@"]

### Cleaning Text
Often the textual data we want to use or analyze contains a lot of extraneous and unnecessary tokens and characters that should be removed before performing any
further operations like tokenization or other normalization techniques. This includes
*extracting out meaningful text from data sources like HTML data*, which consists of
unnecessary HTML tags, or even data from XML and JSON feeds. There are many ways
to parse and clean this data to remove unnecessary tags. You can use functions like
***clean_html() from nltk or even the BeautifulSoup library to parse HTML data***. You can
also use your own custom logic, including regexes, xpath, and the lxml library, to parse *italicized text*
through XML data. And getting data from JSON is substantially easier because it has
definite key-value annotations

## Tokenizing Text
Usually, we tokenize text before or after removing unnecessary characters and symbols
from the data. This choice depends on the problem you are trying to solve and the data
you are dealing with.



In [18]:
#We will define a generic tokenization function here and run the same on our corpus mentioned earlier.
#It takes in textual data, extracts sentences from it, and 
#finally splits each sentence into further tokens, which could be words or special characters and punctuation.

def tokenize_text(text):
  sentences = nltk.sent_tokenize(text)
  word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
  return word_tokens

In [19]:
token_list = [tokenize_text(text) for text in corpus]
pprint(token_list)

[[['The',
   'brown',
   'fox',
   'was',
   "n't",
   'that',
   'quick',
   'and',
   'he',
   'could',
   "n't",
   'win',
   'the',
   'race']],
 [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
 [['@',
   '@',
   'You',
   "'ll",
   '(',
   'learn',
   ')',
   'a',
   '**lot**',
   'in',
   'the',
   'book',
   '.'],
  ['Python', 'is', 'an', 'amazing', 'language', '!'],
  ['@', '@']]]


## Removing Special Characters
These may be special symbols or even punctuation that occurs in sentences.
This step is often performed before or after tokenization. The main reason for doing so is
because often punctuation or special characters do not have much significance when we
analyze the text and utilize it for extracting features or information based on NLP and ML.

In [20]:
#remove special characters after tokenization:
#what we do here is use the string.punctuation attribute, which consists of all possible special characters/symbols, and create a regex pattern from it.
#We use it to match tokens that are symbols and characters and remove them.
#The filter function helps us remove empty tokens obtained after removing the special character tokens using the regex sub method.
def remove_characters_after_tokenization(tokens):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    return filtered_tokens

In [21]:
filtered_list_1 = [filter(None,[remove_characters_after_tokenization(tokens) for tokens in sentence_tokens]) for sentence_tokens in token_list]
print(filtered_list_1)

[<filter object at 0x7f488b9f6350>, <filter object at 0x7f488b9d5ed0>, <filter object at 0x7f488b9b6050>]


In [23]:
#remove special characters before tokenization:
def remove_characters_before_tokenization(sentence, keep_apostrophes=False):
    sentence = sentence.strip()
    if keep_apostrophes:
      PATTERN = r'[?|$|&|*|%|@|(|)|~]' # add other characters here to remove them
      filtered_sentence = re.sub(PATTERN, r'', sentence)
    else:
        PATTERN = r'[^a-zA-Z0-9 ]' # only extract alpha-numeric characters
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    return filtered_sentence

In [24]:
filtered_list_2 = [remove_characters_before_tokenization(sentence) for sentence in corpus]
print (filtered_list_2)

['The brown fox wasnt that quick and he couldnt win the race', 'Hey thats a great deal I just bought a phone for 199', 'Youll learn a lot in the book Python is an amazing language ']


In [25]:
cleaned_corpus = [remove_characters_before_tokenization(sentence, keep_apostrophes=True) for sentence in corpus]
print (cleaned_corpus)

["The brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal! I just bought a phone for 199", "You'll learn a lot in the book. Python is an amazing language !"]


The preceding outputs show two different ways of removing special characters before
tokenization—removing all special characters versus retaining apostrophes and sentence
periods—using regular expressions. By now, you must have realized how powerful regular
expressions can be, as mentioned in Chapter 2 . Usually after removing these characters,
you can take the clean text and tokenize it or apply other normalization operations on it.
Sometimes we want to preserve the apostrophes in the sentences as a way to track them
and expand them if needed.

####Expanding Contractions
* Contractions are shortened version of words or syllables
* Examples would be `is not` to `isn’t` and `will not` to `won’t`

By nature, contractions do pose a problem for NLP and text analytics because, to
start with, we have a special apostrophe character in the word. Plus we have two or more words represented by a contraction, and this opens a whole new can of worms when we try to tokenize this or even standardize the words.

###### Approach to dealing with contractions
* have a proper mapping for contractions and their corresponding expansions
* then use it to expand all the contractions in the text
* I have created a vocabulary for contractions and their corresponding expanded forms that you can access in the file **`contractions.py`** in a Python dictionary (available along with the code files for this chapter). Part of the contractions dictionary is shown below


In [35]:
CONTRACTION_MAP = {
"isn't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have",
"wasn't": "was not",
"couldn't": "could not",
"that's": "that is",
"you'll": "you will"
}

In [34]:
import CONTRACTION_MAP

ModuleNotFoundError: ignored

* uses the function `expanded_match` inside the main `expand_contractions` function to find each contraction that matches the regex pattern we
created out of all the contractions in our ***CONTRACTION_MAP*** dictionary. On matching any contraction, we substitute it with its corresponding expanded version and retain the correct case of the word.

In [36]:
#Implementing the contractions map

def expand_contractions(sentence, contraction_mapping):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
    flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                              if contraction_mapping.get(match)\
                              else contraction_mapping.get(match.lower())
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    expanded_sentence = contractions_pattern.sub(expand_match, sentence)
    return expanded_sentence


In [37]:
expanded_corpus = [expand_contractions(sentence, CONTRACTION_MAP) for sentence in cleaned_corpus]
expanded_corpus

['The brown fox was not that quick and he could not win the race',
 'Hey that is a great deal! I just bought a phone for 199',
 'You will learn a lot in the book. Python is an amazing language !']

#### Removing Stopwords
* These words have litle or no significance
>They are usually removed from text during processing so as to retain words having maximum significance and context. Stopwords are usually words that end up occurring the most if you aggregated any corpus of text based on singular tokens and checked their frequencies. Words like *a, the , me ,* and so on are stopwords.

* One important thing to remember is that negations like `not` and `no` are removed in this case 
> it is often essential to preserve them so the actual context of the sentence is not lost in applications like *sentiment analysis*, 
> so you would need to make sure ***you do not remove such words in those scenarios***.

In [41]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [47]:
#See the list of stopwords in nltk's vocabulary
nltk.corpus.stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [42]:
#we leverage the use of nltk , which has a list of stopwords for English,
#and use it to filter out all tokens that correspond to stopwords.

def remove_stopwords(tokens):
    stopword_list = nltk.corpus.stopwords.words('english')
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    return filtered_tokens

In [43]:
expanded_corpus_tokens = [tokenize_text(text)
                          for text in expanded_corpus]

In [45]:
filtered_list_3 = [[remove_stopwords(tokens)
                  for tokens in sentence_tokens]
                  for sentence_tokens in expanded_corpus_tokens
                   ]
filtered_list_3

[[['The', 'brown', 'fox', 'quick', 'could', 'win', 'race']],
 [['Hey', 'great', 'deal', '!'], ['I', 'bought', 'phone', '199']],
 [['You', 'learn', 'lot', 'book', '.'],
  ['Python', 'amazing', 'language', '!']]]

#### Stemming
* We start with explaining morphemes:
* Morphemes are the smallest independent unit in any natural language
* Morphemes consist of units that are stems and affixes.
* Affixes are units like prefixes, suffixes, and so on, which are attached to a word stem to change its meaning or create a new word altogether.
* Word stems are also often known as the *base form* of a word, and we can create new words by attaching affixes to them in a process known as *inflection*.
* The reverse of this is obtaining the base form of a word from its inflected form, and this is known as ***stemming***.
* The *nltk* package has several implementations for stemmers. These stemmers are implemented in the *stem module*, which inherits the *StemmerI* interface in the *nltk.stem.api* module.

In [48]:
# Porter Stemmer
from nltk.stem import PorterStemmer

In [49]:
ps = PorterStemmer()

In [50]:
print (ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped'))

jump jump jump


In [52]:
print (ps.stem('lying'), ps.stem('strange'))

lie strang


In [53]:
#Alternative stop words with different result

from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

In [54]:
print(ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped'), ls.stem('lying'), ls.stem('strange'))

jump jump jump lying strange


In [55]:
#Yet another alternative stop words with different result
# Snowball Stemmer
from nltk.stem import SnowballStemmer

In [56]:
ss = SnowballStemmer("english")

In [57]:
print(ss.stem('jumping'), ss.stem('jumps'), ss.stem('jumped'), ss.stem('lying'), ss.stem('strange'))

jump jump jump lie strang


####Lemmatization
* The process of *lemmatization* is very similar to stemming—you remove word affixes to get to a base form of the word. But in this case, this base form is also known as the *root word*, but not the *root stem*. 
* The difference is that the root stem may not always be a lexicographically correct word; that is, it may not be present in the dictionary. The *root word*, also known as the *lemma*, will always be present in the dictionary.


The lemmatization process is considerably slower than stemming because an
additional step is involved where the root form or lemma is formed by removing the affix from the word if and only if the lemma is present in the dictionary



In [58]:
from nltk.stem import WordNetLemmatizer

In [61]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [62]:
wnl = WordNetLemmatizer()

In [63]:
# lemmatize nouns
print (wnl.lemmatize('cars', 'n'), wnl.lemmatize('houses', 'n'))

car house


In [64]:
# lemmatize verbs
print (wnl.lemmatize('running', 'v'), wnl.lemmatize('ate', 'v'), wnl.lemmatize('thought', 'v'))

run eat think


In [67]:
# lemmatize adjectives
print (wnl.lemmatize('saddest', 'a'), wnl.lemmatize('better', 'a'), wnl.lemmatize('tallest', 'a'), wnl.lemmatize('fancier', 'a'))

sad good tall fancy


The preceding code leverages the `WordNetLemmatizer` class , which internally uses the `morphy()` function belonging to the `WordNetCorpusReader` class. This function basically finds the base form or lemma for a given word using the word and its part of speech by checking the Wordnet corpus and
uses a recursive technique for removing affixes from the word until a match is found in WordNet. If no match is found, the input word itself is returned unchanged.
* ***The part of speech is extremely important here because if that is wrong, the lemmatization will not be effective.***