In [1]:
# Upgrading dependencies
!pip install --upgrade pip
!pip install --upgrade scikit-learn




In [2]:
import re, string
import nltk


In [3]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer


In [5]:
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

# Working with simple text cleaning processes

In [6]:
txt = "   This is a message to be cleaned. It might involve some things like: <br>, ?, :, ''  adjacent spaces, and tabs     .  "
print(txt)


   This is a message to be cleaned. It might involve some things like: <br>, ?, :, ''  adjacent spaces, and tabs     .  


## Changing the text so that it is all in lowercase.

In [7]:
txt = txt.lower()
print(txt)


   this is a message to be cleaned. it might involve some things like: <br>, ?, :, ''  adjacent spaces, and tabs     .  


## Removing any leading and trailing whitespaces

In [8]:
txt = txt.strip()
print(txt)


this is a message to be cleaned. it might involve some things like: <br>, ?, :, ''  adjacent spaces, and tabs     .


## Using Regular Expressions to remove any HTML tags or Markup
The regular expression `<.*?>` is commonly used to match HTML or XML tags. It is a great example of how greedy vs non-greedy quantifiers work in regex.
<br />
**Breakdown of `<.*?>`**<br />
`<` - Matches a literal `<` character. This is the opening of a tag. <br />
`.*?` - This is the key part. <br />
  `.` - Matches any character except newline <br />
  `*` - Means 'zero or more' of the preceding element <br />
  `?` - Makes the `*` non-greedy (i.e., match as little as possible) <br />
`>` - Matches a literal `>` character. This is the closing of a tag. <br />



In [9]:
txt = re.compile('<.*?>').sub('', txt)
print(txt)


this is a message to be cleaned. it might involve some things like: , ?, :, ''  adjacent spaces, and tabs     .


## Trying re.sub() instead of re.compile()

In [10]:
txt1 = "this is a message to be cleaned. it might involve some things like: <br>, ?, :, ''  adjacent spaces, and tabs     ."
txt1 = re.sub(r'<.*?>', '', txt1)
print(txt1)


this is a message to be cleaned. it might involve some things like: , ?, :, ''  adjacent spaces, and tabs     .


## Replacing punctuation with space.
We should be careful with this task, as depending on the application, punctuation can actually be useful. For instance, punctuation might affect the positive or negative meaning of a sentence. <br />
`string.punctuation` - This is a predefined string in the `string` module that contains all standard punctuation characters.

```
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
```
<br />

The `re.escape()` function escapes all characters that could be interpreted as special regex characters. So, the result will look like this, <br />

```
\!\\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~
```
This makes sure each punctuation character is treated literally in the regex. <br />



```
'[%s]' % re.escape(string.punctuation)
```
This creates a regex character class. <br />
`[%s]` is a format string. This is a placeholder for a string. <br />
The `[...]` are literal characters in the final string. <br />
The code inserts the escaped punctuation string inside square brackets. This final string is used as a regular expression pattern to match any one of those punctuation characters.




In [11]:
txt = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', txt)
print(txt)


this is a message to be cleaned  it might involve some things like              adjacent spaces  and tabs      


## Removing any extra space and tabs

In [12]:
txt = re.sub('\s+', ' ', txt)
print(txt)


this is a message to be cleaned it might involve some things like adjacent spaces and tabs 


# Working with lexicon-based text processing
Lexicon-based text processing methods are applied after the common text-processing methods. They are used to normalize sentences in the dataset. Normalization means putting words into a similar format that will also enhance the similarities (if any) between the sentences. <br />
We have to install some packages, <br />


*   `punkt` - This is a pretrained sentence tokenizer for the English language.
*   `averaged_perceptron_tagger` - This is a part-of-speech tagger
*   `wordnet` - This is a large database of English words that can be used to find the meanings of words, synonyms, antonyms, and more.



## Stopword removal
Some words in sentences can occur very frequently, and they don't contribute too much to the overall meaning of the sentences. Typically, we use a list of such words and remove them from each sentence. For example, stopwords include a, an, the, this, that, is, it, to, and and.

In [13]:
filtered_sentence = []

# We can adjusts stopwords according to our problem
stopwords = ['a', 'an', 'the', 'this', 'that', 'is', 'it', 'to', 'and']

# Tokenizing the sentence
words = word_tokenize(txt)

for w in words:
  if w not in stopwords:
    filtered_sentence.append(w)

text_ = " ".join(filtered_sentence)


In [14]:
print(text_)


message be cleaned might involve some things like adjacent spaces tabs


## Stemming words
Stemming is a rule-based system for converting words into their root form. It removes suffixes from words. This process helps enhance similarities (if any) between sentences. For example, "jumping", "jumped" -> "jump" or "cars" -> "car".

In [15]:
# Initializing the stemmer
sstemmer = SnowballStemmer('english')


In [16]:
stemmed_sentence = []

# Tokenizing the sentence
words = word_tokenize(text_)

for w in words:
  stemmed_sentence.append(sstemmer.stem(w))

stemmed_text = " ".join(stemmed_sentence)


In [17]:
print(stemmed_text)


messag be clean might involv some thing like adjac space tab


This stemming operation is not perfect. It generated some mistakes, such as involv, messag, and adjac. Stemming is a rule-based method that sometimes mistakenly removes suffixes from words. Nevertheless, it runs quickly.

## Lemmatizing words
Since the result of stemming was not satisfactory, we can use lemmatization instead. It usually requires more work, but it gives better results.

In [18]:
# Initializing the lemmatizer
word_lemmatizer = WordNetLemmatizer()


In [19]:
# This is a helper function to map NTLK position tags
# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

def get_wordnet_pos(tag):
  if tag.startswith('J'):
    return wordnet.ADJ
  elif tag.startswith('V'):
    return wordnet.VERB
  elif tag.startswith('N'):
    return wordnet.NOUN
  elif tag.startswith('R'):
    return wordnet.ADV
  else:
    return wordnet.NOUN


In [20]:
lemmatized_sentence = []

# Tokenize the sentence
words = word_tokenize(text_)


In [21]:
words


['message',
 'be',
 'cleaned',
 'might',
 'involve',
 'some',
 'things',
 'like',
 'adjacent',
 'spaces',
 'tabs']

In [22]:
# Getting position tags
word_pos_tags = nltk.pos_tag(words)


In [23]:
# List of tuples
word_pos_tags


[('message', 'NN'),
 ('be', 'VB'),
 ('cleaned', 'VBN'),
 ('might', 'MD'),
 ('involve', 'VB'),
 ('some', 'DT'),
 ('things', 'NNS'),
 ('like', 'IN'),
 ('adjacent', 'JJ'),
 ('spaces', 'NNS'),
 ('tabs', 'VBP')]

In [24]:
# Mapping the position tag and lemmatizing the word or token
for idx, tag in enumerate(word_pos_tags):
  lemmatized_sentence.append(word_lemmatizer.lemmatize(tag[0], get_wordnet_pos(tag[1])))

lemmatized_text = " ".join(lemmatized_sentence)


The above code iterates through a list of word-POS tag pairs. For each word:

*   It converts the POS tag to a WordNet-compatible format,
*   It uses a lemmatizer to find the base form (lemma) of the word, using the POS tag to improve accuracy,
*   It adds the lemmatized word to a list called lemmatized_sentence.

After the loop finishes, lemmatized_sentence will contain a list of the lemmatized words from the original sentence.


In [25]:
print(lemmatized_text)


message be clean might involve some thing like adjacent space tabs
