# Natural Language Preprocessing

Natural language processing (NLP) is a branch of artificial intelligence (AI) that enables computers to comprehend, generate, and manipulate human language. Natural language processing has the ability to interrogate the data with natural language text or voice

In [314]:
import pandas as pd

In [315]:
text = "Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!"
text

"Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!"

# Preprocessing
1. ### To LowerCase
   Python is a case sensitive programming language. for python `A` and `a` are different things. So, we convert all to the lower case to avoid the confusion for python and treat them as same.

In [317]:
text_lower = text.lower()
print(f"This is the Original Text: \n{text}\n\nThis one is the Lower Text: \n{text_lower}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Lower Text: 
heyi there! have yu visited https://example.com recently? it's amazing! i bought 2 new gadgets from their sale last week (crazy deals!). the total cost was $199.99â€”what a bargain! ðŸ˜Š but iâ€™m a bit concerned about my data privacy... ðŸ¤” are they really protecting my information? by the way, do you know if johnâ€™s email is john.doe123@example.com? i need to contact him asap. see you at 5:00 pm tomorrow!


2. ### Handling URLs
   URLs is unnecessary while trainning the nlp model. they don't have any meaning in the text. so we remove it. to remove the redundant data, to reduce the computational resource and to make the model more better on perrformance

In [319]:
import re
#regex is used for this purpose, as it have the compile method in it's function
url_format = re.compile('https://')

def url_handle(text):
    return url_format.sub('', text)

URLHandle = url_handle(text)

print(f"This is the Original Text: \n{text}\n\nThis one is the Modified Text: \n{URLHandle}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified Text: 
Heyi there! Have yu visited example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!


3. ### Non-Word/Non-Whitespace
   Non-word and Non_whitespace, it's crucial to remove them that are considerd as words or whitespace

- `\w:` Matches any word character (equivalent to `[a-zA-Z0-9_]`).
- `\s:` Matches any whitespace character (spaces, tabs, line breaks).
- `^:` The caret (^) inside the square brackets negates the character class, meaning it matches anything that is NOT a word character or a whitespace character.

In [321]:
whiteText = "     This is the White Text in the Start!!  "
noWhiteText = whiteText.strip()
print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{noWhiteText}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
This is the White Text in the Start!!


4. ### Digits
    In the text, the numerical doesn't hold the much signinficance, so in most cases while training the model we remove them.

In [323]:
noDigits = re.sub(r'\d', '', text)
print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{noDigits}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought  new gadgets from their sale last week (crazy deals!). The total cost was $.â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe@example.com? I need to contact him ASAP. See you at : PM tomorrow!


5. ### Tokenization
    Tokenization is the process of breaking down large blocks of text into smallers, for more manageable block units, eog. a lenctence is divided into small units with each word as itself a unit. It happes in NLP, so we can get the more accurate results.

In [325]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\faizr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [326]:
Tokens = word_tokenize(text)
print(f"This is the Original Text: \n{text}\n\nThis one is the Tokens: \n{Tokens}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Tokens: 
['Heyi', 'there', '!', 'Have', 'yu', 'visited', 'https', ':', '//example.com', 'recently', '?', 'It', "'s", 'amazing', '!', 'I', 'bought', '2', 'new', 'gadgets', 'from', 'their', 'sale', 'last', 'week', '(', 'crazy', 'deals', '!', ')', '.', 'The', 'total', 'cost', 'was', '$', '199.99â€”what', 'a', 'bargain', '!', 'ðŸ˜Š', 'But', 'I', 'â€™', 'm', 'a', 'bit', 'concerned', 'about', 'my', 'data', 'privacy', '...', 'ðŸ¤”', 'Are', 'they', 'really', 'protecting', 'my', 'information', '?', 'By', 'the', 'way', ',', 'do', 'you', 'know', '

6. ### StopWord
   Stopword are the most commonly occurinng words in any language, e.g. Urdu have different, English have different.
   For the model to trained, we don't require the repeated words, it's redundant for the model to be trained on. By this way we reduced the size of our data. So the training time will be less as well as for the computional resourses.

In [328]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\faizr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [329]:
stopWords = set(stopwords.words('english'))
noStopWords = [word for word in Tokens if word not in stopWords]
print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{noStopWords}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
['Heyi', '!', 'Have', 'yu', 'visited', 'https', ':', '//example.com', 'recently', '?', 'It', "'s", 'amazing', '!', 'I', 'bought', '2', 'new', 'gadgets', 'sale', 'last', 'week', '(', 'crazy', 'deals', '!', ')', '.', 'The', 'total', 'cost', '$', '199.99â€”what', 'bargain', '!', 'ðŸ˜Š', 'But', 'I', 'â€™', 'bit', 'concerned', 'data', 'privacy', '...', 'ðŸ¤”', 'Are', 'really', 'protecting', 'information', '?', 'By', 'way', ',', 'know', 'John', 'â€™', 'email', 'john.doe123', '@', 'example.com', '?', 'I', 'need', 'contact', 'ASAP', 

7. ### Stemming/Lemmatization
   Stemming and Lemmatization are two similar but not the same things, in stemming we set a word from the words to obtain the root of the word. e.g. running, runner, ran will be stemmed as run. and for the Lemmatization we get the word that is the parent for other words e.g. running will be Lemmatized as run.

In [331]:
# Stemming
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [332]:
stem = PorterStemmer()

stemWords = [stem.stem(word) for word in Tokens]
print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{stemWords}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
['heyi', 'there', '!', 'have', 'yu', 'visit', 'http', ':', '//example.com', 'recent', '?', 'it', "'s", 'amaz', '!', 'i', 'bought', '2', 'new', 'gadget', 'from', 'their', 'sale', 'last', 'week', '(', 'crazi', 'deal', '!', ')', '.', 'the', 'total', 'cost', 'wa', '$', '199.99â€”what', 'a', 'bargain', '!', 'ðŸ˜Š', 'but', 'i', 'â€™', 'm', 'a', 'bit', 'concern', 'about', 'my', 'data', 'privaci', '...', 'ðŸ¤”', 'are', 'they', 'realli', 'protect', 'my', 'inform', '?', 'by', 'the', 'way', ',', 'do', 'you', 'know', 'if', 'john', 'â€™',

In [333]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\faizr\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\faizr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [334]:
lem = WordNetLemmatizer()
lemWords = [lem.lemmatize(word) for word in Tokens]

print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{lemWords}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
['Heyi', 'there', '!', 'Have', 'yu', 'visited', 'http', ':', '//example.com', 'recently', '?', 'It', "'s", 'amazing', '!', 'I', 'bought', '2', 'new', 'gadget', 'from', 'their', 'sale', 'last', 'week', '(', 'crazy', 'deal', '!', ')', '.', 'The', 'total', 'cost', 'wa', '$', '199.99â€”what', 'a', 'bargain', '!', 'ðŸ˜Š', 'But', 'I', 'â€™', 'm', 'a', 'bit', 'concerned', 'about', 'my', 'data', 'privacy', '...', 'ðŸ¤”', 'Are', 'they', 'really', 'protecting', 'my', 'information', '?', 'By', 'the', 'way', ',', 'do', 'you', 'know', 'if

8. ### Special Characters
   Special Characters and Punctuation are common in text preprocessing for NLP.One can may choose to replace or remove, depond upon thescenarioss of the data

In [336]:
clean = re.sub(r'[^\w\s]', '', text)
print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{clean}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
Heyi there Have yu visited httpsexamplecom recently Its amazing I bought 2 new gadgets from their sale last week crazy deals The total cost was 19999what a bargain  But Im a bit concerned about my data privacy  Are they really protecting my information By the way do you know if Johns email is johndoe123examplecom I need to contact him ASAP See you at 500 PM tomorrow


9. ### Spelling Correction
    Spelling Correction is the most crucial in the text/natural language, we human may able to get the meaning of the slightly miscorrect words, but the machine will output on what the training has be done so it's crucial to correct the spelling in the NLP tasks.

In [338]:
from spellchecker import SpellChecker

In [339]:
spell = SpellChecker()
notCorrect = spell.unknown(Tokens)
correctWords = [spell.correction(word) if word in notCorrect else word for word in Tokens]
correctWords = [word if word is not None else "" for word in correctWords]
correctText = ' '.join(correctWords)
print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{correctWords}\n\nThis is the corrected text: \n{correctText}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
['Heyi', 'there', '!', 'Have', 'you', 'visited', 'hates', ':', '', 'recently', '?', 'It', 'is', 'amazing', '!', 'I', 'bought', '2', 'new', 'gadgets', 'from', 'their', 'sale', 'last', 'week', '(', 'crazy', 'deals', '!', ')', '.', 'The', 'total', 'cost', 'was', '$', '', 'a', 'bargain', '!', 'i', 'But', 'I', 'i', 'm', 'a', 'bit', 'concerned', 'about', 'my', 'data', 'privacy', '', 'i', 'Are', 'they', 'really', 'protecting', 'my', 'information', '?', 'By', 'the', 'way', ',', 'do', 'you', 'know', 'if', 'John', 'i', 's', 'email', 'i

10. ### POS
    POS is part-of-speech tagging, it's crucial in natural language, cause it's grammer, if your model is inferencing with wrong grammer, it make no sense to use this kind of model in the company

In [341]:
from nltk.tag import pos_tag

In [342]:
pos = pos_tag(Tokens)
print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{pos}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
[('Heyi', 'NNP'), ('there', 'EX'), ('!', '.'), ('Have', 'VBP'), ('yu', 'VBN'), ('visited', 'VBN'), ('https', 'NN'), (':', ':'), ('//example.com', 'NN'), ('recently', 'RB'), ('?', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('amazing', 'JJ'), ('!', '.'), ('I', 'PRP'), ('bought', 'VBD'), ('2', 'CD'), ('new', 'JJ'), ('gadgets', 'NNS'), ('from', 'IN'), ('their', 'PRP$'), ('sale', 'NN'), ('last', 'JJ'), ('week', 'NN'), ('(', '('), ('crazy', 'JJ'), ('deals', 'NNS'), ('!', '.'), (')', ')'), ('.', '.'), ('The', 'DT'), ('total', 'JJ'), ('cost

11. ### Contractions
    Handling contractions is an important step in text preprocessing. Contractions are shortened forms of words or phrases, often formed by combining two words, and they are commonly used in everyday language.

In [344]:
contractions_dict = {
    "isn't": "is not",
    "don't": "do not",
    "aren't": "are not",
    "can't": "cannot",
    "couldn't": "could not",
    "didn't": "did not",
    "I'm": "I am",
    "Iâ€™m": "I am",  # Adding the contraction with the curly apostrophe
    "it's": "it is",
    "Itâ€™s": "it is"  # Adding the contraction with the curly apostrophe
}

modified_text = text
for contraction, expansion in contractions_dict.items():
    modified_text = modified_text.replace(contraction, expansion)

print(f"This is the Original Text: \n{text}\n\nThis one is the Modified: \n{modified_text}")

This is the Original Text: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But Iâ€™m a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!

This one is the Modified: 
Heyi there! Have yu visited https://example.com recently? It's amazing! I bought 2 new gadgets from their sale last week (crazy deals!). The total cost was $199.99â€”what a bargain! ðŸ˜Š But I am a bit concerned about my data privacy... ðŸ¤” Are they really protecting my information? By the way, do you know if Johnâ€™s email is john.doe123@example.com? I need to contact him ASAP. See you at 5:00 PM tomorrow!
