<h1 style="color: #003366; font-family: Arial, sans-serif; font-weight: bold;">NLP Preprocessing</h1>


#### **Objective**

The primary goal of the preprocessing steps was to **clean and standardize textual data** from various corpora, preparing them for effective analysis and modeling. Each dataset required unique preprocessing techniques tailored to its specific challenges and characteristics, including:

- **Uniform Formatting**: Ensuring consistency in text formatting across different datasets, such as converting text to lowercase and handling contractions.

- **Noise Removal**: Eliminating non-essential elements like punctuation, special characters, HTML tags, and URLs to focus on the core content.

- **Text Normalization**: Standardizing text by techniques such as stemming, lemmatization, and correcting spelling errors to consolidate different forms of the same word.

- **Data Structuring**: Tokenizing text, generating n-grams, and vectorizing text to prepare it for analytical and modeling tasks.

- **Contextual Handling**: Addressing dataset-specific issues, such as removing emojis from chat data or expanding acronyms in formal texts.


In [139]:
import warnings
import nltk
from nltk.corpus import brown
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import gutenberg, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import re
from textblob import TextBlob 
from nltk.corpus import inaugural
import warnings
import emoji
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import random

#### **Preprocessing Steps for the Brown Corpus Dataset**

1. **Lowercasing:** 
   Convert all characters in the text to lowercase, ensuring uniformity and reducing case-related discrepancies.

In [140]:
warnings.filterwarnings("ignore", category=UserWarning, module='nltk')
nltk.download('brown', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
sentences = brown.sents()

lowercased_sentences = [[word.lower() for word in sentence] for sentence in sentences]
print("Lowercased Sentence:", ' '.join(lowercased_sentences[0]))

Lowercased Sentence: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .


2. **Tokenization:** 
   Split the text into individual words or tokens, facilitating easier analysis and processing.
3. **Removing Punctuation:** 
   Eliminate non-essential symbols from the text to focus on the words themselves and clean the dataset.

In [141]:
tokenizer = RegexpTokenizer(r'\w+')
tokenized_sentences = [tokenizer.tokenize(' '.join(sentence)) for sentence in lowercased_sentences]
print("Tokenized Sentence:", tokenized_sentences[0])
# Removing Punctuation is already included in tokenization using RegexpTokenizerThis step is already included in tokenization using RegexpTokenizer

Tokenized Sentence: ['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', 'atlanta', 's', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'irregularities', 'took', 'place']


4. **Removing Stop Words:** 
   Filter out common words like "and" or "the" that carry less semantic weight, enhancing the relevance of the remaining content.


In [142]:
stop_words = set(stopwords.words('english'))
filtered_sentences = [[word for word in sentence if word not in stop_words] for sentence in tokenized_sentences]
print("Filtered Sentence (Stop Words Removed):", ' '.join(filtered_sentences[0]))

Filtered Sentence (Stop Words Removed): fulton county grand jury said friday investigation atlanta recent primary election produced evidence irregularities took place


5. **Stemming:** 
   Reduce words to their base or root forms to consolidate different inflections into a single root, improving text analysis efficiency.

In [143]:
stemmer = PorterStemmer()
stemmed_sentences = [[stemmer.stem(word) for word in sentence] for sentence in filtered_sentences]
print("Stemmed Sentence:", ' '.join(stemmed_sentences[0]))

Stemmed Sentence: fulton counti grand juri said friday investig atlanta recent primari elect produc evid irregular took place


#### **Preprocessing Steps for the Gutenberg Corpus Dataset**

6. **Lemmatization:** 
   Convert words to their dictionary form (e.g., "better" to "good") to standardize word usage and improve text consistency.

In [144]:
nltk.download('gutenberg', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt', quiet=True)
text = gutenberg.raw('austen-emma.txt')  

In [145]:
def lemmatize_word(word):
    lemmatizer = WordNetLemmatizer()
    # Determine the part of speech for the word
    pos = wordnet.VERB if wordnet.synsets(word, pos=wordnet.VERB) else wordnet.NOUN
    return lemmatizer.lemmatize(word, pos=pos)

7. **Removing Numbers:** 
   Eliminate numerical digits from the text to focus on the linguistic content and remove non-textual elements.

In [146]:
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

8. **Removing Whitespace:** 
   Trim unnecessary spaces from the text to clean up formatting and ensure a consistent text structure.

In [147]:
def remove_whitespace(text):
    return ' '.join(text.split())


9. **Handling Contractions:** 
   Expand contractions (e.g., "can't" to "cannot") to ensure that all text is in its full form, enhancing readability and uniformity.

In [148]:
def expand_contractions(text):
    contractions = {
        "can't": "cannot", "won't": "will not", "n't": " not",
        "'m": " am", "'re": " are", "'s": " is", "'d": " would",
        "'ll": " will", "'t": " not", "'ve": " have", "'y": " you",
        "'d": " had"
    }
    pattern = re.compile(r"\b(" + "|".join(contractions.keys()) + r")\b")
    return pattern.sub(lambda x: contractions[x.group()], text)

10. **Text Normalization:** 
   Normalize different spellings of the same word (e.g., "color" and "colour") to unify the text and reduce variability in word forms.

In [149]:
def normalize_text(text):
    return text.lower().replace('colour', 'color')

In [150]:
text = expand_contractions(text)
text = remove_numbers(text)
text = remove_whitespace(text)
text = normalize_text(text)

tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
tokens = [lemmatize_word(token) for token in tokens if token.isalpha()]  # Lemmatize and remove non-alphabetic tokens

print("Sample tokens after preprocessing:", tokens[:50])

Sample tokens after preprocessing: ['emma', 'by', 'jane', 'austen', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', 'seem', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessing', 'of', 'existence', 'and', 'have', 'live', 'nearly', 'twenty', 'one', 'year', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', 'she']


#### **Preprocessing Steps for the Movie Reviews Corpus**

11. **Removing Special Characters:** 
   Remove special characters (e.g., `@`, `#`, `$`) from the text to focus on the linguistic content and eliminate non-textual elements.

In [151]:
text = """
Check out this great deal @ http://example.com! Contact us via email@example.com. 
Our latest offer is #awesome. Don't miss this! <b>Special offer</b> just for you.
"""

In [152]:
def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

12. **Spelling Correction:** 
   Automatically correct spelling mistakes in the text to enhance readability and uniformity by fixing common typos and misspellings.

In [153]:
def correct_spelling(text):
    blob = TextBlob(text)
    return str(blob.correct())

13. **Removing HTML Tags:** 
   Remove HTML tags from the text to clean up formatting and retain only the visible content, eliminating unnecessary HTML elements.

In [154]:
def remove_html_tags(text):
    return re.sub(r'<.*?>', '', text)


14. **Removing URLs:** 
   Remove URLs (web links) from the text to avoid including web addresses that are irrelevant to the textual analysis.

In [155]:
def remove_urls(text):
    return re.sub(r'http[s]?://\S+', '', text)


15. **Removing Email Addresses:** 
   Remove email addresses from the text to exclude personal or contact information that may not be pertinent to the analysis.

In [156]:
def remove_email_addresses(text):
    return re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)

In [157]:
text_no_special_chars = remove_special_characters(text)
text_corrected_spelling = correct_spelling(text_no_special_chars)
text_no_html = remove_html_tags(text_corrected_spelling)
text_no_urls = remove_urls(text_no_html)
text_no_emails = remove_email_addresses(text_no_urls)

print("Original Text:", text)
print("Text after Removing Special Characters:", text_no_special_chars)
print("Text after Spelling Correction:", text_corrected_spelling)
print("Text after Removing HTML Tags:", text_no_html)
print("Text after Removing URLs:", text_no_urls)
print("Text after Removing Email Addresses:", text_no_emails)

Original Text: 
Check out this great deal @ http://example.com! Contact us via email@example.com. 
Our latest offer is #awesome. Don't miss this! <b>Special offer</b> just for you.

Text after Removing Special Characters: 
Check out this great deal  httpexamplecom Contact us via emailexamplecom 
Our latest offer is awesome Dont miss this bSpecial offerb just for you

Text after Spelling Correction: 
Check out this great deal  httpexamplecom Contact us via emailexamplecom 
Our latest offer is awesome Wont miss this special offer just for you

Text after Removing HTML Tags: 
Check out this great deal  httpexamplecom Contact us via emailexamplecom 
Our latest offer is awesome Wont miss this special offer just for you

Text after Removing URLs: 
Check out this great deal  httpexamplecom Contact us via emailexamplecom 
Our latest offer is awesome Wont miss this special offer just for you

Text after Removing Email Addresses: 
Check out this great deal  httpexamplecom Contact us via emailexa

#### **Preprocessing Steps for the Inaugural Address Corpus**











16. **Removing Non-ASCII Characters:** 
   Remove characters that are not part of the ASCII character set to ensure uniformity and avoid issues with non-standard characters.

In [158]:
warnings.filterwarnings("ignore", category=UserWarning, module='nltk')
nltk.download('inaugural', quiet=True)
nltk.download('punkt', quiet=True)

True

In [159]:
def load_dataset():
    return inaugural.raw()

In [160]:
def remove_non_ascii(text):
    return ''.join(char for char in text if ord(char) < 128)

17. **Removing Repeated Characters:** 
   Reduce repeated characters (e.g., "soooo" to "so") to standardize text and avoid excessive character repetitions that may affect readability.

In [161]:
def remove_repeated_characters(text):
    return re.sub(r'(.)\1+', r'\1\1', text)

18. **Expanding Acronyms:** 
   Expand commonly used acronyms (e.g., "AI" to "Artificial Intelligence") to enhance clarity and ensure that all abbreviations are fully spelled out.

In [162]:
def expand_acronyms(text):
    acronyms = {
        'AI': 'Artificial Intelligence',
        'USA': 'United States of America',
    }
    for acronym, expansion in acronyms.items():
        text = text.replace(acronym, expansion)
    return text

19. **Removing Affixes:** 
   Remove prefixes or suffixes from words to focus on the root forms and improve the consistency of the text.

In [163]:
def remove_affixes(text):
    affixes = ['un', 're', 'ing', 'ed', 'ly', 's', 'es']
    for affix in affixes:
        text = re.sub(r'\b' + affix + r'\b', '', text)
    return text

20. **Removing Extra Spaces:** 
   Remove extra spaces between words to clean up formatting and ensure a consistent text structure.


In [164]:
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()


In [165]:
def preprocess_text(text):
    text = remove_non_ascii(text)
    text = remove_repeated_characters(text)
    text = expand_acronyms(text)
    text = remove_affixes(text)
    text = remove_extra_spaces(text)
    return text

text = load_dataset()
preprocessed_text = preprocess_text(text)

print("Original Text Sample:")
print(text[:1000])  

print("\nPreprocessed Text Sample:")
print(preprocessed_text[:1000])  

Original Text Sample:
Fellow-Citizens of the Senate and of the House of Representatives:

Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualif

#### **Text Processing Steps for the NPS Chat Corpus**

21. **Handling Emojis:** 
   Remove emojis from the text to ensure uniformity.

In [166]:
nlp = spacy.load("en_core_web_sm")

In [167]:
def handle_emojis(text):
    return emoji.replace_emoji(text, replace='')

22. **Sentence Splitting:** 
   Split the text into individual sentences using regular expressions.


In [168]:
def split_sentences(text):
    return re.split(r'(?<=[.!?]) +', text)

23. **Creating N-grams:** 
   Generate bigrams and trigrams by tokenizing the text and combining tokens.

In [169]:
def create_ngrams(text, n):
    tokens = text.split()
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return list(ngrams)

24. **Named Entity Recognition (NER) and Part-of-Speech (POS) Tagging:** 
   Identify named entities and assign part-of-speech tags to words using spaCy.

In [170]:
def perform_ner_and_pos(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    pos_tags = [(token.text, token.pos_) for token in doc]
    return entities, pos_tags

In [171]:
text = "Hello 😃! How are you? 🏡 Let's meet at 5:00 PM."

text_no_emojis = handle_emojis(text)
print("Text after removing emojis:")
print(text_no_emojis)

sentences = split_sentences(text_no_emojis)
print("\nSample Sentences:")
print(sentences)

try:
    bigrams = create_ngrams(text_no_emojis, 2)
    trigrams = create_ngrams(text_no_emojis, 3)
    print("\nBigrams:")
    print(bigrams)
    print("\nTrigrams:")
    print(trigrams)
except Exception as e:
    print("\nError creating n-grams:", e)

entities, pos_tags = perform_ner_and_pos(text_no_emojis)
print("\nNamed Entities:")
print(entities)
print("\nPart-of-Speech Tags:")
print(pos_tags)

Text after removing emojis:
Hello ! How are you?  Let's meet at 5:00 PM.

Sample Sentences:
['Hello !', 'How are you?', "Let's meet at 5:00 PM."]

Bigrams:
[('Hello', '!'), ('!', 'How'), ('How', 'are'), ('are', 'you?'), ('you?', "Let's"), ("Let's", 'meet'), ('meet', 'at'), ('at', '5:00'), ('5:00', 'PM.')]

Trigrams:
[('Hello', '!', 'How'), ('!', 'How', 'are'), ('How', 'are', 'you?'), ('are', 'you?', "Let's"), ('you?', "Let's", 'meet'), ("Let's", 'meet', 'at'), ('meet', 'at', '5:00'), ('at', '5:00', 'PM.')]

Named Entities:
[('5:00 PM', 'TIME')]

Part-of-Speech Tags:
[('Hello', 'INTJ'), ('!', 'PUNCT'), ('How', 'SCONJ'), ('are', 'AUX'), ('you', 'PRON'), ('?', 'PUNCT'), (' ', 'SPACE'), ('Let', 'VERB'), ("'s", 'PRON'), ('meet', 'VERB'), ('at', 'ADP'), ('5:00', 'NUM'), ('PM', 'NOUN'), ('.', 'PUNCT')]


#### **Preprocessing Steps for the Movie Reviews Corpus**

25. **Text Vectorization (e.g., TF-IDF):** 
   Convert text into numerical features using methods like TF-IDF.


In [172]:
text = """
This is a sample text. Here is an example of CamelCase format: CamelCase.
HTML entities like &amp; should be converted. 
Let's also remove line breaks and convert multi-line text into a single line.
"""

In [173]:
def vectorize_text(text):
    """Convert text into numerical features using TF-IDF."""
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform([text])
    return X, vectorizer.get_feature_names_out()

26. **Removing HTML Entities:** 
   Convert HTML entities (e.g., `&amp;`) to their corresponding characters.

In [174]:
def remove_html_entities(text):
    """Convert HTML entities (e.g., `&amp;`) to their corresponding characters."""
    return re.sub(r'&\w+;', '', text)

27. **Lowercasing the First Word of Sentences:** 
   Lowercase only the first word of each sentence.


In [175]:
def lowercase_first_word(text):
    """Lowercase only the first word of each sentence."""
    sentences = re.split(r'(?<=[.!?]) +', text)
    sentences = [s[0].lower() + s[1:] if s else '' for s in sentences]
    return ' '.join(sentences)


28. **Splitting Words from CamelCase:** 
   Split words that are in CamelCase format (e.g., "CamelCase" to "Camel Case").

In [176]:
def split_camel_case(text):
    """Split words that are in CamelCase format (e.g., "CamelCase" to "Camel Case")."""
    return re.sub(r'([a-z])([A-Z])', r'\1 \2', text)

29. **Removing Line Breaks:** 
   Remove line breaks and convert multi-line text into a single line.

In [177]:
def remove_line_breaks(text):
    """Remove line breaks and convert multi-line text into a single line."""
    return text.replace('\n', ' ').replace('\r', '')

In [178]:
text_no_html = remove_html_entities(text)
print("After removing HTML entities:")
print(text_no_html)

text_lowercased = lowercase_first_word(text_no_html)
print("\nAfter lowercasing the first word of each sentence:")
print(text_lowercased)

text_split_camel_case = split_camel_case(text_lowercased)
print("\nAfter splitting CamelCase words:")
print(text_split_camel_case)

text_single_line = remove_line_breaks(text_split_camel_case)
print("\nAfter removing line breaks:")
print(text_single_line)

tfidf_matrix, feature_names = vectorize_text(text_single_line)
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())

print("\nFeature Names:")
print(feature_names)

After removing HTML entities:

This is a sample text. Here is an example of CamelCase format: CamelCase.
HTML entities like  should be converted. 
Let's also remove line breaks and convert multi-line text into a single line.


After lowercasing the first word of each sentence:

This is a sample text. here is an example of CamelCase format: CamelCase.
HTML entities like  should be converted. 
Let's also remove line breaks and convert multi-line text into a single line.


After splitting CamelCase words:

This is a sample text. here is an example of Camel Case format: Camel Case.
HTML entities like  should be converted. 
Let's also remove line breaks and convert multi-line text into a single line.


After removing line breaks:
 This is a sample text. here is an example of Camel Case format: Camel Case. HTML entities like  should be converted.  Let's also remove line breaks and convert multi-line text into a single line. 

TF-IDF Matrix:
[[0.14586499 0.14586499 0.14586499 0.14586499 0.145

#### **Preprocessing Steps for the State of the Union Corpus**

30. **Token Normalization** Normalize different forms of the same token (e.g., "u.s.a." and "usa").

In [179]:
text = """
The U.S.A. is also known as USA. Dr. Smith and Mr. Jones are attending the meeting. The terrorist attack was carried out by a terrorist organization. 
The quick brown fox jumps over the lazy dog. The service was not good and the experience was not happy.
"""

In [180]:
def normalize_tokens(text):
    """Normalize different forms of the same token."""
    text = re.sub(r'\b(u\.s\.a\.|usa)\b', 'usa', text, flags=re.IGNORECASE)
    return text


31. **Handling Abbreviations** Expand or normalize abbreviations (e.g., "Dr." to "Doctor").

In [181]:
def expand_abbreviations(text):
    """Expand or normalize abbreviations."""
    abbreviations = {
        'Dr.': 'Doctor',
        'Mr.': 'Mister',
        'Mrs.': 'Missus',
        'Gov.': 'Governor'
    }
    for abbr, full in abbreviations.items():
        text = text.replace(abbr, full)
    return text

32. **Context-Based Replacement** Replace certain words or phrases based on their context.

In [182]:
def context_based_replacement(text):
    """Replace certain words or phrases based on context."""
    text = re.sub(r'\bterrorist\b(?=.*attack)', 'extremist', text, flags=re.IGNORECASE)
    return text

33. **Removing Stop Words with Custom List** Remove stop words using a custom list of stop words.

In [183]:
def remove_custom_stop_words(text, custom_stop_words):
    """Remove stop words using a custom list."""
    words = text.lower().split()
    words = [word for word in words if word not in custom_stop_words]
    return ' '.join(words)

34. **Handling Negations**  Process sentences with negations (e.g., "not good" to "bad").

In [184]:
def handle_negations(text):
    """Process sentences with negations."""
    negation_patterns = {
        r'\bnot good\b': 'bad',
        r'\bnot happy\b': 'unhappy'
    }
    for pattern, replacement in negation_patterns.items():
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
    return text

35. **Text Augmentation** Generate new text data by making slight alterations to existing text.

In [185]:
def text_augmentation(text):
    """Generate new text data by making slight alterations."""
    augmentations = [
        lambda x: x + " Additionally, the text can be altered.",
        lambda x: "Note: " + x,
        lambda x: x.upper()
    ]
    return random.choice(augmentations)(text)

36. **Removing Duplicates** Remove duplicate words or phrases in the text.

In [186]:
def remove_duplicates(text):
    """Remove duplicate words or phrases in the text."""
    words = text.lower().split()
    seen = set()
    return ' '.join([word for word in words if not (word in seen or seen.add(word))])


37. **Text Clustering** Cluster similar text segments based on their content.

In [187]:
def simple_text_clustering(text, num_clusters=2):
    """Cluster similar text segments (simple example)."""
    # Simple clustering by splitting into parts
    sentences = text.split('. ')
    clusters = {i: [] for i in range(num_clusters)}
    for i, sentence in enumerate(sentences):
        cluster_id = i % num_clusters
        clusters[cluster_id].append(sentence.strip())
    return clusters

In [188]:
def preprocess_text(text):
    text = normalize_tokens(text)
    text = expand_abbreviations(text)
    text = context_based_replacement(text)
    custom_stop_words = set(['the', 'over'])  # Custom stop words list
    text = remove_custom_stop_words(text, custom_stop_words)
    text = handle_negations(text)
    text = text_augmentation(text)
    text = remove_duplicates(text)
    clusters = simple_text_clustering(text)
    return text, clusters


preprocessed_text, text_clusters = preprocess_text(text)

print("Preprocessed Text:")
print(preprocessed_text)

print("\nText Clusters:")
for cluster_id, sentences in text_clusters.items():
    print(f"Cluster {cluster_id}:")
    for sentence in sentences:
        print(f"  - {sentence}")

Preprocessed Text:
note: u.s.a. is also known as usa. doctor smith and mister jones are attending meeting. extremist attack was carried out by a terrorist organization. quick brown fox jumps lazy dog. service bad experience unhappy.

Text Clusters:
Cluster 0:
  - note: u.s.a
  - doctor smith and mister jones are attending meeting
  - quick brown fox jumps lazy dog
Cluster 1:
  - is also known as usa
  - extremist attack was carried out by a terrorist organization
  - service bad experience unhappy.
