<a href="https://colab.research.google.com/github/babupallam/Msc_AI_Module2_Natural_Language_Processing/blob/main/L03-Learning%20to%20Classify%20Text/Note_02_Text_Preprocessing_and_Feature_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- This section focuses on preparing raw text data for text classification tasks by cleaning, normalizing, and extracting features that can be used as input for machine learning and deep learning models.
- Text preprocessing is a crucial step that ensures the quality and relevance of the data fed into classification algorithms, directly impacting model performance.



#### 2.1 **Text Cleaning and Normalization**


- **Tokenization**:
  - Split the text into individual units, such as words (word tokenization) or sentences (sentence tokenization).
  - Use NLTK's `word_tokenize` or `sent_tokenize` methods for tokenization.


In [None]:
import nltk
nltk.download('punkt')  # Download the Punkt tokenizer models

# Example text
text = "Hello there! How are you doing today? Let's learn some NLP."

# Word tokenization
word_tokens = nltk.word_tokenize(text)
print("Word Tokenization:", word_tokens)

# Sentence tokenization
sentence_tokens = nltk.sent_tokenize(text)
print("Sentence Tokenization:", sentence_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Word Tokenization: ['Hello', 'there', '!', 'How', 'are', 'you', 'doing', 'today', '?', 'Let', "'s", 'learn', 'some', 'NLP', '.']
Sentence Tokenization: ['Hello there!', 'How are you doing today?', "Let's learn some NLP."]


- **Lowercasing**:
  - Convert all text to lowercase to ensure consistency, as most models treat "Word" and "word" differently.


In [None]:
# Lowercasing
lowercase_text = text.lower()
print("Lowercased Text:", lowercase_text)


Lowercased Text: hello there! how are you doing today? let's learn some nlp.


- **Removing Special Characters and Punctuation**:
  - Eliminate non-alphanumeric characters to reduce noise (e.g., special symbols, hashtags, or emojis, if not relevant to the task).
  - Keep punctuation if it carries meaning for the task (e.g., sentence segmentation).


In [None]:
import re

# Removing special characters and punctuation
cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print("Text without Special Characters:", cleaned_text)


Text without Special Characters: Hello there How are you doing today Lets learn some NLP


- **Stop Word Removal**:
  - Remove common words like "the," "is," "and," which may not add meaningful information.
  - Use NLTK's built-in list of stop words or customize your own list based on the task.


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# List of English stop words
stop_words = set(stopwords.words('english'))
print("Stop Words:", stop_words)

# Removing stop words from word tokens
filtered_tokens = [word for word in word_tokens if word.lower() not in stop_words]
print("Tokens after Stop Word Removal:", filtered_tokens)


Stop Words: {'as', 'there', 'most', 'those', 'she', 'some', "you'll", 'them', 'by', "doesn't", 'didn', 'yourselves', 'while', "you're", "isn't", 'do', 'their', 'wouldn', 'does', 'yours', 'being', 'of', 'own', 'into', 'until', 'other', 'or', 'ours', "weren't", 'any', 'm', 'needn', 'hasn', 'just', 'been', 'll', 'which', 'over', 'have', 'isn', 'you', 'below', "you've", 'on', 'haven', 'the', 'down', 'doesn', "you'd", 'above', 'having', 'where', 'should', 'he', 'now', 'same', 'was', 'has', "shouldn't", 's', 'these', 'out', 'so', 'ourselves', 'hadn', 'myself', 'y', 'each', 'but', 'am', 'were', 'after', 'at', 'nor', 'aren', 'if', 'an', 'can', 'yourself', 'such', "shan't", 'only', 'o', "aren't", "needn't", "mustn't", 'here', 'him', 'itself', 'we', 'not', 'under', 'then', 'with', 'both', 'herself', 'mightn', "wasn't", 'for', 'its', 'again', "it's", 'your', 'few', "wouldn't", 'against', 'between', 'will', "that'll", 'theirs', 'before', 't', 'hers', 'to', 'during', 'a', 'that', 'than', 'in', "she

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


- **Spelling Correction**:
  - Correct misspelled words to improve text quality.
  - Tools like `autocorrect` or `pyspellchecker` can help with spelling correction.


In [None]:
!pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/622.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/622.8 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m583.7/622.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.8/622.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25l[?25hdone
  Created wheel for autocorrect: filename=autocorrect-2.6.1-py3-none-any.whl size=622364 sha256=d04ccbd5bb63c5f9f52d104fc885c60bf8c789c1ddff1294b2bcae4c8e0da037
  Stored in directory: /root/.cache/pip/wheels/b5/7b/6d/b76b29ce11ff8e2521c

In [None]:
from autocorrect import Speller

# Initialize spell checker
spell = Speller()

# Correct spelling in each word
corrected_text = " ".join([spell(word) for word in word_tokens])
print("Corrected Text:", corrected_text)


Corrected Text: Hello there ! How are you doing today ? Let 's learn some LP .


- **Handling Abbreviations and Contractions**:
  - Expand abbreviations (e.g., "etc." → "etcetera") and contractions (e.g., "don't" → "do not") for consistency.



In [None]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K

In [None]:
from contractions import contractions_dict  # A dictionary of contractions and their expansions

# Example function to expand contractions
def expand_contractions(text, contractions_dict):
    pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b')
    expanded_text = pattern.sub(lambda x: contractions_dict[x.group()], text)
    return expanded_text

# Expanding contractions
expanded_text = expand_contractions(text, contractions_dict)
print("Expanded Text:", expanded_text)


Expanded Text: Hello there! How are you doing today? Let's learn some NLP.


#### 2.2 **Text Normalization**


- **Stemming**:
  - Reduce words to their root form by removing suffixes (e.g., "running" → "run").
  - Use algorithms like Porter Stemmer or Snowball Stemmer in NLTK.


In [None]:
from nltk.stem import PorterStemmer

# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Stemming each word
stemmed_words = [stemmer.stem(word) for word in filtered_tokens]
print("Stemmed Words:", stemmed_words)


Stemmed Words: ['hello', '!', 'today', '?', 'let', "'s", 'learn', 'nlp', '.']


- **Lemmatization**:
  - More sophisticated than stemming, it converts words to their base or dictionary form (e.g., "better" → "good").
  - Use NLTK's `WordNetLemmatizer` for lemmatization.


In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')  # For WordNet lemmatizer dependencies

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatizing each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("Lemmatized Words:", lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Lemmatized Words: ['Hello', '!', 'today', '?', 'Let', "'s", 'learn', 'NLP', '.']


- **Handling Accents and Diacritics**:
  - Normalize text by removing accents (e.g., "café" → "cafe").
  - Use libraries like `unidecode` for accent removal.


In [None]:
!pip install unidecode # Install the unidecode module


Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.8


In [None]:
import unidecode

# Example text with accented characters
accented_text = "Café and résumé are common words with accents."

# Removing accents
normalized_text = unidecode.unidecode(accented_text)
print("Normalized Text:", normalized_text)


Normalized Text: Cafe and resume are common words with accents.


#### 2.3 **Feature Extraction Techniques**


- **Bag-of-Words (BoW)**:
  - Represent the text as a set of words, disregarding grammar and word order.
  - Create a matrix where each row represents a document and each column represents a word, with cell values indicating word frequency.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Example corpus (a list of documents)
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# Create a CountVectorizer instance
# This will convert the text data into a matrix of token counts
vectorizer = CountVectorizer()

# Fit the vectorizer to the corpus and transform the corpus to a BoW representation
bow_matrix = vectorizer.fit_transform(corpus)

# Get the feature names (i.e., the vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Convert the BoW matrix to an array for better readability
bow_array = bow_matrix.toarray()

print("Feature Names (Vocabulary):", feature_names)  # Displays the vocabulary
print("Bag-of-Words Matrix:\n", bow_array)  # Displays the BoW matrix where rows represent documents and columns represent word counts


Feature Names (Vocabulary): ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Bag-of-Words Matrix:
 [[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


- **Term Frequency-Inverse Document Frequency (TF-IDF)**:
  - Weigh word occurrences by their importance, giving higher weights to words that appear frequently in a document but not across all documents.
  - Use `TfidfVectorizer` from `scikit-learn` to implement TF-IDF.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer instance
# This will convert the text data into a matrix of TF-IDF features
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer to the corpus and transform the corpus to a TF-IDF representation
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Get the feature names (i.e., the vocabulary)
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to an array for better readability
tfidf_array = tfidf_matrix.toarray()

print("TF-IDF Feature Names (Vocabulary):", tfidf_feature_names)  # Displays the vocabulary
print("TF-IDF Matrix:\n", tfidf_array)  # Displays the TF-IDF matrix where rows represent documents and columns represent TF-IDF scores


TF-IDF Feature Names (Vocabulary): ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
TF-IDF Matrix:
 [[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


- **Word Embeddings**:
  - Represent words in a continuous vector space, capturing semantic relationships.
  - Use pre-trained embeddings like **Word2Vec**, **GloVe**, or **FastText** for capturing word similarities.
  - Fine-tune embeddings on your dataset using deep learning frameworks like PyTorch.


In [None]:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

# Example sentences to train Word2Vec
sentences = [
    "This is the first sentence.",
    "Here we have another sentence.",
    "Word embeddings capture word meanings.",
    "This is a different sentence."
]

# Tokenize each sentence into words
# Convert all text to lowercase to avoid case sensitivity issues
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train a Word2Vec model on the tokenized sentences
# vector_size: size of the embedding vectors, window: max distance between current and predicted word, min_count: ignore words with total frequency below this, workers: number of threads for training
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Get the vector representation for the word 'sentence'
# This vector represents the word's meaning in a numerical form
word_vector = word2vec_model.wv['sentence']

print("Vector for the word 'sentence':\n", word_vector)  # Displays the embedding vector for the word 'sentence'


Vector for the word 'sentence':
 [-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419385e-03
  7.4669183e-03 -6.1676754e-03  1.1056137e-03  6.0472824e-03
 -2.8400505e-03 -6.1735227e-03 -4.1022300e-04 -8.3689485e-03
 -5.6000124e-03  7.1045388e-03  3.3525396e-03  7.2256695e-03
  6.8002474e-03  7.5307419e-03 -3.7891543e-03 -5.6180597e-04
  2.3483764e-03 -4.5190323e-03  8.3887316e-03 -9.8581640e-03
  6.7646410e-03  2.9144168e-03 -4.9328315e-03  4.3981876e-03
 -1.7395747e-03  6.7113843e-03  9.9648498e-03 -4.3624435e-03
 -5.9933780e-04 -5.6956373e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384968e-03  9.2734173e-03
  7.8980681e-03 -6.9895042e-03 -9.1558648e-03 -3.5575271e-04
 -3.0998408e-03  7.8943167e-03  5.9385742e-03 -1.5456629e-03
  1.5109634e-03  1.7900408e-03  7.8175711e-03 -9.5101865e-03
 -2.0553112e-04  3.4691966e-03 -9.3897223e-04  8.3817719e-03
  9.0107834e-03  6.5365066e-03 -7.1162102e-04  7.7104042e-03
 -8.5343346e-03  3.2071066e-03 -4.6379971e-03 -5.088

- **Sentence Embeddings**:
  - Encode entire sentences into vectors, preserving context.
  - Use models like **BERT**, **RoBERTa**, or **Sentence-BERT** for better contextual representations.


In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Load a pre-trained BERT model and its corresponding tokenizer
# 'bert-base-uncased' is a commonly used variant of BERT where all text is lowercased
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example sentence to encode
sentence = "Text classification with BERT embeddings."

# Tokenize the input sentence and convert it into token IDs
# return_tensors='pt' returns PyTorch tensors
inputs = tokenizer(sentence, return_tensors='pt')

# Pass the tokenized inputs to the BERT model to obtain outputs
# with torch.no_grad() ensures no gradients are computed, saving memory
with torch.no_grad():
    outputs = model(**inputs)

# Get the sentence embedding by averaging the last hidden states of all tokens
# outputs.last_hidden_state contains the hidden states of each token in the input sentence
sentence_embedding = outputs.last_hidden_state.mean(dim=1).squeeze()

print("Sentence Embedding for the example sentence:\n", sentence_embedding)  # Displays the vector representation for the entire sentence


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Sentence Embedding for the example sentence:
 tensor([-4.9895e-01, -3.5024e-02, -3.5461e-01, -2.8169e-02,  1.0095e-01,
         2.0884e-02, -2.5413e-01,  2.7742e-01, -4.6405e-02,  4.4192e-02,
        -2.8290e-01,  1.9983e-01, -2.5800e-01, -2.4812e-01,  1.5008e-02,
         2.6797e-01, -1.7945e-01,  2.4348e-01, -2.2559e-01,  6.5268e-02,
         2.4007e-01, -1.1787e-01, -4.0681e-01,  9.2509e-02,  2.6697e-01,
        -1.2055e-01, -2.9749e-02, -1.4366e-01, -4.2196e-01, -2.2909e-01,
         1.8353e-01,  4.2590e-01,  2.1517e-01, -3.6770e-01, -4.9305e-01,
        -1.7630e-01,  3.1828e-02, -8.9287e-02, -3.1037e-01,  4.3864e-01,
        -7.9540e-01, -1.2213e-01,  7.0302e-02, -1.8723e-01,  1.7915e-01,
        -2.4740e-01, -4.3649e-01, -7.1884e-02, -3.6788e-01, -3.8521e-01,
        -8.9820e-01,  3.1086e-01,  5.0078e-02, -3.1777e-02,  3.0968e-01,
         5.5267e-01,  3.5910e-02, -7.9101e-01,  6.3835e-01,  2.8195e-01,
        -2.6005e-02, -2.4286e-01, -3.6507e-01, -5.7928e-01,  3.6649e-01,
     

- **N-grams**:
  - Use sequences of n consecutive words to capture local word dependencies (e.g., bi-grams, tri-grams).
  - Combine n-grams with BoW or TF-IDF for richer feature sets.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Example text for demonstration
text = ["This is a simple example to demonstrate n-grams feature extraction."]

# Create a CountVectorizer with n-gram range (1, 2)
# This will generate both unigrams (single words) and bigrams (pairs of consecutive words)
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))

# Fit the vectorizer to the text and transform it into an n-gram matrix
ngram_matrix = ngram_vectorizer.fit_transform(text)

# Get the feature names (i.e., the generated n-grams)
ngram_feature_names = ngram_vectorizer.get_feature_names_out()

# Convert the n-gram matrix to an array for better readability
ngram_array = ngram_matrix.toarray()

print("N-gram Feature Names:\n", ngram_feature_names)  # Displays the generated n-grams
print("N-gram Matrix:\n", ngram_array)  # Displays the matrix where rows represent documents and columns represent n-gram counts


N-gram Feature Names:
 ['demonstrate' 'demonstrate grams' 'example' 'example to' 'extraction'
 'feature' 'feature extraction' 'grams' 'grams feature' 'is' 'is simple'
 'simple' 'simple example' 'this' 'this is' 'to' 'to demonstrate']
N-gram Matrix:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]



#### 2.4 **Advanced Feature Engineering**


- **Domain-Specific Features**:

  - Extract features based on domain knowledge (e.g., presence of URLs in spam detection, special symbols in technical documents).
  - For some tasks, certain features may be highly relevant depending on the domain (e.g., the presence of URLs in spam detection or special symbols in financial documents). Below is an example to detect URLs in a text.



In [None]:
import re

# Sample text containing a URL
text_with_url = "Visit our website at https://www.example.com for more information."

# Regular expression pattern to detect URLs
url_pattern = r'https?://\S+|www\.\S+'

# Extract URLs from the text
urls = re.findall(url_pattern, text_with_url)
print("Detected URLs:", urls)

# Feature: Count the number of URLs in the text
num_urls = len(urls)
print("Number of URLs in the text:", num_urls)


Detected URLs: ['https://www.example.com']
Number of URLs in the text: 1


- **Part-of-Speech (POS) Tagging**:
  - Use POS tags as features to incorporate syntactic information.
  - Leverage NLTK's `pos_tag` function for POS tagging.


In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')  # Download the POS tagger models

# Example sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)

# Extracting only the POS tags as features
pos_features = [tag for word, tag in pos_tags]
print("POS Features:", pos_features)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
POS Features: ['DT', 'JJ', 'NN', 'NN', 'VBZ', 'IN', 'DT', 'JJ', 'NN', '.']


- **Named Entity Recognition (NER)**:
  - Identify named entities (e.g., person names, locations) and use them as features.
  - Implement NER using NLTK or pre-trained models like spaCy.


In [None]:
import nltk
nltk.download('maxent_ne_chunker')  # Download the NER chunker
nltk.download('words')  # Download word corpus needed for NER

# Example text
text = "Barack Obama was born in Hawaii and was the president of the United States."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Perform POS tagging before NER
pos_tags = nltk.pos_tag(tokens)

# Perform named entity recognition
named_entities = nltk.ne_chunk(pos_tags)
print("Named Entities:", named_entities)

# Extracting named entity chunks
entities = []
for chunk in named_entities:
    if hasattr(chunk, 'label'):  # Checking if it's a named entity chunk
        entity_name = " ".join(c[0] for c in chunk)
        entity_type = chunk.label()  # The entity type (e.g., PERSON, GPE)
        entities.append((entity_name, entity_type))

print("Extracted Named Entities:", entities)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...


Named Entities: (S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Hawaii/NNP)
  and/CC
  was/VBD
  the/DT
  president/NN
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  ./.)
Extracted Named Entities: [('Barack', 'PERSON'), ('Obama', 'PERSON'), ('Hawaii', 'GPE'), ('United States', 'GPE')]


[nltk_data]   Unzipping corpora/words.zip.


- **Syntactic Parsing**:
  - Extract syntactic features from parse trees (e.g., noun phrases, verb phrases).


In [None]:
from nltk import CFG
from nltk.parse.generate import generate

# Example CFG (Context-Free Grammar)
grammar = CFG.fromstring("""
  S -> NP VP
  NP -> DT NN | DT NNS
  VP -> VBZ NP
  DT -> 'the'
  NN -> 'cat' | 'dog'
  NNS -> 'cats'
  VBZ -> 'chases'
""")

# Generate all sentences from the grammar
for sentence in generate(grammar, n=10):
    print(' '.join(sentence))

# Parse a sentence using the CFG
from nltk.parse import RecursiveDescentParser
parser = RecursiveDescentParser(grammar)

sentence = "the cat chases the dog".split()
print("\nParsing Sentence:")
for tree in parser.parse(sentence):
    print(tree)


the cat chases the cat
the cat chases the dog
the cat chases the cats
the dog chases the cat
the dog chases the dog
the dog chases the cats
the cats chases the cat
the cats chases the dog
the cats chases the cats

Parsing Sentence:
(S (NP (DT the) (NN cat)) (VP (VBZ chases) (NP (DT the) (NN dog))))


- **Sentiment Scores**:
  - Use sentiment analysis tools (e.g., VADER) to add sentiment-related features.


In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')  # Download the VADER lexicon for sentiment analysis

# Initialize the VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Example text
text = "I love this product! It works amazingly well and is the best thing I've bought."

# Get the sentiment scores
sentiment_scores = sid.polarity_scores(text)
print("Sentiment Scores:", sentiment_scores)

# Extract specific features from the sentiment scores
positive_score = sentiment_scores['pos']
neutral_score = sentiment_scores['neu']
negative_score = sentiment_scores['neg']
compound_score = sentiment_scores['compound']

print("Positive Score:", positive_score)
print("Neutral Score:", neutral_score)
print("Negative Score:", negative_score)
print("Compound Score:", compound_score)


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Sentiment Scores: {'neg': 0.0, 'neu': 0.498, 'pos': 0.502, 'compound': 0.9019}
Positive Score: 0.502
Neutral Score: 0.498
Negative Score: 0.0
Compound Score: 0.9019


- **Topic Modeling**:
  - Use techniques like **Latent Dirichlet Allocation (LDA)** to capture underlying topics as features.



In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Example corpus
documents = [
    "I love to play football.",
    "The game of football is exciting.",
    "Artificial intelligence is a rapidly growing field.",
    "Machine learning is a subset of artificial intelligence."
]

# Convert documents to a matrix of token counts
vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(documents)

# Perform Latent Dirichlet Allocation (LDA) for topic modeling
lda_model = LatentDirichletAllocation(n_components=2, random_state=0)
lda_model.fit(document_term_matrix)

# Display the topics and top words in each topic
words = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lda_model.components_):
    print(f"Topic {idx+1}:")
    print(" ".join([words[i] for i in topic.argsort()[-5:]]))  # Top 5 words per topic


Topic 1:
subset learning machine artificial intelligence
Topic 2:
play love game exciting football


#### 2.5 **Practical Implementations Using NLTK and PyTorch**


##### 1. **Basic Preprocessing Pipeline with NLTK**:
  - Implement tokenization, stop word removal, stemming/lemmatization, and feature extraction using NLTK.


In [None]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample text data
text = "Natural Language Processing (NLP) is an exciting area of machine learning and artificial intelligence."

# 1. Tokenization
# Split the text into individual words
word_tokens = nltk.word_tokenize(text)
print("Word Tokens:", word_tokens)

# 2. Lowercasing
# Convert all tokens to lowercase
lowercase_tokens = [word.lower() for word in word_tokens]
print("Lowercased Tokens:", lowercase_tokens)

# 3. Removing Special Characters and Punctuation
# Filter out non-alphanumeric characters
cleaned_tokens = [re.sub(r'[^a-zA-Z0-9]', '', token) for token in lowercase_tokens if re.sub(r'[^a-zA-Z0-9]', '', token)]
print("Cleaned Tokens:", cleaned_tokens)

# 4. Stop Word Removal
# Remove common stop words like "is", "an", "of"
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in cleaned_tokens if word not in stop_words]
print("Tokens after Stop Word Removal:", filtered_tokens)

# 5. Stemming
# Reduce words to their root form
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)

# 6. Lemmatization
# Convert words to their base form
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("Lemmatized Tokens:", lemmatized_tokens)


Word Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'an', 'exciting', 'area', 'of', 'machine', 'learning', 'and', 'artificial', 'intelligence', '.']
Lowercased Tokens: ['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'an', 'exciting', 'area', 'of', 'machine', 'learning', 'and', 'artificial', 'intelligence', '.']
Cleaned Tokens: ['natural', 'language', 'processing', 'nlp', 'is', 'an', 'exciting', 'area', 'of', 'machine', 'learning', 'and', 'artificial', 'intelligence']
Tokens after Stop Word Removal: ['natural', 'language', 'processing', 'nlp', 'exciting', 'area', 'machine', 'learning', 'artificial', 'intelligence']
Stemmed Tokens: ['natur', 'languag', 'process', 'nlp', 'excit', 'area', 'machin', 'learn', 'artifici', 'intellig']
Lemmatized Tokens: ['natural', 'language', 'processing', 'nlp', 'exciting', 'area', 'machine', 'learning', 'artificial', 'intelligence']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


##### 2. **Combining NLTK with Scikit-learn for BoW and TF-IDF**:
  - Use NLTK for text cleaning and scikit-learn's vectorizers for feature extraction.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data for multiple documents
documents = [
    "Natural Language Processing is fun!",
    "Machine learning makes NLP more interesting.",
    "Python is a great language for text processing.",
]

# Preprocess each document using NLTK
def preprocess_text(text):
    # Tokenization
    tokens = nltk.word_tokenize(text)
    # Lowercasing
    tokens = [word.lower() for word in tokens]
    # Removing special characters
    tokens = [re.sub(r'[^a-zA-Z0-9]', '', word) for word in tokens if re.sub(r'[^a-zA-Z0-9]', '', word)]
    # Stop word removal
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back to a single string
    return ' '.join(tokens)

# Apply preprocessing to each document
preprocessed_documents = [preprocess_text(doc) for doc in documents]
print("Preprocessed Documents:", preprocessed_documents)

# 1. Bag-of-Words Representation
# Create a CountVectorizer instance
bow_vectorizer = CountVectorizer()
# Fit and transform the documents
bow_features = bow_vectorizer.fit_transform(preprocessed_documents)
print("Bag-of-Words Feature Array:\n", bow_features.toarray())
print("Feature Names (BoW):", bow_vectorizer.get_feature_names_out())

# 2. TF-IDF Representation
# Create a TfidfVectorizer instance
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform the documents
tfidf_features = tfidf_vectorizer.fit_transform(preprocessed_documents)
print("TF-IDF Feature Array:\n", tfidf_features.toarray())
print("Feature Names (TF-IDF):", tfidf_vectorizer.get_feature_names_out())


Preprocessed Documents: ['natural language processing fun', 'machine learning make nlp interesting', 'python great language text processing']
Bag-of-Words Feature Array:
 [[1 0 0 1 0 0 0 1 0 1 0 0]
 [0 0 1 0 1 1 1 0 1 0 0 0]
 [0 1 0 1 0 0 0 0 0 1 1 1]]
Feature Names (BoW): ['fun' 'great' 'interesting' 'language' 'learning' 'machine' 'make'
 'natural' 'nlp' 'processing' 'python' 'text']
TF-IDF Feature Array:
 [[0.5628291  0.         0.         0.42804604 0.         0.
  0.         0.5628291  0.         0.42804604 0.         0.        ]
 [0.         0.         0.4472136  0.         0.4472136  0.4472136
  0.4472136  0.         0.4472136  0.         0.         0.        ]
 [0.         0.49047908 0.         0.37302199 0.         0.
  0.         0.         0.         0.37302199 0.49047908 0.49047908]]
Feature Names (TF-IDF): ['fun' 'great' 'interesting' 'language' 'learning' 'machine' 'make'
 'natural' 'nlp' 'processing' 'python' 'text']


##### 3. **Building Word Embeddings with PyTorch**:
  - Use pre-trained embeddings (e.g., GloVe) with PyTorch's `torchtext` library.


In this example, we will preprocess text using NLTK and create custom word embeddings using PyTorch. We will initialize the embeddings randomly and then demonstrate how to convert the preprocessed tokens into embedding vectors.



In [None]:
import torch
import torch.nn as nn

# Preprocess text using NLTK (reusing the preprocess_text function from above)
text_for_embeddings = "Deep learning models require high-quality embeddings."
processed_text = preprocess_text(text_for_embeddings)  # Assuming preprocess_text returns a list
print("Preprocessed Text for Embeddings:", processed_text)

# Tokenize the processed text into words (No need to split if it's already a list)
# tokens = processed_text.split()  # Remove this line
tokens = processed_text  # Use the processed_text directly as tokens
print("Tokens:", tokens)

# Create a vocabulary of the words
vocab = {word: i for i, word in enumerate(set(tokens))}
print("Vocabulary:", vocab)

# Create an embedding layer with the size of the vocabulary
# Define embedding dimension (e.g., 10)
embedding_dim = 10
embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embedding_dim)

# Convert tokens into indices based on the vocabulary
indices = torch.tensor([vocab[token] for token in tokens], dtype=torch.long)
print("Indices:", indices)

# Pass the indices through the embedding layer to get word embeddings
embedded_tokens = embedding_layer(indices)
print("Embedded Tokens:\n", embedded_tokens)

Preprocessed Text for Embeddings: ['deep', 'learning', 'model', 'require', 'highquality', 'embeddings']
Tokens: ['deep', 'learning', 'model', 'require', 'highquality', 'embeddings']
Vocabulary: {'deep': 0, 'learning': 1, 'require': 2, 'embeddings': 3, 'model': 4, 'highquality': 5}
Indices: tensor([0, 1, 4, 2, 5, 3])
Embedded Tokens:
 tensor([[-0.6018, -0.6115,  0.6035,  0.0116, -0.0642,  0.1093,  1.5187,  1.8751,
         -0.6289, -0.1515],
        [-0.4774,  1.2844, -0.9774,  0.6586,  0.2338,  1.3237,  1.1129, -0.7060,
          1.6030, -0.8359],
        [-1.1107, -0.5906,  0.2105, -1.1151, -0.4692, -0.1182, -0.4299, -0.5282,
         -1.8053, -0.6475],
        [ 0.6377,  1.6067, -1.4270,  1.0003, -0.1413, -0.4395, -0.1992, -0.9898,
          2.4948, -1.2688],
        [ 0.9818,  1.9060, -0.1517,  1.5721,  0.9548,  1.5889,  0.3732, -0.2467,
          1.5822,  0.2485],
        [ 0.9142,  0.2455,  0.4955,  0.4970,  0.1109,  0.0704,  0.1963,  0.1257,
          0.4097, -0.0279]], grad_fn=<

##### 4. **Integrating BERT for Feature Extraction in PyTorch**:
  - Use Hugging Face's `transformers` library to extract contextual embeddings for each text.



In [None]:
from transformers import BertTokenizer, BertModel

# Preprocess text using NLTK
text_for_bert = "Text classification tasks can benefit from pre-trained models like BERT."
processed_text = preprocess_text_for_embeddings(text_for_bert)
processed_text = ' '.join(processed_text)  # Join tokens for BERT processing
print("Processed Text for BERT:", processed_text)

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize the processed text
input_ids = tokenizer.encode(processed_text, return_tensors='pt')
print("Tokenized Input IDs:", input_ids)

# Extract embeddings using BERT
with torch.no_grad():
    outputs = model(input_ids)
    last_hidden_states = outputs.last_hidden_state

print("BERT Embeddings:\n", last_hidden_states)


Processed Text for BERT: text classification task benefit pretrained model like bert
Tokenized Input IDs: tensor([[  101,  3793,  5579,  4708,  5770,  3653, 23654,  2098,  2944,  2066,
         14324,   102]])
BERT Embeddings:
 tensor([[[-0.4142, -0.1928, -0.0709,  ..., -0.5888, -0.0326,  0.4164],
         [-0.4245,  0.1284,  0.1029,  ..., -0.0319,  0.4030,  0.0015],
         [-0.2798,  0.0493,  0.0273,  ..., -0.7371, -0.1523,  0.1787],
         ...,
         [-0.4364, -0.1351,  0.3804,  ..., -0.6504,  0.2804, -0.1594],
         [-0.4033, -0.7611,  0.0645,  ...,  0.2308,  0.0244,  0.0938],
         [ 0.8104,  0.0061, -0.5499,  ...,  0.3899, -0.5547, -0.1938]]])


#### 2.6 **Challenges in Text Preprocessing and Feature Extraction**


##### 1. **Handling Noisy Data**:
  - Develop strategies to clean and normalize text while preserving useful information.


In [None]:
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Example text with noise (misspellings, special characters, repeated characters)
noisy_text = "Thisss is an examplle of noooiiisy texttt!! #NLP @OpenAI :)"

# Step 1: Remove special characters (e.g., hashtags, mentions, emojis)
cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', noisy_text)
print("Text after removing special characters:", cleaned_text)

# Step 2: Handle repeated characters (e.g., 'ss' in 'Thisss')
def remove_repeated_characters(text):
    return re.sub(r'(.)\1+', r'\1\1', text)  # Replace sequences of characters with two occurrences

cleaned_text = remove_repeated_characters(cleaned_text)
print("Text after handling repeated characters:", cleaned_text)

# Step 3: Tokenize and remove stop words
word_tokens = nltk.word_tokenize(cleaned_text)
filtered_tokens = [word for word in word_tokens if word.lower() not in stopwords.words('english')]
print("Tokens after stop word removal:", filtered_tokens)


Text after removing special characters: Thisss is an examplle of noooiiisy texttt NLP OpenAI 
Text after handling repeated characters: Thiss is an examplle of nooiisy textt NLP OpenAI 
Tokens after stop word removal: ['Thiss', 'examplle', 'nooiisy', 'textt', 'NLP', 'OpenAI']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


##### 2. **Overfitting in Feature-rich Representations**:
  - Use dimensionality reduction or regularization techniques to prevent overfitting when using high-dimensional features.
  - When using a large number of features (e.g., n-grams), there is a risk of overfitting, where the model performs well on training data but poorly on unseen data. Techniques like dimensionality reduction can help mitigate this.




In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Example corpus with feature-rich text
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "A fast, brown fox leaps over a sleepy dog.",
]

# Step 1: Convert the text into TF-IDF features
vectorizer = TfidfVectorizer(ngram_range=(1, 3))  # Using unigrams, bigrams, and trigrams
tfidf_matrix = vectorizer.fit_transform(corpus)
print("Original TF-IDF Matrix Shape:", tfidf_matrix.shape)

# Step 2: Perform dimensionality reduction using TruncatedSVD
# Reducing the number of features from original dimension to 2 components
svd = TruncatedSVD(n_components=2)
reduced_matrix = svd.fit_transform(tfidf_matrix)
print("Reduced TF-IDF Matrix Shape:", reduced_matrix.shape)

# The original TF-IDF matrix had a high dimensionality, but we reduced it to 2 components
# to avoid overfitting by keeping only the most important information.


Original TF-IDF Matrix Shape: (3, 45)
Reduced TF-IDF Matrix Shape: (3, 2)


##### 3. **Choosing the Right Features**:
  - Perform error analysis to refine the feature set based on task requirements.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

# Sample data for classification
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "A fast, brown fox leaps over a sleepy dog.",
    "I love NLP and machine learning.",
    "NLP is fascinating and fun!",
    "I dislike boring tasks.",
    "Tasks like these are very interesting.",
]
labels = [0, 0, 0, 1, 1, 2, 2]  # Three classes: 0, 1, 2

# Step 1: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.33, random_state=42)

# Step 2: Convert the text data into TF-IDF features
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Step 3: Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

# Step 4: Make predictions and evaluate the model
y_pred = classifier.predict(X_test_tfidf)
print("Classification Report:\n", classification_report(y_test, y_pred))

# Step 5: Error Analysis using Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# If there are many misclassifications in certain classes, we can perform feature selection
# or add domain-specific features (e.g., specific keywords for each class) to improve the model.


Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.50      0.67         2
           1       0.00      0.00      0.00         0
           2       0.00      0.00      0.00         1

    accuracy                           0.33         3
   macro avg       0.33      0.17      0.22         3
weighted avg       0.67      0.33      0.44         3

Confusion Matrix:
 [[1 1 0]
 [0 0 0]
 [0 1 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


##### 4. **Computational Efficiency**:
  - Balance the complexity of feature extraction techniques with the computational resources available.



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Example large text corpus
large_corpus = [
    "Natural language processing is a fascinating field.",
    "Machine learning is revolutionizing many industries.",
    "Deep learning and neural networks are advancing rapidly.",
    "There are numerous applications of artificial intelligence.",
    "The quick brown fox jumps over the lazy dog.",
] * 1000  # Simulate a larger corpus by repeating entries

# Step 1: Use CountVectorizer with a limited vocabulary size for computational efficiency
vectorizer = CountVectorizer(max_features=100)  # Limit to the top 100 most frequent words
vectorized_data = vectorizer.fit_transform(large_corpus)

print("Shape of Vectorized Data with Limited Vocabulary:", vectorized_data.shape)

# Step 2: Tokenizing using NLTK's RegexpTokenizer for efficiency
from nltk.tokenize import RegexpTokenizer

# Initialize the RegexpTokenizer with a pattern for word characters only
tokenizer = RegexpTokenizer(r'\w+')

# Tokenize one of the sample texts
sample_text = "Natural language processing and machine learning are closely related fields."
tokens = tokenizer.tokenize(sample_text)
print("Tokens:", tokens)

# Limiting the vocabulary size and using efficient tokenizers can significantly reduce
# processing time and memory usage when dealing with large datasets.


Shape of Vectorized Data with Limited Vocabulary: (5000, 32)
Tokens: ['Natural', 'language', 'processing', 'and', 'machine', 'learning', 'are', 'closely', 'related', 'fields']


#### 2.7 **Transition to the Next Section**
   - This section has covered various techniques for preparing text data and extracting meaningful features. These steps are crucial for ensuring the input to machine learning and deep learning models is both relevant and well-structured.
   - In the next section, we will explore "Classical Machine Learning Approaches," where the processed text data will be used to train traditional models like Naive Bayes, Decision Trees, and Logistic Regression. The concepts learned in this section will lay the groundwork for implementing these algorithms effectively.