# 1. Key Preprocessing Techniques for Text Data
## 1.1 Tokenization
#### 🔹 Splits text into individual words (word-level) or subwords/characters (subword/character-level).
#### 🔹 Helps in transforming text into a structured format.


In [5]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt_tab')

text = "Natural Language Processing (NLP) is amazing! Let's learn it."

# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokens:", word_tokens)

# Sentence Tokenization
sent_tokens = sent_tokenize(text)
print("Sentence Tokens:", sent_tokens)



[nltk_data] Downloading package punkt_tab to /home/astane/nltk_data...


Word Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'amazing', '!', 'Let', "'s", 'learn', 'it', '.']
Sentence Tokens: ['Natural Language Processing (NLP) is amazing!', "Let's learn it."]


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



#🔹 Comparison:

#### Method	Pros	Cons
#### Word Tokenization	Simple, easy to use	Doesn't handle multi-word expressions
#### Subword Tokenization	Useful for rare words	Increases complexity
#### Sentence Tokenization	Preserves sentence context	Not useful for word-based models
#### 1.2 Removing Stopwords
#### 🔹 Stopwords (e.g., "the", "is", "in") do not contribute to the meaning of text in most cases and can be removed.



In [6]:

from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

filtered_words = [word for word in word_tokens if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)


[nltk_data] Downloading package stopwords to /home/astane/nltk_data...


Filtered Words: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'amazing', '!', 'Let', "'s", 'learn', '.']


[nltk_data]   Unzipping corpora/stopwords.zip.


#🔹 Comparison:

#### Approach	Pros	Cons
#### Remove Stopwords	Reduces noise and model size	Some stopwords are important for meaning
#### Keep Stopwords	Retains full meaning	May add unnecessary complexity
#### 1.3 Stemming and Lemmatization
#### 🔹 Stemming: Reduces words to their base/root form (e.g., "running" → "run").
#### 🔹 Lemmatization: Converts words to their dictionary form (e.g., "better" → "good").



In [15]:

from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word1 = "running"
word2 = "best"

print("Stemmed:", stemmer.stem(word1))
print("Lemmatized:", lemmatizer.lemmatize(word1, pos='v'))

print("Stemmed:", stemmer.stem(word2))
print("Lemmatized:", lemmatizer.lemmatize(word2, pos='v'))

Stemmed: run
Lemmatized: run
Stemmed: best
Lemmatized: best


[nltk_data] Downloading package wordnet to /home/astane/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#🔹 Comparison:

#### Method	Pros	Cons
#### Stemming	Fast, simple	Can create non-existent words ("goose" → "goos")
#### Lemmatization	More accurate	Slower, needs POS tagging
#### 1.4 Vectorization (Converting Text to Numeric Form)
#### Since ML models require numerical input, we convert text into numerical format using:

#### Bag of Words (BoW)
#### TF-IDF (Term Frequency-Inverse Document Frequency)
#### Word Embeddings (Word2Vec, GloVe, FastText)
#### Transformers (BERT, GPT)
#### 1.4.1 Bag of Words (BoW)
#### BoW converts text into a matrix where each row is a sentence, and each column represents a word.



In [16]:

from sklearn.feature_extraction.text import CountVectorizer

text_data = ["I love natural language processing", "NLP is great"]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(text_data)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_matrix.toarray())


Vocabulary: ['great' 'is' 'language' 'love' 'natural' 'nlp' 'processing']
BoW Matrix:
 [[0 0 1 1 1 0 1]
 [1 1 0 0 0 1 0]]



#### 🔹 Pros & Cons:
#### ✅ Simple, effective for small datasets
#### ❌ Ignores word order and meaning

#### 1.4.2 TF-IDF
#### TF-IDF assigns importance to words based on frequency in a document vs. all documents.


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([text])

print("TF-IDF Matrix:\n", tfidf_matrix.toarray())


TF-IDF Matrix:
 [[0.33333333 0.33333333 0.33333333 0.33333333 0.33333333 0.33333333
  0.33333333 0.33333333 0.33333333]]


#### 🔹 Pros & Cons:
#### ✅ Reduces weight of common words, improves performance
#### ❌ Still ignores word order

#### 1.4.3 Word Embeddings (Word2Vec, GloVe)
#### Embeddings capture the meaning of words in a dense vector space.



In [12]:

import gensim
from gensim.models import Word2Vec

# Example sentences
sentences = [["I", "love", "NLP"], ["NLP", "is", "amazing"]]

# Train a simple Word2Vec model
model = Word2Vec(sentences, vector_size=10, min_count=1, workers=4)

# Get embedding for 'NLP'
print("Word Vector for 'NLP':", model.wv['NLP'])


Word Vector for 'NLP': [-0.00536227  0.00236431  0.0510335   0.09009273 -0.0930295  -0.07116809
  0.06458873  0.08972988 -0.05015428 -0.03763372]


#### 🔹 Pros & Cons:
#### ✅ Captures semantic meaning, works well for deep learning
#### ❌ Needs large datasets, computationally expensive

#### 1.4.4 Transformer-Based Embeddings (BERT)
#### 🔹 Pretrained transformer models (e.g., BERT, GPT) generate context-aware embeddings.



In [14]:

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize input text
inputs = tokenizer("Natural Language Processing is fun!", return_tensors="pt")
outputs = model(**inputs)

# Extract last hidden state
embeddings = outputs.last_hidden_state
print("BERT Embedding Shape:", embeddings.shape)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERT Embedding Shape: torch.Size([1, 8, 768])



#### 🔹 Pros & Cons:
#### ✅ Context-aware, powerful for NLP
#### ❌ Computationally expensive

