<a href="https://colab.research.google.com/github/VasrshiniLekkala/FMML-LABS-/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%




Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]




In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%




Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
len(df)

5572

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

The **TF-IDF** (Term Frequency-Inverse Document Frequency) approach often results in better accuracy than the **Bag-of-Words (BoW)** model for text classification tasks. This is because TF-IDF is more **sensitive to the importance of words** in a given document, whereas the BoW model treats all words as equally important, leading to several key differences. Below are the main reasons why TF-IDF generally performs better than Bag-of-Words:

### 1. **Weighting Words by Importance**
   - **BoW**: In the Bag-of-Words model, each word in the vocabulary is treated equally, meaning that the frequency of each word is simply counted across the document. The model doesn’t differentiate between common words (e.g., "the", "is", "and") and rare but informative words (e.g., "algorithm", "machine learning").
   - **TF-IDF**: TF-IDF adjusts the raw word frequencies by two components:
     - **Term Frequency (TF)**: Measures how often a word appears in a document.
     - **Inverse Document Frequency (IDF)**: Measures how important a word is within the entire corpus. Words that appear frequently across many documents (e.g., stop words like "the", "is") are downweighted, while words that are rare in the corpus (but appear often in a specific document) are given more importance.
   - **Impact**: TF-IDF helps to reduce the weight of common words and increase the weight of words that are more distinctive to a particular document. This makes the representation more meaningful and often leads to better accuracy because the model focuses more on important, discriminative words rather than overemphasizing common terms that don't contribute much to classification.

---

### 2. **Downweighting Common Words (Stop Words)**
   - **BoW**: The Bag-of-Words model counts all occurrences of words, including high-frequency words that appear in nearly every document (like stop words: "the", "a", "in", "on").
   - **TF-IDF**: The **IDF component** of TF-IDF penalizes words that appear frequently across many documents, thus reducing their influence on the model. Common words such as "the", "of", and "in" are assigned a low IDF score because they occur in nearly all documents, making them less important for distinguishing documents.
   - **Impact**: By downweighting common words, TF-IDF helps avoid overfitting to those words and focuses more on terms that are meaningful for the specific topic or context of the document, leading to better classification accuracy.

---

### 3. **Capturing Rare but Informative Words**
   - **BoW**: In the Bag-of-Words model, a word that occurs very rarely across the entire corpus might still be represented equally as any other word, depending on its frequency in the document.
   - **TF-IDF**: Rare words, which are **informative** but not universally common across the corpus, are given a higher weight by the **IDF** component. This is especially helpful in domains where certain terms (e.g., domain-specific terms like "neural network" or "polynomial") carry a lot of meaning for classification, but they might not appear in many documents.
   - **Impact**: TF-IDF helps the classifier prioritize rare but important words, improving the model’s ability to correctly classify documents based on meaningful features.

---

### 4. **Improved Discriminative Power**
   - **BoW**: Since every word is treated equally, the Bag-of-Words model often has poor discriminative power, especially when words are frequent and not informative. For example, frequent words like "book", "price", or "happy" could appear in many documents regardless of the actual content or topic, leading to suboptimal classification.
   - **TF-IDF**: By adjusting for the document frequency of terms, TF-IDF emphasizes terms that provide more **discriminative** power. A word that is unique to a small set of documents will have a higher TF-IDF score, making it a more useful feature for distinguishing between classes or topics.
   - **Impact**: TF-IDF enhances the classifier's ability to differentiate between documents based on distinctive words, improving overall classification accuracy.

---

### 5. **Handling Long and Short Documents**
   - **BoW**: In the BoW approach, the frequency of a word in a document is simply counted, and the model doesn’t account for document length. This means that a very long document with many occurrences of common words might end up dominating the feature vector, which can lead to problems when dealing with short documents.
   - **TF-IDF**: Since TF-IDF normalizes the frequency of terms, it mitigates the problem of **document length**. It adjusts the term frequency by considering both how often the word appears in the document and how frequent it is in the entire corpus, allowing for better handling of documents of different lengths.
   - **Impact**: TF-IDF can provide a more **balanced** representation for long and short documents, leading to better performance in tasks that involve varying document lengths.

---

### 6. **Robustness Against Overfitting**
   - **BoW**: Because Bag-of-Words counts every word occurrence in a document, including frequent but uninformative words, the model might overfit to these non-discriminative words.
   - **TF-IDF**: By downweighting frequent words that appear across many documents, TF-IDF reduces the likelihood of overfitting to irrelevant features. This results in a more **generalizable model** that performs better on unseen data.
   - **Impact**: TF-IDF typically leads to better generalization, especially in cases where a large number of documents are available.

---

### 7. **Better Suitability for Text Classification Tasks**
   - **BoW**: While Bag-of-Words is a simple and intuitive approach, its performance can suffer when dealing with large corpora with varying document lengths and common words. BoW typically works well for simple tasks but struggles with complex text classification problems, especially those that require distinguishing subtle differences in meaning.
   - **TF-IDF**: TF-IDF is often more suitable for text classification because it gives more weight to important and distinctive terms while reducing the influence of noise. It’s especially effective for tasks such as **topic classification**, **spam detection**, and **sentiment analysis**, where identifying key, meaningful words is crucial.
   - **Impact**: TF-IDF’s ability to focus on important words leads to improved accuracy, particularly in more challenging text classification scenarios.

---

### Conclusion: Why TF-IDF is Better than BoW

- **BoW treats all words equally** (including common, non-informative ones), which can lead to **overfitting** and poor performance in real-world applications.
- **TF-IDF adjusts for the frequency of words across documents**, assigning higher importance to words that are **rare but informative** within a document, while downweighting common terms that don't contribute much to the classification task.
- As a result, TF-IDF is generally **more robust, discriminative, and accurate** than Bag-of-Words, especially in real-world text classification tasks where distinguishing between meaningful terms and noise is critical.

Overall, **TF-IDF improves the representation of text by making the model more sensitive to important features**, leading to better performance and higher accuracy in many text-based tasks.

2. Can you think of techniques that are better than both BoW and TF-IDF ?

Yes, while **Bag-of-Words (BoW)** and **TF-IDF** are foundational techniques in natural language processing (NLP), there are more advanced and often more powerful methods for text representation and understanding that address some of their limitations. Below are some techniques that typically perform better than BoW and TF-IDF, especially in complex NLP tasks:

### 1. **Word Embeddings (Word2Vec, GloVe, FastText)**
   - **Overview**: Word embeddings represent words as dense, continuous vectors in a high-dimensional space. Unlike BoW and TF-IDF, which use sparse vectors and rely on simple word counts, word embeddings capture the **semantic meaning** of words by considering their context in large corpora.
   - **Examples**:
     - **Word2Vec** (Skip-gram and CBOW): Trains word vectors by predicting a word's context (or vice versa).
     - **GloVe** (Global Vectors for Word Representation): Captures the global statistical information of a corpus and factors it into the vector representation of each word.
     - **FastText**: Builds upon Word2Vec by representing words as **bags of character n-grams**, improving the handling of rare and out-of-vocabulary words.
   - **Advantages over BoW and TF-IDF**:
     - Word embeddings encode **semantic similarity** (e.g., "king" and "queen" are closer than "king" and "car") rather than treating words independently.
     - **Dense vectors** (compared to sparse BoW and TF-IDF) lead to **more efficient storage and computation**.
     - Better at handling **synonyms**, context, and word relationships (e.g., analogies like "man" is to "woman" as "king" is to "queen").
   - **Limitations**: Requires large corpora for training, and pre-trained models might not capture domain-specific language effectively.

---

### 2. **Contextualized Word Embeddings (ELMo, BERT, GPT, RoBERTa)**
   - **Overview**: Unlike traditional word embeddings (like Word2Vec or GloVe), which map each word to a single vector, **contextualized embeddings** generate different representations for the same word depending on its **context** in a sentence. This is a significant advancement because many words have different meanings depending on context (e.g., "bank" as a financial institution vs. "bank" as the side of a river).
   - **Examples**:
     - **ELMo** (Embeddings from Language Models): Provides context-dependent word embeddings by using deep bidirectional LSTMs (Long Short-Term Memory networks).
     - **BERT** (Bidirectional Encoder Representations from Transformers): A transformer-based model that captures the context of words by training on both the left and right context of a word.
     - **GPT** (Generative Pre-trained Transformer): A unidirectional language model but still highly effective for generating and understanding context.
     - **RoBERTa**: A variant of BERT that has improved training procedures and typically performs better on a wide range of NLP tasks.
   - **Advantages over BoW and TF-IDF**:
     - **Context-sensitive representations**: Words can have different meanings in different contexts, which are captured by models like BERT and GPT.
     - **Pre-trained models** like BERT and RoBERTa can be fine-tuned for specific tasks with smaller datasets, providing state-of-the-art performance.
     - **Better handling of polysemy**, idiomatic expressions, and complex sentence structures.
   - **Limitations**:
     - Computationally expensive and require significant resources to fine-tune, especially for large datasets.
     - Need large-scale annotated datasets for fine-tuning on specific tasks.

---

### 3. **Transformers for Sequence Encoding (BERT, RoBERTa, T5, XLNet)**
   - **Overview**: **Transformers** are a class of deep learning models that use self-attention mechanisms to weigh the importance of different words in a sequence, regardless of their position. These models, such as **BERT**, **T5**, **XLNet**, and **RoBERTa**, have set new standards for NLP performance in tasks like sentiment analysis, machine translation, and question answering.
   - **Examples**:
     - **BERT**: Pretrained on a large corpus and can be fine-tuned for a variety of downstream tasks.
     - **T5** (Text-to-Text Transfer Transformer): Treats every NLP task as a text generation problem, allowing for more flexibility across tasks.
     - **XLNet**: Builds on BERT by incorporating autoregressive modeling, improving its ability to capture long-range dependencies in the text.
   - **Advantages over BoW and TF-IDF**:
     - **Captures global context**: Unlike BoW, which ignores word order, transformers consider the relationships between words across the entire sequence.
     - **Better performance**: Transformers have consistently outperformed traditional techniques like BoW and TF-IDF on most NLP tasks (e.g., named entity recognition, sentiment classification, etc.).
     - **Task-agnostic**: They can be fine-tuned for different tasks, making them versatile.
   - **Limitations**:
     - **Resource-intensive**: Training and even fine-tuning large models like BERT requires substantial computational resources.
     - **Complexity**: The architecture is much more complex compared to traditional models.

---

### 4. **Sentence Embeddings (Sentence-BERT, Universal Sentence Encoder)**
   - **Overview**: Sentence-level embeddings represent entire sentences (or even paragraphs) as a single vector, capturing the overall meaning or semantic information of the sentence rather than focusing on individual words. This is particularly useful for tasks like **semantic textual similarity**, **paraphrase detection**, and **information retrieval**.
   - **Examples**:
     - **Sentence-BERT**: A modification of BERT that is fine-tuned to generate semantically meaningful sentence embeddings for tasks like sentence similarity and clustering.
     - **Universal Sentence Encoder**: A model that generates sentence-level embeddings optimized for tasks such as semantic textual similarity and sentence classification.
   - **Advantages over BoW and TF-IDF**:
     - **Captures sentence-level meaning**: Instead of representing documents as a collection of individual word counts, sentence embeddings capture the broader meaning of a sentence.
     - **More efficient for tasks like clustering** or comparing sentences for similarity, where BoW or TF-IDF might struggle.
   - **Limitations**:
     - Still relies on deep learning models, which are resource-intensive to train and fine-tune.
     - Not as interpretable as simpler methods like BoW or TF-IDF.

---

### 5. **Topic Modeling (Latent Dirichlet Allocation - LDA)**
   - **Overview**: Topic modeling techniques, such as **Latent Dirichlet Allocation (LDA)**, provide a probabilistic framework for uncovering topics within a collection of text. LDA assumes each document is a mixture of topics, and each topic is a mixture of words. By identifying these topics, LDA can provide a more abstract representation of a document's content.
   - **Advantages over BoW and TF-IDF**:
     - **Captures themes**: Instead of focusing on individual words, LDA captures the underlying topics that span multiple words, leading to a higher-level understanding of the document.
     - **Reduces dimensionality**: The model represents each document by a few topics, rather than by a sparse vector of word frequencies.
   - **Limitations**:
     - Requires careful tuning of hyperparameters (like the number of topics).
     - May not perform as well on small datasets or datasets with noisy text.

---

### 6. **Graph-Based Representations (TextRank, GloVe, Graph Neural Networks)**
   - **Overview**: Graph-based methods, such as **TextRank** (an algorithm similar to PageRank) and **Graph Neural Networks (GNNs)**, model text as a graph of words or sentences. This can capture **semantic relationships** between words and allow for more complex reasoning about the text.
   - **Advantages over BoW and TF-IDF**:
     - **Captures relationships** between words and entities beyond linear sequences, helping to capture context and meaning in a more dynamic way.
     - Can be applied to document-level tasks (e.g., summarization) as well as word-level tasks (e.g., word sense disambiguation).
   - **Limitations**:
     - Requires significant computational power for graph construction and processing.
     - May be more complex to implement and understand compared to traditional methods.

---

### Conclusion:
While **BoW** and **TF-IDF** are widely used due to their simplicity and effectiveness in certain tasks, more advanced techniques like **word embeddings** (Word2Vec, GloVe), **contextualized embeddings** (BERT, GPT), and **transformers** (BERT, T5, RoBERTa) are typically **more accurate** for a wider range of NLP tasks. These methods capture deeper semantic relationships, handle polysemy, and provide better context understanding, making them more suitable for complex and nuanced NLP problems.

For specific tasks such as text classification, sentiment analysis, or paraphrase detection, these newer techniques outperform traditional methods by a large margin, though they come with higher computational costs.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

Stemming and **Lemmatization** are two widely used text preprocessing techniques that aim to reduce words to their root forms in Natural Language Processing (NLP). Although both techniques achieve the goal of simplifying words, they do so in different ways, with distinct advantages and disadvantages.

### **Stemming**

**Definition**: Stemming is the process of reducing a word to its **root form** by removing prefixes and suffixes. The resulting word is called a **stem**. This process is rule-based and typically involves removing common suffixes such as "-ing," "-ed," "-ly," etc. Stemming doesn't always produce a valid word, but it ensures that different inflections of a word (such as "running," "runner," "ran") are mapped to the same root (e.g., "run").

**Common Stemming Algorithms**:
- **Porter Stemmer**: One of the most commonly used stemming algorithms, known for its efficiency but sometimes producing non-standard stems.
- **Snowball Stemmer**: A more advanced and improved version of the Porter Stemmer.
- **Lancaster Stemmer**: Another stemmer that is more aggressive and might result in overly shortened forms.

---

#### **Pros of Stemming:**
1. **Faster Processing**: Stemming algorithms are typically simpler and faster than lemmatization. Since they follow a set of rules for removing suffixes, they require less computational overhead.
2. **Simple to Implement**: Many stemming algorithms, like the Porter Stemmer, are relatively easy to implement and can be used directly in most NLP libraries (e.g., NLTK, SpaCy).
3. **Works Well for Information Retrieval**: Stemming can be effective in information retrieval or search-based tasks where exact word forms are less important than the root word. For example, searching for "run" could also retrieve results for "running" or "runner."

#### **Cons of Stemming:**
1. **Non-Standardized Output**: The result of stemming may not always be a valid word. For example, "running" becomes "run," but "happiness" becomes "happi," which is not a proper word in English.
2. **Aggressive and Inaccurate**: Stemming may produce overly aggressive reductions. For instance, "better" might be stemmed to "bet," which changes the meaning of the word.
3. **Loss of Semantic Meaning**: Since stemming is based on heuristic rules, it does not consider the **context** or meaning of the word, which can lead to misinterpretation in certain applications, such as sentiment analysis.

---

### **Lemmatization**

**Definition**: Lemmatization is the process of reducing a word to its **lemma**, which is its base or dictionary form. Unlike stemming, lemmatization uses **morphological analysis** and involves looking at the context of the word to decide its lemma. Lemmatization often results in a valid word that can be found in the dictionary. For example, "running" becomes "run," but "better" becomes "good."

**Common Lemmatization Algorithms**:
- **WordNet Lemmatizer**: A widely used lemmatizer that relies on the **WordNet lexical database** to look up word meanings and reduce them to their base form. This approach considers the part of speech (POS) of the word.
- **SpaCy Lemmatizer**: A high-performance lemmatizer, often used with **SpaCy**, that leverages both machine learning and a lexicon to produce accurate lemmas.

---

#### **Pros of Lemmatization:**
1. **Accurate and Meaningful Lemmas**: Lemmatization produces valid words, which are typically closer to the base form or dictionary version of the word (e.g., "better" becomes "good"), making it semantically meaningful.
2. **Context-Aware**: Lemmatization is more context-sensitive than stemming. It uses **POS tagging** (e.g., verb, noun, adjective) to determine the correct lemma, reducing ambiguity. For instance, "run" (noun) vs. "run" (verb) will be lemmatized differently depending on context.
3. **Better for Meaning-Dependent Tasks**: Lemmatization works well for tasks that require understanding the meaning of words, such as **text classification**, **sentiment analysis**, and **machine translation**.

#### **Cons of Lemmatization:**
1. **Slower and More Resource-Intensive**: Lemmatization algorithms, especially those using WordNet or other lexicons, tend to be more computationally expensive and slower than stemming. This is because they require the additional step of looking up the word in a dictionary and applying context-based rules.
2. **Complex to Implement**: Lemmatization is more complex to implement than stemming, as it often requires access to resources like **WordNet** or **part-of-speech tagging**, which may need additional processing.
3. **Requires Contextual Information**: Lemmatization algorithms require contextual information (like POS tagging) to work properly. If POS tagging is inaccurate or unavailable, the lemmatization process may produce incorrect results.

---

### **Comparison of Stemming and Lemmatization**

| **Aspect**                | **Stemming**                                        | **Lemmatization**                                   |
|---------------------------|-----------------------------------------------------|-----------------------------------------------------|
| **Output**                | May not be a valid word (e.g., "happi," "run").     | Always results in a valid word (e.g., "good," "run").|
| **Algorithm Complexity**  | Simple, fast, rule-based approach.                  | More complex, uses lexicons and context analysis.    |
| **Context Awareness**     | No context awareness; applies heuristics based on suffixes. | Considers context (e.g., POS tagging) to determine lemma.|
| **Accuracy**              | Less accurate, may lead to over-aggressive reductions. | More accurate and context-sensitive.                |
| **Speed**                 | Faster, more efficient.                            | Slower, more resource-intensive.                    |
| **Use Cases**             | Works well for information retrieval, search engines, and large datasets where computational efficiency is critical. | Works well for tasks requiring semantic understanding, such as sentiment analysis, text classification, and machine translation. |

---

### **When to Use Stemming vs. Lemmatization**

- **Use Stemming when**:
  - Speed is crucial, and you're working with large amounts of text (e.g., search engines, information retrieval, document indexing).
  - You're less concerned with preserving exact meanings and more focused on **identifying broad themes** or keywords.
  - The task is tolerant of rough, somewhat distorted words (such as keyword extraction or clustering).
  
- **Use Lemmatization when**:
  - You need more **accurate and semantically meaningful representations** of text (e.g., sentiment analysis, document classification).
  - The task requires **understanding the context** of words, such as in **machine translation** or **question answering**.
  - You want a more sophisticated approach to preprocessing that takes the full meaning and role of words into account.

---

### **Conclusion**

- **Stemming** is generally faster and easier to implement, but it can produce non-standard and sometimes misleading word forms. It's useful when speed is a priority and exact word forms are not crucial.
- **Lemmatization** is more accurate and context-sensitive, producing semantically correct words, but is slower and requires more resources. It is typically better suited for tasks where understanding the **meaning** of words is important.

In summary, the choice between stemming and lemmatization depends on the **specific task** and the **trade-off between computational efficiency and accuracy**. For simple tasks or large datasets, stemming might be sufficient, while for more advanced NLP tasks that require semantic understanding, lemmatization is usually the better choice.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
