# * NLP Pipeline

An NLP pipeline is a sequence of steps used to process and analyze text data. Here’s a detailed breakdown of each step:

#### 1. Text Preprocessing

**Text preprocessing** is the first step in the NLP pipeline, crucial for cleaning and preparing raw text data for further analysis. This involves several sub-steps:

1. **Tokenization**:
   - **Word Tokenization**: Splitting a sentence into individual words. For example, "I love NLP" becomes `["I", "love", "NLP"]`.
   - **Sentence Tokenization**: Splitting a paragraph into individual sentences. For example, "I love NLP. It is fascinating." becomes `["I love NLP.", "It is fascinating."]`.

2. **Lowercasing**:
   - Converting all characters in the text to lowercase to ensure uniformity. For example, "I Love NLP" becomes "i love nlp".

3. **Stop Words Removal**:
   - Removing common words that do not contribute significant meaning to the text. These include words like "and", "the", "is". For example, "I love NLP and it is fascinating" becomes "love NLP fascinating".

4. **Stemming and Lemmatization**:
   - **Stemming**: Reducing words to their base or root form. For example, "running" becomes "run".
   - **Lemmatization**: Reducing words to their dictionary form, considering the context. For example, "better" becomes "good".

5. **Punctuation Removal**:
   - Removing punctuation marks from the text. For example, "Hello, world!" becomes "Hello world".

6. **Handling Special Characters**:
   - Removing or converting special characters (like emojis, currency symbols) as per the requirement of the analysis.

7. **Text Normalization**:
   - Converting different forms of text into a standard format. This can include converting all numbers to a specific format or expanding contractions (e.g., "don't" to "do not").

#### 2. Feature Extraction

Feature extraction transforms the cleaned text into numerical representations.

1. **Bag of Words (BoW)**:
   - Representing text as a set of words without considering grammar or order.

2. **Term Frequency-Inverse Document Frequency (TF-IDF)**:
   - Measuring the importance of a word in a document relative to a collection of documents.

3. **Word Embeddings**:
   - Using pre-trained models like Word2Vec, GloVe, or BERT to convert words into numerical vectors that capture semantic meaning.

#### 3. Model Building

Building models to perform specific NLP tasks using the extracted features.

1. **Supervised Learning**:
   - Using labeled data to train models for specific tasks like sentiment analysis or Named Entity Recognition (NER).

2. **Unsupervised Learning**:
   - Finding patterns in unlabeled data, such as clustering similar documents.

3. **Deep Learning**:
   - Employing neural networks for tasks like machine translation and text generation.

#### 4. Model Evaluation

Evaluating the performance of the built models.

1. **Accuracy**:
   - The percentage of correctly predicted instances.

2. **Precision and Recall**:
   - **Precision**: Measuring the relevance of the model's predictions.
   - **Recall**: Measuring the completeness of the model's predictions.

3. **F1 Score**:
   - The harmonic mean of precision and recall.

#### 5. Deployment

Deploying the NLP model for real-world applications.

1. **API Integration**:
   - Making the NLP model available through an API for other applications to use.

2. **User Interface**:
   - Creating interfaces like chatbots or dashboards for end-users to interact with the NLP system.

#### 6. Monitoring and Maintenance

Ensuring the NLP model remains effective over time.

1. **Performance Monitoring**:
   - Continuously tracking the model's performance in production.

2. **Retraining**:
   - Updating the model with new data to maintain its accuracy and relevance.

### Text Preprocessing Examples

**Tokenization**:
- Word Tokenization: "I love NLP" → `["I", "love", "NLP"]`
- Sentence Tokenization: "I love NLP. It is fascinating." → `["I love NLP.", "It is fascinating."]`

**Lowercasing**:
- "I Love NLP" → "i love nlp"

**Stop Words Removal**:
- "I love NLP and it is fascinating" → "love NLP fascinating"

**Stemming and Lemmatization**:
- Stemming: "running" → "run"
- Lemmatization: "better" → "good"

**Punctuation Removal**:
- "Hello, world!" → "Hello world"

**Handling Special Characters**:
- Removing emojis and currency symbols as needed.

**Text Normalization**:
- Converting "don't" to "do not"
- Standardizing different forms of text to a uniform format.

***
***
***

# * Encoding in Natural Language Processing (NLP)

Encoding in Natural Language Processing (NLP) refers to the process of transforming text data into numerical representations that can be processed by machine learning models. The primary goal of encoding is to convert the unstructured data of natural language into a structured format that algorithms can understand and work with. There are several encoding methods, each with its strengths and applications.

### Types of Encoding in NLP

1. **One-Hot Encoding**
   - **Definition**: Represents each word as a binary vector where only one element is "1" (indicating the presence of the word) and the rest are "0".
   - **Advantages**: Simple and easy to implement.
   - **Disadvantages**: Results in very high-dimensional sparse vectors, which are inefficient and may lead to poor performance for large vocabularies.
   - **Use Case**: Small vocabulary and simple tasks.

2. **Bag of Words (BoW)**
   - **Definition**: Represents text as a collection of word counts or frequencies, disregarding grammar and word order.
   - **Advantages**: Simple and effective for basic text classification.
   - **Disadvantages**: Ignores word order and semantics; large feature vectors for large vocabularies.
   - **Use Case**: Text classification, document categorization.

3. **Term Frequency-Inverse Document Frequency (TF-IDF)**
   - **Definition**: Weighs the frequency of a word in a document against its frequency across all documents, highlighting words that are important in specific documents.
   - **Advantages**: Balances word frequency and importance; more informative than raw frequency counts.
   - **Disadvantages**: Still ignores word order and semantics.
   - **Use Case**: Information retrieval, document classification.

4. **Word Embeddings**
   - **Definition**: Represents words in continuous vector space where semantically similar words are closer together.
   - **Popular Models**:
     - **Word2Vec**: Uses neural networks to generate word vectors based on context within a fixed window size.
     - **GloVe (Global Vectors for Word Representation)**: Uses statistical information from the entire text corpus to learn word vectors.
   - **Advantages**: Captures semantic relationships; dense and low-dimensional vectors.
   - **Disadvantages**: Fixed vocabulary; doesn't handle out-of-vocabulary words well.
   - **Use Case**: Sentiment analysis, text classification, named entity recognition.

5. **Contextual Embeddings**
   - **Definition**: Generates word representations that change based on context, capturing nuanced meanings of words in different situations.
   - **Popular Models**:
     - **ELMo (Embeddings from Language Models)**: Uses bi-directional LSTM to create context-aware embeddings.
     - **BERT (Bidirectional Encoder Representations from Transformers)**: Uses transformers to generate deep contextualized embeddings.
   - **Advantages**: Handles polysemy (multiple meanings of words); captures complex dependencies in text.
   - **Disadvantages**: Computationally intensive; requires large amounts of data and computing power.
   - **Use Case**: Question answering, machine translation, text generation.

6. **Sentence and Document Embeddings**
   - **Definition**: Extends word embeddings to entire sentences or documents to capture the overall meaning.
   - **Popular Models**:
     - **Doc2Vec**: Extension of Word2Vec for documents.
     - **Universal Sentence Encoder**: Encodes sentences into high-dimensional vectors suitable for various NLP tasks.
   - **Advantages**: Captures sentence or document-level semantics; useful for tasks requiring understanding of larger text chunks.
   - **Disadvantages**: May lose fine-grained word-level details.
   - **Use Case**: Text similarity, document classification, summarization.

### Detailed Explanation of Key Methods

#### Word2Vec
- **Architecture**: Two main models—CBOW (Continuous Bag of Words) and Skip-Gram.
- **CBOW**: Predicts a target word from its context.
- **Skip-Gram**: Predicts context words given a target word.
- **Training**: Uses neural networks; optimized using methods like negative sampling or hierarchical softmax.

#### BERT
- **Architecture**: Transformer-based model with multiple layers of self-attention mechanisms.
- **Training**: Pre-trained on large corpora using masked language modeling and next sentence prediction tasks.
- **Usage**: Fine-tuned on specific tasks, making it versatile for various NLP applications.

#### TF-IDF
- **Formula**: 
  - Term Frequency (TF): \( \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \)
  - Inverse Document Frequency (IDF): \( \text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right) \)
  - TF-IDF: \( \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) \)

### Applications of Encoding in NLP

- **Text Classification**: Assigning predefined categories to text documents (e.g., spam detection, sentiment analysis).
- **Named Entity Recognition (NER)**: Identifying and classifying named entities in text (e.g., names of people, organizations).
- **Machine Translation**: Translating text from one language to another.
- **Question Answering**: Building systems that can answer questions posed in natural language.
- **Text Generation**: Generating human-like text based on a given input.

### Conclusion

Encoding methods are fundamental to NLP, transforming text data into numerical representations suitable for machine learning models. Each method has its strengths and weaknesses, and the choice of encoding depends on the specific application and requirements. From simple one-hot encoding to sophisticated contextual embeddings like BERT, these techniques enable machines to understand and process human language effectively.

## Q) What is Vectorization?

Vectorization in Natural Language Processing (NLP) refers to the process of converting text data into numerical vectors so that it can be processed by machine learning algorithms. Since most machine learning models work with numerical data, vectorization is essential for enabling these models to understand and analyze text. Here are some common methods of vectorization in NLP:

### 1. **Bag of Words (BoW)**
- **Description:** This approach represents text as a collection of words (or tokens) without considering the order. Each unique word in the corpus becomes a feature, and each document is represented as a vector of word counts or binary values (indicating the presence or absence of words).
- **Example:**
  - Document 1: "I love cats"
  - Document 2: "I love dogs"
  - Vocabulary: ["I", "love", "cats", "dogs"]
  - Vector for Document 1: [1, 1, 1, 0]
  - Vector for Document 2: [1, 1, 0, 1]

### 2. **Term Frequency-Inverse Document Frequency (TF-IDF)**
- **Description:** This method enhances the BoW model by weighting words based on their frequency in a document relative to their frequency in the entire corpus. The idea is to emphasize words that are more informative (frequent in one document but rare in others).
- **Formula:** 
  \[
  \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right)
  \]
  Where:
  - \( \text{TF}(t, d) \) = Term frequency of term \( t \) in document \( d \)
  - \( N \) = Total number of documents
  - \( \text{DF}(t) \) = Number of documents containing term \( t \)

### 3. **Word Embeddings**
- **Description:** Word embeddings represent words as dense vectors in a continuous vector space, capturing semantic meanings and relationships. Popular techniques include:
  - **Word2Vec:** Uses skip-gram or continuous bag of words (CBOW) models to learn word representations based on context.
  - **GloVe (Global Vectors for Word Representation):** Factorizes the word co-occurrence matrix to create word vectors.
  - **FastText:** Similar to Word2Vec but considers subword information, making it effective for morphologically rich languages.

### 4. **Sentence and Document Embeddings**
- **Description:** These techniques extend word embeddings to entire sentences or documents, capturing context and meaning more effectively. Common methods include:
  - **Doc2Vec:** Extends Word2Vec to generate vector representations for larger blocks of text.
  - **Universal Sentence Encoder:** Provides sentence-level embeddings trained on a variety of tasks.

### 5. **Transformers and BERT**
- **Description:** Modern NLP utilizes transformer models like BERT (Bidirectional Encoder Representations from Transformers) to generate contextualized embeddings for words in a sentence, considering the entire context rather than fixed vectors.
- **Output:** Each token in a sentence is represented as a vector that captures its meaning based on the surrounding words.

### Summary
Vectorization is a critical step in NLP, allowing models to interpret and analyze text data effectively. The choice of vectorization method depends on the specific application, the nature of the text data, and the desired level of complexity and context understanding.

***

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix


In [2]:
data = pd.read_csv("spam.csv", encoding='latin-1')
data


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [3]:
data.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True) #column rename
data['label'] = data['label'].map({'ham': 0, 'spam': 1}) #label encoding
data

Unnamed: 0,label,text,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,0,"Go until jurong point, crazy.. Available only ...",,,
1,0,Ok lar... Joking wif u oni...,,,
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,0,U dun say so early hor... U c already then say...,,,
4,0,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,,,
5568,0,Will Ì_ b going to esplanade fr home?,,,
5569,0,"Pity, * was in mood for that. So...any other s...",,,
5570,0,The guy did some bitching but I acted like i'd...,,,


In [4]:
data = data.drop(columns=["Unnamed: 2","Unnamed: 3","Unnamed: 4"])
data #column 2,3,4 drop

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [5]:
data.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### preprocessing
    1- tokens
    2- lowercase
    3- stopwords
    4- punctuation
    5- stemming and lemmatization
    4- join token back
 ##### unique vocab

In [6]:
data['text'].head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: text, dtype: object

In [7]:
data['label'].head()

0    0
1    0
2    1
3    0
4    0
Name: label, dtype: int64

In [8]:
data.shape

(5572, 2)

In [9]:
# vectorizer = CountVectorizer()
# vectorizer.fit(data['text'])

In [10]:
# # Printing the identified Unique words along with their indices
# print("Vocabulary: ", vectorizer.vocabulary_)

In [11]:
# # Encode the Document
# vector = vectorizer.transform(data['text'])
# print(vector)

In [12]:
# # Value to find
# value_to_find = 1069

# # Find the key for the given value
# key = None
# for k, v in vectorizer.vocabulary_.items():
#     if v == value_to_find:
#         key = k
#         break

# if key is not None:
#     print(f"The key for the value {value_to_find} is '{key}'.")
# else:
#     print(f"The value {value_to_find} is not found in the dictionary.")

In [13]:
# data['text'][0]

In [14]:
# vectorizer.vocabulary_["amore"]

In [15]:
# # Summarizing the Encoded Texts
# print("Encoded Document is:")
# print(vector.toarray())

***

In [None]:
# using CountVectorizer techniques(vectorization)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])
y = data['label']


In [16]:
#  using TfidfVectorizer techniques(vectorization)

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(data['text'])

# Get the labels
y = data['label']

# Convert the TF-IDF matrix to a dense array for easier manipulation (optional)
X_dense = X.toarray()

print("TF-IDF matrix (dense array):\n", X_dense)
print("Labels:\n", y)


TF-IDF matrix (dense array):
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Labels:
 0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: label, Length: 5572, dtype: int64


In [17]:
# print(X[2])
X

<5572x8672 sparse matrix of type '<class 'numpy.float64'>'
	with 73916 stored elements in Compressed Sparse Row format>

In [18]:
# print(X[100])

In [19]:
y.head()

0    0
1    0
2    1
3    0
4    0
Name: label, dtype: int64

In [20]:
type(X)

scipy.sparse._csr.csr_matrix

In [21]:
type(y)

pandas.core.series.Series

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [23]:
X_train.shape

(4179, 8672)

In [24]:
X_test.shape

(1393, 8672)

In [25]:
y_train.shape

(4179,)

In [26]:
y_test.shape

(1393,)

In [27]:
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

In [28]:
# look at the target(label) col's --> the target col shows that dataset is imbalanced...

data["label"].value_counts()

label
0    4825
1     747
Name: count, dtype: int64

In [29]:
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.955491744436468
Confusion Matrix:
 [[1202    0]
 [  62  129]]


In [30]:
def classify_message(model, vectorizer, message):
    message_vect = vectorizer.transform([message])
    prediction = model.predict(message_vect)
    return "ham" if prediction[0] == 0 else "spam"

# Example of using the function
message = "this is a spam email"
print(classify_message(model, vectorizer, message))


ham


## Vectorisation
Vectorization is a fundamental step in natural language processing (NLP) that involves converting text data into numerical vectors that can be processed by machine learning algorithms. Here are some common vectorization techniques and their typical use cases:

# 1. Count Vectorizor
CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.  

1. Bag of Words (BoW)
Description:

    Converts text into a vector of word frequencies.
    Each unique word in the text is represented as a feature.
    When to Use:

    Simple and efficient for smaller datasets.
    Works well for basic text classification and clustering tasks.
    Suitable when word order and context are not important.




CountVectorization, often referred to as Count Vectorizer, is a popular technique in Natural Language Processing (NLP) used to convert a collection of text documents into a matrix of token counts. This process is a type of feature extraction that transforms the text data into numerical data, which can be used as input for machine learning algorithms.

Here's a detailed explanation of Count Vectorization:

### Process of Count Vectorization

1. **Tokenization**: This is the first step where the text is split into individual words (tokens). For instance, the sentence "The cat sat on the mat" would be tokenized into ["The", "cat", "sat", "on", "the", "mat"].

2. **Vocabulary Building**: Once the text is tokenized, a vocabulary is built, which is a list of unique words (tokens) from the entire text corpus. For example, if we have the following two sentences:
    - "The cat sat on the mat"
    - "The dog lay on the rug"

   The vocabulary would be ["The", "cat", "sat", "on", "the", "mat", "dog", "lay", "rug"].

3. **Count Encoding**: After building the vocabulary, each document is converted into a vector of counts of each word in the vocabulary. The length of the vector is equal to the size of the vocabulary. For the above example, the count vectors would be:
    - "The cat sat on the mat": [1, 1, 1, 1, 1, 1, 0, 0, 0]
    - "The dog lay on the rug": [1, 0, 0, 1, 1, 0, 1, 1, 1]

### Example in Python

Using Python's `sklearn` library, you can perform Count Vectorization as follows:

```python
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
corpus = [
    'The cat sat on the mat',
    'The dog lay on the rug'
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Get the vocabulary
vocabulary = vectorizer.get_feature_names_out()

# Convert the matrix to an array and print the result
print(X.toarray())
print(vocabulary)
```

Output:
```
[[1 1 0 1 1 0 1 1 1]
 [1 0 1 0 1 1 0 1 0]]
['cat' 'dog' 'lay' 'mat' 'on' 'rug' 'sat' 'the']
```

### Important Points

1. **Case Sensitivity**: By default, CountVectorizer converts all characters to lowercase before tokenizing. This can be controlled using the `lowercase` parameter.

2. **Stop Words**: Common words like "and", "the", "is", etc., can be excluded from the vocabulary using the `stop_words` parameter.

3. **N-grams**: CountVectorizer can also create n-grams (sequences of n words) instead of single words. This can be set using the `ngram_range` parameter.

4. **Sparse Matrix**: The output of CountVectorizer is a sparse matrix, which is efficient in terms of memory usage for large text corpora.

5. **Feature Names**: The order of features (words) in the matrix can be accessed using the `get_feature_names_out()` method.

### Advantages and Disadvantages

**Advantages**:
- Simple and easy to understand.
- Provides a straightforward way to convert text into numerical data.

**Disadvantages**:
- The resulting matrix can be very large and sparse, especially for large vocabularies.
- Does not capture the semantics or meaning of words; just the frequency.
- Sensitive to vocabulary size and may require feature selection techniques to reduce dimensionality.

Count Vectorization is a foundational technique in text processing and often serves as a stepping stone to more advanced methods like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
document = ["One Geek helps Two Geeks",
        "Two Geeks help Four Geeks",
    "Each Geek helps many other Geeks at GeeksforGeeks"]

In [None]:
# Create a Vectorizer Object
vectorizer = CountVectorizer()

vectorizer.fit(document)

In [None]:
# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)

In [None]:
# Encode the Document
vector = vectorizer.transform(document)
print(vector)

In [None]:
# Summarizing the Encoded Texts
print("Encoded Document is:")
print(vector.toarray())

   **Key Observations:**

    There are 12 unique words in the document, represented as columns of the table.

    There are 3 text samples in the document, each represented as rows of the table

    Every cell contains a number, that represents the count of the word in that particular text.

    All words have been converted to lowercase.
    The words in columns have been arranged alphabetically.
    Inside CountVectorizer, these words are not stored as strings. Rather, they are given a particular index value. In this case, ‘at’ would have index 0, ‘each’ would have index 1, ‘four’ would have index 2 and so on. The actual representation has been shown in the table below –

![image.png](attachment:image.png)

***

# 2. Term Frequency-Inverse Document Frequency (TF-IDF)
Description:

    TF-IDF or Term Frequency–Inverse Document Frequency, is a numerical statistic that’s intended to reflect how important a word is to a document. Although it’s another frequency-based method, it’s not as naive as Bag of Words.

    How does TF-IDF improve over Bag of Words?

    In Bag of Words, we witnessed how vectorization was just concerned with the frequency of vocabulary words in a given document. As a result, articles, prepositions, and conjunctions which don’t contribute a lot to the meaning get as much importance as, say, adjectives. 

    TF-IDF helps us to overcome this issue. Words that get repeated too often don’t overpower less frequent but important words.

It has two parts:

TF
TF stands for Term Frequency. It can be understood as a normalized frequency score. It is calculated via the following formula:
![image.png](attachment:image.png)

So one can imagine that this number will always stay ≤ 1, thus we now judge how frequent a word is in the context of all of the words in a document.

IDF
IDF stands for Inverse Document Frequency, but before we go into IDF, we must make sense of DF – Document Frequency. It’s given by the following formula:
![image-2.png](attachment:image-2.png)

DF tells us about the proportion of documents that contain a certain word. So what’s IDF?
![image-3.png](attachment:image-3.png)
It’s the reciprocal of the Document Frequency, and the final IDF score comes out of the following formula:


Why inverse the DF?

Just as we discussed above, the intuition behind it is that the more common a word is across all documents, the lesser its importance is for the current document.

A logarithm is taken to dampen the effect of IDF in the final calculation.

The final TF-IDF score comes out to be:


This is how TF-IDF manages to incorporate the significance of a word. The higher the score, the more important that word is.

Let’s get our hands dirty now and see how TF-IDF looks in practice.
Weighs the frequency of a word by how rare it is across all documents.
Reduces the impact of frequently occurring words that are less informative.
When to Use:


Useful for text classification, information retrieval, and document clustering.

Better than BoW for handling larger datasets with varying document lengths.
Example:

















***
***

### Basic Idea

TF-IDF helps to find out which words in a document are important and how relevant they are across a collection of documents. In information retrieval, like in a search engine, TF-IDF can be used to rank documents by their relevance to a search query.

### Step-by-Step Example

Imagine you have three documents and a search query.

#### Documents
1. Document 1: "the cat in the hat"
2. Document 2: "the quick brown fox"
3. Document 3: "the cat and the hat"

#### Query
- Query: "cat hat"

### Step 1: Calculate Term Frequency (TF)

Term Frequency (TF) counts how often each word appears in each document.

**Document 1:**
- "the": 2 times
- "cat": 1 time
- "in": 1 time
- "hat": 1 time

**Document 2:**
- "the": 1 time
- "quick": 1 time
- "brown": 1 time
- "fox": 1 time

**Document 3:**
- "the": 2 times
- "cat": 1 time
- "and": 1 time
- "hat": 1 time

### Step 2: Calculate Inverse Document Frequency (IDF)

IDF measures how important a word is across all documents. Words that appear in many documents get a lower score.

**Formula:**
\[
\text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right)
\]

- Total number of documents = 3

Calculate IDF for each word:
- "the": \(\log(3/3) = \log(1) = 0\) (appears in all documents, so it's less important)
- "cat": \(\log(3/2) = \log(1.5) \approx 0.18\)
- "hat": \(\log(3/2) = \log(1.5) \approx 0.18\)
- "in": \(\log(3/1) = \log(3) \approx 0.48\)
- "quick": \(\log(3/1) = \log(3) \approx 0.48\)
- "brown": \(\log(3/1) = \log(3) \approx 0.48\)
- "fox": \(\log(3/1) = \log(3) \approx 0.48\)
- "and": \(\log(3/1) = \log(3) \approx 0.48\)

### Step 3: Calculate TF-IDF

Multiply TF by IDF for each term in each document.

**Document 1:**
- "the": \(2 \times 0 = 0\)
- "cat": \(1 \times 0.18 = 0.18\)
- "in": \(1 \times 0.48 = 0.48\)
- "hat": \(1 \times 0.18 = 0.18\)

**Document 2:**
- "the": \(1 \times 0 = 0\)
- "quick": \(1 \times 0.48 = 0.48\)
- "brown": \(1 \times 0.48 = 0.48\)
- "fox": \(1 \times 0.48 = 0.48\)

**Document 3:**
- "the": \(2 \times 0 = 0\)
- "cat": \(1 \times 0.18 = 0.18\)
- "and": \(1 \times 0.48 = 0.48\)
- "hat": \(1 \times 0.18 = 0.18\)

### Step 4: Calculate Query TF-IDF

For the query "cat hat":
- "cat": \(1 \times 0.18 = 0.18\)
- "hat": \(1 \times 0.18 = 0.18\)

### Step 5: Rank Documents by Relevance

Sum the TF-IDF scores for the query terms in each document.

**Document 1:**
- TF-IDF for "cat": 0.18
- TF-IDF for "hat": 0.18
- Total = 0.18 + 0.18 = 0.36

**Document 2:**
- TF-IDF for "cat": 0
- TF-IDF for "hat": 0
- Total = 0

**Document 3:**
- TF-IDF for "cat": 0.18
- TF-IDF for "hat": 0.18
- Total = 0.18 + 0.18 = 0.36

### Result

- Document 1 and Document 3 are equally relevant to the query "cat hat" with a TF-IDF score of 0.36.
- Document 2 is not relevant with a TF-IDF score of 0.

In this way, TF-IDF helps in ranking documents based on their relevance to a search query by considering both the frequency of the query terms in each document and the importance of those terms across all documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

sents = ['coronavirus is a highly infectious disease',
   'coronavirus affects older people the most',
   'older people are at high risk due to this disease']

In [None]:
tfidf = TfidfVectorizer()
transformed = tfidf.fit_transform(sents)

In [None]:
# Get the feature names/words
feature_names = tfidf.get_feature_names_out()

# Convert the TF-IDF matrix to a dense array for easier manipulation (optional)
X_dense = transformed.toarray()

In [None]:
# Print the TF-IDF matrix and feature words
print("TF-IDF Matrix:")
print(X_dense)
print("\nFeature Names:")
print(feature_names)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

sents = [
    'coronavirus is a highly infectious disease',
    'coronavirus affects older people the most',
    'older people are at high risk due to this disease'
]

# Initialize the TfidfVectorizer
tfidf = TfidfVectorizer()

# Fit and transform the sentences
transformed = tfidf.fit_transform(sents)

# Get the feature names/words
feature_names = tfidf.get_feature_names_out()

# Convert the TF-IDF matrix to a dense array for easier manipulation (optional)
X_dense = transformed.toarray()

print("Feature names:", feature_names)
print("TF-IDF matrix (dense array):\n", X_dense)


# 3. Word2Vec
Word2vec is a popular word embedding ( type of word vector and useful to capture semantic and syntactic similarity) technique in NLP. This was developed by Tomas Mikolov and his team at Google in 2013. Word2vec represents words as continuous vectors in a multi-dimensional space.

Word2vec aims to represent words in a way that captures their semantic meaning. Word vectors generated by word2vec are positioned in a continuous vector space.

---

![image.png](attachment:image.png)
Ex – ‘Cat’ and ‘Dog’ vectors would be closer than vectors of ‘cat’ and ‘girl’.

---

https://geekflare.com/nlp-simplified-vectorization-techniques/

Two model architectures can be used by word2vec to create word embedding.

CBOW: Continous bag of words or CBOW tries to predict a word by averaging the meaning of nearby words. It takes a fixed number or window of words around the target word, then converts it into numerical form (Embedding), then averages all, and uses that average to predict the target word with the neural network.

Ex- Predict target: ‘Fox’

Sentence words: ‘The’, ‘quick’, ‘brown’, ‘jumps’, ‘over’, ‘the’

---
![image-2.png](attachment:image-2.png)
CBOW takes fixed size window (number) of words like 2 (2 to the left and 2 to the right)
Convert to word embedding.
CBOW averages the word embedding.
CBOW averages the word embedding to the context words.
Averaged vector tries to predict a target word using a neural network.
Now, Let’s understand how skip-gram is different from CBOW.

Skip-gram: It is a word embedding model, but it works differently. Instead of predicting the target word, skip-gram predicts the context words given target words.

Skip-grams is better at capturing the semantic relationships between words.

Ex- ‘King – Men + Women = Queen’

If you want to work with Word2Vec, you have two choices: either you can train your own model or use a pre-trained model. We will be going through a pre-trained model.

Word2Vec is a popular technique in Natural Language Processing (NLP) that transforms words into numerical vectors in such a way that words with similar meanings are mapped to nearby points in the vector space. This allows algorithms to understand words not just as unique symbols but in terms of their meanings and relationships to other words.

### Key Concepts

1. **Word Embedding**: Word2Vec creates a mapping of words to vectors of real numbers. These vectors capture semantic meanings of words based on their context.

2. **Vector Space**: Words are represented as points in a high-dimensional space. Words with similar meanings are close to each other in this space.

3. **Context**: The technique uses the context in which words appear to determine their vector representations. Words that frequently appear in similar contexts will have similar vectors.

### How Word2Vec Works

Word2Vec comes in two main architectures: Continuous Bag of Words (CBOW) and Skip-Gram.

1. **Continuous Bag of Words (CBOW)**:
   - Predicts the target word from the surrounding context words.
   - Example: In the sentence "the cat sat on the mat," CBOW would use "the," "cat," "on," "the," and "mat" to predict "sat."

2. **Skip-Gram**:
   - Predicts the surrounding context words from the target word.
   - Example: In the sentence "the cat sat on the mat," Skip-Gram would use "sat" to predict "the," "cat," "on," "the," and "mat."

### Example

Imagine you have the following sentences:
- "The cat sat on the mat."
- "The dog lay on the rug."

Using Word2Vec, you would transform these sentences into vectors. After training, you might find that the vectors for "cat" and "dog" are close to each other, because they often appear in similar contexts ("sat on the mat" and "lay on the rug").

### Why Use Word2Vec?

1. **Captures Semantic Meaning**: Unlike traditional bag-of-words models that treat words as independent tokens, Word2Vec captures the meanings and relationships between words.
2. **Dimensionality Reduction**: It reduces the dimensionality of text data, making it more manageable for machine learning algorithms.
3. **Improves Performance**: Using word vectors often improves the performance of NLP tasks like text classification, sentiment analysis, and machine translation.

### Summary

In simple terms, Word2Vec is a technique that transforms words into vectors based on their meanings and contexts. This helps computers understand and process human language more effectively by capturing the relationships between words.

In [None]:
# !pip install gensim

In [None]:
# Import necessary libraries
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample sentences
sentences = [
    "I love thor",
    "Hulk is an important members of Avengers",
    "Ironman helps Spiderman so Ironman is an Avengers",
    "Spiderman is one of the popular members of Avengers",
]

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

In [None]:
# Train a Word2Vec model
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, sg=0)

# Find similar words
similar_words = model.wv.most_similar("avengers")

In [None]:
# Print similar words
print("Similar words to 'avengers':")

for word, score in similar_words:
    print(f"{word}: {score}")
    

These are some of the words that are similar to “avengers” based on the Word2Vec model, along with their similarity scores.

The model calculates a similarity score (mostly cosine similarity) between the word vectors of “avengers” and other words in its vocabulary. The similarity score indicates how closely related two words are in the vector space.

Ex –

Here, word ‘helps‘ with cosine similarity -0.005911458611011982 with word ‘avengers‘. The negative value suggests that they might dissimilar with each other.

Cosine similarity values range from -1 to 1, where:

1 indicates that the two vectors are identical and have positive similarity.
Values close to 1 indicate high positive similarity.
Values close to 0 indicate that the vectors are not strongly related.
Values close to -1 indicate high dissimilarity.
-1 indicates that the two vectors are totally opposed and have a perfect negative similarity.

When to Use:

Ideal for capturing word context and semantics.
Useful in tasks like word similarity, sentiment analysis, and more complex NLP tasks.
Suitable when the dataset is large enough to train meaningful embeddings or when pre-trained embeddings are available.

***
***

**What is Clustering in NLP?**

    Clustering in NLP groups similar text documents together based on their content.

Here's a simple and short explanation:

1. **Prepare Text Data**: Clean the text by removing unnecessary parts (e.g., punctuation, stopwords), and convert it to lowercase.
2. **Convert Text to Numbers**: Use techniques like TF-IDF or word embeddings to turn text into numerical vectors.
3. **Choose a Clustering Algorithm**: Common ones are K-Means and Hierarchical Clustering.
4. **Apply the Algorithm**: Run the algorithm on the numerical vectors to group similar texts together.

    For example, if you have a collection of news articles, clustering can help group articles about sports, politics, and technology together.

***

## * Synthetic Data Generation in NLP

Synthetic data generation in Natural Language Processing (NLP) involves creating artificial text data that mimics real-world text data. This is useful in various scenarios, such as when there's insufficient data to train models, when privacy concerns prevent the use of real data, or when creating balanced datasets for training.

### Types of Synthetic Data in NLP

1. **Text Augmentation**: Modifying existing text data to create new examples.
2. **Text Generation**: Creating entirely new text data from scratch using models.
3. **Back-Translation**: Translating text to another language and back to create paraphrases.
4. **Template-Based Generation**: Using predefined templates to generate text.
5. **Gibbs Sampling**: Generating text by sampling from a probability distribution.

### How Synthetic Data is Generated in NLP

#### 1. Text Augmentation
Text augmentation involves making slight modifications to existing text data. Common methods include:

- **Synonym Replacement**: Replacing words with their synonyms.
- **Random Insertion**: Adding random words into sentences.
- **Random Deletion**: Removing words from sentences.
- **Random Swap**: Swapping the positions of words in sentences.

Example using Python and `nlpaug` library:
```python
import nlpaug.augmenter.word as naw

text = "The quick brown fox jumps over the lazy dog."
aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)
print(augmented_text)
```

#### 2. Text Generation
Text generation creates new text data using models like:

- **Language Models**: Pre-trained models such as GPT-3, BERT, etc., which can generate human-like text based on the input prompt.
- **Recurrent Neural Networks (RNNs)**: Traditional models for sequence data.
- **Transformers**: Advanced models like GPT-3 and BERT.

Example using `transformers` library from Hugging Face:
```python
from transformers import pipeline

generator = pipeline('text-generation', model='gpt-3')
generated_text = generator("Once upon a time,")[0]['generated_text']
print(generated_text)
```

#### 3. Back-Translation
Back-translation involves translating text to another language and then back to the original language to create paraphrases.

Example using `transformers` library:
```python
from transformers import MarianMTModel, MarianTokenizer

# English to French
tokenizer_fr = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-fr')
model_fr = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-fr')
translated = model_fr.generate(**tokenizer_fr("The quick brown fox jumps over the lazy dog", return_tensors="pt", padding=True))

# French to English
tokenizer_en = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-fr-en')
model_en = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-fr-en')
back_translated = model_en.generate(**tokenizer_en(translated, return_tensors="pt", padding=True))
paraphrased_text = [tokenizer_en.decode(t, skip_special_tokens=True) for t in back_translated]
print(paraphrased_text)
```

#### 4. Template-Based Generation
Template-based generation uses predefined templates to create text.

Example:
```python
templates = ["The {} is {}.", "{} is very {}."]
subjects = ["cat", "dog"]
adjectives = ["cute", "adorable"]

import random
template = random.choice(templates)
subject = random.choice(subjects)
adjective = random.choice(adjectives)
generated_text = template.format(subject, adjective)
print(generated_text)
```

#### 5. Gibbs Sampling
Gibbs sampling generates text by iteratively sampling from a conditional probability distribution.

Example:
```python
import random

def gibbs_sampling(words, num_iterations=1000):
    current_sample = words[:]
    for _ in range(num_iterations):
        for i in range(len(words)):
            current_sample[i] = random.choice(words)
    return ' '.join(current_sample)

words = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
generated_text = gibbs_sampling(words)
print(generated_text)
```

### Applications of Synthetic Data in NLP

1. **Data Augmentation**: Enhancing training datasets with more variations to improve model robustness.
2. **Privacy Preservation**: Creating synthetic datasets to protect sensitive information.
3. **Testing and Validation**: Generating edge cases or balanced datasets for rigorous testing.
4. **Domain Adaptation**: Creating domain-specific data when transferring models to new domains.
5. **Handling Imbalanced Data**: Generating data for underrepresented classes to balance the dataset.

### Challenges and Considerations

- **Quality of Synthetic Data**: The generated data must closely mimic real data in terms of semantics and structure.
- **Bias Introduction**: Synthetic data might introduce or amplify biases present in the original data.
- **Model Dependency**: The quality and nature of synthetic data depend heavily on the models and methods used.

Synthetic data generation in NLP is a powerful tool for enhancing datasets and improving model performance, but it requires careful consideration of the methods used and the quality of the generated data.



***

## * How to Handle imbalanced text data in a text classification problem

Handling imbalanced text data in a text classification problem is crucial to ensure that your model does not become biased towards the majority class. Here are several strategies to address this issue:

### 1. Resampling Techniques

#### a. Oversampling the Minority Class

Oversampling involves increasing the number of instances in the minority class. A common method is Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic samples.

Example using `imblearn` library:
```python
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
texts = ["text sample 1", "text sample 2", "text sample 3", ...]
labels = [0, 1, 0, ...]

# Vectorize the text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Apply SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, labels)
```

#### b. Undersampling the Majority Class

Undersampling reduces the number of instances in the majority class to balance the dataset.

Example using `imblearn` library:
```python
from imblearn.under_sampling import RandomUnderSampler

# Apply undersampling
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X, labels)
```

### 2. Data Augmentation

Augment the minority class by generating new samples. Techniques include synonym replacement, back-translation, and other text augmentation methods.

Example using `nlpaug` library:
```python
import nlpaug.augmenter.word as naw

# Augmenting the minority class
texts_minority = ["minority text sample 1", "minority text sample 2", ...]

aug = naw.SynonymAug(aug_src='wordnet')
augmented_texts = [aug.augment(text) for text in texts_minority]
texts.extend(augmented_texts)
labels.extend([1] * len(augmented_texts))
```

### 3. Class Weight Adjustment

Modify the class weights in the loss function to give more importance to the minority class.

Example using `sklearn`:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight('balanced', classes=[0, 1], y=labels)
model = LogisticRegression(class_weight={0: class_weights[0], 1: class_weights[1]})
model.fit(X, labels)
```

### 4. Ensemble Methods

Use ensemble techniques such as boosting or bagging to improve the model's performance on imbalanced data.

#### a. Balanced Random Forest
Example using `imblearn` library:
```python
from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier()
model.fit(X, labels)
```

#### b. EasyEnsemble
EasyEnsemble combines multiple undersampled datasets to train separate classifiers and then aggregates their predictions.

Example using `imblearn` library:
```python
from imblearn.ensemble import EasyEnsembleClassifier

model = EasyEnsembleClassifier()
model.fit(X, labels)
```

### 5. Evaluation Metrics

Use appropriate evaluation metrics that give a better understanding of the model's performance on imbalanced data, such as:

- Precision
- Recall
- F1 Score
- ROC-AUC
- Confusion Matrix

Example using `sklearn`:
```python
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred))
```

### 6. Custom Loss Function

Implement a custom loss function that penalizes the model more for misclassifying the minority class.

Example:
```python
import tensorflow as tf

def custom_loss(y_true, y_pred):
    # Define custom loss function
    loss = tf.nn.weighted_cross_entropy_with_logits(y_true, y_pred, pos_weight=class_weights[1])
    return loss

model.compile(optimizer='adam', loss=custom_loss)
```

### 7. Cross-Validation

Use stratified cross-validation to ensure that each fold has a representative distribution of classes.

Example using `sklearn`:
```python
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(X, labels):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))
```

By employing these techniques, you can address the issue of imbalanced text data, leading to more robust and fair models.