## **Common Ternms**


# üìò 1. Corpus
**üîπ Easy Meaning:**
A corpus is a big collection of texts (many documents or messages).
Think of it like a folder full of files you want to study or train a model on.

**üß† Example:**

A folder with 1,000 emails = Corpus

A collection of tweets = Corpus

# üìù 2. Vocabulary
**üîπ Easy Meaning:**
Vocabulary is the list of unique words found in your corpus.
It shows what words are used, not how often.

**üß† Example:**
If your texts have:

‚ÄúI love NLP‚Äù, ‚ÄúNLP is fun‚Äù, ‚ÄúI love Python‚Äù

Vocabulary = {I, love, NLP, is, fun, Python}

‚úÖ No repeats ‚Äî just the full word list.



# üìÑ 3. Document
**üîπ Easy Meaning:**
A document is one piece of text inside your corpus.

**üß† Example:**
If the corpus is 1,000 emails,
then each email = one document.

**Another example:**

One tweet

One review

One paragraph
All are ‚Äúdocuments‚Äù.

# üî§ 4. Word
**üîπ Easy Meaning:**
Just a single word in a document.

**üß† Example:**
In the sentence ‚ÄúI love Python‚Äù,
the words are: I, love, Python

---

# **Important NLP Feature Extraction Techniques**

## **1. One-Hot Encoding (OHE)**  
**üîπ Easy Meaning:**  
Turn each word into a vector (a row of 0s and 1s) where only one word is 1, and all others are 0.  

**üß† Example:**  
Let‚Äôs say vocabulary = ["cat", "dog", "fish"]  

- "cat" ‚Üí [1, 0, 0]  
- "dog" ‚Üí [0, 1, 0]  
- "fish" ‚Üí [0, 0, 1]  

‚úÖ Simple but inefficient for large vocabularies.  

---

## **üß∫ 2. BoW (Bag of Words)**  
**üîπ Easy Meaning:**  
Counts how many times each word appears in the sentence or document.  

**üß† Example:**  
**Sentences:**  
1. "I love NLP"  
2. "I love Python and NLP"  

**Vocabulary** = [I, love, NLP, Python, and]  

**Now count how many times each word appears:**  

![Bag of Words Example](Bag%20of%20Words%20Example.png)  

‚úÖ Good for basic text analysis and ML models.  


**‚úÖ Advantages:**

Simple to understand and implement ‚Äì great for beginners.

Works well with traditional machine learning models.

Fast to compute with small vocabulary sizes.

Gives a solid baseline for text classification tasks.

**‚ùå Disadvantages:**

Ignores word order, so it loses meaning (e.g., "not happy" vs "happy").

Vocabulary can become very large, especially with big datasets.

Sparse vectors (many zeros) use a lot of memory.

Does not capture context or relationships between words.

In [4]:
# Bag of Words (BoW)


import numpy as np
import pandas as pd

# Create a DataFrame with text data and output labels
df = pd.DataFrame({
    'text': [
        'people watch campusx',
        'campusx watch campusx',
        'people write comment',
        'campusx write comment'
    ],
    'output': [1, 1, 0, 0]
})

# Display the DataFrame
print(df)


                    text  output
0   people watch campusx       1
1  campusx watch campusx       1
2   people write comment       0
3  campusx write comment       0


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
cv = CountVectorizer()

# Create the Bag-of-Words representation
bow = cv.fit_transform(df['text'])

# To see the vocabulary and the actual matrix, you could add:
print("\nVocabulary:", cv.get_feature_names_out())

print("Bag-of-Words matrix:\n", bow.toarray())


Vocabulary: ['campusx' 'comment' 'people' 'watch' 'write']
Bag-of-Words matrix:
 [[1 0 1 1 0]
 [2 0 0 1 0]
 [0 1 1 0 1]
 [1 1 0 0 1]]


---

## **üì¶ 3. N-gram**  
**üîπ Easy Meaning:**  
It means combining N words together to understand context.  

**üß† Examples:**  
- **Unigram (1 word):** ["I", "love", "NLP"]  
- **Bigram (2 words):** ["I love", "love NLP"]  
- **Trigram (3 words):** ["I love NLP"]  

‚úÖ Helps machine understand phrases, not just single words.  

---

In [12]:
import numpy as np
import pandas as pd

# Create a DataFrame with text data and output labels
df = pd.DataFrame({
    'text': [
        'people watch campusx',
        'campusx watch campusx',
        'people write comment',
        'campusx write comment'
    ],
    'output': [1, 1, 0, 0]
})

# Display the DataFrame
print(df)

                    text  output
0   people watch campusx       1
1  campusx watch campusx       1
2   people write comment       0
3  campusx write comment       0


In [16]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))

# Create the Bag-of-Words representation
bow = cv.fit_transform(df['text'])

In [20]:
# vocab
print(cv.vocabulary_)
print("Length = ",len(cv.vocabulary_))

{'people watch': 2, 'watch campusx': 4, 'campusx watch': 0, 'people write': 3, 'write comment': 5, 'campusx write': 1}
Length =  6


In [15]:
print(bow[0].toarray())
print(bow[1].toarray())

[[0 0 1 0 1 0]]
[[1 0 0 0 1 0]]


# **ü§î Why move from N-gram to TF-IDF?**

**üîπ First, understand what N-gram does:**

It joins N words together to capture short phrases.

**For example:**

**Bigram:** "I love" or "love NLP"  
**Trigram:** "I love NLP"

‚úÖ It‚Äôs better than just counting single words because it starts to understand context (phrases).

---

### ‚ùå But N-gram has some problems:

---

| ‚ùå Problem               | üìñ What it means                                                                 |
|-------------------------|----------------------------------------------------------------------------------|
| **Too many combinations** | More words = more N-grams = bigger feature space (high memory usage)             |
| **Still just counting**   | It doesn't know which words are important ‚Äî all words/phrases are treated equally |
| **Common words dominate** | Common words like "the", "and", "is" appear a lot ‚Äî but don't add much meaning   |

---

# ‚úÖ So, we move to **TF-IDF** (Term Frequency ‚Äì Inverse Document Frequency)

### üîπ Easy Meaning:

> ‚ÄúIf a word appears a lot in **one document** but not much in others, it‚Äôs probably important.‚Äù

TF-IDF gives **higher weight** to important words and **lowers weight** for common words.

---

### ‚öñÔ∏è N-gram vs TF-IDF

| Feature               | N-gram            | TF-IDF                         |
|-----------------------|-------------------|--------------------------------|
| Understand phrases?   | ‚úÖ Yes             | ‚úÖ Yes (when combined with N-gram) |
| Weigh word importance?| ‚ùå No              | ‚úÖ Yes                          |
| Handles common words? | ‚ùå No              | ‚úÖ Yes (reduces their weight)   |
| Feature size          | üìà Very Large      | üìâ Smaller & more meaningful    |

---

### üß† In Simple Words:

- Use **N-gram** to capture phrases.
- But use **TF-IDF** to make the model **focus on important words**, not just frequent ones.

> üëâ Best of both worlds:  
> Use `TfidfVectorizer(ngram_range=(1,2))` for **weighted n-grams**.


---


# **üìö What is TF-IDF?**
**TF-IDF stands for:**

üëâ Term Frequency ‚Äì Inverse Document Frequency

It‚Äôs a method to score words based on how important they are in a document compared to other documents.

---

## **üîπ Easy Meaning:**

- **TF** = How often the word appears in one document  
- **IDF** = How rare that word is across all documents

---

## **üß† Think like this:**

- If a word appears many times in one document, it's probably important (TF).
- But if that same word appears in every document, it's probably not special (IDF).

‚úÖ So TF-IDF helps us find the **important and unique words** in each document.

---

## **üì¶ Example:**

Imagine we have 3 sentences (documents):

1. "I love NLP"  
2. "NLP is fun"  
3. "I love Python and NLP"

**Now:**
- The word **"NLP"** appears in all 3 documents ‚Üí common ‚Üí **lower IDF**
- The word **"Python"** appears only once ‚Üí rare ‚Üí **higher IDF**

üî∏ So, TF-IDF gives:
- **Lower score** to common words (like "NLP")
- **Higher score** to rare/unique words (like "Python")

---

# üì∏ TF-IDF Diagram 1 (Formula + Table View):

![TF-IDF Diagram](TF-IDF%20Diagram.png)

---

## üéØ Why use TF-IDF?

| ‚ùå Without TF-IDF (just count) | ‚úÖ With TF-IDF                    |
|-------------------------------|----------------------------------|
| All words treated equally     | Important words get higher scores |
| Common words dominate         | Rare but meaningful words stand out |
| No sense of word importance   | Smarter and more focused features  |

---

## üß™ Simple Formula :

**TF-IDF = TF √ó IDF**

TF = (Word Count in Document)

IDF = log(Total Documents / Documents containing the word)



In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer()

# Fit and transform the text data to TF-IDF features
tfidf_matrix = tfidf.fit_transform(df['text']).toarray()

# Display the TF-IDF matrix
print("TF-IDF Matrix:")
print(tfidf_matrix)

# Display feature names (vocabulary)
print("\nFeature Names (Vocabulary):")
print(tfidf.get_feature_names_out())

TF-IDF Matrix:
[[0.49681612 0.         0.61366674 0.61366674 0.        ]
 [0.8508161  0.         0.         0.52546357 0.        ]
 [0.         0.57735027 0.57735027 0.         0.57735027]
 [0.49681612 0.61366674 0.         0.         0.61366674]]

Feature Names (Vocabulary):
['campusx' 'comment' 'people' 'watch' 'write']


---

## **‚úçÔ∏è 4. Custom Features**  
**üîπ Easy Meaning:**  
You create your own features from the text ‚Äî based on your logic or goals.  

**üß† Example:**  
From this review: *"I LOVED the movie!!! üòç"*  
You can create custom features like:  
- Number of uppercase words ‚Üí **1**  
- Has emojis ‚Üí **Yes**  
- Contains ‚Äúloved‚Äù ‚Üí **Yes**  
- Word count ‚Üí **5**  

‚úÖ You choose what matters most. Great for ML models.  

---

## **üß† 5. Word2Vec**  
**üîπ Easy Meaning:**  
Converts each word into a vector (set of numbers) that shows the meaning of the word, not just the word itself.  

**üß† Example:**  
- Word2Vec("king") ‚Üí [0.12, 0.65, -0.33, ‚Ä¶]  
- Word2Vec("queen") ‚Üí [0.14, 0.68, -0.30, ‚Ä¶]  

And guess what?  
**king - man + woman ‚âà queen**  

‚úÖ Word2Vec understands meaning and context (semantic similarity).  
Much better than BoW or OHE for deep NLP work.  

---

## **üìä Summary of important NLP feature extraction techniques**  
![Summary](<Summary important NLP feature extraction techniques.png>)


