<a href="https://colab.research.google.com/github/faisu6339-glitch/Natural-Language-Processing-NLP-/blob/main/Text_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Vocabulary

In Natural Language Processing (NLP), a **vocabulary** refers to the set of all unique words, tokens, or terms encountered in a corpus or a specific dataset. It's essentially the dictionary of distinct linguistic units that an NLP model or system is aware of and can process.

Here are some key aspects:

*   **Tokens**: The 'words' in a vocabulary are often referred to as 'tokens'. These can be actual words, but also punctuation, numbers, or sub-word units depending on the tokenization strategy.
*   **Size**: The size of the vocabulary (the number of unique tokens) can vary greatly depending on the size and diversity of the corpus it's derived from. Larger corpora generally lead to larger vocabularies.
*   **Importance**: A vocabulary is fundamental for many NLP tasks:
    *   **Text Representation**: It forms the basis for converting text into numerical representations that machine learning models can understand (e.g., one-hot encoding, word embeddings).
    *   **Model Training**: NLP models like neural networks (e.g., recurrent neural networks, Transformers) learn associations between words based on their position and frequency within the defined vocabulary.
    *   **Out-of-Vocabulary (OOV) words**: Words encountered during inference that were not present in the training vocabulary are called OOV words and can pose a challenge to models.
*   **Creation**: Vocabularies are typically built by:
    *   **Tokenization**: Breaking down raw text into individual tokens.
    *   **Counting Frequencies**: Tallying the occurrences of each unique token.
    *   **Filtering**: Often, very rare words are removed to keep the vocabulary size manageable and prevent noise, or common words (stop words) are also removed depending on the task.

**Example**: If your corpus contains the sentences "The quick brown fox jumps over the lazy dog." and "A brown fox and a lazy dog ran.", your vocabulary might be {'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'A', 'and', 'ran', '.'} (ignoring case and punctuation handling for simplicity).

In essence, the vocabulary defines the scope of linguistic understanding for an NLP system.

#Explain how to build a vocabulary from a corpus

Building a vocabulary from a corpus generally involves several key steps. The goal is to extract all unique words or tokens and prepare them for use in NLP models.

#Here's a breakdown of the typical process:

#Text Collection (Corpus):

First, you need a corpus – a collection of text documents relevant to your task. This could be articles, books, social media posts, etc.
Text Normalization (Preprocessing):

#Lowercasing:
 Convert all text to lowercase to treat words like "The" and "the" as the same token.

#Removing Punctuation:
Decide whether to remove punctuation or treat it as separate tokens. For most tasks, it's removed or handled separately.

#Removing Numbers:
 Similar to punctuation, decide if numbers are relevant or should be removed/replaced.
Removing Stop Words (Optional): For some tasks (like topic modeling), common words like "a," "an," "the," "is" (stop words) might be removed as they often don't carry significant meaning.

#Stemming or Lemmatization (Optional):
 Reduce words to their root form (e.g., "running," "runs," "ran" become "run"). Lemmatization (using a dictionary) is generally preferred over stemming (heuristic rules) for better accuracy.

#Tokenization:

This is the process of breaking down the normalized text into individual units called "tokens." Tokens are usually words, but can also be sub-word units (like 'un-', '-ing') or characters, depending on the tokenization strategy.
Common tokenizers include word_tokenize from NLTK or SpaCy's tokenizer.
Counting Token Frequencies:

After tokenization, you'll have a list of tokens. Count the occurrences of each unique token in the entire corpus.
This step helps you understand which words are most frequent and which are very rare.
Filtering (Optional but common):

Frequency-based Filtering: Often, you'll filter the vocabulary based on token frequency:
Minimum Frequency: Remove tokens that appear very infrequently (e.g., only once or twice). These rare words are often typos or noise and can inflate vocabulary size without providing much useful information for models.
Maximum Frequency: Remove tokens that appear extremely frequently (e.g., in more than 90% of documents). These might be domain-specific stop words.
Vocabulary Size Limiting: For practical reasons (memory, computational cost), you might decide to limit the vocabulary to the N most frequent tokens. All other tokens become 'out-of-vocabulary' (OOV) and are often replaced with a special <unk> (unknown) token.
Creating a Mapping (Token-to-ID and ID-to-Token):

Once the final set of unique tokens is determined, assign a unique integer ID to each token. This creates a mapping from words to numbers, which is essential for machine learning models.
You'll typically create two mappings: one from token (string) to ID (integer) and another from ID (integer) back to token (string).
Special tokens like <pad> (for padding sequences), <unk> (for unknown words), and <s>, </s> (for start/end of sentence) are often added to the vocabulary as well.

In [1]:
from collections import Counter
import re

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "A brown fox and a lazy dog ran very fast."
]

# 1. Text Normalization and Tokenization
all_tokens = []
for text in corpus:
    text = text.lower() # Lowercasing
    text = re.sub(r'[^a-z\s]', '', text) # Remove punctuation (fixed regex)
    tokens = text.split() # Simple tokenization by space
    all_tokens.extend(tokens)

# 2. Counting Token Frequencies
token_counts = Counter(all_tokens)

# 3. Filtering (e.g., keep top N words, remove words with freq < 2)
min_freq = 2
filtered_tokens = {token for token, count in token_counts.items() if count >= min_freq}

# Add special tokens
special_tokens = ['<pad>', '<unk>']
vocabulary = sorted(list(filtered_tokens)) # Sort for consistency
vocabulary = special_tokens + vocabulary

# 4. Creating Mappings
token_to_id = {token: i for i, token in enumerate(vocabulary)}
id_to_token = {i: token for token, i in token_to_id.items()}

print("Vocabulary size:", len(vocabulary))
print("Token to ID mapping (sample):")
for token, idx in list(token_to_id.items())[:10]:
    print(f"  '{token}': {idx}")

print("ID to Token mapping (sample):")
for idx, token in list(id_to_token.items())[:10]:
    print(f"  {idx}: '{token}'")

Vocabulary size: 8
Token to ID mapping (sample):
  '<pad>': 0
  '<unk>': 1
  'a': 2
  'brown': 3
  'dog': 4
  'fox': 5
  'lazy': 6
  'the': 7
ID to Token mapping (sample):
  0: '<pad>'
  1: '<unk>'
  2: 'a'
  3: 'brown'
  4: 'dog'
  5: 'fox'
  6: 'lazy'
  7: 'the'


#Corpus

In Natural Language Processing (NLP), a **corpus** (plural: corpora) is a large and structured set of texts or speech data. It's essentially a collection of written or spoken documents that are used for linguistic research, analysis, or to train NLP models.

Here's a breakdown of its key characteristics and uses:

*   **Size and Scope**: Corpora are typically large, containing millions or even billions of words, allowing for statistical analysis of language patterns.
*   **Authenticity**: They consist of real-world language use, such as books, articles, web pages, social media posts, transcripts of conversations, etc.
*   **Annotation**: Many corpora are annotated, meaning they have additional linguistic information added, such as part-of-speech tags (noun, verb, adjective), syntactic parses, named entity recognition (person, organization, location), or sentiment labels. This enriches the data for specific NLP tasks.
*   **Structure**: While 'large collection of text' is a good start, a corpus implies a certain level of organization. It might be structured by author, date, genre, domain, etc.

**Why are corpora important in NLP?**

1.  **Training Data**: They serve as the primary source of data for training various NLP models, such as machine translation systems, sentiment analyzers, chatbots, and speech recognition systems.
2.  **Linguistic Analysis**: Researchers use corpora to study how language is used, identify common phrases, analyze grammatical structures, or track linguistic changes over time.
3.  **Lexicography**: They are invaluable for creating dictionaries and thesauri by providing evidence of word usage, frequency, and collocations.
4.  **Language Learning**: They can help in understanding common errors or typical usage for language learners.

**Examples of well-known corpora include:**

*   **Brown Corpus**: One of the earliest computer-readable corpora, consisting of 1 million words of American English texts.
*   **Penn Treebank**: A widely used corpus annotated with syntactic structures.
*   **Google Books Ngram Corpus**: A massive collection of text from books, used to study word frequencies and linguistic trends.

In essence, a corpus provides the raw material that allows NLP researchers and developers to understand, analyze, and build systems that can process and generate human language.

One-hot encoding is a common technique used in Natural Language Processing (NLP) and machine learning in general to convert categorical data into a numerical format that algorithms can understand and process. It's particularly useful when dealing with nominal categorical features, meaning categories that don't have any inherent order or ranking.

**What is One-Hot Encoding?**

At its core, one-hot encoding transforms a categorical variable with `N` distinct values into `N` binary features (columns). For each instance (e.g., a word, a category), exactly one of these `N` features will have a value of 1, and all others will have a value of 0.

**Why is it Used?**

Machine learning algorithms, especially those that rely on mathematical operations like distance calculations (e.g., k-Nearest Neighbors, Support Vector Machines) or gradient descent (e.g., neural networks), cannot directly work with categorical labels (like 'red', 'green', 'blue' or 'cat', 'dog', 'bird'). If we simply assigned numerical labels (e.g., red=1, green=2, blue=3), the algorithm might incorrectly infer an ordinal relationship (e.g., that green is 'more' than red, or closer to blue), which doesn't exist for nominal categories.

One-hot encoding addresses this by:

*   **Representing Categorical Data Numerically**: It provides a numerical representation that algorithms can process.
*   **Avoiding Ordinal Relationships**: By creating binary vectors, it ensures that no artificial ordering or magnitude is imposed on the categories. Each category is treated as equally distinct from others.
*   **Increasing Dimensionality**: It expands the feature space, creating a new dimension for each unique category.

**How it Works (Step-by-Step):**

Let's say you have a list of words from a vocabulary:
`['cat', 'dog', 'bird', 'fish']`

1.  **Identify Unique Categories**: First, determine all unique categories in your feature. In this case, 'cat', 'dog', 'bird', 'fish'. Let's say there are `N` unique categories.
2.  **Create Binary Columns**: Create `N` new binary columns (or features), one for each unique category.
3.  **Assign 1 or 0**: For each data instance:
    *   Set the value of the column corresponding to its category to 1.
    *   Set the values of all other `N-1` columns to 0.

**Example:**

Consider the vocabulary `['cat', 'dog', 'bird', 'fish']`.

*   **'cat'**: would be encoded as `[1, 0, 0, 0]`
*   **'dog'**: would be encoded as `[0, 1, 0, 0]`
*   **'bird'**: would be encoded as `[0, 0, 1, 0]`
*   **'fish'**: would be encoded as `[0, 0, 0, 1]`

In NLP, if you have a vocabulary of, say, 10,000 unique words, each word would be represented by a vector of 10,000 dimensions, with a single '1' at the index corresponding to that word and '0's elsewhere.

**Advantages:**
*   Simple and intuitive to understand.
*   Prevents algorithms from assuming false ordinal relationships.

**Disadvantages:**
*   **High Dimensionality**: For features with many unique categories (like a large vocabulary in NLP), one-hot encoding can lead to a very high-dimensional and sparse feature space (most values are zero). This can increase computational cost and memory usage, and sometimes lead to the "curse of dimensionality."
*   **No Semantic Relationship**: It doesn't capture any semantic relationships between words. For example, 'king' and 'queen' are equally distant from 'apple' in a one-hot encoded space, even though 'king' and 'queen' are semantically related.

Because of the high dimensionality and lack of semantic information, in modern NLP, one-hot encoding for words has largely been replaced by more sophisticated techniques like **word embeddings** (e.g., Word2Vec, GloVe, FastText), which represent words in dense, lower-dimensional vectors that capture semantic relationships. However, one-hot encoding remains valuable for other categorical features in various machine learning tasks.

Let's demonstrate how to build a vocabulary from a larger corpus. We will use a more extensive set of sentences to see how the vocabulary expands.

The process will involve the same steps as before:
1.  **Text Normalization and Tokenization**: Convert text to lowercase and remove punctuation, then split into words.
2.  **Counting Token Frequencies**: Count the occurrences of each unique token.
3.  **Filtering**: Remove infrequent tokens to manage vocabulary size.
4.  **Creating Mappings**: Assign unique integer IDs to each token and create inverse mappings.

In [2]:
from collections import Counter
import re

# A larger sample corpus
larger_corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "A brown fox and a lazy dog ran very fast.",
    "The dog barks loudly at the cat.",
    "The cat quickly climbs the tree.",
    "Dogs and cats are common pets.",
    "The quick brown fox is a clever animal.",
    "Many animals live in the forest.",
    "Birds sing in the morning and fly high."
]

# 1. Text Normalization and Tokenization
all_tokens_large = []
for text in larger_corpus:
    text = text.lower() # Lowercasing
    text = re.sub(r'[^a-z\s]', '', text) # Remove punctuation
    tokens = text.split() # Simple tokenization by space
    all_tokens_large.extend(tokens)

# 2. Counting Token Frequencies
token_counts_large = Counter(all_tokens_large)

# 3. Filtering (e.g., keep words with freq >= 2)
min_freq_large = 2
filtered_tokens_large = {token for token, count in token_counts_large.items() if count >= min_freq_large}

# Add special tokens
special_tokens_large = ['<pad>', '<unk>']
vocabulary_large = sorted(list(filtered_tokens_large)) # Sort for consistency
vocabulary_large = special_tokens_large + vocabulary_large

# 4. Creating Mappings
token_to_id_large = {token: i for i, token in enumerate(vocabulary_large)}
id_to_token_large = {i: token for token, i in token_to_id_large.items()}

print("Larger Vocabulary size:", len(vocabulary_large))
print("Larger Token to ID mapping (sample):")
for token, idx in list(token_to_id_large.items())[:15]:
    print(f"  '{token}': {idx}")

print("Larger ID to Token mapping (sample):")
for idx, token in list(id_to_token_large.items())[:15]:
    print(f"  {idx}: '{token}'")


Larger Vocabulary size: 12
Larger Token to ID mapping (sample):
  '<pad>': 0
  '<unk>': 1
  'a': 2
  'and': 3
  'brown': 4
  'cat': 5
  'dog': 6
  'fox': 7
  'in': 8
  'lazy': 9
  'quick': 10
  'the': 11
Larger ID to Token mapping (sample):
  0: '<pad>'
  1: '<unk>'
  2: 'a'
  3: 'and'
  4: 'brown'
  5: 'cat'
  6: 'dog'
  7: 'fox'
  8: 'in'
  9: 'lazy'
  10: 'quick'
  11: 'the'


In [3]:
# Assuming 'token_to_id' and 'vocabulary' are already defined from the previous steps.
# If not, run the vocabulary building cell first.

# Let's choose a word from our vocabulary to one-hot encode
word_to_encode = 'fox'

# Get the size of the vocabulary
vocabulary_size = len(vocabulary)

# Find the index of the word in the vocabulary
if word_to_encode in token_to_id:
    word_index = token_to_id[word_to_encode]

    # Create a one-hot vector of zeros
    one_hot_vector = [0] * vocabulary_size

    # Set the value at the word's index to 1
    one_hot_vector[word_index] = 1

    print(f"The word '{word_to_encode}' has index {word_index} in the vocabulary.")
    print(f"Its one-hot encoding is: {one_hot_vector}")
else:
    print(f"The word '{word_to_encode}' is not in the current vocabulary.")
    print(f"Vocabulary: {vocabulary}")

The word 'fox' has index 5 in the vocabulary.
Its one-hot encoding is: [0, 0, 0, 0, 0, 1, 0, 0]


In [4]:
# Ensure larger_corpus, token_to_id_large, and vocabulary_large are defined
# (from the previous cells that built the vocabulary from the larger corpus).

encoded_sentences = []
vocabulary_size_large = len(vocabulary_large)
unk_token_id = token_to_id_large['<unk>']

for text in larger_corpus:
    # Normalization and Tokenization (same as used for vocabulary creation)
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = text.split()

    sentence_encoded = []
    for token in tokens:
        # Get the ID for the token, or the UNK ID if not in vocabulary
        token_id = token_to_id_large.get(token, unk_token_id)

        # Create the one-hot vector
        one_hot_vector = [0] * vocabulary_size_large
        one_hot_vector[token_id] = 1
        sentence_encoded.append(one_hot_vector)
    encoded_sentences.append(sentence_encoded)

print(f"One-hot encoding for the first sentence in larger_corpus:")
print(f"Original sentence: '{larger_corpus[0]}'")
# For brevity, print only the first 5 words' encodings
for i, vec in enumerate(encoded_sentences[0]):
    if i < 5:
        word = list(token_to_id_large.keys())[list(token_to_id_large.values()).index(vec.index(1))] if 1 in vec else '<unk>'
        print(f"  '{word}': {vec}")
    elif i == 5:
        print("  ...")
        break

print(f"\nTotal {len(encoded_sentences)} sentences encoded. Each list contains one-hot vectors for its words.")
# print("Encoded sentences (full):")
# for i, sentence in enumerate(encoded_sentences):
#     print(f"Sentence {i+1}: {sentence}")


One-hot encoding for the first sentence in larger_corpus:
Original sentence: 'The quick brown fox jumps over the lazy dog.'
  'the': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
  'quick': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
  'brown': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
  'fox': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
  '<unk>': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  ...

Total 8 sentences encoded. Each list contains one-hot vectors for its words.


OneHotEncoder (sklearn – categorical data)

In [5]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

data = np.array([["Red"], ["Blue"], ["Green"]])

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(data)

print(encoded)


[[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]


#Bags Of Words

In Natural Language Processing (NLP), the **Bag of Words (BoW)** model is a simplified representation of text. It's a way to extract features from text so that machine learning algorithms can use them. The name "bag of words" comes from the idea that, in this model, the grammar and even the order of words are disregarded, only counting the frequency of words.

Here's how it works:

1.  **Vocabulary Creation**: First, a vocabulary of all unique words from the entire corpus (collection of documents) is created.
2.  **Document Representation**: Each document (or piece of text) is then represented as a vector where each dimension corresponds to a word in the vocabulary. The value in each dimension is typically the frequency of that word in the document. This can be raw counts, binary (1 if the word is present, 0 if not), or TF-IDF (Term Frequency-Inverse Document Frequency) scores.

**Example:**

Consider these two simple sentences:
*   Document 1: "I love dogs and cats."
*   Document 2: "I love cats and birds."

1.  **Vocabulary**: {"I", "love", "dogs", "and", "cats", "birds"}
    *   Let's assign an index to each word: I=0, love=1, dogs=2, and=3, cats=4, birds=5.

2.  **Vector Representation**:
    *   Document 1: `[1, 1, 1, 1, 1, 0]` (I:1, love:1, dogs:1, and:1, cats:1, birds:0)
    *   Document 2: `[1, 1, 0, 1, 1, 1]` (I:1, love:1, dogs:0, and:1, cats:1, birds:1)

**Advantages of Bag of Words:**
*   **Simplicity**: It's a straightforward and easy-to-understand model.
*   **Effectiveness**: Despite its simplicity, it can be quite effective for various text classification and clustering tasks, especially with sufficient data.
*   **Computational Efficiency**: When used with sparse matrix representations, it can be computationally efficient for large datasets.

**Disadvantages of Bag of Words:**
*   **Loss of Word Order/Context**: It completely ignores the grammatical structure and the order of words. "Dog bites man" and "Man bites dog" would have the same BoW representation, even though their meanings are entirely different.
*   **High Dimensionality**: For large vocabularies (common in real-world text), the feature vectors can become very large and sparse (mostly zeros), leading to the "curse of dimensionality" and increased computational cost.
*   **Semantic Ambiguity**: It treats every word as independent and doesn't capture semantic relationships or synonyms (e.g., "good" and "great" are treated as distinct words).
*   **Out-of-Vocabulary (OOV) Words**: Words not in the vocabulary during training cannot be directly represented.

Because of these limitations, especially the loss of semantic and syntactic information, more advanced techniques like **word embeddings** (e.g., Word2Vec, GloVe, FastText) have largely replaced BoW for many modern NLP tasks, as they can capture the meaning and context of words in lower-dimensional, dense vectors.

Let's compare Bag of Words (BoW) and One-Hot Encoding (OHE) in the context of Natural Language Processing. They are closely related but serve different purposes in text representation.

### Relationship:

*   **One-Hot Encoding as a Building Block**: You can think of one-hot encoding as a fundamental step or a component used within the Bag of Words model (and many other NLP techniques) for representing individual words. When you assign a unique integer ID to each word in your vocabulary and then convert that ID into a binary vector (where only the dimension corresponding to that ID is 1), you are essentially one-hot encoding the *word itself*.

### One-Hot Encoding (OHE):

*   **Purpose**: Primarily used to convert individual categorical features (like a single word, a color, a city) into a numerical format that machine learning algorithms can process. It avoids implying any ordinal relationship between categories.
*   **Representation**: For a vocabulary of size `V`, a single word is represented by a vector of length `V`, with a `1` at the index corresponding to that word and `0`s elsewhere.
*   **Scope**: Focuses on representing a *single item* from a set of discrete categories.
*   **Example**: If vocabulary is `{'cat':0, 'dog':1, 'bird':2}`, then 'cat' is `[1, 0, 0]`, 'dog' is `[0, 1, 0]`, 'bird' is `[0, 0, 1]`.

### Bag of Words (BoW):

*   **Purpose**: Represents an *entire document or text snippet* as a vector. It focuses on the occurrence of words within a document, disregarding grammar and word order.
*   **Representation**: For a vocabulary of size `V`, a document is represented by a vector of length `V`. Each dimension of this vector corresponds to a word in the vocabulary, and its value typically indicates the *frequency* (count) of that word in the document, or simply its presence (binary).
*   **Scope**: Focuses on representing a *collection of items* (a document) based on the counts/presence of words from the vocabulary.
*   **Example**: Using the same vocabulary and documents:
    *   Document 1: "I love cats and dogs." (Assume these words are in the vocabulary)
    *   If 'I' is at index 0, 'love' at 1, 'cats' at 2, 'and' at 3, 'dogs' at 4:
    *   BoW representation for Document 1 could be `[1, 1, 1, 1, 1, 0, ..., 0]` (if using binary presence, or counts if 'and' appears twice).

### Key Differences Summarized:

| Feature             | One-Hot Encoding                                | Bag of Words                                  |
| :------------------ | :---------------------------------------------- | :-------------------------------------------- |
| **Unit of Encoding**| A single categorical item (e.g., a word)      | An entire document or text segment            |
| **Vector Content**  | Binary (0 or 1) for a single item             | Counts/Frequencies of words within a document |
| **Information**     | Identity/presence of *one* specific item      | Word distribution/composition of a *document* |
| **Output Shape**    | Vector for *one word* (size = vocab size)     | Vector for *one document* (size = vocab size) |

### Common Limitations:

Both One-Hot Encoding (when applied to words) and Bag of Words models share some significant drawbacks:

*   **High Dimensionality**: For large vocabularies, the resulting vectors are very long and sparse (mostly zeros), which can lead to computational inefficiency and the "curse of dimensionality."
*   **Lack of Semantic Meaning**: They don't capture any semantic relationships between words (e.g., 'king' and 'queen' are just as far apart as 'king' and 'apple'). They treat words as independent tokens.
*   **Loss of Context/Order (BoW only)**: BoW specifically loses the order and grammatical structure of words within a sentence, which is crucial for understanding nuanced meaning.

In modern NLP, these techniques are often superseded by **word embeddings** (like Word2Vec, GloVe, BERT embeddings), which represent words and documents in dense, lower-dimensional vectors that *do* capture semantic and contextual relationships.

#Basic Bag of Words (Single Step)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "movie was good",
    "movie was bad",
    "good movie"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", X.toarray())


Vocabulary: ['bad' 'good' 'movie' 'was']
BoW Matrix:
 [[0 1 1 1]
 [1 0 1 1]
 [0 1 1 0]]


#Bag of Words with Stopword Removal

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "this movie was very good",
    "this movie was very bad",
    "good movie"
]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())


['bad' 'good' 'movie']
[[0 1 1]
 [1 0 1]
 [0 1 1]]


#Bag of Words with Vocabulary Limit

In [8]:
vectorizer = CountVectorizer(max_features=3)
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())


['good' 'movie' 'this']
[[1 1 1]
 [0 1 1]
 [1 1 0]]


In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_text = [
    "movie was good",
    "movie was bad",
    "excellent movie",
    "terrible movie",
    "good acting",
    "bad acting"
]

y = [1, 0, 1, 0, 1, 0]  # 1=Positive, 0=Negative

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X_text)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

print("Accuracy:", model.score(X_test, y_test))


Accuracy: 1.0


In [10]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

pipeline = Pipeline([
    ('bow', CountVectorizer(
        stop_words='english',
        ngram_range=(1,2),
        max_features=5000
    )),
    ('clf', LinearSVC())
])

pipeline.fit(X_text, y)

print("Prediction:", pipeline.predict(["very good movie"]))


Prediction: [1]


#Feature Extractions

In [11]:
import pandas as pd
import numpy as np


In [15]:
import pandas as pd

df = pd.DataFrame({
    "text": [
        'people watch campusx',
        'campusx watch on zoom',
        'people write comment on campusx',
        'zoom write comment on campusx',
        'students watch lecture on campusx',
        'people like campusx content',
        'zoom host live class',
        'students write notes',
        'people attend class on zoom',
        'campusx upload new lecture'
    ],
    "output": [1, 1, 0, 0, 1, 1, 0, 0, 1, 1]
})

df


Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch on zoom,1
2,people write comment on campusx,0
3,zoom write comment on campusx,0
4,students watch lecture on campusx,1
5,people like campusx content,1
6,zoom host live class,0
7,students write notes,0
8,people attend class on zoom,1
9,campusx upload new lecture,1


In [14]:
df

Unnamed: 0,text
0,people watch campusx
1,campusx watch on zoom
2,people write comment on campusx
3,zoom write comment on campusx


In [16]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
bow = cv.fit_transform(df['text'])

#vocab

In [17]:
print(cv.vocabulary_)

{'people': 12, 'watch': 15, 'campusx': 1, 'on': 11, 'zoom': 17, 'write': 16, 'comment': 3, 'students': 13, 'lecture': 6, 'like': 7, 'content': 4, 'host': 5, 'live': 8, 'class': 2, 'notes': 10, 'attend': 0, 'upload': 14, 'new': 9}


In [18]:
print(bow[0].toarray())
print(bow[1].toarray())

[[0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0]]
[[0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1]]


In [19]:
cv.transform(["campusx watch lecture on zoom"]).toarray()

array([[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1]])

In [20]:
# Get feature names (vocabulary) from the CountVectorizer
feature_names = cv.get_feature_names_out()

# Create a DataFrame from the Bag of Words matrix
bow_df = pd.DataFrame(bow.toarray(), columns=feature_names)

print("Bag of Words DataFrame for the 'text' column:")
display(bow_df)

print("\nNow you have a DataFrame where each row represents a document and each column represents a word from the vocabulary, with values indicating word counts.")
print("You can now use this 'bow_df' for various NLP tasks, such as text classification, by combining it with your 'output' column if it's a supervised learning problem.")

Bag of Words DataFrame for the 'text' column:


Unnamed: 0,attend,campusx,class,comment,content,host,lecture,like,live,new,notes,on,people,students,upload,watch,write,zoom
0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1
2,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0
3,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1
4,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,1,0,0
5,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0
6,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0
8,1,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1
9,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0



Now you have a DataFrame where each row represents a document and each column represents a word from the vocabulary, with values indicating word counts.
You can now use this 'bow_df' for various NLP tasks, such as text classification, by combining it with your 'output' column if it's a supervised learning problem.


## N-grams

In Natural Language Processing (NLP), an **n-gram** is a contiguous sequence of `n` items from a given sample of text or speech. The items can be words, syllables, letters, or even characters, but most commonly, n-grams refer to sequences of words.

*   **Unigram (n=1)**: A single word. This is what the basic Bag of Words model uses.
    *   Example: "The", "quick", "brown"

*   **Bigram (n=2)**: A sequence of two consecutive words.
    *   Example: "The quick", "quick brown", "brown fox"

*   **Trigram (n=3)**: A sequence of three consecutive words.
    *   Example: "The quick brown", "quick brown fox"

And so on, for higher values of `n`.

### Why are N-grams Important?

The main limitation of the basic Bag of Words (BoW) model is that it treats each word independently and discards the order of words. This means "good movie" and "movie good" or "man bites dog" and "dog bites man" would have identical BoW representations, despite having different meanings.

N-grams address this limitation by capturing some of the local word order and context:

1.  **Contextual Information**: Bigrams and trigrams can capture common phrases or expressions where the individual words' meanings are altered by their combination (e.g., "not good" is different from "good").
2.  **Improved Performance**: For tasks like text classification, sentiment analysis, or machine translation, incorporating n-grams often leads to better model performance because the model can learn patterns from word sequences rather than just individual words.
3.  **Language Modeling**: N-grams are fundamental in language modeling, where they are used to predict the next word in a sequence based on the preceding `n-1` words.

### How N-grams are Used with Bag of Words:

When you use `CountVectorizer` (or similar tools) to create a Bag of Words representation, you can specify a `ngram_range`. This allows the vectorizer to not only count individual words (unigrams) but also sequences of words (bigrams, trigrams, etc.). The vocabulary will then consist of both single words and these multi-word phrases.

For example, if `ngram_range=(1, 2)`, the vocabulary would include both unigrams (single words) and bigrams (two-word sequences). Each document would then be represented by a vector counting the occurrences of both unigrams and bigrams within it.

### Example:

Let's consider the sentence: "The quick brown fox."

*   **Unigrams**: {"The", "quick", "brown", "fox"}
*   **Bigrams**: {"The quick", "quick brown", "brown fox"}
*   **Trigrams**: {"The quick brown", "quick brown fox"}

Including n-grams helps capture more nuanced meaning and relationships between words that a simple Bag of Words model would miss. However, it also significantly increases the dimensionality of your feature space, which can lead to higher memory consumption and computational cost.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "I love cats and dogs.",
    "Dogs are loyal pets.",
    "Cats are playful and independent.",
    "I love loyal dogs."
]

# Using CountVectorizer to generate unigrams and bigrams
# ngram_range=(1,1) for unigrams only (default Bag of Words)
# ngram_range=(2,2) for bigrams only
# ngram_range=(1,2) for unigrams and bigrams

vectorizer_ngrams = CountVectorizer(ngram_range=(1, 2))
X_ngrams = vectorizer_ngrams.fit_transform(corpus)

# Get the feature names (vocabulary including n-grams)
ngram_feature_names = vectorizer_ngrams.get_feature_names_out()

# Create a DataFrame for better visualization
ngram_bow_df = pd.DataFrame(X_ngrams.toarray(), columns=ngram_feature_names)

print("Vocabulary with Unigrams and Bigrams:")
print(ngram_feature_names)

print("\nBag of Words DataFrame with Unigrams and Bigrams:")
display(ngram_bow_df)

Vocabulary with Unigrams and Bigrams:
['and' 'and dogs' 'and independent' 'are' 'are loyal' 'are playful' 'cats'
 'cats and' 'cats are' 'dogs' 'dogs are' 'independent' 'love' 'love cats'
 'love loyal' 'loyal' 'loyal dogs' 'loyal pets' 'pets' 'playful'
 'playful and']

Bag of Words DataFrame with Unigrams and Bigrams:


Unnamed: 0,and,and dogs,and independent,are,are loyal,are playful,cats,cats and,cats are,dogs,...,independent,love,love cats,love loyal,loyal,loyal dogs,loyal pets,pets,playful,playful and
0,1,1,0,0,0,0,1,1,0,1,...,0,1,1,0,0,0,0,0,0,0
1,0,0,0,1,1,0,0,0,0,1,...,0,0,0,0,1,0,1,1,0,0
2,1,0,1,1,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,0,0,1,...,0,1,0,1,1,1,0,0,0,0


In [22]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

# Recreate ngram_bow_df using the 'text' column from the original df
vectorizer_ngrams_full = CountVectorizer(ngram_range=(1, 2))
X_ngrams_full = vectorizer_ngrams_full.fit_transform(df['text'])

# Get the feature names (vocabulary including n-grams)
ngram_feature_names_full = vectorizer_ngrams_full.get_feature_names_out()

# Create a DataFrame for better visualization (optional, but good for understanding)
ngram_bow_df_full = pd.DataFrame(X_ngrams_full.toarray(), columns=ngram_feature_names_full)

print("Recreated N-gram Bag of Words DataFrame (first 5 rows):")
display(ngram_bow_df_full.head())

# Define features (X) and target (y)
X = ngram_bow_df_full
y = df['output']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

# Initialize and train a Logistic Regression model
model = LogisticRegression(max_iter=1000) # Increased max_iter for convergence
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"\nModel Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(report)

print("\nThis demonstrates how you can use the N-gram Bag of Words features to train a text classification model and evaluate its performance.")

Recreated N-gram Bag of Words DataFrame (first 5 rows):


Unnamed: 0,attend,attend class,campusx,campusx content,campusx upload,campusx watch,class,class on,comment,comment on,...,watch,watch campusx,watch lecture,watch on,write,write comment,write notes,zoom,zoom host,zoom write
0,0,0,1,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1,0,0,1,0,0,1,0,0,0,0,...,1,0,0,1,0,0,0,1,0,0
2,0,0,1,0,0,0,0,0,1,1,...,0,0,0,0,1,1,0,0,0,0
3,0,0,1,0,0,0,0,0,1,1,...,0,0,0,0,1,1,0,1,0,1
4,0,0,1,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0



Training set size: 7 samples
Test set size: 3 samples

Model Accuracy: 0.67

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3


This demonstrates how you can use the N-gram Bag of Words features to train a text classification model and evaluate its performance.


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#Implementation of N-grams


## Add N-gram Code Explanation

### Subtask:
Insert a markdown cell containing the detailed explanation of the `generate_ngrams` function and its usage for unigrams, bigrams, and trigrams, right before the code cell that implements it.


## Explanation of `generate_ngrams` function

The `generate_ngrams` function is a custom utility to illustrate how n-grams are extracted from a given text. It takes a `text` string and an integer `n` (representing the size of the n-gram) as input.

### How it works:
1.  **Tokenization**: First, the input `text` is split into individual words using `text.split()`. This creates a list of tokens.
2.  **N-gram Generation**: It then iterates through this list of tokens, creating contiguous sequences of `n` tokens. Each sequence is stored as a tuple.

### Usage for Unigrams, Bigrams, and Trigrams:
*   **Unigrams (n=1)**: When `n=1`, the function generates single words. For the example text "Geeks for Geeks Community", this will produce `('Geeks',), ('for',), ('Geeks',), ('Community',)`.
*   **Bigrams (n=2)**: When `n=2`, it generates pairs of consecutive words. From "Geeks for Geeks Community", the bigrams will be `('Geeks', 'for'), ('for', 'Geeks'), ('Geeks', 'Community')`.
*   **Trigrams (n=3)**: When `n=3`, it generates sequences of three consecutive words. For the same text, the trigrams will be `('Geeks', 'for', 'Geeks'), ('for', 'Geeks', 'Community')`.

This function provides a clear, step-by-step understanding of how to manually generate different levels of n-grams, which are crucial for capturing word order and contextual information in NLP tasks.

## Summary:

### Q&A
Yes, the explanation has been added to the notebook as requested.

### Data Analysis Key Findings
*   A markdown cell containing a detailed explanation of the `generate_ngrams` function was successfully inserted into the notebook.
*   The explanation clarifies the function's purpose and its internal mechanism, including tokenization via `text.split()` and the iterative generation of n-grams.
*   Specific examples were provided for unigrams (n=1), bigrams (n=2), and trigrams (n=3) using the text "Geeks for Geeks Community".
    *   Unigrams produce `('Geeks',), ('for',), ('Geeks',), ('Community',)`.
    *   Bigrams produce `('Geeks', 'for'), ('for', 'Geeks'), ('Geeks', 'Community')`.
    *   Trigrams produce `('Geeks', 'for', 'Geeks'), ('for', 'Geeks', 'Community')`.
*   The explanation also highlights the importance of n-grams for capturing word order and contextual information in Natural Language Processing (NLP) tasks.

### Insights or Next Steps
*   The addition of the detailed explanation significantly improves the clarity and educational value of the notebook, especially for users less familiar with N-gram concepts.
*   Ensure that all custom functions or complex code blocks throughout the notebook are similarly documented with clear, concise explanations to maintain high readability and understanding.


In [1]:
def generate_ngrams(text, n):
    tokens = text.split()
    ngrams = [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]
    return ngrams


text = "Geeks for Geeks Community"

unigrams = generate_ngrams(text, 1)
bigrams = generate_ngrams(text, 2)
trigrams = generate_ngrams(text, 3)

print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Unigrams: [('Geeks',), ('for',), ('Geeks',), ('Community',)]
Bigrams: [('Geeks', 'for'), ('for', 'Geeks'), ('Geeks', 'Community')]
Trigrams: [('Geeks', 'for', 'Geeks'), ('for', 'Geeks', 'Community')]


In [2]:
import pandas as pd

df = pd.DataFrame({
    "text": [
        'people watch campusx',
        'campusx watch on zoom',
        'people write comment on campusx',
        'zoom write comment on campusx',
        'students watch lecture on campusx',
        'people like campusx content'
    ],
    "output": [1, 1, 0, 0, 1, 1]
})


#🟢 LEVEL 1 — UNIGRAMS (Basic BoW)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

cv_uni = CountVectorizer(ngram_range=(1,1))
X_uni = cv_uni.fit_transform(df['text'])

uni_df = pd.DataFrame(
    X_uni.toarray(),
    columns=cv_uni.get_feature_names_out()
)

final_uni = pd.concat([uni_df, df['output']], axis=1)
final_uni


Unnamed: 0,campusx,comment,content,lecture,like,on,people,students,watch,write,zoom,output
0,1,0,0,0,0,0,1,0,1,0,0,1
1,1,0,0,0,0,1,0,0,1,0,1,1
2,1,1,0,0,0,1,1,0,0,1,0,0
3,1,1,0,0,0,1,0,0,0,1,1,0
4,1,0,0,1,0,1,0,1,1,0,0,1
5,1,0,1,0,1,0,1,0,0,0,0,1


#🟡 LEVEL 2 — BIGRAMS (Word Pairs)

In [4]:
cv_bi = CountVectorizer(ngram_range=(2,2))
X_bi = cv_bi.fit_transform(df['text'])

bi_df = pd.DataFrame(
    X_bi.toarray(),
    columns=cv_bi.get_feature_names_out()
)

final_bi = pd.concat([bi_df, df['output']], axis=1)
final_bi


Unnamed: 0,campusx content,campusx watch,comment on,lecture on,like campusx,on campusx,on zoom,people like,people watch,people write,students watch,watch campusx,watch lecture,watch on,write comment,zoom write,output
0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1
1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1
2,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0
3,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1
5,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1


#🟠 LEVEL 3 — UNIGRAM + BIGRAM (Most Used)

In [5]:
cv_uni_bi = CountVectorizer(ngram_range=(1,2))
X_uni_bi = cv_uni_bi.fit_transform(df['text'])

uni_bi_df = pd.DataFrame(
    X_uni_bi.toarray(),
    columns=cv_uni_bi.get_feature_names_out()
)

final_uni_bi = pd.concat([uni_bi_df, df['output']], axis=1)
final_uni_bi


Unnamed: 0,campusx,campusx content,campusx watch,comment,comment on,content,lecture,lecture on,like,like campusx,...,students watch,watch,watch campusx,watch lecture,watch on,write,write comment,zoom,zoom write,output
0,1,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,1
1,1,0,1,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,1
2,1,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
3,1,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
4,1,0,0,0,0,0,1,1,0,0,...,1,1,0,1,0,0,0,0,0,1
5,1,1,0,0,0,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1


#🔵 LEVEL 4 — TRIGRAMS (Advanced)

In [6]:
cv_tri = CountVectorizer(ngram_range=(3,3))
X_tri = cv_tri.fit_transform(df['text'])

tri_df = pd.DataFrame(
    X_tri.toarray(),
    columns=cv_tri.get_feature_names_out()
)

final_tri = pd.concat([tri_df, df['output']], axis=1)
final_tri


Unnamed: 0,campusx watch on,comment on campusx,lecture on campusx,like campusx content,people like campusx,people watch campusx,people write comment,students watch lecture,watch lecture on,watch on zoom,write comment on,zoom write comment,output
0,0,0,0,0,0,1,0,0,0,0,0,0,1
1,1,0,0,0,0,0,0,0,0,1,0,0,1
2,0,1,0,0,0,0,1,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,0,0,1,1,0
4,0,0,1,0,0,0,0,1,1,0,0,0,1
5,0,0,0,1,1,0,0,0,0,0,0,0,1


#🔴 LEVEL 5 — N-grams with Stopword Removal

In [7]:
cv_clean = CountVectorizer(
    ngram_range=(1,2),
    stop_words='english'
)

X_clean = cv_clean.fit_transform(df['text'])

clean_df = pd.DataFrame(
    X_clean.toarray(),
    columns=cv_clean.get_feature_names_out()
)

final_clean = pd.concat([clean_df, df['output']], axis=1)
final_clean


Unnamed: 0,campusx,campusx content,campusx watch,comment,comment campusx,content,lecture,lecture campusx,like,like campusx,...,students watch,watch,watch campusx,watch lecture,watch zoom,write,write comment,zoom,zoom write,output
0,1,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,1
1,1,0,1,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,1
2,1,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
3,1,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
4,1,0,0,0,0,0,1,1,0,0,...,1,1,0,1,0,0,0,0,0,1
5,1,1,0,0,0,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1


# Task
Train and evaluate Logistic Regression models using various n-gram features (unigrams, bigrams, unigrams+bigrams, trigrams, and unigrams+bigrams with stopword removal) from the provided `df` data, and then summarize their classification performance, including accuracy and classification reports, to understand the effectiveness of each approach.

## Evaluate Unigrams (Basic BoW)

### Subtask:
Train a Logistic Regression model using the 'final_uni' DataFrame (unigrams only) as features and 'output' as the target. Evaluate and print the model's accuracy and classification report.


**Reasoning**:
The subtask requires training a Logistic Regression model on the unigram features and evaluating its performance. This code block will perform all the necessary steps: separating features and target, splitting data, training the model, making predictions, and printing accuracy and the classification report.



In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 1. Separate features (X) and target (y)
X_uni = final_uni.drop('output', axis=1)
y_uni = final_uni['output']

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_uni, y_uni, test_size=0.3, random_state=42, stratify=y_uni
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

# 3. Initialize a Logistic Regression model
model_uni = LogisticRegression(max_iter=1000)

# 4. Fit the model to the training data
model_uni.fit(X_train, y_train)

# 5. Make predictions on the test set
y_pred_uni = model_uni.predict(X_test)

# 6. Calculate and print the accuracy score
accuracy_uni = accuracy_score(y_test, y_pred_uni)
print(f"\nModel Accuracy (Unigrams): {accuracy_uni:.2f}")

# 7. Generate and print the classification report
print("\nClassification Report (Unigrams):")
print(classification_report(y_test, y_pred_uni))

Training set size: 4 samples
Test set size: 2 samples

Model Accuracy (Unigrams): 0.50

Classification Report (Unigrams):
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Reasoning**:
The previous output showed a warning for undefined precision due to a small test set size. While the code itself is functionally correct for the given data, for the next step, I will continue to train and evaluate a Logistic Regression model, but this time using bigram features from the 'final_bi' DataFrame, following the same evaluation process as for unigrams.



In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 1. Separate features (X) and target (y) for bigrams
X_bi = final_bi.drop('output', axis=1)
y_bi = final_bi['output']

# 2. Split the data into training and testing sets
X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(
    X_bi, y_bi, test_size=0.3, random_state=42, stratify=y_bi
)

print(f"Training set size (Bigrams): {X_train_bi.shape[0]} samples")
print(f"Test set size (Bigrams): {X_test_bi.shape[0]} samples")

# 3. Initialize a Logistic Regression model for bigrams
model_bi = LogisticRegression(max_iter=1000)

# 4. Fit the model to the training data
model_bi.fit(X_train_bi, y_train_bi)

# 5. Make predictions on the test set
y_pred_bi = model_bi.predict(X_test_bi)

# 6. Calculate and print the accuracy score
accuracy_bi = accuracy_score(y_test_bi, y_pred_bi)
print(f"\nModel Accuracy (Bigrams): {accuracy_bi:.2f}")

# 7. Generate and print the classification report
print("\nClassification Report (Bigrams):")
print(classification_report(y_test_bi, y_pred_bi))

Training set size (Bigrams): 4 samples
Test set size (Bigrams): 2 samples

Model Accuracy (Bigrams): 0.50

Classification Report (Bigrams):
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Reasoning**:
The previous steps successfully evaluated unigram and bigram models. Now, I will train and evaluate a Logistic Regression model using unigram + bigram features from the 'final_uni_bi' DataFrame, following the overall task to assess various n-gram combinations.



In [10]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 1. Separate features (X) and target (y) for unigram + bigram
X_uni_bi = final_uni_bi.drop('output', axis=1)
y_uni_bi = final_uni_bi['output']

# 2. Split the data into training and testing sets
X_train_uni_bi, X_test_uni_bi, y_train_uni_bi, y_test_uni_bi = train_test_split(
    X_uni_bi, y_uni_bi, test_size=0.3, random_state=42, stratify=y_uni_bi
)

print(f"Training set size (Unigram + Bigram): {X_train_uni_bi.shape[0]} samples")
print(f"Test set size (Unigram + Bigram): {X_test_uni_bi.shape[0]} samples")

# 3. Initialize a Logistic Regression model for unigram + bigram
model_uni_bi = LogisticRegression(max_iter=1000)

# 4. Fit the model to the training data
model_uni_bi.fit(X_train_uni_bi, y_train_uni_bi)

# 5. Make predictions on the test set
y_pred_uni_bi = model_uni_bi.predict(X_test_uni_bi)

# 6. Calculate and print the accuracy score
accuracy_uni_bi = accuracy_score(y_test_uni_bi, y_pred_uni_bi)
print(f"\nModel Accuracy (Unigram + Bigram): {accuracy_uni_bi:.2f}")

# 7. Generate and print the classification report
print("\nClassification Report (Unigram + Bigram):")
print(classification_report(y_test_uni_bi, y_pred_uni_bi))

Training set size (Unigram + Bigram): 4 samples
Test set size (Unigram + Bigram): 2 samples

Model Accuracy (Unigram + Bigram): 1.00

Classification Report (Unigram + Bigram):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



**Reasoning**:
Following the overall task, I will now train and evaluate a Logistic Regression model using trigram features from the 'final_tri' DataFrame. This continues the process of assessing different n-gram combinations' effectiveness.



In [11]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 1. Separate features (X) and target (y) for trigrams
X_tri = final_tri.drop('output', axis=1)
y_tri = final_tri['output']

# 2. Split the data into training and testing sets
X_train_tri, X_test_tri, y_train_tri, y_test_tri = train_test_split(
    X_tri, y_tri, test_size=0.3, random_state=42, stratify=y_tri
)

print(f"Training set size (Trigrams): {X_train_tri.shape[0]} samples")
print(f"Test set size (Trigrams): {X_test_tri.shape[0]} samples")

# 3. Initialize a Logistic Regression model for trigrams
model_tri = LogisticRegression(max_iter=1000)

# 4. Fit the model to the training data
model_tri.fit(X_train_tri, y_train_tri)

# 5. Make predictions on the test set
y_pred_tri = model_tri.predict(X_test_tri)

# 6. Calculate and print the accuracy score
accuracy_tri = accuracy_score(y_test_tri, y_pred_tri)
print(f"\nModel Accuracy (Trigrams): {accuracy_tri:.2f}")

# 7. Generate and print the classification report
print("\nClassification Report (Trigrams):")
print(classification_report(y_test_tri, y_pred_tri))

Training set size (Trigrams): 4 samples
Test set size (Trigrams): 2 samples

Model Accuracy (Trigrams): 0.50

Classification Report (Trigrams):
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Reasoning**:
Following the overall task, I will now train and evaluate a Logistic Regression model using unigram + bigram features with stopword removal from the 'final_clean' DataFrame. This is the last n-gram combination specified in the task.



In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 1. Separate features (X) and target (y) for unigram + bigram with stopword removal
X_clean = final_clean.drop('output', axis=1)
y_clean = final_clean['output']

# 2. Split the data into training and testing sets
X_train_clean, X_test_clean, y_train_clean, y_test_clean = train_test_split(
    X_clean, y_clean, test_size=0.3, random_state=42, stratify=y_clean
)

print(f"Training set size (N-grams w/ Stopword Removal): {X_train_clean.shape[0]} samples")
print(f"Test set size (N-grams w/ Stopword Removal): {X_test_clean.shape[0]} samples")

# 3. Initialize a Logistic Regression model
model_clean = LogisticRegression(max_iter=1000)

# 4. Fit the model to the training data
model_clean.fit(X_train_clean, y_train_clean)

# 5. Make predictions on the test set
y_pred_clean = model_clean.predict(X_test_clean)

# 6. Calculate and print the accuracy score
accuracy_clean = accuracy_score(y_test_clean, y_pred_clean)
print(f"\nModel Accuracy (N-grams w/ Stopword Removal): {accuracy_clean:.2f}")

# 7. Generate and print the classification report
print("\nClassification Report (N-grams w/ Stopword Removal):")
print(classification_report(y_test_clean, y_pred_clean))

Training set size (N-grams w/ Stopword Removal): 4 samples
Test set size (N-grams w/ Stopword Removal): 2 samples

Model Accuracy (N-grams w/ Stopword Removal): 0.50

Classification Report (N-grams w/ Stopword Removal):
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Reasoning**:
To conclude the analysis, I will add a markdown cell to summarize the classification performance of all the trained Logistic Regression models (unigrams, bigrams, unigrams+bigrams, trigrams, and unigrams+bigrams with stopword removal). This will provide a concise overview and comparison of their effectiveness.



# Task
Add a markdown cell with a detailed explanation of Term Frequency (TF) and Inverse Document Frequency (IDF), defining each concept and illustrating how they are used together in the TF-IDF weighting scheme.

## Explain TF-IDF

### Subtask:
Add a markdown cell with a detailed explanation of Term Frequency (TF) and Inverse Document Frequency (IDF), defining each concept and illustrating how they are used together in the TF-IDF weighting scheme.


## Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

### 1. Term Frequency (TF)

**Definition**: Term Frequency (TF) measures how frequently a term (word) appears in a document. Since some documents are longer than others, it's common to normalize the TF by dividing the raw count of a term by the total number of terms in the document. This helps prevent a bias towards longer documents.

**Calculation**: There are several ways to calculate TF:

*   **Raw Count**: `TF(t, d) = (Number of times term t appears in document d)`
*   **Normalized Count**: `TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)`
*   **Log Normalization**: `TF(t, d) = 1 + log(Number of times term t appears in document d)`

The normalized count is the most commonly used form to account for document length.

**Purpose**: A higher TF value for a term in a document suggests that the term is more relevant to that specific document's content.

### 2. Inverse Document Frequency (IDF)

**Definition**: Inverse Document Frequency (IDF) measures how unique or rare a term is across the entire corpus. Words that appear frequently in many documents (like "the", "a", "is" - stopwords) carry less importance, while words that appear in only a few documents are likely to be more significant. IDF aims to diminish the weight of common terms and increase the weight of rare terms.

**Calculation**: IDF is typically calculated as follows:

`IDF(t, D) = log_e(Total number of documents in corpus D / Number of documents where term t appears)`

Sometimes, a `+1` is added to the denominator to prevent division by zero for terms not in the corpus: `IDF(t, D) = log_e(Total number of documents in corpus D / (1 + Number of documents where term t appears))`.

**Purpose**: A higher IDF value indicates that a term is rare and thus potentially more informative across the entire collection of documents.

### 3. TF-IDF Weighting Scheme

**Combination**: TF-IDF is calculated by multiplying the TF and IDF values for each term:

`TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)`

**Purpose and Benefits in NLP**:
*   **Feature Extraction**: TF-IDF is a powerful technique for converting text into a numerical representation (vector space model) that can be used by machine learning algorithms.
*   **Keyword Extraction**: Terms with high TF-IDF scores are often excellent indicators of the main topics or keywords within a document, as they are frequent in that document but rare across the entire corpus.
*   **Information Retrieval**: When searching a corpus, documents containing terms with high TF-IDF scores are considered more relevant to a query.
*   **Text Classification**: By representing documents as TF-IDF vectors, classification algorithms can learn to categorize texts based on their most distinctive words.
*   **Dimensionality Reduction (Implicitly)**: While not directly a dimensionality reduction technique, it helps focus on important terms, effectively reducing the 'noise' from common, less informative words.
*   **Handles Stopwords**: Automatically down-weights common words (stopwords) without explicit removal, as they will have a low IDF score.

## Summarize N-gram Model Performance

### Subtask:
Add a markdown cell summarizing the classification performance of all trained Logistic Regression models (unigrams, bigrams, unigrams+bigrams, trigrams, and unigrams+bigrams with stopword removal).


### Summary of N-gram Model Performance

We trained and evaluated Logistic Regression models using various n-gram feature sets. Due to the very small dataset, the test set sizes were minimal (2 samples), which often leads to unstable metrics and `UndefinedMetricWarning` for precision/recall if one class is not predicted. However, we can still observe trends in accuracy.

Here's a summary of the model accuracies:

*   **Unigrams (Basic BoW)**:
    *   Accuracy: `0.50`
    *   *Note: Classification report showed poor precision/recall for class 0 due to limited samples.*

*   **Bigrams (Word Pairs)**:
    *   Accuracy: `0.50`
    *   *Note: Similar to unigrams, the classification report showed poor precision/recall for class 0.*

*   **Unigrams + Bigrams (Most Used)**:
    *   Accuracy: `1.00`
    *   *This combination achieved perfect accuracy on this very small test set, suggesting that the combined unigram and bigram features provided enough information for perfect separation of the classes. However, this is likely an overfitting scenario given the tiny dataset.*

*   **Trigrams (Advanced)**:
    *   Accuracy: `0.50`
    *   *The model performed similarly to unigrams and bigrams, with issues in predicting class 0 due to data scarcity.*

*   **N-grams with Stopword Removal (Unigrams + Bigrams)**:
    *   Accuracy: `0.50`
    *   *Removing stopwords in conjunction with unigrams and bigrams did not improve performance over basic unigrams/bigrams, potentially because the removed words were important for context in this specific, small dataset, or simply due to the tiny test set.*


### Key Takeaways:

*   For this extremely small dataset, the **Unigrams + Bigrams** model achieved the highest accuracy (1.00), likely due to capturing more contextual information than single words alone, which might perfectly separate the classes in this limited sample.
*   The `UndefinedMetricWarning` is expected given the `test_size=0.3` on a `df` of 6 samples, leading to a test set of only 2 samples (1 from each class due to `stratify=y_uni`). This makes the evaluation metrics very sensitive and not truly representative of real-world performance.
*   This exercise effectively demonstrates the process of feature extraction using different n-gram ranges and training/evaluating classification models, even if the absolute performance metrics are unreliable due to data size.


## Explain TF-IDF

### Subtask:
Add a markdown cell with a detailed explanation of Term Frequency (TF) and Inverse Document Frequency (IDF), defining each concept and illustrating how they are used together in the TF-IDF weighting scheme.


## Explain TF-IDF

### Subtask:
Add a markdown cell with a detailed explanation of Term Frequency (TF) and Inverse Document Frequency (IDF), defining each concept and illustrating how they are used together in the TF-IDF weighting scheme.

---

### Term Frequency-Inverse Document Frequency (TF-IDF)

**TF-IDF** stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a corpus. It is often used as a weighting factor in information retrieval and text mining. The TF-IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

### 1. Term Frequency (TF)

**Definition**: Term Frequency (TF) measures how frequently a term (word) appears in a document. Since every document is different in length, it is possible that a term would appear much more times in longer documents than shorter ones. Thus, the term frequency is often divided by the document length to normalize for this.

**Calculation**: There are several ways to calculate TF, but common approaches include:
*   **Raw Count**: The number of times a term `t` appears in document `d` (`count(t, d)`).
*   **Normalized Frequency**: `count(t, d) / total_number_of_terms_in_d`
*   **Log Normalization**: `log(1 + count(t, d))`

**Purpose**: Higher TF values indicate that a word is more relevant or characteristic of that specific document.

### 2. Inverse Document Frequency (IDF)

**Definition**: Inverse Document Frequency (IDF) measures how important a term is across the entire corpus. While TF increases with the number of times a word appears in a document, IDF is used to scale down the impact of words that appear very frequently across many documents and are therefore less informative (e.g., "the", "a", "is").

**Calculation**: The IDF for a term `t` is calculated as:

`IDF(t) = log_e(Total_number_of_documents / Number_of_documents_containing_term_t)`

To prevent division by zero for terms not in the corpus, a common practice is to add 1 to the denominator:

`IDF(t) = log_e(Total_number_of_documents / (Number_of_documents_containing_term_t + 1))`

**Purpose**: Rare words have a high IDF score, while common words that appear in many documents have a low IDF score.

### 3. TF-IDF Weighting Scheme

**Calculation**: The TF-IDF score is the product of the Term Frequency and Inverse Document Frequency:

`TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)`

Where:
*   `t` is the term
*   `d` is the document
*   `D` is the corpus

**How they are used together**: TF-IDF assigns a weight to each term in a document based on its frequency within that document and its rarity across the entire corpus. A high TF-IDF score for a term in a document means that the term appears frequently in that specific document (high TF) but rarely in other documents in the corpus (high IDF). This combination highlights terms that are distinctive and important to a particular document.

### Purpose and Benefits in NLP

*   **Feature Extraction**: TF-IDF is widely used for converting text into a numerical representation that can be used by machine learning algorithms (e.g., for text classification, clustering).
*   **Keyword Extraction**: It helps identify terms that are most relevant to a document, effectively extracting keywords.
*   **Information Retrieval**: Search engines use TF-IDF to score how well a document matches a user's query.
*   **Document Similarity**: TF-IDF vectors can be used to calculate the similarity between documents (e.g., using cosine similarity).
*   **Handling Stop Words**: By penalizing terms that appear frequently across the corpus, TF-IDF inherently reduces the weight of common stop words, making them less influential without explicit removal.

**In essence, TF-IDF provides a robust way to quantify the importance of words in a document relative to a collection of documents, making it a foundational concept in many NLP applications.**

## Summary:

### Data Analysis Key Findings

*   **TF-IDF Explained**: A comprehensive markdown explanation of Term Frequency (TF) and Inverse Document Frequency (IDF) was successfully added. This included definitions, various calculation methods (e.g., raw count, normalized count for TF; logarithmic formula for IDF), and a clear illustration of how they combine into the TF-IDF weighting scheme. The explanation also detailed TF-IDF's benefits in NLP, such as feature extraction, keyword identification, information retrieval, and implicit handling of stopwords.
*   **N-gram Model Performance Summary**: A markdown cell summarizing the performance of Logistic Regression models with different n-gram features was added.
    *   The **Unigrams + Bigrams** model achieved the highest accuracy of 1.00 on the small test set.
    *   All other models (Unigrams, Bigrams, Trigrams, and N-grams with Stopword Removal) yielded an accuracy of 0.50.
    *   The summary explicitly noted that the perfect accuracy of the 'Unigrams + Bigrams' model was likely due to **overfitting** given the extremely small test set size of only 2 samples.
    *   The presence of `UndefinedMetricWarning` for precision/recall was attributed to the limited test data and class imbalance within the tiny test set.

### Insights or Next Steps

*   The detailed explanation of TF-IDF serves as a valuable foundational resource for understanding text feature engineering.
*   The performance results from the n-gram models highlight the critical importance of sufficient data for reliable model evaluation and generalization; the current results are not indicative of real-world performance due to the minuscule dataset.


In [14]:
import pandas as pd

df = pd.DataFrame({
    "text": [
        'people watch campusx',
        'campusx watch on zoom',
        'people write comment on campusx',
        'zoom write comment on campusx'
    ],
    "output": [1, 1, 0, 0]
})


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X = tfidf.fit_transform(df['text'])


In [16]:
tfidf_df = pd.DataFrame(
    X.toarray(),
    columns=tfidf.get_feature_names_out()
)

final_df = pd.concat([tfidf_df, df['output']], axis=1)
final_df


Unnamed: 0,campusx,comment,on,people,watch,write,zoom,output
0,0.423897,0.0,0.0,0.640434,0.640434,0.0,0.0,1
1,0.376321,0.0,0.460295,0.0,0.568556,0.0,0.568556,1
2,0.327142,0.494255,0.400142,0.494255,0.0,0.494255,0.0,0
3,0.327142,0.494255,0.400142,0.0,0.0,0.494255,0.494255,0


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    "people watch campusx",
    "campusx watch on zoom",
    "people write comment on campusx",
    "zoom write comment on campusx"
]

tfidf = TfidfVectorizer(ngram_range=(1,2))
X = tfidf.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())
df


Unnamed: 0,campusx,campusx watch,comment,comment on,on,on campusx,on zoom,people,people watch,people write,watch,watch campusx,watch on,write,write comment,zoom,zoom write
0,0.27832,0.0,0.0,0.0,0.0,0.0,0.0,0.420493,0.533343,0.0,0.420493,0.533343,0.0,0.0,0.0,0.0,0.0
1,0.235195,0.450701,0.0,0.0,0.287677,0.0,0.450701,0.0,0.0,0.0,0.355338,0.0,0.450701,0.0,0.0,0.355338,0.0
2,0.224372,0.0,0.338987,0.338987,0.274439,0.338987,0.0,0.338987,0.0,0.429962,0.0,0.0,0.0,0.338987,0.338987,0.0,0.0
3,0.224372,0.0,0.338987,0.338987,0.274439,0.338987,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.338987,0.338987,0.338987,0.429962


In [18]:
tfidf = TfidfVectorizer(
    stop_words='english'
)

X = tfidf.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())
df


Unnamed: 0,campusx,comment,people,watch,write,zoom
0,0.423897,0.0,0.640434,0.640434,0.0,0.0
1,0.423897,0.0,0.0,0.640434,0.0,0.640434
2,0.356966,0.539313,0.539313,0.0,0.539313,0.0
3,0.356966,0.539313,0.0,0.0,0.539313,0.539313


In [19]:
tfidf = TfidfVectorizer(
    max_features=5
)

X = tfidf.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())
df


Unnamed: 0,campusx,comment,on,people,watch
0,0.423897,0.0,0.0,0.640434,0.640434
1,0.457453,0.0,0.55953,0.0,0.691131
2,0.376321,0.568556,0.460295,0.568556,0.0
3,0.457453,0.691131,0.55953,0.0,0.0


In [20]:
tfidf = TfidfVectorizer(
    min_df=2,     # word must appear in 2 docs
    max_df=0.8    # ignore too common words
)

X = tfidf.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())
df


Unnamed: 0,comment,on,people,watch,write,zoom
0,0.0,0.0,0.707107,0.707107,0.0,0.0
1,0.0,0.496816,0.0,0.613667,0.0,0.613667
2,0.523035,0.423442,0.523035,0.0,0.523035,0.0
3,0.523035,0.423442,0.0,0.0,0.523035,0.523035
