# Bag of Words (BoW)
This notebook explains the Bag of Words method in Natural Language Processing (NLP), 

covering both mathematical foundations and practical implementation in Python.

## 1. What is Bag of Words?
The Bag of Words (BoW) model is a way to represent text data for use in machine learning models. It converts text into `fixed-length` vectors by:
1. Creating a vocabulary of all unique words in the corpus.
2. Representing each document as a vector that contains the frequency (or presence) of each word from the vocabulary.

### Example
Corpus:
- "I love NLP."
- "NLP is fun!"

Vocabulary: ['I', 'love', 'NLP', 'is', 'fun']

Vectors:
- Doc1: [1, 1, 1, 0, 0]
- Doc2: [0, 0, 1, 1, 1]

## 2. Mathematical Foundation
Let:
- $D = \{d_1, d_2, \ldots, d_N\}$ be a collection of $N$ documents.
- $V = \{w_1, w_2, \ldots, w_M\}$ be the vocabulary of $M$ unique words.

Then each document $d_i$ can be represented as a vector $\mathbf{x}_i \in \mathbb{R}^M$, where $x_{ij} = \text{count}(w_j, d_i)$ is frequency of word $j$ in document $i$.

The length of $x_i$ is equal to the length of the vocabulary. This representation ignores `grammar` and `word order`, treating each document as a multiset of words.

## 3. Practical Example in Python

In [1]:
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus
corpus = [
    "I love NLP, because NLP is very interesting",
    "NLP is fun",
    "I love machine learning and NLP"
]

# Create the BoW model
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

# Show the vocabulary and feature vectors
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,and,because,fun,interesting,is,learning,love,machine,nlp,very
0,0,1,0,1,1,0,1,0,2,1
1,0,0,1,0,1,0,0,0,1,0
2,1,0,0,0,0,1,1,1,1,0


The word "I" is missing in the output of CountVectorizer. This happens because by default, `CountVectorizer` 

removes English stop words (like "I", "the", "is", etc.). we can override the `token_pattern` to include `single-character` tokens like "I":

`vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
`

## 4. Implement `from scratch`

- Step 1: Tokenization & Lowercase

- Step 2: Build Vocabulary

- Step 3: Create BoW Vectors


In [None]:
import numpy as np
from collections import defaultdict

# Step 1: Tokenization & Lowercase

# this function converts the text to lower case, removes dotes and finally splits based on blank space
def tokenize(text): 
    return text.lower().replace('.', '').split()

tokenized_docs = [tokenize(doc) for doc in corpus]

# Step 2: Build Vocabulary
vocab = set()
# this loop adds all words in tokenized list to vocabulary(as a set)
for doc in tokenized_docs: vocab.update(doc)

# Optional: sorting thevocabulary
vocab = sorted(vocab)


# Step 3: Create BoW Vectors
bow_vectors = []

for doc in tokenized_docs:
    
    word_count = defaultdict(int)
    # Enumerates number of each vords in the document
    for word in doc:
        word_count[word] += 1
    # Generates vector
    vector = [word_count[word] for word in vocab]
    
    bow_vectors.append(vector)


# Show the vocabulary and feature vectors
df = pd.DataFrame(np.asarray(bow_vectors), columns=vocab)
df


Unnamed: 0,and,fun,i,is,learning,love,machine,nlp,too,you
0,1,0,1,0,0,2,0,2,1,1
1,0,1,0,1,0,0,0,1,0,0
2,1,0,1,0,1,1,1,1,0,0


## 5. Binarized Bag of Words
Instead of counting word frequencies, we can use binary indicators (1 if the word appears, 0 otherwise).

In [4]:
# Binarized version
vectorizer_bin = CountVectorizer(binary=True)
X_bin = vectorizer_bin.fit_transform(corpus)
df_bin = pd.DataFrame(X_bin.toarray(), columns=vectorizer_bin.get_feature_names_out())
df_bin

Unnamed: 0,and,fun,is,learning,love,machine,nlp,too,you
0,1,0,0,0,1,0,1,1,1
1,0,1,1,0,0,0,1,0,0
2,1,0,0,1,1,1,1,0,0


## 6. Advantages and Disadvantages
### Advantages
- Simple to implement and interpret
- Works well with simpler models (e.g., Naive Bayes)

### Disadvantages
- Ignores word order and context
- High dimensionality with large vocabulary
- Doesn’t handle synonyms or polysemy well

## 7. References

1. **Jurafsky, D., & Martin, J. H. (2023).**
   *Speech and Language Processing* (3rd ed. draft).
   [https://web.stanford.edu/\~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/)

   * 📌 Chapter 2 and Chapter 4 provide detailed explanations of the Bag of Words model, tokenization, and vector representations of text.

2. **Manning, C. D., Raghavan, P., & Schütze, H. (2008).**
   *Introduction to Information Retrieval*. Cambridge University Press.
   [https://nlp.stanford.edu/IR-book/](https://nlp.stanford.edu/IR-book/)

   * 📌 Chapter 6 ("Vector Space Classification") and Chapter 2 explain the mathematical foundations and use of BoW in retrieval and classification.

---

### 📄 **Peer-Reviewed Article**

3. **Harris, Z. S. (1954).**
   *Distributional Structure*. Word, 10(2–3), 146–162.
   DOI: [10.1080/00437956.1954.11659520](https://doi.org/10.1080/00437956.1954.11659520)

   * 📌 This foundational work underlies the idea that "meaning is derived from word context," forming the philosophical basis of bag-of-words approaches.