#### The Bag of Words (BoW) model is a simple and commonly used technique for text representation in Natural Language Processing (NLP). It represents text data as a set of words with their frequencies, disregarding the order of words. Below is a detailed implementation of the BoW model from scratch.

In [1]:
class BagOfWords:
    def __init__(self):
        """
        Initialize the Bag of Words model.
        """
        self.vocabulary = {}  # Dictionary to store the word-to-index mapping.
        self.word_count = {}  # Dictionary to store the frequency of each word.

    def fit(self, documents):
        """
        Build the vocabulary and word count from the provided documents.
        
        Parameters:
        documents (list of str): List of documents (strings) to build the vocabulary.
        """
        word_set = set()  # Set to hold unique words.

        # Iterate over each document
        for doc in documents:
            words = doc.split()  # Split document into words based on whitespace.
            word_set.update(words)  # Update the set with words from the current document.

        # Create a vocabulary mapping each unique word to an index
        self.vocabulary = {word: idx for idx, word in enumerate(word_set)}

        # Initialize the word count dictionary
        self.word_count = {word: 0 for word in self.vocabulary}

    def transform(self, documents):
        """
        Convert documents into a Bag of Words representation.
        
        Parameters:
        documents (list of str): List of documents (strings) to convert.
        
        Returns:
        list of lists: Each inner list is a vector representing a document.
        """
        rows = []  # List to store the BoW representation of each document.

        # Iterate over each document
        for doc in documents:
            row = [0] * len(self.vocabulary)  # Initialize a zero vector for the current document.
            words = doc.split()  # Split the document into words.
            
            # Count the frequency of each word in the document
            for word in words:
                if word in self.vocabulary:
                    index = self.vocabulary[word]  # Get the index of the word from the vocabulary.
                    row[index] += 1  # Increment the count for this word in the document's vector.
            
            rows.append(row)  # Append the vector representation of the document to the rows list.

        return rows

    def fit_transform(self, documents):
        """
        Fit the model and transform the documents in one step.
        
        Parameters:
        documents (list of str): List of documents (strings) to fit and transform.
        
        Returns:
        list of lists: Each inner list is a vector representing a document.
        """
        self.fit(documents)  # Build the vocabulary and word count from the documents.
        return self.transform(documents)  # Convert the documents into BoW representation.

In [2]:
# Example usage
if __name__ == "__main__":
    # Sample documents
    documents = [
        "the cat sat on the mat",
        "the dog ate my homework",
        "the cat ate the dog food"
    ]

    # Initialize the Bag of Words model
    bow = BagOfWords()
    
    # Fit the model and transform the documents
    bag_of_words = bow.fit_transform(documents)
    
    # Print the Bag of Words representation
    print("Vocabulary:", bow.vocabulary)
    print("Bag of Words Representation:")
    for i, vector in enumerate(bag_of_words):
        print(f"Document {i + 1}: {vector}")

Vocabulary: {'cat': 0, 'my': 1, 'the': 2, 'on': 3, 'sat': 4, 'food': 5, 'homework': 6, 'mat': 7, 'dog': 8, 'ate': 9}
Bag of Words Representation:
Document 1: [1, 0, 2, 1, 1, 0, 0, 1, 0, 0]
Document 2: [0, 1, 1, 0, 0, 0, 1, 0, 1, 1]
Document 3: [1, 0, 2, 0, 0, 1, 0, 0, 1, 1]


In [4]:
# Additional example usage
if __name__ == "__main__":
    # Sample documents
    documents = [
        "I love programming in Python",
        "Machine learning is fun",
        "Python is a versatile language",
        "Learning new skills is always beneficial"
    ]

    # Initialize the Bag of Words model
    bow = BagOfWords()
    
    # Fit the model and transform the documents
    bag_of_words = bow.fit_transform(documents)
    
    # Print the Bag of Words representation
    print("Vocabulary:", bow.vocabulary)
    print("Bag of Words Representation:")
    for i, vector in enumerate(bag_of_words):
        print(f"Document {i + 1}: {vector}")

    # More example documents with mixed content
    more_documents = [
        "the quick brown fox jumps over the lazy dog",
        "a journey of a thousand miles begins with a single step",
        "to be or not to be that is the question",
        "the rain in Spain stays mainly in the plain",
        "all human beings are born free and equal in dignity and rights"
    ]

    # Fit the model and transform the new set of documents
    bow_more = BagOfWords()
    bag_of_words_more = bow_more.fit_transform(more_documents)
    
    # Print the Bag of Words representation for the new documents
    print("\nVocabulary for new documents:", bow_more.vocabulary)
    print("Bag of Words Representation for new documents:")
    for i, vector in enumerate(bag_of_words_more):
        print(f"Document {i + 1}: {vector}")

Vocabulary: {'love': 0, 'learning': 1, 'Learning': 2, 'a': 3, 'beneficial': 4, 'new': 5, 'Machine': 6, 'fun': 7, 'always': 8, 'I': 9, 'language': 10, 'Python': 11, 'programming': 12, 'skills': 13, 'versatile': 14, 'is': 15, 'in': 16}
Bag of Words Representation:
Document 1: [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1]
Document 2: [0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
Document 3: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0]
Document 4: [0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0]

Vocabulary for new documents: {'fox': 0, 'human': 1, 'and': 2, 'Spain': 3, 'all': 4, 'born': 5, 'over': 6, 'rain': 7, 'lazy': 8, 'dignity': 9, 'stays': 10, 'dog': 11, 'are': 12, 'jumps': 13, 'in': 14, 'that': 15, 'rights': 16, 'is': 17, 'begins': 18, 'to': 19, 'quick': 20, 'single': 21, 'a': 22, 'equal': 23, 'with': 24, 'mainly': 25, 'question': 26, 'not': 27, 'be': 28, 'miles': 29, 'brown': 30, 'thousand': 31, 'or': 32, 'beings': 33, 'free': 34, 'journey': 35, 'the': 36, 'st

### Explanation

1. **Initialization (`__init__` Method):**
   - `self.vocabulary` is a dictionary to store the mapping of each unique word to an index.
   - `self.word_count` is a dictionary to keep track of the frequency of each word in the documents.

2. **Fitting (`fit` Method):**
   - `word_set` is used to gather all unique words from the documents.
   - For each document, split it into words and update the `word_set`.
   - Create the vocabulary dictionary where each unique word is assigned an index.
   - Initialize `self.word_count` to store word frequencies.

3. **Transforming (`transform` Method):**
   - Convert each document into a vector based on the vocabulary.
   - For each word in a document, increment the corresponding position in the vector based on the word's index.

4. **Fit and Transform (`fit_transform` Method):**
   - Combines `fit` and `transform` methods into one step.

5. **Example Usage:**
   - Create a list of sample documents.
   - Initialize the `BagOfWords` object.
   - Fit the model and transform the documents.
   - Print the vocabulary and the Bag of Words representation for each document.

This implementation demonstrates the basic concept of the Bag of Words model. It converts text documents into numerical vectors based on word frequency, which can be used as input features for machine learning models.