__Bag of words__

Bag of Words (BoW) is a simple and widely used technique in Natural Language Processing (NLP) for representing text data as numerical data. It is a way of extracting features from text and converting them into a format that can be used for machine learning algorithms.

The Bag of Words model represents text as a collection of words, ignoring grammar and word order, but keeping track of the number of times each word appears. It creates a vocabulary of unique words from the entire corpus and generates a document-term matrix that contains the frequency of each word in the vocabulary for each document in the corpus.

The Bag of Words model has many applications, including sentiment analysis, topic modeling, and text classification. It is often used as a baseline model in NLP because of its simplicity and ease of implementation.

In [9]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

# Sample text
text = "The quick brown fox jumps over the lazy dog"

# Tokenize the text into words
words = word_tokenize(text.lower())

# Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]

# Create bag of words
bow = Counter(words)

# Print bag of words
print(bow)

Counter({'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1})


In [10]:
import spacy

# Load the small English NLP model
nlp = spacy.load('en_core_web_sm')

# Define a sample text to process
text = "The quick brown fox jumped over the lazy dog. The dog slept over the verandah."

# Process the text using the Spacy model
doc = nlp(text)

# Define a list of stop words to exclude from the bag of words
stop_words = ['a', 'an', 'the', 'over']

# Create a dictionary to store the bag of words
bag_of_words = {}

# Loop through the tokens in the document
for token in doc:

    # Check if the token is a word and not a stop word
    if token.is_alpha and token.text.lower() not in stop_words:
        
        # Convert the word to lower case
        word = token.text.lower()

        # Add the word to the dictionary if it doesn't already exist
        if word not in bag_of_words:
            bag_of_words[word] = 1
        
        # Increment the count for the word if it already exists in the dictionary
        else:
            bag_of_words[word] += 1

# Print the bag of words dictionary
print(bag_of_words)


{'quick': 1, 'brown': 1, 'fox': 1, 'jumped': 1, 'lazy': 1, 'dog': 2, 'slept': 1, 'verandah': 1}
