<a href="https://colab.research.google.com/github/esashika/Data-Science-Machine-Learning/blob/main/NLP_BOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A ideia do **Bag of Words** é transformar um texto em uma coleção (“bag”) de palavras, sem levar em conta a ordem em que as palavras aparecem ou a estrutura gramatical das frases.

Em outras palavras, o modelo não se importa se a frase é *“O gato dorme na cama”* ou *“Na cama, o gato dorme”*; o que interessa é apenas quais palavras existem e com que frequência elas aparecem.



In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords and tokenizer if you haven't already
nltk.download("punkt")
nltk.download('punkt_tab')
nltk.download("stopwords")

# Example sentence
sentence = "This is an example showing how to remove stop words from a sentence."

# Tokenize the sentence into words
words = word_tokenize(sentence)

# Get the list of stop words in English
stop_words = set(stopwords.words("english"))

# Remove stop words from the sentence
filtered_sentence = [word for word in words if word.lower() not in stop_words]

# Join the words back into a sentence
filtered_sentence = " ".join(filtered_sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
print(filtered_sentence)

example showing remove stop words sentence .


## **Fazendo na Mão**

## **Step 1:** Preprocessing the Text Data

In [5]:
from collections import defaultdict
import string

# Sample text data: sentences
corpus = [
    "Python is amazing and fun.",
    "Python is not just fun but also powerful.",
    "Learning Python is fun!",
]
# Function to preprocess text
def preprocess(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize: split the text into words
    tokens = text.split()
    return tokens

# Apply preprocessing to the sample corpus
processed_corpus = [preprocess(sentence) for sentence in corpus]
print(processed_corpus)

[['python', 'is', 'amazing', 'and', 'fun'], ['python', 'is', 'not', 'just', 'fun', 'but', 'also', 'powerful'], ['learning', 'python', 'is', 'fun']]


# **Step 2:** Build Vocabulary

In [6]:
# Initialize an empty set for the vocabulary
vocabulary = set()

# Build the vocabulary
for sentence in processed_corpus:
    vocabulary.update(sentence)

# Convert to a sorted list
vocabulary = sorted(list(vocabulary))
print("Vocabulary:", vocabulary)

Vocabulary: ['also', 'amazing', 'and', 'but', 'fun', 'is', 'just', 'learning', 'not', 'powerful', 'python']


# **Step 3:** Calculate Word Frequencies and Vectorize

In [7]:
def create_bow_vector(sentence, vocab):
    vector = [0] * len(vocab)  # Initialize a vector of zeros
    for word in sentence:
        if word in vocab:
            idx = vocab.index(word)  # Find the index of the word in the vocabulary
            vector[idx] += 1  # Increment the count at that index
    return vector


# Create BoW vector for each sentence in the processed corpus
bow_vectors = [create_bow_vector(sentence, vocabulary) for sentence in processed_corpus]
print("Bag of Words Vectors:")
for vector in bow_vectors:
    print(vector)

Bag of Words Vectors:
[0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1]
[1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1]
[0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1]


# Using Scikit-learn’s CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# Original corpus
corpus = [
    "Python is amazing and fun.",
    "Python is not just fun but also powerful.",
    "Learning Python is fun!",
]
# O objeto CountVectorizer será usado para transformar o texto em uma matriz de contagens de palavras.
vectorizer = CountVectorizer()

# O método .fit_transform() aprende o vocabulário do corpus e transforma os
# textos em uma matriz esparsa (matriz que tem muitos elementos com valor zero,
# ou seja, que é composta principalmente por zeros), onde cada linha representa
# um documento e cada coluna representa uma palavra única do vocabulário.
X = vectorizer.fit_transform(corpus)

# get_feature_names_out() -> retorna a lista de palavras únicas que foram extraídas do corpus.
print("Vocabulary:", vectorizer.get_feature_names_out())

# X.toarray() converte a matriz esparsa em um array NumPy, permitindo visualizá-la como uma matriz comum.
print("BoW Representation:")
print(X.toarray())

Vocabulary: ['also' 'amazing' 'and' 'but' 'fun' 'is' 'just' 'learning' 'not'
 'powerful' 'python']
BoW Representation:
[[0 1 1 0 1 1 0 0 0 0 1]
 [1 0 0 1 1 1 1 0 1 1 1]
 [0 0 0 0 1 1 0 1 0 0 1]]


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)

Vocabulary: ['but' 'fantastic' 'great' 'hated' 'it' 'loved' 'movie' 'not' 'okay'
 'terrible' 'the' 'was']
Document-Term Matrix:
 [[0 1 0 0 1 1 1 0 0 0 1 1]
 [1 0 1 0 0 0 1 1 1 0 1 1]
 [0 0 0 1 1 0 1 0 0 1 1 1]]
