### Bag of Words

The Bag of Words model is a way to convert text into numerical features. The basic idea is to represent a text document as a collection of its words, disregarding grammar and word order but keeping multiplicity. Here are the steps to build and use a BoW model:

1. **Tokenization**: Split the text into words (tokens).
2. **Vocabulary Creation**: Create a list of unique words (vocabulary) from all documents.
3. **Vectorization**: Create vectors for each document where each element corresponds to a word in the vocabulary. The value is typically the frequency of the word in the document.


In [24]:
# Import necessary libraries
import nltk
nltk.download('stopwords')
import tensorflow as tf
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
from nltk.corpus import stopwords
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step1: Preprocessing

Before tokenization,it's often useful to preprocess the text to improve the
quality of the data.
   1. Lowercasing: Convert all characters to lowercase to ensure uniformity.
   2. Removing Punctuation: Punctuation marks are generally not useful for BoW.
   3. Removing Stop Words: Common words like "and", "the", "is" which do not contribute to the meaning of the document.
   4. Stemming/Lemmatization: Reduce words to their root form.


In [25]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# Make sure to download the necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
len(stop_words)

179

In [27]:


# Initialize the stop words and the stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])

    # # Tokenize text
    tokens = nltk.word_tokenize(text)

    # Remove stop words
    tokens = [word for word in tokens if word not in stop_words]

    # # Stem the tokens
    tokens = [stemmer.stem(word) for word in tokens]
    return tokens

text = "Hello world! This is a Bag of Words example./////.........#########"
print(preprocess(text))


['hello', 'world', 'bag', 'word', 'exampl']


### Step 2: Vocabulary Creation


In [28]:
corpus = [
    "Hello world! This is a Bag of Words example.",
    "Bag of Words is a simple model.",
    "We are learning Bag of Words."
]

processed_corpus = [preprocess(text) for text in corpus]
vocabulary = set()
for tokens in processed_corpus:
  vocabulary.update(tokens)
print(vocabulary)

{'hello', 'model', 'simpl', 'world', 'exampl', 'word', 'learn', 'bag'}


### vectorization

In [29]:

def vectorize(tokens, vocabulary):
    token_count = Counter(tokens)
    vector = [token_count[word] for word in vocabulary]
    return vector

vectors = [vectorize(doc, vocabulary) for doc in processed_corpus]
vectors


[[1, 0, 0, 1, 1, 1, 0, 1], [0, 1, 1, 0, 0, 1, 0, 1], [0, 0, 0, 0, 0, 1, 1, 1]]

### Scikit-Learn for Bag of Words

In [30]:
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(corpus)
print(vectorizer.vocabulary_)
print(vectorizer.get_feature_names_out())
print(vectors.toarray())

{'hello': 3, 'world': 12, 'this': 9, 'is': 4, 'bag': 1, 'of': 7, 'words': 11, 'example': 2, 'simple': 8, 'model': 6, 'we': 10, 'are': 0, 'learning': 5}
['are' 'bag' 'example' 'hello' 'is' 'learning' 'model' 'of' 'simple'
 'this' 'we' 'words' 'world']
[[0 1 1 1 1 0 0 1 0 1 0 1 1]
 [0 1 0 0 1 0 1 1 1 0 0 1 0]
 [1 1 0 0 0 1 0 1 0 0 1 1 0]]


### One code snippet

In [32]:
# prompt: mount drive

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [34]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Deep_learning_model/datasets/quora.csv')

In [38]:
questions = df.question1 + df.question2

In [44]:
# prompt: questions.typeof()

questions.dtypes


dtype('O')

In [49]:
# prompt: convert questions to str

questions = questions[:5000].astype(str)


In [50]:
# Preprocess the questions
processed_questions = [preprocess(question) for question in questions]

# Create a vocabulary
vocabulary = set()
for tokens in processed_questions:
  vocabulary.update(tokens)

# Vectorize the questions
vectors = [vectorize(doc, vocabulary) for doc in processed_questions]

# Convert the vectors to a dataframe
df_vectors = pd.DataFrame(vectors)

# Print the dataframe
print(df_vectors.head())


   0     1     2     3     4     5     6     7     8     9     ...  2402  \
0     0     0     0     0     0     0     0     0     0     0  ...     0   
1     0     0     0     0     0     0     0     0     0     0  ...     0   
2     0     0     0     0     0     0     0     0     0     0  ...     0   
3     0     0     0     0     0     0     0     0     0     0  ...     0   
4     0     0     0     0     0     0     0     0     0     0  ...     0   

   2403  2404  2405  2406  2407  2408  2409  2410  2411  
0     0     0     0     0     0     0     0     0     0  
1     0     0     0     0     0     0     0     0     0  
2     0     0     0     0     0     0     0     0     0  
3     0     0     0     0     0     0     0     0     0  
4     0     0     0     0     0     0     0     0     0  

[5 rows x 2412 columns]
