# Bag of Words Representation/Model

<b>The Bag of Words (BoW)</b> is a simple and commonly used model in natural language processing (NLP) to represent text data for machine learning. It transforms text into a format that algorithms can work with, particularly by counting the frequency of words in a given set of documents.

Key Concepts of Bag of Words:

- <b>Vocabulary:</b> This refers to the set of unique words that appear in the corpus (collection of documents). Each word in the vocabulary becomes a column in the resulting matrix.

- <b>Document Representation:</b> Each document in the corpus is represented as a vector, where each element corresponds to the frequency of a word from the vocabulary. If a word does not appear in the document, its frequency will be zero.

- <b>Order Ignorance:</b> The Bag of Words model ignores the order of words in the text. It only counts the occurrence of each word without considering the context or sequence in which the words appear.

In [1]:
# import required libraries
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

### Basic Bag of Words Example

In [2]:
documents = ["Gangs of Wasseypur is a great movie.", "The success of a movie depends on the performance of the actors.", "There are no new movies releasing this week."]
print(documents)

['Gangs of Wasseypur is a great movie.', 'The success of a movie depends on the performance of the actors.', 'There are no new movies releasing this week.']


#### Let us define a method for preprocessing the document/text. Steps are

- convert the text to lower/upper case (preferred lower case) to make the words case insensitive
- tokenise the text into words
- remove the stop words
- re-join the tokenised words into a sentence

In [3]:
def preprocess(document):

    # convert document text to lower
    document = document.lower()

    # tokenise the text into words
    words = word_tokenize(document)

    # remove the stop words
    words = [word for word in words if word not in stopwords.words('english')]

    # rejoin the words to make sentence without stop words
    document = ' '.join(words)

    return document

In [4]:
# Let is pre-process the documents
documents = [preprocess(document) for document in documents]
print(documents)

['gangs wasseypur great movie .', 'success movie depends performance actors .', 'new movies releasing week .']


#### Creating bag of words using count vectorizer function

In [5]:
vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(documents)
print(bow_model)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 13 stored elements and shape (3, 12)>
  Coords	Values
  (0, 2)	1
  (0, 10)	1
  (0, 3)	1
  (0, 4)	1
  (1, 4)	1
  (1, 9)	1
  (1, 1)	1
  (1, 7)	1
  (1, 0)	1
  (2, 6)	1
  (2, 5)	1
  (2, 8)	1
  (2, 11)	1


In [6]:
# print the full sparse matrix
print(bow_model.toarray())

[[0 0 1 1 1 0 0 0 0 0 1 0]
 [1 1 0 0 1 0 0 1 0 1 0 0]
 [0 0 0 0 0 1 1 0 1 0 0 1]]


In [7]:
print(bow_model.shape)

(3, 12)


In [8]:
print(vectorizer.get_feature_names_out())

['actors' 'depends' 'gangs' 'great' 'movie' 'movies' 'new' 'performance'
 'releasing' 'success' 'wasseypur' 'week']


### Create Bag of Words for Spam-Ham Dataset 

In [9]:
# read the file
spam_ham_text = pd.read_csv('SMSSpamCollection.txt', delimiter='\t', header=None)

In [10]:
spam_ham_text.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
spam_ham_text.columns = ['label','message']

In [12]:
spam_ham_text.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [13]:
spam_ham_text.shape

(5572, 2)

In [14]:
spam_ham_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [15]:
# taking only first 50 records from the entire data frame
spam_ham_text = spam_ham_text.iloc[0:100,:]

In [16]:
spam_ham_text.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [17]:
messages = spam_ham_text.message

In [18]:
type(messages)

pandas.core.series.Series

In [19]:
# lets convert the messages to list for preprocessing
messages = [message for message in messages]

In [20]:
type(messages)

list

In [21]:
# Let us pre-process the messages
messages = [preprocess(message) for message in messages]
print(messages)

['go jurong point , crazy .. available bugis n great world la e buffet ... cine got amore wat ...', 'ok lar ... joking wif u oni ...', "free entry 2 wkly comp win fa cup final tkts 21st may 2005. text fa 87121 receive entry question ( std txt rate ) & c 's apply 08452810075over18 's", 'u dun say early hor ... u c already say ...', "nah n't think goes usf , lives around though", "freemsg hey darling 's 3 week 's word back ! 'd like fun still ? tb ok ! xxx std chgs send , £1.50 rcv", 'even brother like speak . treat like aids patent .', "per request 'melle melle ( oru minnaminunginte nurungu vettam ) ' set callertune callers . press * 9 copy friends callertune", 'winner ! ! valued network customer selected receivea £900 prize reward ! claim call 09061701461. claim code kl341 . valid 12 hours .', 'mobile 11 months ? u r entitled update latest colour mobiles camera free ! call mobile update co free 08002986030', "'m gon na home soon n't want talk stuff anymore tonight , k ? 've cried enoug

In [22]:
# vectorize the messages
vectorizer = CountVectorizer()
bow_spam_ham = vectorizer.fit_transform(messages)

In [23]:
print(bow_spam_ham)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 884 stored elements and shape (100, 640)>
  Coords	Values
  (0, 237)	1
  (0, 294)	1
  (0, 419)	1
  (0, 150)	1
  (0, 75)	1
  (0, 93)	1
  (0, 247)	1
  (0, 623)	1
  (0, 302)	1
  (0, 92)	1
  (0, 121)	1
  (0, 243)	1
  (0, 62)	1
  (0, 600)	1
  (1, 389)	1
  (1, 303)	1
  (1, 293)	1
  (1, 611)	1
  (1, 391)	1
  (2, 222)	1
  (2, 196)	2
  (2, 616)	1
  (2, 138)	1
  (2, 612)	1
  (2, 201)	2
  :	:
  (97, 469)	1
  (97, 320)	1
  (97, 321)	1
  (98, 389)	1
  (98, 266)	1
  (98, 634)	1
  (98, 373)	1
  (98, 232)	1
  (98, 85)	1
  (98, 262)	1
  (98, 615)	1
  (98, 265)	1
  (98, 458)	1
  (98, 220)	1
  (98, 252)	1
  (98, 70)	1
  (98, 221)	1
  (98, 484)	1
  (98, 83)	1
  (98, 112)	1
  (98, 424)	1
  (99, 154)	1
  (99, 469)	1
  (99, 130)	1
  (99, 64)	1


In [24]:
print(bow_spam_ham.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [25]:
print(bow_spam_ham.shape)
print(vectorizer.get_feature_names_out())

(100, 640)
['000' '07732584351' '0800' '08000930705' '08002986030'
 '08452810075over18' '09061209465' '09061701461' '09066364589' '10' '100'
 '1000' '10am' '11' '12' '1500' '150p' '150pm' '16' '169' '18' '20' '2005'
 '21st' '2nd' '3aj' '4403ldnw1a7rw18' '450ppw' '4txt' '50' '5000' '5249'
 '530' '5we' '6031' '6days' '81010' '85069' '87077' '87121' '87575' '8am'
 '900' '92h' '9pm' 'abiola' 'abt' 'ac' 'accomodations' 'aco' 'actin'
 'advise' 'aft' 'afternoon' 'ah' 'ahead' 'ahhh' 'aids' 'almost' 'already'
 'alright' 'always' 'amore' 'amp' 'animation' 'another' 'anymore'
 'anything' 'apologetic' 'apply' 'appointment' 'arabian' 'ard' 'around'
 'ask' 'available' 'awarded' 'babe' 'back' 'badly' 'barbie' 'becoz' 'bed'
 'beforehand' 'best' 'bit' 'blessing' 'bonus' 'box' 'breather' 'britney'
 'brother' 'buffet' 'bugis' 'burger' 'burns' 'bus' 'buy' 'bx420' 'ca'
 'call' 'callers' 'callertune' 'calls' 'camcorder' 'came' 'camera' 'car'
 'cash' 'casualty' 'catch' 'caught' 'cause' 'cave' 'chances' 'char

#### A lot of duplicate tokens such as 'win'and 'winner'; 'reply' and 'replying'; 'want' and 'wanted' etc. 