## Representing text data as a Bag of Words

Computing the bag-of-words representation for a corpus of documents consists of the following three steps:
<li><i>Tokenization.</i> Split each document into the words that appear in it (called <i>tokens</i>, for example by splitting them on whitespace and punctuation.</li>
<li><i>Vocabulary building.</i> Collect a vocabulary of all words that appear in any of the documents, and number them (say, in alphabetical order).</li>
<li><i>Encoding.</i> For each document, count how often each of the words in the vocabulary appear in this document.</li>

In [1]:
bards_words = ["In language processing, the vectors x are derived from textual data,", 
               "in order to reflect various linguistic properties of the text."]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)

CountVectorizer()

In [3]:
print(f"Vocabulary size: {len(vect.vocabulary_)}")
print(f"Vocabulary content: \n {vect.vocabulary_}")

Vocabulary size: 18
Vocabulary content: 
 {'in': 4, 'language': 5, 'processing': 9, 'the': 14, 'vectors': 17, 'are': 0, 'derived': 2, 'from': 3, 'textual': 13, 'data': 1, 'order': 8, 'to': 15, 'reflect': 11, 'various': 16, 'linguistic': 6, 'properties': 10, 'of': 7, 'text': 12}


The vocabulary consists of 18 words form "in" to "text"

In [4]:
bag_of_words = vect.transform(bards_words)
print(f"Bag-of-words: {repr(bag_of_words)}")

Bag-of-words: <2x18 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>


In [5]:
print(f"Dense representation of bag_of_words:\n {bag_of_words.toarray()}")

Dense representation of bag_of_words:
 [[1 1 1 1 1 1 0 0 0 1 0 0 0 1 1 0 0 1]
 [0 0 0 0 1 0 1 1 1 0 1 1 1 0 1 1 1 0]]
