# Notes on Text pre-processing for Machine Learning Algorithms

Text cannot be used directly for any Text-processing or NLP algorithms. Text need to be preprossed so that it become suitable for the algorithms. The Bag of the Words (BoW) model is used in machine learning. BoW focuses on the occurance of words rather than the order. This can be done by assigning each word a unique number and then any text can be encoded with the fixed size vector of known words. For example

    Vocab -           am, new, boy, a, and....
    Assigned Number - 1,  2,   3,  4,   5, ..........
    Intial Vector   { 0 , 0 , 0 , 0 , 0 , 0.....}

    Now the text - I am a boy and I am new
    In this text, 'I' has occured once. Similary 'am', 'a', 'boy'. So, 'I' has occured twice so in the vector the first position will be incremented to 2. Same thing is done for other words
             am  new  boy  a  and  ....
    Vector {  2,  1,    1,  1,  1 }

In reality, the vocab is too big and many times there are many 0s in the vector. The scikit-learn library provides 3 different schemes that we can use

# 1) Word Counts with CountVectorizer

### Convert a collection of text documents to a matrix of token counts

You can use it as follows:

    1) Create an instance of the CountVectorizer class.
    2) Call the fit() function in order to learn a vocabulary from one or more documents.
    3) Call the transform() function on one or more documents as needed to encode each as a vector.
    
 Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["I am a boy and I am new."]

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

{'am': 0, 'boy': 2, 'and': 1, 'new': 3}
(1, 4)
<class 'scipy.sparse.csr.csr_matrix'>
[[2 1 1 1]]
