# Notes on Text pre-processing for Machine Learning Algorithms

Text cannot be used directly for any Text-processing or NLP algorithms. Text need to be preprossed so that it become suitable for the algorithms. The Bag of the Words (BoW) model is used in machine learning. BoW focuses on the occurance of words rather than the order. This can be done by assigning each word a unique number and then any text can be encoded with the fixed size vector of known words. For example

    Vocab -           am, new, boy, a, and....
    Assigned Number - 1,  2,   3,  4,   5, ..........
    Intial Vector   { 0 , 0 , 0 , 0 , 0 , 0.....}

    Now the text - I am a boy and I am new
    In this text, 'I' has occured once. Similary 'am', 'a', 'boy'. So, 'I' has occured twice so in the vector the first position will be incremented to 2. Same thing is done for other words
             am  new  boy  a  and  ....
    Vector {  2,  1,    1,  1,  1 }

In reality, the vocab is too big and many times there are many 0s in the vector. The scikit-learn library provides 3 different schemes that we can use

# 1) Word Counts with CountVectorizer

### Convert a collection of text documents to a matrix of token counts

You can use it as follows:

    1) Create an instance of the CountVectorizer class.
    2) Call the fit() function in order to learn a vocabulary from one or more documents.
    3) Call the transform() function on one or more documents as needed to encode each as a vector.
    
 Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["I am a boy and I am new."]

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(text)

vector2 = vectorizer.transform(["I am a good boy"])  #as "good" is not in vocab it will be ignored


# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())
print(vector2.toarray())

{'am': 0, 'boy': 2, 'and': 1, 'new': 3}
(1, 4)
<class 'scipy.sparse.csr.csr_matrix'>
[[2 1 1 1]]
[[1 0 1 0]]


# 2) Word Frequencies with TfidfVectorizer (Term Frequency – Inverse Document)

TfidfVectorizer is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf

Sometimes the words like 'the', 'and' etc are repeated lot of times. This results in increasing number of counts

    Term Frequency: As the name suggests, it counts the frequnecy of a word occurrance. 
    Inverse Document Frequency: Some words like 'the' ,'am', 'and' are so common and usally repeates very often. So this words are not as important and can be downsampled. This helped to focus on important words.
    

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
		"The dog.",
		"The fox"]

# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

# encode document
vector = vectorizer.transform([text[0]])

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


# 3) Hashing with HashingVectorizer

The limitation of the above two methods is that, it requires to generate vocab. Sometimes the vocab become unecessary long and hence is overhead. The alternate efficient method is by hashing. By hashing, the words are stored as integers hence eliminating the need of vocab. The drawback of this method is, the hashed word cannot be converted back to the original one. 

In below example, an aribitary vector length of 40 was choosen. The length should be set with atmost care to avoid frquent hash collisions. There are some heurestics which can be used to set the length

In [10]:
from sklearn.feature_extraction.text import HashingVectorizer

# list of text documents
text = ["I am a bow and I am new"]

# create the transform
vectorizer = HashingVectorizer(n_features=40)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 40)
[[-0.37796447  0.          0.          0.          0.         -0.37796447
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.37796447
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.75592895
   0.          0.          0.          0.        ]]
