# Basics
> Text cannot be directly fed to any ML algorithm, as most of them expect fixed size input

> **Vectorization -**  
> It's the process of turning text into numerical feature vectors  
> It consists of **Tokenization, Counting & Normalization**  
> This strategy is called **Bag of Words**

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
X_train = ['call me back', 'i will call you back', 'do not call me', 'urgent calls only']

# 1. Count Vectorizer

In [3]:
vect = CountVectorizer()

In [4]:
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
vect.get_feature_names()    # Tokens extracted from the messages

['back', 'call', 'calls', 'do', 'me', 'not', 'only', 'urgent', 'will', 'you']

### Document Term Matrix
> 4 messages x 10 unique features  
> It's a sparse matrix because not all messages use all the unique features

In [6]:
X_train_dtm = vect.transform(X_train)
X_train_dtm

<4x10 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

### Visualizing a DTM
Can't be done to an extremely big sparse DTM, because The Dense representation may eat the whole ram!

In [7]:
pd.DataFrame(X_train_dtm.toarray(), columns = vect.get_feature_names())

Unnamed: 0,back,call,calls,do,me,not,only,urgent,will,you
0,1,1,0,0,1,0,0,0,0,0
1,1,1,0,0,0,0,0,0,1,1
2,0,1,0,1,1,1,0,0,0,0
3,0,0,1,0,0,0,1,1,0,0


# 2. TF-IDF Vectorizer
> Occurrence count is a good start but there is an issue:

> longer documents will have higher average count values than shorter documents, even though they might talk about the same topics

> **TF - Term Frequency -** Calculated by dividing count of each word by total number of the words in the document  
> **IDF - Inverse Document Frequency -** Calculated by reducing weights of words that occur in many documents  
> **TF-IDF -** is TF multiplied by IDF

> There are 2 TF-IDF classes in feature extraction of sklearn -  
1. TfidfTransformer - It transforms already counted DTM
1. TfidfVectorizer - It does both Count Vectorizer's task and Transformer's task at once

In [8]:
vect = TfidfVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<4x10 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [9]:
pd.DataFrame(X_train_dtm.toarray(), columns = vect.get_feature_names())

Unnamed: 0,back,call,calls,do,me,not,only,urgent,will,you
0,0.613667,0.496816,0.0,0.0,0.613667,0.0,0.0,0.0,0.0,0.0
1,0.453005,0.366747,0.0,0.0,0.0,0.0,0.0,0.0,0.57458,0.57458
2,0.0,0.366747,0.0,0.57458,0.453005,0.57458,0.0,0.0,0.0,0.0
3,0.0,0.0,0.57735,0.0,0.0,0.0,0.57735,0.57735,0.0,0.0


# Test Process
> New tokens won't be entertained, (e.g. don't), which happens in fit state

> A model can never predict things it hasn't been trained on

> vect.fit(X_train) **Learns the Vocabulary** of training data  
> vect.transform(X_train) **Builds Document Term Matrix** from the learnt vocab (which will be fed to a model)  
> vect.fit-transform(X_train) **2 in 1**  
> vect.transform(X_test) builds DTM for testing, **Rejects New Vocab**

In [10]:
X_test = ["please don't call me"]
X_test_dtm = vect.transform(X_test)
X_test_dtm    # 1 x 10, coz of single message and 10 unique tokens

<1x10 sparse matrix of type '<class 'numpy.float64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [11]:
pd.DataFrame(X_test_dtm.toarray(), columns = vect.get_feature_names())

Unnamed: 0,back,call,calls,do,me,not,only,urgent,will,you
0,0.0,0.629228,0.0,0.0,0.777221,0.0,0.0,0.0,0.0,0.0
