# Binary Term Frequency 

* We will use Sklearn's __CountVectorizer()__ function with __binary=True__ parameter. 
* Let's talk about __fit()__ and __transform()__ functions. 
    * __fit()__ function "fits" the text data and creates the vocabulary on the text. 
    * __transform()__ function calculates the needed vector numbers. They are, in this case, binary values 0s or 1s. 
* We can get the feature vectors with this: __features.toarray()__ This gives you a multi-dim numpy array.
* Then, we will put the result into a data frame for nice result visualization. 

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

texts = ["good movie", "bad acting", "it was boring movie"]

vectorizer = CountVectorizer(binary=True)
vectorizer.fit(texts)
features = vectorizer.transform(texts)

df = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())

print("Texts:", texts)
print("--------------------------------------------------------")
print(df)

# Word counts

* We will use Sklearn's __CountVectorizer()__ function. 
* __fit()__ function "fits" the text data and creates the vocabulary on the text.
* __transform()__ function calculates the needed vector numbers. They are raw token/word counts for this example.
* We can get the feature vectors with this: __features.toarray()__ This gives you a multi-dim numpy array.
* Then, we will put the result into a data frame for nice result visualization. 

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

texts = ["good movie", "bad acting", "it was boring movie"]

vectorizer = CountVectorizer()
vectorizer.fit(texts)
features = vectorizer.transform(texts)

df = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())

print("Texts:", texts)
print("--------------------------------------------------------")
print(df)

# N-grams 
We can use N-grams in our feature vectors. This will help us getting token sequences. The example below uses 1-grams (regular tokens) and 2-grams (2 consecutive tokens). Pay attention to our new vocabulary, it gest __BIGGER__ because of additional 2-grams. 

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

texts = ["good movie", "bad acting", "it was boring movie"]

# CountVectorizer(ngram_range = (ngram_low_limit, ngram_up_limit)) 
vectorizer = CountVectorizer(ngram_range=(1, 2))
vectorizer.fit(texts)
features = vectorizer.transform(texts)

df = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())

print("Texts:", texts)
print("--------------------------------------------------------")
print(df)

# Term Frequencies

* We will use Sklearn's __TfidfVectorizer()__ function with __use_idf=False__ parameter.
* fit() function "fits" the text data and creates the vocabulary on the text.
* transform() function calculates term frequencies. SKlearn automatically applies __L2 normalization__ to each vector.
* We can get the feature vectors with this: features.toarray() This gives you a multi-dim numpy array.
* Then, we will put the result into a data frame for nice result visualization.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

texts = ["good movie", "bad acting", "it was boring movie"]

vectorizer = TfidfVectorizer(use_idf=False)
vectorizer.fit(texts)
features = vectorizer.transform(texts)

df = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())

print("Texts:", texts)
print("--------------------------------------------------------")
print(df)

# TF - IDF

* We will use Sklearn's __TfidfVectorizer()__ function.
* fit() function "fits" the text data and creates the vocabulary on the text.
* transform() function calculates term freq. - inverse document frequencies (TF-IDF). SKlearn automatically applies __L2 normalization__ to each vector.
* We can get the feature vectors with this: features.toarray() This gives you a multi-dim numpy array.
* Then, we will put the result into a data frame for nice result visualization.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

texts = ["good movie", "bad acting", "it was boring movie"]

vectorizer = TfidfVectorizer()
vectorizer.fit(texts)
features = vectorizer.transform(texts)

df = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())

print("Texts:", texts)
print("--------------------------------------------------------")
print(df)

# Vectorizing with limited feature size (smaller vocabulary)

Sometimes we may need to reduce the size of our feature array for faster training and better generalization. In this case, we can use the __max_features__ parameter. This will keep the most important features by the given feature size.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

texts = ["good movie", "bad acting", "it was boring movie"]

vectorizer = TfidfVectorizer(max_features=3)    # TF-IDF 
#vectorizer = CountVectorizer(max_features=3)  # Word counts

vectorizer.fit(texts)
features = vectorizer.transform(texts)

df = pd.DataFrame(features.toarray(), columns=vectorizer.get_feature_names())

print("Texts:", texts)
print("--------------------------------------------------------")
print(df)