# CountVectorizer vs TFIDF

description.__doc__ <br>
""" A simple way to see count vectorizer and term frequency inverse document frequency.
<br>
Args:
  CountVectorizer, TFIDF, Pandas and Random Text
<br>  
Returns:
  Easy overview of the difference of CountVectorizer and TFIDF<br>
"""

In [110]:
text = [["Five little monkeys sitting in a tree","One monkey fell down from the tree and hurt her knee","Why me? Oh the humanity!"],["Six large monkeys eating in a tree","Two monkey fell down and hurt their knees","Why me? Oh the humanity!"]]
text2 = "This string is already flatter than the earth!"

In [126]:
import itertools
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
flat_text = itertools.chain.from_iterable(text)
print(list(flat_text))

['Five little monkeys sitting in a tree', 'One monkey fell down from the tree and hurt her knee', 'Why me? Oh the humanity!', 'Six large monkeys eating in a tree', 'Two monkey fell down and hurt their knees', 'Why me? Oh the humanity!']


In [127]:
# Crude way of OOP for NLP
class WordProcess:
    
    def __init__(self, text):
        self.text = text
        self.word_count = defaultdict(int) # dict with default value 0 for new keys
        
    def tokenize_list_of_list(self):#Brute tokenizer
        flat_text = list(itertools.chain.from_iterable(self.text))
        flat_text = " ".join(flat_text)
        for word in flat_text.split(" "):
          self.word_count[word] += 1
             
my_text = WordProcess(text)    
my_text.tokenize_list_of_list()

In [128]:
print(my_text.word_count) # Basically we are getting the counts for each word. That is it.

defaultdict(<class 'int'>, {'Five': 1, 'little': 1, 'monkeys': 2, 'sitting': 1, 'in': 2, 'a': 2, 'tree': 3, 'One': 1, 'monkey': 2, 'fell': 2, 'down': 2, 'from': 1, 'the': 3, 'and': 2, 'hurt': 2, 'her': 1, 'knee': 1, 'Why': 2, 'me?': 2, 'Oh': 2, 'humanity!': 2, 'Six': 1, 'large': 1, 'eating': 1, 'Two': 1, 'their': 1, 'knees': 1})


In [129]:
# Crude Pandas Preprocessing
text_names = []
texts = []
for i in range(0,len(text)):
  text_names.append("text" + str(i+1))
  texts.append(" ".join(text[i]))
print(lst)
print(text_names)

['Five little monkeys sitting in a tree One monkey fell down from the tree and hurt her knee Why me? Oh the humanity!', 'Five little monkeys sitting in a tree One monkey fell down and hurt her knee Why me? Oh the humanity!']
['text1', 'text2']


In [130]:
df = pd.DataFrame(zip(text_names, texts), columns=['Document','Text'])
print(df.head())

  Document                                               Text
0    text1  Five little monkeys sitting in a tree One monk...
1    text2  Six large monkeys eating in a tree Two monkey ...


# Difference on preprocessing for NLP
The main difference is that a better processor, might be able to do lemmitization and stemming of words. 
Stemming would bring certain words in our texts down. Like the stem of fell is fall and eating would be eat.
We would end up with fewer words in total. And that computationally is better. Modelwise it might not be.
The below CountVectorizer does not do this and is almost as crude as the WordProcess class above for this particular text. However it is way better on longer texts. The point is though, a lot of data will be generated. The data is high in cardinallity up to a point, as it approaches some limit in language it will overlap pretty well with the real language in question. We can see the law of large numbers similarly for langugage except, it may require a lot less to get most of vocabulary out than would be expected from a normal distribution. 
The problem though for NLP is that MEANING isn't extracted at all. It has to be infered, 

In [131]:
count_vectorizer = CountVectorizer(stop_words="english")
count_train = count_vectorizer.fit_transform(df["Text"])
print(count_vectorizer.get_feature_names()) # The Columns if you will

['eating', 'fell', 'humanity', 'hurt', 'knee', 'knees', 'large', 'little', 'monkey', 'monkeys', 'oh', 'sitting', 'tree']


In [132]:
count_train.A  # The counts of each word in a vector for each document. Making the CountVector 

array([[0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 2],
       [1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1]], dtype=int64)

In [133]:
tfidf_vectorizer = TfidfVectorizer(stop_words="english")

In [134]:
float_formatter = "{:.2f}".format
np.set_printoptions(formatter={'float_kind':float_formatter})
tfidf_train = tfidf_vectorizer.fit_transform(df["Text"])
print(f"{tfidf_train.A}") # TFIDF vector weights

[[0.00 0.25 0.25 0.25 0.35 0.00 0.00 0.35 0.25 0.25 0.25 0.35 0.50]
 [0.39 0.28 0.28 0.28 0.00 0.39 0.39 0.00 0.28 0.28 0.28 0.00 0.28]]


# TF-idf formula 

The idea is to find words that are very important by frequency in certain documents but not across all documents. If say Zebra is mentioned 5 times in one text and in no other it should be relevant. 
To do this the frequencies counts/total words are used and multiplied with log(N / d). The log of the number of documents to number of documents with the term ratio. In practise when the algorithem runs it will smoothing new words by adding 1 to the denominator when the count in the vocabulary is 0.  
So returned is weighted vectors. Like space-time, or a circle on a sphere or any number of things that are adjusted and weighted in some manner.

# The Chance for TF-idf formular to be something different - non-linear and ReLU
The first time I examined the formula I really liked the log term. As log(1) = 0 would remove the word completly from being used. This is a familiar pattern I see in stronger models. The ability to ignore things.
The first part of the equation, that is a count of zero will set the weight to zero. But so will close to presence in all documents.
To me this looks a lot like rectified linear unit ReLU in neural networks. And also the SVM support vectors. And other machine learning regularization strategies. That is we can ignore large parts of our data, which is similar to life. Pick out the reducible parts we can do something with, if possible, to understand the phenomenon.

In practise however, the TfidfVectorizer can instanciated with a min_df, max_df range that would remove words that are not common enough or too common. It can also be limited by max_features setting a limit to how much of the data is to be used. 

These are all trying to accomplish the same thing. Reduction. 

Some ideas on this field:
- one: do bootstrap vectors with the corpus, and compare to our corpus #Answers is it more meaningfull than random
- two: split up the data with cross validation rather than train test split
- three: funnel several models

In [None]:
TfidfVectorizer(stop_words="english", max_df=0.7) Clipping off Non-Linearities
"""
:param: max_df
float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
"""