# 0. Dependências

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math 
import json

%matplotlib inline

# 1. Introdução 

O valor **tf–idf** (abreviação do inglês term frequency–inverse document frequency, que significa frequência do termo–inverso da frequência nos documentos),
é uma medida estatística que tem o intuito de indicar a importância de uma palavra de um documento em relação a uma coleção de documentos ou em um [corpus linguístico](https://pt.wikipedia.org/wiki/Corpus_lingu%C3%ADstico).
Ela é frequentemente utilizada como fator de ponderação na recuperação de informações e na mineração de dados.

O valor **tf–idf** de uma palavra aumenta proporcionalmente à medida que aumenta o número de ocorrências dela em um documento, no entanto, esse valor é equilibrado pela 
frequência da palavra no corpus. Isso auxilia a distinguir o fato da ocorrência de algumas palavras serem geralmente mais comuns que outras.

<img src="formula.png" align="center"/>

<img src="pseudocode.png" align="center"/>

## Frequência do termo (tf)

Suponha que foram selecionados uma coleção de documento de textos em português e que nós desejamos determinar qual deles tem maior relação com a frase "uma vaca amarela". Uma maneira simples de iniciar essa análise seria simplesmente descartar todos os documentos que não contém as palavras "uma", "vaca" e "amarela", mas apenas esse procedimento não seria suficiente para completar a análise, pois muitos documentos provavelmente possuem as três palavras. Assim, para melhorar a distinção entre elas, nós podemos **contar o número de vezes que um dos termos ocorre em cada documento e somar esse valor; o número de vezes que um termo ocorre em um documento é a frequência do termo**.

A primeira forma de ponderação de termos é atribuída a Hans Peter Luhn (1957) e se baseia na suposição de Luhn:

- O peso de um termo que ocorre em um documento é diretamente proporcional à sua frequência.

TF = (Frequency of the word in the sentence) / (Total number of words in the sentence)

<p align="center">
  <img  src="tf.png">
</p>

## Inverso da frequência nos documentos (idf)
No entanto, como o termo "uma" é muito comum, isso vai dar ênfase em documentos que utilizam essa palavra com mais frequência, sem dar a ênfase apropriada para termos com mais 
significado como "vaca" e "amarela". O termo "uma" não é uma boa palavra-chave para distinguir documentos relevantes de não-relevantes em comparação com as palavras "vaca" e "amarela". 
Assim, o inverso da frequência do termo nos documentos é incorporado para diminuir o peso dos termos que ocorrem mais frequentemente no conjunto de textos selecionados, ao mesmo tempo
que aumenta o peso daqueles que ocorrem raramente.

Karen Spärck Jones (1972) concebeu uma interpretação estatística do termo **IDF**, que se tornou um conceito base para a ponderação de termos:

- A especificidade de um termo pode ser quantificada por uma **função inversa** do número de documentos em que ele ocorre.

IDF: (Total number of sentences (documents))/(Number of sentences (documents) containing the word)

<p align="center">
  <img  src="idf.png">
</p>

# 2. Dados

In [63]:
corpus = [
    'this is the first document pipoca pipoca pipoca',
    'this document is the second document pipoca pipoca pipoca',
    'and this is the third one pipoca pipoca pipoca',
    'as this the first document amesterda'
]

# 3. Implementação 

In [75]:
class Tfidf():
    
    def __init__(self, corpus):
        self.dictionary = set()
        self.bow_list = []
        self.word_dict_list = []
        self.tf_list = []
        
        self.idfs = {}
        self.tf_idfs = []
        
        for doc in corpus:
            wordlist = doc.split(" ")
            self.bow_list.append(wordlist)
            self.dictionary= self.dictionary.union(set(wordlist))
            
        for doc in corpus:    
            self.word_dict_list.append(dict.fromkeys(self.dictionary, 0))
    
        for bow, wordDict  in zip(self.bow_list, self.word_dict_list) :
            for word in bow:
                wordDict[word]+=1
        
    def computeTF(self,wordDict, bow):
        import math
        """
        TF = (Frequency of the word in the sentence) / (Total number of words in the sentence)
        """
        tfDict = {}
        bowCount = len(bow)
        for word, count in wordDict.items():
            tfDict[word] = count/float(bowCount)
        return tfDict
    
    def computeTFIDF(self, tfBow, idfs):
        tfidf = {}
        for word, val in tfBow.items():
            tfidf[word] = val*idfs[word]
        return tfidf
        
    def computeIDF(self, docList):
        """
        IDF: (Total number of sentences (documents))/(Number of sentences (documents) containing the word)
        """
        import math
        idfDict = {}
        N = len(docList)

        idfDict = dict.fromkeys(docList[0].keys(), 0)
        for doc in docList:
            for word, val in doc.items():
                if val > 0:
                    idfDict[word] += 1

        for word, val in idfDict.items():
            idfDict[word] = math.log( 1+ N / 1+ float(val) ) +1
        
        return idfDict

    def fit(self):
        
        self.idfs = self.computeIDF(self.word_dict_list)
        
        for bow, wordDict  in zip(self.bow_list, self.word_dict_list) :
            tfbow = self.computeTF(wordDict, bow)
            self.tf_list.append(self.computeTF(wordDict, bow))
            self.tf_idfs.append(self.computeTFIDF(tfbow, self.idfs))
            
        return  self.tf_idfs  

# 4. Teste 

In [76]:
tfidf = Tfidf(corpus)
print(f"dictionary :{tfidf.dictionary}")

dictionary :{'one', 'first', 'third', 'document', 'is', 'amesterda', 'the', 'and', 'as', 'second', 'this', 'pipoca'}


In [77]:
print(f"Bag Of words:{tfidf.bow_list}")

Bag Of words:[['this', 'is', 'the', 'first', 'document', 'pipoca', 'pipoca', 'pipoca'], ['this', 'document', 'is', 'the', 'second', 'document', 'pipoca', 'pipoca', 'pipoca'], ['and', 'this', 'is', 'the', 'third', 'one', 'pipoca', 'pipoca', 'pipoca'], ['as', 'this', 'the', 'first', 'document', 'amesterda']]


**Word Ocurrences**

In [78]:
pd.DataFrame(tfidf.word_dict_list)

Unnamed: 0,amesterda,and,as,document,first,is,one,pipoca,second,the,third,this
0,0,0,0,1,1,1,0,3,0,1,0,1
1,0,0,0,2,0,1,0,3,1,1,0,1
2,0,1,0,0,0,1,1,3,0,1,1,1
3,1,0,1,1,1,0,0,0,0,1,0,1


In [79]:
tfidfs = tfidf.fit()

**TF**

In [80]:
pd.DataFrame(tfidf.tf_list)

Unnamed: 0,amesterda,and,as,document,first,is,one,pipoca,second,the,third,this
0,0.0,0.0,0.0,0.125,0.125,0.125,0.0,0.375,0.0,0.125,0.0,0.125
1,0.0,0.0,0.0,0.222222,0.0,0.111111,0.0,0.333333,0.111111,0.111111,0.0,0.111111
2,0.0,0.111111,0.0,0.0,0.0,0.111111,0.111111,0.333333,0.0,0.111111,0.111111,0.111111
3,0.166667,0.0,0.166667,0.166667,0.166667,0.0,0.0,0.0,0.0,0.166667,0.0,0.166667


**IDF**

In [81]:
pd.DataFrame([tfidf.idfs])

Unnamed: 0,amesterda,and,as,document,first,is,one,pipoca,second,the,third,this
0,2.791759,2.791759,2.791759,3.079442,2.94591,3.079442,2.791759,3.079442,2.791759,3.197225,2.791759,3.197225


In [82]:
pd.DataFrame(tfidfs)

Unnamed: 0,amesterda,and,as,document,first,is,one,pipoca,second,the,third,this
0,0.0,0.0,0.0,0.38493,0.368239,0.38493,0.0,1.154791,0.0,0.399653,0.0,0.399653
1,0.0,0.0,0.0,0.68432,0.0,0.34216,0.0,1.026481,0.310195,0.355247,0.0,0.355247
2,0.0,0.310195,0.0,0.0,0.0,0.34216,0.310195,1.026481,0.0,0.355247,0.310195,0.355247
3,0.465293,0.0,0.465293,0.51324,0.490985,0.0,0.0,0.0,0.0,0.532871,0.0,0.532871


## Comparação com o Scikit-learn

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names()[:5]

features_array = np.array(vectorizer.get_feature_names())
features_array[:5]

data = X.todense().tolist()


In [73]:
df = pd.DataFrame(data, columns=features_array)
df

Unnamed: 0,amesterda,and,as,document,first,is,one,pipoca,second,the,third,this
0,0.0,0.0,0.0,0.268583,0.331753,0.268583,0.0,0.805749,0.0,0.219584,0.0,0.219584
1,0.0,0.0,0.0,0.474161,0.0,0.23708,0.0,0.711241,0.371432,0.193829,0.0,0.193829
2,0.0,0.362292,0.0,0.0,0.0,0.231246,0.362292,0.693738,0.0,0.189059,0.362292,0.189059
3,0.528987,0.0,0.528987,0.337645,0.417059,0.0,0.0,0.0,0.0,0.276047,0.0,0.276047


In [74]:
for i in df.iterrows():
    values = i[1].values.tolist()
    print([features_array[i] for i, v in enumerate(values) if 0.0 < v < 0.6])

['document', 'first', 'is', 'the', 'this']
['document', 'is', 'second', 'the', 'this']
['and', 'is', 'one', 'the', 'third', 'this']
['amesterda', 'as', 'document', 'first', 'the', 'this']


## Comparação com o Pyspark

In [14]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer,CountVectorizer
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName('abc').getOrCreate()

sentenceData = spark.createDataFrame([
    (0.0, 'this is the first document pipoca pipoca pipoca'),
    (0.0, 'this document is the second document pipoca pipoca pipoca'),
    (1.0, 'and this is the third one pipoca pipoca pipoca'),
    (1.0, 'as this the first document')
    
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.show(truncate=False)

+-----+---------------------------------------------------------+-------------------------------------------------------------------+
|label|sentence                                                 |words                                                              |
+-----+---------------------------------------------------------+-------------------------------------------------------------------+
|0.0  |this is the first document pipoca pipoca pipoca          |[this, is, the, first, document, pipoca, pipoca, pipoca]           |
|0.0  |this document is the second document pipoca pipoca pipoca|[this, document, is, the, second, document, pipoca, pipoca, pipoca]|
|1.0  |and this is the third one pipoca pipoca pipoca           |[and, this, is, the, third, one, pipoca, pipoca, pipoca]           |
|1.0  |as this the first document                               |[as, this, the, first, document]                                   |
+-----+-------------------------------------------------------

In [15]:
cv = CountVectorizer(inputCol="words", outputCol="cv_features") #vocabSize=3, minDF=2.0
model = cv.fit(wordsData)

featurizedData = model.transform(wordsData)
featurizedData.show(truncate=False)

+-----+---------------------------------------------------------+-------------------------------------------------------------------+--------------------------------------------------+
|label|sentence                                                 |words                                                              |cv_features                                       |
+-----+---------------------------------------------------------+-------------------------------------------------------------------+--------------------------------------------------+
|0.0  |this is the first document pipoca pipoca pipoca          |[this, is, the, first, document, pipoca, pipoca, pipoca]           |(11,[0,1,2,3,4,5],[3.0,1.0,1.0,1.0,1.0,1.0])      |
|0.0  |this document is the second document pipoca pipoca pipoca|[this, document, is, the, second, document, pipoca, pipoca, pipoca]|(11,[0,1,2,3,4,10],[3.0,1.0,1.0,2.0,1.0,1.0])     |
|1.0  |and this is the third one pipoca pipoca pipoca           |[and, this

In [16]:
print(model.vocabulary)

['pipoca', 'this', 'the', 'document', 'is', 'first', 'third', 'as', 'one', 'and', 'second']


In [17]:
featurizedData.printSchema()

root
 |-- label: double (nullable = true)
 |-- sentence: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cv_features: vector (nullable = true)



In [18]:
vector_udf = F.udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))

colvalues = featurizedData.select(vector_udf('cv_features').alias('features')).collect()

spark.createDataFrame(list(map(lambda x:x.features,colvalues)),model.vocabulary).show()

+------+----+---+--------+---+-----+-----+---+---+---+------+
|pipoca|this|the|document| is|first|third| as|one|and|second|
+------+----+---+--------+---+-----+-----+---+---+---+------+
|   3.0| 1.0|1.0|     1.0|1.0|  1.0|  0.0|0.0|0.0|0.0|   0.0|
|   3.0| 1.0|1.0|     2.0|1.0|  0.0|  0.0|0.0|0.0|0.0|   1.0|
|   3.0| 1.0|1.0|     0.0|1.0|  0.0|  1.0|0.0|1.0|1.0|   0.0|
|   0.0| 1.0|1.0|     1.0|0.0|  1.0|  0.0|1.0|0.0|0.0|   0.0|
+------+----+---+--------+---+-----+-----+---+---+---+------+



In [19]:
idf = IDF(inputCol="cv_features", outputCol="idf_features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.show(truncate=False)

+-----+---------------------------------------------------------+-------------------------------------------------------------------+--------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|label|sentence                                                 |words                                                              |cv_features                                       |idf_features                                                                                                                  |
+-----+---------------------------------------------------------+-------------------------------------------------------------------+--------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|0.0  |this is the first document pipoca pipoca pipoca          

In [20]:
colvalues = rescaledData.select(vector_udf('idf_features').alias('features')).collect()

spark.createDataFrame(list(map(lambda x:x.features,colvalues)),model.vocabulary).show()

+------------------+----+---+-------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|            pipoca|this|the|           document|                 is|             first|             third|                as|               one|               and|            second|
+------------------+----+---+-------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|0.6694306539426294| 0.0|0.0|0.22314355131420976|0.22314355131420976|0.5108256237659907|               0.0|               0.0|               0.0|               0.0|               0.0|
|0.6694306539426294| 0.0|0.0|0.44628710262841953|0.22314355131420976|               0.0|               0.0|               0.0|               0.0|               0.0|0.9162907318741551|
|0.6694306539426294| 0.0|0.0|                0.0|0.22314355131420976|           

## 5. Referências

https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4494134497577204/2690176961360097/6933319862459084/latest.html

https://stackoverflow.com/questions/39546671/handle-unseen-categorical-string-spark-countvectorizer

http://www.tfidf.com/https://codelabs.developers.google.com/codelabs/spark-nlp/#7

http://www.tfidf.com/