# 0. Dependências

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math 

%matplotlib inline

# 1. Introdução 

O valor **tf–idf** (abreviação do inglês term frequency–inverse document frequency, que significa frequência do termo–inverso da frequência nos documentos),
é uma medida estatística que tem o intuito de indicar a importância de uma palavra de um documento em relação a uma coleção de documentos ou em um [corpus linguístico](https://pt.wikipedia.org/wiki/Corpus_lingu%C3%ADstico).
Ela é frequentemente utilizada como fator de ponderação na recuperação de informações e na mineração de dados.

O valor **tf–idf** de uma palavra aumenta proporcionalmente à medida que aumenta o número de ocorrências dela em um documento, no entanto, esse valor é equilibrado pela 
frequência da palavra no corpus. Isso auxilia a distinguir o fato da ocorrência de algumas palavras serem geralmente mais comuns que outras.

## Frequência do termo (tf)

Suponha que foram selecionados uma coleção de documento de textos em português e que nós desejamos determinar qual deles tem maior relação com a frase "uma vaca amarela". Uma maneira simples de iniciar essa análise seria simplesmente descartar todos os documentos que não contém as palavras "uma", "vaca" e "amarela", mas apenas esse procedimento não seria suficiente para completar a análise, pois muitos documentos provavelmente possuem as três palavras. Assim, para melhorar a distinção entre elas, nós podemos **contar o número de vezes que um dos termos ocorre em cada documento e somar esse valor; o número de vezes que um termo ocorre em um documento é a frequência do termo**.

A primeira forma de ponderação de termos é atribuída a Hans Peter Luhn (1957) e se baseia na suposição de Luhn:

- O peso de um termo que ocorre em um documento é diretamente proporcional à sua frequência.

TF = (Frequency of the word in the sentence) / (Total number of words in the sentence)

<p align="center">
  <img  src="tf.png">
</p>

## Inverso da frequência nos documentos (idf)
No entanto, como o termo "uma" é muito comum, isso vai dar ênfase em documentos que utilizam essa palavra com mais frequência, sem dar a ênfase apropriada para termos com mais 
significado como "vaca" e "amarela". O termo "uma" não é uma boa palavra-chave para distinguir documentos relevantes de não-relevantes em comparação com as palavras "vaca" e "amarela". 
Assim, o inverso da frequência do termo nos documentos é incorporado para diminuir o peso dos termos que ocorrem mais frequentemente no conjunto de textos selecionados, ao mesmo tempo
que aumenta o peso daqueles que ocorrem raramente.

Karen Spärck Jones (1972) concebeu uma interpretação estatística do termo **IDF**, que se tornou um conceito base para a ponderação de termos:

- A especificidade de um termo pode ser quantificada por uma função inversa do número de documentos em que ele ocorre.

IDF: (Total number of sentences (documents))/(Number of sentences (documents) containing the word)

<p align="center">
  <img  src="idf.png">
</p>

## TF-IDF Pseudocódigo 

<p align="center">
  <img width="460" height="300" src="pseudocode.png">
</p>

<img src="formula.png" align="center"/>

# 2. Dados

In [22]:
corpus = [
    'this is the first document pipoca pipoca pipoca',
    'this document is the second document pipoca pipoca pipoca',
    'and this is the third one pipoca pipoca pipoca',
    'as this the first document amesterda'
]

# 3. Implementação 

In [120]:
docA = 'this is the first document pipoca pipoca pipoca'
docB = 'this document is the second document pipoca pipoca pipoca'
docC = 'and this is the third one pipoca pipoca pipoca'
docD = 'as this the first document'

In [121]:
# wordSet = set()
# for doc in corpus:
#     wordSet= wordSet.union(set(doc.split(" ")))
    
bowA = docA.split(" ")
bowB = docB.split(" ")
bowC = docC.split(" ")
bowD = docD.split(" ")

In [124]:
wordSet = set(bowA).union(set(bowB)).union(set(bowC)).union(set(bowD))

In [125]:
wordSet

{'and',
 'as',
 'document',
 'first',
 'is',
 'one',
 'pipoca',
 'second',
 'the',
 'third',
 'this'}

In [126]:
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0)
wordDictC = dict.fromkeys(wordSet, 0)
wordDictD = dict.fromkeys(wordSet, 0)

In [127]:
wordDictA

{'one': 0,
 'third': 0,
 'second': 0,
 'as': 0,
 'pipoca': 0,
 'first': 0,
 'and': 0,
 'this': 0,
 'document': 0,
 'is': 0,
 'the': 0}

In [128]:
for word in bowA:
    wordDictA[word]+=1
    
for word in bowB:
    wordDictB[word]+=1
    
for word in bowC:
    wordDictC[word]+=1
    
for word in bowD:
    wordDictD[word]+=1

In [129]:
wordDictA

{'one': 0,
 'third': 0,
 'second': 0,
 'as': 0,
 'pipoca': 3,
 'first': 1,
 'and': 0,
 'this': 1,
 'document': 1,
 'is': 1,
 'the': 1}

In [130]:
pd.DataFrame([wordDictA, wordDictB, wordDictC, wordDictD])

Unnamed: 0,and,as,document,first,is,one,pipoca,second,the,third,this
0,0,0,1,1,1,0,3,0,1,0,1
1,0,0,2,0,1,0,3,1,1,0,1
2,1,0,0,0,1,1,3,0,1,1,1
3,0,1,1,1,0,0,0,0,1,0,1


In [131]:
def computeTF(wordDict, bow):
    import math
    """
    TF = (Frequency of the word in the sentence) / (Total number of words in the sentence)
    """
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

In [132]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)
tfBowC = computeTF(wordDictC, bowC)
tfBowD = computeTF(wordDictD, bowD)

In [133]:
tfBowA

{'one': 0.0,
 'third': 0.0,
 'second': 0.0,
 'as': 0.0,
 'pipoca': 0.375,
 'first': 0.125,
 'and': 0.0,
 'this': 0.125,
 'document': 0.125,
 'is': 0.125,
 'the': 0.125}

In [165]:
def computeIDF(docList):
    """
    IDF: (Total number of sentences (documents))/(Number of sentences (documents) containing the word)
    """
    import math
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val) ) +1
        print(f"word: {word},val: {val}, IDF: {idfDict[word]}")
        
        
    return idfDict

In [166]:
idfs = computeIDF([wordDictA, wordDictB, wordDictC, wordDictD])

word: one,val: 1, IDF: 1.6020599913279625
word: third,val: 1, IDF: 1.6020599913279625
word: second,val: 1, IDF: 1.6020599913279625
word: as,val: 1, IDF: 1.6020599913279625
word: pipoca,val: 3, IDF: 1.1249387366083
word: first,val: 2, IDF: 1.3010299956639813
word: and,val: 1, IDF: 1.6020599913279625
word: this,val: 4, IDF: 1.0
word: document,val: 3, IDF: 1.1249387366083
word: is,val: 3, IDF: 1.1249387366083
word: the,val: 4, IDF: 1.0


In [167]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        print(f"word :{word} , TF :{val}, IDF: {idfs[word]}")
        tfidf[word] = val*idfs[word]
    return tfidf

In [168]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)
tfidfBowC = computeTFIDF(tfBowC, idfs)
tfidfBowD = computeTFIDF(tfBowD, idfs)

word :one , TF :0.0, IDF: 1.6020599913279625
word :third , TF :0.0, IDF: 1.6020599913279625
word :second , TF :0.0, IDF: 1.6020599913279625
word :as , TF :0.0, IDF: 1.6020599913279625
word :pipoca , TF :0.375, IDF: 1.1249387366083
word :first , TF :0.125, IDF: 1.3010299956639813
word :and , TF :0.0, IDF: 1.6020599913279625
word :this , TF :0.125, IDF: 1.0
word :document , TF :0.125, IDF: 1.1249387366083
word :is , TF :0.125, IDF: 1.1249387366083
word :the , TF :0.125, IDF: 1.0
word :one , TF :0.0, IDF: 1.6020599913279625
word :third , TF :0.0, IDF: 1.6020599913279625
word :second , TF :0.1111111111111111, IDF: 1.6020599913279625
word :as , TF :0.0, IDF: 1.6020599913279625
word :pipoca , TF :0.3333333333333333, IDF: 1.1249387366083
word :first , TF :0.0, IDF: 1.3010299956639813
word :and , TF :0.0, IDF: 1.6020599913279625
word :this , TF :0.1111111111111111, IDF: 1.0
word :document , TF :0.2222222222222222, IDF: 1.1249387366083
word :is , TF :0.1111111111111111, IDF: 1.1249387366083
wor

In [169]:
pd.DataFrame([tfidfBowA, tfidfBowB, tfidfBowC , tfidfBowD])

Unnamed: 0,and,as,document,first,is,one,pipoca,second,the,third,this
0,0.0,0.0,0.140617,0.162629,0.140617,0.0,0.421852,0.0,0.125,0.0,0.125
1,0.0,0.0,0.249986,0.0,0.124993,0.0,0.37498,0.178007,0.111111,0.0,0.111111
2,0.178007,0.0,0.0,0.0,0.124993,0.178007,0.37498,0.0,0.111111,0.178007,0.111111
3,0.0,0.320412,0.224988,0.260206,0.0,0.0,0.0,0.0,0.2,0.0,0.2


In [162]:
class Tfidf():
    
    def __init__(self, corpus):
        '''adicione os parâmetros necessários no __init__'''
        
        pass
    
    def fit(self, x, y):
        '''NÃO ALTERE OS PARÂMETROS DO MÉTODO FIT'''
        pass
        
    def predict(self, x):
        '''NÃO ALTERE OS PARÂMETROS DO MÉTODO PREDICT'''        
        pass

# 4. Teste 

## Comparação com o Scikit-learn

In [163]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'this is the first document pipoca pipoca pipoca',
    'this document is the second document pipoca pipoca pipoca',
    'and this is the third one pipoca pipoca pipoca',
    'as this the first document'
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names()[:5]

import numpy

features_array = numpy.array(vectorizer.get_feature_names())
features_array[:5]

data = X.todense().tolist()


In [164]:
import pandas

df = pandas.DataFrame(data, columns=features_array)
df

Unnamed: 0,and,as,document,first,is,one,pipoca,second,the,third,this
0,0.0,0.0,0.268583,0.331753,0.268583,0.0,0.805749,0.0,0.219584,0.0,0.219584
1,0.0,0.0,0.474161,0.0,0.23708,0.0,0.711241,0.371432,0.193829,0.0,0.193829
2,0.362292,0.0,0.0,0.0,0.231246,0.362292,0.693738,0.0,0.189059,0.362292,0.189059
3,0.0,0.623342,0.397871,0.49145,0.0,0.0,0.0,0.0,0.325285,0.0,0.325285


In [97]:
for i in df.iterrows():
    values = i[1].values.tolist()
    print([features_array[i] for i, v in enumerate(values) if 0.0 < v < 0.6])

['document', 'first', 'is', 'the', 'this']
['document', 'is', 'second', 'the', 'this']
['and', 'is', 'one', 'the', 'third', 'this']
['document', 'first', 'the', 'this']


## Comparação com o Pyspark

In [109]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer,CountVectorizer
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName('abc').getOrCreate()

sentenceData = spark.createDataFrame([
    (0.0, 'this is the first document pipoca pipoca pipoca'),
    (0.0, 'this document is the second document pipoca pipoca pipoca'),
    (1.0, 'and this is the third one pipoca pipoca pipoca'),
    (1.0, 'as this the first document')
    
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.show(truncate=False)

+-----+---------------------------------------------------------+-------------------------------------------------------------------+
|label|sentence                                                 |words                                                              |
+-----+---------------------------------------------------------+-------------------------------------------------------------------+
|0.0  |this is the first document pipoca pipoca pipoca          |[this, is, the, first, document, pipoca, pipoca, pipoca]           |
|0.0  |this document is the second document pipoca pipoca pipoca|[this, document, is, the, second, document, pipoca, pipoca, pipoca]|
|1.0  |and this is the third one pipoca pipoca pipoca           |[and, this, is, the, third, one, pipoca, pipoca, pipoca]           |
|1.0  |as this the first document                               |[as, this, the, first, document]                                   |
+-----+-------------------------------------------------------

In [105]:
cv = CountVectorizer(inputCol="words", outputCol="cv_features") #vocabSize=3, minDF=2.0
model = cv.fit(wordsData)

featurizedData = model.transform(wordsData)
featurizedData.show(truncate=False)

+-----+---------------------------------------------------------+-------------------------------------------------------------------+--------------------------------------------------+
|label|sentence                                                 |words                                                              |cv_features                                       |
+-----+---------------------------------------------------------+-------------------------------------------------------------------+--------------------------------------------------+
|0.0  |this is the first document pipoca pipoca pipoca          |[this, is, the, first, document, pipoca, pipoca, pipoca]           |(11,[0,1,2,3,4,5],[3.0,1.0,1.0,1.0,1.0,1.0])      |
|0.0  |this document is the second document pipoca pipoca pipoca|[this, document, is, the, second, document, pipoca, pipoca, pipoca]|(11,[0,1,2,3,4,8],[3.0,2.0,1.0,1.0,1.0,1.0])      |
|1.0  |and this is the third one pipoca pipoca pipoca           |[and, this

In [106]:
print(model.vocabulary)

['pipoca', 'document', 'the', 'this', 'is', 'first', 'one', 'third', 'second', 'and', 'as']


In [107]:
featurizedData.printSchema()

root
 |-- label: double (nullable = true)
 |-- sentence: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cv_features: vector (nullable = true)



In [110]:
vector_udf = udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))

colvalues = rescaledData.select(vector_udf('cv_features').alias('features')).collect()

spark.createDataFrame(list(map(lambda x:x.features,colvalues)),model.vocabulary).show()

+------+--------+---+----+---+-----+---+-----+------+---+---+
|pipoca|document|the|this| is|first|one|third|second|and| as|
+------+--------+---+----+---+-----+---+-----+------+---+---+
|   3.0|     1.0|1.0| 1.0|1.0|  1.0|0.0|  0.0|   0.0|0.0|0.0|
|   3.0|     2.0|1.0| 1.0|1.0|  0.0|0.0|  0.0|   1.0|0.0|0.0|
|   3.0|     0.0|1.0| 1.0|1.0|  0.0|1.0|  1.0|   0.0|1.0|0.0|
|   0.0|     1.0|1.0| 1.0|0.0|  1.0|0.0|  0.0|   0.0|0.0|1.0|
+------+--------+---+----+---+-----+---+-----+------+---+---+



In [111]:
idf = IDF(inputCol="cv_features", outputCol="idf_features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.show(truncate=False)

+-----+---------------------------------------------------------+-------------------------------------------------------------------+--------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|label|sentence                                                 |words                                                              |cv_features                                       |idf_features                                                                                                                  |
+-----+---------------------------------------------------------+-------------------------------------------------------------------+--------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|0.0  |this is the first document pipoca pipoca pipoca          

In [112]:
colvalues = rescaledData.select(vector_udf('idf_features').alias('features')).collect()

spark.createDataFrame(list(map(lambda x:x.features,colvalues)),model.vocabulary).show()

+------------------+-------------------+---+----+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|            pipoca|           document|the|this|                 is|             first|               one|             third|            second|               and|                as|
+------------------+-------------------+---+----+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|0.6694306539426294|0.22314355131420976|0.0| 0.0|0.22314355131420976|0.5108256237659907|               0.0|               0.0|               0.0|               0.0|               0.0|
|0.6694306539426294|0.44628710262841953|0.0| 0.0|0.22314355131420976|               0.0|               0.0|               0.0|0.9162907318741551|               0.0|               0.0|
|0.6694306539426294|                0.0|0.0| 0.0|0.22314355131420976|           

## 5. Referências

https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4494134497577204/2690176961360097/6933319862459084/latest.html

https://stackoverflow.com/questions/39546671/handle-unseen-categorical-string-spark-countvectorizer

https://codelabs.developers.google.com/codelabs/spark-nlp/#7