## Reproduzindo a vetorização TF-IDF realizada pelo scikit-learn

Normalmente os livros-texto abordam o assunto "Vetorização de texto pelo método TF-IDF" apresentando fórmulas 
que nem sempre são utilizadas na prática pelos *frameworks* de NLP.

O objetivo deste pequeno *notebook* é apresentar o tópico em questão de forma sucinta, adotando as expressões 
clássicas e, na sequência, apresentar como o TF-IDF é , de fato, calculado no scikit-learn.  

### IDF

O inverso da frequencia de ocorrência de um termo (ou palavra) nos documentos de um *corpus* — *IDF - Inverse Document 
Frequency* — é uma métrica ou representação que penaliza as palavras que aparecem com muita frequencia e bonifica 
aquelas que ocorrem com raridade na base.

Embora existam algumas pequenas variações na sua forma de implementação, a fórmula comumente apresentada nos 
livros-texto para o cálculo da IDF de uma palavra $w$ em relação a um *corpus* é:

$$\text{IDF}_{w} = log\frac{N}{n_w},$$ 

onde $n_w$ é o nº de documentos que contêm a palavra $w$ e $N$ é o nº total de documentos no *corpus*.

### TF-IDF

O TF-IDF de uma palavra $w$ num documento $d$ de um *corpus* será então calculado por:

$$\text{TF-IDF}_{d} = \text{TF}_{w,d} \times \text{IDF}_w$$ 

> **nota**: segundo as fórmulas anteriormente apresentadas, termos muito frequentes que apareçam em todos os
> documentos ($n_w=N$) terão um $\text{IDF}_w=0$ resultando numa penalização máxima para seu $\text{TF-IDF}$. Por
> outro lado, um termo $w$ que ocorra em um único documento terá $\text{IDF}_{w}=logN$, resultando no maior valor 
> possível para $\text{TF-IDF}_{w} = \text{TF}_{w,d} \times logN$.

<br>

#### Exemplo

Consideremos por exemplo uma base de contratos celebrados por algum órgão da administração pública. Nesta base poderia 
existir uma campo "Descrição do objeto" contendo uma descrição sucinta do objeto dos contratos, como a seguir:

|Nº do contrato| Descrição do objeto                   |
|:-------------|:--------------------------------------|
|01            |"manutenção de ar condicionado"        |
|02            |"contratação de serviço"               |
|03            |"contratação de pintor"                |
|04            |"serviço de hemodiálise"               |
|05            |"contratação de serviço de pintor"     | 
|06            |"aquisição de peças de ar condicionado"|

Nos passos a seguir iremos utilizar as fórmulas clássicas para calcular a vetorização $\text{TF-IDF}$ do *corpus* 
apresentado.

#### Passo 1:

Cálculo da frequência dos termos (palavras) em cada um dos documentos do *corpus*:

In [28]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["manutenção de ar condicionado",
          "contratação de serviço",
          "contratação de pintor",
          "serviço de hemodiálise",
          "contratação de serviço de pintor", 
          "aquisição de peças de ar condicionado"]

tf_vectorizer = CountVectorizer()

# cálculo das frequencias das palavras por documento: TF(w,d)
tf_w_d = tf_vectorizer.fit_transform(corpus).toarray()

# número de documentos no corpus
N = len(corpus)

# recupera o o vabulario do corpus acima
vocab = tf_vectorizer.get_feature_names()

# cria rótulos em latex para os documentos que compõem o corpus
rotulos = [r'$\text{doc}_' + str(i+1) + '$' for i in np.arange(N)]

# exibe os resultados da vetorização TF em um dataframe pandas
pd.DataFrame(tf_w_d, columns=vocab, index=rotulos)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0,1,1,0,1,0,1,0,0,0
$\text{doc}_2$,0,0,0,1,1,0,0,0,0,1
$\text{doc}_3$,0,0,0,1,1,0,0,0,1,0
$\text{doc}_4$,0,0,0,0,1,1,0,0,0,1
$\text{doc}_5$,0,0,0,1,2,0,0,0,1,1
$\text{doc}_6$,1,1,1,0,2,0,0,1,0,0


#### Passo 2:

Cálculo do inverso da frequência dos termos (palavras) do vocabulário nos documentos do *corpus*,
segundo a fórmula $\text{IDF}_w = log\tfrac{N}{n_w}$:

In [29]:
# cálculo de N(w):
n_w = sum(tf_w_d>0)

# calculo do IDF(w):  
idf_w = np.log(N/n_w)
pd.DataFrame(idf_w, index=vocab, columns=[r'$\text{IDF}_w$'])

Unnamed: 0,$\text{IDF}_w$
aquisição,1.791759
ar,1.098612
condicionado,1.098612
contratação,0.693147
de,0.0
hemodiálise,1.791759
manutenção,1.791759
peças,1.791759
pintor,1.098612
serviço,0.693147


#### Passo 3:

Cálculo da vetorização $\text{TF-IDF}$ para os documentos do *corpus*,
segundo a fórmula $\text{TF-IDF}_d = \text{TF}_{w,d} \times \text{IDF}_{w}$:
 

In [30]:
# calculo do TF-IDF(w,d):
tfidf_w_d = tf_w_d * idf_w

# exibe os resultados da vetorização TF-IDF em um dataframe pandas
pd.DataFrame(tfidf_w_d, columns=vocab, index=rotulos)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,1.098612,1.098612,0.0,0.0,0.0,1.791759,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.693147
$\text{doc}_3$,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,1.098612,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,0.0,1.791759,0.0,0.0,0.0,0.693147
$\text{doc}_5$,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,1.098612,0.693147
$\text{doc}_6$,1.791759,1.098612,1.098612,0.0,0.0,0.0,0.0,1.791759,0.0,0.0


#### Vetorização $\text{TF-IDF}$ do *corpus* segundo o scikit-learn 

Agora que calculamos manualmente o $\text{TF-IDF}$ segundo as fórmulas clássicas, 
iremos instanciar um objeto da classe `TfidVectorizer` para obtermos uma 
$\text{TF-IDF}_{sk}$ por meio da biblioteca scikit-learn.

<br>

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

tfidf_sk = tfidf_vectorizer.fit_transform(corpus).toarray()

# exibe os resultados da vetorização TF-IDF em um dataframe pandas
pd.DataFrame(tfidf_sk, columns=vocab, index=rotulos)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,0.514331,0.514331,0.0,0.278423,0.0,0.627222,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,0.644007,0.412927,0.0,0.0,0.0,0.0,0.644007
$\text{doc}_3$,0.0,0.0,0.0,0.59612,0.382222,0.0,0.0,0.0,0.706079,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,0.342849,0.772358,0.0,0.0,0.0,0.534713
$\text{doc}_5$,0.0,0.0,0.0,0.445109,0.570793,0.0,0.0,0.0,0.527212,0.445109
$\text{doc}_6$,0.491887,0.403355,0.403355,0.0,0.436697,0.0,0.0,0.491887,0.0,0.0


Ao compararmos os resultados obtidos no *Passo 3* com aqueles retornados pelo 
scikit-learn, fica evidente a diferença, que só pode ser explicada pelo fato de 
o scikit-learn empregar fórmulas alternativas para o cálculo da vetorização 
$\text{TF-IDF}_w$.

Segundo a documentação do  método 
[TfidfTransformer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer),
a fórmula usada para calcular o $\text{TF-IDF}_w$ para um termo w de um documento d em um conjunto de documentos 
é:
 
$$\text{TF-IDF}_{w, d} = \text{TF}_{w,d} \times \text{IDF}_w,$$ 

onde $\text{IDF}_w$ é calculado como:
 
- $$\textit{IDF}_w = log\tfrac{1+N}{1+n_w} + 1$$, se `smooth_idf = True`
- $$\textit{IDF}_w = log\tfrac{N}{n_w} + 1$$, se `smooth_idf = False`

> $\text{obs}_1$: $N$ é o número total de documentos no conjunto de documentos;
>
> $\text{obs}_2$: $n_w$ é a frequência de documentos de $w$; ou seja é o número de documentos no *corpus* 
> que contêm o termo $w$; e
>
> $\text{obs}_3$: `smooth_idf` é um parâmetro de entrada para o método construtor da classe 
> `TfidfTransformer()`, cujo valor padrão é `True`.

O efeito de se adicionar "1" ao $IDF_w$ nas expressões anteriores é que termos que ocorrem em todos os documentos, 
como a palavra "de" no *corpus* utilizado como exemplo não serão totalmente ignorados.

#### Refazendo os Passos 2 e 3

Agora podemos refazer os *Passos 2 e 3* e verificar se conseguimos reproduzir os resultados do scikit-learn:

In [32]:
# calculo do IDF(w):  
idf_w = np.log(N/n_w) + 1

# calculo do TF-IDF(w,d):
tfidf_w_d = tf_w_d * idf_w

# exibe os resultados da vetorização TF-IDF em um dataframe pandas
pd.DataFrame(tfidf_w_d, columns=vocab, index=rotulos)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,2.098612,2.098612,0.0,1.0,0.0,2.791759,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,1.693147,1.0,0.0,0.0,0.0,0.0,1.693147
$\text{doc}_3$,0.0,0.0,0.0,1.693147,1.0,0.0,0.0,0.0,2.098612,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,1.0,2.791759,0.0,0.0,0.0,1.693147
$\text{doc}_5$,0.0,0.0,0.0,1.693147,2.0,0.0,0.0,0.0,2.098612,1.693147
$\text{doc}_6$,2.791759,2.098612,2.098612,0.0,2.0,0.0,0.0,2.791759,0.0,0.0


Comparando estes resultados com aqueles produzidos por `TfidVectorizer()` ainda notamos discrepâncias. O que 
será que está faltando? Só mais um pequeno detalhe: o scikit-learn normaliza a representação vetorial dos 
documentos de modo que sua norma euclideana tenha valor unitário: $||doc_n||_2 = 1$.

A título de exemplo, suponhamos que queiramos normalizar a representação vetorial do primeiro documento (1ª 
linha da última tabela apresentada).

Para evitar confusão, na explicação a seguir, iremos renomear a representação não normalizada deste documento 
para $d_1$ e chamar a versão normalizada de $doc_1$. Assim teremos:

$$\textit{doc}_1 = \frac{d_1}{||d_1||_2}, \text{onde}$$

$$||\textit{d}||_2 = \sqrt{0.00^2+2.09^2+2.09^2+0.0^2+1.00^2+0.00^2+2.79^2+0.00^2+0.00^2+0.00^2} \approx 4.19$$

logo:

$$\textit{doc}_{1} \approx \begin{bmatrix} \tfrac{0.00}{4.19} & \tfrac{2.09}{4.19} & \tfrac{2.09}{4.19} & \tfrac{0.00}{4.19} & \tfrac{1.00}{4.19} & \tfrac{0.00}{4.19} & \tfrac{2.79}{4.19} & \tfrac{0.00}{4.19} & \tfrac{0.00}{4.19} & \tfrac{0.00}{4.19} \end{bmatrix}$$

$$\textit{doc}_{1} \approx \begin{bmatrix} 0.00 & 0.50 & 0.50 & 0.00 & 0.23 & 0.00 & 0.66 & 0.00 & 0.00 & 0.00 \end{bmatrix}$$

In [33]:
tfidf_norm = tfidf_w_d/np.linalg.norm(tfidf_w_d, axis=1).reshape((6,1))

pd.DataFrame(tfidf_norm, index=rotulos, columns=vocab)


Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,0.500205,0.500205,0.0,0.23835,0.0,0.665417,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,0.652491,0.385372,0.0,0.0,0.0,0.0,0.652491
$\text{doc}_3$,0.0,0.0,0.0,0.588732,0.347715,0.0,0.0,0.0,0.729718,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,0.292845,0.817554,0.0,0.0,0.0,0.49583
$\text{doc}_5$,0.0,0.0,0.0,0.450304,0.531914,0.0,0.0,0.0,0.55814,0.450304
$\text{doc}_6$,0.523899,0.393824,0.393824,0.0,0.375318,0.0,0.0,0.523899,0.0,0.0


In [34]:
tfidf_norm = tfidf_w_d/np.linalg.norm(tfidf_w_d, axis=1).reshape((6,1))

pd.DataFrame(tfidf_norm, index=rotulos, columns=vocab)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,0.500205,0.500205,0.0,0.23835,0.0,0.665417,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,0.652491,0.385372,0.0,0.0,0.0,0.0,0.652491
$\text{doc}_3$,0.0,0.0,0.0,0.588732,0.347715,0.0,0.0,0.0,0.729718,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,0.292845,0.817554,0.0,0.0,0.0,0.49583
$\text{doc}_5$,0.0,0.0,0.0,0.450304,0.531914,0.0,0.0,0.0,0.55814,0.450304
$\text{doc}_6$,0.523899,0.393824,0.393824,0.0,0.375318,0.0,0.0,0.523899,0.0,0.0


Nos passos a seguir iremos utilizar as fórmulas clássicas para calcular a vetorização $\text{TF-IDF}$ do *corpus* 
apresentado.

#### Passo 1:

Cálculo da frequência dos termos (palavras) em cada um dos documentos do *corpus*:

In [35]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["manutenção de ar condicionado",
          "contratação de serviço",
          "contratação de pintor",
          "serviço de hemodiálise",
          "contratação de serviço de pintor", 
          "aquisição de peças de ar condicionado"]

tf_vectorizer = CountVectorizer()

# cálculo das frequencias das palavras por documento: TF(w,d)
tf_w_d = tf_vectorizer.fit_transform(corpus).toarray()

# número de documentos no corpus
N = len(corpus)

# recupera o o vabulario do corpus acima
vocab = tf_vectorizer.get_feature_names()

# cria rótulos em latex para os documentos que compõem o corpus
rotulos = [r'$\text{doc}_' + str(i+1) + '$' for i in np.arange(N)]

# exibe os resultados da vetorização TF em um dataframe pandas
pd.DataFrame(tf_w_d, columns=vocab, index=rotulos)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0,1,1,0,1,0,1,0,0,0
$\text{doc}_2$,0,0,0,1,1,0,0,0,0,1
$\text{doc}_3$,0,0,0,1,1,0,0,0,1,0
$\text{doc}_4$,0,0,0,0,1,1,0,0,0,1
$\text{doc}_5$,0,0,0,1,2,0,0,0,1,1
$\text{doc}_6$,1,1,1,0,2,0,0,1,0,0


#### Passo 2:

Cálculo do inverso da frequência dos termos (palavras) do vocabulário nos documentos do *corpus*,
segundo a fórmula $\text{IDF}_w = log\tfrac{N}{n_w}$:

In [36]:
# cálculo de N(w):
n_w = sum(tf_w_d>0)

# calculo do IDF(w):  
idf_w = np.log(N/n_w)
pd.DataFrame(idf_w, index=vocab, columns=[r'$\text{IDF}_w$'])

Unnamed: 0,$\text{IDF}_w$
aquisição,1.791759
ar,1.098612
condicionado,1.098612
contratação,0.693147
de,0.0
hemodiálise,1.791759
manutenção,1.791759
peças,1.791759
pintor,1.098612
serviço,0.693147


#### Passo 3:

Cálculo da vetorização $\text{TF-IDF}$ para os documentos do *corpus*,
segundo a fórmula $\text{TF-IDF}_d = \text{TF}_{w,d} \times \text{IDF}_{w}$:
 

In [37]:
# calculo do TF-IDF(w,d):
tfidf_w_d = tf_w_d * idf_w

# exibe os resultados da vetorização TF-IDF em um dataframe pandas
pd.DataFrame(tfidf_w_d, columns=vocab, index=rotulos)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,1.098612,1.098612,0.0,0.0,0.0,1.791759,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.693147
$\text{doc}_3$,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,1.098612,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,0.0,1.791759,0.0,0.0,0.0,0.693147
$\text{doc}_5$,0.0,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,1.098612,0.693147
$\text{doc}_6$,1.791759,1.098612,1.098612,0.0,0.0,0.0,0.0,1.791759,0.0,0.0


#### Vetorização $\text{TF-IDF}$ do *corpus* segundo o scikit-learn 

Agora que calculamos manualmente o $\text{TF-IDF}$ segundo as fórmulas clássicas, 
iremos instanciar um objeto da classe `TfidVectorizer` para obtermos uma 
$\text{TF-IDF}_{sk}$ por meio da biblioteca scikit-learn.

<br>

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

tfidf_sk = tfidf_vectorizer.fit_transform(corpus).toarray()

# exibe os resultados da vetorização TF-IDF em um dataframe pandas
pd.DataFrame(tfidf_sk, columns=vocab, index=rotulos)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,0.514331,0.514331,0.0,0.278423,0.0,0.627222,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,0.644007,0.412927,0.0,0.0,0.0,0.0,0.644007
$\text{doc}_3$,0.0,0.0,0.0,0.59612,0.382222,0.0,0.0,0.0,0.706079,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,0.342849,0.772358,0.0,0.0,0.0,0.534713
$\text{doc}_5$,0.0,0.0,0.0,0.445109,0.570793,0.0,0.0,0.0,0.527212,0.445109
$\text{doc}_6$,0.491887,0.403355,0.403355,0.0,0.436697,0.0,0.0,0.491887,0.0,0.0


Ao compararmos os resultados obtidos no *Passo 3* com aqueles retornados pelo 
scikit-learn, fica evidente a diferença, que só pode ser explicada pelo fato de 
o scikit-learn empregar fórmulas alternativas para o cálculo da vetorização 
$\text{TF-IDF}_w$.

Segundo a documentação do  método 
[TfidfTransformer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer),
a fórmula usada para calcular o $\text{TF-IDF}_w$ para um termo w de um documento d em um conjunto de documentos 
é:
 
$$\text{TF-IDF}_{w, d} = \text{TF}_{w,d} \times \text{IDF}_w,$$ 

onde $\text{IDF}_w$ é calculado como:
 
$$\textit{IDF}_w = log\tfrac{N}{n_w} + 1 \text{(se `smooth_idf = False`)}, \text{ou}$$

$$\textit{IDF}_w = log\tfrac{1+N}{1+n_w} + 1 \text{(se `smooth_idf = True`)}$$

, em que $N$ é o número total de documentos no conjunto de documentos e $n_w$ 
é a frequência de documentos de $w$; ou seja é o número de documentos no *corpus* que contêm o termo $w$. 
O efeito de adicionar "1" ao $IDF_w$ na equação acima é que termos que ocorrem em todos os documentos, 
como a palavra "de" no *corpus* acima não serão totalmente ignorados.

#### Refazendo os Passos 2 e 3

Agora podemos refazer os *Passos 2 e 3* e verificar se conseguimos reproduzir os resultados do scikit-learn. Para isso, 
utilizaremos a expressão para o cálculo de $\text{TF-IDF}{w}$ correspondente à parametrização do objeto da classe 
`TfidVectorizer()`:

In [39]:
# calculo do IDF(w):  
idf_w = np.log((1+N)/(1+n_w)) + 1

# calculo do TF-IDF(w,d):
tfidf_w_d = tf_w_d * idf_w

# exibe os resultados da vetorização TF-IDF em um dataframe pandas
pd.DataFrame(tfidf_w_d, columns=vocab, index=rotulos)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,1.847298,1.847298,0.0,1.0,0.0,2.252763,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,1.559616,1.0,0.0,0.0,0.0,0.0,1.559616
$\text{doc}_3$,0.0,0.0,0.0,1.559616,1.0,0.0,0.0,0.0,1.847298,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,1.0,2.252763,0.0,0.0,0.0,1.559616
$\text{doc}_5$,0.0,0.0,0.0,1.559616,2.0,0.0,0.0,0.0,1.847298,1.559616
$\text{doc}_6$,2.252763,1.847298,1.847298,0.0,2.0,0.0,0.0,2.252763,0.0,0.0


Comparando estes resultados com aqueles produzidos por `TfidVectorizer()` ainda notamos discrepâncias. O que 
será que está faltando? Só mais um pequeno detalhe: o scikit-learn normaliza a representação vetorial dos 
documentos de modo que sua norma euclideana tenha valor unitário: $||doc_n||_2 = 1$.

A título de exemplo, suponhamos que queiramos normalizar a representação vetorial do primeiro documento (1ª 
linha da última tabela apresentada).

Para evitar confusão, na explicação a seguir, iremos renomear a representação não normalizada deste documento 
para $d_1$ e chamar a versão normalizada de $doc_1$. Assim teremos:

$$\textit{doc}_1 = \frac{d_1}{||d_1||_2}, \text{onde}$$

$$||\textit{d}||_2 = \sqrt{0.00^2+2.09^2+2.09^2+0.0^2+1.00^2+0.00^2+2.79^2+0.00^2+0.00^2+0.00^2} \approx 4.19$$

logo:

$$\textit{doc}_{1} \approx \begin{bmatrix} \tfrac{0.00}{3.59} & \tfrac{1.84}{3.59} & \tfrac{1.84}{3.59} & \tfrac{0.00}{3.59} & \tfrac{1.00}{3.59} & \tfrac{0.00}{3.59} & \tfrac{2.25}{3.59} & \tfrac{0.00}{3.59} & \tfrac{0.00}{3.59} & \tfrac{0.00}{3.59} \end{bmatrix}$$

$$\textit{doc}_{1} \approx \begin{bmatrix} 0.00 & 0.51 & 0.51 & 0.00 & 0.27 & 0.00 & 0.62 & 0.00 & 0.00 & 0.00 \end{bmatrix}$$

> **nota**: em todo o desenvolvimento do cálculo do vetor $\textit{doc}_{1}$ os resultados numéricos foram apresentados de 
> forma *truncada* na segunda casa decimal.

In [40]:
tfidf_norm = tfidf_w_d/np.linalg.norm(tfidf_w_d, axis=1).reshape((6,1))

pd.DataFrame(tfidf_norm, index=rotulos, columns=vocab)

Unnamed: 0,aquisição,ar,condicionado,contratação,de,hemodiálise,manutenção,peças,pintor,serviço
$\text{doc}_1$,0.0,0.514331,0.514331,0.0,0.278423,0.0,0.627222,0.0,0.0,0.0
$\text{doc}_2$,0.0,0.0,0.0,0.644007,0.412927,0.0,0.0,0.0,0.0,0.644007
$\text{doc}_3$,0.0,0.0,0.0,0.59612,0.382222,0.0,0.0,0.0,0.706079,0.0
$\text{doc}_4$,0.0,0.0,0.0,0.0,0.342849,0.772358,0.0,0.0,0.0,0.534713
$\text{doc}_5$,0.0,0.0,0.0,0.445109,0.570793,0.0,0.0,0.0,0.527212,0.445109
$\text{doc}_6$,0.491887,0.403355,0.403355,0.0,0.436697,0.0,0.0,0.491887,0.0,0.0


O resultado acima coincide exatamente com a vetorização produzida pelo scikit-learn!