# TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF 由兩部分組成：`詞頻`&`逆向文件頻率`

### $\cdot$ 文字量化(向量化)
* One-Hot Encoding
* Bag of Words(BoW)
* Bag of M-Grams(BoN)

## `詞頻（term frequency, t）`

單詞(word)出現在一份見的頻率，需要將文件內的單詞進行正規化，再除以文件的長度。

$tf(t, d) := \dfrac{f_{t,d}}{\Sigma_{t' \in d} f_{t',d}} = \dfrac{(\textit{Number of occurrences of term } t \textit{ in document } d)}{(\textit{Total number of terms in the document }d)}$

$f_{t,d}$ 表示單詞 $t$ 在文件 $d$ 中的次數。

※ 衡量單詞對於文件重要程度的定義方式並不唯一

## `逆向文件頻率（Inverse Document Frequency, IDF）`

有些單詞詞頻雖然很高，但不具重要性，因此需要考慮單詞對語料庫的重要程度。

$idf(t, D) := \ln{\left(\dfrac{N}{1 + |\{d \in D | t \in d\}|}\right)} = \ln{\left(\dfrac{\textit{Total number of documents in the corpus}}{\textit{Numbers of documents with term } t \textit{ in them}}\right)}$

$D$表示語料庫，其元素為文件 $d$。

> * 分母加1是為了避免因單不再語料庫中導致分母為零的狀況 $\Rightarrow$ well-defined
> * idf 越大：單詞越集中出現在某幾份文件中，對於整個語料庫而言就愈重要。
> * idf 越小：單詞在大量文件中都出現，會被認為這個單詞越一般（較不重要）。

※ 衡量單詞對於整個語料庫重要程度的定義方式並不唯一

### TF-IDF 

$tf-idf(t, d, D) := tf(t, d) \cdot idf(t, D)$

tf-idf Score 與文件矩陣（Term-Document Matrix）Example：

In [4]:
import preprocessing

# sample documents
document_1 = "This is the first sentence!"
document_2 = "This is my second sentence."
document_3 = "Is this my third sentence?"

# corpus of documents
corpus = [document_1, document_2, document_3]
print(corpus)

# preprocess documents
# processed_corpus = [preprocess_text(doc) for doc in corpus]

['This is the first sentence!', 'This is my second sentence.', 'Is this my third sentence?']


In [29]:
# Standard libraries
import os
import re
import string
from typing import List, Optional, Union, Callable

# Third party libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, PunktSentenceTokenizer
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, WordNetLemmatizer
from spellchecker import SpellChecker
from names_dataset import NameDataset

from text_preprocessing import to_lower, remove_url, remove_email, remove_phone_number, remove_special_character
from text_preprocessing import remove_itemized_bullet_and_numbering, expand_contraction, remove_punctuation, remove_whitespace
from text_preprocessing import remove_stopword, check_spelling, normalize_unicode, substitute_token

def preprocess_text(input_text: str, processing_function_list: Optional[List[Callable]] = None) -> str:
   """ Preprocess an input text by executing a series of preprocessing functions specified in functions list """
   if processing_function_list is None:
      processing_function_list = [to_lower,
                                  remove_url,
                                  remove_email,
                                  remove_phone_number,
                                  remove_itemized_bullet_and_numbering,
                                  expand_contraction,
                                  check_spelling,
                                  remove_special_character,
                                  remove_punctuation,
                                  remove_whitespace,
                                  normalize_unicode,
                                  remove_stopword,
                                  substitute_token]
   for func in processing_function_list:
        input_text = func(input_text)
   if isinstance(input_text, str):
        processed_text = input_text
   else:
        processed_text = ' '.join(input_text)
   return processed_text

processed_corpus = [preprocess_text(doc) for doc in corpus]
print(processed_corpus)

['first sentence', 'second sentence', 'third sentence']


TfidfVectorizer 得到每個單詞相對於各個文件的權重

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialise TfidfVectorizer
vectoriser = TfidfVectorizer(norm = None)

# obtain weights of each term to each document in corpus (ie, tf-idf scores)
tf_idf_scores = vectoriser.fit_transform(corpus)

文件矩陣（term-document matrix）用來表示各個單詞在整個語料庫中之於文件的重要性，由 tf-idf scores 所構成。

In [31]:
# get vocabulary of terms
feature_names = vectoriser.get_feature_names()
corpus_index = [n for n in processed_corpus]

import pandas as pd

# create pandas DataFrame with tf-idf scores: Term-Document Matrix
df_tf_idf = pd.DataFrame(tf_idf_scores.T.todense(), index = feature_names, columns = corpus_index)
print(df_tf_idf)

          first sentence  second sentence  third sentence
first           1.693147         0.000000        0.000000
is              1.000000         1.000000        1.000000
my              0.000000         1.287682        1.287682
second          0.000000         1.693147        0.000000
sentence        1.000000         1.000000        1.000000
the             1.693147         0.000000        0.000000
third           0.000000         0.000000        1.693147
this            1.000000         1.000000        1.000000


