## TF-IDF (Term Frequency - Inverse Document Frequency)

It is a statistical method used in natural language processing and information retrieval to evaluate how important a word is to a document in relation to a larger collection of documents. 

It combines two components:

#### 1. Term Frequency
Measures how often a word appears in a document. A higher frequency suggests greater importance.
$$TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}$$

#### 2. Inverse Document Frequency
Reduces the weight of common words across multiple documents while increasing the weight of rare words. If a term appears in fewer documents, it is more likely to be meaningful and specific.
$$IDF(t, D) = \log \frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t}$$

#### Combination

This combination allows to highlight terms that are both frequent within a specific document and distinctive across the text document, making it a useful tool for tasks like search ranking, text classification and keyword extraction.

For each document the final TF-IDF is then calculated by:
$$\text{TF-IDF}(t, d, D) = TF(t, d) \times IDF(t, D)$$

### Applications
- **Document Similarity and Clustering**: By converting documents into numerical vectors TF-IDF enables comparison and grouping of related texts. This is valuable for clustering news articles, research papers or customer support tickets into meaningful categories.
- **Text Classification**: It helps in identify patterns in text for spam filtering, sentiment analysis and topic classification.
- **Keyword Extraction**: It ranks words by importance making it possible to automatically highlight key terms, generate document tags or create concise summaries.
- **Recommendation Systems**: Through comparison of textual descriptions TF-IDF supports suggesting related articles, videos or products enhancing user engagement.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# documents
d0 = 'cats and dogs'
d1 = 'cats only'
d2 = 'no'
string = [d0, d1, d2]

tfidf = TfidfVectorizer()
result = tfidf.fit_transform(string)

print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)

print('\nWord indexes:')
print(tfidf.vocabulary_)
print('\ntf-idf values in matrix form:')
print(result.toarray())


idf values:
and : 1.6931471805599454
cats : 1.2876820724517808
dogs : 1.6931471805599454
no : 1.6931471805599454
only : 1.6931471805599454

Word indexes:
{'cats': 1, 'and': 0, 'dogs': 2, 'only': 4, 'no': 3}

tf-idf values in matrix form:
[[0.62276601 0.4736296  0.62276601 0.         0.        ]
 [0.         0.60534851 0.         0.         0.79596054]
 [0.         0.         0.         1.         0.        ]]
