#### TF-IDF (Term Frequency-Inverse Document Frequency) 
is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It helps reduce the impact of frequently occurring words like "the," "is," and "and," which carry less informational value.

#### 1.Term Frequency (TF): 
The number of times a word appears in a document relative to the total number of words in that document. It measures the word's frequency in the document.

$ TF(t,d)=(Number of times term t appears in document d) /(Total number of terms in document d)$

#### 2.Inverse Document Frequency (IDF):

 The logarithm of the inverse fraction of documents that contain the word. Words that appear in many documents will have a low IDF, meaning they are not useful in distinguishing documents.

IDF(t,D)=log(Total number of documents/number of documents containing term t)

#### 3.TF-IDF: 
The product of TF and IDF, used to weigh words.

$TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)$

___
___

### Example:

Let’s take the following three documents:

1. "I love programming in Python."
2. "Python is a great programming language."
3. "I love learning Python."

We will calculate the TF-IDF for the word "Python" in each document.

#### Step 1: Calculate Term Frequency (TF)
For each document, we calculate the frequency of the term "Python."

- **Document 1**: "I love programming in Python."
  - Term frequency of "Python": $ \frac{1}{5} = 0.2 $

- **Document 2**: "Python is a great programming language."
  - Term frequency of "Python": $ \frac{1}{6} = 0.167 $

- **Document 3**: "I love learning Python."
  - Term frequency of "Python": $ \frac{1}{4} = 0.25 $

#### Step 2: Calculate Inverse Document Frequency (IDF)
Now, we calculate how important the term "Python" is across all documents.

The word "Python" appears in all three documents. So, the IDF will be:

$ \text{IDF(Python)} = \log \left( \frac{3}{3} \right) = 0 $

Since the IDF of "Python" is 0 (it appears in all documents), it will have no distinguishing power.

#### Step 3: Calculate TF-IDF
Now, multiply the TF and IDF for each document.

- **Document 1**: TF-IDF = $ 0.2 \times 0 = 0 $
- **Document 2**: TF-IDF = $ 0.167 \times 0 = 0 $
- **Document 3**: TF-IDF = $ 0.25 \times 0 = 0 $

Since the IDF for "Python" is 0, its TF-IDF score is also 0 in all documents. This shows that "Python" is not a good keyword for distinguishing these documents.


___
___

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Example documents
documents = [
    "I love programming in Python.",
    "Python is a great programming language.",
    "I love learning Python."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents into TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Convert the matrix to a DataFrame for easier interpretation
import pandas as pd
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

print(df_tfidf)


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example documents
documents = [
    "I love programming in Python.",
    "Python is a great programming language.",
    "I love learning Python."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents into TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Convert the matrix to a DataFrame for easier interpretation
import pandas as pd
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

print(df_tfidf)


      great        in        is  language  learning      love  programming  \
0  0.000000  0.631745  0.000000  0.000000  0.000000  0.480458     0.480458   
1  0.504611  0.000000  0.504611  0.504611  0.000000  0.000000     0.383770   
2  0.000000  0.000000  0.000000  0.000000  0.720333  0.547832     0.000000   

     python  
0  0.373119  
1  0.298032  
2  0.425441  
