## Understanding TfidfVectorizer

`TfidfVectorizer` is an advanced feature extraction tool from the scikit-learn library that converts text data into a numerical matrix of TF-IDF features, suitable for use in machine learning models.

- **TF (Term Frequency)**: This quantifies the frequency of a term within a single document, giving higher weight to terms that occur more frequently.

- **IDF (Inverse Document Frequency)**: This assesses the significance of a term across the entire document corpus. It diminishes the weight of terms that occur very commonly across documents, thereby amplifying the importance of rarer terms. The IDF for a term is calculated by taking the logarithm of the ratio of the total number of documents to the number of documents containing the term.

- **TF-IDF**: This metric is the product of TF and IDF, reflecting the term's significance within a particular document relative to the entire collection of documents. A term's TF-IDF score increases with its frequency in a document but is balanced by its commonality across all documents.

<font color='Blue'><b>Example:</b></font>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

docs_ext = ['Columbia, Missouri is known for its vibrant college town atmosphere.',
            'The University of Missouri in Columbia is a major research institution.',
            "Columbia's weather can be unpredictable, especially in spring.",
            'Columbia, Columbia, a city so vibrant, so vibrant.',
            'The Missouri River, the Missouri River, so scenic, so scenic.']

# Initialize the TfidfVectorizer with English stop words to be filtered out
vectorizer = TfidfVectorizer(stop_words="english")
analyzer = vectorizer.build_analyzer()

# Remove stop words from each document
docs_ext_no_stopwords = [" ".join(analyzer(doc)) for doc in docs_ext]

print(docs_ext_no_stopwords)

# Fit the vectorizer to the documents to learn the vocabulary
vectorizer.fit(docs_ext)

# Display the size and content of the learned vocabulary
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print("Vocabulary content:")
print(vectorizer.vocabulary_)

# Transform the documents into a TF-IDF-weighted term-document matrix
tfidf_matrix = vectorizer.transform(docs_ext)

# Retrieve and display the feature names (vocabulary)
print("\nFeature names:")
print(vectorizer.get_feature_names_out())

# Display the TF-IDF matrix as a dense array
# print("\nTF-IDF matrix:")
# print(tfidf_matrix.toarray())

# Convert the TF-IDF matrix into a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(),
                        columns=vectorizer.get_feature_names_out())

# Display the DataFrame
print("\nDataFrame representation of the TF-IDF matrix:")
display(tfidf_df.round(3))

The numbers in the table represent the **TF-IDF (Term Frequency-Inverse Document Frequency) scores** for various terms across different documents. Here's what they indicate:

- **Columns**: Each column header represents a unique term (e.g., 'annual', 'calgary', 'city', etc.).
- **Rows**: Each row corresponds to a different document (e.g., Document 0, Document 1, etc.).
- **Values**: The numerical values are the TF-IDF scores, which measure how important a term is to a document in a collection. A score of **0** means the term does not appear in the document, while higher scores indicate greater importance.

For example, in Document 0, the terms 'annual', 'known', and 'stampede' have a score of **0.549**, suggesting they are significant in that document. Conversely, the term 'city' has a score of **0**, indicating it does not appear in Document 0.

In [None]:
import math
from collections import defaultdict
import pandas as pd

def compute_tf(documents):
    """Compute term frequency for each term in each document."""
    tf_list = []
    for doc in documents:
        words = doc.split()
        total_terms = len(words)
        tf = defaultdict(float)
        for word in words:
            tf[word] += 1
        for word in tf:
            tf[word] /= total_terms
        tf_list.append(tf)
    return tf_list

def compute_idf(documents, smooth=True):
    """Compute inverse document frequency for each term."""
    N = len(documents)
    doc_freq = defaultdict(int)
    for doc in documents:
        unique_terms = set(doc.split())
        for term in unique_terms:
            doc_freq[term] += 1
    idf_dict = {}
    for term, df in doc_freq.items():
        if smooth:
            idf_dict[term] = math.log((1 + N) / (1 + df)) + 1
        else:
            idf_dict[term] = math.log(N / df) if df != 0 else 0.0
    return idf_dict

def l2_normalize(vec):
    norm = math.sqrt(sum(x**2 for x in vec))
    return [x / norm if norm != 0 else 0 for x in vec]

def compute_tfidf(documents, smooth=True):
    """Compute both non-normalized and normalized TF-IDF matrices."""
    tf_list = compute_tf(documents)
    idf = compute_idf(documents, smooth=smooth)
    vocab = sorted(idf.keys())
    tfidf_matrix = []
    tfidf_matrix_norm = []
    for tf in tf_list:
        tfidf = [tf.get(word, 0) * idf[word] for word in vocab]
        tfidf_matrix.append(tfidf)
        tfidf_matrix_norm.append(l2_normalize(tfidf))
    return vocab, tfidf_matrix, tfidf_matrix_norm

# Example usage
docs = [
    'columbia missouri known vibrant college town atmosphere',
    'university missouri columbia major research institution',
    'columbia weather unpredictable especially spring',
    'columbia columbia city vibrant vibrant',
    'missouri river missouri river scenic scenic'
]

# TF Table
tf_list = compute_tf(docs)
vocab = sorted(set(word for doc in docs for word in doc.split()))
tf_table = []
for tf in tf_list:
    tf_table.append([tf.get(word, 0) for word in vocab])
tf_df = pd.DataFrame(tf_table, columns=vocab)
print("TF (Term Frequency) Table:")
display(tf_df.round(3))

# IDF with smoothing
idf_smooth = compute_idf(docs, smooth=True)
idf_smooth_df = pd.DataFrame(list(idf_smooth.items()), columns=["Term", "IDF (Smooth)"])
print("\nIDF (Inverse Document Frequency, smooth=True):")
display(idf_smooth_df.sort_values("Term").reset_index(drop=True))

# IDF without smoothing
idf_nosmooth = compute_idf(docs, smooth=False)
idf_nosmooth_df = pd.DataFrame(list(idf_nosmooth.items()), columns=["Term", "IDF (No Smoothing)"])
print("\nIDF (Inverse Document Frequency, smooth=False):")
display(idf_nosmooth_df.sort_values("Term").reset_index(drop=True))

# TF-IDF Table (with smoothing, non-normalized and normalized)
vocab, tfidf_matrix, tfidf_matrix_norm = compute_tfidf(docs, smooth=True)
tfidf_df = pd.DataFrame(tfidf_matrix, columns=vocab)
tfidf_df_norm = pd.DataFrame(tfidf_matrix_norm, columns=vocab)
print("\nTF-IDF Table (smooth=True, non-normalized):")
display(tfidf_df.round(3))
print("\nTF-IDF Table (smooth=True, L2 normalized):")
display(tfidf_df_norm.round(3))

# TF-IDF Table (without smoothing, non-normalized and normalized)
vocab, tfidf_matrix_nosmooth, tfidf_matrix_nosmooth_norm = compute_tfidf(docs, smooth=False)
tfidf_nosmooth_df = pd.DataFrame(tfidf_matrix_nosmooth, columns=vocab)
tfidf_nosmooth_df_norm = pd.DataFrame(tfidf_matrix_nosmooth_norm, columns=vocab)
print("\nTF-IDF Table (smooth=False, non-normalized):")
display(tfidf_nosmooth_df.round(3))
print("\nTF-IDF Table (smooth=False, L2 normalized):")
display(tfidf_nosmooth_df_norm.round(3))


## Expanding Bag-of-Words with n-Grams

The Bag-of-Words (BoW) model is a simple yet powerful way to represent text data for machine learning. However, it has a significant limitation: it ignores word order. For example, "it's bad, not good at all" and "it's good, not bad at all" would have identical BoW representations despite their opposite meanings. This is where n-grams come into play.

### Capturing Context with n-Grams

To capture more context, we can extend the BoW model to consider sequences of words:
- **Bigrams**: Pairs of consecutive words.
- **Trigrams**: Triplets of consecutive words.
- **n-Grams**: Sequences of 'n' consecutive words.


### Implementing n-Grams with CountVectorizer

The `CountVectorizer` and `TfidfVectorizer` classes in scikit-learn can be configured to use n-grams by setting the `ngram_range` parameter. Here's how you can do it:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

docs = ["Columbia, Missouri is known for its vibrant college town atmosphere.",
        "The University of Missouri in Columbia is a major research institution.",
        "Columbia's weather can be unpredictable, especially in spring."
        ]

# Unigrams (standard BoW)
uni_vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words="english")
uni_matrix = uni_vectorizer.fit_transform(docs)
print("Unigram features:", uni_vectorizer.get_feature_names_out())

# Bigrams
bi_vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words="english")
bi_matrix = bi_vectorizer.fit_transform(docs)
print("Bigram features:", bi_vectorizer.get_feature_names_out())

In [None]:
docs = ["Columbia, Missouri is known for its vibrant college town atmosphere.",
        "The University of Missouri in Columbia is a major research institution.",
        "Columbia's weather can be unpredictable, especially in spring."
        ]

vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words = "english").fit(docs)
# ngram_rangetuple (min_n, max_n), default=(1, 1)
print(f"Vocabulary size: {vectorizer.vocabulary_}")
print(f"Vocabulary:")
print(vectorizer.get_feature_names_out())

ngram_words = vectorizer.transform(docs)
pd.DataFrame(ngram_words.toarray(), columns= vectorizer.get_feature_names_out())

In [None]:
from pprint import pprint
vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words = "english").fit(docs)
print(f"Vocabulary size: {vectorizer.vocabulary_}")
print(f"Vocabulary:")
pprint(vectorizer.get_feature_names_out())

ngram_words = vectorizer.transform(docs)
pd.DataFrame(ngram_words.toarray(), columns= vectorizer.get_feature_names_out())

In [None]:
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words = "english").fit(docs)
print(f"Vocabulary size: {vectorizer.vocabulary_}")
print(f"Vocabulary:")
print(vectorizer.get_feature_names_out())

ngram_words = vectorizer.transform(docs)
pd.DataFrame(ngram_words.toarray(), columns= vectorizer.get_feature_names_out())

## Td-idf n-gram

In [None]:
docs = ["Columbia, Missouri is known for its vibrant college town atmosphere.",
        "The University of Missouri in Columbia is a major research institution.",
        "Columbia's weather can be unpredictable, especially in spring."]
vect = TfidfVectorizer(ngram_range=(1, 2), stop_words="english")
vect.fit(docs)
tfidf_words = vect.transform(docs)
df = pd.DataFrame(tfidf_words.toarray(), columns=vect.get_feature_names_out())
display(df.round(3))