# TF-IDF Vectorizer Implementation with Python

In this notebook, we will implement **TF-IDF (Term Frequency - Inverse Document Frequency)** using Python and the `sklearn` library. This technique helps us understand the importance of a word in a document relative to a collection of documents (corpus).

We will follow these steps:
1. Load the spam dataset
2. Clean and preprocess the text (lowercase, remove stopwords, apply lemmatization)
3. Convert the corpus into TF-IDF vectors
4. Explore n-grams with TF-IDF


In [None]:
# Step 1: Load the dataset
import pandas as pd
df = pd.read_csv('spam.csv', encoding='latin-1')[['v1', 'v2']]
df.columns = ['label', 'text']
df.head()

### Step 2: Text Cleaning and Lemmatization

In [None]:
import nltk
import re
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
corpus = []

for text in df['text']:
    review = re.sub('[^a-zA-Z]', ' ', text)
    review = review.lower().split()
    review = [lemmatizer.lemmatize(word) for word in review if word not in stopwords.words('english')]
    corpus.append(' '.join(review))

### Step 3: Apply TF-IDF Vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=100)
X = tfidf.fit_transform(corpus).toarray()
X[:5]  # Display first 5 vectors

As you can see, each sentence is converted into a 100-dimensional vector representing TF-IDF scores of top 100 most frequent words in the corpus.

### Step 4: Use TF-IDF with n-grams (bigrams)

In [None]:
tfidf_bigram = TfidfVectorizer(max_features=100, ngram_range=(2, 2))
X_bigram = tfidf_bigram.fit_transform(corpus).toarray()
X_bigram[:5]  # Display first 5 bigram vectors

### View Vocabulary
You can view the vocabulary used in the TF-IDF representation using the `get_feature_names_out()` method:

In [None]:
tfidf_bigram.get_feature_names_out()[:10]  # Display first 10 bigram features

## Summary
- We implemented TF-IDF using sklearn
- Preprocessed the text using lemmatization
- Explored both word-level and bigram-level TF-IDF vectors

Next step would be to split the data into train/test sets and apply machine learning algorithms!