1. The Symbolic Era (One-Hot, Bag of Words, TF-IDF)

Sklearn modules:

CountVectorizer → Bag of Words

TfidfVectorizer → TF-IDF

LabelEncoder, OneHotEncoder → One-hot for categorical labels

In [2]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import pandas as pd

# Example categorical data
data = pd.DataFrame({'fruit': ['apple', 'banana', 'apple', 'orange']})

# One-hot encoding
ohe = OneHotEncoder()
onehot = ohe.fit_transform(data[['fruit']])
print("One-hot:\n", onehot)

# Ordinal encoding (maps to integers)
ord_enc = OrdinalEncoder()
ordinal = ord_enc.fit_transform(data[['fruit']])
print("Ordinal:\n", ordinal)


One-hot:
 <Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (4, 3)>
  Coords	Values
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 0)	1.0
  (3, 2)	1.0
Ordinal:
 [[0.]
 [1.]
 [0.]
 [2.]]


In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [   "Cats like milk",
    "Dogs like bones",
    "Cats and dogs are pets"]

# Bag-of-words (count vectorizer)
cv = CountVectorizer()
X_count = cv.fit_transform(corpus)
print("Count vectors:\n", X_count.toarray())

# TF-IDF vectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF vectors:\n", X_tfidf.toarray())


Count vectors:
 [[0 0 0 1 0 1 1 0]
 [0 0 1 0 1 1 0 0]
 [1 1 0 1 1 0 0 1]]
TF-IDF vectors:
 [[0.         0.         0.         0.51785612 0.         0.51785612
  0.68091856 0.        ]
 [0.         0.         0.68091856 0.         0.51785612 0.51785612
  0.         0.        ]
 [0.49047908 0.49047908 0.         0.37302199 0.37302199 0.
  0.         0.49047908]]


2. The Statistical Era (LSA, LDA)

Sklearn modules:

TruncatedSVD → LSA (dimensionality reduction on term-document matrix)

LatentDirichletAllocation → LDA topic modeling

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample corpus
corpus = [
    "Cats like milk",
    "Dogs like bones",
    "Cats and dogs are pets"
]

# Convert text to TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Apply LSA
svd = TruncatedSVD(n_components=2)  # Reduce to 2 topics
X_lsa = svd.fit_transform(X)

print("Original shape:", X.shape)  # (3 docs x vocab size)
print("LSA shape:", X_lsa.shape) 

Original shape: (3, 8)
LSA shape: (3, 2)


In [2]:
print(X_lsa)

[[ 0.71975512 -0.34064651]
 [ 0.71975512 -0.34064651]
 [ 0.63428041  0.77310307]]


In [5]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
lda = LatentDirichletAllocation(n_components=2)
X_lda = lda.fit_transform(X_count)

In [9]:
print(X_lda)

[[0.1491743  0.8508257 ]
 [0.14917387 0.85082613]
 [0.90304224 0.09695776]]


3. The Embedding Era (Word2Vec, GloVe, FastText)

Sklearn doesn’t directly implement Word2Vec, but you can use:

gensim library for Word2Vec/FastText

Or sklearn’s CountVectorizer + TruncatedSVD as a simple approximation of embeddings

Example (gensim):

In [11]:
from gensim.models import Word2Vec

sentences = [["I", "love", "cats"], ["I", "love", "dogs"]]
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, workers=4)
vector = model.wv['cats']

ModuleNotFoundError: No module named 'gensim'

In [12]:
import gensim
print(gensim.__version__)


ModuleNotFoundError: No module named 'gensim'

4. The Contextual Era (ELMo, Seq2Seq + Attention)

Sklearn cannot do deep contextual embeddings.

Use tensorflow.keras or huggingface/transformers for contextual embeddings.

You can simulate sequence models with sklearn pipelines on n-grams, but context is limited.

5. The Transformer Era (BERT, GPT-2)

Sklearn cannot train transformers.

Use transformers library from Hugging Face: BertModel, GPT2Model, DistilBERT.

Sklearn can still be used on top of embeddings for classification/regression:

In [13]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

inputs = tokenizer("I love cats", return_tensors="pt")
outputs = model(**inputs)
embedding = outputs.last_hidden_state


ModuleNotFoundError: No module named 'transformers'

6. The Scale Era (GPT-3/4/5, LLaMA)

Access via API (OpenAI, Hugging Face Inference API).

Use embeddings as input to sklearn models for downstream tasks: clustering, classification, semantic search.

7. The Multimodal Era (CLIP, GPT-4V, Gemini)

Sklearn cannot handle multimodal directly.

Use OpenAI CLIP, Hugging Face CLIP models, or other vision+text embeddings.

Once embeddings are obtained, sklearn can handle clustering, similarity search, classification.