
# 🧠 Part 3: Vectorization in Practice

**Goal:** Convert cleaned text into numbers so we can analyze it and feed it into ML models.  
We’ll build **Document–Term Matrices (DTM)** using:
- **CountVectorizer** (word counts / bag-of-words)
- **TfidfVectorizer** (weighted by term importance)

**Context:** Travel agency & hostels — short reviews and booking-like messages.


In [None]:

# Core imports
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample travel texts (already 'lightly cleaned' for the demo)
docs = [
    "great location very clean dorm near shibuya",
    "private room at hostel in asakusa budget under 8000 jpy per night",
    "cheap hostel in barcelona close to la rambla arrival 10 02 leaving 10 06",
    "recommend capsule hotel in osaka near namba under 40 usd",
    "stay in kyoto during golden week 2 adults 1 child total budget 60000 jpy",
    "dorm bed tokyo 3 nights shibuya area",
    "love the staff but not happy with noisy street",
    "room was clean and location perfect near station",
]
df = pd.DataFrame({"doc_id": range(len(docs)), "text": docs})
df


Unnamed: 0,doc_id,text
0,0,great location very clean dorm near shibuya
1,1,private room at hostel in asakusa budget under...
2,2,cheap hostel in barcelona close to la rambla a...
3,3,recommend capsule hotel in osaka near namba un...
4,4,stay in kyoto during golden week 2 adults 1 ch...
5,5,dorm bed tokyo 3 nights shibuya area
6,6,love the staff but not happy with noisy street
7,7,room was clean and location perfect near station



## Why Vectorize?
Most ML algorithms need **numeric inputs**. Vectorization turns text into numbers.  
Two classic approaches:
- **Counts**: how many times a term appears.
- **TF–IDF**: how important a term is in a document, adjusted by how common it is across the corpus.


In [None]:

# --- CountVectorizer: Bag of Words ---
cv = CountVectorizer()
X_counts = cv.fit_transform(df["text"])

# Convert to a DataFrame for readability
counts_df = pd.DataFrame(X_counts.toarray(), columns=cv.get_feature_names_out(), index=df["doc_id"])
counts_df.head()


Unnamed: 0_level_0,02,06,10,40,60000,8000,adults,and,area,arrival,...,the,to,tokyo,total,under,usd,very,was,week,with
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,1,1,2,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
4,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [None]:

# Inspect an example row (choose a doc)
row_id = 1  # try changing to explore
doc_counts = counts_df.loc[row_id].sort_values(ascending=False)
doc_counts[doc_counts > 0]


Unnamed: 0,1
8000,1
at,1
asakusa,1
budget,1
hostel,1
jpy,1
in,1
room,1
private,1
per,1


In [None]:

# --- TF-IDF Vectorizer ---
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df["text"])

tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out(), index=df["doc_id"])
tfidf_df.head()


Unnamed: 0_level_0,02,06,10,40,60000,8000,adults,and,area,arrival,...,the,to,tokyo,total,under,usd,very,was,week,with
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.433046,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.317597,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.266171,0.0,0.0,0.0,0.0,0.0
2,0.257305,0.257305,0.514609,0.0,0.0,0.0,0.0,0.0,0.0,0.257305,...,0.0,0.257305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.340454,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.285327,0.340454,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.304194,0.0,0.304194,0.0,0.0,0.0,...,0.0,0.0,0.0,0.304194,0.0,0.0,0.0,0.0,0.304194,0.0


In [None]:

# Compare counts vs TF-IDF for the same document
row_id = 1
c_top = counts_df.loc[row_id][counts_df.loc[row_id] > 0].sort_values(ascending=False)
t_top = tfidf_df.loc[row_id][tfidf_df.loc[row_id] > 0].sort_values(ascending=False)

print("=== COUNTS (doc", row_id, ") ===")
print(c_top.head(15))
print("\n=== TF-IDF (doc", row_id, ") ===")
print(t_top.head(15))


=== COUNTS (doc 1 ) ===
8000       1
asakusa    1
at         1
budget     1
hostel     1
in         1
jpy        1
night      1
per        1
private    1
room       1
under      1
Name: 1, dtype: int64

=== TF-IDF (doc 1 ) ===
8000       0.317597
asakusa    0.317597
at         0.317597
night      0.317597
private    0.317597
per        0.317597
budget     0.266171
hostel     0.266171
room       0.266171
jpy        0.266171
under      0.266171
in         0.201382
Name: 1, dtype: float64


In [None]:

# Add useful options:
# - stop_words='english': remove common words
# - min_df=2: ignore very rare terms (appear in fewer than 2 docs)
# - ngram_range=(1,2): include unigrams and bigrams (captures short phrases)
tfidf_cfg = TfidfVectorizer(stop_words='english', min_df=2, ngram_range=(1,2))
X_tfidf_cfg = tfidf_cfg.fit_transform(df["text"])
tfidf_cfg_df = pd.DataFrame(X_tfidf_cfg.toarray(), columns=tfidf_cfg.get_feature_names_out(), index=df["doc_id"])
tfidf_cfg_df.head()


Unnamed: 0_level_0,budget,clean,dorm,hostel,jpy,location,near,room,shibuya
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0.0,0.459091,0.459091,0.0,0.0,0.459091,0.396158,0.0,0.459091
1,0.5,0.0,0.0,0.5,0.5,0.0,0.0,0.5,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.707107,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0


In [None]:

# Find the most similar documents using cosine similarity on TF-IDF vectors
cos = cosine_similarity(X_tfidf)
sim_df = pd.DataFrame(cos, index=df["doc_id"], columns=df["doc_id"])

# For each doc, show the most similar other doc
def top_match(sim_row, k=2):
    # Exclude self (similarity=1 on diagonal)
    order = sim_row.drop(sim_row.name).sort_values(ascending=False)
    return order.head(k)

matches = sim_df.apply(top_match, axis=1)
matches


doc_id,0,1,3,4,5,7
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,,,,,0.261665,0.324209
1,,,0.11942,0.174558,,
2,,0.090254,0.035221,,,
3,0.077108,0.11942,,,,
4,,0.174558,0.041639,,,
5,0.261665,0.0,,,,
6,0.0,0.0,,,,
7,0.324209,0.086633,,,,


In [None]:

# Bag-of-words loses word order: 'happy' vs 'not happy'
toy = pd.Series(["i am happy", "i am not happy", "i am very happy"])

cv_toy = CountVectorizer()
toy_counts = cv_toy.fit_transform(toy)
toy_df = pd.DataFrame(toy_counts.toarray(), columns=cv_toy.get_feature_names_out())

tfidf_toy = TfidfVectorizer()
toy_tfidf = tfidf_toy.fit_transform(toy)
toy_tfidf_df = pd.DataFrame(toy_tfidf.toarray(), columns=tfidf_toy.get_feature_names_out())

print("COUNTS (order ignored):\n", toy_df, "\n")
print("TF-IDF (order ignored):\n", toy_tfidf_df.round(3))


COUNTS (order ignored):
    am  happy  not  very
0   1      1    0     0
1   1      1    1     0
2   1      1    0     1 

TF-IDF (order ignored):
       am  happy    not   very
0  0.707  0.707  0.000  0.000
1  0.453  0.453  0.767  0.000
2  0.453  0.453  0.000  0.767


In [None]:

# Using bigrams helps partially: captures 'not happy'
cv_bi = CountVectorizer(ngram_range=(1,2))
toy_bi = cv_bi.fit_transform(toy)
toy_bi_df = pd.DataFrame(toy_bi.toarray(), columns=cv_bi.get_feature_names_out())
toy_bi_df[["happy","not","not happy"]]


Unnamed: 0,happy,not,not happy
0,1,0,0
1,1,1,1
2,1,0,0


In [None]:

# Optional: export matrices to CSV for modeling/dashboards
counts_df.to_csv("counts_dtm.csv")
tfidf_df.to_csv("tfidf_dtm.csv")
tfidf_cfg_df.to_csv("tfidf_dtm_cfg.csv")
print("Saved counts_dtm.csv, tfidf_dtm.csv, tfidf_dtm_cfg.csv")


Saved counts_dtm.csv, tfidf_dtm.csv, tfidf_dtm_cfg.csv



## Key Takeaways
- **CountVectorizer** builds a **Document–Term Matrix** with **word counts** (bag-of-words).
- **TfidfVectorizer** reweights by importance so rare-but-informative terms get higher scores.
- **Order is lost** in bag-of-words; **n‑grams** help, but modern embeddings capture context best.
- Use **stop_words, min_df, ngram_range** to improve model inputs.
