# Video: Testing Hashed Document Vectors for Classification

A key concern with any hashed or otherwise compressed representation is whether it is faithful to the underlying data.
Can it be used for the same tasks as the original data?
This video will compare document vectors and their hashed versions using logistic regression.

Script:
* We previously evaluated document vectors for classification where each feature in the vector corresponded to a word in the training vocabulary.
* We saw decent results, but a concern with these methods is that the vectors will keep getting larger as the system trains on more data and gradually expands its vocabulary.
* One proposed alternative to word features is to hash them to a fixed size representation.
* In this video, we will extend the previous evaluation to cover hashed document vectors.

In [None]:
import pandas as pd

In [None]:
recipes = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx704-examples/refs/heads/main/data/recipes.tsv.gz", sep="\t")
recipe_tags = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx704-examples/refs/heads/main/data/recipe-tags.tsv.gz", sep="\t")

In [None]:
recipes = recipes.set_index("recipe_slug")
recipes

Unnamed: 0_level_0,recipe_title,recipe_introduction,recipe_ingredients,recipe_instructions,recipe_conclusion,recipe_related_slugs,recipe_ts
recipe_slug,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
spiced-pear-and-walnut-salad,Spiced Pear And Walnut Salad,Spiced pear and walnut salad is a delicious an...,"[""2 ripe pears, thinly sliced"", ""4 cups mixed ...","[""In a small bowl, whisk together the olive oi...",\N,"[""pear-and-blue-cheese-salad"", ""walnut-and-cra...",2023-06-17 22:10:35.744536+00
roasted-pear-and-butternut-squash-soup,Roasted Pear And Butternut Squash Soup,Roasted pear and butternut squash soup is a cr...,"[""2 medium-sized butternut squash, peeled and ...","[""Preheat the oven to 400°F."", ""In a large bow...",\N,"[""roasted-butternut-squash-and-apple-soup"", ""p...",2023-06-17 22:10:46.428069+00
peach-clafoutis,Peach Clafoutis,Peach clafoutis is a classic French dessert th...,"[""4 ripe peaches, peeled and sliced"", ""3 eggs""...","[""Preheat the oven to 375°F."", ""Grease a 9-inc...",\N,"[""cherry-clafoutis"", ""blueberry-clafoutis"", ""a...",2023-06-17 19:05:50.44248+00
plum-clafoutis,Plum Clafoutis,Plum clafoutis is a classic French dessert mad...,"[""4-5 ripe plums, pitted and sliced"", ""3 eggs""...","[""Preheat the oven to 375°F (190°C) and butter...",\N,"[""cherry-clafoutis"", ""apple-clafoutis"", ""blueb...",2023-06-17 19:05:42.705122+00
pear,Pear,Pears are a sweet and juicy fruit that come in...,"[""1 sheet of puff pastry"", ""2 ripe pears, peel...","[""Preheat the oven to 400°F."", ""Roll out the p...",\N,"[""pear-and-goat-cheese-salad"", ""pear-and-ginge...",2023-06-17 22:11:13.760378+00
...,...,...,...,...,...,...,...
pear-coffee-cake,Pear Coffee Cake,Description: \nPear coffee cake is a moist an...,"[""2 cups all-purpose flour"", ""1 cup granulated...","[""Preheat your oven to 350°F (175°C). Grease a...",\N,"[""apple-cinnamon-coffee-cake"", ""banana-nut-bre...",2025-07-16 22:22:02.711138+00
elegant-pear-coffee-loaf,Elegant Pear Coffee Loaf,Description: \nThe Elegant Pear Coffee Loaf i...,"[""2 ripe pears, peeled, cored, and chopped"", ""...","[""Preheat your oven to 350°F (175°C). Grease a...",\N,"[""pear-and-walnut-coffee-cake"", ""coffee-infuse...",2025-07-16 22:22:18.463873+00
halibut,Halibut,"Description: \nHalibut is a large, flat fish ...","[""4 halibut fillets (6 oz each)"", ""Salt and fr...","[""Pat the halibut fillets dry and season gener...",\N,"[""lemon-herb-baked-halibut"", ""grilled-halibut-...",2025-07-17 20:40:05.664215+00
arayes,Arayes,Description: \nArayes are a delicious Middle ...,"[""4 large pita bread pockets"", ""1 pound ground...","[""In a mixing bowl, combine the ground meat, c...",\N,"[""shawarma"", ""kebabs"", ""falafel"", ""manakish-mi...",2025-07-22 21:30:10.41649+00


In [None]:
desserts = recipe_tags.query("recipe_tag == 'dessert'").set_index("recipe_slug")
dessert_target = pd.DataFrame({"dessert": recipes.index.isin(desserts.index).astype(int)}, index=recipes.index)
dessert_target

Unnamed: 0_level_0,dessert
recipe_slug,Unnamed: 1_level_1
spiced-pear-and-walnut-salad,0
roasted-pear-and-butternut-squash-soup,0
peach-clafoutis,1
plum-clafoutis,1
pear,1
...,...
pear-coffee-cake,0
elegant-pear-coffee-loaf,1
halibut,0
arayes,0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(recipes["recipe_introduction"])
X_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 5533341 stored elements and shape (97047, 7918)>

Script:
* I already did the same feature extraction as the previous video testing document vectors.
* Let's setup the feature hashing now.

In [None]:
import numpy as np

In [None]:
word_hash = np.random.randint(0, 2, size=(X_bow.shape[1], 100)) * 2 - 1
word_hash

array([[-1, -1, -1, ..., -1,  1, -1],
       [-1,  1, -1, ..., -1, -1,  1],
       [-1,  1, -1, ..., -1,  1,  1],
       ...,
       [ 1, -1, -1, ..., -1, -1, -1],
       [-1,  1,  1, ...,  1, -1, -1],
       [ 1,  1,  1, ...,  1, -1,  1]])

Script:
* For clarity of the computation, I created a hash matrix to show how the hashed vector relates to the original vector.
* The hashed vector is just the original document vector multipled by this hash matrix of -1s and 1s.


In [None]:
X_bow_hash = X_bow @ word_hash
X_bow_hash

array([[-21,  -5,  11, ...,  -1,  15,  -1],
       [-19, -21,   5, ...,   5,  11,   3],
       [ -6, -16,   6, ...,  -2,   2,  -6],
       ...,
       [  1, -15,   9, ...,  -1,   7, -15],
       [-16,   0,  -4, ...,  12,  -2,   4],
       [-22, -10,  -4, ...,  14,  -4,  -6]])

In [None]:
X_bow_hash.shape

(97047, 100)

Script:
* A real implementation would probably skip making this array to save memory, and just compute individual rows on an as needed basis.
* Let's test the resulting features.


In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model_bow_hash = LogisticRegression()
model_bow_hash.fit(X_bow_hash, dessert_target["dessert"])

Script:
* This time, the logistic regression converged without any warnings.
* Is that good or bad?

In [None]:
bow_hash_predictions = model_bow_hash.predict(X_bow_hash)
(bow_hash_predictions == dessert_target["dessert"]).mean()

np.float64(0.9326305810586623)

Script:
* Previously, we saw 96% accuracy with logistic regression on bag of word features and always predicting not dessert would be 90% accurate.
* So this is in the middle.
* Will it do better with more hashed features?
* I separately tested with 200 and 500 hashed features and got essentially the same result.

Script: (faculty on screen)
* As we just saw, feature hashing lets us trade off accuracy for speed.
* We shrank the number of features by a factor of 79, but lost half our accuracy gain over fixed predictions.
* If you need the speed in your system, this may be a good option, but test carefully first so you know what you are paying for that speed.