-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathindex.json
More file actions
1 lines (1 loc) Β· 141 KB
/
index.json
File metadata and controls
1 lines (1 loc) Β· 141 KB
1
[{"href":"https://bart.degoe.de/ai-agent-dungeon-crawl/","title":"Your OpenClaw can book flights. But can it survive a dungeon crawl?","categories":["ai","python","games"],"content":" Listen to this article instead Your browser does not support the audio element AI agents are having a moment. OpenClaw (nΓ©e Clawdbot, nΓ©e Moltbot1) just hit 200k GitHub stars. People are driving a Mac Mini shortage just to manage their email, book flights, and order groceries. Someone launched a fake $CLAWD crypto token that hit $16 million before crashing 90%. Its creator joined OpenAI. There\u0026rsquo;s a social network for AI agents now.\nThese agents can do genuinely useful and awesome things. But let\u0026rsquo;s be honest: sending emails and checking you in for flights is not exactly exciting. What if, instead of a boarding pass, we gave our AI agent a sword?\nMy buddy and I have been building CrawlerVerse, a roguelike dungeon crawler designed specifically for AI agents. Think Dungeon Crawl Stone Soup but it also has an API, and 3d dice if you want to use your wetware/brain to play yourself. You or your agent get dropped into a procedurally generated dungeon, receive observations about what is visible (there\u0026rsquo;s monsters, items, walls, stairs), and has to decide what to do each turn. Kill monsters. Pick up loot. Don\u0026rsquo;t die2.\nWe just open-sourced the Python SDK, and there\u0026rsquo;s a public leaderboard. So now you could build an agent and let it loose in a dungeon, and give your Clawdbot something to do while it waits for your texts about ordering DoorDash.\nA dumbass bot Here\u0026rsquo;s the simplest possible agent:\nfrom crawlerverse import CrawlerClient, run_game, Wait, Observation, Action def my_agent(observation: Observation) -\u0026gt; Action: # it waits, that all it does return Wait() # request an API key at https://www.crawlerver.se/agent-api/waitlist # ping me if I don\u0026#39;t approve your key fast enough with CrawlerClient(api_key=\u0026#34;cra_...\u0026#34;) as client: result = run_game(client, my_agent, model_id=\u0026#34;my-monster-slaying-bot\u0026#34;) print(f\u0026#34;Floor {result.outcome.floor}, turns: {result.outcome.turns}\u0026#34;) That\u0026rsquo;s it. Your agent receives an observation, returns an action. The SDK handles the game loop, API calls, retries, all that jazz. This particular agent waits every turn and will get eaten by the first roaming monster that happens to walk into the room its crawler is waiting in, but it works. It \u0026ldquo;plays\u0026rdquo; the game. It will show up on the leaderboard3.\nThe observation tells you everything your agent can see: visible tiles, monster positions and health, items on the ground, your inventory, player stats (HP, attack, defense), equipped gear, and which directions you can move. The action is one of: move, attack, wait, pickup, drop, use, equip, enter portal, or ranged attack. This is exactly the same you would see as a human playing the game.\nLet\u0026rsquo;s make it less suicidal.\nGiving it some brains The SDK has some example agents for Anthropic, OpenAI, and local models. Let\u0026rsquo;s walk through the Claude one, because that\u0026rsquo;s what I had lying around4.\nStep 1: Tell the bot what it can see First, we need to turn the observation into something an LLM can understand. The SDK gives you typed Python objects; the LLM needs text.\ndef format_observation(obs: Observation) -\u0026gt; str: p = obs.player # Basic stats the LLM needs to make decisions lines = [ f\u0026#34;Turn {obs.turn} | Floor {obs.floor}\u0026#34;, f\u0026#34;HP: {p.hp}/{p.max_hp} | ATK: {p.attack} | DEF: {p.defense}\u0026#34;, f\u0026#34;Position: ({p.position[0]}, {p.position[1]})\u0026#34;, ] # What gear we\u0026#39;re wearing (if any) if p.equipped_weapon: lines.append(f\u0026#34;Weapon: {p.equipped_weapon}\u0026#34;) if p.equipped_armor: lines.append(f\u0026#34;Armor: {p.equipped_armor}\u0026#34;) # What we\u0026#39;re carrying if obs.inventory: inv = \u0026#34;, \u0026#34;.join(f\u0026#34;{i.name} ({i.type})\u0026#34; for i in obs.inventory) lines.append(f\u0026#34;Inventory: {inv}\u0026#34;) # Which directions aren\u0026#39;t blocked by walls passable = [d.value for d in Direction if obs.can_move(d)] lines.append(f\u0026#34;Passable directions: {\u0026#39;, \u0026#39;.join(passable)}\u0026#34;) # Everything we can see: tiles, monsters, and items on the ground lines.append(\u0026#34;\\nVisible tiles:\u0026#34;) for tile in obs.visible_tiles: parts = [f\u0026#34; ({tile.x},{tile.y}) {tile.type}\u0026#34;] if tile.monster: m = tile.monster parts.append(f\u0026#34;[MONSTER: {m.type} HP:{m.hp}/{m.max_hp}]\u0026#34;) if tile.items: parts.append(f\u0026#34;[ITEMS: {\u0026#39;, \u0026#39;.join(tile.items)}]\u0026#34;) lines.append(\u0026#34; \u0026#34;.join(parts)) return \u0026#34;\\n\u0026#34;.join(lines) Each turn, the LLM gets something like this:\nTurn 14 | Floor 1 HP: 8/10 | ATK: 3 | DEF: 1 Position: (5, 3) Weapon: short-sword Passable directions: north, east, southeast Visible tiles: (5,2) floor (6,3) floor [ITEMS: health-potion] (6,2) floor [MONSTER: goblin HP:4/6] (4,3) wall Step 2: Tell the bot what it can do The system prompt is where the strategy lives. You can get surprisingly far with a basic set of instructions:\nSYSTEM_PROMPT = \u0026#34;\u0026#34;\u0026#34;\\ You are an AI agent playing Crawlerver.se, a roguelike dungeon game. Each turn you receive an observation and must choose ONE action. Respond with a JSON object (no markdown, no explanation). ## Actions {\u0026#34;action\u0026#34;: \u0026#34;move\u0026#34;, \u0026#34;direction\u0026#34;: \u0026#34;\u0026lt;dir\u0026gt;\u0026#34;} {\u0026#34;action\u0026#34;: \u0026#34;attack\u0026#34;, \u0026#34;direction\u0026#34;: \u0026#34;\u0026lt;dir\u0026gt;\u0026#34;} {\u0026#34;action\u0026#34;: \u0026#34;ranged_attack\u0026#34;, \u0026#34;direction\u0026#34;: \u0026#34;\u0026lt;dir\u0026gt;\u0026#34;, \u0026#34;distance\u0026#34;: \u0026lt;1-15\u0026gt;} {\u0026#34;action\u0026#34;: \u0026#34;pickup\u0026#34;} {\u0026#34;action\u0026#34;: \u0026#34;drop\u0026#34;, \u0026#34;itemType\u0026#34;: \u0026#34;\u0026lt;item\u0026gt;\u0026#34;} {\u0026#34;action\u0026#34;: \u0026#34;use\u0026#34;, \u0026#34;itemType\u0026#34;: \u0026#34;\u0026lt;item\u0026gt;\u0026#34;} {\u0026#34;action\u0026#34;: \u0026#34;equip\u0026#34;, \u0026#34;itemType\u0026#34;: \u0026#34;\u0026lt;item\u0026gt;\u0026#34;} {\u0026#34;action\u0026#34;: \u0026#34;wait\u0026#34;} {\u0026#34;action\u0026#34;: \u0026#34;enter_portal\u0026#34;} Directions: north, south, east, west, northeast, northwest, southeast, southwest ## Strategy Tips - Kill monsters to clear the path. Attack adjacent monsters. - Pick up items (potions, weapons, armor), they help you survive. - Equip weapons and armor for better stats. - Use health potions when HP is low. - Find stairs down to descend to the next floor. - Explore systematically; avoid getting surrounded. Always include a \u0026#34;reasoning\u0026#34; field explaining your decision.\u0026#34;\u0026#34;\u0026#34; This is the part to mess around with. More on that later.\nStep 3: Parse the response LLMs are not known for consistently producing valid JSON5, so we need some defensive parsing:\ndef parse_action(raw: str) -\u0026gt; Action: text = raw.strip() # Strip markdown code fences if the LLM ignores our instructions # Gemini does this A LOT if text.startswith(\u0026#34;```\u0026#34;): text = text.split(\u0026#34;\\n\u0026#34;, 1)[1] if \u0026#34;\\n\u0026#34; in text else text[3:] if text.endswith(\u0026#34;```\u0026#34;): text = text[:-3] # Find JSON if buried in other text if not text.startswith(\u0026#34;{\u0026#34;): start = text.find(\u0026#34;{\u0026#34;) if start \u0026gt;= 0: end = text.rfind(\u0026#34;}\u0026#34;) if end \u0026gt; start: text = text[start : end + 1] try: data = json.loads(text) except json.JSONDecodeError: return Wait(reasoning=\u0026#34;Failed to parse response\u0026#34;) action_type = data.get(\u0026#34;action\u0026#34;, \u0026#34;wait\u0026#34;) cls = ACTION_MAP.get(action_type) if cls is None: return Wait(reasoning=f\u0026#34;Unknown action: {action_type}\u0026#34;) # Build the action from the response fields valid_fields = set(cls.model_fields.keys()) kwargs = {k: v for k, v in data.items() if k != \u0026#34;action\u0026#34; and k in valid_fields} try: return cls(**kwargs) except Exception: return Wait(reasoning=f\u0026#34;Failed to construct {action_type}\u0026#34;) The fallback is always Wait(). Better to skip a turn than to crash the game because your AI of choice decided to write a haiku instead of JSON.\nStep 4: Wire it up Now we connect the pieces. The agent keeps a conversation history so the LLM \u0026ldquo;remembers\u0026rdquo; what happened on previous turns; this is what gives it \u0026ldquo;memory\u0026rdquo; across the game.\ndef make_agent(model: str = \u0026#34;claude-haiku-4-5-20251001\u0026#34;): client = Anthropic() messages: list[dict] = [] def agent(obs: Observation) -\u0026gt; Action: prompt = format_observation(obs) messages.append({\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: prompt}) # Prefill with \u0026#34;{\u0026#34; to force JSON output # This is another stupid trick I learned while doing this prefill = {\u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;{\u0026#34;} response = client.messages.create( model=model, system=SYSTEM_PROMPT, messages=[*messages, prefill], temperature=0.3, max_tokens=200, ) reply = \u0026#34;{\u0026#34; + response.content[0].text messages.append({\u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34;, \u0026#34;content\u0026#34;: reply}) return parse_action(reply) return agent The \u0026quot;{\u0026quot; prefill trick is worth calling out. By starting the assistant\u0026rsquo;s response with {, we\u0026rsquo;re nudging the LLM to continue with JSON rather than English (or Chinese if you\u0026rsquo;re on GLM). It\u0026rsquo;s not bulletproof (hence the defensive parser), but it works okay.\nRun it:\nagent = make_agent() with CrawlerClient() as client: result = run_game(client, agent, model_id=\u0026#34;claude-haiku-4.5\u0026#34;) print(f\u0026#34;Game over! Floor {result.outcome.floor}\u0026#34;) print(f\u0026#34;Watch replay: {result.spectator_url}\u0026#34;) The full example with error handling, debug output, and game resumption is in examples/anthropic_agent.py. There are equivalent examples for OpenAI (works with any OpenAI-compatible API, including Ollama and LMStudio) and local models.\nIf there are currently bots playing, you can see them strugg.. crushing their dungeons live. Every game also produces shareable replay links to show off to your friends how great your bot is doing!\nThe leaderboard The CrawlerVerse leaderboard. Your bot\u0026#39;s name could be here. The CrawlerVerse leaderboard tracks the best run for each model ID. When you call run_game, you pass a model_id string that identifies your agent on the leaderboard. So claude-haiku-4.5 and gpt-4o show up as separate entries, and your fine-tuned my-custom-llama-v3 would get its own row too.\nThis is almost like a benchmark, but not the boring kind. Nobody\u0026rsquo;s picking between four multiple choice options on a standardized test. Your model has to explore a dungeon it\u0026rsquo;s never seen before, manage health and inventory, decide when to fight and when to run, and not walk into walls. The levels are procedurally generated, so you can\u0026rsquo;t memorize solutions.\nIf you\u0026rsquo;re in the fine-tuning or RL space and you\u0026rsquo;re tired of optimizing for MTEB, I think this could be more interesting (definitely not biased). The signal is clean (floor reached, turns survived) and the leaderboard is public.\nBuild your own Three ways in, depending on who you are:\n\u0026ldquo;I just want to try it\u0026rdquo; pip install crawlerverse Grab an API key from crawlerver.se, copy one of the example agents, and run it. You\u0026rsquo;ll have a bot on the leaderboard in five minutes.\nThe SDK supports both sync and async clients, so if you want to run multiple games concurrently:\nfrom crawlerverse import AsyncCrawlerClient, async_run_game async with AsyncCrawlerClient() as client: result = await async_run_game(client, my_agent) N.B.: we do have some basic rate limiting set up, and we\u0026rsquo;re just nerds trying to have fun, so be nice.\n\u0026ldquo;I have an OpenClaw running on a Mac Mini\u0026rdquo; If you\u0026rsquo;re one of the people that managed to snag a Mac Mini and set up OpenClaw, you can wire up a CrawlerVerse skill. Create a folder at ~/.openclaw/skills/crawlerverse/ with a SKILL.md:\n--- name: crawlerverse description: Play CrawlerVerse, a roguelike dungeon crawler game. Use when the user wants to play a dungeon game, fight monsters, or compete on the CrawlerVerse leaderboard. tools: Bash metadata: {\u0026#34;openclaw\u0026#34;:{\u0026#34;requires\u0026#34;:{\u0026#34;env\u0026#34;:[\u0026#34;CRAWLERVERSE_API_KEY\u0026#34;]}}} --- # CrawlerVerse Dungeon Crawler Play a roguelike dungeon game via the CrawlerVerse API. ## Setup Run `pip install crawlerverse` if not already installed. ## How to Play Run the game script: ```bash python ~/.openclaw/skills/crawlerverse/scripts/play.py ``` The script will start a game and ask you to make decisions each turn. Each turn you\u0026#39;ll see what\u0026#39;s around you (monsters, items, walls) and need to choose an action: move, attack, pickup items, use potions, etc. ## Strategy - Attack adjacent monsters to clear the path - Pick up and equip weapons and armor - Use health potions when HP is low - Find stairs to descend to the next floor - Explore systematically, don\u0026#39;t get surrounded Then add a scripts/play.py that uses the SDK\u0026rsquo;s run_game with a callback that asks OpenClaw for each decision β same pattern as the Claude example above, but using OpenClaw\u0026rsquo;s LLM instead. The anthropic_agent.py example is a good starting point to adapt.\nYour Mac Mini can fight monsters while you sleep. Each game takes a few minutes and a handful of cents in tokens, so you could leave it grinding the leaderboard overnight and check the results in the morning.\n\u0026ldquo;I want to train a model for this\u0026rdquo; The game API is a pretty clean RL environment. Observations in, actions out. Discrete action space (9 actions, 8 directions). Clear reward signal (floor reached, monsters killed, turns survived). Episodes are short enough to iterate quickly.\nThe API docs cover everything you need. The Python SDK is MIT licensed. The leaderboard is waiting.\nSome ideas to get you started:\nPrompt engineering: the system prompt in the example is basic. There\u0026rsquo;s a lot of room for better strategic instructions, few-shot examples, or chain-of-thought reasoning. Fine-tuning: collect a dataset of game transcripts from good runs and fine-tune a smaller model on it. RL from game outcomes: use floor reached as a reward signal and train directly on gameplay. Come beat the high score The SDK is at github.com/crawlerverse/crawlerverse-sdks. The API docs are at crawlerver.se/docs/agent-api. The leaderboard is at crawlerver.se/leaderboard.\nCurrent best run is literally floor 1. Come take it.\nThe name changes are a whole saga on their own. Steinberger originally called it Clawdbot (a play on Claude bc he loved Claude and Claude Code), Anthropic sent a cease-and-desist, it became Moltbot, then OpenClaw. The crypto scammers didn\u0026rsquo;t get the memo and launched a $CLAWD token under the original name.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nSo at time of writing we built basic ranged combat and a (de)buff system, that will be the foundation for the magic system. I.e. the magic system doesn\u0026rsquo;t exist yet. All I\u0026rsquo;m saying is maybe don\u0026rsquo;t pick the mage class right now.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDead last, but still technically on there. You could be first for a bit, depending on how quickly you get a key and run this script.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nGotta use those Claude subscription tokens for something. Your mileage (and invoice) may vary. Any OpenAI compatible API will work too.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nUnderstatement of the year. Hi Gemini.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/building-a-semantic-search-engine-in-250-lines-of-python/","title":"Building a semantic search engine in Β±250 lines of Python","categories":["how-to","search","full-text search","python","machine learning","ai"],"content":" Listen to this article instead Your browser does not support the audio element Once upon a time I wrote a post about building a toy TF-IDF keyword search engine. It has been one of the more popular posts I\u0026rsquo;ve written, and in this age of AI I felt a sequel has been long overdue.\nIt\u0026rsquo;s pretty fast (even though it\u0026rsquo;s written in pure Python), it ranks results with TF-IDF, and it can rank 6.4 million Wikipedia articles for a given query in milliseconds. But it has absolutely no context of what words mean.\n\u0026gt;\u0026gt;\u0026gt; index.search(\u0026#39;alcoholic beverage disaster in England\u0026#39;) search took 0.12 milliseconds [] Not a single result, even though the London Beer Flood is sitting right there. For context, the London Beer Flood is an accident that occurred at a London brewery in 1814 where a giant vat of beer burst and flooded the surrounding streets. That\u0026rsquo;s very much an \u0026ldquo;alcoholic beverage disaster in England\u0026rdquo;, but our keyword search engine doesn\u0026rsquo;t know that \u0026ldquo;beer\u0026rdquo; and \u0026ldquo;alcoholic beverage\u0026rdquo; are similar concepts, or that \u0026ldquo;London\u0026rdquo; is in \u0026ldquo;England,\u0026rdquo; or that a \u0026ldquo;flood\u0026rdquo; is a type of \u0026ldquo;disaster.\u0026rdquo; It only knows how to match strings 1.\nIn the original post, we built an inverted index with TF-IDF ranking, and recently the project underwent some modernization with Hugging Face datasets and proper tooling. The core search engine hasn\u0026rsquo;t changed: tokenize, stem, look up matching documents, rank by term frequency weighted by inverse document frequency. It works great when your query uses the same words as the documents. But when you don\u0026rsquo;t know the right words to search for, you end up with empty results and sad emojis.\nBy the end of this post, we\u0026rsquo;ll have a vector-based search mode that captures meaning, in around 200 lines or so of Python 2:\n\u0026gt;\u0026gt;\u0026gt; vector_index.search(\u0026#39;alcoholic beverage disaster in England\u0026#39;, k=3) [(\u0026#39;1900 English beer poisoning\u0026#39;, 0.6237), (\u0026#39;United States Industrial Alcohol Company\u0026#39;, 0.5980), (\u0026#39;Timeline of British breweries\u0026#39;, 0.5732)] All the code is on GitHub.\nWhat are embeddings? The TF-IDF search engine already represents documents as vectors, in a way. Each document becomes a sparse vector with one dimension for every unique word (or token really) in the corpus. If the corpus has 500,000 unique words, each document can be represented as a vector with 500,000 dimensions, most of which are zeroes. One way to think about this is that every unique word has an index on an array of length 500,000 (say beer is index 42). Then every document that has the word beer in it, would have a 1 (or the number of times the word beer occurs in the document, i.e. term frequency) on index 42 (and zeroes for all the words that aren\u0026rsquo;t in the document). The problem is that \u0026ldquo;beer\u0026rdquo; and \u0026ldquo;alcoholic beverage\u0026rdquo; end up on completely random positions/dimensions with no relationship encoded between them.\nEmbeddings take a different approach. Instead of one dimension per word, a neural network learns to compress text into a dense vector of a few hundred dimensions, where every value is meaningful.\nThe main idea is that if you just look at enough data, you\u0026rsquo;ll observe similar things having similar words co-occur more often. The model is trained on billions of sentences, and it learns that similar meanings should end up near each other in this vector space. \u0026ldquo;King\u0026rdquo; and \u0026ldquo;queen\u0026rdquo; are close together. \u0026ldquo;King\u0026rdquo; and \u0026ldquo;bicycle\u0026rdquo; are far apart. \u0026ldquo;Beer\u0026rdquo; and \u0026ldquo;alcoholic beverage\u0026rdquo; are neighbors3.\nThe key insight is that these vectors capture semantic relationships, not just lexical ones. The model has never seen our specific query or our specific documents, but it has learned enough about language to know that a query about \u0026ldquo;alcoholic beverage disaster in England\u0026rdquo; should be close to a document about a beer flood in London.\nIf you want to go deep on how these models work beyond this very high level gist, I highly recommend Jay Alammar\u0026rsquo;s Illustrated Word2Vec blog post for the foundational intuition, and the Sentence-BERT paper for the specific architecture we\u0026rsquo;ll use. You don\u0026rsquo;t need to understand the internals to use them, just like you don\u0026rsquo;t need to understand combustion engines to drive a car.\nGenerating embeddings We\u0026rsquo;ll use the sentence-transformers library for this; essentially there\u0026rsquo;s a whole bunch of open models available on the internet, trained on different datasets and languages, output different dimensions, use different model architectures, etc. We\u0026rsquo;re not going too worry too much about that and just load a pre-trained model and call encode:\nfrom sentence_transformers import SentenceTransformer # the first time you do this it\u0026#39;ll download the model weights model = SentenceTransformer(\u0026#39;all-MiniLM-L6-v2\u0026#39;) vector = model.encode(\u0026#39;alcoholic beverage disaster in England\u0026#39;) # vector is a numpy array of 384 floats len(vector) # =\u0026gt; 384 # array([ 4.01491821e-02, -1.56203713e-02, 3.25193480e-02, -4.24171740e-04, # 7.81598538e-02, 4.59311642e-02, -1.13611883e-02, 1.07017970e-02 # ... # 8.66697952e-02, -7.92134833e-03, 4.44980636e-02, -1.87412277e-02], # dtype=float32) That\u0026rsquo;s it. The model takes a string and returns a 384-dimensional vector that captures its meaning. The all-MiniLM-L6-v2 model is small (80MB), fast, and surprisingly good for general-purpose semantic search4.\nIf you prefer a hosted API, OpenAI offers an embeddings endpoint:\nfrom openai import OpenAI client = OpenAI() # or if you prefer OpenRouter # client = OpenAI(base_url=https://openrouter.ai/api/v1, api_key=\u0026#34;sk-or-v1-...\u0026#34;) response = client.embeddings.create( input=\u0026#39;alcoholic beverage disaster in England\u0026#39;, model=\u0026#39;text-embedding-3-small\u0026#39; ) vector = response.data[0].embedding len(vector) # =\u0026gt; 1536 # [0.0035473527386784554, # 0.030878795310854912, # 0.03244471549987793, # 0.03686774522066116, # -0.02967001684010029, # -0.018530024215579033, # 0.027856849133968353, # 0.010542471893131733, # -0.03247218579053879, # -0.0033705001696944237, # ... # N.B.: note that this is a list of floats, not a NumPy array! N.B.: pro-tip, OpenRouter.ai has tons of models available and has an OpenAI-compatible API if you want to compare different models. Many of the models they offer even have free tiers, although at time of writing there seem to be no free embedding models.\nThe tradeoffs are straightforward: sentence-transformers is free because it runs locally, and produces 384-dimensional vectors. OpenAI costs money, requires a network call, and produces 1536-dimensional vectors. For this project, free and local wins (free is my favorite price), and the smaller vectors will be easier on our memory budget later.\nIt also comes with batch encoding, which we\u0026rsquo;re going to need when we\u0026rsquo;re embedding 6.4 million documents:\nvectors = model.encode( [doc.fulltext for doc in documents], batch_size=256, show_progress_bar=True ) This will embed our documents in batches of 256, showing a pretty progress bar so you know roughly how long your coffee break should be5.\nWe\u0026rsquo;re going to process our 6.4 million documents in chunks rather than loading everything into memory at once. The implementation on GitHub processes documents in groups of 10,000, saving intermediate embeddings as checkpoints. I got sick of having to start over while I was writing this because something crashed halfway through.\nIf you\u0026rsquo;d rather skip the multi-hour encoding step, I\u0026rsquo;ve uploaded the pre-computed embeddings to Hugging Face π€ so you can download them and start searching right away. Just move the JSON and .npy files into data/checkpoints and run uv run python run_semantic.py.\nThe memory problem Let\u0026rsquo;s do some napkin math (it\u0026rsquo;s the best math). We have 6.4 million documents, each represented by a 384-dimensional vector of 32-bit floats:\n$$6{,}400{,}000 \\times 384 \\times 4 \\text{ bytes} \\approx 9.2 \\text{ GB}$$\nThat\u0026rsquo;s not an insignificant amount of RAM, and that stuff comes at a premium these days.\nOne easy win we can have here: we\u0026rsquo;ll store the vectors as 16 bit floats instead of 32 bit. The precision loss is negligible for ranking because we\u0026rsquo;re just comparing relative similarity scores, not doing stuff that\u0026rsquo;s sensitive to rounding errors like launching rockets:\n$$6{,}400{,}000 \\times 384 \\times 2 \\text{ bytes} \\approx 4.6 \\text{ GB}$$\nBetter, but still a lot. Add the documents themselves and you\u0026rsquo;re past what a typical laptop has to spare. My MacBook would start sweating just thinking about it.\nThe solution is numpy.memmap. Instead of loading the entire matrix into RAM, we memory-map the file: the operating system maps it into our address space but only loads pages from disk as we access them, and evicts them when memory gets tight. We get the programming model of \u0026ldquo;everything in RAM\u0026rdquo; without actually needing all that RAM. Also maybe don\u0026rsquo;t do this in production systems.\nimport numpy as np # create a memory-mapped file to write embeddings into as we go matrix = np.lib.format.open_memmap( \u0026#34;vectors.npy\u0026#34;, mode=\u0026#34;w+\u0026#34;, dtype=np.float16, shape=(num_documents, 384), ) # write chunks of embeddings directly to disk for i in range(0, num_documents, chunk_size): chunk_vectors = embed_batch(model, chunk_texts) matrix[i:i+len(chunk_vectors)] = chunk_vectors.astype(np.float16) matrix.flush() # later, load it back memory-mapped for search matrix = np.load(\u0026#34;vectors.npy\u0026#34;, mmap_mode=\u0026#39;r\u0026#39;) # matrix looks, acts and quacks like a normal numpy array # lives on disk, the OS pages in what we need This works because the OS virtual memory system is really good at this. It maps the file into the process\u0026rsquo;s address space, loads 4KB pages on demand when you touch them, and transparently evicts pages under memory pressure. For our search workload, where each query touches every row but only briefly, the OS can page data in and out efficiently. We don\u0026rsquo;t need to write any caching logic ourselves.\nCosine similarity Now we have vectors for our query and for every document. How do we find the most similar ones? We need a way to measure how \u0026ldquo;close\u0026rdquo; two vectors are in our 384-dimensional space. The standard measure for this is cosine similarity.\nThe formula Nicked from https://aitechtrend.com/how-cosine-similarity-can-improve-your-machine-learning-models/ Cosine similarity measures the cosine of the angle between two vectors. Two vectors pointing in the same direction have a cosine of 1. Two perpendicular vectors have a cosine of 0. Two vectors pointing in opposite directions have a cosine of -1.\n$$\\cos(\\theta) = \\frac{\\mathbf{A} \\cdot \\mathbf{B}}{|\\mathbf{A}| |\\mathbf{B}|}= \\frac{\\sum_{i=1}^{n} A_i B_i}{\\sqrt{\\sum_{i=1}^{n} A_i^2} \\times \\sqrt{\\sum_{i=1}^{n} B_i^2}}$$\nDon\u0026rsquo;t let the formula intimidate you. It has three parts:\nNumerator (the dot product): multiply each pair of corresponding elements and add them up. If two vectors have large values in the same dimensions, this number will be big. Denominator (the magnitudes): compute the \u0026ldquo;length\u0026rdquo; of each vector and multiply them together. This normalizes the result so it doesn\u0026rsquo;t depend on how long the vectors are, only their direction. Result: a number between -1 and 1, where 1 means identical direction, 0 means unrelated, and -1 means opposite 6. A naive implementation Let\u0026rsquo;s build it from scratch in plain Python, because I think seeing the pieces helps:\nimport math def dot_product(a, b): return sum(ai * bi for ai, bi in zip(a, b)) def magnitude(v): return math.sqrt(sum(vi ** 2 for vi in v)) def cosine_similarity(a, b): return dot_product(a, b) / (magnitude(a) * magnitude(b)) And it does what we\u0026rsquo;d expect:\n\u0026gt;\u0026gt;\u0026gt; cosine_similarity([1, 2, 3], [1, 2, 3]) 1.0 \u0026gt;\u0026gt;\u0026gt; cosine_similarity([1, 2, 3], [-1, -2, -3]) -1.0 \u0026gt;\u0026gt;\u0026gt; cosine_similarity([1, 0, 0], [0, 1, 0]) 0.0 Identical vectors, opposite vectors, and perpendicular vectors. The math checks out.\nThe problem at scale Here\u0026rsquo;s the thing: we need to compute cosine similarity between our query vector and every single document vector. That\u0026rsquo;s 6.4 million similarity computations, each involving 384 multiplications and additions. That\u0026rsquo;s about 2.5 billion floating-point operations per query. That naive Python implementation would take\u0026hellip; a while.\nThe normalization trick There\u0026rsquo;s a pretty elegant optimization trick we can do though. If we normalize all vectors to unit length (magnitude = 1) at index time, the denominator of the cosine similarity formula becomes $1 \\times 1 = 1$. The entire formula can be simplified to just the dot product:\n$$\\cos(\\theta) = \\mathbf{A} \\cdot \\mathbf{B} \\quad \\text{(when both vectors have unit length)}$$\nAnd a dot product between a matrix and a vector? That\u0026rsquo;s exactly what NumPy was made for. It drops into optimized BLAS routines7 that use SIMD instructions, cache-friendly memory access patterns, and all the other tricks that make NumPy fast:\nimport numpy as np # normalize all document vectors once at index time norms = np.linalg.norm(matrix, axis=1, keepdims=True) matrix /= norms # at search time, normalize the query and dot product query = query_vector / np.linalg.norm(query_vector) scores = matrix @ query # cosine similarity with ALL 6.4M documents That matrix @ query line computes cosine similarity against every document in one shot. NumPy upcasts the float16 matrix to float32 on the fly during the multiplication, so we get full precision where it matters (the computation) while keeping the storage compact. On my laptop, this takes about 2 seconds for 6.4 million documents. Not bad for a brute-force search on a MacBook.\nThe VectorIndex class Here\u0026rsquo;s the full class that ties everything together:\nclass VectorIndex: def __init__(self, dimensions=384): self.dimensions = dimensions self.documents = {} self._matrix = None def build(self, documents, vectors): for i, doc in enumerate(documents): self.documents[i] = doc self._matrix = np.array(vectors, dtype=np.float32) # normalize to unit vectors so dot product = cosine similarity norms = np.linalg.norm(self._matrix, axis=1, keepdims=True) norms[norms == 0] = 1 # avoid division by zero self._matrix /= norms def search(self, query_vector, k=10): # keep the query in float32, numpy will upcast the # matrix (which may be float16) automatically during matmul query = np.array(query_vector, dtype=np.float32) query = query / np.linalg.norm(query) scores = self._matrix @ query # find top-k results without sorting the entire array k = min(k, len(self.documents)) top_k = np.argpartition(scores, -k)[-k:] top_k = top_k[np.argsort(scores[top_k])[::-1]] return [(self.documents[int(i)], float(scores[i])) for i in top_k] Most of this should look familiar from the explanation above, but there\u0026rsquo;s one trick worth calling out: np.argpartition. When you only want the top 10 results out of 6.4 million, sorting the entire array is wasteful. argpartition runs in $O(n)$ time and gives us the indices of the top-k elements (unordered), and then we only sort those k elements. It\u0026rsquo;s the difference between sorting a phone book and just pulling out the top 10 entries 8.\nKeyword vs. semantic: where each wins Now that we have both search modes, let\u0026rsquo;s see how they compare:\nQuery Keyword search Semantic search \u0026ldquo;London Beer Flood\u0026rdquo; Exact match, instant Also finds it (0.74) \u0026ldquo;alcoholic beverage disaster in England\u0026rdquo; Nothing 1900 English beer poisoning, Timeline of British breweries \u0026ldquo;python programming language\u0026rdquo; Lots of results Python syntax and semantics, CPython, Python compiler \u0026ldquo;large constricting reptiles\u0026rdquo; Nothing Outline of reptiles, Giant lizard, Squamata \u0026ldquo;color\u0026rdquo; vs \u0026ldquo;colour\u0026rdquo; Only matches exact spelling9 Understands they mean the same thing The pattern is clear. Keyword search is precise \u0026ndash; when your query matches the words used in the documents, it\u0026rsquo;s fast and reliable. Semantic search understands meaning \u0026ndash; it bridges the vocabulary gap between how you phrase your query and how the document was written. \u0026ldquo;Alcoholic beverage disaster\u0026rdquo; finds beer poisoning articles. \u0026ldquo;Large constricting reptiles\u0026rdquo; finds squamates. Neither query shares a single word with the documents it found.\nThey\u0026rsquo;re not competing approaches. They\u0026rsquo;re complementary. So what\u0026rsquo;s next?\nWhat\u0026rsquo;s next: hybrid search We now have two search engines: one that matches keywords, and one that understands meaning. What if we combined them? Use keyword search for precision on exact terms, semantic search for understanding meaning, and blend the scores together.\nThat\u0026rsquo;s actually what modern search engine implementations like Elasticsearch, Vespa, and Pinecone do under the hood. It\u0026rsquo;s called hybrid search, and it\u0026rsquo;s what we\u0026rsquo;ll build in part 3. Stay tuned.\nWe applied some standard tricks to increase the likelihood of strings matching, but that only gets us so far, at the cost of reducing precision.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nTbf it\u0026rsquo;s 200 lines that plumb a bunch libraries together, I\u0026rsquo;m not reimplementing the transformers library here π
The goal is to have an illustrative example of how this technology can be applied in a search engine context.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe original insight dates back to Word2Vec (2013), where researchers at Google showed you could learn word vectors by predicting surrounding words. The famous example: vector(\u0026quot;king\u0026quot;) - vector(\u0026quot;man\u0026quot;) + vector(\u0026quot;woman\u0026quot;) β vector(\u0026quot;queen\u0026quot;). Modern sentence-transformers build on this idea but operate on entire sentences rather than individual words.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe all-MiniLM-L6-v2 model is a good balance of speed and quality. I\u0026rsquo;m also on an M4 MacBook so there\u0026rsquo;s only so much I can ask of the processor to embed 6.4m documents in a reasonable amount of time. There\u0026rsquo;s tons of models out there, with various different specializations (trained on multiple languages like bge-m3 or much higher dimensionality to capture wider ranges of meaning better like the Qwen3 family of embedding models). The MTEB leaderboard ranks models by various benchmarks, so which one to pick depends on your use case and how many GPUs you can afford.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nOn a CPU, encoding 6.4 million documents takes a few hours. On a GPU, it could be more like 20 minutes. If you have a CUDA-capable GPU, sentence-transformers will use it automatically. I\u0026rsquo;m on Apple Silicon, and it\u0026rsquo;ll use the mps backend leveraging the Apple GPU, but while faster than CPU it\u0026rsquo;s still significantly slower than an Nvidia GPU.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIn practice, sentence-transformer embeddings rarely produce negative cosine similarities. The values tend to cluster between 0 and 1, with most unrelated pairs landing around 0.1-0.3. Don\u0026rsquo;t expect to see many -1.0 values in the wild.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nBLAS (or Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) are linear algebra libraries, often written in Fortran, that have highly optimized low-level routines for doing linear algebra. These libraries have been around for ages, and are what makes NumPy so fast. It\u0026rsquo;s like calling upon the Wisdom of the Ancientsβ’\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nTechnically, np.argpartition uses the introselect algorithm, which is a hybrid of quickselect and median-of-medians. It guarantees $O(n)$ worst-case, which is nice.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nOur keyword search does have stemming, so it would catch \u0026ldquo;colors\u0026rdquo; and \u0026ldquo;coloring,\u0026rdquo; but not the British/American spelling difference. The stemmer maps \u0026ldquo;colour\u0026rdquo; to \u0026ldquo;colour\u0026rdquo; and \u0026ldquo;color\u0026rdquo; to \u0026ldquo;color\u0026rdquo;; different stems.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/modernizing-python-search-engine/","title":"Modernizing my 150-line Python search engine: Yahoo! dumps -> Hugging Face π€","categories":["how-to","search","python","open-source"],"content":"A few years ago I wrote a full-text search engine in 150 lines of Python. The Wikipedia data source it relied on has since been discontinued, and the tooling around it was showing its age. I wanted to (finally) write a follow-up about semantic search, but I realized that I had to get the old repository in a working state first. It\u0026rsquo;s now using Hugging Face (π€) datasets, uv, ruff, pytest, and GitHub Actions, without touching the core search logic.\nListen to this article instead Your browser does not support the audio element The problem The original project downloaded Wikipedia abstracts from an XML dump hosted at dumps.wikimedia.org. This was a convenient Β±800mb gzipped XML file containing about 6.3 million article titles, URLs, and abstracts. The code used lxml to stream-parse the XML and requests to download it.\nTurns out these dumps were specifically made for Yahoo!, so at some point, Wikimedia proposed sunsetting these abstract dumps. They required them maintaining a dedicated MediaWiki extension called ActiveAbstract, and the same data could be extracted from the main article dumps. Fair enough, but that means the convenient data dump for this little project\u0026rsquo;s data pipeline is now busted.\nHugging Face datasets Thankfully, Wikimedia publishes1 a Wikipedia dataset on Hugging Face that contains all English Wikipedia articles with their titles, URLs, and full text. The 20231101.en config has 6.4 million articles, roughly the same size as the old abstract dump, and has 2.5 years more articles.\nThe Hugging Face datasets library handles downloading, caching, and efficient iteration out of the box. That means I delete both my questionable code in download.py (the HTTP download logic) and the XML parsing in load.py with essentially one function call.\nHere\u0026rsquo;s what the old manual load.py looked like:\nimport gzip from lxml import etree from search.documents import Abstract def load_documents(): with gzip.open(\u0026#39;data/enwiki-latest-abstract.xml.gz\u0026#39;, \u0026#39;rb\u0026#39;) as f: doc_id = 0 for _, element in etree.iterparse(f, events=(\u0026#39;end\u0026#39;,), tag=\u0026#39;doc\u0026#39;): title = element.findtext(\u0026#39;./title\u0026#39;) url = element.findtext(\u0026#39;./url\u0026#39;) abstract = element.findtext(\u0026#39;./abstract\u0026#39;) yield Abstract(ID=doc_id, title=title, url=url, abstract=abstract) doc_id += 1 element.clear() And here\u0026rsquo;s the new version:\nfrom datasets import load_dataset from tqdm import tqdm from search.documents import Abstract DATASET = \u0026#34;wikimedia/wikipedia\u0026#34; DATASET_CONFIG = \u0026#34;20231101.en\u0026#34; def load_documents(): ds = load_dataset(DATASET, DATASET_CONFIG, split=\u0026#34;train\u0026#34;) # this library is used for ML training for doc_id, row in enumerate(tqdm(ds, desc=\u0026#34;Loading documents\u0026#34;)): title = row[\u0026#34;title\u0026#34;] url = row[\u0026#34;url\u0026#34;] # extract first paragraph as abstract text = row[\u0026#34;text\u0026#34;] abstract = text.split(\u0026#34;\\n\\n\u0026#34;)[0] if text else \u0026#34;\u0026#34; yield Abstract(ID=doc_id, title=title, url=url, abstract=abstract) The major difference is that the HF dataset contains the full article text, not just the abstract. So we extract the first paragraph with a simple split(\u0026quot;\\n\\n\u0026quot;)[0]. The rest of the code (the Abstract dataclass, the inverted index, the analysis pipeline, the TF-IDF ranking) remains completely unchanged. The load_documents() function still yields the same Abstract objects.\nThe datasets library also handles caching to ~/.cache/huggingface/, so the Β±20GB download only happens once. That replaces the old download.py that did HTTP chunked downloads with progress tracking, and the library gives us progress bars for free (and free is my favorite price).\nLoading progress bars! While we\u0026rsquo;re at it: modern Python tooling Since we\u0026rsquo;re here, I figured I\u0026rsquo;d drag the rest of the project into 2026 too. The original setup was the classic requirements.txt with pinned dependencies and not much else. No tests, no linting, no CI. And GitHub Actions are also free.\npyproject.toml and uv I replaced requirements.txt with a pyproject.toml and switched to uv for dependency management:\n[project] name = \u0026#34;python-searchengine\u0026#34; version = \u0026#34;0.1.0\u0026#34; requires-python = \u0026#34;\u0026gt;=3.10\u0026#34; dependencies = [ \u0026#34;datasets\u0026#34;, \u0026#34;PyStemmer\u0026#34;, ] [dependency-groups] dev = [ \u0026#34;pytest\u0026#34;, \u0026#34;ruff\u0026#34;, ] The dependencies went from three (lxml, PyStemmer, requests) to two (datasets, PyStemmer). The datasets library brings in its own HTTP and caching machinery, so we don\u0026rsquo;t need requests or lxml anymore.\nruff I added ruff for linting with a minimal set of rules (pyflakes, pycodestyle, import sorting). Running it immediately caught a forgotten import requests in run.py and some unsorted imports. It also surfaced a subtle bug in the original index.py:\n# before (bug: second `if` always evaluates, even after AND branch) if search_type == \u0026#39;AND\u0026#39;: documents = [self.documents[doc_id] for doc_id in set.intersection(*results)] if search_type == \u0026#39;OR\u0026#39;: documents = [self.documents[doc_id] for doc_id in set.union(*results)] # after if search_type == \u0026#39;AND\u0026#39;: documents = [self.documents[doc_id] for doc_id in set.intersection(*results)] elif search_type == \u0026#39;OR\u0026#39;: documents = [self.documents[doc_id] for doc_id in set.union(*results)] That second if should have been an elif. It doesn\u0026rsquo;t really matter because the early return for invalid search types meant the code paths were mutually exclusive, but wasteful and correctness and I just wanted a happy linter.\nTests I added pytest with tests for the core search logic, which is basically the analysis pipeline (tokenization, filtering, stemming), the Abstract dataclass, and the Index class (indexing, AND/OR search, ranked results). All tests use tiny in-memory data, so they run in about 30ms and don\u0026rsquo;t require downloading Wikipedia:\n$ uv run pytest -v tests/test_analysis.py::test_tokenize PASSED tests/test_analysis.py::test_lowercase_filter PASSED tests/test_analysis.py::test_punctuation_filter PASSED tests/test_analysis.py::test_stopword_filter PASSED tests/test_analysis.py::test_stem_filter PASSED tests/test_analysis.py::test_analyze_full_pipeline PASSED tests/test_analysis.py::test_analyze_filters_empty_tokens PASSED tests/test_index.py::TestAbstract::test_fulltext PASSED tests/test_index.py::TestAbstract::test_term_frequency PASSED tests/test_index.py::TestIndex::test_index_document PASSED tests/test_index.py::TestIndex::test_index_document_no_duplicate PASSED tests/test_index.py::TestIndex::test_document_frequency PASSED tests/test_index.py::TestIndex::test_search_and PASSED tests/test_index.py::TestIndex::test_search_or PASSED tests/test_index.py::TestIndex::test_search_invalid_type PASSED tests/test_index.py::TestIndex::test_search_ranked PASSED tests/test_index.py::TestIndex::test_search_ranked_ordering PASSED 17 passed in 0.03s GitHub Actions Finally, a CI workflow that runs lint and tests across Python 3.10 through 3.13:\nname: CI on: push: branches: [master] pull_request: branches: [master] jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v5 - run: uv sync - run: uv run ruff check . test: runs-on: ubuntu-latest strategy: matrix: python-version: [\u0026#34;3.10\u0026#34;, \u0026#34;3.11\u0026#34;, \u0026#34;3.12\u0026#34;, \u0026#34;3.13\u0026#34;] steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v5 with: python-version: ${{ matrix.python-version }} - run: uv sync - run: uv run pytest -v What stayed the same The nice thing about all of this is that the core search engine (the inverted index, the analysis pipeline, the TF-IDF scoring) didn\u0026rsquo;t change at all. The Abstract dataclass, the Index class, the tokenizer, the stemmer, the stopword filter are also all identical to the original post. No rewriting old blog posts! This is entirely about the plumbing around it: how the data gets in, how dependencies are managed, and how correctness is verified.\nIf you want to try it yourself, the updated code is on GitHub:\nuv sync uv run python run.py The first run will download the Wikipedia dataset from Hugging Face (Β±20GB, cached after that). If you want faster downloads, you can set a Hugging Face token:\nexport HF_TOKEN=hf_... Also do note that it wants to load a lot of data, and that takes a long time, especially if your laptop isn\u0026rsquo;t as well-endowed in the RAM department π
\nWikipedia is awesome, and you should donate to them. Go do it now.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/migrating-hugo-blog-with-claude-code/","title":"Migrating my hopelessly outdated Hugo blog with Claude Code","categories":["hugo","ai","how-to","blog"],"content":"I haven\u0026rsquo;t really touched my blog since 2019. The theme was ancient, jQuery was everywhere, and I kept putting off the migration. Then I decided to let an LLM do it. Here\u0026rsquo;s what happened when Claude Code spent an evening trying to modernize my setup.\nListen to this article instead Your browser does not support the audio element The situation It\u0026rsquo;s October 2025, and my blog is running on a Hugo theme from 2018 (the inimitable hyde-x) that was last updated when people still thought cryptocurrency was going to revolutionize everything1. The site worked fine (Google Analytics tells me there were at least three people who stumbled on my posts every month), but every time I thought about writing something new, I\u0026rsquo;d open the repo and would immediately run head-first into approximately all of the tech debt.\n2019 called, they want their blog theme back If you want to see the old site in all its pre-covid glory, the Internet Archive has you covered.\nThe problems were\u0026hellip; numerous:\nHugo straight up didn\u0026rsquo;t build the site anymore jQuery 3.x doing things that vanilla JavaScript could handle Font Awesome 4.3.0 (current version: 6.x) A custom Lunr.js search implementation that I\u0026rsquo;d hacked together in 2018 Hugo configuration using deprecated syntax An interactive bloom filter example that is exactly the type of jank you\u0026rsquo;d expect from a non-frontender Zero responsive design considerations beyond \u0026ldquo;it sort of works on mobile if you squint\u0026rdquo; I\u0026rsquo;d been meaning to fix this for years. I\u0026rsquo;ve tried. And then one change broke three other things and it just became a complete PITA.\nWe could let\u0026hellip; an LLM do it? Clever little hobbitses? I\u0026rsquo;ve been using Claude Code at work a lot, and this seems like a great project for an LLM to solve. It can deal with all the nitty-gritty fixes and find random changes I made to a template 5 years ago.\nSo I kicked it off with this prompt:\nYou are a principal engineer level expert on using static sites to write blogs like these; simple, quick to write posts in markdown, and publish them on the internet. However, as you can see, this repository is years outdated. What I need you to do is come up with a migration plan. We want to migrate to something more modern but with the following parameters: static site deployed on github pages, existing blog posts need to be migrated under the same URL, we should maintain custom functionality like the Lunr.js search we currently have, as well as any other custom code we have laying around. We can change the theme (https://adityatelange.github.io/hugo-PaperMod/ seems nice, for example), but that\u0026rsquo;s not a requirement. Search the internet, figure out what the current best practices for something like this are, and give me a couple of options. Make sure to check the existing code base for quirks that are built in, and make sure to ultrathink on this.\nWhat went well The analysis phase Claude started by reading the codebase (gasp). It found my custom audio shortcode, identified the interactive JavaScript examples, discovered the text-to-speech Python script, and even noticed the opensearch.xml integration for browser search bars.\nI find telling it to \u0026ldquo;search the internet\u0026rdquo; a pretty useful hack. It helps avoiding using older versions of libraries (i.e. that were current when it was trained), and makes it more likely to find viable alternatives to outdated stuff. It came back with actual data about Hugo still being viable, PaperMod being the most popular theme, and Pagefind being a more modern alternative to Lunr.js (ended up using Fuse.js for simplicity).\nURL preservation One of the non-negotiables was keeping all existing URLs intact. There\u0026rsquo;s links to these posts scattered across the internet, and link rot is already bad enough without me contributing to it.\nClaude did its job and preserved the permalink structure (/:slug/) and verified that every single post URL remained identical. No redirects needed, no broken links, no SEO penalty.\nThe custom features This is where I expected things to fall apart. I have:\nA custom audio player shortcode for text-to-speech versions of posts An interactive bloom filter visualization using MurmurHash and janky jQuery Custom CSS for code blocks and figures Claude preserved most of this, and improved some of it:\nFrom jQuery to vanilla JavaScript My bloom filter example was written in jQuery because that\u0026rsquo;s what I learned back in 2012, and I had Claude rewrite it into just vanilla JavaScript:\n// Before (jQuery) $(\u0026#39;#bits #\u0026#39; + a).css({ \u0026#39;background-color\u0026#39;: \u0026#39;#ac4142\u0026#39; }).addClass(\u0026#39;set\u0026#39;); // After (vanilla JS) const cell = document.querySelector(`#bits td[data-index=\u0026#34;${index}\u0026#34;]`); if (cell) { cell.classList.add(\u0026#39;set\u0026#39;); } No more 30KB jQuery dependency. The interactive example still works. Nice\nTry the interactive bloom filter example yourself You can try the interactive bloom filter example yourself to see it in action.\nWhere it got stuck As much as I\u0026rsquo;d liked to have called it a day there, it did skip a whole bunch of things, and made up a bunch of stuff that I definitely didn\u0026rsquo;t tell it to do.\nGaslighting After switching to the 1M context window model to fit everything in, I asked it to write an outline for a post (it wrote a whole post), and specifically to list what it did and didn\u0026rsquo;t do well.\nOh boy.\nHomepage search The old version had a basic search bar on the homepage, and Claude helpfully decided to create a separate search page instead. I think it may have gotten that from the PaperMod examples and confused my instructions with the context it loaded from there, but I was pretty explicit so slightly disappointed.\nIt confidently claimed that it suggested fixes that it didn\u0026rsquo;t. For example, it wrote the following:\nThe old site had a dedicated search page. Claude suggested putting the search bar directly on the homepage with live filtering. As you type, posts filter in real-time using Fuse.js fuzzy matching. It\u0026rsquo;s faster and more intuitive than the old implementation.\nFavicon fiasco Claude confidently removed my existing favicon.png and replaced it with the generic Hugo favicon. This is how it described its actions:\nClaude noticed the favicon was missing and helpfully copied one from the old theme files. Unfortunately, it copied the wrong oneβsome generic \u0026ldquo;H\u0026rdquo; icon instead of my Bart Simpson as Pablo Escobar masterpiece. Then it created a favicon.ico that was actually a PNG file (which technically works, but come on). Then it forgot to push all the icon sizes. Then the paths were wrong. We went back and forth on this like four times before I finally said \u0026ldquo;bro, the PNG is correct, the link is just wrong\u0026rdquo; and we got it sorted.\nGot hung up on git submodules It spent a lot of time figuring out how git submodules and the deployment worked (even though I had it write a CLAUDE.md ahead of time). So much so there was a 5 paragraph explanation about a \u0026ldquo;detached HEAD\u0026rdquo; situation, where it went into excruciating detail describing all the git commands it tried to resolve this.\nLearnings? It needs clear goals \u0026ldquo;Migrate my blog\u0026rdquo; is too vague. \u0026ldquo;Migrate to a modern Hugo theme while preserving all URLs and custom features\u0026rdquo; worked much better. The more specific the requirements, the better Claude performed.\nIt does catch things I didn\u0026rsquo;t mention the opensearch.xml file. Claude found it, migrated it, and integrated it with the new theme. It also noticed I had a text-to-speech Python script in the repo and made sure the audio player shortcode still worked.\nIt most definitely needs supervision It confidently told me it fixed stuff that it didn\u0026rsquo;t, and I repeatedly had to tell it to validate and check it\u0026rsquo;s work. The \u0026ldquo;buy a coffee\u0026rdquo; button was missing, and it blindly added footer partials that didn\u0026rsquo;t work, added the button twice, then proceeded to remove it completely and \u0026ldquo;fix it at a later time\u0026rdquo; before finally landing on putting it at the bottom of posts.\nIt really likes sing its own praises. For example, it really likes to commit everything all at once, so after I told it to break down the changes into sensible commits, it decided to describe what it had done as follows:\nThe best part? Claude created sensible git commits with clear messages: - \u0026ldquo;Add PaperMod theme and convert config to YAML\u0026rdquo; - \u0026ldquo;Modernize bloom filter JavaScript\u0026rdquo; - \u0026ldquo;Add custom CSS styling\u0026rdquo; - \u0026ldquo;Improve deploy script\u0026rdquo; Not a single \u0026ldquo;fix stuff\u0026rdquo; or \u0026ldquo;more changes\u0026rdquo; commit in sight. It even wrote a comprehensive PR description.\nIt did write a comprehensive PR description, I guess I\u0026rsquo;ll have to give it that.\nAm I happy with the result? Absolutely. It worked2. It still took a couple of hours. Just don\u0026rsquo;t expect miracles from an LLM, it\u0026rsquo;s still a next-token predictor after all.\nTo be fair, some people still think this. I\u0026rsquo;m not one of them.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe PR is on GitHub if you want to see the actual changes.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/","title":"Building a full-text search engine in 150 lines of Python code","categories":["how-to","search","full-text search","python"],"content":"Full-text search is everywhere. From finding a book on Scribd, a movie on Netflix, toilet paper on Amazon, or anything else on the web through Google (like how to do your job as a software engineer), you\u0026rsquo;ve searched vast amounts of unstructured data multiple times today. What\u0026rsquo;s even more amazing, is that you\u0026rsquo;ve even though you searched millions (or billions) of records, you got a response in milliseconds. In this post, we are going to explore the basic components of a full-text search engine, and use them to build one that can search across millions of documents and rank them according to their relevance in milliseconds, in less than 150 lines of Python code!\nListen to this article instead Your browser does not support the audio element Data All the code you in this blog post can be found on Github. I\u0026rsquo;ll provide links with the code snippets here, so you can try running this yourself. You can run the full example by installing the requirements (pip install -r requirements.txt) and run python run.py. This will download all the data and execute the example query with and without rankings.\nBefore we\u0026rsquo;re jumping into building a search engine, we first need some full-text, unstructured data to search. We are going to be searching abstracts of articles from the English Wikipedia, which is currently a gzipped XML file of about 785mb and contains about 6.27 million abstracts1. I\u0026rsquo;ve written a simple function to download the gzipped XML, but you can also just manually download the file.\nData preparation The file is one large XML file that contains all abstracts. One abstract in this file is contained by a \u0026lt;doc\u0026gt; element, and looks roughly like this (I\u0026rsquo;ve omitted elements we\u0026rsquo;re not interested in):\n\u0026lt;doc\u0026gt; \u0026lt;title\u0026gt;Wikipedia: London Beer Flood\u0026lt;/title\u0026gt; \u0026lt;url\u0026gt;https://en.wikipedia.org/wiki/London_Beer_Flood\u0026lt;/url\u0026gt; \u0026lt;abstract\u0026gt;The London Beer Flood was an accident at Meux \u0026amp; Co\u0026#39;s Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.\u0026lt;/abstract\u0026gt; ... \u0026lt;/doc\u0026gt; The bits were interested in are the title, the url and the abstract text itself. We\u0026rsquo;ll represent documents with a Python dataclass for convenient data access. We\u0026rsquo;ll add a property that concatenates the title and the contents of the abstract. You can find the code here.\nfrom dataclasses import dataclass @dataclass class Abstract: \u0026#34;\u0026#34;\u0026#34;Wikipedia abstract\u0026#34;\u0026#34;\u0026#34; ID: int title: str abstract: str url: str @property def fulltext(self): return \u0026#39; \u0026#39;.join([self.title, self.abstract]) Then, we\u0026rsquo;ll want to extract the abstracts data from the XML and parse it so we can create instances of our Abstract object. We are going to stream through the gzipped XML without loading the entire file into memory first2. We\u0026rsquo;ll assign each document an ID in order of loading (ie the first document will have ID=1, the second one will have ID=2, etcetera). You can find the code here.\nimport gzip from lxml import etree from search.documents import Abstract def load_documents(): # open a filehandle to the gzipped Wikipedia dump with gzip.open(\u0026#39;data/enwiki.latest-abstract.xml.gz\u0026#39;, \u0026#39;rb\u0026#39;) as f: doc_id = 1 # iterparse will yield the entire `doc` element once it finds the # closing `\u0026lt;/doc\u0026gt;` tag for _, element in etree.iterparse(f, events=(\u0026#39;end\u0026#39;,), tag=\u0026#39;doc\u0026#39;): title = element.findtext(\u0026#39;./title\u0026#39;) url = element.findtext(\u0026#39;./url\u0026#39;) abstract = element.findtext(\u0026#39;./abstract\u0026#39;) yield Abstract(ID=doc_id, title=title, url=url, abstract=abstract) doc_id += 1 # the `element.clear()` call will explicitly free up the memory # used to store the element element.clear() Indexing We are going to store this in a data structure known as an \u0026ldquo;inverted index\u0026rdquo; or a \u0026ldquo;postings list\u0026rdquo;. Think of it as the index in the back of a book that has an alphabetized list of relevant words and concepts, and on what page number a reader can find them.\nBack of the book index Practically, what this means is that we\u0026rsquo;re going to create a dictionary where we map all the words in our corpus to the IDs of the documents they occur in. That will look something like this:\n{ ... \u0026#34;london\u0026#34;: [5245250, 2623812, 133455, 3672401, ...], \u0026#34;beer\u0026#34;: [1921376, 4411744, 684389, 2019685, ...], \u0026#34;flood\u0026#34;: [3772355, 2895814, 3461065, 5132238, ...], ... } Note that in the example above the words in the dictionary are lowercased; before building the index we are going to break down or analyze the raw text into a list of words or tokens. The idea is that we first break up or tokenize the text into words, and then apply zero or more filters (such as lowercasing or stemming) on each token to improve the odds of matching queries to text.\nTokenization Analysis We are going to apply very simple tokenization, by just splitting the text on whitespace. Then, we are going to apply a couple of filters on each of the tokens: we are going to lowercase each token, remove any punctuation, remove the 25 most common words in the English language (and the word \u0026ldquo;wikipedia\u0026rdquo; because it occurs in every title in every abstract) and apply stemming to every word (ensuring that different forms of a word map to the same stem, like brewery and breweries3).\nThe tokenization and lowercase filter are very simple:\nimport Stemmer STEMMER = Stemmer.Stemmer(\u0026#39;english\u0026#39;) def tokenize(text): return text.split() def lowercase_filter(tokens): return [token.lower() for token in tokens] def stem_filter(tokens): return STEMMER.stemWords(tokens) Punctuation is nothing more than a regular expression on the set of punctuation:\nimport re import string PUNCTUATION = re.compile(\u0026#39;[%s]\u0026#39; % re.escape(string.punctuation)) def punctuation_filter(tokens): return [PUNCTUATION.sub(\u0026#39;\u0026#39;, token) for token in tokens] Stopwords are words that are very common and we would expect to occcur in (almost) every document in the corpus. As such, they won\u0026rsquo;t contribute much when we search for them (i.e. (almost) every document will match when we search for those terms) and will just take up space, so we will filter them out at index time. The Wikipedia abstract corpus includes the word \u0026ldquo;Wikipedia\u0026rdquo; in every title, so we\u0026rsquo;ll add that word to the stopword list as well. We drop the 25 most common words in English.\n# top 25 most common words in English and \u0026#34;wikipedia\u0026#34;: # https://en.wikipedia.org/wiki/Most_common_words_in_English STOPWORDS = set([\u0026#39;the\u0026#39;, \u0026#39;be\u0026#39;, \u0026#39;to\u0026#39;, \u0026#39;of\u0026#39;, \u0026#39;and\u0026#39;, \u0026#39;a\u0026#39;, \u0026#39;in\u0026#39;, \u0026#39;that\u0026#39;, \u0026#39;have\u0026#39;, \u0026#39;I\u0026#39;, \u0026#39;it\u0026#39;, \u0026#39;for\u0026#39;, \u0026#39;not\u0026#39;, \u0026#39;on\u0026#39;, \u0026#39;with\u0026#39;, \u0026#39;he\u0026#39;, \u0026#39;as\u0026#39;, \u0026#39;you\u0026#39;, \u0026#39;do\u0026#39;, \u0026#39;at\u0026#39;, \u0026#39;this\u0026#39;, \u0026#39;but\u0026#39;, \u0026#39;his\u0026#39;, \u0026#39;by\u0026#39;, \u0026#39;from\u0026#39;, \u0026#39;wikipedia\u0026#39;]) def stopword_filter(tokens): return [token for token in tokens if token not in STOPWORDS] Bringing all these filters together, we\u0026rsquo;ll construct an analyze function that will operate on the text in each abstract; it will tokenize the text into individual words (or rather, tokens), and then apply each filter in succession to the list of tokens. The order is important, because we use a non-stemmed list of stopwords, so we should apply the stopword_filter before the stem_filter.\ndef analyze(text): tokens = tokenize(text) tokens = lowercase_filter(tokens) tokens = punctuation_filter(tokens) tokens = stopword_filter(tokens) tokens = stem_filter(tokens) return [token for token in tokens if token] Indexing the corpus We\u0026rsquo;ll create an Index class that will store the index and the documents. The documents dictionary stores the dataclasses by ID, and the index keys will be the tokens, with the values being the document IDs the token occurs in:\nclass Index: def __init__(self): self.index = {} self.documents = {} def index_document(self, document): if document.ID not in self.documents: self.documents[document.ID] = document for token in analyze(document.fulltext): if token not in self.index: self.index[token] = set() self.index[token].add(document.ID) Searching Now we have all tokens indexed, searching for a query becomes a matter of analyzing the query text with the same analyzer as we applied to the documents; this way we\u0026rsquo;ll end up with tokens that should match the tokens we have in the index. For each token, we\u0026rsquo;ll do a lookup in the dictionary, finding the document IDs that the token occurs in. We do this for every token, and then find the IDs of documents in all these sets (i.e. for a document to match the query, it needs to contain all the tokens in the query). We will then take the resulting list of document IDs, and fetch the actual data from our documents store4.\ndef _results(self, analyzed_query): return [self.index.get(token, set()) for token in analyzed_query] def search(self, query): \u0026#34;\u0026#34;\u0026#34; Boolean search; this will return documents that contain all words from the query, but not rank them (sets are fast, but unordered). \u0026#34;\u0026#34;\u0026#34; analyzed_query = analyze(query) results = self._results(analyzed_query) documents = [self.documents[doc_id] for doc_id in set.intersection(*results)] return documents In [1]: index.search(\u0026#39;London Beer Flood\u0026#39;) search took 0.16307830810546875 milliseconds Out[1]: [Abstract(ID=1501027, title=\u0026#39;Wikipedia: Horse Shoe Brewery\u0026#39;, abstract=\u0026#39;The Horse Shoe Brewery was an English brewery in the City of Westminster that was established in 1764 and became a major producer of porter, from 1809 as Henry Meux \u0026amp; Co. It was the site of the London Beer Flood in 1814, which killed eight people after a porter vat burst.\u0026#39;, url=\u0026#39;https://en.wikipedia.org/wiki/Horse_Shoe_Brewery\u0026#39;), Abstract(ID=1828015, title=\u0026#39;Wikipedia: London Beer Flood\u0026#39;, abstract=\u0026#34;The London Beer Flood was an accident at Meux \u0026amp; Co\u0026#39;s Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.\u0026#34;, url=\u0026#39;https://en.wikipedia.org/wiki/London_Beer_Flood\u0026#39;)] Now, this will make our queries very precise, especially for long query strings (the more tokens our query contains, the less likely it\u0026rsquo;ll be that there will be a document that has all of these tokens). We could optimize our search function for recall rather than precision by allowing users to specify that only one occurrence of a token is enough to match our query:\ndef search(self, query, search_type=\u0026#39;AND\u0026#39;): \u0026#34;\u0026#34;\u0026#34; Still boolean search; this will return documents that contain either all words from the query or just one of them, depending on the search_type specified. We are still not ranking the results (sets are fast, but unordered). \u0026#34;\u0026#34;\u0026#34; if search_type not in (\u0026#39;AND\u0026#39;, \u0026#39;OR\u0026#39;): return [] analyzed_query = analyze(query) results = self._results(analyzed_query) if search_type == \u0026#39;AND\u0026#39;: # all tokens must be in the document documents = [self.documents[doc_id] for doc_id in set.intersection(*results)] if search_type == \u0026#39;OR\u0026#39;: # only one token has to be in the document documents = [self.documents[doc_id] for doc_id in set.union(*results)] return documents In [2]: index.search(\u0026#39;London Beer Flood\u0026#39;, search_type=\u0026#39;OR\u0026#39;) search took 0.02816295623779297 seconds Out[2]: [Abstract(ID=5505026, title=\u0026#39;Wikipedia: Addie Pryor\u0026#39;, abstract=\u0026#39;| birth_place = London, England\u0026#39;, url=\u0026#39;https://en.wikipedia.org/wiki/Addie_Pryor\u0026#39;), Abstract(ID=1572868, title=\u0026#39;Wikipedia: Tim Steward\u0026#39;, abstract=\u0026#39;|birth_place = London, United Kingdom\u0026#39;, url=\u0026#39;https://en.wikipedia.org/wiki/Tim_Steward\u0026#39;), Abstract(ID=5111814, title=\u0026#39;Wikipedia: 1877 Birthday Honours\u0026#39;, abstract=\u0026#39;The 1877 Birthday Honours were appointments by Queen Victoria to various orders and honours to reward and highlight good works by citizens of the British Empire. The appointments were made to celebrate the official birthday of the Queen, and were published in The London Gazette on 30 May and 2 June 1877.\u0026#39;, url=\u0026#39;https://en.wikipedia.org/wiki/1877_Birthday_Honours\u0026#39;), ... In [3]: len(index.search(\u0026#39;London Beer Flood\u0026#39;, search_type=\u0026#39;OR\u0026#39;)) search took 0.029065370559692383 seconds Out[3]: 49627 Relevancy We have implemented a pretty quick search engine with just some basic Python, but there\u0026rsquo;s one aspect that\u0026rsquo;s obviously missing from our little engine, and that\u0026rsquo;s the idea of relevance. Right now we just return an unordered list of documents, and we leave it up to the user to figure out which of those (s)he is actually interested in. Especially for large result sets, that is painful or just impossible (in our OR example, there are almost 50,000 results).\nThis is where the idea of relevancy comes in; what if we could assign each document a score that would indicate how well it matches the query, and just order by that score? A naive and simple way of assigning a score to a document for a given query is to just count how often that document mentions that particular word. After all, the more that document mentions that term, the more likely it is that it is about our query!\nTerm frequency Let\u0026rsquo;s expand our Abstract dataclass to compute and store it\u0026rsquo;s term frequencies when we index it. That way, we\u0026rsquo;ll have easy access to those numbers when we want to rank our unordered list of documents:\n# in documents.py from collections import Counter from .analysis import analyze @dataclass class Abstract: # snip def analyze(self): # Counter will create a dictionary counting the unique values in an array: # {\u0026#39;london\u0026#39;: 12, \u0026#39;beer\u0026#39;: 3, ...} self.term_frequencies = Counter(analyze(self.fulltext)) def term_frequency(self, term): return self.term_frequencies.get(term, 0) We need to make sure to generate these frequency counts when we index our data:\n# in index.py we add `document.analyze() def index_document(self, document): if document.ID not in self.documents: self.documents[document.ID] = document document.analyze() We\u0026rsquo;ll modify our search function so we can apply a ranking to the documents in our result set. We\u0026rsquo;ll fetch the documents using the same Boolean query from the index and document store, and then we\u0026rsquo;ll for every document in that result set, we\u0026rsquo;ll simply sum up how often each term occurs in that document\ndef search(self, query, search_type=\u0026#39;AND\u0026#39;, rank=True): # snip if rank: return self.rank(analyzed_query, documents) return documents def rank(self, analyzed_query, documents): results = [] if not documents: return results for document in documents: score = sum([document.term_frequency(token) for token in analyzed_query]) results.append((document, score)) return sorted(results, key=lambda doc: doc[1], reverse=True) Inverse Document Frequency That\u0026rsquo;s already a lot better, but there are some obvious short-comings. We\u0026rsquo;re considering all query terms to be of equivalent value when assessing the relevancy for the query. However, it\u0026rsquo;s likely that certain terms have very little to no discriminating power when determining relevancy; for example, a collection with lots of documents about beer would be expected to have the term \u0026ldquo;beer\u0026rdquo; appear often in almost every document (in fact, we\u0026rsquo;re already trying to address that by dropping the 25 most common English words from the index). Searching for the word \u0026ldquo;beer\u0026rdquo; in such a case would essentially do another random sort.\nIn order to address that, we\u0026rsquo;ll add another component to our scoring algorithm that will reduce the contribution of terms that occur very often in the index to the final score. We could use the collection frequency of a term (i.e. how often does this term occur across all documents), but in practice the document frequency is used instead (i.e. how many documents in the index contain this term). We\u0026rsquo;re trying to rank documents after all, so it makes sense to have a document level statistic.\nWe\u0026rsquo;ll compute the inverse document frequency for a term by dividing the number of documents (N) in the index by the amount of documents that contain the term, and take a logarithm of that.\nIDF; taken from https://moz.com/blog/inverse-document-frequency-and-the-importance-of-uniqueness We\u0026rsquo;ll then simply multiple the term frequency with the inverse document frequency during our ranking, so matches on terms that are rare in the corpus will contribute more to the relevancy score5. We can easily compute the inverse document frequency from the data available in our index:\n# index.py import math def document_frequency(self, token): return len(self.index.get(token, set())) def inverse_document_frequency(self, token): # Manning, Hinrich and SchΓΌtze use log10, so we do too, even though it # doesn\u0026#39;t really matter which log we use anyway # https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html return math.log10(len(self.documents) / self.document_frequency(token)) def rank(self, analyzed_query, documents): results = [] if not documents: return results for document in documents: score = 0.0 for token in analyzed_query: tf = document.term_frequency(token) idf = self.inverse_document_frequency(token) score += tf * idf results.append((document, score)) return sorted(results, key=lambda doc: doc[1], reverse=True) Future Workβ’ And that\u0026rsquo;s a basic search engine in just a few lines of Python code! You can find all the code on Github, and I\u0026rsquo;ve provided a utility function that will download the Wikipedia abstracts and build an index. Install the requirements, run it in your Python console of choice and have fun messing with the data structures and searching.\nNow, obviously this is a project to illustrate the concepts of search and how it can be so fast (even with ranking, I can search and rank 6.27m documents on my laptop with a \u0026ldquo;slow\u0026rdquo; language like Python) and not production grade software. It runs entirely in memory on my laptop, whereas libraries like Lucene utilize hyper-efficient data structures and even optimize disk seeks, and software like Elasticsearch and Solr scale Lucene to hundreds if not thousands of machines.\nThat doesn\u0026rsquo;t mean that we can\u0026rsquo;t think about fun expansions on this basic functionality though; for example, we assume that every field in the document has the same contribution to relevancy, whereas a query term match in the title should probably be weighted more strongly than a match in the description. Another fun project could be to expand the query parsing; there\u0026rsquo;s no reason why either all or just one term need to match. Why not exclude certain terms, or do AND and OR between individual terms? Can we persist the index to disk and make it scale beyond the confines of my laptop RAM?\nAn abstract is generally the first paragraph or the first couple of sentences of a Wikipedia article. The entire dataset is currently about Β±796mb of gzipped XML. There\u0026rsquo;s smaller dumps with a subset of articles available if you want to experiment and mess with the code yourself; parsing XML and indexing will take a while, and require a substantial amount of memory.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nWe\u0026rsquo;re going to have the entire dataset and index in memory as well, so we may as well skip keeping the raw data in memory.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nWhether or not stemming is a good idea is subject of debate. It will decrease the total size of your index (ie fewer unique words), but stemming is based on heuristics; we\u0026rsquo;re throwing away information that could very well be valuable. For example, think about the words university, universal, universities, and universe that are stemmed to univers. We are losing the ability to distinguish between the meaning of these words, which would negatively impact relevance. For a more detailed article about stemming (and lemmatization), read this excellent article.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nWe obviously just use our laptop\u0026rsquo;s RAM for this, but it\u0026rsquo;s a pretty common practice to not store your actual data in the index. Elasticsearch stores it\u0026rsquo;s data as plain old JSON on disk, and only stores indexed data in Lucene (the underlying search and indexing library) itself, and many other search engines will simply return an ordered list of document IDs which are then used to retrieve the data to display to users from a database or other service. This is especially relevant for large corpora, where doing a full reindex of all your data is expensive, and you generally only want to store data relevant to relevancy in your search engine (and not attributes that are only relevant for presentation purposes).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nFor a more in-depth post about the algorithm, I recommend reading https://monkeylearn.com/blog/what-is-tf-idf/ and https://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts/","title":"Use Google Cloud Text-to-Speech to create an audio version of your blog posts","categories":["hugo","blog","text-to-speech","how-to"],"content":"Audio is big. Like, really big, and growing fast, to the tune of \u0026ldquo;two-thirds of the population listens to online audio\u0026rdquo; and \u0026ldquo;weekly online listeners reporting an average nearly 17 hours of listening in the last week\u0026rdquo;1. These numbers include all kinds of audio, from online radio stations, audiobooks, streaming services and podcasts (hi Spotify!). It makes sense too. Consuming audio content is easier to consume and more engaging than written content while you\u0026rsquo;re on the go, exercising, commuting or doing household chores2. But what do you do if you\u0026rsquo;re like me and don\u0026rsquo;t have the time or recording equipment to ride this podcasting wave, and just write the occasional blog post?\nListen to this article instead Your browser does not support the audio element Well, you can always use a sophisticated deep learning text-to-speech model, train it on thousands of hours of audio content and endlessly tweak the model parameters, create an audio version of those occassional blog posts and host them on your website. Or, you know, you use the Google one3. The Cloud Text-to-Speech API is priced by character, and the first 1 million characters are free4! In this post, we\u0026rsquo;ll go over how to set up a Google API, write a Python script to extract text from a Markdown file, and create a Hugo shortcode5 to include the generated files in your static website.\nSet up a Google API In order to get started, we have to jump through a couple of hoops to create a text-to-speech API. Most of these are pretty straightforward, and are easiest to follow when you\u0026rsquo;re signed in to your Google account. It will be even easier if you\u0026rsquo;ve enabled billing which should be the default on a personal account (although you may have to add a payment method)6.\nThere\u0026#39;s some steps you\u0026#39;ll have to follow to set up an API. If you click that \u0026ldquo;Enable the API\u0026rdquo; button, you\u0026rsquo;ll be taken to the project creation page. This project basically functions as a label and administrative container for everything ranging from authentication (API keys) to billing (so you can see what you spend on each Google product you\u0026rsquo;re using). Give it a name (leave \u0026ldquo;organization\u0026rdquo; set to \u0026ldquo;no organization\u0026rdquo;).\nCreate a new Google Cloud project This will trigger some background jobs where Google will provision the resources necessary to run your very own text-to-speech API. Finally, we\u0026rsquo;ll want to set up some authentication, so we can interact with this API. Create a service account for this project:\nCreate a service account for our text-to-speech project This will download a JSON file with API credentials to your computer. Do not throw away this file. You\u0026rsquo;ll want to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of this JSON file. Keep in mind that this will set the variable for the duration of your terminal session, so you\u0026rsquo;ll have to set the variable again if you\u0026rsquo;re opening a new session.\n# in this case, the file was downloaded to my Downloads folder; you may want to put it elsewhere $ export GOOGLE_APPLICATION_CREDENTIALS=~/Downloads/text-to-speech-123456.json With all that, we can get started on our script to process some Markdown!\nWrite a script to transform text to audio I\u0026rsquo;m using Python for most of my scripting and hacking, so I\u0026rsquo;ve got virtualenv set up on my machine; this essentially installs a Python interpreter for every project, so I can keep dependencies separated. For this project we have the following requirements; install them with pip install -r requirements.txt, or individually (i.e. pip install Click==7.0 etc.):\nbeautifulsoup4==4.8.1 Click==7.0 pydub==0.23.1 Markdown==3.1.1 google-cloud-texttospeech==0.5.0 Click is a great library for building CLI\u0026rsquo;s in Python, and besides giving us some nice features, it will make it easy to convert this script to some executable later on.\nWe\u0026rsquo;ll be calling the script like this:\n$ python text_to_speech.py path/to/markdown.md The script will take the path to a Markdown file on disk, and do a couple of things to it:\nRead the file into memory Do some clean up, and convert it to plain text Send the text to our Google Cloud Text-to-Speech API, and write the audio from the response to disk Read file from disk # text_to_speech.py import click import logging import os logging.basicConfig(level=logging.INFO, format=\u0026#39;%(asctime)s %(levelname)s %(message)s\u0026#39;) @click.command() @click.argument(\u0026#39;filename\u0026#39;, type=click.File(\u0026#39;rb\u0026#39;)) def text_to_speech(filename): name = os.path.basename(filename.name).replace(\u0026#39;.md\u0026#39;, \u0026#39;\u0026#39;) data = filename.read() if __name__ == \u0026#39;__main__\u0026#39;: text_to_speech() This snippet defines a Click command, sets up logging, will try to open it\u0026rsquo;s argument as a file, stores the name of the file in a variable for later use, and reads and stores the contents of the file in a variable data. Step 1, check.\nConvert to plain text We need to send Google plain text as input for their model, so the next step is to add a function that will do some cleanup and extract the text from the Markdown-formatted file. In order to do that, we\u0026rsquo;ll apply some regular expressions and convert the Markdown to HTML first, and use BeautifulSoup to extract the text.\n# text_to_speech.py import click import os import re from bs4 import BeautifulSoup from markdown import markdown def clean_text(text): # get rid of the Hugo preamble text = \u0026#39;\u0026#39;.join(text.decode(\u0026#39;utf8\u0026#39;).split(\u0026#39;---\u0026#39;)[2:]).strip() # get rid of superfluous newlines, as that counts towards our API limits text = re.sub(\u0026#39;\\n+\u0026#39;, \u0026#39; \u0026#39;, text) # we\u0026#39;re hacking our way around the markdown by converting to html first, # just because BeautifulSoup makes life so easy html = markdown(text) html = re.sub(r\u0026#39;\u0026lt;pre\u0026gt;(.*?)\u0026lt;/pre\u0026gt;\u0026#39;, \u0026#39; \u0026#39;, html) # this removes some artifacts from Hugo shortcodes html = re.sub(r\u0026#39;{{}}\u0026#39;, \u0026#39;\u0026#39;, html) html = re.sub(r\u0026#39;\\[\\^.*?\\]\u0026#39;, \u0026#39; \u0026#39;, html) soup = BeautifulSoup(html, \u0026#34;html.parser\u0026#34;) text = \u0026#39;\u0026#39;.join(soup.findAll(text=True)) # get rid of superfluous whitespace return re.sub(r\u0026#39;\\s+\u0026#39;, \u0026#39; \u0026#39;, text) @click.command() @click.argument(\u0026#39;filename\u0026#39;, type=click.File(\u0026#39;rb\u0026#39;)) def text_to_speech(filename): name = os.path.basename(filename.name).replace(\u0026#39;.md\u0026#39;, \u0026#39;\u0026#39;) data = filename.read() text = clean_text(data) Now we extracted the plain text from our Markdown file, we can send it to Google:\nfrom google.cloud import texttospeech ... @click.command() @click.argument(\u0026#39;filename\u0026#39;, type=click.File(\u0026#39;rb\u0026#39;)) def text_to_speech(filename): name = os.path.basename(filename.name).replace(\u0026#39;.md\u0026#39;, \u0026#39;\u0026#39;) data = filename.read() text = clean_text(data) text = clean_text(data) # initialize the API client client = texttospeech.TextToSpeechClient() # we can send up to 5000 characters per request, so split up the text step = 5000 for j, i in enumerate(range(0, len(text), step)): synthesis_input = texttospeech.types.SynthesisInput(text=text[i:i+step]) voice = texttospeech.types.VoiceSelectionParams( language_code=\u0026#39;en-US\u0026#39;, name=\u0026#39;en-US-Wavenet-B\u0026#39; ) audio_config = texttospeech.types.AudioConfig( audio_encoding=texttospeech.enums.AudioEncoding.MP3 ) logging.info(f\u0026#39;Synthesizing speech for {name}_{j}\u0026#39;) response = client.synthesize_speech(synthesis_input, voice, audio_config) with open(f\u0026#39;{name}_{j}.mp3\u0026#39;, \u0026#39;wb\u0026#39;) as out: # Write the response to the output file. out.write(response.audio_content) logging.info(f\u0026#39;Audio content written to file \u0026#34;{name}_{j}.mp3\u0026#34;\u0026#39;) Now, this is where we run into the first quirk of the API; it will only accept snippets of up to 5000 characters. My blog posts generally range between 12k to 15k characters, so I had to add some code that will chunk up the text into bits of 5000 characters each. Note that I don\u0026rsquo;t make any effort to detect word boundaries, so it can happen that a chunk will end with half a word; I\u0026rsquo;ll leave it up to the reader to improve upon my implementation7.\nWe provide some configuration (I like the voice of robot es-US-Wavenet-B, but there are loads of other voices and languages to choose from), specify we want to receive an MP3 back, and write out the response MP3 into separate chunks in the current working directory.\nNext, we need to stitch the temporary MP3 chunks together (using the excellent pydub library), write the completed file to a sensible directory and clean up after ourselves.\nimport functools from glob import glob from pydub import AudioSegment ... mp3_segments = sorted(glob(f\u0026#39;{name}_*.mp3\u0026#39;)) segments = [AudioSegment.from_mp3(f) for f in mp3_segments] logging.info(f\u0026#39;Stitching together {len(segments)} mp3 files for {name}\u0026#39;) audio = functools.reduce(lambda a, b: a + b, segments) logging.info(f\u0026#39;Exporting {name}.mp3\u0026#39;) audio.export(f\u0026#39;static/audio/{name}.mp3\u0026#39;, format=\u0026#39;mp3\u0026#39;) logging.info(f\u0026#39;Exporting {name}.ogg\u0026#39;) audio.export(f\u0026#39;static/audio/{name}.ogg\u0026#39;, format=\u0026#39;ogg\u0026#39;) logging.info(\u0026#39;Removing intermediate files\u0026#39;) for f in mp3_segments: os.remove(f) This will stitch the MP38 segments together (functools.reduce), write out an MP3 and an OGG file (with the same filename as the blog post Markdown file) to the static/audio directory I use (change to a destination folder of your liking if necessary), and deletes the intermediate files from the current directory.\nYou can find the complete script here.\nRunning the script for this article generates output that looks like this:\n$ python scripts/text_to_speech.py content/post/2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts.md 2019-10-29 23:19:59,995 INFO Synthesizing speech for 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_0 2019-10-29 23:20:08,044 INFO Audio content written to file \u0026#34;2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_0.mp3\u0026#34; 2019-10-29 23:20:08,045 INFO Synthesizing speech for 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_1 2019-10-29 23:20:13,709 INFO Audio content written to file \u0026#34;2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_1.mp3\u0026#34; 2019-10-29 23:20:13,709 INFO Synthesizing speech for 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_2 2019-10-29 23:20:18,576 INFO Audio content written to file \u0026#34;2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_2.mp3\u0026#34; 2019-10-29 23:20:19,830 INFO Stitching together 3 mp3 files for 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts 2019-10-29 23:20:19,880 INFO Exporting 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts.mp3 2019-10-29 23:20:23,353 INFO Exporting 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts.ogg 2019-10-29 23:20:26,744 INFO Removing intermediate files Include audio in the post With this, we end up with a bunch of audio files a directory. In order to display them properly to our users so that they can actually consume the content, we have to do a little more work. Hugo provides shortcodes, which are effectively parameterized macro\u0026rsquo;s that expand into snippets of HTML that get embedded in your posts. There are many shortcodes included with standard Hugo (like figure, gist or tweet), but you can also create your own. We\u0026rsquo;ll leverage that to include some swanky HTML5 audio tags in our blog posts9.\n\u0026lt;audio controls class=\u0026#34;audio_controls {{ .Get \u0026#34;class\u0026#34; }}\u0026#34; {{ with .Get \u0026#34;id\u0026#34; }}id=\u0026#34;{{ . }}\u0026#34;{{ end }} {{ with .Get \u0026#34;preload\u0026#34; }}preload=\u0026#34;{{ . }}\u0026#34;{{ else }}preload=\u0026#34;metadata\u0026#34;{{ end }} style=\u0026#34;{{ with .Get \u0026#34;style\u0026#34; }}{{ . | safeCSS }}; {{ end }}\u0026#34; {{ with .Get \u0026#34;title\u0026#34; }}data-info-title=\u0026#34;{{ . }}\u0026#34;{{ end }} \u0026gt; {{ if .Get \u0026#34;src\u0026#34; }} \u0026lt;source {{ with .Get \u0026#34;src\u0026#34; }}src=\u0026#34;{{ . }}\u0026#34;{{ end }} {{ with .Get \u0026#34;type\u0026#34; }}type=\u0026#34;audio/{{ . }}\u0026#34;{{ end }}\u0026gt; {{ else if .Get \u0026#34;backup_src\u0026#34; }} \u0026lt;source src=\u0026#34;{{ .Get \u0026#34;backup_src\u0026#34; }}\u0026#34; {{ with .Get \u0026#34;backup_type\u0026#34; }}type=\u0026#34;audio/{{ . }}\u0026#34;{{ end }} {{ with .Get \u0026#34;backup_codec\u0026#34; }}codecs=\u0026#34;{{ . }}\u0026#34;{{ end }} \u0026gt; {{ end }} Your browser does not support the audio element \u0026lt;/audio\u0026gt; This snippet will give us access to a shortcode that injects some HTML into our post, and accepts a couple of parameters so we can include the appropriate audio file, the appropriate backup file, and override styling should we so choose. This file should be stored in layouts/shortcodes/audio.html, and can be included in your posts as follows:\nLorem ipsum dolor sit amet. {{\u0026lt;audio src=\u0026#34;/audio/name-of-your-audio-file.mp3\u0026#34; type=\u0026#34;mp3\u0026#34; backup_src=\u0026#34;/audio/name-of-your-audio-file.ogg\u0026#34; backup_type=\u0026#34;ogg\u0026#34;\u0026gt;}} Some more words, this time not in Latin. This will include an audio player looking like this in your blog post. I\u0026rsquo;ve added some bells and whistles to mine, with some additional styling for all the UX points.\nBasic HTML5 audio player https://www.edisonresearch.com/infinite-dial-2019/\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNot to mention that audio content is more easily accessible for people that suffer from dyslexia or poor sight, or seems to be a lot better for user engagement.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIt has all kinds of funky stuff, like multiple languages, API clients in your favorite programming language, pitch, speaking rates and volume controls, and even optimizations around where your audio is going to play, such as headphones or phone lines.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nFree is my favorite price.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nMy website is a static website generated from Markdown files with Hugo, hosted on GitHub Pages, so you may need to make some small changes to make it work with Jekyll, Next.js or whichever other static site generator you\u0026rsquo;re using.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDon\u0026rsquo;t worry, the API is free for the first 4 million characters per month for standard voices, or 1 million characters for fancy WaveNet voices.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nPull requests are always welcome! π\u0026#160;\u0026#x21a9;\u0026#xfe0e;\npydub requires ffmpeg or libav to be installed to open, convert and save non-WAV files (such as MP3 or OGG)\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNot all browsers support all file types and audio codecs, which is why we\u0026rsquo;ve generated the OGG files as backup.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/use-hugo-output-formats-to-generate-lunr-index-files/","title":"Use Hugo Output Formats to generate Lunr index files for your static site search","categories":["hugo","search","lunr","how-to"],"content":"I\u0026rsquo;ve been using Lunr.js to enable some basic site search on this blog. Lunr.js requires an index file that contains all the content you want to make available for search. In order to generate that file, I had a kind of hacky setup, depending on running a Grunt script on every deploy, which introduces a dependency on node, and nobody really wants any of that for just a static HTML website.\nListen to this article instead Your browser does not support the audio element I have been wanting forever to have Hugo build that file for me instead1. As it turns out, Output Formats2 make building that index file very easy. Output formats let you generate your content in other formats than HTML, such as AMP or XML for an RSS feed, and it also speaks JSON.\nThe search on my blog lives on the homepage, where some (very ugly) Javascript downloads the index file, parses it contents into an inverted index, and replaces the content on the page with search results whenever someone starts typing. Essentially, I want to create some JSON output on my homepage (index.json instead of index.html).\nI added the following snippet to my config.toml, that says that besides HTML, the homepage also has JSON output:\n[outputs] home = [\u0026#34;HTML\u0026#34;, \u0026#34;JSON\u0026#34;] page = [\u0026#34;HTML\u0026#34;] N.B.: this means that there won\u0026rsquo;t be a JSON version of the other pages; I just need it on my homepage, because that serves as the search results page too.\nNow, I don\u0026rsquo;t want that index.json file to basically be the list of links it is in the HTML version and in the RSS feed, so I added an index.json file in my layouts folder with the following content:\n[ {{ range $index, $page := .Site.Pages }} {{- if eq $page.Type \u0026#34;post\u0026#34; -}} {{- if $page.Plain -}} {{- if and $index (gt $index 0) -}},{{- end }} { \u0026#34;href\u0026#34;: \u0026#34;{{ $page.Permalink }}\u0026#34;, \u0026#34;title\u0026#34;: \u0026#34;{{ htmlEscape $page.Title }}\u0026#34;, \u0026#34;categories\u0026#34;: [{{ range $tindex, $tag := $page.Params.categories }}{{ if $tindex }}, {{ end }}\u0026#34;{{ $tag| htmlEscape }}\u0026#34;{{ end }}], \u0026#34;content\u0026#34;: {{$page.Plain | jsonify}} } {{- end -}} {{- end -}} {{- end -}} ] This will render a JSON file (named index.json) with an array in the root directory of my site, and every item in that array is one of the .Site.Pages (i.e. my posts), whenever that page has text in it and it\u0026rsquo;s not the homepage. I didn\u0026rsquo;t bother with minification, because the file is tiny and will be served nicely gzipped by Cloudflare anyway. Whenever Hugo builds the site, it will reindex all the data (i.e. rebuild this file), and I don\u0026rsquo;t have a dependency on Node and Grunt scripts anymore.\nEver since someone opened a GitHub issue about it π\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nShips with Hugo version 0.20.0 or greater.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/tab-plus-search-from-your-url-bar-with-opensearch/","title":"Custom OpenSearch: search from your URL bar","categories":["search","opensearch","how-to"],"content":"Almost all modern browsers enable websites to customize the built-in search feature to let the user access their search features directly, without going to your website first and finding the search input box. If your website has search functionality accessible through a basic GET request, it\u0026rsquo;s surprisingly simple to enable this for your website too.\nListen to this article instead Your browser does not support the audio element Typing \u0026#39;bart\u0026#39; and hitting tab in my Chrome browser lets me search the website directly. Some browsers do it automatically If your users are on Chrome, chances are this already works! Chromium tries really hard to figure out where your search page is and how to access it. A strong hint you can give it is to change the type of the \u0026lt;input\u0026gt; element to \u0026quot;search\u0026quot;1:\n\u0026lt;input autocapitalize=\u0026#34;off\u0026#34; autocorrect=\u0026#34;off\u0026#34; autocomplete=\u0026#34;off\u0026#34; name=\u0026#34;q\u0026#34; placeholder=\u0026#34;Search\u0026#34; type=\u0026#34;search\u0026#34;\u0026gt; The \u0026quot;name\u0026quot; attribute gives the browser a hint as to what HTTP parameter will hold the query (it is a good idea to configure your Google Analytics to pick this up as well!).\nThis will let the browser add some nice UI elements to the search input box, like a small \u0026ldquo;x\u0026rdquo; button on the right to clear the search input in Safari and Chrome. Enabling the \u0026quot;autocapitalize\u0026quot;, \u0026quot;autocorrect\u0026quot; and \u0026quot;autocomplete\u0026quot; attributes will instruct your browser to modify and correct the user input even further (think of the iOS autocorrect feature, for example).\nJust by changing the input type you can hook in to the browsers\u0026#39; native UX. Word of warning Because once upon a time apple.com relied on the type attribute to give their search box a more \u0026ldquo;Mac-like\u0026rdquo; feel, Safari will basically ignore any CSS applied to \u0026lt;input type=\u0026quot;search\u0026quot;\u0026gt; elements. If you need Safari to treat your search field like any other input field for display purposes, you can add the following to your CSS:\ninput[type=\u0026#34;search\u0026#34;] { -webkit-appearance: textfield; } This will let you apply your own styles to the input box.\nOthers don\u0026rsquo;t Not all browsers do this out of the box, so you need to provide them with a more formalized configuration. Most browsers find out about the search functionality of a website through an OpenSearch XML file that directs them to the right page.\nOpenSearch OpenSearch is a standard that was developed by A9, an Amazon subsidiary developing search engine and search advertising technology, and has been around since Jeff Bezos unveiled it in 2005 at a conference on emerging technologies.\nIt is nothing more than an XML specification that lets a website describe a search engine for itself, and where a user or browser might find and use it. Firefox, Chrome, Edge, Internet Explorer and Safari all support the OpenSearch standard, with Firefox even supporting features that are not in the standard, such as search suggestions.\nXML All you need is a small XML file. Below is an example of the one we have at work:\n\u0026lt;OpenSearchDescription xmlns=\u0026#34;http://a9.com/-/spec/opensearch/1.1/\u0026#34; xmlns:moz=\u0026#34;http://www.mozilla.org/2006/browser/search/\u0026#34;\u0026gt; \u0026lt;ShortName\u0026gt;Scribd.com\u0026lt;/ShortName\u0026gt; \u0026lt;Description\u0026gt;Scribd\u0026#39;s mission is to create the world\u0026#39;s largest open library of documents. Search it.\u0026lt;/Description\u0026gt; \u0026lt;Url type=\u0026#34;text/html\u0026#34; method=\u0026#34;get\u0026#34; template=\u0026#34;https://www.scribd.com/search?query={searchTerms}\u0026#34; /\u0026gt; \u0026lt;Image height=\u0026#34;32\u0026#34; width=\u0026#34;32\u0026#34; type=\u0026#34;image/x-icon\u0026#34;\u0026gt;https://www.scribd.com/favicon.ico\u0026lt;/Image\u0026gt; \u0026lt;/OpenSearchDescription\u0026gt; It provides a \u0026lt;ShortName\u0026gt; (there\u0026rsquo;s a \u0026lt;LongName\u0026gt; element too, that\u0026rsquo;s mostly used for aggregators or automatically generated search plugins), a \u0026lt;Description\u0026gt; of what the search will let you do, and most importantly, the \u0026lt;Url\u0026gt; where you can do it.\nIt tells the browser there\u0026rsquo;s a text/html page that can process an HTTP GET request, and has a template for the browser. {searchTerms} will be interpolated with the query terms the user will type in the browser. You need to host this file somewhere with the rest of your web pages.\nBut what if you don\u0026rsquo;t have a dedicated search engine for your website? Well, just use Google! Replace the value of the \u0026quot;template\u0026quot; attribute with something like this2:\n\u0026lt;Url type=\u0026#34;text/html\u0026#34; method=\u0026#34;get\u0026#34; template=\u0026#34;https://www.google.com/search?q=site:bart.degoe.de {searchTerms}\u0026#34;\u0026gt; This will redirect your user to the Google search results, but those will only display matches from content on your site. That\u0026rsquo;s a lot cheaper than employing a bunch of engineers to build and maintain a custom search engine!\nTurn on autodiscovery! Now we need to activate the automatic discovery of search engines in the browsers of your users. That sounds a lot cooler and more complicated than it actually is; the only thing you have to do is provide a \u0026lt;link\u0026gt; somewhere in the \u0026lt;head\u0026gt; of your webpages:\n\u0026lt;link rel=\u0026#34;search\u0026#34; href=\u0026#34;https://bart.degoe.de/opensearch.xml\u0026#34; type=\u0026#34;application/opensearchdescription+xml\u0026#34; title=\u0026#34;Search bart.degoe.de\u0026#34;\u0026gt; This will alert browsers that load the page that there is a search feature available, described in the linked XML file. Make sure your OpenSearch XML file is available and can be loaded from your webserver, and refresh the page containing the \u0026lt;link\u0026gt;. This will tell the browser where to look, and enable custom search!\nNow tab-searching from the Safari URL bar works too! The OpenSearch specification supports a lot more features than this, ranging from \u0026lt;Tags\u0026gt; to help plugins generated from these standardized descriptions be found better in search plugin aggregators, what \u0026lt;Language\u0026gt; the search engine supports, or whether the search results may contain \u0026lt;AdultContent\u0026gt;. There are many ways to configure and customize OpenSearch that go way beyond the basic example described here, but for my little blog this is more than enough π.\nThe other attributes are to dis-/enable features certain other browsers like Safari have that automatically correct what you type into the search box.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYes, you could absolutely point your search input to my website, but that\u0026rsquo;s not a requirement π\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/github-pages-and-lets-encrypt/","title":"Free SSL on Github Pages with a custom domain: Part 2 - Let's Encrypt","categories":["ssl","hugo","how-to","gh-pages","https","lets-encrypt"],"content":"GitHub Pages has just become even more awesome. Since yesterday1, GitHub Pages supports HTTPS for custom domains. And yes, it is still free!\nListen to this article instead Your browser does not support the audio element Let\u0026rsquo;s Encrypt GitHub has partnered with Let\u0026rsquo;s Encrypt, which is a free, open and automated certificate authority (CA). It is run by the Internet Security Research Group (ISRG), which is a public benefit corporation2 funded by donations and a bunch of large corporations and non-profits.\nThe goal of this initiative is to secure the web by making it very easy to obtain a free, trusted SSL certificate. Moreover, it lets web servers run a piece of software that not only gets a valid SSL certificate, but will also configure your web server and automatically renew the certificate when it expires.\nHow does it do that? It works by running a bit of software on your web server, a certificate management agent. This agent software has two tasks: it proves to the Let\u0026rsquo;s Encrypt certificate authority that it controls the domain, and it requests, renews and revokes certificates for the domain it controls.\nValidating a domain Similar to a traditional process of obtaining a certificate for a domain, where you create an account with the CA and add domains you control, the certificate management agent needs to perform a test to prove that it controls the domain.\nThe agent will ask the Let\u0026rsquo;s Encrypt CA what it needs to do to prove that it is, effectively, in control of the domain. The CA will look at the domain, and issue one or more challenges to the agent it needs to complete to prove that it has control over the domain. For example, it can ask the agent to provision a particular DNS record under the domain, or make an HTTP resource available under a particular URL. With these challenges, it provides the agent with a nonce (some random number that can only be used once for verification purposes).\nCA issuing a challenge to the certificate management agent (image taken from https://letsencrypt.org/how-it-works/) In the image above, the agent creates a file on a specified path on the web server (in this case, on https://example.com/8303). It creates a key pair it will use to identify itself with the CA, and signs the nonce received from the CA with the private key. Then, it notifies the CA that it has completed the challenge by sending back the signed nonce and is ready for validation. The CA then validates the completion of the challenge by attempting to download the file from the web server and verify that it contains the expected content.\nCertificate management agent completing a challenge (image taken from https://letsencrypt.org/how-it-works/) If the signed nonce is valid, and the challenge is completed successfully, the agent identified by the public key is officially authorized to manage valid SSL certificates for the domain.\nCertificate management So, what does that mean? By having validated the agent by its public key, the CA can now validate that messages sent to the CA are actually sent by the certificate management agent.\nIt can send a Certificate Signing Request (CSR) to the CA to request it to issue a SSL certificate for the domain, signed with the authorized key. Let\u0026rsquo;s Encrypt will only have to validate the signatures, and if those check out, a certificate will be issued.\nIssuing a certificate (image taken from https://letsencrypt.org/how-it-works/) Let\u0026rsquo;s Encrypt will add the certificate to the appropriate channels, so that browsers will know that the CA has validated the certificate, and will display that coveted green lock to your users!\nSo, GitHub Pages Right, that\u0026rsquo;s how we got started. The awesome thing about Let\u0026rsquo;s Encrypt is that it is automated, so all this handshaking and verifying happens behind the scenes, without you having to be involved.\nIn the previous post we saw how to set up a CNAME file for your custom domain. That\u0026rsquo;s it. Done. Works out of the box.\nOptionally, you can enforce HTTPS in the settings of your repository. This will upgrade all users requesting stuff from your site over HTTP to be automatically redirected to HTTPS.\nIf you use A records to route traffic to your website, you need to update your DNS settings at your registrar. These IP addresses are new, and have an added benefit of putting your static site behind a CDN (just like we did with Cloudflare in the previous post).\nSSL all the things Let\u0026rsquo;s Encrypt makes securing the web easy. More and more websites are served over HTTPS only, so it is getting increasingly difficult for script kiddies to sniff your web traffic on free WiFi networks. Moreover, they provide this service world-wide, to anyone, for free. Help them help you (and the rest of the world), and buy them a coffee!\nAt time of writing, yesterday is May 1, 2018.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nOne in California, to be specific.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/free-ssl-on-github-pages-with-a-custom-domain/","title":"Free SSL with a custom domain on GitHub Pages","categories":["ssl","hugo","how-to","gh-pages","https"],"content":"GitHub Pages is pretty awesome. It lets you push a bunch of static HTML (and/or CSS and Javascript) to a GitHub repository, and they\u0026rsquo;ll host and serve it for you. For free!\nListen to this article instead Your browser does not support the audio element You basically set up a specific repository (you have to name it \u0026lt;your_username\u0026gt;.github.io), you push your HTML there, and they will be available at https://\u0026lt;your_username\u0026gt;.github.io. Did I mention that this is free?\nWhile you can perfectly write and push HTML files straight to your GitHub repository, there\u0026rsquo;s a whole bunch of open source static site generators available that provide a structured way of organising content, in formats (Markdown π) that are easier to work with1. GitHub even supports one of them (Jekyll) out of the box, so you can just push your project as is and they\u0026rsquo;ll take care of building of your HTML too2.\nYou can even set up your own custom domain! Register your domain at your favourite registrar, and change a setting for your repository:\nThere, you fill out the custom domain you want your site to be available at (in my case that\u0026rsquo;s bart.degoe.de).\nBefore you rush off to your registrar to point your domain (or subdomain, in my case3), make sure you add a CNAME file to the root of your repository. The CNAME file should contain the URL your website should be displaying in the browser (this is important for redirects). In my case, the file contains bart.degoe.de, because that\u0026rsquo;s the URL I want my site to be published under.\nSetting up CloudFlare and SSL Then, all you need to do is add a CNAME entry to your domain settings settings. Right? Well, yes and no. Yes, setting up a CNAME DNS record will get your website working under the proper URL (it might take a while for the DNS change to propagate).\nHowever, serving your static files from GitHub under your own domain name does pose a problem; GitHub Pages only supports SSL for the github.io domain, not for custom domains (they have a wildcard certificate for their own domain, but supporting HTTPS on custom domains is not trivial4).\nThat means that your website can\u0026rsquo;t take advantage of HTTP/2 speedups, it will have negative impact on your Google ranking, Chrome will show your visitors that your website is not secure and even for your static site with fancy Javascript features you do want to protect your users when they\u0026rsquo;re reading your posts on unsecured Wi-Fi networks.\nCloudFlare Fortunately, there\u0026rsquo;s a way to get this coveted green secure lock on your static website. CloudFlare5 provides the (free) feature \u0026ldquo;Universal SSL\u0026rdquo; that will allow your users to access your website over SSL. Sign up for a free account, and enter the (non-SSL-ized) domain name of your website in their scanning tool:\nCloudFlare will fetch your current DNS configuration, and will provide you with instructions on how to enable CloudFlare for your (sub-)domain(s). The idea is that CloudFlare will act as a proxy between your GitHub hosted site and the user. This will allow them to encrypt traffic between their servers and your users (the traffic between GitHub and CloudFlare is also encrypted, but doesn\u0026rsquo;t require you to install an SSL certificate on the GitHub servers; added bonus is that they can cache your content on servers close to your visitors increasing the page speed of your website).\nEnable CloudFlare for the (sub)domain you\u0026rsquo;re hosting your website on:\nEnabling SSL CloudFlare\u0026rsquo;s Universal SSL lets you provide your website\u0026rsquo;s users with a valid signed SSL certificate. There\u0026rsquo;s several configuration options for Universal SSL (find it in the \u0026ldquo;Crypto\u0026rdquo; tab), and make sure your SSL mode is set to Full SSL (but not Full SSL (Strict)!).\nDo note it may take a while (up to 24 hours) for CloudFlare to set you up with your SSL certificates. They will send you an email once they\u0026rsquo;re provisioned and ready to go.\nNext, create a Page Rule. Page rules are, surprisingly, rules that apply to a page or a collection of pages. These rules can do a lot of cool things, such as automatically obfuscating emails on the page, control cache settings or add geolocation information to the requests. The rule you\u0026rsquo;re looking for is \u0026ldquo;Always Use HTTPS\u0026rdquo;, which will enforce all requests for pages matching the URL pattern you provide to use SSL:\nIn my case, I only have one URL for my website. However, if you use the www subdomain (i.e. www.example.com), you might want to add a Page Rule that redirects users that type example.com to www.example.com, where you enforce HTTPS to ensure all users benefit from encrypted requests. However, if you add more Page Rules, make sure that the HTTPS rule is the primary (first) page rule. Only one rule will trigger per URL, so you\u0026rsquo;ll want to make sure that this one is listed first!\nProfit! Right? This article has gotten quite meaty for the steps you have to follow, so if you\u0026rsquo;re looking for a more concise set of steps, this Gist by @cvan is great:\nThere\u0026rsquo;s a lot more you can do with CloudFlare and your static site (you could set up caching on CloudFlare\u0026rsquo;s content distribution network, for example), but be aware that even though you\u0026rsquo;ve encrypted your traffic, you should still be careful in submitting sensitive data to (third-party) APIs with Javascript; \u0026ldquo;GitHub Pages sites shouldn\u0026rsquo;t be used for sensitive transactions like sending passwords or credit card numbers\u0026rdquo;. Your website\u0026rsquo;s source code is publicly available in your GitHub repository, so be mindful of any scripts and content you publish there.\nI use Hugo for this website, which is written in Golang (\u0026ldquo;fast\u0026rdquo; and \u0026ldquo;easy\u0026rdquo; are keywords I like). There\u0026rsquo;s a lot of different static site generators out there, each with their own focuses, advantages and trade-offs.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIn my setup, I have two separate repositories, where I maintain the Hugo project structure in one (the blog repository), and build and push the static files to the other (the bartdegoede.github.io repository). What I like about that is that it gives me a \u0026ldquo;deploy\u0026rdquo; step, so I don\u0026rsquo;t accidentally push something that\u0026rsquo;s not finished yet.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nSkipping this step took me a lot longer to figure out than I\u0026rsquo;m willing to admit.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThere\u0026rsquo;s been disscusions about this for a while.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nCloudFlare is a company that provides a content-delivery network (CDN), DDoS protection services, DNS and a whole slew of other services for websites.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/bloom-filters-bit-arrays-recommendations-caches-bitcoin/","title":"Bloom filters, using bit arrays for recommendations, caches and Bitcoin","categories":["python","bloom filter","how-to"],"content":"Bloom filters are cool. In my experience, it\u0026rsquo;s a somewhat underestimated data structure that sounds more complex than it actually is. In this post I\u0026rsquo;ll go over what they are, how they work (I\u0026rsquo;ve hacked together an interactive example to help visualise what happens behind the scenes) and go over some of their usecases in the wild.\nListen to this article instead Your browser does not support the audio element What is a Bloom filter? A Bloom filter is a data structure designed to quickly tell you whether an element is not in a set. What\u0026rsquo;s even nicer, it does so within the memory constraints you specify. It doesn\u0026rsquo;t actually store the data itself, only trimmed down version of it. This gives it the desirable property that it has a constant time complexity1 for both adding a value to the filter and for checking whether a value is present in the filter. The cool part is that this is independent of how many elements already in the filter.\nLike with most things that offer great benefits, there is a trade-off: Bloom filters are probabilistic in nature. On rare occassions, it will respond with yes to the question if the element is in the set (false positives are a possibility), although it will never respond with no if the value is actually present (false negatives can\u0026rsquo;t happen).\nYou can actually control how rare those occassions are, by setting the size of the Bloom filter bit array and the amount of hash functions depending on the amount of elements you expect to add2. Also, note that you can\u0026rsquo;t remove items from a Bloom filter.\nHow does it work? An empty Bloom filter is a bit array of a particular size (let\u0026rsquo;s call that size m) where all the bits are set to 0. In addition, there must be a number (let\u0026rsquo;s call the number k) of hashing functions defined. Each of these functions hashes a value to one of the positions in our array m, distributing the values uniformly over the array.\nWe\u0026rsquo;ll do a very simple Python implementation3 of a Bloom filter. For simplicity\u0026rsquo;s sake, we\u0026rsquo;ll use a bit array4 with 15 bits (m=15) and 3 hashing functions (k=3) for the running example.\nimport mmh3 class Bloomfilter(object): def __init__(self, m=15, k=3): self.m = m self.k = k # we use a list of Booleans to represent our # bit array for simplicity self.bit_array = [False for i in range(self.m)] def add(self, element): ... def check(self, element): ... To add elements to the array, our add method needs to run k hashing functions on the input that each will almost randomly pick an index in our bit array. We\u0026rsquo;ll use the mmh3 library to hash our element, and use the amount of hash functions we want to apply as a seed to give us different hashes for each of them. Finally, we compute the remainder of the hash divided by the size of the bit array to obtain the position we want to set.5\ndef add(self, element): \u0026#34;\u0026#34;\u0026#34; Add an element to the filter. Murmurhash3 gives us hash values distributed uniformly enough we can use different seeds to represent different hash functions \u0026#34;\u0026#34;\u0026#34; for i in range(self.k): # this will give us a number between 0 and m - 1 digest = mmh3.hash(element, i, signed=False) % self.m self.bit_array[digest] = True In our case (m=15 and k=3), we would set the bits at index 1, 7 and 10 to one for the string hello.\nIn [1]: mmh3.hash(\u0026#39;hello\u0026#39;, 0, signed=False) % 15 Out[1]: 1 In [2]: mmh3.hash(\u0026#39;hello\u0026#39;, 1, signed=False) % 15 Out[2]: 7 In [3]: mmh3.hash(\u0026#39;hello\u0026#39;, 2, signed=False) % 15 Out[3]: 10 Now, to determine if an element is in the bloom filter, we apply the same hash functions to the element, and see whether the bits at the resulting indices are all 1. If one of them is not 1, then the element has not been added to the filter (because otherwise we\u0026rsquo;d see a value of 1 for all hash functions!).\ndef check(self, element): \u0026#34;\u0026#34;\u0026#34; To check whether element is in the filter, we hash the element with the same hash functions as the add functions (using the seed). If one of them doesn\u0026#39;t occur in our bit_array, the element is not in there (only a value that hashes to all of the same indices we\u0026#39;ve already seen before). \u0026#34;\u0026#34;\u0026#34; for i in range(self.k): digest = mmh3.hash(element, i, signed=False) % self.m if self.bit_array[digest] == False: # if any of the bits hasn\u0026#39;t been set, then it\u0026#39;s not in # the filter return False return True You can see how this approach guarantuees that there will be no false negatives, but that there might be false positives; especially in our toy example with the small bit array, the more elements you add to the filter, the more likely it gets that the three bits we hash an element to are set other elements (running one of the hash functions on the string world will also set the bit at index 6 to 1):\nIn [4]: mmh3.hash(\u0026#39;world\u0026#39;, 0, signed=False) % 15 Out[4]: 7 In [5]: mmh3.hash(\u0026#39;world\u0026#39;, 1, signed=False) % 15 Out[5]: 4 In [6]: mmh3.hash(\u0026#39;world\u0026#39;, 2, signed=False) % 15 Out[6]: 9 We can actually compute the probability of our Bloom filter returning a false positive, as it is a function of the number of bits used in the bit array divided by the length of the bit array (m) to the power of hash functions we\u0026rsquo;re using k (we\u0026rsquo;ll leave that for a future post though). The more values we add, the higher the probability of false positives becomes.\nInteractive example To further drive home how Bloom filters work, I\u0026rsquo;ve hacked together a Bloom filter in JavaScript that uses the cells in the table below as a \u0026ldquo;bit array\u0026rdquo; to visualise how adding more values will fill up the filter and increase the probability of a false positive (a full Bloom filter will always return \u0026ldquo;yes\u0026rdquo; for whatever value you throw at it).\nAdd Hash value 1: Hash value 2: Hash value 3: Elements in the filter: []\nProbability of false positives: 0% Test In Bloom filter: What can I use it for? Given that a Bloom filter is really good at telling you whether something is in a set or not, caching is a prime candidate for using a Bloom filter. CDN providers like Akamai6 use it to optimise their disk caches; nearly 75% of the URLs that are accessed in their web caches is accessed only once and then never again. To prevent caching these \u0026ldquo;one-hit wonders\u0026rdquo; and massively saving disk space requirements, Akamai uses a Bloom filter to store all URLs that are accessed. If a URL is found in the Bloom filter, it means it was requested before, and should be stored in their disk cache.\nBlogging platform Medium uses Bloom filters7 to filter out posts that users have already read from their personalised reading lists. They create a Bloom filter for every user, and add every article they read to the filter. When a reading list is generated, they can check the filter whether the user has seen the article. The trade-off for false positives (i.e. an article they haven\u0026rsquo;t read before) is more than acceptable, because in that case the user won\u0026rsquo;t be shown an article that they haven\u0026rsquo;t read yet (so they will never know).\nQuora does something similar to filter out stories users have seen before, and Facebook and LinkedIn use Bloom filters in their typeahead searches (it basically provides a fast and memory-efficient way to filter out documents that can\u0026rsquo;t match on the prefix of the query terms).\nBitcoin relies strongly on a peer-to-peer style of communication, instead of a client-server architecture in the examples above. Every node in the network is a server, and everyone in the network has a copy of everone else\u0026rsquo;s transactions. For big beefy servers in a data center that\u0026rsquo;s fine, but what if you don\u0026rsquo;t necessarily care about all transactions? Think of a mobile wallet application for example, you don\u0026rsquo;t want all transactions on the blockchain, especially when you have to download them on a mobile connection. To address this, Bitcoin has an option called Simplified Payment Verification (SPV) which lets your (mobile) node request only the transactions it\u0026rsquo;s interested in (i.e. payments from or to your wallet address). The SPV client calculates a Bloom filter for the transactions it cares about, so the \u0026ldquo;full node\u0026rdquo; has an efficient way to answer \u0026ldquo;is this client interested in this transation?\u0026rdquo;. The cost of false positives (i.e. a client is actually not interested in a transaction) is minimal, because when the client processes the transactions returned by the full node it can simply discard the ones it doesn\u0026rsquo;t care about.\nClosing thoughts There are a lot more applications for Bloom filters out there, and I can\u0026rsquo;t list them all here. I hope a gave you a whirlwind overview of how Bloom filters work and how they might be useful to you.\nFeel free to drop me a line or comment below if you have nice examples of where they\u0026rsquo;re used, or if you have any feedback, comments, or just want to say hi :-)\nThe runtime for both inserting and checking is defined by the number of hash functions (k) we have to execute. So, O(k). Space complexity is more difficult to quantify, because that depends on how many false positives you\u0026rsquo;re willing to tolerate; allocating more space will lower the false positive rate.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nGoing over the math is a bit much for this post, so check Wikipedia for all the formulas π.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nFull implementation on GitHub.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nOur implementation won\u0026rsquo;t use an actual bit array but a Python list containing Booleans for the sake of readability.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNote that there\u0026rsquo;s a slight difference between the Python and Javascript Murmurhash implementation in the libraries I\u0026rsquo;ve used; the Javascript library I used returns a 32 bit unsigned integer, where the Python library returns a 32 bit signed integer by default. To keep the Python example consistent with the Javascript, I opted to use unsigned integers there too; there is no impact for the working of the Bloom filter.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nMaggs, Bruce M.; Sitaraman, Ramesh K. (July 2015), \u0026ldquo;Algorithmic nuggets in content delivery\u0026rdquo;, SIGCOMM Computer Communication Review, New York, NY, USA: ACM, 45 (3): 52β66, doi:10.1145/2805789.2805800\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nRead the article. It\u0026rsquo;s really good.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"href":"https://bart.degoe.de/searching-your-hugo-site-with-lunr/","title":"Searching your Hugo site with Lunr","categories":["hugo","search","lunr","javascript","how-to"],"content":"Like many software engineers, I figured I needed a blog of sorts, because it would give me a place for my own notes on \u0026ldquo;How To Do Thingsβ’\u0026rdquo;, let me have a URL to give people, and share my ramblings about Life, the Universe and Everything Else with whoever wants to read them.\nListen to this article instead Your browser does not support the audio element Because I\u0026rsquo;m trying to get more familiar with Go, I opted to use the awesome Hugo1 framework to build myself a static site hosted on Github Pages.\nIn my day job I work on our search engine, so the first thing that I wanted to have was some basic search functionality for all the blog posts I haven\u0026rsquo;t written yet, preferably something that I can mess with is extensible and configurable.\nThere are three options if you want to add search functionality to a static website, each with their pros and cons:\nThird-party service (i.e. Google CSE): There are a bunch of services that provide basic search widgets for your site, such as Google Custom Search Engine (CSE). Those are difficult to customise, break your UI with their Google-styled widgets, and (in some cases) will display ads on your website2. Run a server-side search engine: You can set up a backend that indexes your data and can process the queries your users submit in the search box on your website. The obvious downside is that you throw away all the benefits of having a static site (free hosting, complex infrastructure). Search client-side: Having a static site, it makes sense to move all the user interaction to the client. We depend on the users\u0026rsquo; browser to run Javascript3 and download the searchable data in order to run queries against it, but the upside is that you can control how data is processed and how that data is queried. Fortunately for us, Atwood\u0026rsquo;s Law holds true; there\u0026rsquo;s a full-text search library inspired by Lucene/Solr written in Javascript we can use to implement our search engine: Lunr.js. Relevance When thinking about search, the most important question is what users want to find. This sounds very much like an open door, but you\u0026rsquo;d be surprised how often this gets overlooked; what are we looking for (tweets, products, (the fastest route to) a destination?), who is doing the search (lawyers, software engineers, my mom?), what do we hope to get out of it (money, page views?).\nIn our case, we\u0026rsquo;re searching blog posts that have titles, tags and content (in decreasing order of value to relevance); queries matching titles should be more important than matches in post content4.\nIndexing The project folder for my blog5 looks roughly like this:\nblog/ \u0026lt;= Hugo project root folder |- content/ \u0026lt;- this is where the pages I want to be searchable live |- about.md |- post/ |- 2018-01-01-first-post.md |- 2018-01-15-second-post.md |- ... |- layout/ |- partials/ \u0026lt;- these contain the templates we need for search |- search.html |- search_scripts.html |- static/ |- js/ |- search/ \u0026lt;- Where we generate the index file |- vendor/ |- lunrjs.min.js \u0026lt;- lunrjs library; https://cdnjs.com/libraries/lunr.js/ |- ... |- config.toml |- ... |- Gruntfile.js \u0026lt;- This will build our index |- ... The idea is that we build an index on site generation time, and fetch that file when a user loads the page.\nI use Gruntjs6 to build the index file, and some dependencies that make life a little easier. Install them with npm:\n$ npm install --save-dev grunt string gray-matter This is my Gruntfile.js that lives in the root of my project. It will walk through the content/ directory and parse all the markdown files it finds. It will parse out title, categories and href (this will be the reference to the post; i.e. the URL of the page we want to point to) from the front matter, and the content from the rest of the post. It also skips posts that are labeled draft, because I don\u0026rsquo;t want the posts I\u0026rsquo;m still working on to already show up in the search results.\nvar matter = require(\u0026#39;gray-matter\u0026#39;); var S = require(\u0026#39;string\u0026#39;); var CONTENT_PATH_PREFIX = \u0026#39;content\u0026#39;; module.exports = function(grunt) { grunt.registerTask(\u0026#39;search-index\u0026#39;, function() { grunt.log.writeln(\u0026#39;Build pages index\u0026#39;); var indexPages = function() { var pagesIndex = []; grunt.file.recurse(CONTENT_PATH_PREFIX, function(abspath, rootdir, subdir, filename) { grunt.verbose.writeln(\u0026#39;Parse file:\u0026#39;, abspath); d = processMDFile(abspath, filename); if (d !== undefined) { pagesIndex.push(d); } }); return pagesIndex; }; var processMDFile = function(abspath, filename) { var content = matter(grunt.file.read(abspath, filename)); if (content.data.draft) { // don\u0026#39;t index draft posts return; } var pageIndex; return { title: content.data.title, categories: content.data.categories, href: content.data.slug, content: S(content.content).trim().stripTags().stripPunctuation().s }; }; grunt.file.write(\u0026#39;static/js/search/index.json\u0026#39;, JSON.stringify(indexPages())); grunt.log.ok(\u0026#39;Index built\u0026#39;); }); }; To run this task, simply run grunt search-index in the directory where Gruntfile.js is located7. This will generate a JSON index file looking like this:\n[ { \u0026#34;content\u0026#34;: \u0026#34;Hi My name is Bart de Goede and ...\u0026#34;, \u0026#34;href\u0026#34;: \u0026#34;about\u0026#34;, \u0026#34;title\u0026#34;: \u0026#34;About\u0026#34; }, { \u0026#34;content\u0026#34;: \u0026#34;Like many software engineers, I figured I needed a blog of sorts...\u0026#34;, \u0026#34;href\u0026#34;: \u0026#34;Searching-your-hugo-site-with-lunr\u0026#34;, \u0026#34;title\u0026#34;: \u0026#34;Searching your Hugo site with Lunr\u0026#34;, \u0026#34;categories\u0026#34;: [ \u0026#34;hugo\u0026#34;, \u0026#34;search\u0026#34;, \u0026#34;lunr\u0026#34;, \u0026#34;javascript\u0026#34; ] }, ... ] Querying Now we\u0026rsquo;ve built the index, we need a way of obtaining it client-side, and then query it. To do that, I have two partials that include the markup for the search input box and the links to the relevant Javascript:\n\u0026lt;script type=\u0026#34;text/javascript\u0026#34; src=\u0026#34;https://code.jquery.com/jquery-2.1.3.min.js\u0026#34;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;script type=\u0026#34;text/javascript\u0026#34; src=\u0026#34;js/vendor/lunr.min.js\u0026#34;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;script type=\u0026#34;text/javascript\u0026#34; src=\u0026#34;js/search/search.js\u0026#34;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;!-- js/search/search.js contains the code that downloads and initialises the index --\u0026gt; ... \u0026lt;input type=\u0026#34;text\u0026#34; id=\u0026#34;search\u0026#34;\u0026gt; For my blog, I have one search.js file that will download the index file, initialise the UI, and run the searches. For the sake of readability, I\u0026rsquo;ve split up the relevant functions below and added some comments to the code.\nThis function fetches the index file we\u0026rsquo;ve generated with the Grunt task, initialises the relevant fields, and then adds the each of the documents to the index. The pagesIndex variable will store the documents as we indexed them, and the searchIndex variable will store the statistics and data structures we need to rank our documents for a query efficiently.\nfunction initSearchIndex() { // this file is built by the Grunt task, and $.getJSON(\u0026#39;js/search/index.json\u0026#39;) .done(function(documents) { pagesIndex = documents; searchIndex = lunr(function() { this.field(\u0026#39;title\u0026#39;); this.field(\u0026#39;categories\u0026#39;); this.field(\u0026#39;content\u0026#39;); this.ref(\u0026#39;href\u0026#39;); // This will add all the documents to the index. This is // different compared to older versions of Lunr, where // documents could be added after index initialisation for (var i = 0; i \u0026lt; documents.length; ++i) { this.add(documents[i]) } }); }) .fail(function(jqxhr, textStatus, error) { var err = textStatus + \u0026#39;, \u0026#39; + error; console.error(\u0026#39;Error getting index file:\u0026#39;, err); } ); } initSearchIndex(); Then, we need to sprinkle some jQuery magic on the input box. In my case, I want to start searching once a user has typed at least two characters, and support a typeahead style of searching, so everytime a character is entered, I want to empty the current search results (if any), run the searchSite function with whatever is in the input box, and render the results.\nfunction initUI() { $results = $(\u0026#39;.posts\u0026#39;); // or whatever element is supposed to hold your results $(\u0026#39;#search\u0026#39;).keyup(function() { $results.empty(); // only search when query has 2 characters or more var query = $(this).val(); if (query.length \u0026lt; 2) { return; } var results = searchSite(query); renderResults(results); }); } $(document).ready(function() { initUI(); }); The searchSite function will take the query_string the user typed in and build a lunr.Query object and run it against the index (stored in the searchIndex variable). The lunr index will return a ranked list of refs (these are the identifiers we assigned to the documents in the Gruntfile). The second part of this method maps these identifiers to the original documents we stored in the pagesIndex variable.\n// this function will parse the query_string, which will you // to run queries like \u0026#34;title:lunr\u0026#34; (search the title field), // \u0026#34;lunr^10\u0026#34; (boost hits with this term by a factor 10) or // \u0026#34;lunr~2\u0026#34; (will match anything within an edit distance of 2, // i.e. \u0026#34;losr\u0026#34; will also match) function simpleSearchSite(query_string) { return searchIndex.search(query_string).map(function(result) { return pagesIndex.filter(function(page) { return page.href === result.ref; })[0]; }); } // I want a typeahead search, so if a user types a query like // \u0026#34;pyth\u0026#34;, it should show results that contain the word \u0026#34;Python\u0026#34;, // rather than just the entire word. function searchSite(query_string) { return searchIndex.query(function(q) { // look for an exact match and give that a massive positive boost q.term(query_string, { usePipeline: true, boost: 100 }); // prefix matches should not use stemming, and lower positive boost q.term(query_string, { usePipeline: false, boost: 10, wildcard: lunr.Query.wildcard.TRAILING }); }).map(function(result) { return pagesIndex.filter(function(page) { return page.href === result.ref; })[0]; }); } The snippet above lists two methods. The first shows an example of a search using the default lunr.Index#search method, which uses the lunr query syntax.\nIn my case, I want to support a typeahead search, where we show the user results for partial queries too; if the user types \u0026quot;pyth\u0026quot;, we should display results that have the word \u0026quot;python\u0026quot; in the post. To do that, we tell Lunr to combine two queries: the first q.term provides exact matches with a high boost to relevance (because we it\u0026rsquo;s likely that these matches are relevant to the user), the second appends a trailing wildcard to the query8, providing prefix matches with a (lower) boost.\nFinally, given the ranked list of results (containing all pages in the content/ directory), we want to render those somewhere on the page. The renderResults method slices the result list to the first ten results, creates a link to the appropriate post based on the href, and creates a (crude) snippet based on the 100 first characters of the content.\nfunction renderResults(results) { if (!results.length) { return; } results.slice(0, 10).forEach(function(hit) { var $result = $(\u0026#39;\u0026lt;li\u0026gt;\u0026#39;); $result.append($(\u0026#39;\u0026lt;a\u0026gt;\u0026#39;, { href: hit.href, text: \u0026#39;Β» \u0026#39; + hit.title })); $result.append($(\u0026#39;\u0026lt;p/\u0026gt;\u0026#39;, { text: hit.content.slice(0, 100) + \u0026#39;...\u0026#39; })); $results.append($result); }); } This is a pretty naive approach to introducing full-text search to a static site (I use Hugo, but this will work with static site generators like Jekyll or Hyde too); it completely ignores other languages than English (there\u0026rsquo;s support for other languages too), let alone non whitespace languages like Chinese, and it requires users to download the full index that contains all your searchable pages, so it won\u0026rsquo;t scale as nicely if you have thousands of pages. For my personal blog though, it\u0026rsquo;s good enough π.\nIt\u0026rsquo;s fast, it\u0026rsquo;s written in Golang, it supports fancy themes, and it\u0026rsquo;s open source!\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYou can make money off theses ads, but the question is whether you want to show ads on your personal blog or not.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI\u0026rsquo;m assuming that the audience that\u0026rsquo;ll land on these pages will have Javascript enabled in their browser π\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIn this case, I\u0026rsquo;m totally assuming that if words from the query occur in the title or the manually assigned tags of a post are way more relevant than matches in the content of a post, if only because there\u0026rsquo;s a lot more words in post content, so there\u0026rsquo;s a higher probability of matching any word in the query.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIt\u0026rsquo;s also on GitHub.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nA port of this script to Golang is in the works.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe idea is to run the task before you deploy the latest version of your site. In my case, I have a deploy.sh script that runs Hugo to build my static pages, runs grunt search-index and pushes the result to GitHub.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nLunr uses tries to represent terms internally, giving us an efficient way of doing fast prefix lookups.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"}]