Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time
"href": "/building-a-full-text-search-engine-150-lines-of-code/",
"title": "Building a full-text search engine in 150 lines of Python code",
"categories": ["how-to", "search", "full-text search", "python"],
"content": "Full-text search is everywhere. From finding a book on Scribd, a movie on Netflix, toilet paper on Amazon, or anything else on the web through Google (like how to do your job as a software engineer), you\u0026rsquo;ve searched vast amounts of unstructured data multiple times today. What\u0026rsquo;s even more amazing, is that you\u0026rsquo;ve even though you searched millions (or billions) of records, you got a response in milliseconds. In this post, we are going to explore the basic components of a full-text search engine, and use them to build one that can search across millions of documents and rank them according to their relevance in milliseconds, in less than 150 lines of Python code!Listen to this article instead Your browser does not support the audio element Data All the code you in this blog post can be found on Github. I\u0026rsquo;ll provide links with the code snippets here, so you can try running this yourself. You can run the full example by installing the requirements (pip install -r requirements.txt) and run python This will download all the data and execute the example query with and without rankings.\nBefore we\u0026rsquo;re jumping into building a search engine, we first need some full-text, unstructured data to search. We are going to be searching abstracts of articles from the English Wikipedia, which is currently a gzipped XML file of about 785mb and contains about 6.27 million abstracts1. I\u0026rsquo;ve written a simple function to download the gzipped XML, but you can also just manually download the file.\nData preparation The file is one large XML file that contains all abstracts. One abstract in this file is contained by a \u0026lt;doc\u0026gt; element, and looks roughly like this (I\u0026rsquo;ve omitted elements we\u0026rsquo;re not interested in):\n\u0026lt;doc\u0026gt; \u0026lt;title\u0026gt;Wikipedia: London Beer Flood\u0026lt;/title\u0026gt; \u0026lt;url\u0026gt;\u0026lt;/url\u0026gt; \u0026lt;abstract\u0026gt;The London Beer Flood was an accident at Meux \u0026amp; Co\u0026#39;s Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.\u0026lt;/abstract\u0026gt; ... \u0026lt;/doc\u0026gt; The bits were interested in are the title, the url and the abstract text itself. We\u0026rsquo;ll represent documents with a Python dataclass for convenient data access. We\u0026rsquo;ll add a property that concatenates the title and the contents of the abstract. You can find the code here.\nfrom dataclasses import dataclass @dataclass class Abstract: \u0026#34;\u0026#34;\u0026#34;Wikipedia abstract\u0026#34;\u0026#34;\u0026#34; ID: int title: str abstract: str url: str @property def fulltext(self): return \u0026#39; \u0026#39;.join([self.title, self.abstract]) Then, we\u0026rsquo;ll want to extract the abstracts data from the XML and parse it so we can create instances of our Abstract object. We are going to stream through the gzipped XML without loading the entire file into memory first2. We\u0026rsquo;ll assign each document an ID in order of loading (ie the first document will have ID=1, the second one will have ID=2, etcetera). You can find the code here.\nimport gzip from lxml import etree from search.documents import Abstract def load_documents(): # open a filehandle to the gzipped Wikipedia dump with\u0026#39;data/enwiki.latest-abstract.xml.gz\u0026#39;, \u0026#39;rb\u0026#39;) as f: doc_id = 1 # iterparse will yield the entire `doc` element once it finds the # closing `\u0026lt;/doc\u0026gt;` tag for _, element in etree.iterparse(f, events=(\u0026#39;end\u0026#39;,), tag=\u0026#39;doc\u0026#39;): title = element.findtext(\u0026#39;./title\u0026#39;) url = element.findtext(\u0026#39;./url\u0026#39;) abstract = element.findtext(\u0026#39;./abstract\u0026#39;) yield Abstract(ID=doc_id, title=title, url=url, abstract=abstract) doc_id += 1 # the `element.clear()` call will explicitly free up the memory # used to store the element element.clear() Indexing We are going to store this in a data structure known as an \u0026ldquo;inverted index\u0026rdquo; or a \u0026ldquo;postings list\u0026rdquo;. Think of it as the index in the back of a book that has an alphabetized list of relevant words and concepts, and on what page number a reader can find them.\n Back of the book index Practically, what this means is that we\u0026rsquo;re going to create a dictionary where we map all the words in our corpus to the IDs of the documents they occur in. That will look something like this:\n{ ... \u0026#34;london\u0026#34;: [5245250, 2623812, 133455, 3672401, ...], \u0026#34;beer\u0026#34;: [1921376, 4411744, 684389, 2019685, ...], \u0026#34;flood\u0026#34;: [3772355, 2895814, 3461065, 5132238, ...], ... } Note that in the example above the words in the dictionary are lowercased; before building the index we are going to break down or analyze the raw text into a list of words or tokens. The idea is that we first break up or tokenize the text into words, and then apply zero or more filters (such as lowercasing or stemming) on each token to improve the odds of matching queries to text.\n Tokenization Analysis We are going to apply very simple tokenization, by just splitting the text on whitespace. Then, we are going to apply a couple of filters on each of the tokens: we are going to lowercase each token, remove any punctuation, remove the 25 most common words in the English language (and the word \u0026ldquo;wikipedia\u0026rdquo; because it occurs in every title in every abstract) and apply stemming to every word (ensuring that different forms of a word map to the same stem, like brewery and breweries3).\nThe tokenization and lowercase filter are very simple:\nimport Stemmer STEMMER = Stemmer.Stemmer(\u0026#39;english\u0026#39;) def tokenize(text): return text.split() def lowercase_filter(tokens): return [token.lower() for token in tokens] def stem_filter(tokens): return STEMMER.stemWords(tokens) Punctuation is nothing more than a regular expression on the set of punctuation:\nimport re import string PUNCTUATION = re.compile(\u0026#39;[%s]\u0026#39; % re.escape(string.punctuation)) def punctuation_filter(tokens): return [PUNCTUATION.sub(\u0026#39;\u0026#39;, token) for token in tokens] Stopwords are words that are very common and we would expect to occcur in (almost) every document in the corpus. As such, they won\u0026rsquo;t contribute much when we search for them (i.e. (almost) every document will match when we search for those terms) and will just take up space, so we will filter them out at index time. The Wikipedia abstract corpus includes the word \u0026ldquo;Wikipedia\u0026rdquo; in every title, so we\u0026rsquo;ll add that word to the stopword list as well. We drop the 25 most common words in English.\n# top 25 most common words in English and \u0026#34;wikipedia\u0026#34;: # STOPWORDS = set([\u0026#39;the\u0026#39;, \u0026#39;be\u0026#39;, \u0026#39;to\u0026#39;, \u0026#39;of\u0026#39;, \u0026#39;and\u0026#39;, \u0026#39;a\u0026#39;, \u0026#39;in\u0026#39;, \u0026#39;that\u0026#39;, \u0026#39;have\u0026#39;, \u0026#39;I\u0026#39;, \u0026#39;it\u0026#39;, \u0026#39;for\u0026#39;, \u0026#39;not\u0026#39;, \u0026#39;on\u0026#39;, \u0026#39;with\u0026#39;, \u0026#39;he\u0026#39;, \u0026#39;as\u0026#39;, \u0026#39;you\u0026#39;, \u0026#39;do\u0026#39;, \u0026#39;at\u0026#39;, \u0026#39;this\u0026#39;, \u0026#39;but\u0026#39;, \u0026#39;his\u0026#39;, \u0026#39;by\u0026#39;, \u0026#39;from\u0026#39;, \u0026#39;wikipedia\u0026#39;]) def stopword_filter(tokens): return [token for token in tokens if token not in STOPWORDS] Bringing all these filters together, we\u0026rsquo;ll construct an analyze function that will operate on the text in each abstract; it will tokenize the text into individual words (or rather, tokens), and then apply each filter in succession to the list of tokens. The order is important, because we use a non-stemmed list of stopwords, so we should apply the stopword_filter before the stem_filter.\ndef analyze(text): tokens = tokenize(text) tokens = lowercase_filter(tokens) tokens = punctuation_filter(tokens) tokens = stopword_filter(tokens) tokens = stem_filter(tokens) return [token for token in tokens if token] Indexing the corpus We\u0026rsquo;ll create an Index class that will store the index and the documents. The documents dictionary stores the dataclasses by ID, and the index keys will be the tokens, with the values being the document IDs the token occurs in:\nclass Index: def __init__(self): self.index = {} self.documents = {} def index_document(self, document): if document.ID not in self.documents: self.documents[document.ID] = document for token in analyze(document.fulltext): if token not in self.index: self.index[token] = set() self.index[token].add(document.ID) Searching Now we have all tokens indexed, searching for a query becomes a matter of analyzing the query text with the same analyzer as we applied to the documents; this way we\u0026rsquo;ll end up with tokens that should match the tokens we have in the index. For each token, we\u0026rsquo;ll do a lookup in the dictionary, finding the document IDs that the token occurs in. We do this for every token, and then find the IDs of documents in all these sets (i.e. for a document to match the query, it needs to contain all the tokens in the query). We will then take the resulting list of document IDs, and fetch the actual data from our documents store4.\ndef _results(self, analyzed_query): return [self.index.get(token, set()) for token in analyzed_query] def search(self, query): \u0026#34;\u0026#34;\u0026#34; Boolean search; this will return documents that contain all words from the query, but not rank them (sets are fast, but unordered). \u0026#34;\u0026#34;\u0026#34; analyzed_query = analyze(query) results = self._results(analyzed_query) documents = [self.documents[doc_id] for doc_id in set.intersection(*results)] return documents In [1]:\u0026#39;London Beer Flood\u0026#39;) search took 0.16307830810546875 milliseconds Out[1]: [Abstract(ID=1501027, title=\u0026#39;Wikipedia: Horse Shoe Brewery\u0026#39;, abstract=\u0026#39;The Horse Shoe Brewery was an English brewery in the City of Westminster that was established in 1764 and became a major producer of porter, from 1809 as Henry Meux \u0026amp; Co. It was the site of the London Beer Flood in 1814, which killed eight people after a porter vat burst.\u0026#39;, url=\u0026#39;\u0026#39;), Abstract(ID=1828015, title=\u0026#39;Wikipedia: London Beer Flood\u0026#39;, abstract=\u0026#34;The London Beer Flood was an accident at Meux \u0026amp; Co\u0026#39;s Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.\u0026#34;, url=\u0026#39;\u0026#39;)] Now, this will make our queries very precise, especially for long query strings (the more tokens our query contains, the less likely it\u0026rsquo;ll be that there will be a document that has all of these tokens). We could optimize our search function for recall rather than precision by allowing users to specify that only one occurrence of a token is enough to match our query:\ndef search(self, query, search_type=\u0026#39;AND\u0026#39;): \u0026#34;\u0026#34;\u0026#34; Still boolean search; this will return documents that contain either all words from the query or just one of them, depending on the search_type specified. We are still not ranking the results (sets are fast, but unordered). \u0026#34;\u0026#34;\u0026#34; if search_type not in (\u0026#39;AND\u0026#39;, \u0026#39;OR\u0026#39;): return [] analyzed_query = analyze(query) results = self._results(analyzed_query) if search_type == \u0026#39;AND\u0026#39;: # all tokens must be in the document documents = [self.documents[doc_id] for doc_id in set.intersection(*results)] if search_type == \u0026#39;OR\u0026#39;: # only one token has to be in the document documents = [self.documents[doc_id] for doc_id in set.union(*results)] return documents In [2]:\u0026#39;London Beer Flood\u0026#39;, search_type=\u0026#39;OR\u0026#39;) search took 0.02816295623779297 seconds Out[2]: [Abstract(ID=5505026, title=\u0026#39;Wikipedia: Addie Pryor\u0026#39;, abstract=\u0026#39;| birth_place = London, England\u0026#39;, url=\u0026#39;\u0026#39;), Abstract(ID=1572868, title=\u0026#39;Wikipedia: Tim Steward\u0026#39;, abstract=\u0026#39;|birth_place = London, United Kingdom\u0026#39;, url=\u0026#39;\u0026#39;), Abstract(ID=5111814, title=\u0026#39;Wikipedia: 1877 Birthday Honours\u0026#39;, abstract=\u0026#39;The 1877 Birthday Honours were appointments by Queen Victoria to various orders and honours to reward and highlight good works by citizens of the British Empire. The appointments were made to celebrate the official birthday of the Queen, and were published in The London Gazette on 30 May and 2 June 1877.\u0026#39;, url=\u0026#39;\u0026#39;), ... In [3]: len(\u0026#39;London Beer Flood\u0026#39;, search_type=\u0026#39;OR\u0026#39;)) search took 0.029065370559692383 seconds Out[3]: 49627 Relevancy We have implemented a pretty quick search engine with just some basic Python, but there\u0026rsquo;s one aspect that\u0026rsquo;s obviously missing from our little engine, and that\u0026rsquo;s the idea of relevance. Right now we just return an unordered list of documents, and we leave it up to the user to figure out which of those (s)he is actually interested in. Especially for large result sets, that is painful or just impossible (in our OR example, there are almost 50,000 results).\nThis is where the idea of relevancy comes in; what if we could assign each document a score that would indicate how well it matches the query, and just order by that score? A naive and simple way of assigning a score to a document for a given query is to just count how often that document mentions that particular word. After all, the more that document mentions that term, the more likely it is that it is about our query!\nTerm frequency Let\u0026rsquo;s expand our Abstract dataclass to compute and store it\u0026rsquo;s term frequencies when we index it. That way, we\u0026rsquo;ll have easy access to those numbers when we want to rank our unordered list of documents:\n# in from collections import Counter from .analysis import analyze @dataclass class Abstract: # snip def analyze(self): # Counter will create a dictionary counting the unique values in an array: # {\u0026#39;london\u0026#39;: 12, \u0026#39;beer\u0026#39;: 3, ...} self.term_frequencies = Counter(analyze(self.fulltext)) def term_frequency(self, term): return self.term_frequencies.get(term, 0) We need to make sure to generate these frequency counts when we index our data:\n# in we add `document.analyze() def index_document(self, document): if document.ID not in self.documents: self.documents[document.ID] = document document.analyze() We\u0026rsquo;ll modify our search function so we can apply a ranking to the documents in our result set. We\u0026rsquo;ll fetch the documents using the same Boolean query from the index and document store, and then we\u0026rsquo;ll for every document in that result set, we\u0026rsquo;ll simply sum up how often each term occurs in that document\ndef search(self, query, search_type=\u0026#39;AND\u0026#39;, rank=True): # snip if rank: return self.rank(analyzed_query, documents) return documents def rank(self, analyzed_query, documents): results = [] if not documents: return results for document in documents: score = sum([document.term_frequency(token) for token in analyzed_query]) results.append((document, score)) return sorted(results, key=lambda doc: doc[1], reverse=True) Inverse Document Frequency That\u0026rsquo;s already a lot better, but there are some obvious short-comings. We\u0026rsquo;re considering all query terms to be of equivalent value when assessing the relevancy for the query. However, it\u0026rsquo;s likely that certain terms have very little to no discriminating power when determining relevancy; for example, a collection with lots of documents about beer would be expected to have the term \u0026ldquo;beer\u0026rdquo; appear often in almost every document (in fact, we\u0026rsquo;re already trying to address that by dropping the 25 most common English words from the index). Searching for the word \u0026ldquo;beer\u0026rdquo; in such a case would essentially do another random sort.\nIn order to address that, we\u0026rsquo;ll add another component to our scoring algorithm that will reduce the contribution of terms that occur very often in the index to the final score. We could use the collection frequency of a term (i.e. how often does this term occur across all documents), but in practice the document frequency is used instead (i.e. how many documents in the index contain this term). We\u0026rsquo;re trying to rank documents after all, so it makes sense to have a document level statistic.\nWe\u0026rsquo;ll compute the inverse document frequency for a term by dividing the number of documents (N) in the index by the amount of documents that contain the term, and take a logarithm of that.\n IDF; taken from We\u0026rsquo;ll then simply multiple the term frequency with the inverse document frequency during our ranking, so matches on terms that are rare in the corpus will contribute more to the relevancy score5. We can easily compute the inverse document frequency from the data available in our index:\n# import math def document_frequency(self, token): return len(self.index.get(token, set())) def inverse_document_frequency(self, token): # Manning, Hinrich and Schütze use log10, so we do too, even though it # doesn\u0026#39;t really matter which log we use anyway # return math.log10(len(self.documents) / self.document_frequency(token)) def rank(self, analyzed_query, documents): results = [] if not documents: return results for document in documents: score = 0.0 for token in analyzed_query: tf = document.term_frequency(token) idf = self.inverse_document_frequency(token) score += tf * idf results.append((document, score)) return sorted(results, key=lambda doc: doc[1], reverse=True) Future Work™ And that\u0026rsquo;s a basic search engine in just a few lines of Python code! You can find all the code on Github, and I\u0026rsquo;ve provided a utility function that will download the Wikipedia abstracts and build an index. Install the requirements, run it in your Python console of choice and have fun messing with the data structures and searching.\nNow, obviously this is a project to illustrate the concepts of search and how it can be so fast (even with ranking, I can search and rank 6.27m documents on my laptop with a \u0026ldquo;slow\u0026rdquo; language like Python) and not production grade software. It runs entirely in memory on my laptop, whereas libraries like Lucene utilize hyper-efficient data structures and even optimize disk seeks, and software like Elasticsearch and Solr scale Lucene to hundreds if not thousands of machines.\nThat doesn\u0026rsquo;t mean that we can\u0026rsquo;t think about fun expansions on this basic functionality though; for example, we assume that every field in the document has the same contribution to relevancy, whereas a query term match in the title should probably be weighted more strongly than a match in the description. Another fun project could be to expand the query parsing; there\u0026rsquo;s no reason why either all or just one term need to match. Why not exclude certain terms, or do AND and OR between individual terms? Can we persist the index to disk and make it scale beyond the confines of my laptop RAM?\n An abstract is generally the first paragraph or the first couple of sentences of a Wikipedia article. The entire dataset is currently about ±796mb of gzipped XML. There\u0026rsquo;s smaller dumps with a subset of articles available if you want to experiment and mess with the code yourself; parsing XML and indexing will take a while, and require a substantial amount of memory.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n We\u0026rsquo;re going to have the entire dataset and index in memory as well, so we may as well skip keeping the raw data in memory.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Whether or not stemming is a good idea is subject of debate. It will decrease the total size of your index (ie fewer unique words), but stemming is based on heuristics; we\u0026rsquo;re throwing away information that could very well be valuable. For example, think about the words university, universal, universities, and universe that are stemmed to univers. We are losing the ability to distinguish between the meaning of these words, which would negatively impact relevance. For a more detailed article about stemming (and lemmatization), read this excellent article.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n We obviously just use our laptop\u0026rsquo;s RAM for this, but it\u0026rsquo;s a pretty common practice to not store your actual data in the index. Elasticsearch stores it\u0026rsquo;s data as plain old JSON on disk, and only stores indexed data in Lucene (the underlying search and indexing library) itself, and many other search engines will simply return an ordered list of document IDs which are then used to retrieve the data to display to users from a database or other service. This is especially relevant for large corpora, where doing a full reindex of all your data is expensive, and you generally only want to store data relevant to relevancy in your search engine (and not attributes that are only relevant for presentation purposes).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n For a more in-depth post about the algorithm, I recommend reading and\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n "
"href": "/use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts/",
"title": "Use Google Cloud Text-to-Speech to create an audio version of your blog posts",
"categories": ["hugo", "blog", "text-to-speech", "how-to"],
"content": "Audio is big. Like, really big, and growing fast, to the tune of \u0026ldquo;two-thirds of the population listens to online audio\u0026rdquo; and \u0026ldquo;weekly online listeners reporting an average nearly 17 hours of listening in the last week\u0026rdquo;1. These numbers include all kinds of audio, from online radio stations, audiobooks, streaming services and podcasts (hi Spotify!). It makes sense too. Consuming audio content is easier to consume and more engaging than written content while you\u0026rsquo;re on the go, exercising, commuting or doing household chores2. But what do you do if you\u0026rsquo;re like me and don\u0026rsquo;t have the time or recording equipment to ride this podcasting wave, and just write the occasional blog post?Listen to this article instead Your browser does not support the audio element Well, you can always use a sophisticated deep learning text-to-speech model, train it on thousands of hours of audio content and endlessly tweak the model parameters, create an audio version of those occassional blog posts and host them on your website. Or, you know, you use the Google one3. The Cloud Text-to-Speech API is priced by character, and the first 1 million characters are free4! In this post, we\u0026rsquo;ll go over how to set up a Google API, write a Python script to extract text from a Markdown file, and create a Hugo shortcode5 to include the generated files in your static website.\nSet up a Google API In order to get started, we have to jump through a couple of hoops to create a text-to-speech API. Most of these are pretty straightforward, and are easiest to follow when you\u0026rsquo;re signed in to your Google account. It will be even easier if you\u0026rsquo;ve enabled billing which should be the default on a personal account (although you may have to add a payment method)6.\n There\u0026#39;s some steps you\u0026#39;ll have to follow to set up an API. If you click that \u0026ldquo;Enable the API\u0026rdquo; button, you\u0026rsquo;ll be taken to the project creation page. This project basically functions as a label and administrative container for everything ranging from authentication (API keys) to billing (so you can see what you spend on each Google product you\u0026rsquo;re using). Give it a name (leave \u0026ldquo;organization\u0026rdquo; set to \u0026ldquo;no organization\u0026rdquo;).\n Create a new Google Cloud project This will trigger some background jobs where Google will provision the resources necessary to run your very own text-to-speech API. Finally, we\u0026rsquo;ll want to set up some authentication, so we can interact with this API. Create a service account for this project:\n Create a service account for our text-to-speech project This will download a JSON file with API credentials to your computer. Do not throw away this file. You\u0026rsquo;ll want to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of this JSON file. Keep in mind that this will set the variable for the duration of your terminal session, so you\u0026rsquo;ll have to set the variable again if you\u0026rsquo;re opening a new session.\n# in this case, the file was downloaded to my Downloads folder; you may want to put it elsewhere $ export GOOGLE_APPLICATION_CREDENTIALS=~/Downloads/text-to-speech-123456.json With all that, we can get started on our script to process some Markdown!\nWrite a script to transform text to audio I\u0026rsquo;m using Python for most of my scripting and hacking, so I\u0026rsquo;ve got virtualenv set up on my machine; this essentially installs a Python interpreter for every project, so I can keep dependencies separated. For this project we have the following requirements; install them with pip install -r requirements.txt, or individually (i.e. pip install Click==7.0 etc.):\nbeautifulsoup4==4.8.1 Click==7.0 pydub==0.23.1 Markdown==3.1.1 google-cloud-texttospeech==0.5.0 Click is a great library for building CLI\u0026rsquo;s in Python, and besides giving us some nice features, it will make it easy to convert this script to some executable later on.\nWe\u0026rsquo;ll be calling the script like this:\n$ python path/to/ The script will take the path to a Markdown file on disk, and do a couple of things to it:\n Read the file into memory Do some clean up, and convert it to plain text Send the text to our Google Cloud Text-to-Speech API, and write the audio from the response to disk Read file from disk # import click import logging import os logging.basicConfig(level=logging.INFO, format=\u0026#39;%(asctime)s%(levelname)s%(message)s\u0026#39;) @click.command() @click.argument(\u0026#39;filename\u0026#39;, type=click.File(\u0026#39;rb\u0026#39;)) def text_to_speech(filename): name = os.path.basename(\u0026#39;.md\u0026#39;, \u0026#39;\u0026#39;) data = if __name__ == \u0026#39;__main__\u0026#39;: text_to_speech() This snippet defines a Click command, sets up logging, will try to open it\u0026rsquo;s argument as a file, stores the name of the file in a variable for later use, and reads and stores the contents of the file in a variable data. Step 1, check.\nConvert to plain text We need to send Google plain text as input for their model, so the next step is to add a function that will do some cleanup and extract the text from the Markdown-formatted file. In order to do that, we\u0026rsquo;ll apply some regular expressions and convert the Markdown to HTML first, and use BeautifulSoup to extract the text.\n# import click import os import re from bs4 import BeautifulSoup from markdown import markdown def clean_text(text): # get rid of the Hugo preamble text = \u0026#39;\u0026#39;.join(text.decode(\u0026#39;utf8\u0026#39;).split(\u0026#39;---\u0026#39;)[2:]).strip() # get rid of superfluous newlines, as that counts towards our API limits text = re.sub(\u0026#39;\\n+\u0026#39;, \u0026#39; \u0026#39;, text) # we\u0026#39;re hacking our way around the markdown by converting to html first, # just because BeautifulSoup makes life so easy html = markdown(text) html = re.sub(r\u0026#39;\u0026lt;pre\u0026gt;(.*?)\u0026lt;/pre\u0026gt;\u0026#39;, \u0026#39; \u0026#39;, html) # this removes some artifacts from Hugo shortcodes html = re.sub(r\u0026#39;{{}}\u0026#39;, \u0026#39;\u0026#39;, html) html = re.sub(r\u0026#39;\\[\\^.*?\\]\u0026#39;, \u0026#39; \u0026#39;, html) soup = BeautifulSoup(html, \u0026#34;html.parser\u0026#34;) text = \u0026#39;\u0026#39;.join(soup.findAll(text=True)) # get rid of superfluous whitespace return re.sub(r\u0026#39;\\s+\u0026#39;, \u0026#39; \u0026#39;, text) @click.command() @click.argument(\u0026#39;filename\u0026#39;, type=click.File(\u0026#39;rb\u0026#39;)) def text_to_speech(filename): name = os.path.basename(\u0026#39;.md\u0026#39;, \u0026#39;\u0026#39;) data = text = clean_text(data) Now we extracted the plain text from our Markdown file, we can send it to Google:\nfrom import texttospeech ... @click.command() @click.argument(\u0026#39;filename\u0026#39;, type=click.File(\u0026#39;rb\u0026#39;)) def text_to_speech(filename): name = os.path.basename(\u0026#39;.md\u0026#39;, \u0026#39;\u0026#39;) data = text = clean_text(data) text = clean_text(data) # initialize the API client client = texttospeech.TextToSpeechClient() # we can send up to 5000 characters per request, so split up the text step = 5000 for j, i in enumerate(range(0, len(text), step)): synthesis_input = texttospeech.types.SynthesisInput(text=text[i:i+step]) voice = texttospeech.types.VoiceSelectionParams( language_code=\u0026#39;en-US\u0026#39;, name=\u0026#39;en-US-Wavenet-B\u0026#39; ) audio_config = texttospeech.types.AudioConfig( audio_encoding=texttospeech.enums.AudioEncoding.MP3 )\u0026#39;Synthesizing speech for {name}_{j}\u0026#39;) response = client.synthesize_speech(synthesis_input, voice, audio_config) with open(f\u0026#39;{name}_{j}.mp3\u0026#39;, \u0026#39;wb\u0026#39;) as out: # Write the response to the output file. out.write(response.audio_content)\u0026#39;Audio content written to file \u0026#34;{name}_{j}.mp3\u0026#34;\u0026#39;) Now, this is where we run into the first quirk of the API; it will only accept snippets of up to 5000 characters. My blog posts generally range between 12k to 15k characters, so I had to add some code that will chunk up the text into bits of 5000 characters each. Note that I don\u0026rsquo;t make any effort to detect word boundaries, so it can happen that a chunk will end with half a word; I\u0026rsquo;ll leave it up to the reader to improve upon my implementation7.\nWe provide some configuration (I like the voice of robot es-US-Wavenet-B, but there are loads of other voices and languages to choose from), specify we want to receive an MP3 back, and write out the response MP3 into separate chunks in the current working directory.\nNext, we need to stitch the temporary MP3 chunks together (using the excellent pydub library), write the completed file to a sensible directory and clean up after ourselves.\nimport functools from glob import glob from pydub import AudioSegment ... mp3_segments = sorted(glob(f\u0026#39;{name}_*.mp3\u0026#39;)) segments = [AudioSegment.from_mp3(f) for f in mp3_segments]\u0026#39;Stitching together {len(segments)}mp3 files for {name}\u0026#39;) audio = functools.reduce(lambda a, b: a + b, segments)\u0026#39;Exporting {name}.mp3\u0026#39;) audio.export(f\u0026#39;static/audio/{name}.mp3\u0026#39;, format=\u0026#39;mp3\u0026#39;)\u0026#39;Exporting {name}.ogg\u0026#39;) audio.export(f\u0026#39;static/audio/{name}.ogg\u0026#39;, format=\u0026#39;ogg\u0026#39;)\u0026#39;Removing intermediate files\u0026#39;) for f in mp3_segments: os.remove(f) This will stitch the MP38 segments together (functools.reduce), write out an MP3 and an OGG file (with the same filename as the blog post Markdown file) to the static/audio directory I use (change to a destination folder of your liking if necessary), and deletes the intermediate files from the current directory.\nYou can find the complete script here.\nRunning the script for this article generates output that looks like this:\n$ python scripts/ content/post/ 2019-10-29 23:19:59,995 INFO Synthesizing speech for 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_0 2019-10-29 23:20:08,044 INFO Audio content written to file \u0026#34;2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_0.mp3\u0026#34; 2019-10-29 23:20:08,045 INFO Synthesizing speech for 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_1 2019-10-29 23:20:13,709 INFO Audio content written to file \u0026#34;2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_1.mp3\u0026#34; 2019-10-29 23:20:13,709 INFO Synthesizing speech for 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_2 2019-10-29 23:20:18,576 INFO Audio content written to file \u0026#34;2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts_2.mp3\u0026#34; 2019-10-29 23:20:19,830 INFO Stitching together 3 mp3 files for 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts 2019-10-29 23:20:19,880 INFO Exporting 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts.mp3 2019-10-29 23:20:23,353 INFO Exporting 2019-10-29-use-google-cloud-text-to-speech-to-create-an-audio-version-of-your-blog-posts.ogg 2019-10-29 23:20:26,744 INFO Removing intermediate files Include audio in the post With this, we end up with a bunch of audio files a directory. In order to display them properly to our users so that they can actually consume the content, we have to do a little more work. Hugo provides shortcodes, which are effectively parameterized macro\u0026rsquo;s that expand into snippets of HTML that get embedded in your posts. There are many shortcodes included with standard Hugo (like figure, gist or tweet), but you can also create your own. We\u0026rsquo;ll leverage that to include some swanky HTML5 audio tags in our blog posts9.\n\u0026lt;audio controls class=\u0026#34;audio_controls {{ .Get \u0026#34;class\u0026#34; }}\u0026#34; {{ with .Get \u0026#34;id\u0026#34; }}id=\u0026#34;{{ . }}\u0026#34;{{ end }} {{ with .Get \u0026#34;preload\u0026#34; }}preload=\u0026#34;{{ . }}\u0026#34;{{ else }}preload=\u0026#34;metadata\u0026#34;{{ end }} style=\u0026#34;{{ with .Get \u0026#34;style\u0026#34; }}{{ . | safeCSS }}; {{ end }}\u0026#34; {{ with .Get \u0026#34;title\u0026#34; }}data-info-title=\u0026#34;{{ . }}\u0026#34;{{ end }} \u0026gt; {{ if .Get \u0026#34;src\u0026#34; }} \u0026lt;source {{ with .Get \u0026#34;src\u0026#34; }}src=\u0026#34;{{ . }}\u0026#34;{{ end }} {{ with .Get \u0026#34;type\u0026#34; }}type=\u0026#34;audio/{{ . }}\u0026#34;{{ end }}\u0026gt; {{ else if .Get \u0026#34;backup_src\u0026#34; }} \u0026lt;source src=\u0026#34;{{ .Get \u0026#34;backup_src\u0026#34; }}\u0026#34; {{ with .Get \u0026#34;backup_type\u0026#34; }}type=\u0026#34;audio/{{ . }}\u0026#34;{{ end }} {{ with .Get \u0026#34;backup_codec\u0026#34; }}codecs=\u0026#34;{{ . }}\u0026#34;{{ end }} \u0026gt; {{ end }} Your browser does not support the audio element \u0026lt;/audio\u0026gt; This snippet will give us access to a shortcode that injects some HTML into our post, and accepts a couple of parameters so we can include the appropriate audio file, the appropriate backup file, and override styling should we so choose. This file should be stored in layouts/shortcodes/audio.html, and can be included in your posts as follows:\nLorem ipsum dolor sit amet. {{\u0026lt;audio src=\u0026#34;/audio/name-of-your-audio-file.mp3\u0026#34; type=\u0026#34;mp3\u0026#34; backup_src=\u0026#34;/audio/name-of-your-audio-file.ogg\u0026#34; backup_type=\u0026#34;ogg\u0026#34;\u0026gt;}} Some more words, this time not in Latin. This will include an audio player looking like this in your blog post. I\u0026rsquo;ve added some bells and whistles to mine, with some additional styling for all the UX points.\n Basic HTML5 audio player\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Not to mention that audio content is more easily accessible for people that suffer from dyslexia or poor sight, or seems to be a lot better for user engagement.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n It has all kinds of funky stuff, like multiple languages, API clients in your favorite programming language, pitch, speaking rates and volume controls, and even optimizations around where your audio is going to play, such as headphones or phone lines.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Free is my favorite price.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n My website is a static website generated from Markdown files with Hugo, hosted on GitHub Pages, so you may need to make some small changes to make it work with Jekyll, Next.js or whichever other static site generator you\u0026rsquo;re using.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Don\u0026rsquo;t worry, the API is free for the first 4 million characters per month for standard voices, or 1 million characters for fancy WaveNet voices.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Pull requests are always welcome! 😄\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n pydub requires ffmpeg or libav to be installed to open, convert and save non-WAV files (such as MP3 or OGG)\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Not all browsers support all file types and audio codecs, which is why we\u0026rsquo;ve generated the OGG files as backup.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n "
"href": "/use-hugo-output-formats-to-generate-lunr-index-files/",
"title": "Use Hugo Output Formats to generate Lunr index files for your static site search",
"categories": ["hugo", "search", "lunr", "how-to"],
"content": "I\u0026rsquo;ve been using Lunr.js to enable some basic site search on this blog. Lunr.js requires an index file that contains all the content you want to make available for search. In order to generate that file, I had a kind of hacky setup, depending on running a Grunt script on every deploy, which introduces a dependency on node, and nobody really wants any of that for just a static HTML website.\nListen to this article instead Your browser does not support the audio element I have been wanting forever to have Hugo build that file for me instead1. As it turns out, Output Formats2 make building that index file very easy. Output formats let you generate your content in other formats than HTML, such as AMP or XML for an RSS feed, and it also speaks JSON.\nThe search on my blog lives on the homepage, where some (very ugly) Javascript downloads the index file, parses it contents into an inverted index, and replaces the content on the page with search results whenever someone starts typing. Essentially, I want to create some JSON output on my homepage (index.json instead of index.html).\nI added the following snippet to my config.toml, that says that besides HTML, the homepage also has JSON output:\n[outputs] home = [\u0026#34;HTML\u0026#34;, \u0026#34;JSON\u0026#34;] page = [\u0026#34;HTML\u0026#34;] N.B.: this means that there won\u0026rsquo;t be a JSON version of the other pages; I just need it on my homepage, because that serves as the search results page too.\nNow, I don\u0026rsquo;t want that index.json file to basically be the list of links it is in the HTML version and in the RSS feed, so I added an index.json file in my layouts folder with the following content:\n[ {{ range $index, $page := .Site.Pages }} {{- if eq $page.Type \u0026quot;post\u0026quot; -}} {{- if $page.Plain -}} {{- if and $index (gt $index 0) -}},{{- end }} { \u0026quot;href\u0026quot;: \u0026quot;{{ $page.Permalink }}\u0026quot;, \u0026quot;title\u0026quot;: \u0026quot;{{ htmlEscape $page.Title }}\u0026quot;, \u0026quot;categories\u0026quot;: [{{ range $tindex, $tag := $page.Params.categories }}{{ if $tindex }}, {{ end }}\u0026quot;{{ $tag| htmlEscape }}\u0026quot;{{ end }}], \u0026quot;content\u0026quot;: {{$page.Plain | jsonify}} } {{- end -}} {{- end -}} {{- end -}} ] This will render a JSON file (named index.json) with an array in the root directory of my site, and every item in that array is one of the .Site.Pages (i.e. my posts), whenever that page has text in it and it\u0026rsquo;s not the homepage. I didn\u0026rsquo;t bother with minification, because the file is tiny and will be served nicely gzipped by Cloudflare anyway. Whenever Hugo builds the site, it will reindex all the data (i.e. rebuild this file), and I don\u0026rsquo;t have a dependency on Node and Grunt scripts anymore.\n Ever since someone opened a GitHub issue about it 😄\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Ships with Hugo version 0.20.0 or greater.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n "
"href": "/tab-plus-search-from-your-url-bar-with-opensearch/",
"title": "Custom OpenSearch: search from your URL bar",
"categories": ["search", "opensearch", "how-to"],
"content": "Almost all modern browsers enable websites to customize the built-in search feature to let the user access their search features directly, without going to your website first and finding the search input box. If your website has search functionality accessible through a basic GET request, it\u0026rsquo;s surprisingly simple to enable this for your website too.\nListen to this article instead Your browser does not support the audio element Typing \u0026#39;bart\u0026#39; and hitting tab in my Chrome browser lets me search the website directly. Some browsers do it automatically If your users are on Chrome, chances are this already works! Chromium tries really hard to figure out where your search page is and how to access it. A strong hint you can give it is to change the type of the \u0026lt;input\u0026gt; element to \u0026quot;search\u0026quot;1:\n\u0026lt;input autocapitalize=\u0026quot;off\u0026quot; autocorrect=\u0026quot;off\u0026quot; autocomplete=\u0026quot;off\u0026quot; name=\u0026quot;q\u0026quot; placeholder=\u0026quot;Search\u0026quot; type=\u0026quot;search\u0026quot;\u0026gt; The \u0026quot;name\u0026quot; attribute gives the browser a hint as to what HTTP parameter will hold the query (it is a good idea to configure your Google Analytics to pick this up as well!).\nThis will let the browser add some nice UI elements to the search input box, like a small \u0026ldquo;x\u0026rdquo; button on the right to clear the search input in Safari and Chrome. Enabling the \u0026quot;autocapitalize\u0026quot;, \u0026quot;autocorrect\u0026quot; and \u0026quot;autocomplete\u0026quot; attributes will instruct your browser to modify and correct the user input even further (think of the iOS autocorrect feature, for example).\n Just by changing the input type you can hook in to the browsers\u0026#39; native UX. Word of warning Because once upon a time relied on the type attribute to give their search box a more \u0026ldquo;Mac-like\u0026rdquo; feel, Safari will basically ignore any CSS applied to \u0026lt;input type=\u0026quot;search\u0026quot;\u0026gt; elements. If you need Safari to treat your search field like any other input field for display purposes, you can add the following to your CSS:\ninput[type=\u0026quot;search\u0026quot;] { -webkit-appearance: textfield; } This will let you apply your own styles to the input box.\nOthers don\u0026rsquo;t Not all browsers do this out of the box, so you need to provide them with a more formalized configuration. Most browsers find out about the search functionality of a website through an OpenSearch XML file that directs them to the right page.\nOpenSearch OpenSearch is a standard that was developed by A9, an Amazon subsidiary developing search engine and search advertising technology, and has been around since Jeff Bezos unveiled it in 2005 at a conference on emerging technologies.\nIt is nothing more than an XML specification that lets a website describe a search engine for itself, and where a user or browser might find and use it. Firefox, Chrome, Edge, Internet Explorer and Safari all support the OpenSearch standard, with Firefox even supporting features that are not in the standard, such as search suggestions.\nXML All you need is a small XML file. Below is an example of the one we have at work:\n\u0026lt;OpenSearchDescription xmlns=\u0026quot;\u0026quot; xmlns:moz=\u0026quot;\u0026quot;\u0026gt; \u0026lt;ShortName\u0026gt;\u0026lt;/ShortName\u0026gt; \u0026lt;Description\u0026gt;Scribd's mission is to create the world's largest open library of documents. Search it.\u0026lt;/Description\u0026gt; \u0026lt;Url type=\u0026quot;text/html\u0026quot; method=\u0026quot;get\u0026quot; template=\u0026quot;{searchTerms}\u0026quot; /\u0026gt; \u0026lt;Image height=\u0026quot;32\u0026quot; width=\u0026quot;32\u0026quot; type=\u0026quot;image/x-icon\u0026quot;\u0026gt;\u0026lt;/Image\u0026gt; \u0026lt;/OpenSearchDescription\u0026gt; It provides a \u0026lt;ShortName\u0026gt; (there\u0026rsquo;s a \u0026lt;LongName\u0026gt; element too, that\u0026rsquo;s mostly used for aggregators or automatically generated search plugins), a \u0026lt;Description\u0026gt; of what the search will let you do, and most importantly, the \u0026lt;Url\u0026gt; where you can do it.\nIt tells the browser there\u0026rsquo;s a text/html page that can process an HTTP GET request, and has a template for the browser. {searchTerms} will be interpolated with the query terms the user will type in the browser. You need to host this file somewhere with the rest of your web pages.\nBut what if you don\u0026rsquo;t have a dedicated search engine for your website? Well, just use Google! Replace the value of the \u0026quot;template\u0026quot; attribute with something like this2:\n\u0026lt;Url type=\u0026quot;text/html\u0026quot; method=\u0026quot;get\u0026quot; template=\u0026quot; {searchTerms}\u0026quot;\u0026gt; This will redirect your user to the Google search results, but those will only display matches from content on your site. That\u0026rsquo;s a lot cheaper than employing a bunch of engineers to build and maintain a custom search engine!\nTurn on autodiscovery! Now we need to activate the automatic discovery of search engines in the browsers of your users. That sounds a lot cooler and more complicated than it actually is; the only thing you have to do is provide a \u0026lt;link\u0026gt; somewhere in the \u0026lt;head\u0026gt; of your webpages:\n\u0026lt;link rel=\u0026quot;search\u0026quot; href=\u0026quot;\u0026quot; type=\u0026quot;application/opensearchdescription+xml\u0026quot; title=\u0026quot;Search\u0026quot;\u0026gt; This will alert browsers that load the page that there is a search feature available, described in the linked XML file. Make sure your OpenSearch XML file is available and can be loaded from your webserver, and refresh the page containing the \u0026lt;link\u0026gt;. This will tell the browser where to look, and enable custom search!\n Now tab-searching from the Safari URL bar works too! The OpenSearch specification supports a lot more features than this, ranging from \u0026lt;Tags\u0026gt; to help plugins generated from these standardized descriptions be found better in search plugin aggregators, what \u0026lt;Language\u0026gt; the search engine supports, or whether the search results may contain \u0026lt;AdultContent\u0026gt;. There are many ways to configure and customize OpenSearch that go way beyond the basic example described here, but for my little blog this is more than enough 😄.\n The other attributes are to dis-/enable features certain other browsers like Safari have that automatically correct what you type into the search box.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Yes, you could absolutely point your search input to my website, but that\u0026rsquo;s not a requirement 😉\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n "
"href": "/github-pages-and-lets-encrypt/",
"title": "Free SSL on Github Pages with a custom domain: Part 2 - Let's Encrypt",
"categories": ["ssl", "hugo", "how-to", "gh-pages", "https", "lets-encrypt"],
"content": "GitHub Pages has just become even more awesome. Since yesterday1, GitHub Pages supports HTTPS for custom domains. And yes, it is still free!\nListen to this article instead Your browser does not support the audio element Let\u0026rsquo;s Encrypt GitHub has partnered with Let\u0026rsquo;s Encrypt, which is a free, open and automated certificate authority (CA). It is run by the Internet Security Research Group (ISRG), which is a public benefit corporation2 funded by donations and a bunch of large corporations and non-profits.\nThe goal of this initiative is to secure the web by making it very easy to obtain a free, trusted SSL certificate. Moreover, it lets web servers run a piece of software that not only gets a valid SSL certificate, but will also configure your web server and automatically renew the certificate when it expires.\nHow does it do that? It works by running a bit of software on your web server, a certificate management agent. This agent software has two tasks: it proves to the Let\u0026rsquo;s Encrypt certificate authority that it controls the domain, and it requests, renews and revokes certificates for the domain it controls.\nValidating a domain Similar to a traditional process of obtaining a certificate for a domain, where you create an account with the CA and add domains you control, the certificate management agent needs to perform a test to prove that it controls the domain.\nThe agent will ask the Let\u0026rsquo;s Encrypt CA what it needs to do to prove that it is, effectively, in control of the domain. The CA will look at the domain, and issue one or more challenges to the agent it needs to complete to prove that it has control over the domain. For example, it can ask the agent to provision a particular DNS record under the domain, or make an HTTP resource available under a particular URL. With these challenges, it provides the agent with a nonce (some random number that can only be used once for verification purposes).\n CA issuing a challenge to the certificate management agent (image taken from In the image above, the agent creates a file on a specified path on the web server (in this case, on It creates a key pair it will use to identify itself with the CA, and signs the nonce received from the CA with the private key. Then, it notifies the CA that it has completed the challenge by sending back the signed nonce and is ready for validation. The CA then validates the completion of the challenge by attempting to download the file from the web server and verify that it contains the expected content.\n Certificate management agent completing a challenge (image taken from If the signed nonce is valid, and the challenge is completed successfully, the agent identified by the public key is officially authorized to manage valid SSL certificates for the domain.\nCertificate management So, what does that mean? By having validated the agent by its public key, the CA can now validate that messages sent to the CA are actually sent by the certificate management agent.\nIt can send a Certificate Signing Request (CSR) to the CA to request it to issue a SSL certificate for the domain, signed with the authorized key. Let\u0026rsquo;s Encrypt will only have to validate the signatures, and if those check out, a certificate will be issued.\n Issuing a certificate (image taken from Let\u0026rsquo;s Encrypt will add the certificate to the appropriate channels, so that browsers will know that the CA has validated the certificate, and will display that coveted green lock to your users!\nSo, GitHub Pages Right, that\u0026rsquo;s how we got started. The awesome thing about Let\u0026rsquo;s Encrypt is that it is automated, so all this handshaking and verifying happens behind the scenes, without you having to be involved.\nIn the previous post we saw how to set up a CNAME file for your custom domain. That\u0026rsquo;s it. Done. Works out of the box.\nOptionally, you can enforce HTTPS in the settings of your repository. This will upgrade all users requesting stuff from your site over HTTP to be automatically redirected to HTTPS.\n If you use A records to route traffic to your website, you need to update your DNS settings at your registrar. These IP addresses are new, and have an added benefit of putting your static site behind a CDN (just like we did with Cloudflare in the previous post).\nSSL all the things Let\u0026rsquo;s Encrypt makes securing the web easy. More and more websites are served over HTTPS only, so it is getting increasingly difficult for script kiddies to sniff your web traffic on free WiFi networks. Moreover, they provide this service world-wide, to anyone, for free. Help them help you (and the rest of the world), and buy them a coffee!\n At time of writing, yesterday is May 1, 2018.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n One in California, to be specific.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n "
"href": "/free-ssl-on-github-pages-with-a-custom-domain/",
"title": "Free SSL with a custom domain on GitHub Pages",
"categories": ["ssl", "hugo", "how-to", "gh-pages", "https"],
"content": "GitHub Pages is pretty awesome. It lets you push a bunch of static HTML (and/or CSS and Javascript) to a GitHub repository, and they\u0026rsquo;ll host and serve it for you. For free!\nListen to this article instead Your browser does not support the audio element You basically set up a specific repository (you have to name it \u0026lt;your_username\u0026gt;, you push your HTML there, and they will be available at https://\u0026lt;your_username\u0026gt; Did I mention that this is free?\nWhile you can perfectly write and push HTML files straight to your GitHub repository, there\u0026rsquo;s a whole bunch of open source static site generators available that provide a structured way of organising content, in formats (Markdown 🙌) that are easier to work with1. GitHub even supports one of them (Jekyll) out of the box, so you can just push your project as is and they\u0026rsquo;ll take care of building of your HTML too2.\nYou can even set up your own custom domain! Register your domain at your favourite registrar, and change a setting for your repository:\nThere, you fill out the custom domain you want your site to be available at (in my case that\u0026rsquo;s\nBefore you rush off to your registrar to point your domain (or subdomain, in my case3), make sure you add a CNAME file to the root of your repository. The CNAME file should contain the URL your website should be displaying in the browser (this is important for redirects). In my case, the file contains, because that\u0026rsquo;s the URL I want my site to be published under.\nSetting up CloudFlare and SSL Then, all you need to do is add a CNAME entry to your domain settings settings. Right? Well, yes and no. Yes, setting up a CNAME DNS record will get your website working under the proper URL (it might take a while for the DNS change to propagate).\nHowever, serving your static files from GitHub under your own domain name does pose a problem; GitHub Pages only supports SSL for the domain, not for custom domains (they have a wildcard certificate for their own domain, but supporting HTTPS on custom domains is not trivial4).\nThat means that your website can\u0026rsquo;t take advantage of HTTP/2 speedups, it will have negative impact on your Google ranking, Chrome will show your visitors that your website is not secure and even for your static site with fancy Javascript features you do want to protect your users when they\u0026rsquo;re reading your posts on unsecured Wi-Fi networks.\nCloudFlare Fortunately, there\u0026rsquo;s a way to get this coveted green secure lock on your static website. CloudFlare5 provides the (free) feature \u0026ldquo;Universal SSL\u0026rdquo; that will allow your users to access your website over SSL. Sign up for a free account, and enter the (non-SSL-ized) domain name of your website in their scanning tool:\nCloudFlare will fetch your current DNS configuration, and will provide you with instructions on how to enable CloudFlare for your (sub-)domain(s). The idea is that CloudFlare will act as a proxy between your GitHub hosted site and the user. This will allow them to encrypt traffic between their servers and your users (the traffic between GitHub and CloudFlare is also encrypted, but doesn\u0026rsquo;t require you to install an SSL certificate on the GitHub servers; added bonus is that they can cache your content on servers close to your visitors increasing the page speed of your website).\nEnable CloudFlare for the (sub)domain you\u0026rsquo;re hosting your website on:\nEnabling SSL CloudFlare\u0026rsquo;s Universal SSL lets you provide your website\u0026rsquo;s users with a valid signed SSL certificate. There\u0026rsquo;s several configuration options for Universal SSL (find it in the \u0026ldquo;Crypto\u0026rdquo; tab), and make sure your SSL mode is set to Full SSL (but not Full SSL (Strict)!).\nDo note it may take a while (up to 24 hours) for CloudFlare to set you up with your SSL certificates. They will send you an email once they\u0026rsquo;re provisioned and ready to go.\nNext, create a Page Rule. Page rules are, surprisingly, rules that apply to a page or a collection of pages. These rules can do a lot of cool things, such as automatically obfuscating emails on the page, control cache settings or add geolocation information to the requests. The rule you\u0026rsquo;re looking for is \u0026ldquo;Always Use HTTPS\u0026rdquo;, which will enforce all requests for pages matching the URL pattern you provide to use SSL:\nIn my case, I only have one URL for my website. However, if you use the www subdomain (i.e., you might want to add a Page Rule that redirects users that type to, where you enforce HTTPS to ensure all users benefit from encrypted requests. However, if you add more Page Rules, make sure that the HTTPS rule is the primary (first) page rule. Only one rule will trigger per URL, so you\u0026rsquo;ll want to make sure that this one is listed first!\nProfit! Right? This article has gotten quite meaty for the steps you have to follow, so if you\u0026rsquo;re looking for a more concise set of steps, this Gist by @cvan is great:\n There\u0026rsquo;s a lot more you can do with CloudFlare and your static site (you could set up caching on CloudFlare\u0026rsquo;s content distribution network, for example), but be aware that even though you\u0026rsquo;ve encrypted your traffic, you should still be careful in submitting sensitive data to (third-party) APIs with Javascript; \u0026ldquo;GitHub Pages sites shouldn\u0026rsquo;t be used for sensitive transactions like sending passwords or credit card numbers\u0026rdquo;. Your website\u0026rsquo;s source code is publicly available in your GitHub repository, so be mindful of any scripts and content you publish there.\n I use Hugo for this website, which is written in Golang (\u0026ldquo;fast\u0026rdquo; and \u0026ldquo;easy\u0026rdquo; are keywords I like). There\u0026rsquo;s a lot of different static site generators out there, each with their own focuses, advantages and trade-offs.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n In my setup, I have two separate repositories, where I maintain the Hugo project structure in one (the blog repository), and build and push the static files to the other (the repository). What I like about that is that it gives me a \u0026ldquo;deploy\u0026rdquo; step, so I don\u0026rsquo;t accidentally push something that\u0026rsquo;s not finished yet.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Skipping this step took me a lot longer to figure out than I\u0026rsquo;m willing to admit.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n There\u0026rsquo;s been disscusions about this for a while.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n CloudFlare is a company that provides a content-delivery network (CDN), DDoS protection services, DNS and a whole slew of other services for websites.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n "
"href": "/bloom-filters-bit-arrays-recommendations-caches-bitcoin/",
"title": "Bloom filters, using bit arrays for recommendations, caches and Bitcoin",
"categories": ["python", "bloom filter", "how-to"],
"content": "Bloom filters are cool. In my experience, it\u0026rsquo;s a somewhat underestimated data structure that sounds more complex than it actually is. In this post I\u0026rsquo;ll go over what they are, how they work (I\u0026rsquo;ve hacked together an interactive example to help visualise what happens behind the scenes) and go over some of their usecases in the wild.\nListen to this article instead Your browser does not support the audio element What is a Bloom filter? A Bloom filter is a data structure designed to quickly tell you whether an element is not in a set. What\u0026rsquo;s even nicer, it does so within the memory constraints you specify. It doesn\u0026rsquo;t actually store the data itself, only trimmed down version of it. This gives it the desirable property that it has a constant time complexity1 for both adding a value to the filter and for checking whether a value is present in the filter. The cool part is that this is independent of how many elements already in the filter.\nLike with most things that offer great benefits, there is a trade-off: Bloom filters are probabilistic in nature. On rare occassions, it will respond with yes to the question if the element is in the set (false positives are a possibility), although it will never respond with no if the value is actually present (false negatives can\u0026rsquo;t happen).\nYou can actually control how rare those occassions are, by setting the size of the Bloom filter bit array and the amount of hash functions depending on the amount of elements you expect to add2. Also, note that you can\u0026rsquo;t remove items from a Bloom filter.\nHow does it work? An empty Bloom filter is a bit array of a particular size (let\u0026rsquo;s call that size m) where all the bits are set to 0. In addition, there must be a number (let\u0026rsquo;s call the number k) of hashing functions defined. Each of these functions hashes a value to one of the positions in our array m, distributing the values uniformly over the array.\nWe\u0026rsquo;ll do a very simple Python implementation3 of a Bloom filter. For simplicity\u0026rsquo;s sake, we\u0026rsquo;ll use a bit array4 with 15 bits (m=15) and 3 hashing functions (k=3) for the running example.\nimport mmh3 class Bloomfilter(object): def __init__(self, m=15, k=3): self.m = m self.k = k # we use a list of Booleans to represent our # bit array for simplicity self.bit_array = [False for i in range(self.m)] def add(self, element): ... def check(self, element): ... To add elements to the array, our add method needs to run k hashing functions on the input that each will almost randomly pick an index in our bit array. We\u0026rsquo;ll use the mmh3 library to hash our element, and use the amount of hash functions we want to apply as a seed to give us different hashes for each of them. Finally, we compute the remainder of the hash divided by the size of the bit array to obtain the position we want to set.5\ndef add(self, element): \u0026quot;\u0026quot;\u0026quot; Add an element to the filter. Murmurhash3 gives us hash values distributed uniformly enough we can use different seeds to represent different hash functions \u0026quot;\u0026quot;\u0026quot; for i in range(self.k): # this will give us a number between 0 and m - 1 digest = mmh3.hash(element, i, signed=False) % self.m self.bit_array[digest] = True In our case (m=15 and k=3), we would set the bits at index 1, 7 and 10 to one for the string hello.\nIn [1]: mmh3.hash('hello', 0, signed=False) % 15 Out[1]: 1 In [2]: mmh3.hash('hello', 1, signed=False) % 15 Out[2]: 7 In [3]: mmh3.hash('hello', 2, signed=False) % 15 Out[3]: 10 Now, to determine if an element is in the bloom filter, we apply the same hash functions to the element, and see whether the bits at the resulting indices are all 1. If one of them is not 1, then the element has not been added to the filter (because otherwise we\u0026rsquo;d see a value of 1 for all hash functions!).\ndef check(self, element): \u0026quot;\u0026quot;\u0026quot; To check whether element is in the filter, we hash the element with the same hash functions as the add functions (using the seed). If one of them doesn't occur in our bit_array, the element is not in there (only a value that hashes to all of the same indices we've already seen before). \u0026quot;\u0026quot;\u0026quot; for i in range(self.k): digest = mmh3.hash(element, i, signed=False) % self.m if self.bit_array[digest] == False: # if any of the bits hasn't been set, then it's not in # the filter return False return True You can see how this approach guarantuees that there will be no false negatives, but that there might be false positives; especially in our toy example with the small bit array, the more elements you add to the filter, the more likely it gets that the three bits we hash an element to are set other elements (running one of the hash functions on the string world will also set the bit at index 6 to 1):\nIn [4]: mmh3.hash('world', 0, signed=False) % 15 Out[4]: 7 In [5]: mmh3.hash('world', 1, signed=False) % 15 Out[5]: 4 In [6]: mmh3.hash('world', 2, signed=False) % 15 Out[6]: 9 We can actually compute the probability of our Bloom filter returning a false positive, as it is a function of the number of bits used in the bit array divided by the length of the bit array (m) to the power of hash functions we\u0026rsquo;re using k (we\u0026rsquo;ll leave that for a future post though). The more values we add, the higher the probability of false positives becomes.\nInteractive example To further drive home how Bloom filters work, I\u0026rsquo;ve hacked together a Bloom filter in JavaScript that uses the cells in the table below as a \u0026ldquo;bit array\u0026rdquo; to visualise how adding more values will fill up the filter and increase the probability of a false positive (a full Bloom filter will always return \u0026ldquo;yes\u0026rdquo; for whatever value you throw at it).\n Add Hash value 1: \nHash value 2: \nHash value 3: \nElements in the filter: []\nProbability of false positives: 0% Test In Bloom filter: What can I use it for? Given that a Bloom filter is really good at telling you whether something is in a set or not, caching is a prime candidate for using a Bloom filter. CDN providers like Akamai6 use it to optimise their disk caches; nearly 75% of the URLs that are accessed in their web caches is accessed only once and then never again. To prevent caching these \u0026ldquo;one-hit wonders\u0026rdquo; and massively saving disk space requirements, Akamai uses a Bloom filter to store all URLs that are accessed. If a URL is found in the Bloom filter, it means it was requested before, and should be stored in their disk cache.\nBlogging platform Medium uses Bloom filters7 to filter out posts that users have already read from their personalised reading lists. They create a Bloom filter for every user, and add every article they read to the filter. When a reading list is generated, they can check the filter whether the user has seen the article. The trade-off for false positives (i.e. an article they haven\u0026rsquo;t read before) is more than acceptable, because in that case the user won\u0026rsquo;t be shown an article that they haven\u0026rsquo;t read yet (so they will never know).\nQuora does something similar to filter out stories users have seen before, and Facebook and LinkedIn use Bloom filters in their typeahead searches (it basically provides a fast and memory-efficient way to filter out documents that can\u0026rsquo;t match on the prefix of the query terms).\nBitcoin relies strongly on a peer-to-peer style of communication, instead of a client-server architecture in the examples above. Every node in the network is a server, and everyone in the network has a copy of everone else\u0026rsquo;s transactions. For big beefy servers in a data center that\u0026rsquo;s fine, but what if you don\u0026rsquo;t necessarily care about all transactions? Think of a mobile wallet application for example, you don\u0026rsquo;t want all transactions on the blockchain, especially when you have to download them on a mobile connection. To address this, Bitcoin has an option called Simplified Payment Verification (SPV) which lets your (mobile) node request only the transactions it\u0026rsquo;s interested in (i.e. payments from or to your wallet address). The SPV client calculates a Bloom filter for the transactions it cares about, so the \u0026ldquo;full node\u0026rdquo; has an efficient way to answer \u0026ldquo;is this client interested in this transation?\u0026rdquo;. The cost of false positives (i.e. a client is actually not interested in a transaction) is minimal, because when the client processes the transactions returned by the full node it can simply discard the ones it doesn\u0026rsquo;t care about.\nClosing thoughts There are a lot more applications for Bloom filters out there, and I can\u0026rsquo;t list them all here. I hope a gave you a whirlwind overview of how Bloom filters work and how they might be useful to you.\nFeel free to drop me a line or comment below if you have nice examples of where they\u0026rsquo;re used, or if you have any feedback, comments, or just want to say hi :-)\n The runtime for both inserting and checking is defined by the number of hash functions (k) we have to execute. So, O(k). Space complexity is more difficult to quantify, because that depends on how many false positives you\u0026rsquo;re willing to tolerate; allocating more space will lower the false positive rate.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Going over the math is a bit much for this post, so check Wikipedia for all the formulas 😄.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Full implementation on GitHub.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Our implementation won\u0026rsquo;t use an actual bit array but a Python list containing Booleans for the sake of readability.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Note that there\u0026rsquo;s a slight difference between the Python and Javascript Murmurhash implementation in the libraries I\u0026rsquo;ve used; the Javascript library I used returns a 32 bit unsigned integer, where the Python library returns a 32 bit signed integer by default. To keep the Python example consistent with the Javascript, I opted to use unsigned integers there too; there is no impact for the working of the Bloom filter.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Maggs, Bruce M.; Sitaraman, Ramesh K. (July 2015), \u0026ldquo;Algorithmic nuggets in content delivery\u0026rdquo;, SIGCOMM Computer Communication Review, New York, NY, USA: ACM, 45 (3): 52–66, doi:10.1145/2805789.2805800\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Read the article. It\u0026rsquo;s really good.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n "
"href": "/searching-your-hugo-site-with-lunr/",
"title": "Searching your Hugo site with Lunr",
"categories": ["hugo", "search", "lunr", "javascript", "how-to"],
"content": "Like many software engineers, I figured I needed a blog of sorts, because it would give me a place for my own notes on \u0026ldquo;How To Do Things™\u0026rdquo;, let me have a URL to give people, and share my ramblings about Life, the Universe and Everything Else with whoever wants to read them.\nListen to this article instead Your browser does not support the audio element Because I\u0026rsquo;m trying to get more familiar with Go, I opted to use the awesome Hugo1 framework to build myself a static site hosted on Github Pages.\nIn my day job I work on our search engine, so the first thing that I wanted to have was some basic search functionality for all the blog posts I haven\u0026rsquo;t written yet, preferably something that I can mess with is extensible and configurable.\nThere are three options if you want to add search functionality to a static website, each with their pros and cons:\n Third-party service (i.e. Google CSE): There are a bunch of services that provide basic search widgets for your site, such as Google Custom Search Engine (CSE). Those are difficult to customise, break your UI with their Google-styled widgets, and (in some cases) will display ads on your website2. Run a server-side search engine: You can set up a backend that indexes your data and can process the queries your users submit in the search box on your website. The obvious downside is that you throw away all the benefits of having a static site (free hosting, complex infrastructure). Search client-side: Having a static site, it makes sense to move all the user interaction to the client. We depend on the users' browser to run Javascript3 and download the searchable data in order to run queries against it, but the upside is that you can control how data is processed and how that data is queried. Fortunately for us, Atwood\u0026rsquo;s Law holds true; there\u0026rsquo;s a full-text search library inspired by Lucene/Solr written in Javascript we can use to implement our search engine: Lunr.js. Relevance When thinking about search, the most important question is what users want to find. This sounds very much like an open door, but you\u0026rsquo;d be surprised how often this gets overlooked; what are we looking for (tweets, products, (the fastest route to) a destination?), who is doing the search (lawyers, software engineers, my mom?), what do we hope to get out of it (money, page views?).\nIn our case, we\u0026rsquo;re searching blog posts that have titles, tags and content (in decreasing order of value to relevance); queries matching titles should be more important than matches in post content4.\nIndexing The project folder for my blog5 looks roughly like this:\nblog/ \u0026lt;= Hugo project root folder |- content/ \u0026lt;- this is where the pages I want to be searchable live |- |- post/ |- |- |- ... |- layout/ |- partials/ \u0026lt;- these contain the templates we need for search |- search.html |- search_scripts.html |- static/ |- js/ |- search/ \u0026lt;- Where we generate the index file |- vendor/ |- lunrjs.min.js \u0026lt;- lunrjs library; |- ... |- config.toml |- ... |- Gruntfile.js \u0026lt;- This will build our index |- ... The idea is that we build an index on site generation time, and fetch that file when a user loads the page.\nI use Gruntjs6 to build the index file, and some dependencies that make life a little easier. Install them with npm:\n$ npm install --save-dev grunt string gray-matter This is my Gruntfile.js that lives in the root of my project. It will walk through the content/ directory and parse all the markdown files it finds. It will parse out title, categories and href (this will be the reference to the post; i.e. the URL of the page we want to point to) from the front matter, and the content from the rest of the post. It also skips posts that are labeled draft, because I don\u0026rsquo;t want the posts I\u0026rsquo;m still working on to already show up in the search results.\nvar matter = require('gray-matter'); var S = require('string'); var CONTENT_PATH_PREFIX = 'content'; module.exports = function(grunt) { grunt.registerTask('search-index', function() { grunt.log.writeln('Build pages index'); var indexPages = function() { var pagesIndex = []; grunt.file.recurse(CONTENT_PATH_PREFIX, function(abspath, rootdir, subdir, filename) { grunt.verbose.writeln('Parse file:', abspath); d = processMDFile(abspath, filename); if (d !== undefined) { pagesIndex.push(d); } }); return pagesIndex; }; var processMDFile = function(abspath, filename) { var content = matter(, filename)); if ( { // don't index draft posts return; } var pageIndex; return { title:, categories:, href:, content: S(content.content).trim().stripTags().stripPunctuation().s }; }; grunt.file.write('static/js/search/index.json', JSON.stringify(indexPages())); grunt.log.ok('Index built'); }); }; To run this task, simply run grunt search-index in the directory where Gruntfile.js is located7. This will generate a JSON index file looking like this:\n[ { \u0026quot;content\u0026quot;: \u0026quot;Hi My name is Bart de Goede and ...\u0026quot;, \u0026quot;href\u0026quot;: \u0026quot;about\u0026quot;, \u0026quot;title\u0026quot;: \u0026quot;About\u0026quot; }, { \u0026quot;content\u0026quot;: \u0026quot;Like many software engineers, I figured I needed a blog of sorts...\u0026quot;, \u0026quot;href\u0026quot;: \u0026quot;Searching-your-hugo-site-with-lunr\u0026quot;, \u0026quot;title\u0026quot;: \u0026quot;Searching your Hugo site with Lunr\u0026quot;, \u0026quot;categories\u0026quot;: [ \u0026quot;hugo\u0026quot;, \u0026quot;search\u0026quot;, \u0026quot;lunr\u0026quot;, \u0026quot;javascript\u0026quot; ] }, ... ] Querying Now we\u0026rsquo;ve built the index, we need a way of obtaining it client-side, and then query it. To do that, I have two partials that include the markup for the search input box and the links to the relevant Javascript:\n\u0026lt;script type=\u0026quot;text/javascript\u0026quot; src=\u0026quot;\u0026quot;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;script type=\u0026quot;text/javascript\u0026quot; src=\u0026quot;js/vendor/lunr.min.js\u0026quot;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;script type=\u0026quot;text/javascript\u0026quot; src=\u0026quot;js/search/search.js\u0026quot;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;!-- js/search/search.js contains the code that downloads and initialises the index --\u0026gt; ... \u0026lt;input type=\u0026quot;text\u0026quot; id=\u0026quot;search\u0026quot;\u0026gt; For my blog, I have one search.js file that will download the index file, initialise the UI, and run the searches. For the sake of readability, I\u0026rsquo;ve split up the relevant functions below and added some comments to the code.\nThis function fetches the index file we\u0026rsquo;ve generated with the Grunt task, initialises the relevant fields, and then adds the each of the documents to the index. The pagesIndex variable will store the documents as we indexed them, and the searchIndex variable will store the statistics and data structures we need to rank our documents for a query efficiently.\nfunction initSearchIndex() { // this file is built by the Grunt task, and $.getJSON('js/search/index.json') .done(function(documents) { pagesIndex = documents; searchIndex = lunr(function() { this.field('title'); this.field('categories'); this.field('content'); this.ref('href'); // This will add all the documents to the index. This is // different compared to older versions of Lunr, where // documents could be added after index initialisation for (var i = 0; i \u0026lt; documents.length; ++i) { this.add(documents[i]) } }); }) .fail(function(jqxhr, textStatus, error) { var err = textStatus + ', ' + error; console.error('Error getting index file:', err); } ); } initSearchIndex(); Then, we need to sprinkle some jQuery magic on the input box. In my case, I want to start searching once a user has typed at least two characters, and support a typeahead style of searching, so everytime a character is entered, I want to empty the current search results (if any), run the searchSite function with whatever is in the input box, and render the results.\nfunction initUI() { $results = $('.posts'); // or whatever element is supposed to hold your results $('#search').keyup(function() { $results.empty(); // only search when query has 2 characters or more var query = $(this).val(); if (query.length \u0026lt; 2) { return; } var results = searchSite(query); renderResults(results); }); } $(document).ready(function() { initUI(); }); The searchSite function will take the query_string the user typed in and build a lunr.Query object and run it against the index (stored in the searchIndex variable). The lunr index will return a ranked list of refs (these are the identifiers we assigned to the documents in the Gruntfile). The second part of this method maps these identifiers to the original documents we stored in the pagesIndex variable.\n// this function will parse the query_string, which will you // to run queries like \u0026quot;title:lunr\u0026quot; (search the title field), // \u0026quot;lunr^10\u0026quot; (boost hits with this term by a factor 10) or // \u0026quot;lunr~2\u0026quot; (will match anything within an edit distance of 2, // i.e. \u0026quot;losr\u0026quot; will also match) function simpleSearchSite(query_string) { return { return pagesIndex.filter(function(page) { return page.href === result.ref; })[0]; }); } // I want a typeahead search, so if a user types a query like // \u0026quot;pyth\u0026quot;, it should show results that contain the word \u0026quot;Python\u0026quot;, // rather than just the entire word. function searchSite(query_string) { return searchIndex.query(function(q) { // look for an exact match and give that a massive positive boost q.term(query_string, { usePipeline: true, boost: 100 }); // prefix matches should not use stemming, and lower positive boost q.term(query_string, { usePipeline: false, boost: 10, wildcard: lunr.Query.wildcard.TRAILING }); }).map(function(result) { return pagesIndex.filter(function(page) { return page.href === result.ref; })[0]; }); } The snippet above lists two methods. The first shows an example of a search using the default lunr.Index#search method, which uses the lunr query syntax.\nIn my case, I want to support a typeahead search, where we show the user results for partial queries too; if the user types \u0026quot;pyth\u0026quot;, we should display results that have the word \u0026quot;python\u0026quot; in the post. To do that, we tell Lunr to combine two queries: the first q.term provides exact matches with a high boost to relevance (because we it\u0026rsquo;s likely that these matches are relevant to the user), the second appends a trailing wildcard to the query8, providing prefix matches with a (lower) boost.\nFinally, given the ranked list of results (containing all pages in the content/ directory), we want to render those somewhere on the page. The renderResults method slices the result list to the first ten results, creates a link to the appropriate post based on the href, and creates a (crude) snippet based on the 100 first characters of the content.\nfunction renderResults(results) { if (!results.length) { return; } results.slice(0, 10).forEach(function(hit) { var $result = $('\u0026lt;li\u0026gt;'); $result.append($('\u0026lt;a\u0026gt;', { href: hit.href, text: '» ' + hit.title })); $result.append($('\u0026lt;p/\u0026gt;', { text: hit.content.slice(0, 100) + '...' })); $results.append($result); }); } This is a pretty naive approach to introducing full-text search to a static site (I use Hugo, but this will work with static site generators like Jekyll or Hyde too); it completely ignores other languages than English (there\u0026rsquo;s support for other languages too), let alone non whitespace languages like Chinese, and it requires users to download the full index that contains all your searchable pages, so it won\u0026rsquo;t scale as nicely if you have thousands of pages. For my personal blog though, it\u0026rsquo;s good enough 😇.\n It\u0026rsquo;s fast, it\u0026rsquo;s written in Golang, it supports fancy themes, and it\u0026rsquo;s open source!\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n You can make money off theses ads, but the question is whether you want to show ads on your personal blog or not.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n I\u0026rsquo;m assuming that the audience that\u0026rsquo;ll land on these pages will have Javascript enabled in their browser 😄\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n In this case, I\u0026rsquo;m totally assuming that if words from the query occur in the title or the manually assigned tags of a post are way more relevant than matches in the content of a post, if only because there\u0026rsquo;s a lot more words in post content, so there\u0026rsquo;s a higher probability of matching any word in the query.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n It\u0026rsquo;s also on GitHub.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n A port of this script to Golang is in the works.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n The idea is to run the task before you deploy the latest version of your site. In my case, I have a script that runs Hugo to build my static pages, runs grunt search-index and pushes the result to GitHub.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n Lunr uses tries to represent terms internally, giving us an efficient way of doing fast prefix lookups.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n "