Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
38 lines (37 sloc) 46.4 KB
[
{
"href": "https://bart.degoe.de/use-hugo-output-formats-to-generate-lunr-index-files/",
"title": "Use Hugo Output Formats to generate Lunr index files for your static site search",
"categories": ["hugo", "search", "lunr", "how-to"],
"content": "I\u0026rsquo;ve been using Lunr.js to enable some basic site search on this blog. Lunr.js requires an index file that contains all the content you want to make available for search. In order to generate that file, I had a kind of hacky setup, depending on running a Grunt script on every deploy, which introduces a dependency on node, and nobody really wants any of that for just a static HTML website.\nI have been wanting forever to have Hugo build that file for me instead1. As it turns out, Output Formats2 make building that index file very easy. Output formats let you generate your content in other formats than HTML, such as AMP or XML for an RSS feed, and it also speaks JSON.\nThe search on my blog lives on the homepage, where some (very ugly) Javascript downloads the index file, parses it contents into an inverted index, and replaces the content on the page with search results whenever someone starts typing. Essentially, I want to create some JSON output on my homepage (index.json instead of index.html).\nI added the following snippet to my config.toml, that says that besides HTML, the homepage also has JSON output:\n[outputs] home = [\u0026quot;HTML\u0026quot;, \u0026quot;JSON\u0026quot;] page = [\u0026quot;HTML\u0026quot;] N.B.: this means that there won\u0026rsquo;t be a JSON version of the other pages; I just need it on my homepage, because that serves as the search results page too.\nNow, I don\u0026rsquo;t want that index.json file to basically be the list of links it is in the HTML version and in the RSS feed, so I added an index.json file in my layouts folder with the following content:\n[ {{ range $index, $page := .Site.Pages }} {{- if eq $page.Type \u0026quot;post\u0026quot; -}} {{- if $page.Plain -}} {{- if and $index (gt $index 0) -}},{{- end }} { \u0026quot;href\u0026quot;: \u0026quot;{{ $page.Permalink }}\u0026quot;, \u0026quot;title\u0026quot;: \u0026quot;{{ htmlEscape $page.Title }}\u0026quot;, \u0026quot;categories\u0026quot;: [{{ range $tindex, $tag := $page.Params.categories }}{{ if $tindex }}, {{ end }}\u0026quot;{{ $tag| htmlEscape }}\u0026quot;{{ end }}], \u0026quot;content\u0026quot;: {{$page.Plain | jsonify}} } {{- end -}} {{- end -}} {{- end -}} ] This will render a JSON file (named index.json) with an array in the root directory of my site, and every item in that array is one of the .Site.Pages (i.e. my posts), whenever that page has text in it and it\u0026rsquo;s not the homepage. I didn\u0026rsquo;t bother with minification, because the file is tiny and will be served nicely gzipped by Cloudflare anyway. Whenever Hugo builds the site, it will reindex all the data (i.e. rebuild this file), and I don\u0026rsquo;t have a dependency on Node and Grunt scripts anymore.\n Ever since someone opened a GitHub issue about it 😄 [return] Ships with Hugo version 0.20.0 or greater. [return] "
},
{
"href": "https://bart.degoe.de/tab-plus-search-from-your-url-bar-with-opensearch/",
"title": "Custom OpenSearch: search from your URL bar",
"categories": ["search", "opensearch", "how-to"],
"content": "Almost all modern browsers enable websites to customize the built-in search feature to let the user access their search features directly, without going to your website first and finding the search input box. If your website has search functionality accessible through a basic GET request, it\u0026rsquo;s surprisingly simple to enable this for your website too.\n Typing \u0026#39;bart\u0026#39; and hitting tab in my Chrome browser lets me search the website directly. Some browsers do it automatically If your users are on Chrome, chances are this already works! Chromium tries really hard to figure out where your search page is and how to access it. A strong hint you can give it is to change the type of the \u0026lt;input\u0026gt; element to \u0026quot;search\u0026quot;1:\n\u0026lt;input autocapitalize=\u0026quot;off\u0026quot; autocorrect=\u0026quot;off\u0026quot; autocomplete=\u0026quot;off\u0026quot; name=\u0026quot;q\u0026quot; placeholder=\u0026quot;Search\u0026quot; type=\u0026quot;search\u0026quot;\u0026gt; The \u0026quot;name\u0026quot; attribute gives the browser a hint as to what HTTP parameter will hold the query (it is a good idea to configure your Google Analytics to pick this up as well!).\nThis will let the browser add some nice UI elements to the search input box, like a small \u0026ldquo;x\u0026rdquo; button on the right to clear the search input in Safari and Chrome. Enabling the \u0026quot;autocapitalize\u0026quot;, \u0026quot;autocorrect\u0026quot; and \u0026quot;autocomplete\u0026quot; attributes will instruct your browser to modify and correct the user input even further (think of the iOS autocorrect feature, for example).\n Just by changing the input type you can hook in to the browsers\u0026#39; native UX. Word of warning Because once upon a time apple.com relied on the type attribute to give their search box a more \u0026ldquo;Mac-like\u0026rdquo; feel, Safari will basically ignore any CSS applied to \u0026lt;input type=\u0026quot;search\u0026quot;\u0026gt; elements. If you need Safari to treat your search field like any other input field for display purposes, you can add the following to your CSS:\ninput[type=\u0026quot;search\u0026quot;] { -webkit-appearance: textfield; } This will let you apply your own styles to the input box.\nOthers don\u0026rsquo;t Not all browsers do this out of the box, so you need to provide them with a more formalized configuration. Most browsers find out about the search functionality of a website through an OpenSearch XML file that directs them to the right page.\nOpenSearch OpenSearch is a standard that was developed by A9, an Amazon subsidiary developing search engine and search advertising technology, and has been around since Jeff Bezos unveiled it in 2005 at a conference on emerging technologies.\nIt is nothing more than an XML specification that lets a website describe a search engine for itself, and where a user or browser might find and use it. Firefox, Chrome, Edge, Internet Explorer and Safari all support the OpenSearch standard, with Firefox even supporting features that are not in the standard, such as search suggestions.\nXML All you need is a small XML file. Below is an example of the one we have at work:\n\u0026lt;OpenSearchDescription xmlns=\u0026quot;http://a9.com/-/spec/opensearch/1.1/\u0026quot; xmlns:moz=\u0026quot;http://www.mozilla.org/2006/browser/search/\u0026quot;\u0026gt; \u0026lt;ShortName\u0026gt;Scribd.com\u0026lt;/ShortName\u0026gt; \u0026lt;Description\u0026gt;Scribd's mission is to create the world's largest open library of documents. Search it.\u0026lt;/Description\u0026gt; \u0026lt;Url type=\u0026quot;text/html\u0026quot; method=\u0026quot;get\u0026quot; template=\u0026quot;https://www.scribd.com/search?query={searchTerms}\u0026quot; /\u0026gt; \u0026lt;Image height=\u0026quot;32\u0026quot; width=\u0026quot;32\u0026quot; type=\u0026quot;image/x-icon\u0026quot;\u0026gt;https://www.scribd.com/favicon.ico\u0026lt;/Image\u0026gt; \u0026lt;/OpenSearchDescription\u0026gt; It provides a \u0026lt;ShortName\u0026gt; (there\u0026rsquo;s a \u0026lt;LongName\u0026gt; element too, that\u0026rsquo;s mostly used for aggregators or automatically generated search plugins), a \u0026lt;Description\u0026gt; of what the search will let you do, and most importantly, the \u0026lt;Url\u0026gt; where you can do it.\nIt tells the browser there\u0026rsquo;s a text/html page that can process an HTTP GET request, and has a template for the browser. {searchTerms} will be interpolated with the query terms the user will type in the browser. You need to host this file somewhere with the rest of your web pages.\nBut what if you don\u0026rsquo;t have a dedicated search engine for your website? Well, just use Google! Replace the value of the \u0026quot;template\u0026quot; attribute with something like this2:\n\u0026lt;Url type=\u0026quot;text/html\u0026quot; method=\u0026quot;get\u0026quot; template=\u0026quot;https://www.google.com/search?q=site:bart.degoe.de {searchTerms}\u0026quot;\u0026gt; This will redirect your user to the Google search results, but those will only display matches from content on your site. That\u0026rsquo;s a lot cheaper than employing a bunch of engineers to build and maintain a custom search engine!\nTurn on autodiscovery! Now we need to activate the automatic discovery of search engines in the browsers of your users. That sounds a lot cooler and more complicated than it actually is; the only thing you have to do is provide a \u0026lt;link\u0026gt; somewhere in the \u0026lt;head\u0026gt; of your webpages:\n\u0026lt;link rel=\u0026quot;search\u0026quot; href=\u0026quot;https://bart.degoe.de/opensearch.xml\u0026quot; type=\u0026quot;application/opensearchdescription+xml\u0026quot; title=\u0026quot;Search bart.degoe.de\u0026quot;\u0026gt; This will alert browsers that load the page that there is a search feature available, described in the linked XML file. Make sure your OpenSearch XML file is available and can be loaded from your webserver, and refresh the page containing the \u0026lt;link\u0026gt;. This will tell the browser where to look, and enable custom search!\n Now tab-searching from the Safari URL bar works too! The OpenSearch specification supports a lot more features than this, ranging from \u0026lt;Tags\u0026gt; to help plugins generated from these standardized descriptions be found better in search plugin aggregators, what \u0026lt;Language\u0026gt; the search engine supports, or whether the search results may contain \u0026lt;AdultContent\u0026gt;. There are many ways to configure and customize OpenSearch that go way beyond the basic example described here, but for my little blog this is more than enough 😄.\n The other attributes are to dis-/enable features certain other browsers like Safari have that automatically correct what you type into the search box. [return] Yes, you could absolutely point your search input to my website, but that\u0026rsquo;s not a requirement 😉 [return] "
},
{
"href": "https://bart.degoe.de/github-pages-and-lets-encrypt/",
"title": "Free SSL on Github Pages with a custom domain: Part 2 - Let's Encrypt",
"categories": ["ssl", "hugo", "how-to", "gh-pages", "https", "lets-encrypt"],
"content": "GitHub Pages has just become even more awesome. Since yesterday1, GitHub Pages supports HTTPS for custom domains. And yes, it is still free!\nLet\u0026rsquo;s Encrypt GitHub has partnered with Let\u0026rsquo;s Encrypt, which is a free, open and automated certificate authority (CA). It is run by the Internet Security Research Group (ISRG), which is a public benefit corporation2 funded by donations and a bunch of large corporations and non-profits.\nThe goal of this initiative is to secure the web by making it very easy to obtain a free, trusted SSL certificate. Moreover, it lets web servers run a piece of software that not only gets a valid SSL certificate, but will also configure your web server and automatically renew the certificate when it expires.\nHow does it do that? It works by running a bit of software on your web server, a certificate management agent. This agent software has two tasks: it proves to the Let\u0026rsquo;s Encrypt certificate authority that it controls the domain, and it requests, renews and revokes certificates for the domain it controls.\nValidating a domain Similar to a traditional process of obtaining a certificate for a domain, where you create an account with the CA and add domains you control, the certificate management agent needs to perform a test to prove that it controls the domain.\nThe agent will ask the Let\u0026rsquo;s Encrypt CA what it needs to do to prove that it is, effectively, in control of the domain. The CA will look at the domain, and issue one or more challenges to the agent it needs to complete to prove that it has control over the domain. For example, it can ask the agent to provision a particular DNS record under the domain, or make an HTTP resource available under a particular URL. With these challenges, it provides the agent with a nonce (some random number that can only be used once for verification purposes).\n CA issuing a challenge to the certificate management agent (image taken from https://letsencrypt.org/how-it-works/) In the image above, the agent creates a file on a specified path on the web server (in this case, on https://example.com/8303). It creates a key pair it will use to identify itself with the CA, and signs the nonce received from the CA with the private key. Then, it notifies the CA that it has completed the challenge by sending back the signed nonce and is ready for validation. The CA then validates the completion of the challenge by attempting to download the file from the web server and verify that it contains the expected content.\n Certificate management agent completing a challenge (image taken from https://letsencrypt.org/how-it-works/) If the signed nonce is valid, and the challenge is completed successfully, the agent identified by the public key is officially authorized to manage valid SSL certificates for the domain.\nCertificate management So, what does that mean? By having validated the agent by its public key, the CA can now validate that messages sent to the CA are actually sent by the certificate management agent.\nIt can send a Certificate Signing Request (CSR) to the CA to request it to issue a SSL certificate for the domain, signed with the authorized key. Let\u0026rsquo;s Encrypt will only have to validate the signatures, and if those check out, a certificate will be issued.\n Issuing a certificate (image taken from https://letsencrypt.org/how-it-works/) Let\u0026rsquo;s Encrypt will add the certificate to the appropriate channels, so that browsers will know that the CA has validated the certificate, and will display that coveted green lock to your users!\nSo, GitHub Pages Right, that\u0026rsquo;s how we got started. The awesome thing about Let\u0026rsquo;s Encrypt is that it is automated, so all this handshaking and verifying happens behind the scenes, without you having to be involved.\nIn the previous post we saw how to set up a CNAME file for your custom domain. That\u0026rsquo;s it. Done. Works out of the box.\nOptionally, you can enforce HTTPS in the settings of your repository. This will upgrade all users requesting stuff from your site over HTTP to be automatically redirected to HTTPS.\n If you use A records to route traffic to your website, you need to update your DNS settings at your registrar. These IP addresses are new, and have an added benefit of putting your static site behind a CDN (just like we did with Cloudflare in the previous post).\nSSL all the things Let\u0026rsquo;s Encrypt makes securing the web easy. More and more websites are served over HTTPS only, so it is getting increasingly difficult for script kiddies to sniff your web traffic on free WiFi networks. Moreover, they provide this service world-wide, to anyone, for free. Help them help you (and the rest of the world), and buy them a coffee!\n At time of writing, yesterday is May 1, 2018. [return] One in California, to be specific. [return] "
},
{
"href": "https://bart.degoe.de/free-ssl-on-github-pages-with-a-custom-domain/",
"title": "Free SSL with a custom domain on GitHub Pages",
"categories": ["ssl", "hugo", "how-to", "gh-pages", "https"],
"content": "GitHub Pages is pretty awesome. It lets you push a bunch of static HTML (and/or CSS and Javascript) to a GitHub repository, and they\u0026rsquo;ll host and serve it for you. For free!\nYou basically set up a specific repository (you have to name it \u0026lt;your_username\u0026gt;.github.io), you push your HTML there, and they will be available at https://\u0026lt;your_username\u0026gt;.github.io. Did I mention that this is free?\nWhile you can perfectly write and push HTML files straight to your GitHub repository, there\u0026rsquo;s a whole bunch of open source static site generators available that provide a structured way of organising content, in formats (Markdown 🙌) that are easier to work with1. GitHub even supports one of them (Jekyll) out of the box, so you can just push your project as is and they\u0026rsquo;ll take care of building of your HTML too2.\nYou can even set up your own custom domain! Register your domain at your favourite registrar, and change a setting for your repository:\nThere, you fill out the custom domain you want your site to be available at (in my case that\u0026rsquo;s bart.degoe.de).\nBefore you rush off to your registrar to point your domain (or subdomain, in my case3), make sure you add a CNAME file to the root of your repository. The CNAME file should contain the URL your website should be displaying in the browser (this is important for redirects). In my case, the file contains bart.degoe.de, because that\u0026rsquo;s the URL I want my site to be published under.\nSetting up CloudFlare and SSL Then, all you need to do is add a CNAME entry to your domain settings settings. Right? Well, yes and no. Yes, setting up a CNAME DNS record will get your website working under the proper URL (it might take a while for the DNS change to propagate).\nHowever, serving your static files from GitHub under your own domain name does pose a problem; GitHub Pages only supports SSL for the github.io domain, not for custom domains (they have a wildcard certificate for their own domain, but supporting HTTPS on custom domains is not trivial4).\nThat means that your website can\u0026rsquo;t take advantage of HTTP/2 speedups, it will have negative impact on your Google ranking, Chrome will show your visitors that your website is not secure and even for your static site with fancy Javascript features you do want to protect your users when they\u0026rsquo;re reading your posts on unsecured Wi-Fi networks.\nCloudFlare Fortunately, there\u0026rsquo;s a way to get this coveted green secure lock on your static website. CloudFlare5 provides the (free) feature \u0026ldquo;Universal SSL\u0026rdquo; that will allow your users to access your website over SSL. Sign up for a free account, and enter the (non-SSL-ized) domain name of your website in their scanning tool:\nCloudFlare will fetch your current DNS configuration, and will provide you with instructions on how to enable CloudFlare for your (sub-)domain(s). The idea is that CloudFlare will act as a proxy between your GitHub hosted site and the user. This will allow them to encrypt traffic between their servers and your users (the traffic between GitHub and CloudFlare is also encrypted, but doesn\u0026rsquo;t require you to install an SSL certificate on the GitHub servers; added bonus is that they can cache your content on servers close to your visitors increasing the page speed of your website).\nEnable CloudFlare for the (sub)domain you\u0026rsquo;re hosting your website on:\nEnabling SSL CloudFlare\u0026rsquo;s Universal SSL lets you provide your website\u0026rsquo;s users with a valid signed SSL certificate. There\u0026rsquo;s several configuration options for Universal SSL (find it in the \u0026ldquo;Crypto\u0026rdquo; tab), and make sure your SSL mode is set to Full SSL (but not Full SSL (Strict)!).\nDo note it may take a while (up to 24 hours) for CloudFlare to set you up with your SSL certificates. They will send you an email once they\u0026rsquo;re provisioned and ready to go.\nNext, create a Page Rule. Page rules are, surprisingly, rules that apply to a page or a collection of pages. These rules can do a lot of cool things, such as automatically obfuscating emails on the page, control cache settings or add geolocation information to the requests. The rule you\u0026rsquo;re looking for is \u0026ldquo;Always Use HTTPS\u0026rdquo;, which will enforce all requests for pages matching the URL pattern you provide to use SSL:\nIn my case, I only have one URL for my website. However, if you use the www subdomain (i.e. www.example.com), you might want to add a Page Rule that redirects users that type example.com to www.example.com, where you enforce HTTPS to ensure all users benefit from encrypted requests. However, if you add more Page Rules, make sure that the HTTPS rule is the primary (first) page rule. Only one rule will trigger per URL, so you\u0026rsquo;ll want to make sure that this one is listed first!\nProfit! Right? This article has gotten quite meaty for the steps you have to follow, so if you\u0026rsquo;re looking for a more concise set of steps, this Gist by @cvan is great:\n There\u0026rsquo;s a lot more you can do with CloudFlare and your static site (you could set up caching on CloudFlare\u0026rsquo;s content distribution network, for example), but be aware that even though you\u0026rsquo;ve encrypted your traffic, you should still be careful in submitting sensitive data to (third-party) APIs with Javascript; \u0026ldquo;GitHub Pages sites shouldn\u0026rsquo;t be used for sensitive transactions like sending passwords or credit card numbers\u0026rdquo;. Your website\u0026rsquo;s source code is publicly available in your GitHub repository, so be mindful of any scripts and content you publish there.\n I use Hugo for this website, which is written in Golang (\u0026ldquo;fast\u0026rdquo; and \u0026ldquo;easy\u0026rdquo; are keywords I like). There\u0026rsquo;s a lot of different static site generators out there, each with their own focuses, advantages and trade-offs. [return] In my setup, I have two separate repositories, where I maintain the Hugo project structure in one (the blog repository), and build and push the static files to the other (the bartdegoede.github.io repository). What I like about that is that it gives me a \u0026ldquo;deploy\u0026rdquo; step, so I don\u0026rsquo;t accidentally push something that\u0026rsquo;s not finished yet. [return] Skipping this step took me a lot longer to figure out than I\u0026rsquo;m willing to admit. [return] There\u0026rsquo;s been disscusions about this for a while. [return] CloudFlare is a company that provides a content-delivery network (CDN), DDoS protection services, DNS and a whole slew of other services for websites. [return] "
},
{
"href": "https://bart.degoe.de/bloom-filters-bit-arrays-recommendations-caches-bitcoin/",
"title": "Bloom filters, using bit arrays for recommendations, caches and Bitcoin",
"categories": ["python", "bloom filter", "how-to"],
"content": "Bloom filters are cool. In my experience, it\u0026rsquo;s a somewhat underestimated data structure that sounds more complex than it actually is. In this post I\u0026rsquo;ll go over what they are, how they work (I\u0026rsquo;ve hacked together an interactive example to help visualise what happens behind the scenes) and go over some of their usecases in the wild.\nWhat is a Bloom filter? A Bloom filter is a data structure designed to quickly tell you whether an element is not in a set. What\u0026rsquo;s even nicer, it does so within the memory constraints you specify. It doesn\u0026rsquo;t actually store the data itself, only trimmed down version of it. This gives it the desirable property that it has a constant time complexity1 for both adding a value to the filter and for checking whether a value is present in the filter. The cool part is that this is independent of how many elements already in the filter.\nLike with most things that offer great benefits, there is a trade-off: Bloom filters are probabilistic in nature. On rare occassions, it will respond with yes to the question if the element is in the set (false positives are a possibility), although it will never respond with no if the value is actually present (false negatives can\u0026rsquo;t happen).\nYou can actually control how rare those occassions are, by setting the size of the Bloom filter bit array and the amount of hash functions depending on the amount of elements you expect to add2. Also, note that you can\u0026rsquo;t remove items from a Bloom filter.\nHow does it work? An empty Bloom filter is a bit array of a particular size (let\u0026rsquo;s call that size m) where all the bits are set to 0. In addition, there must be a number (let\u0026rsquo;s call the number k) of hashing functions defined. Each of these functions hashes a value to one of the positions in our array m, distributing the values uniformly over the array.\nWe\u0026rsquo;ll do a very simple Python implementation3 of a Bloom filter. For simplicity\u0026rsquo;s sake, we\u0026rsquo;ll use a bit array4 with 15 bits (m=15) and 3 hashing functions (k=3) for the running example.\nimport mmh3 class Bloomfilter(object): def __init__(self, m=15, k=3): self.m = m self.k = k # we use a list of Booleans to represent our # bit array for simplicity self.bit_array = [False for i in range(self.m)] def add(self, element): ... def check(self, element): ... To add elements to the array, our add method needs to run k hashing functions on the input that each will almost randomly pick an index in our bit array. We\u0026rsquo;ll use the mmh3 library to hash our element, and use the amount of hash functions we want to apply as a seed to give us different hashes for each of them. Finally, we compute the remainder of the hash divided by the size of the bit array to obtain the position we want to set.5\ndef add(self, element): \u0026quot;\u0026quot;\u0026quot; Add an element to the filter. Murmurhash3 gives us hash values distributed uniformly enough we can use different seeds to represent different hash functions \u0026quot;\u0026quot;\u0026quot; for i in range(self.k): # this will give us a number between 0 and m - 1 digest = mmh3.hash(element, i, signed=False) % self.m self.bit_array[digest] = True In our case (m=15 and k=3), we would set the bits at index 1, 7 and 10 to one for the string hello.\nIn [1]: mmh3.hash('hello', 0, signed=False) % 15 Out[1]: 1 In [2]: mmh3.hash('hello', 1, signed=False) % 15 Out[2]: 7 In [3]: mmh3.hash('hello', 2, signed=False) % 15 Out[3]: 10 Now, to determine if an element is in the bloom filter, we apply the same hash functions to the element, and see whether the bits at the resulting indices are all 1. If one of them is not 1, then the element has not been added to the filter (because otherwise we\u0026rsquo;d see a value of 1 for all hash functions!).\ndef check(self, element): \u0026quot;\u0026quot;\u0026quot; To check whether element is in the filter, we hash the element with the same hash functions as the add functions (using the seed). If one of them doesn't occur in our bit_array, the element is not in there (only a value that hashes to all of the same indices we've already seen before). \u0026quot;\u0026quot;\u0026quot; for i in range(self.k): digest = mmh3.hash(element, i, signed=False) % self.m if self.bit_array[digest] == False: # if any of the bits hasn't been set, then it's not in # the filter return False return True You can see how this approach guarantuees that there will be no false negatives, but that there might be false positives; especially in our toy example with the small bit array, the more elements you add to the filter, the more likely it gets that the three bits we hash an element to are set other elements (running one of the hash functions on the string world will also set the bit at index 6 to 1):\nIn [4]: mmh3.hash('world', 0, signed=False) % 15 Out[4]: 7 In [5]: mmh3.hash('world', 1, signed=False) % 15 Out[5]: 4 In [6]: mmh3.hash('world', 2, signed=False) % 15 Out[6]: 9 We can actually compute the probability of our Bloom filter returning a false positive, as it is a function of the number of bits used in the bit array divided by the length of the bit array (m) to the power of hash functions we\u0026rsquo;re using k (we\u0026rsquo;ll leave that for a future post though). The more values we add, the higher the probability of false positives becomes.\nInteractive example To further drive home how Bloom filters work, I\u0026rsquo;ve hacked together a Bloom filter in JavaScript that uses the cells in the table below as a \u0026ldquo;bit array\u0026rdquo; to visualise how adding more values will fill up the filter and increase the probability of a false positive (a full Bloom filter will always return \u0026ldquo;yes\u0026rdquo; for whatever value you throw at it).\n Add Hash value 1: \nHash value 2: \nHash value 3: \nElements in the filter: []\nProbability of false positives: 0% Test In Bloom filter: \nWhat can I use it for? Given that a Bloom filter is really good at telling you whether something is in a set or not, caching is a prime candidate for using a Bloom filter. CDN providers like Akamai6 use it to optimise their disk caches; nearly 75% of the URLs that are accessed in their web caches is accessed only once and then never again. To prevent caching these \u0026ldquo;one-hit wonders\u0026rdquo; and massively saving disk space requirements, Akamai uses a Bloom filter to store all URLs that are accessed. If a URL is found in the Bloom filter, it means it was requested before, and should be stored in their disk cache.\nBlogging platform Medium uses Bloom filters7 to filter out posts that users have already read from their personalised reading lists. They create a Bloom filter for every user, and add every article they read to the filter. When a reading list is generated, they can check the filter whether the user has seen the article. The trade-off for false positives (i.e. an article they haven\u0026rsquo;t read before) is more than acceptable, because in that case the user won\u0026rsquo;t be shown an article that they haven\u0026rsquo;t read yet (so they will never know).\nQuora does something similar to filter out stories users have seen before, and Facebook and LinkedIn use Bloom filters in their typeahead searches (it basically provides a fast and memory-efficient way to filter out documents that can\u0026rsquo;t match on the prefix of the query terms).\nBitcoin relies strongly on a peer-to-peer style of communication, instead of a client-server architecture in the examples above. Every node in the network is a server, and everyone in the network has a copy of everone else\u0026rsquo;s transactions. For big beefy servers in a data center that\u0026rsquo;s fine, but what if you don\u0026rsquo;t necessarily care about all transactions? Think of a mobile wallet application for example, you don\u0026rsquo;t want all transactions on the blockchain, especially when you have to download them on a mobile connection. To address this, Bitcoin has an option called Simplified Payment Verification (SPV) which lets your (mobile) node request only the transactions it\u0026rsquo;s interested in (i.e. payments from or to your wallet address). The SPV client calculates a Bloom filter for the transactions it cares about, so the \u0026ldquo;full node\u0026rdquo; has an efficient way to answer \u0026ldquo;is this client interested in this transation?\u0026rdquo;. The cost of false positives (i.e. a client is actually not interested in a transaction) is minimal, because when the client processes the transactions returned by the full node it can simply discard the ones it doesn\u0026rsquo;t care about.\nClosing thoughts There are a lot more applications for Bloom filters out there, and I can\u0026rsquo;t list them all here. I hope a gave you a whirlwind overview of how Bloom filters work and how they might be useful to you.\nFeel free to drop me a line or comment below if you have nice examples of where they\u0026rsquo;re used, or if you have any feedback, comments, or just want to say hi :-)\n The runtime for both inserting and checking is defined by the number of hash functions (k) we have to execute. So, O(k). Space complexity is more difficult to quantify, because that depends on how many false positives you\u0026rsquo;re willing to tolerate; allocating more space will lower the false positive rate. [return] Going over the math is a bit much for this post, so check Wikipedia for all the formulas 😄. [return] Full implementation on GitHub. [return] Our implementation won\u0026rsquo;t use an actual bit array but a Python list containing Booleans for the sake of readability. [return] Note that there\u0026rsquo;s a slight difference between the Python and Javascript Murmurhash implementation in the libraries I\u0026rsquo;ve used; the Javascript library I used returns a 32 bit unsigned integer, where the Python library returns a 32 bit signed integer by default. To keep the Python example consistent with the Javascript, I opted to use unsigned integers there too; there is no impact for the working of the Bloom filter. [return] Maggs, Bruce M.; Sitaraman, Ramesh K. (July 2015), \u0026ldquo;Algorithmic nuggets in content delivery\u0026rdquo;, SIGCOMM Computer Communication Review, New York, NY, USA: ACM, 45 (3): 52–66, doi:10.1145/2805789.2805800 [return] Read the article. It\u0026rsquo;s really good. [return] "
},
{
"href": "https://bart.degoe.de/searching-your-hugo-site-with-lunr/",
"title": "Searching your Hugo site with Lunr",
"categories": ["hugo", "search", "lunr", "javascript", "how-to"],
"content": "Like many software engineers, I figured I needed a blog of sorts, because it would give me a place for my own notes on \u0026ldquo;How To Do Things™\u0026rdquo;, let me have a URL to give people, and share my ramblings about Life, the Universe and Everything Else with whoever wants to read them.\nBecause I\u0026rsquo;m trying to get more familiar with Go, I opted to use the awesome Hugo1 framework to build myself a static site hosted on Github Pages.\nIn my day job I work on our search engine, so the first thing that I wanted to have was some basic search functionality for all the blog posts I haven\u0026rsquo;t written yet, preferably something that I can mess with is extensible and configurable.\nThere are three options if you want to add search functionality to a static website, each with their pros and cons:\n Third-party service (i.e. Google CSE): There are a bunch of services that provide basic search widgets for your site, such as Google Custom Search Engine (CSE). Those are difficult to customise, break your UI with their Google-styled widgets, and (in some cases) will display ads on your website2. Run a server-side search engine: You can set up a backend that indexes your data and can process the queries your users submit in the search box on your website. The obvious downside is that you throw away all the benefits of having a static site (free hosting, complex infrastructure). Search client-side: Having a static side, it makes sense to move all the user interaction to the client. We depend on the users\u0026rsquo; browser to run Javascript3 and download the searchable data in order to run queries against it, but the upside is that you can control how data is processed and how that data is queried. Fortunately for us, Atwood\u0026rsquo;s Law holds true; there\u0026rsquo;s a full-text search library inspired by Lucene/Solr written in Javascript we can use to implement our search engine: Lunr.js. Relevance When thinking about search, the most important question is what users want to find. This sounds very much like an open door, but you\u0026rsquo;d be surprised how often this gets overlooked; what are we looking for (tweets, products, (the fastest route to) a destination?), who is doing the search (lawyers, software engineers, my mom?), what do we hope to get out of it (money, page views?).\nIn our case, we\u0026rsquo;re searching blog posts that have titles, tags and content (in decreasing order of value to relevance); queries matching titles should be more important than matches in post content4.\nIndexing The project folder for my blog5 looks roughly like this:\nblog/ \u0026lt;= Hugo project root folder |- content/ \u0026lt;- this is where the pages I want to be searchable live |- about.md |- post/ |- 2018-01-01-first-post.md |- 2018-01-15-second-post.md |- ... |- layout/ |- partials/ \u0026lt;- these contain the templates we need for search |- search.html |- search_scripts.html |- static/ |- js/ |- search/ \u0026lt;- Where we generate the index file |- vendor/ |- lunrjs.min.js \u0026lt;- lunrjs library; https://cdnjs.com/libraries/lunr.js/ |- ... |- config.toml |- ... |- Gruntfile.js \u0026lt;- This will build our index |- ... The idea is that we build an index on site generation time, and fetch that file when a user loads the page.\nI use Gruntjs6 to build the index file, and some dependencies that make life a little easier. Install them with npm:\n$ npm install \u0026ndash;save-dev grunt string gray-matter \nThis is my Gruntfile.js that lives in the root of my project. It will walk through the content/ directory and parse all the markdown files it finds. It will parse out title, categories and href (this will be the reference to the post; i.e. the URL of the page we want to point to) from the front matter, and the content from the rest of the post. It also skips posts that are labeled draft, because I don\u0026rsquo;t want the posts I\u0026rsquo;m still working on to already show up in the search results.\nvar matter = require('gray-matter'); var S = require('string'); var CONTENT_PATH_PREFIX = 'content'; module.exports = function(grunt) { grunt.registerTask('search-index', function() { grunt.log.writeln('Build pages index'); var indexPages = function() { var pagesIndex = []; grunt.file.recurse(CONTENT_PATH_PREFIX, function(abspath, rootdir, subdir, filename) { grunt.verbose.writeln('Parse file:', abspath); d = processMDFile(abspath, filename); if (d !== undefined) { pagesIndex.push(d); } }); return pagesIndex; }; var processMDFile = function(abspath, filename) { var content = matter(grunt.file.read(abspath, filename)); if (content.data.draft) { // don't index draft posts return; } var pageIndex; return { title: content.data.title, categories: content.data.categories, href: content.data.slug, content: S(content.content).trim().stripTags().stripPunctuation().s }; }; grunt.file.write('static/js/search/index.json', JSON.stringify(indexPages())); grunt.log.ok('Index built'); }); }; To run this task, simply run grunt search-index in the directory where Gruntfile.js is located7. This will generate a JSON index file looking like this:\n[ { \u0026quot;content\u0026quot;: \u0026quot;Hi My name is Bart de Goede and ...\u0026quot;, \u0026quot;href\u0026quot;: \u0026quot;about\u0026quot;, \u0026quot;title\u0026quot;: \u0026quot;About\u0026quot; }, { \u0026quot;content\u0026quot;: \u0026quot;Like many software engineers, I figured I needed a blog of sorts...\u0026quot;, \u0026quot;href\u0026quot;: \u0026quot;Searching-your-hugo-site-with-lunr\u0026quot;, \u0026quot;title\u0026quot;: \u0026quot;Searching your Hugo site with Lunr\u0026quot;, \u0026quot;categories\u0026quot;: [ \u0026quot;hugo\u0026quot;, \u0026quot;search\u0026quot;, \u0026quot;lunr\u0026quot;, \u0026quot;javascript\u0026quot; ] }, ... ] Querying Now we\u0026rsquo;ve built the index, we need a way of obtaining it client-side, and then query it. To do that, I have two partials that include the markup for the search input box and the links to the relevant Javascript:\n\u0026lt;script type=\u0026quot;text/javascript\u0026quot; src=\u0026quot;https://code.jquery.com/jquery-2.1.3.min.js\u0026quot;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;script type=\u0026quot;text/javascript\u0026quot; src=\u0026quot;js/vendor/lunr.min.js\u0026quot;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;script type=\u0026quot;text/javascript\u0026quot; src=\u0026quot;js/search/search.js\u0026quot;\u0026gt;\u0026lt;/script\u0026gt; \u0026lt;!-- js/search/search.js contains the code that downloads and initialises the index --\u0026gt; ... \u0026lt;input type=\u0026quot;text\u0026quot; id=\u0026quot;search\u0026quot;\u0026gt; For my blog, I have one search.js file that will download the index file, initialise the UI, and run the searches. For the sake of readability, I\u0026rsquo;ve split up the relevant functions below and added some comments to the code.\nThis function fetches the index file we\u0026rsquo;ve generated with the Grunt task, initialises the relevant fields, and then adds the each of the documents to the index. The pagesIndex variable will store the documents as we indexed them, and the searchIndex variable will store the statistics and data structures we need to rank our documents for a query efficiently.\nfunction initSearchIndex() { // this file is built by the Grunt task, and $.getJSON('js/search/index.json') .done(function(documents) { pagesIndex = documents; searchIndex = lunr(function() { this.field('title'); this.field('categories'); this.field('content'); this.ref('href'); // This will add all the documents to the index. This is // different compared to older versions of Lunr, where // documents could be added after index initialisation for (var i = 0; i \u0026lt; documents.length; ++i) { this.add(documents[i]) } }); }) .fail(function(jqxhr, textStatus, error) { var err = textStatus + ', ' + error; console.error('Error getting index file:', err); } ); } initSearchIndex(); Then, we need to sprinkle some jQuery magic on the input box. In my case, I want to start searching once a user has typed at least two characters, and support a typeahead style of searching, so everytime a character is entered, I want to empty the current search results (if any), run the searchSite function with whatever is in the input box, and render the results.\nfunction initUI() { $results = $('.posts'); // or whatever element is supposed to hold your results $('#search').keyup(function() { $results.empty(); // only search when query has 2 characters or more var query = $(this).val(); if (query.length \u0026lt; 2) { return; } var results = searchSite(query); renderResults(results); }); } $(document).ready(function() { initUI(); }); The searchSite function will take the query_string the user typed in and build a lunr.Query object and run it against the index (stored in the searchIndex variable). The lunr index will return a ranked list of refs (these are the identifiers we assigned to the documents in the Gruntfile). The second part of this method maps these identifiers to the original documents we stored in the pagesIndex variable.\n// this function will parse the query_string, which will you // to run queries like \u0026quot;title:lunr\u0026quot; (search the title field), // \u0026quot;lunr^10\u0026quot; (boost hits with this term by a factor 10) or // \u0026quot;lunr~2\u0026quot; (will match anything within an edit distance of 2, // i.e. \u0026quot;losr\u0026quot; will also match) function simpleSearchSite(query_string) { return searchIndex.search(query_string).map(function(result) { return pagesIndex.filter(function(page) { return page.href === result.ref; })[0]; }); } // I want a typeahead search, so if a user types a query like // \u0026quot;pyth\u0026quot;, it should show results that contain the word \u0026quot;Python\u0026quot;, // rather than just the entire word. function searchSite(query_string) { return searchIndex.query(function(q) { // look for an exact match and give that a massive positive boost q.term(query_string, { usePipeline: true, boost: 100 }); // prefix matches should not use stemming, and lower positive boost q.term(query_string, { usePipeline: false, boost: 10, wildcard: lunr.Query.wildcard.TRAILING }); }).map(function(result) { return pagesIndex.filter(function(page) { return page.href === result.ref; })[0]; }); } The snippet above lists two methods. The first shows an example of a search using the default lunr.Index#search method, which uses the lunr query syntax.\nIn my case, I want to support a typeahead search, where we show the user results for partial queries too; if the user types \u0026quot;pyth\u0026quot;, we should display results that have the word \u0026quot;python\u0026quot; in the post. To do that, we tell Lunr to combine two queries: the first q.term provides exact matches with a high boost to relevance (because we it\u0026rsquo;s likely that these matches are relevant to the user), the second appends a trailing wildcard to the query8, providing prefix matches with a (lower) boost.\nFinally, given the ranked list of results (containing all pages in the content/ directory), we want to render those somewhere on the page. The renderResults method slices the result list to the first ten results, creates a link to the appropriate post based on the href, and creates a (crude) snippet based on the 100 first characters of the content.\nfunction renderResults(results) { if (!results.length) { return; } results.slice(0, 10).forEach(function(hit) { var $result = $('\u0026lt;li\u0026gt;'); $result.append($('\u0026lt;a\u0026gt;', { href: hit.href, text: '» ' + hit.title })); $result.append($('\u0026lt;p/\u0026gt;', { text: hit.content.slice(0, 100) + '...' })); $results.append($result); }); } This is a pretty naive approach to introducing full-text search to a static site (I use Hugo, but this will work with static site generators like Jekyll or Hyde too); it completely ignores other languages than English (there\u0026rsquo;s support for other languages too), let alone non whitespace languages like Chinese, and it requires users to download the full index that contains all your searchable pages, so it won\u0026rsquo;t scale as nicely if you have thousands of pages. For my personal blog though, it\u0026rsquo;s good enough 😇.\n It\u0026rsquo;s fast, it\u0026rsquo;s written in Golang, it supports fancy themes, and it\u0026rsquo;s open source! [return] You can make money off theses ads, but the question is whether you want to show ads on your personal blog or not. [return] I\u0026rsquo;m assuming that the audience that\u0026rsquo;ll land on these pages will have Javascript enabled in their browser 😄 [return] In this case, I\u0026rsquo;m totally assuming that if words from the query occur in the title or the manually assigned tags of a post are way more relevant than matches in the content of a post, if only because there\u0026rsquo;s a lot more words in post content, so there\u0026rsquo;s a higher probability of matching any word in the query. [return] It\u0026rsquo;s also on GitHub. [return] A port of this script to Golang is in the works. [return] The idea is to run the task before you deploy the latest version of your site. In my case, I have a deploy.sh script that runs Hugo to build my static pages, runs grunt search-index and pushes the result to GitHub. [return] Lunr uses tries to represent terms internally, giving us an efficient way of doing fast prefix lookups. [return] "
}]
You can’t perform that action at this time.