Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Цитатник Леся Подерв'янського


The site is implemented as a simple Flask-based web app. Core components are the actual app code (v2/app/) and an engine that is responsible for matching (v2/app/ Run-time data is stored in an SQLite DB and is never modified.

Previous (v1) implementation is a PHP script in the v1 directory.

Search implementation

Since the corpus is relatively small and we require fuzzy matching, searching is done by matching the query against all phrases (properly cleaned up and tokenized) and selecting those with the highest match score.

At the bottom level, we use a concept of a word difference score, which measures the difference between two words. We define it as follows:


where L is Levenshtein edit distance and l is length of each word. The reason for scaling by the longer word length is to reflect the fact that for example, an edit distance of 2 between two 4-letter words is not the same as an edit distance of 2 in 15-letter ones.

Defined this way, the difference score is a real number between 0 and 1, where it is 0 for two identical words (or, strictly speaking, an insignificant difference between two infinite-length words).

With this score defined, we calculate the phrase similarity score as follows:


where Nq is number of unmatched words in the query and Np is number of unmatched words in candidate quote. The rationale for such definition is given below:

  • The more unmatched words in the query the less good the match is, since the user obvious considred those words important.

  • On the other hand, a candidate quote in most cases has many more words than a query given (since the query will only have a handful of most important words). Therefore we don't want the number of unmatched words in the candidate quote to affect the score in a major way. We use it scaled down by 1000 to break a tie when two quotes have a good match to query words. In this case a shorter quote would score slightly higher.

The similarity score defined this way is a real number from 0 to 1 which is 1 for nearly exact match between two phrases.

The algorithm is tuned using two parameters (both defined in FuzzyMatchingEngine class):

  • WORD_SIMILARITY_THRESHOLD: cut-off value for difference score that determines whether two words are similar or different.

  • PHRASE_SIMILARITY_THRESHOLD: we only report matches that have a similarity score higher than this value.

Source data

As a source data we used the text of plays from . Most of the texts are HTML files (with little document structure), so the parser does its best trying to make sense from it. A couple of playes were available eslewhere as text documents, so the parser is capable of handling them as well.

Although we tried to avoid modifying the source document, in several cases it was unavoidable, so for now one is advised to use the documents stored within this project instead of re-downloading the files.



  • Web hosting with Python/WSGI support
  • Python
  • Korn shell (required for database creation)
  • SQLite3
  • Flask

Deployment procedure

  • Build the database
cd tools

This step builds the database in data/habib.db.

  • Deploy the database file.
  • Deploy the app in a manner prescribed by your provider.
  • Copy v2/app/app-template.cfg to v2/app/app.cfg and modify the latter file to accommodate your setup.


Цитатник Леся Подерв'янського






No packages published