A proof-of-concept decoder for the CAPTCHAs used by the Information System of Masaryk University.
On 12 April 2014 the staff of the Information System of Masaryk University (IS) introduced a new anti-scraping policy to prevent user-created programs from excessively overloading the IS with automatic requests (students would often use automatic page reloaders to increase their chances of getting a prefered seminar group or exam date during time competitions and leave them running for too long). This new policy involves per-user counters for separate parts of the IS that increase with each web request/operation performed by the user and decrease each minute by a specific amount. If a counter exceeds the limit of one or more counters, all further user's request are redirected to a page where they can reset all counters by decoding a simple CAPTCHA.
This project demonstrates the weakness of this type of CAPTCHA by providing a simple utility written in Python 2.7 (using the numpy and scikit-image libraries) that is able to automatically recognize the symbols in the CAPTCHA.
DISCLAIMER: While I do use certain automated tools myself, I do not intend to and do not need to circumvent the imposed limits by means of automated CAPTCHA decoding. I respect the new anti-scraping policy and I think it leaves sufficient capacity for responsibly written tools designed for legitimate purposes, such as gaining advantage in seminar group enrollment. I wrote this software merely for fun and to show how easy it is to crack this kind of CAPTCHA.
The CAPTCHAs can be generated by refreshing this link (it is conveniently accessible without authentication :) ). Each CAPTCHA is a 6-bit grayscale image 20 pixels high and about 100 pixels wide. Background is completely white (#ffffff
) covered with glyphs and noise both of (relatively) random non-white color for each pixel. Glyphs are 17 pixels high with different widths for different glyphs. There seem to be only 23 different glyphs used, corresponding to characters "3467ABDEHIJKLMNPRTUVWXY". Noise is randomly distributed, forming mostly one-pixel, rarely larger, clusters.
First, you need to generate patterns for different glyphs from random selection of CAPTCHA samples (50-100 samples should be enough). These must be placed in a directory on the filesystem. There is a pre-generated set of glyph patterns included in the repository (the glyphs
directory).
# Download the samples: (in bash)
mkdir ./samples
for (( i = 0; i < 100; i++ )); do
wget -O`printf './samples/%06d.gif' $i` https://is.muni.cz/system/vstkod.pl
sleep 0.5s
done
$ python captcha_breaker.py -d ./glyphs categorize --max-samples 100 ./samples
Categorizing '000000.gif'... OK!
Categorizing '000001.gif'... OK!
Categorizing '000002.gif'... OK!
Categorizing '000003.gif'... OK!
...
Then you need to assign corresponding names (characters) to the glyphs. The rename
subcommand will display each glyph and ask you to set the new name. Some glyphs may appear multiple times - just give them the same names and duplicities will be removed automatically. If you enter an empty name, the glyph will be removed. You can also rename glyphs manually by renaming the subdirectories in the glyph data directory (./glyphs
in our example) - see the pattern_image.png
file in each subdirectory.
$ python captcha_breaker.py -d ./glyphs rename
Enter new name for {__unnamed_0}: A
Enter new name for {__unnamed_1}:
Deleting glyph pattern...
Enter new name for {__unnamed_2}: B
...
Now you can run the recognize
subcommand to decode a captcha:
$ python captcha_breaker.py -d ./glyphs recognize --show https://is.muni.cz/system/vstkod.pl
ABCDEF
The --show
flag tells the program to display the image that it is decoding.
So far I couldn't find any sample that my program wouldn't get right, so the success rate is at least close to 100%... If you happen to find a sample that is recognized incorrectly, please contact me ;)
Before a CAPTCHA image is processed, it is converted into a 1-bit bitmap (all white pixels remain white, the rest becomes black).
Then it undergoes a simple transformation which aims to remove the random noise sprayed across the image (see function flatten
in captcha.py
). The transformation iteratively performs the following:
- From each row of the bitmap, any sequence of black pixels longer than a specified limit (the "magic" number for this CAPTCHA is 2, can be changed with the
--level
option). - Same as above, but by columns instead of rows.
- Take the pixel-wise logical conjuction (AND) of the outputs from steps 1 and 2 as the new bitmap.
- If the new bitmap is different from the previous one, go back to step 1. Otherwise, the new bitmap is the final result of the transformation.
First, the bitmap is splitted into 'spots' (separate contigous black areas) using a flood-fill algorithm. Excess white in each spot is cropped away. Note that glyphs may sometimes touch each other, so there may be more than one glyph in one spot (these cases will be dealt with later).
For each spot extracted from the bitmap, the following is done:
- A new category is created with the spot as the glyph pattern.
- The new category is now compared to the categories already stored in the database. The comparison is done by computing the average pixel-wise absolute difference between the pattern and the tested spot. If the compared images are of different sizes, the comparison is done in all possible overlap positions.
- If the average difference falls below a specified threshold (0.075 works fine, can be changed with the
--threshold
option) on some position, it is considered a match - the categories are combined together (both patterns are cropped to the overlapping part), the original category is removed from the database, the combined category is now used as the new category and matching starts over (back to step 2). - If no match was found, the new category is added to the database.
Note that when a "double glyph" pattern gets matched with a spot containing just one glyph, the extra glyph is just discarded. This would be a problem if some glyphs only appeared fused with another glyph, but this doesn't happen in a large enough sample selection.
Each category in the database is matched against the input image (the same technique is used as in categorization) and both position and the matched category is recorded. After all categories have been checked, the matches are sorted by the horizontal coordinate of the match position and their names are concatenated in this order.