Skip to content

4Chan Captcha! Data Set and Data Generating Procedures

Notifications You must be signed in to change notification settings

afogarty85/4ChanCaptcha

Repository files navigation

4ChanCaptcha

This repository contains a 4chan captcha data set amounting to 16242 foreground and background images and the procedures needed to produce or extend the captcha data set on your own.

Background

Foreground

Included Data

  1. base64 dataset -- the raw data amounting to 16242 foreground and background images

  2. fourCharacters -- a set of extracted characters, derived from Character Extractor below.

It also contains the following modules that aim to exploit the data:

Character Extractor

This module is the main workhorse which populates the foundation necessary to generate synthetic training data. Using a set of hyperparameters derived from optimize_character_extractor, character_extractor.py isolates and extracts individual characters and digits from the image, the atomic unit of our synthetic training data generator.

Classify Character

classify_character.py is a straight forward TorchVision implementation of ResNet18 that classifies extracted characters.

Generate Synthetic Data Set

generate_synthetic_dataset.py aims to artificially generate 4chan captchas for the main image model to train on, for example:

Synthetic Captcha

Train Synthetic Data Set

train_synthetic_dataset.py uses a CRNN model to train on the synthetically generated data set. It also includes a small prediction function as well as a lightly modified model to generate an ONNX export model capable of running in the browser.

CRNN Browser Powered Inference

The ultimate consequence of this repository is a capable model that is agile enough to be run from your local browser. Using Tampermonkey or Requestly, you can load the model (onnx_model.onnx) into your own browser using browser_inference.

Browser Inference

Some Starting Code

import cv2
import base64
import numpy as np
import pandas as pd
from PIL import Image


# b64 processing
def prepare_b64_image(background, foreground):
    '''
    params:
    background: string  - base64 image as text
    foreground: string  - base64 image as text
    '''
    # bg
    background = base64.b64decode(str(background))
    background = np.frombuffer(background, np.uint8)
    background = cv2.imdecode(background, cv2.IMREAD_UNCHANGED)

    # fg
    foreground = base64.b64decode(str(foreground))
    foreground = np.frombuffer(foreground, np.uint8)
    foreground = cv2.imdecode(foreground, cv2.IMREAD_UNCHANGED)
    return background, foreground


# get data
df = pd.read_csv(r'data\captcha_training_set.csv')

# get the first image
bg_ = df.iloc[0]['bg']
fg_ = df.iloc[0]['fg']

# transform
background, foreground = prepare_b64_image(bg_, fg_)

# show image
Image.fromarray(np.uint8(background))
Image.fromarray(np.uint8(foreground))

About

4Chan Captcha! Data Set and Data Generating Procedures

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published