4ChanCaptcha

This repository contains a 4chan captcha data set amounting to 16242 foreground and background images and the procedures needed to produce or extend the captcha data set on your own.

Included Data

base64 dataset -- the raw data amounting to 16242 foreground and background images
fourCharacters -- a set of extracted characters, derived from Character Extractor below.

It also contains the following modules that aim to exploit the data:

Character Extractor

This module is the main workhorse which populates the foundation necessary to generate synthetic training data. Using a set of hyperparameters derived from optimize_character_extractor, character_extractor.py isolates and extracts individual characters and digits from the image, the atomic unit of our synthetic training data generator.

Classify Character

classify_character.py is a straight forward TorchVision implementation of ResNet18 that classifies extracted characters.

Generate Synthetic Data Set

generate_synthetic_dataset.py aims to artificially generate 4chan captchas for the main image model to train on, for example:

Train Synthetic Data Set

train_synthetic_dataset.py uses a CRNN model to train on the synthetically generated data set. It also includes a small prediction function as well as a lightly modified model to generate an ONNX export model capable of running in the browser.

CRNN Browser Powered Inference

The ultimate consequence of this repository is a capable model that is agile enough to be run from your local browser. Using Tampermonkey or Requestly, you can load the model (onnx_model.onnx) into your own browser using browser_inference.

Some Starting Code

import cv2
import base64
import numpy as np
import pandas as pd
from PIL import Image


# b64 processing
def prepare_b64_image(background, foreground):
    '''
    params:
    background: string  - base64 image as text
    foreground: string  - base64 image as text
    '''
    # bg
    background = base64.b64decode(str(background))
    background = np.frombuffer(background, np.uint8)
    background = cv2.imdecode(background, cv2.IMREAD_UNCHANGED)

    # fg
    foreground = base64.b64decode(str(foreground))
    foreground = np.frombuffer(foreground, np.uint8)
    foreground = cv2.imdecode(foreground, cv2.IMREAD_UNCHANGED)
    return background, foreground


# get data
df = pd.read_csv(r'data\captcha_training_set.csv')

# get the first image
bg_ = df.iloc[0]['bg']
fg_ = df.iloc[0]['fg']

# transform
background, foreground = prepare_b64_image(bg_, fg_)

# show image
Image.fromarray(np.uint8(background))
Image.fromarray(np.uint8(foreground))

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
base64_captcha_scraper		base64_captcha_scraper
browser_inference		browser_inference
character_extractor		character_extractor
classify_character		classify_character
dataset		dataset
generate_synthetic_dataset		generate_synthetic_dataset
sample_images		sample_images
train_synthetic_dataset		train_synthetic_dataset
tune_synthetic_trainer		tune_synthetic_trainer
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

4ChanCaptcha

Included Data

Character Extractor

Classify Character

Generate Synthetic Data Set

Train Synthetic Data Set

CRNN Browser Powered Inference

Some Starting Code

About

Releases

Packages

Languages

afogarty85/4ChanCaptcha

Folders and files

Latest commit

History

Repository files navigation

4ChanCaptcha

Included Data

Character Extractor

Classify Character

Generate Synthetic Data Set

Train Synthetic Data Set

CRNN Browser Powered Inference

Some Starting Code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages