# Memes Library Builder

This library contains tens of thousands of memes organized in hundreds of folders by topic. This notebook builds the master json file which contains a list of all the topics, all the memes in each topic, and a list of any metadata associated with each meme.


In [1]:
## Import Libraries

import os
import json
from glob import glob
from pathlib import Path


In [2]:
## Define Constants

MEMES_ROOT = Path('memes')


# Create missing first-seen files

Iterate recursively through all the subdirectories of the ./memes folder and for any file with one of the following extensions, check if it already has a file with the same name in the same directory but appended with first-seen.txt. So for example meme.jpg meansthere should also be a file called meme.jpg.first-seen.txt  

If not, create a new file containing the file modification time as a python datetime. This file should have the same name but appended with first-seen.txt. So for example meme.jpg means we need to create a file called meme.jpg.first-seen.txt containing the python datetime of the modification time of the meme.jpg file.  

- .gif
- .jfif
- .jpeg
- .jpg
- .mp4
- .png
- .svg
- .webp




In [3]:
from pathlib import Path
from datetime import datetime, timezone


TRACKED_EXTS = {
    ".gif", ".jfif", ".jpeg", ".jpg",
    ".mp4", ".png", ".svg", ".webp",
}

def ensure_first_seen_files(root: Path, exts: set[str]) -> dict[str, int]:
    created = 0
    skipped = 0

    for file in root.rglob("*"):
        if file.suffix.lower() not in exts or not file.is_file():
            continue

        meta_path = file.with_name(file.name + ".first-seen.txt")

        if meta_path.exists():
            skipped += 1
            continue

        # ▶ grab mtime and convert to UTC datetime
        mod_time = datetime.fromtimestamp(file.stat().st_mtime,
                                          tz=timezone.utc)
        # ▶ write ISO-8601 string
        meta_path.write_text(mod_time.isoformat())

        created += 1

    return {"created": created, "skipped": skipped}

summary = ensure_first_seen_files(MEMES_ROOT, TRACKED_EXTS)
print(f"First-seen files created: {summary['created']}")
print(f"Already present / skipped : {summary['skipped']}")

First-seen files created: 0
Already present / skipped : 203


# Build the master memes.json file

The memes are organized like so;
/memes/Topic 1
/memes/Topic 2

Memes can be images or videos. Assume all the common file extensions will be present. For each meme, a number of metadata files may be present. These should be included if present.

For example;
memefilename.jpg
memefilename.jpg.txt <- Canonical tesseract-OCR of the meme. Might be nonsense. Probably we won't need this but it's there, so lets include it in the json file.
memefilename.jpg.llama-3.2-vision.txt <- High quality transformer analysis of the image, containing detailed explanation of the visual elements of the image, including any text, but probably lacking an awareness of any social context or any relationship to current events.

Example output:

'Topic 1' => {
    1 => {
        'file' => 'memes/topic/filename.jpg',
        'filemtime' => 'date the file was last modified',
        'metadata' => {
            'tesseract-ocr' => 'memes/topic/filename.jpg.txt',
            'llama-3.2-vision' => 'memes/topic/filename.jpg.llama-3.2-vision.txt'
        }
    }
}

The list of memes in each topic must be ordered by filemtime descending, such that the most recently added item is number 1. 

In [4]:
# Helper functions
import html


def read_text_multi(path: Path, encodings=('utf-8', 'utf-8-sig', 'cp1252', 'latin-1')):
    "Try multiple encodings and fall back to replacement."
    for enc in encodings:
        try:
            return path.read_text(encoding=enc)
        except UnicodeDecodeError:
            pass
    return path.read_text(encoding=encodings[0], errors='replace')

def escape_entities(text: str) -> str:
    "Escape HTML special characters."
    return html.escape(text, quote=True).replace('\n', ' ')

def get_file_timestamp(file: Path) -> datetime:
    
    file = Path(file)
    
    meta_path = file.with_name(file.name + ".first-seen.txt")

    if meta_path.exists():
        # Read the timestamp from first-seen.txt
        mod_time_str = read_text_multi(meta_path).strip()
        return mod_time_str
    else:
        # Fallback to current modification time
        return datetime.fromtimestamp(file.stat().st_mtime, tz=timezone.utc).isoformat()
    
def get_topics():
    """Return list of topic folders in the memes directory"""
    return [p.name for p in MEMES_ROOT.iterdir() if p.is_dir()]

def process_topic(topic: str) -> dict[int, dict]:
    topic_path = Path(MEMES_ROOT) / topic
    memes: list[dict] = []

    for meme_file in topic_path.iterdir():
        if meme_file.suffix.lower() not in TRACKED_EXTS or not meme_file.is_file():
            continue

        memes.append({
            "file": str(meme_file),
            "filemtime": get_file_timestamp(meme_file),   # dt obj
            "metadata": {
                k: str(meme_file.with_name(meme_file.name + suffix))
                for k, suffix in {
                    "tesseract-ocr": ".txt",
                    "llama-3.2-vision": ".llama-3.2-vision.txt"
                }.items()
                if meme_file.with_name(meme_file.name + suffix).exists()
            }
        })

    # newest first
    memes.sort(key=lambda m: m["filemtime"], reverse=True)

    # re-index so 1 == newest
    return {i: _serialise(meme) for i, meme in enumerate(memes, 1)}

def _serialise(meme: dict) -> dict:
    """Convert dt → iso-string without microseconds so JSON dump works."""
    meme = meme.copy()
    meme["filemtime"] = meme["filemtime"]
    return meme

def build_master_json():
    """Build the master JSON file with all topics and all memes"""
    master_dict = {}

    for topic in get_topics():
        master_dict[topic] = process_topic(topic)

    # Sort each topic's memes by filemtime descending
    for topic, memes in master_dict.items():
        sorted_memes = dict(sorted(memes.items(), key=lambda item: item[1]['filemtime'], reverse=True))
        master_dict[topic] = sorted_memes

    return master_dict

master_json_data = build_master_json()


## Save JSON File

with open('memes.json', 'w') as json_file:
    json.dump(master_json_data, json_file, indent=4)


# Generate missing llama-3.2-vision.txt files



In [5]:
import base64, requests, textwrap, time
from pathlib import Path
from requests.exceptions import RequestException

MODEL_NAME   = "llama3.2-vision:11b"
OLLAMA_URL   = "http://docker-ai:11434/api/generate"
PROMPT       = (
    "In 2-3 sentences, describe this meme for someone who cannot see it. "
    "Include any text that appears in the image."
)
IMAGE_EXTS   = {".gif", ".jfif", ".jpeg", ".jpg", ".png", ".svg", ".webp"}

MAX_RETRIES  = 3        # total attempts per image
INITIAL_WAIT = 5        # seconds before first retry (doubles each time)

def _meta_path(img: Path) -> Path:
    return img.with_name(f"{img.name}.llama3.2-vision.txt")

def _summarise_image(img: Path) -> str:
    """Call Ollama with retries; raise after MAX_RETRIES failures."""
    img_b64  = base64.b64encode(img.read_bytes()).decode()
    payload  = {
        "model": MODEL_NAME,
        "prompt": PROMPT,
        "stream": False,
        "images": [img_b64],
    }

    for attempt in range(1, MAX_RETRIES + 1):
        try:
            r = requests.post(OLLAMA_URL, json=payload, timeout=300)
            r.raise_for_status()
            summary = r.json().get("response", "").strip()
            if not summary:
                raise ValueError("API returned empty 'response'")
            return summary

        except (RequestException, ValueError) as err:
            wait = INITIAL_WAIT * 2 ** (attempt - 1)
            print(f"[{img.name}] attempt {attempt}/{MAX_RETRIES} failed: {err}")
            if attempt < MAX_RETRIES:
                print(f"   → retrying in {wait}s …")
                time.sleep(wait)
            else:
                raise  # bubbled up to main loop

def create_all_summaries(root=MEMES_ROOT):
    skipped, made = 0, 0
    for img in Path(root).rglob("*"):
        if img.suffix.lower() not in IMAGE_EXTS or _meta_path(img).exists():
            continue

        try:
            summary = _summarise_image(img)
            meta    = _meta_path(img)
            meta.write_text(summary + "\n", encoding="utf-8")
            made += 1

            print(f"\n⟹  {img.relative_to(root)}")
            print(textwrap.fill(summary, width=88))
            print(f"— saved to {meta.name} —\n")

        except Exception as e:
            skipped += 1
            print(f"[skip] {img.relative_to(root)} → {e}")

    print(f"\n✓ Done. {made} files written, {skipped} skipped after retries.")

create_all_summaries()


✓ Done. 0 files written, 0 skipped after retries.


# Generate a markdown file for each meme

In [6]:
from pathlib import Path

META_ORDER = ["llama-3.2-vision", "first-seen", "tesseract"]
META_SUFFIXES = {
    "llama-3.2-vision": ".llama3.2-vision.txt",
    "tesseract": ".txt",
    "first-seen": ".first-seen.txt",
}

def create_markdown_files(root=MEMES_ROOT):
    root = Path(root)
    for meme in root.rglob('*'):
        if meme.suffix.lower() not in TRACKED_EXTS or not meme.is_file():
            continue

        md_file = meme.with_name(meme.name + '.md')
        lines = [
            '---\n',
            'layout: default\n',
            f'title: {meme.name}\n',
            f'category: {meme.parent.name}\n',
            '---\n\n',
        ]

        lines.append(f'<div markdown="0">')
        if meme.suffix.lower() in IMAGE_EXTS:
            lines.append(f'<a href="{meme.name}"><img class="photo" src="{meme.name}" /></a>\n\n')
        else:
            lines.append(f'[Download {meme.name}]({meme.name})\n')

        for meta in META_ORDER:
            
            meta_path = meme.with_name(meme.name + META_SUFFIXES[meta])
            
            if meta_path.exists():
                content = read_text_multi(meta_path).strip()
                lines.append(f'<h2>{meta}</h2>\n')

                if meta == "llama-3.2-vision":
                    lines.append(f'<p><i>Llama-3.2-Vision-11B is a really good model that probably gets the visual details right but doesn\'t understand literary or media references, and often fails to accurately represent the physical arrangement of objects and the implied relationships between the objects.</i></p>\n')
                elif meta == "first-seen":
                    lines.append(f'<p><i>Because Git doesn\'t preserve file modification times, this metadata file contains the file\'s modification time when it was added to the library.</i></p>\n')
                elif meta == "tesseract":
                    lines.append(f'<p><i>Tesseract is often terrible and just gives a lot of nonsense characters, but it used to be the state of the art, and usually it is better at correctly representing text than llama-3.2-vision-11b.</i></p>\n')


                lines.append(f'<p>{escape_entities(content)}</p>\n\n')

        lines.append('</div>\n\n')
        md_file.write_text(''.join(lines), encoding='utf-8')

create_markdown_files()


# Generate new markdown index files for each category directory

In [7]:
def create_category_indexes(root=MEMES_ROOT):
    root = Path(root)
    for category in root.iterdir():
        if not category.is_dir():
            continue
        index_md = category / 'index.md'
        entries = []
        for meme in category.iterdir():
            if meme.suffix.lower() not in TRACKED_EXTS or not meme.is_file():
                continue
            fs_path = meme.with_name(meme.name + META_SUFFIXES['first-seen'])
            llama_path = meme.with_name(meme.name + META_SUFFIXES['llama-3.2-vision'])
            first_seen = read_text_multi(fs_path).strip() if fs_path.exists() else ''
            llama = read_text_multi(llama_path).strip() if llama_path.exists() else ''
            html = meme.name + '.html'
            entries.append((first_seen, meme.name, html, llama))
        entries.sort(key=lambda e: e[0])
        lines = [
            '---\n',
            'layout: default\n',
            f'title: "Category: {category.name}"\n',
            '---\n\n',
        ]
        for _fs, img, html, llama in entries:
            alt = escape_entities(llama)
            lines.append(f'<div markdown="0">')
            lines.append(f'<a href=\"{html}\"><img loading=\"lazy\" src=\"{img}\" alt=\"{alt}\" /></a>\n')
            if llama:
                lines.append(f'<p>{escape_entities(llama)}</p>\n')
            lines.append('</div>\n\n')
        index_md.write_text(''.join(lines), encoding='utf-8')

create_category_indexes()

# Generate main index markdown

In [8]:
def create_main_index(root=MEMES_ROOT, out_file=Path("index.md")):
    root = Path(root)
    entries = []
    for category in root.iterdir():
        if not category.is_dir():
            continue
        for meme in category.iterdir():
            if meme.suffix.lower() not in TRACKED_EXTS or not meme.is_file():
                continue
            fs_path = meme.with_name(meme.name + META_SUFFIXES['first-seen'])
            llama_path = meme.with_name(meme.name + META_SUFFIXES['llama-3.2-vision'])
            first_seen = read_text_multi(fs_path).strip() if fs_path.exists() else ''
            llama = read_text_multi(llama_path).strip() if llama_path.exists() else ''
            html = f"{category.name}/{meme.name}.html"
            img  = f"{category.name}/{meme.name}"
            entries.append((first_seen, category.name, img, html, llama))
    entries.sort(key=lambda e: e[0])
    lines = [
        '---',
        'layout: default',
        'title: Home',
        '---\n\n',
    ]
    for fs, cat, img, html, llama in entries:
        alt = escape_entities(llama)
        lines.append(f'<div markdown="0">')
        lines.append(f'<div class="card mb-4" data-category="{cat}" data-pubdate="{fs}">')
        lines.append(f'  <a href="{html}"><img class="card-img-top" loading="lazy" src="{img}" alt="{alt}" /></a>')
        lines.append('  <div class="card-body">')
        if fs:
            lines.append(f'    <p class="card-text text-muted small">{fs}</p>')
        #if llama:
            #lines.append(f'    <p class="card-text">{llama}</p>')
        lines.append('  </div>')
        lines.append('</div>\n\n')
    Path(out_file).write_text(''.join(lines), encoding='utf-8')

create_main_index()


# Build like jeckyll

In [9]:
# ╔══════════════════════════════════════════════════════════════════════╗
# ║  Build Markdown → HTML in-place with pure Python                    ║
# ╚══════════════════════════════════════════════════════════════════════╝
import sys, subprocess, importlib, textwrap, shutil
from pathlib import Path

REPO_ROOT   = Path.cwd()
MD_EXT      = ".md"
HTML_EXT    = ".html"

# --------------------------------------------------------------------- #
# 1. Ensure required libraries are present (installs once, then imports)
# --------------------------------------------------------------------- #
PKGS = {"python-frontmatter": "frontmatter",
        "markdown": "markdown",
        "python-liquid": "liquid"}   # comment this line if you don’t need Liquid tags

def _pip_install(pkg):
    print(f"▶ installing {pkg} …")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--quiet", pkg])

for pip_name, mod_name in PKGS.items():
    try:
        importlib.import_module(mod_name)
    except ModuleNotFoundError:
        _pip_install(pip_name)

import frontmatter, markdown
try:
    import liquid
    HAVE_LIQUID = True
except ModuleNotFoundError:
    HAVE_LIQUID = False

# --------------------------------------------------------------------- #
# 2. Minimal HTML template (keep it tiny & self-contained)
# --------------------------------------------------------------------- #
TEMPLATE = textwrap.dedent("""\
    <!doctype html>
    <html lang="en">
    <head>
      <meta charset="utf-8">
      <title>{title}</title>
      <meta name="viewport" content="width=device-width,initial-scale=1">
      <style>
        body{{font-family:system-ui, sans-serif; line-height:1.5; margin:2rem auto; max-width:60ch;}}
        img,video{{max-width:100%; height:auto;}}
        pre{{background:#f6f8fa; padding:1em; overflow:auto;}}
      </style>
    </head>
    <body>
    {body}
    </body>
    </html>
""")

if HAVE_LIQUID:
    env = liquid.Environment()

def render_html(md_path: Path) -> str:
    """Return full HTML for a single Markdown file."""
    post       = frontmatter.load(md_path)
    md_html    = markdown.markdown(
        post.content,
        extensions=["extra", "codehilite", "toc", "tables", "sane_lists"],
    )
    if HAVE_LIQUID:
        md_html = env.from_string(md_html).render(**post.metadata)

    title = post.get("title") or md_path.stem
    return TEMPLATE.format(title=title, body=md_html)

# --------------------------------------------------------------------- #
# 3. Walk the repo and convert every *.md → *.html
# --------------------------------------------------------------------- #
converted, skipped = 0, 0
for md_file in REPO_ROOT.rglob(f"*{MD_EXT}"):
    # ignore potential notebook checkpoints, virtual-envs, etc.
    if md_file.parts[0].startswith((".venv", ".git", ".ipynb_checkpoints", "_site")):
        continue

    html_path = md_file.with_suffix(HTML_EXT)
    # Re-build only if source is newer than output
    if html_path.exists() and html_path.stat().st_mtime >= md_file.stat().st_mtime:
        skipped += 1
        continue

    html_path.write_text(render_html(md_file), encoding="utf-8")
    converted += 1
    print("✓", html_path.relative_to(REPO_ROOT))

print(f"\n🎉  Done. {converted} file(s) converted, {skipped} up-to-date.")


✓ index.html
✓ README.html
✓ memes\Aesthetics - Cottagecore\1.JPG.html
✓ memes\Aesthetics - Cottagecore\10.JPG.html
✓ memes\Aesthetics - Cottagecore\11.JPG.html
✓ memes\Aesthetics - Cottagecore\11.webp.html
✓ memes\Aesthetics - Cottagecore\148491414_1395406877518752_1482888999266907316_n.jpg.html
✓ memes\Aesthetics - Cottagecore\1672721101277-noauth.jpeg.html
✓ memes\Aesthetics - Cottagecore\239153374_10160343231922871_2866747252492966655_n.jpg.html
✓ memes\Aesthetics - Cottagecore\267402505_4838520402836003_2742750630610652629_n.jpg.html
✓ memes\Aesthetics - Cottagecore\271050482_10228032638146999_2828551779419707699_n.jpg.html
✓ memes\Aesthetics - Cottagecore\278470936_5703598449654469_7763925991469591123_n.jpg.html
✓ memes\Aesthetics - Cottagecore\280503331_8089799731045844_9025300622612021248_n.jpg.html
✓ memes\Aesthetics - Cottagecore\292099097_3185065228411383_525233966772296549_n.jpg.html
✓ memes\Aesthetics - Cottagecore\298750477_564804355327962_842271975289137657_n.jpg.html
✓ 