Skip to content

HTML checker is web app solution that help to clean my html code that contain unwanted text like cite_start or cite

Notifications You must be signed in to change notification settings

algsoch/html-checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧹 HTML Citation Cleaner

Python FastAPI License Deploy

A powerful web application built with FastAPI that intelligently removes citation markers from HTML files while preserving the complete HTML structure, formatting, and LaTeX content. Perfect for cleaning academic content, research papers, and educational materials generated by AI tools like Gemini.

✨ Features

Core Functionality

  • 🧹 Smart Citation Removal - Removes [cite: numbers] patterns including dash-separated ranges (e.g., [cite: 124, 125], [cite: 105-107])
  • πŸ—‘οΈ Empty Tag Cleanup - Removes tags containing only [cite_start] markers
  • πŸ“ LaTeX Support - Preserves and renders mathematical formulas with MathJax
  • 🎯 Dual Mode - Paste HTML code OR upload files
  • πŸ“Š Real-time Statistics - Live count of removed citations

Advanced UI Features

  • 🎨 Modern Glass Morphism Design - Beautiful gradient backgrounds with glass effects
  • πŸ“± Fully Responsive - Works perfectly on desktop, tablet, and mobile
  • πŸ‘οΈ Live Preview - See rendered HTML before and after cleaning with LaTeX rendering
  • πŸ’» Code Comparison - Syntax-highlighted before/after code view
  • πŸ“‹ 6 Copy Methods - Clipboard, Rich Text, Formatted, Select All, Download, Reset
  • 🎬 Smooth Animations - Engaging transitions and micro-interactions
  • 🌈 Interactive Demo - Live example with real academic content
  • ⚑ Auto-Update Preview - Real-time preview updates as you type (500ms debounce)

πŸš€ Live Demo

Try it now:

  • Render: https://html-checker-1.onrender.com βœ… Primary
  • Status: Running on free tier (may have cold starts after 15 min inactivity)
  • Response Time: < 200ms (warm) | ~50s (cold start)

⚠️ Note: This instance runs on Render's free tier. See RENDER_TROUBLESHOOTING.md if you encounter issues.

Deploy Your Own Instance:

This application can be easily deployed to various cloud platforms including Render, Azure, DigitalOcean, Railway, Heroku, and more. See the Deployment section below for detailed instructions.

πŸ› οΈ Installation

Prerequisites

  • Python 3.11 or higher
  • pip (Python package manager)

Local Setup

  1. Clone the repository:
git clone https://github.com/algsoch/html-checker.git
cd html-checker
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the application:
python main.py
  1. Open your browser:
http://localhost:8000

πŸ“– Usage

Method 1: Paste HTML Code

  1. Navigate to the Paste Mode tab
  2. Paste your HTML code into the textarea
  3. Click "✨ Clean HTML Code"
  4. View the before/after comparison and live preview
  5. Copy or download the cleaned HTML

Method 2: Upload HTML File

  1. Navigate to the Upload Mode tab
  2. Drag & drop your HTML file OR click "Browse Files"
  3. Click "Clean & Download"
  4. Your cleaned file will be downloaded automatically

Live Preview Features

  • Code View: Syntax-highlighted before/after comparison
  • Rendered View: See actual HTML output with LaTeX formulas
  • Statistics: Total citations removed, breakdown by type
  • 6 Copy Options: Choose your preferred copy method

πŸ—οΈ Project Structure

html-citation-cleaner/
β”œβ”€β”€ main.py                     # FastAPI backend with cleaning logic
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ render.yaml                 # Render deployment config
β”œβ”€β”€ startup.sh                  # Azure startup script
β”œβ”€β”€ Procfile                    # Heroku deployment config
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       β”œβ”€β”€ main_html-checker.yml    # Azure GitHub Actions CI/CD
β”‚       β”œβ”€β”€ azure-deploy.yml         # Alternative Azure deployment
β”‚       └── keep-alive.yml           # Render keep-alive (optional)
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ index.html             # Main UI with live preview
β”‚   └── public/
β”‚       └── images/
β”‚           └── filter.png      # App favicon
β”œβ”€β”€ static/
β”‚   β”œβ”€β”€ style.css              # Advanced styling (glass morphism)
β”‚   β”œβ”€β”€ script.js              # Frontend logic + MathJax integration
β”œβ”€β”€ outputs/                    # Cleaned files directory
β”œβ”€β”€ sample.html                 # Sample citation-filled HTML
└── sample1.html                # Demo content (Chemistry Question)

πŸ”§ How It Works

The application uses a two-step intelligent cleaning algorithm:

Step 1: Remove Empty Citation Tags

Removes entire HTML tags that contain ONLY cite markers (no other text):

Before After
<p>[cite_start]</p> (removed)
<p>[cite: 123]</p> (removed)
<div>[cite_start]</div> (removed)
<span>[cite: 456, 789]</span> (removed)

Step 2: Remove Inline Citation Markers

Removes cite markers from tags that contain other content:

Before After
<p>Some text [cite: 123]</p> <p>Some text </p>
[cite_start]Other text Other text
Text [cite: 124, 125] more Text more
Formula \(C_6H_{14}\) [cite: 130] Formula \(C_6H_{14}\)

βœ… The HTML structure and LaTeX formulas remain completely intact!

🌐 Deployment

This application can be deployed to multiple cloud platforms. Choose the one that best fits your needs:

Deploy to Render (Free Tier Available)

Render offers a generous free tier and simple deployment process.

Deployment Steps:

  1. Fork this repository
  2. Go to Render Dashboard
  3. Click "New +" β†’ "Web Service"
  4. Connect your GitHub repo
  5. Render will auto-detect the render.yaml configuration
  6. Click "Create Web Service"

Configuration (already in render.yaml):

plan: free  # or 'starter' ($7/month) for always-on
startCommand: gunicorn main:app --workers 1 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:$PORT --timeout 120

Free Tier Features:

  • 750 hours/month
  • Service sleeps after 15 minutes of inactivity
  • ~50 second cold start on first request
  • Perfect for personal projects and learning

Upgrade to Starter ($7/month) for:

  • Always-on service (no sleep)
  • No cold starts
  • Unlimited hours

Keep-Alive Option:

  • The repository includes .github/workflows/keep-alive.yml that pings your service every 5 minutes
  • Warning: This uses ~360 hours/month of your free tier limit
  • Recommended to disable for free tier users (see RENDER_TROUBLESHOOTING.md)

Troubleshooting: See RENDER_TROUBLESHOOTING.md for detailed solutions to common issues.


Deploy to Azure App Service (Enterprise Option)

Azure offers reliable hosting with good free tier options and easy GitHub integration.

Using GitHub Actions (Automated):

The repository includes a pre-configured workflow (.github/workflows/main_html-checker.yml) that automatically deploys to Azure on every push to main branch.

Setup:

  1. Create an Azure App Service with Python runtime
  2. Configure deployment credentials in GitHub repository secrets
  3. Push to main branch - auto-deploys!

Using Azure CLI (Manual):

# Login to Azure
az login

# Create resource group
az group create --name html-checker-rg --location eastus

# Create App Service Plan (choose tier based on needs)
# Free F1: Good for testing ($0/month, has limitations)
# Basic B1: Recommended for production ($13/month)
az appservice plan create --name html-checker-plan --resource-group html-checker-rg --sku B1 --is-linux

# Create Web App
az webapp create --resource-group html-checker-rg --plan html-checker-plan --name your-unique-app-name --runtime "PYTHON:3.11"

# Configure startup command
az webapp config set --resource-group html-checker-rg --name your-unique-app-name --startup-file "gunicorn main:app --workers 2 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 --timeout 120"

# Deploy code
az webapp up --name your-unique-app-name --resource-group html-checker-rg

Azure Pricing:

  • Free F1: $0/month (60 CPU minutes/day, 1GB RAM) - Good for testing
  • Basic B1: ~$13/month (Unlimited, 1.75GB RAM) - Recommended for production
  • Standard S1: ~$70/month (Better performance, auto-scaling)

Deploy to DigitalOcean App Platform

DigitalOcean offers simple deployment with competitive pricing.

Deployment Steps:

  1. Go to DigitalOcean App Platform
  2. Click "Create App" β†’ Connect your GitHub repository
  3. Configure:
    • Build Command: pip install -r requirements.txt
    • Run Command: gunicorn main:app --workers 2 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:$PORT --timeout 120
    • HTTP Port: 8080 (or use environment variable $PORT)
  4. Choose plan and deploy

DigitalOcean Pricing:

  • Basic: $5/month (512MB RAM, 1 vCPU)
  • Professional: $12/month (1GB RAM, 1 vCPU)
  • Pro+: $24/month (2GB RAM, 2 vCPU)

Deploy to Railway

Railway provides a modern deployment experience with generous free tier.

Deployment Steps:

  1. Go to Railway
  2. Click "New Project" β†’ "Deploy from GitHub repo"
  3. Select your repository
  4. Railway auto-detects Python and installs dependencies
  5. Add start command in settings:
    gunicorn main:app --workers 2 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:$PORT --timeout 120
    

Railway Pricing:

  • Hobby: $5/month (512MB RAM, shared CPU)
  • Pro: Starting at $20/month (more resources)

Deploy to Heroku

Note: Heroku discontinued free tier in November 2022.

Deployment Steps:

  1. Create a Procfile in your repository (already included):
    web: gunicorn main:app --workers 2 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:$PORT --timeout 120
    
  2. Install Heroku CLI
  3. Deploy:
    heroku login
    heroku create your-app-name
    git push heroku main

Heroku Pricing:

  • Basic: $7/month per dyno
  • Standard: $25-50/month per dyno

Deploy to Google Cloud Run

Serverless deployment with pay-per-use pricing.

Deployment Steps:

  1. Create a Dockerfile (if not exists):
    FROM python:3.11-slim
    WORKDIR /app
    COPY requirements.txt .
    RUN pip install -r requirements.txt
    COPY . .
    CMD ["gunicorn", "main:app", "--workers", "2", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8080", "--timeout", "120"]
  2. Deploy:
    gcloud run deploy html-checker --source . --platform managed --region us-central1 --allow-unauthenticated

Google Cloud Run Pricing:

  • Pay per use (free tier: 2 million requests/month)
  • ~$0.24 per million requests after free tier

Deploy to Fly.io

Modern platform with global deployment.

Deployment Steps:

  1. Install Fly CLI
  2. Deploy:
    fly launch
    fly deploy

Fly.io Pricing:

  • Free: 3 shared-cpu-1x VMs with 256MB RAM
  • Paid: Starting at $1.94/month per VM

🎯 Deployment Comparison

Platform Entry Price Free Tier Auto-Sleep Best For
Render $7/month (Starter) 750 hrs/mo After 15 min Free tier, easy setup
Azure $13/month (B1) Limited F1 No Enterprise, Microsoft ecosystem
DigitalOcean $5/month No No Simple, predictable pricing
Railway $5/month 500 hours/mo No Modern development
Heroku $7/month No (removed) No Quick deployment
Google Cloud Run Pay-per-use 2M req/month Yes Serverless, variable traffic
Fly.io $1.94/month Limited No Global deployment

πŸ’‘ Recommendation

For Students/Learning:

  • Render: Best free tier (750 hours/month), easy setup
  • Railway: Good trial, modern platform
  • Azure F1: Free tier available (with limitations)
  • Fly.io: Generous free tier

For Production:

  • Render Starter: $7/month, always-on, no cold starts
  • DigitalOcean: Simple pricing, $5-12/month
  • Azure B1: Reliable, good performance, $13/month
  • Railway: Modern platform, starting at $5/month

For Variable Traffic:

  • Google Cloud Run: Pay only for what you use

πŸ”Œ API Endpoints

Method Endpoint Description
GET / Main web interface
POST /upload Upload and clean HTML file
GET /download/{filename} Download cleaned file
GET /static/* Serve static assets (CSS/JS)
GET /templates/public/* Serve public assets (images)

πŸ’‘ Example Transformation

πŸ’‘ Example Transformation

Input HTML:

<p>[cite_start]</p>
<div>The Lewis electron-dot diagrams [cite: 124, 125] show molecular structure.</div>
<p>\(HClO_{3}\) is the stronger acid [cite: 130] due to electronegativity.</p>

Output HTML:

<div>The Lewis electron-dot diagrams  show molecular structure.</div>
<p>\(HClO_{3}\) is the stronger acid  due to electronegativity.</p>

✨ Notice: LaTeX \(HClO_{3}\) is perfectly preserved!

πŸ›‘οΈ Technology Stack

  • Backend: FastAPI (Python 3.11+)
  • ASGI Server: Uvicorn with Gunicorn workers
  • Frontend: HTML5, CSS3 (Glass Morphism), Vanilla JavaScript
  • Math Rendering: MathJax 3
  • Deployment: Render, Azure, DigitalOcean, Railway, Heroku, Google Cloud Run, Fly.io
  • CI/CD: GitHub Actions
  • Version Control: Git/GitHub

πŸ“Š Performance

  • ⚑ Fast Processing: Handles large HTML files (100KB+) in milliseconds
  • 🎯 Accuracy: 100% citation removal without structure damage
  • πŸ“± Responsive: Works on devices from 320px to 4K displays
  • πŸ”„ Auto-Deploy: GitHub push β†’ Live in 2 minutes
  • πŸ’Ύ Memory Efficient: Minimal server resource usage

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘¨β€πŸ’» Author

vicky kumar

πŸ™ Acknowledgments

  • FastAPI for the amazing web framework
  • MathJax for LaTeX rendering support
  • Render, Azure, and other cloud platforms for making deployment accessible

πŸ“ž Support

If you found this project helpful, please give it a ⭐️!

For issues or questions, please open an issue.


Made with ❀️ for the academic community

About

HTML checker is web app solution that help to clean my html code that contain unwanted text like cite_start or cite

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •