# Automated Metadata Generation

This notebook demonstrates the process of extracting text from various document types (PDF, DOCX, TXT, scanned images), and generating metadata using an LLM (Mistral-7B via OpenRouter).

In [21]:
import os
from pdfminer.high_level import extract_text as extract_pdf_text
import docx
import pytesseract
from PIL import Image
from openai import OpenAI
import json
import logging

#  Text Extraction Functions( ocr)

In [22]:
def extract_text_from_pdf(filepath):
    return extract_pdf_text(filepath)

def extract_text_from_docx(filepath):
    doc = docx.Document(filepath)
    return '\n'.join(para.text for para in doc.paragraphs)

def extract_text_from_txt(filepath):
    with open(filepath, 'r', encoding='utf-8', errors='replace') as file:
        return file.read()

def extract_text_with_ocr(filepath):
    return pytesseract.image_to_string(Image.open(filepath))

def extract_text(filepath):
    ext = os.path.splitext(filepath)[1].lower()
    try:
        if ext == '.pdf':
            try:
                return extract_text_from_pdf(filepath)
            except Exception:
                return extract_text_with_ocr(filepath)
        elif ext == '.docx':
            return extract_text_from_docx(filepath)
        elif ext == '.txt':
            return extract_text_from_txt(filepath)
        else:
            return extract_text_with_ocr(filepath)
    except Exception as e:
        return f"Error extracting text: {str(e)}"

# Metadatageration using TF-IDF

In [32]:
import string
import os

from nltk.corpus import stopwords
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# Download stopwords if not already available
import nltk
nltk.download('stopwords')

def preprocess_text(text):
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    return ' '.join(word for word in text.split() if word not in stop_words)

# Initialize summarizer once
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

def generate_metadata(text, n_top_words=5):
    processed_text = preprocess_text(text)
    
    # Use TF-IDF for keyword extraction (works for single documents)
    vectorizer = TfidfVectorizer(stop_words='english')
    try:
        tfidf_matrix = vectorizer.fit_transform([processed_text])
        feature_names = vectorizer.get_feature_names_out()
        scores = tfidf_matrix.toarray()[0]
        # Get top n_top_words with highest TF-IDF scores
        top_indices = scores.argsort()[::-1][:n_top_words]
        topics_str = ', '.join(feature_names[top_indices])
    except ValueError:
        topics_str = ""

    # Generate summary
    max_chunk_length = 1024
    if len(text) > max_chunk_length:
        text = text[:max_chunk_length]
    summary = summarizer(text, max_length=60, min_length=20, do_sample=False)[0]['summary_text']

    return {
       
        "summary": summary
    }


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anjan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Device set to use cpu


In [33]:
file_path = "Dona Pattal.pdf"  
text = extract_text(file_path)

In [34]:
meta=preprocess_text(text)
print(meta)

introduction 1 dona pattal initiative ongoing project puranpur village focused creating ecofriendly disposable plates traditionally dona pattal plates coated plastic easily degradable contributes environmental pollution address issue propose using bioplastic laminating paper plates offering sustainable alternative environmentally friendly 2 literature review found method preparing bioplastic research paper outlines bioplastic made using cornstarch vinegar glycerin water combination ingredients allows creation sustainable material suitable disposable plates 3 methodology make bioplastic laminating paper plates start gathering necessary ingredients tools need 1 teaspoon cornstarch 1 teaspoon vinegar 1 teaspoon glycerin 4 teaspoons water begin combining ingredients pot stirring mixture becomes smooth milky white next place pot mediumlow heat continue stirring mixture heats course 1015 minutes mixture begin thicken turn translucent cautious avoid overheating may cause lumps form mixture re

In [35]:
metadata=generate_metadata(meta, n_top_words=5)
print(metadata)

{'summary': ' Puranpur village focused creating ecofriendly disposable plates . traditional dona pattal plates coated plastic easily degradable contributes environmental pollution . Bioplastic laminating paper plates offer sustainable alternative environmentally friendly .'}


# Metadata Generation using LLM via OpenRouter

In [29]:

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
api_key = "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"

def generate_metadata_via_api(text, api_key, model="mistralai/mistral-7b-instruct:free"):
    if not text.strip():
        return {"error": "No text extracted from document"}

    truncated_text = text[:5000] + "..." if len(text) > 5000 else text

    prompt = f"""
    You are a metadata generator. Given the following document content, generate:
    - Title
    - 3-sentence summary
    - 5 keywords
    - Topic category
    
    Return JSON format with these keys:
    {{
        \"title\": \"...\",
        \"summary\": \"...\",
        \"keywords\": [\"...\", \"...\", \"...\"],
        \"topics\": \"...\",
        \"category\": \"...\",
        \"document_type\": \"report\"  
    }}
    
    Document Content:
    {truncated_text}
    """

    try:
        client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
            timeout=30,
            default_headers={
                "HTTP-Referer": "https://metadata-generator-app.com",
                "X-Title": "Automated Metadata Generator"
            }
        )
        
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,
            max_tokens=500
        )
        
        content = response.choices[0].message.content
        metadata = json.loads(content)
        
        required_keys = ["title", "summary", "keywords", "topics", "category"]
        if not all(key in metadata for key in required_keys):
            missing = [k for k in required_keys if k not in metadata]
            return {"error": f"Missing keys: {', '.join(missing)}"}

        return metadata
    
    except json.JSONDecodeError as e:
        logger.error(f"JSON error: {e} | Response: {content}")
        return {"error": "Invalid JSON response from API"}
    except Exception as e:
        logger.exception("API request failed")
        return {"error": f"API request failed: {str(e)}"}

#  Test with a local file

In [31]:

file_path = "Dona Pattal.pdf"  
text = extract_text(file_path)
metadata = generate_metadata_via_api(text, api_key)
metadata

INFO:httpx:HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"


{'title': 'Creating Bioplastic for Eco-Friendly Disposable Plates: A Sustainable Alternative',
 'summary': 'This report presents a method for creating bioplastic using cornstarch, vinegar, glycerin, and water as ingredients. The bioplastic was successfully produced, demonstrating waterproof properties, but its tensile strength was lower than conventional plastics. The bioplastic began to dissolve when exposed to water for an extended period, likely due to insufficient heating during the preparation process. Future work should focus on optimizing the preparation method and testing thermal stability to improve durability and overall suitability for practical applications.',
 'keywords': ['bioplastic',
  'sustainable',
  'disposable plates',
  'cornstarch',
  'vinegar',
  'glycerin'],
 'topics': 'Environmental Science, Sustainability, Material Science',
 'category': 'Research',
 'document_type': 'report'}

## ✅ Conclusion

This notebook demonstrates a complete pipeline for **automated metadata generation** from unstructured documents. Through content extraction, semantic analysis, and structured metadata creation, we improve document discoverability and classification. The solution supports various file formats such as **PDF, DOCX, and TXT**, and uses **OCR** where necessary.

➡️ This functionality has been integrated into a simple and intuitive web application using **Flask**, allowing users to upload documents and instantly view their metadata.

---

### 🔗 Project Links

- 💻 **GitHub Repository:** [View Code](https://github.com/anjaninandan001/mars_auto_metadata_generator_project/tree/main)  
- 🌐 **Live Deployed Web App:** [Try it here](https://metagenerator-4.onrender.com/)  
- 🎥 **Demo Video (2 min):** [Watch on YouTube](https://youtube.com/your-demo-video)

---

Feel free to explore the live app, test different document formats, and review the extracted metadata in real time!
