๐ Automatically generates structured notes from educational PDFs using advanced NLP techniques.
Transform your educational PDFs into intelligent, structured notes with extractive and abstractive summarization, keyword extraction, and chapter segmentation capabilities.
- ๐ฅ๏ธ Interactive Streamlit Web Interface - User-friendly UI for PDF upload and processing
- ๐ง Dual Summarization Modes:
- Extractive: TextRank and LSA algorithms for fast summarization
- Abstractive: Transformer models (T5, BART) for human-like summaries
- ๐ Advanced Keyword Extraction - YAKE algorithm with TF-IDF fallback
- ๐ Intelligent Chapter Segmentation - Automatic detection and organization
- ๐พ Multi-Format Export - Export to
.docx
,.txt
, and.md
formats - ๐งช Comprehensive Testing - Full test suite for reliability
- ๐ง Modular Architecture - Easy to extend and maintain
smart_notes_generator/
โโโ ๐ฏ app.py # Streamlit web application (main entry point)
โโโ ๐ง summarizer.py # NLP summarization engine
โโโ ๐ pdf_handler.py # PDF text extraction & processing
โโโ ๐พ exporter.py # Multi-format export functionality
โโโ ๐ example_usage.py # Usage examples and demos
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ซ .gitignore # Git ignore rules
โโโ ๐ง utils/ # Utility modules
โ โโโ __init__.py
โ โโโ config.py # Configuration management
โ โโโ text_processing.py # Text cleaning & preprocessing
โโโ ๐งช tests/ # Comprehensive test suite
โโโ __init__.py
โโโ test_pdf_handler.py # PDF processing tests
โโโ test_summarizer.py # Summarization tests
โโโ test_exporter.py # Export functionality tests
- PyMuPDF (fitz)
>=1.24
- Primary PDF text extraction - PyPDF2 - Backup PDF processing
- pdfplumber
>=0.11
- Advanced PDF analysis
- spaCy
>=3.7
- Industrial-strength NLP library - Transformers (Hugging Face)
>=4.43
- State-of-the-art NLP models - YAKE
>=0.4.8
- Keyword extraction algorithm - Sumy
>=0.11
- Automatic text summarization
- PyTorch
>=2.3
- Deep learning framework - scikit-learn - Machine learning utilities
- pandas
>=2.2
- Data manipulation and analysis - NumPy
>=1.26
- Numerical computing
- python-docx
>=1.1
- Microsoft Word document generation
- sentencepiece
>=0.2
- Text tokenization - regex
>=2024.5
- Advanced regular expressions - requests
>=2.32
- HTTP library
- Python: 3.10 or higher
- Operating System: Windows, macOS, or Linux
- Memory: 4GB RAM minimum (8GB recommended for large PDFs)
- Storage: 2GB free space (for models and dependencies)
- Network: Internet connection for initial model downloads
git clone <repository-url>
cd smart_notes_generator
# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Download spaCy English model
python -m spacy download en_core_web_sm
# Download NLTK data (automatically handled on first run)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab'); nltk.download('stopwords')"
python example_usage.py
streamlit run app.py
Then open your browser and navigate to http://localhost:8501
from pdf_handler import PDFHandler
from summarizer import SmartSummarizer
from exporter import NotesExporter
# Initialize components
pdf_handler = PDFHandler()
summarizer = SmartSummarizer()
exporter = NotesExporter()
# Process PDF
result = pdf_handler.extract_text("your_document.pdf")
summary = summarizer.generate_summary(result['text'], mode="extractive")
keywords = summarizer.extract_keywords(result['text'])
# Export notes
notes_data = {
'summary': summary,
'keywords': keywords,
'metadata': result['metadata']
}
exported_content = exporter.export_notes(notes_data, format_type='.docx')
python example_usage.py [optional_pdf_path]
- Upload PDF through web interface or specify file path
- Validate file format and integrity
- Extract metadata (title, author, pages, etc.)
- Extract raw text using PyMuPDF
- Clean and preprocess text (remove artifacts, fix formatting)
- Segment into sentences and paragraphs
- Chapter Detection: Identify chapter boundaries and titles
- Text Statistics: Calculate word count, reading time, etc.
- Structure Analysis: Identify bullet points, lists, and sections
- Extractive Mode:
- Use TextRank or LSA algorithms
- Select most important sentences
- Fast processing, no GPU required
- Abstractive Mode:
- Use transformer models (BART/T5)
- Generate new sentences
- Higher quality, requires more resources
- Apply YAKE algorithm for key phrase extraction
- Fallback to TF-IDF if needed
- Filter and rank keywords by relevance
- Generate structured notes in chosen format
- Include summary, keywords, and metadata
- Provide download link for immediate access
- Mode: Extractive (fast) or Abstractive (high-quality)
- Length: 50-500 words
- Algorithm: TextRank or LSA (extractive mode)
- Model: BART or T5 (abstractive mode)
- Count: 5-50 keywords
- Method: YAKE or TF-IDF
- Phrase Length: 1-3 words per phrase
- Format: DOCX, TXT, or Markdown
- Include Chapters: Yes/No
- Include Keywords: Yes/No
- Custom Filename: User-defined names
Run the comprehensive test suite:
# Run all tests
python -m pytest tests/ -v
# Run specific test modules
python -m pytest tests/test_pdf_handler.py -v
python -m pytest tests/test_summarizer.py -v
python -m pytest tests/test_exporter.py -v
# Run with coverage report
pip install pytest-cov
python -m pytest tests/ --cov=. --cov-report=html
- Create feature branch:
git checkout -b feature/new-feature
- Implement changes in appropriate modules
- Add comprehensive tests
- Update documentation
- Submit pull request
- Add new algorithms in
summarizer.py
- Implement in
_generate_extractive_summary()
or_generate_abstractive_summary()
- Update configuration options in
utils/config.py
- Extend
exporter.py
with new format methods - Update
supported_formats
list - Add corresponding tests
1. spaCy Model Not Found
python -m spacy download en_core_web_sm
2. NLTK Data Missing
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
3. PyTorch Installation Issues Visit PyTorch installation guide for platform-specific instructions.
4. Memory Issues with Large PDFs
- Reduce summary length
- Use extractive mode instead of abstractive
- Process PDFs in smaller chunks
- Use extractive mode for faster processing
- Enable GPU acceleration for abstractive summarization
- Adjust chunk sizes for large documents
- ๐ Notion API Integration - Direct export to Notion databases
- ๐ Multi-language Support - Process PDFs in multiple languages
- ๐ฏ Custom Model Fine-tuning - Domain-specific summarization
- ๐ Batch Processing - Handle multiple PDFs simultaneously
- ๐ REST API - Headless operation for integration
- ๐ฑ Mobile App - React Native mobile application
We welcome contributions! Please see our contribution guidelines:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Ensure all tests pass
- Submit a pull request
This project is licensed under the MIT License. See the LICENSE file for details.
- Issues: Report bugs and feature requests on GitHub Issues
- Documentation: Check the
/docs
folder for detailed guides - Examples: See
example_usage.py
for usage examples
Made with โค๏ธ by the Smart Notes Generator Team