A tool for estimating the Lexile® level of English PDF books.
Lexiler extracts text from PDFs and calculates an approximate Lexile measure based on sentence length and word frequency statistics. It uses an adaptive calibration approach that adjusts to different text formats and writing styles.
# venv
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Basic usage:
# After installing dependencies and activating the venv
python main.py path/to/book.pdf
With a custom calibration configuration:
python main.py path/to/book.pdf path/to/calibration_config.json
The program:
- Extracts text from a PDF using pdfminer.six
- Segments the text into sentences and words
- Calculates mean sentence length (MSL) and mean log word frequency (MLWF)
- Uses these metrics in the Lexile formula with adaptive calibration from a config file
- Outputs the estimated Lexile score along with supporting metrics
Lexiler uses a dynamic calibration approach based on text characteristics. The program has been calibrated using a diverse set of books with known Lexile measures:
- Harry Potter series (Book 1-8): 500L to 1030L
- The Kite Runner: 840L
- The Martian: 680L
The formula selects an appropriate calibration constant based primarily on mean sentence length (MSL) with minor adjustments for word frequency (MLWF):
| Sentence Length Category | MSL Range | Approx. Constant | Example Books |
|---|---|---|---|
| Very short sentences | < 10 | ~9.6 | HP Book 2 (500L) |
| Short sentences | 10-15 | ~13.2 | The Martian (680L) |
| Medium-short sentences | 15-20 | ~14.1 | The Kite Runner (840L) |
| Medium sentences | 20-24 | ~17.3 | HP Book 8 (880L) |
| Medium-long sentences | 24-30 | ~18.8 | HP Books 1,3,4,7 (880L-1030L) |
| Long sentences | 30-35 | ~20.8 | HP Books 5,6 (870L-1030L) |
| Very long sentences | > 35 | ~21.7+ | Advanced texts |
You can generate a custom calibration configuration by using the included analyze.py script:
python analyze.py [directory_with_calibration_books]
This will:
- Search for PDFs with Lexile levels in their filenames (format: "*-XXXL.pdf")
- Extract text and compute metrics for each book
- Calculate ideal calibration constants for each book
- Group books by sentence length categories
- Output a calibration configuration file (lexile_calibration.json)
You can then use this configuration file with main.py:
python main.py your_book.pdf lexile_calibration.json
- pdfminer.six (PDF text extraction)
- wordfreq (word frequency statistics)
- This is an approximation of the official Lexile measure
- Results may vary based on PDF formatting and text extraction quality
- For certified scores, use MetaMetrics' official Lexile Analyzer
The Lexile® Framework is a registered trademark of MetaMetrics, Inc.