Skip to content

aztack/Lexiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lexiler

A tool for estimating the Lexile® level of English PDF books.

Overview

Lexiler extracts text from PDFs and calculates an approximate Lexile measure based on sentence length and word frequency statistics. It uses an adaptive calibration approach that adjusts to different text formats and writing styles.

Installation

# venv
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Basic usage:

# After installing dependencies and activating the venv
python main.py path/to/book.pdf

With a custom calibration configuration:

python main.py path/to/book.pdf path/to/calibration_config.json

How It Works

The program:

  1. Extracts text from a PDF using pdfminer.six
  2. Segments the text into sentences and words
  3. Calculates mean sentence length (MSL) and mean log word frequency (MLWF)
  4. Uses these metrics in the Lexile formula with adaptive calibration from a config file
  5. Outputs the estimated Lexile score along with supporting metrics

Calibration

Default Calibration

Lexiler uses a dynamic calibration approach based on text characteristics. The program has been calibrated using a diverse set of books with known Lexile measures:

  • Harry Potter series (Book 1-8): 500L to 1030L
  • The Kite Runner: 840L
  • The Martian: 680L

The formula selects an appropriate calibration constant based primarily on mean sentence length (MSL) with minor adjustments for word frequency (MLWF):

Sentence Length Category MSL Range Approx. Constant Example Books
Very short sentences < 10 ~9.6 HP Book 2 (500L)
Short sentences 10-15 ~13.2 The Martian (680L)
Medium-short sentences 15-20 ~14.1 The Kite Runner (840L)
Medium sentences 20-24 ~17.3 HP Book 8 (880L)
Medium-long sentences 24-30 ~18.8 HP Books 1,3,4,7 (880L-1030L)
Long sentences 30-35 ~20.8 HP Books 5,6 (870L-1030L)
Very long sentences > 35 ~21.7+ Advanced texts

Custom Calibration

You can generate a custom calibration configuration by using the included analyze.py script:

python analyze.py [directory_with_calibration_books]

This will:

  1. Search for PDFs with Lexile levels in their filenames (format: "*-XXXL.pdf")
  2. Extract text and compute metrics for each book
  3. Calculate ideal calibration constants for each book
  4. Group books by sentence length categories
  5. Output a calibration configuration file (lexile_calibration.json)

You can then use this configuration file with main.py:

python main.py your_book.pdf lexile_calibration.json

Dependencies

  • pdfminer.six (PDF text extraction)
  • wordfreq (word frequency statistics)

Limitations

  • This is an approximation of the official Lexile measure
  • Results may vary based on PDF formatting and text extraction quality
  • For certified scores, use MetaMetrics' official Lexile Analyzer

Notes

The Lexile® Framework is a registered trademark of MetaMetrics, Inc.

About

A tool for estimating the Lexile® level of English PDF books.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages