A Python tool that discovers, extracts, and validates replication and reproduction studies for the FLoRA database.
Part of the FORRT project.
Starting from keyword searches of academic databases, FLoRA Extractor:
- Discovers candidate replication/reproduction papers from OpenAlex and curated lists
- Filters false positives using rule-based and LLM classification
- Extracts the target study and replication outcome from each paper
- Validates results through a crowdsourced voting web interface
Stage 1: search/ → data/candidates.csv (discover candidates)
Stage 2: filter/ → data/filtered.csv (remove false positives)
Stage 3: extract/ → data/extracted.csv (link original + code outcome)
Stage 4: validate/ → Flask web app (human voting, export)
Each stage is independently runnable. See CLAUDE.md for full technical details.
Note: The commands below are the target state once all stages are implemented. Currently,
flora_selected.csvcan be used to seed Stage 4 directly.
# 1. Clone and setup
git clone https://github.com/forrtproject/flora-extractor.git
cd flora-extractor
pip install -r requirements.txt
cp .env.example .env # fill in your API keys
# 2. Run the pipeline
python search/run_search.py # → data/candidates.csv
python filter/run_filter.py # → data/filtered.csv
python extract/run_extract.py # → data/extracted.csv
# 3. Start the validation web app
python -m validate.import_csv # load into SQLite
python -m validate.app # → http://localhost:5001Add to your .env file (copy from .env.example):
RESEARCHER_EMAIL=you@example.com # for OpenAlex/Crossref API politeness
GEMINI_API_KEY=... # primary LLM
GEMINI_API_KEY_2=... # optional: rotate for higher quota
OPENAI_API_KEY=... # fallback LLM (optional)
GROBID_URL=http://localhost:8070 # local GROBID server (optional, for full-text extraction)
Get a free Gemini API key at aistudio.google.com.
Bibliographic databases (primary):
| Source | Coverage |
|---|---|
| OpenAlex | Broad academic literature, free API |
| Semantic Scholar | Supplementary coverage |
| Crossref | DOI resolution and reference lists |
| OpenCitations | Reference lists (where OpenAlex coverage is thin) |
Curated lists (secondary, pluggable):
| Source | Coverage |
|---|---|
| Bob Reed's Replication Network | Economics |
| I4R | Institute for Replication reports |
Full-text acquisition (for Stage 3): Unpaywall, CORE, arXiv, OSF.
Each extracted record contains:
| Field | Description |
|---|---|
doi_r |
Replication paper DOI |
doi_o |
Original target study DOI |
title_o |
Original target study title |
outcome |
success / failure / mixed / uninformative / descriptive |
outcome_phrase |
Supporting quote from the paper |
link_evidence |
Evidence used to identify the original |
validation_status |
confirmed / rejected / pending / needs_review |
Full schema: shared/schema.py
| Team | Stage | Branch | Docs |
|---|---|---|---|
| Team Search | Stage 1 | feature/search |
docs/STAGE1_SEARCH.md |
| Team Filter | Stage 2 | feature/filter |
docs/STAGE2_FILTER.md |
| Team Extract | Stage 3 | feature/extract |
docs/STAGE3_EXTRACT.md |
| Team Validate | Stage 4 | feature/validate |
docs/STAGE4_VALIDATE.md |
New team member? Read CLAUDE.md first — it contains architecture, schema, and coding rules.
AI coding agent? Read CLAUDE.md (Claude Code) or AGENTS.md (all others).
Working in R? See the R note in CLAUDE.md.
- Branch from
devusing your team's branch name (feature/search, etc.) - Use sample data in
misc/to develop and test independently - Open a PR to
devwhen a feature is stable — don't wait until the end mainanddevare branch-protected; all merges require a PR review
- flora_search_approaches — original R-based pathway pipeline (reference implementation)
- FLoRA database — the database this tool feeds into
MIT