FLoRA Extractor

A Python tool that discovers, extracts, and validates replication and reproduction studies for the FLoRA database.

Part of the FORRT project.

What It Does

Starting from keyword searches of academic databases, FLoRA Extractor:

Discovers candidate replication/reproduction papers from OpenAlex and curated lists
Filters false positives using rule-based and LLM classification
Extracts the target study and replication outcome from each paper
Validates results through a crowdsourced voting web interface

Architecture

Stage 1: search/      → data/candidates.csv   (discover candidates)
Stage 2: filter/      → data/filtered.csv     (remove false positives)
Stage 3: extract/     → data/extracted.csv    (link original + code outcome)
Stage 4: validate/    → Flask web app         (human voting, export)

Each stage is independently runnable. See CLAUDE.md for full technical details.

Quick Start

Note: The commands below are the target state once all stages are implemented. Currently, flora_selected.csv can be used to seed Stage 4 directly.

# 1. Clone and setup
git clone https://github.com/forrtproject/flora-extractor.git
cd flora-extractor
pip install -r requirements.txt
cp .env.example .env   # fill in your API keys

# 2. Run the pipeline
python search/run_search.py        # → data/candidates.csv
python filter/run_filter.py        # → data/filtered.csv
python extract/run_extract.py      # → data/extracted.csv

# 3. Start the validation web app
python -m validate.import_csv      # load into SQLite
python -m validate.app             # → http://localhost:5001

API Keys Required

Add to your .env file (copy from .env.example):

RESEARCHER_EMAIL=you@example.com      # for OpenAlex/Crossref API politeness
GEMINI_API_KEY=...                    # primary LLM
GEMINI_API_KEY_2=...                  # optional: rotate for higher quota
OPENAI_API_KEY=...                    # fallback LLM (optional)
GROBID_URL=http://localhost:8070      # local GROBID server (optional, for full-text extraction)

Get a free Gemini API key at aistudio.google.com.

Data Sources

Bibliographic databases (primary):

Source	Coverage
OpenAlex	Broad academic literature, free API
Semantic Scholar	Supplementary coverage
Crossref	DOI resolution and reference lists
OpenCitations	Reference lists (where OpenAlex coverage is thin)

Curated lists (secondary, pluggable):

Source	Coverage
Bob Reed's Replication Network	Economics
I4R	Institute for Replication reports

Full-text acquisition (for Stage 3): Unpaywall, CORE, arXiv, OSF.

Output Schema

Each extracted record contains:

Field	Description
`doi_r`	Replication paper DOI
`doi_o`	Original target study DOI
`title_o`	Original target study title
`outcome`	success / failure / mixed / uninformative / descriptive
`outcome_phrase`	Supporting quote from the paper
`link_evidence`	Evidence used to identify the original
`validation_status`	confirmed / rejected / pending / needs_review

Full schema: shared/schema.py

Team Guide

Team	Stage	Branch	Docs
Team Search	Stage 1	`feature/search`	docs/STAGE1_SEARCH.md
Team Filter	Stage 2	`feature/filter`	docs/STAGE2_FILTER.md
Team Extract	Stage 3	`feature/extract`	docs/STAGE3_EXTRACT.md
Team Validate	Stage 4	`feature/validate`	docs/STAGE4_VALIDATE.md

New team member? Read CLAUDE.md first — it contains architecture, schema, and coding rules.
AI coding agent? Read CLAUDE.md (Claude Code) or AGENTS.md (all others).
Working in R? See the R note in CLAUDE.md.

Contributing

Branch from dev using your team's branch name (feature/search, etc.)
Use sample data in misc/ to develop and test independently
Open a PR to dev when a feature is stable — don't wait until the end
main and dev are branch-protected; all merges require a PR review

Related Projects

flora_search_approaches — original R-based pathway pipeline (reference implementation)
FLoRA database — the database this tool feeds into

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FLoRA Extractor

What It Does

Architecture

Quick Start

API Keys Required

Data Sources

Output Schema

Team Guide

Contributing

Related Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
extract		extract
filter		filter
misc		misc
search		search
shared		shared
tests		tests
validate		validate
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
RULEBOOK.md		RULEBOOK.md
SETUP_GUIDE.md		SETUP_GUIDE.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FLoRA Extractor

What It Does

Architecture

Quick Start

API Keys Required

Data Sources

Output Schema

Team Guide

Contributing

Related Projects

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages