Automatically version-control files from Google Drive in a git repository with meaningful content diffs.
Files dropped in a Drive folder appear in git with extracted text alongside the originals. git diff shows actual content changes, not binary blobs. Commits are attributed to the person who edited the file in Drive.
- Legal teams — Track changes to contracts and NDAs with full redline history.
git blameshows who changed what. - Compliance & regulatory — Immutable audit trail for policy documents. Every version is hashed and timestamped.
- Consulting / client deliverables — Version-control proposals and reports that clients edit in Drive.
- Research & academia — Track revisions to papers and grant applications across collaborators.
- Finance & accounting — Diff quarterly reports, invoices, and spreadsheets to catch changes between versions.
- HR & operations — Version employee handbooks, SOPs, and training materials edited by non-technical staff.
- Any team using Drive — Get git-grade version history for people who will never touch a terminal.
File added/edited in Drive folder
↓
Drive push notification (webhook)
↓
Cloud Function processes changes:
• Downloads files from Drive
• Extracts text (docx→markdown, pdf→text)
• Detects renames/moves/deletes
• Commits per author (attributed to the actual Drive editor)
• Pushes to any git host
↓
Git repo has originals + diffable text side by side
docs/
├── Contracts/
│ ├── Contract_v2.docx # original binary
│ └── Contract_v2.docx.md # pandoc-extracted markdown (diffable)
├── Reports/
│ ├── Q4_Report.pdf # original binary
│ └── Q4_Report.pdf.txt # pdfplumber-extracted text (diffable)
git diffon.md/.txtfiles shows meaningful content changes- Track changes in docx files are preserved (insertions/deletions with author/date)
git log --author="Jane Smith"shows changes by the actual editorgit log --followtracks file renames- Works with GitHub, GitLab, Bitbucket, or any git host
- GCP project with billing enabled
- Google Drive folder to monitor
- Git repository (any host supporting HTTPS push)
- Personal access token for git push
gcloud,terraform, andgitCLI tools (setup will offer to install these via brew)
make setupmake setup is a thin wrapper for ./scripts/setup.sh.
The interactive setup installs missing tools via brew, creates your .env, creates a GCP project (or uses an existing one), links billing, enables APIs, deploys infrastructure, and stores your git token. It's crash-safe and idempotent — if anything fails, re-run and it picks up where it left off.
Dry run — preview every step without executing anything:
./scripts/setup.sh --dry-runNon-interactive / agent mode — for CI or AI-agent-driven setup:
cp .env.example .env # fill in values first
GIT_TOKEN_VALUE=ghp_xxx ./scripts/setup.sh --non-interactiveRequires .env and GCP auth to exist beforehand. Auto-installs missing tools, prints a machine-readable summary of remaining manual steps.
To redeploy after code changes:
make deploymake deploy is a thin wrapper for ./scripts/deploy.sh.
Manual setup (step-by-step)
cp .env.example .env
# Edit .env with your valuesGCP_PROJECT=my-project ./scripts/bootstrap.sh# GitHub: fine-grained token with Contents read/write on target repo
echo -n "github_pat_XXXX" | gcloud secrets versions add git-token --data-file=-make deployDrive webhooks require proving ownership of the webhook URL:
- Copy the
sync_handler_urlfrom the deploy output - Go to Google Search Console → Add Property → URL Prefix → paste the URL
- Choose "HTML file" verification (the function serves it automatically via
GOOGLE_VERIFICATION_TOKENenv var) - Go to Google API Console → Domain Verification → Add Domain → paste the domain
- Now Drive webhooks will accept your function URL
Alternative: Map a custom domain to Cloud Run and verify via DNS TXT record.
Share your target Drive folder with the service account email (shown in deploy output) with Editor access.
# Create watch channel and optionally do an initial sync
curl -X POST "$(terraform -chdir=infra output -raw setup_watch_url)?initial_sync=true" \
-H "Authorization: bearer $(gcloud auth print-identity-token)"| Variable | Required | Default | Description |
|---|---|---|---|
GCP_PROJECT |
Yes | — | GCP project ID |
DRIVE_FOLDER_ID |
Yes | — | Root Drive folder to monitor |
GIT_REPO_URL |
Yes | — | Git repository HTTPS URL |
GIT_BRANCH |
Yes | — | Branch to push to |
GIT_TOKEN_SECRET |
Yes | — | Secret Manager secret name |
EXCLUDE_PATHS |
No | (empty) | Glob patterns to skip, comma-separated |
SKIP_EXTENSIONS |
No | .zip,.exe,.dmg,.iso |
Extensions to skip |
MAX_FILE_SIZE_MB |
No | 100 |
Skip files larger than this |
COMMIT_AUTHOR_NAME |
No | Drive Sync Bot |
Fallback commit author |
COMMIT_AUTHOR_EMAIL |
No | sync@example.com |
Fallback commit email |
FIRESTORE_COLLECTION |
No | drive_sync_state |
Firestore collection name |
DOCS_SUBDIR |
No | docs |
Subdirectory in git repo |
SYNC_TRIGGER_SECRET |
No | auto-generated in make deploy if unset |
Required for channel-less manual/scheduler sync triggers |
| Source | Extracted as | Tool | Notes |
|---|---|---|---|
.docx |
.docx.md |
pandoc | Track changes preserved with --track-changes=all |
.pdf |
.pdf.txt |
pdfplumber | Tables formatted as markdown, scanned pages warned |
.csv |
.csv.txt |
built-in | Markdown pipe table |
| Google Docs | export→docx→.md |
pandoc | |
| Google Sheets | export→csv→.txt |
built-in | |
| Google Slides | export→pdf→.txt |
pdfplumber |
- Webhook + polling: Drive push notifications for speed (~30s), safety-net poll every 4 hours for reliability
- Resync on contention: If a webhook arrives during an active sync, it flags a re-run instead of silently dropping
- Watch renewal: Automatic every 6 days (channels expire after 7)
- Concurrency: Triple protection (max_instances=1, max_concurrency=1, Firestore distributed lock with 10-min TTL)
- Deduplication: md5 checksums prevent redundant commits
- Python 3.12+
- uv
- Terraform (for infrastructure)
git clone https://github.com/gbasin/gdrive-git-sync.git
cd gdrive-git-sync
# Install all dependencies (runtime + dev) in a virtual env
make install
# Run the full CI suite locally
make ci| Command | What it does |
|---|---|
make install |
Install all deps via uv (creates .venv automatically) |
make lint |
Run shellcheck + ruff lint + ruff format check |
make format |
Auto-format code (ruff + terraform fmt) |
make typecheck |
Run mypy type checker |
make test |
Run pytest with coverage |
make ci |
Run lint + typecheck + test (same as CI) |
make setup |
Interactive first-time setup (guided) |
make deploy |
Package and deploy to GCP |
make clean |
Remove caches and build artifacts |
uv run pre-commit installmake install now installs these hooks automatically. Once installed, each commit auto-runs Ruff lint autofixes (--fix) and Ruff formatting, plus mypy.
functions/ # Cloud Function source (deployed to GCP)
├── main.py # 3 HTTP entry points
├── sync_engine.py # Core orchestration
├── drive_client.py
├── git_ops.py
├── text_extractor.py
├── pandoc_postprocess.py
├── state_manager.py
└── config.py
infra/ # Terraform (Cloud Functions, Scheduler, Firestore, IAM)
scripts/ # setup.sh (guided onboarding), bootstrap.sh, deploy.sh, verify.sh
tests/ # pytest suite (~190 test cases)
pyproject.toml is the source of truth. uv.lock pins exact versions for reproducible local dev.
functions/requirements.txt is a separate runtime manifest for Cloud Functions deployment — Google's buildpack doesn't support uv, so it needs a plain requirements file. Keep both in sync when adding dependencies.
GitHub Actions runs on every push/PR to main:
- Lint: ruff check + format check (via astral-sh/ruff-action)
- Typecheck: mypy via uv
- Test: pytest with coverage threshold (60% minimum)
- Terraform: format check
Every GCP service this project uses has a free tier. Typical small-team usage stays well within it:
| Service | Free tier | What this project uses it for |
|---|---|---|
| Cloud Functions (2nd gen) | 2M invocations/month | Webhook handler, watch renewal, setup |
| Firestore | 1 GiB storage, 50K reads/day | Page tokens, lock state, watch channel info |
| Secret Manager | 10K access operations/month | Git push token |
| Cloud Scheduler | 3 jobs free | Watch renewal (1 job), safety-net poll (1 job) |
| Cloud Build | 120 build-minutes/day | Function deployments |
Beyond free tier, costs scale with Drive activity. See GCP pricing for details.
- Service account — Gets read-only access to the monitored Drive folder plus Firestore read/write. No broader GCP permissions.
- Git token — Stored in Secret Manager, never in environment variables or source code. The Cloud Function reads it at runtime.
- Webhook endpoint — Public for Drive push notifications, but sync requests without Drive channel headers are gated by
X-Sync-Trigger-Secretto prevent unauthenticated trigger abuse. - No data storage beyond — Firestore holds only page tokens, lock state, and watch channel metadata. File contents pass through the function transiently and land only in the git repo.
- Scanned/image-only PDFs — pdfplumber extracts text from text-based PDFs only. Scanned pages produce a warning and no extracted text.
- Binary formats beyond docx/pdf/csv — Committed as-is without text extraction.
git diffwon't show meaningful changes for these. - Single Drive folder — Monitors one folder (including subfolders). For multiple unrelated folders, deploy separate instances.
- Google Workspace restrictions — Workspace admins can restrict Drive API access or webhook delivery. Check with your admin if webhooks don't arrive.
- Large files — Files over
MAX_FILE_SIZE_MB(default 100MB) are skipped to stay within Cloud Function memory/timeout limits. - Webhook delivery — Google doesn't guarantee webhook delivery. The 4-hour safety-net poll catches anything missed.
To remove all deployed resources:
cd infra
terraform destroyThen clean up:
- Revoke the Drive share — Remove the service account from your Drive folder
- Delete the git token — Revoke the personal access token from your git host
- Delete the GCP project (optional) — If you created a project just for this:
gcloud projects delete <project-id>
Watch channel not receiving notifications
- Verify domain ownership is complete in both Search Console and API Console
- Check that the service account has access to the Drive folder
- Run the safety-net sync manually:
curl -X POST <sync_handler_url> -H "X-Sync-Trigger-Secret: $SYNC_TRIGGER_SECRET"
Files not syncing
- Check Cloud Function logs:
gcloud functions logs read drive-sync-handler --gen2 --limit=50 - Verify page token exists: check Firestore
drive_sync_state/config/settings/page_token - Check if lock is stuck: Firestore
drive_sync_state/config/settings/sync_lock— lock auto-expires after 10 minutes
Git push failures
- Verify token in Secret Manager has push access to the repo
- Check that the branch exists on remote
- Ensure token hasn't expired (GitHub fine-grained tokens have expiry dates)
Extraction quality issues
- docx track changes not showing: verify pandoc version supports
--track-changes=all - PDF tables garbled: pdfplumber works best with text-based PDFs, not scanned images
- Large files timing out: increase
MAX_FILE_SIZE_MBor function timeout
MIT