SocFlow is a unified framework for collecting and analyzing public discourse from multiple social platforms such as Reddit, Bluesky, and Mastodon.
It helps researchers and developers build large-scale social datasets for sentiment analysis, topic modeling, and behavioral studies.
Real-time data collection from Reddit, Bluesky, and Mastodon with split-screen monitoring
- π₯οΈ Terminal User Interface (TUI): Real-time split-screen monitoring of all platforms
- π Continuous Collection: Collect data continuously until interrupted
- π― Keyword-Based Collection: Filter posts by keywords and hashtags
- π Real-Time Statistics: Live updates of collection progress and metrics
- π‘οΈ Data Deduplication: Prevents storing duplicate posts across collection cycles
- β‘ Parallel Processing: Simultaneous collection from multiple platforms
- ποΈ Object-Oriented Design: Clean, modular architecture with reusable components
- βοΈ Hierarchical Configuration: Dev and user-level configuration management
- ποΈ Database Flexibility: Choose between single or separate databases per platform
- β Pydantic Validation: Type-safe data models with automatic validation
- π± Multiple Platforms: Reddit, Bluesky, and Mastodon support
- π Unified Schema: Consistent data structure across all platforms
- π₯οΈ CLI Interface: Easy-to-use command-line interface
- π€ Export Options: JSON, CSV, and Parquet export formats
# Clone the repository
git clone https://github.com/gauravfs-14/socflow.git
cd socflow
# Complete setup with one command
make setup# Create a virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv sync
# Setup environment and configuration
make setup-env
make setup-configπ Click to expand Quick Start Guide
# Copy environment template
cp .env.example .env
# Edit .env with your API credentials
nano .env# Setup credentials interactively
make setup-credentials# Create database tables
make setup-db# Launch TUI for real-time collection
make collect-tui
# Or collect from specific platforms
make collect-reddit
make collect-bluesky
make collect-mastodon# Show collection statistics
make stats
# Export data
make export-jsonThe TUI provides a real-time split-screen interface for monitoring data collection across all platforms:
- π Live Statistics: Real-time post counts and collection status
- β±οΈ Timestamps: Last update times for each platform
- π Continuous Updates: Automatic refresh of collection progress
- π Graceful Shutdown: Ctrl+C to stop all collection processes
- π Progress Tracking: Visual indicators for collection status
# Launch TUI
make collect-tui
# The TUI will show:
# - Reddit Collector: Status, posts collected, last update
# - Bluesky Collector: Status, posts collected, last update
# - Mastodon Collector: Status, posts collected, last updateποΈ Click to expand Architecture Details
- Dev Config:
config/settings.yml- Development settings - User Config:
~/.socflow/config.yml- User-specific settings - Environment Variables: Override any setting via environment variables
- Single Database: All platforms in one database (default)
- Separate Databases: One database per platform
- Supported Types: SQLite (default), PostgreSQL, MySQL
- BasePost: Common interface for all platforms
- Platform-Specific Models: RedditPost, BlueskyPost, MastodonPost
- Pydantic Validation: Automatic data validation and serialization
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Reddit β β Bluesky β β Mastodon β
β Collector β β Collector β β Collector β
βββββββ¬ββββββββ ββββββββ¬ββββββββ βββββββ¬ββββββββ
β β β
βββββββββββββββββββββΌββββββββββββββββββββ
β
ββββββββΌβββββββ
β Database β
β Manager β
ββββββββββββββ
βοΈ Click to expand Configuration Options
database:
type: "sqlite" # sqlite, postgresql, mysql
path: "data/socflow.db"
separate_databases: false # true for separate DBs per platformcollectors:
reddit:
enabled: true
subreddits: ["all", "MachineLearning", "worldnews", "politics"]
max_posts_per_subreddit: 999999
sort_by: "hot" # hot, new, top, rising
time_filter: "day" # hour, day, week, month, year, allcollectors:
bluesky:
enabled: true
keywords: [] # Empty for timeline collection
max_posts: 999999collectors:
mastodon:
enabled: true
instances: ["https://mastodon.social", "https://mastodon.technology"]
hashtags: [] # Empty for public timeline collection
max_posts_per_instance: 999999# Reddit API
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
# Bluesky
BLUESKY_HANDLE=your_handle.bsky.app
BLUESKY_PASSWORD=your_password
# Mastodon
MASTODON_ACCESS_TOKEN=your_access_tokenπ Click to expand Data Schema Details
Each collected post is normalized into a common structure:
| Field | Description |
|---|---|
platform |
Source platform (reddit, bluesky, etc.) |
object_id |
Unique ID per platform |
author_handle |
Username or handle |
text |
Post or comment text |
created_at |
Timestamp |
tags |
Hashtags or communities |
metrics |
Likes, shares, upvotes, etc. |
url |
Link to the original post |
Reddit:
subreddit,title,flair,is_self,is_nsfwupvotes,downvotes,score,gilded
Bluesky:
handle,display_name,is_reply,is_repostlikes,reposts,replies,quotes
Mastodon:
instance,is_reblog,is_sensitivefavourites,reblogs,replies,bookmarks
π οΈ Click to expand Usage Examples
The project includes a comprehensive Makefile for easy usage:
# Show all available commands
make help
# Complete project setup
make setup
# Data collection
make collect-tui # Launch TUI for real-time collection
make collect-reddit # Collect from Reddit
make collect-bluesky # Collect from Bluesky
make collect-mastodon # Collect from Mastodon
make collect-all # Collect from all platforms
# Data management
make stats # Show statistics
make export-json # Export as JSON
make export-csv # Export as CSV
make export-parquet # Export as Parquet
# Configuration
make config # Show current config
make setup-env # Setup environment file
make setup-credentials # Setup API credentials
make setup-db # Create database tables# Collect data from Reddit
python -m src.main collect --platforms reddit --subreddits all
# Collect data from Bluesky
python -m src.main collect --platforms bluesky --keywords AI machinelearning
# Collect data from Mastodon
python -m src.main collect --platforms mastodon --hashtags AI tech
# Collect from all platforms
python -m src.main collect --platforms reddit bluesky mastodon
# Show statistics
python -m src.main stats --platform reddit
# Export data
python -m src.main export --output data/export.json --platform reddit
# Show configuration
python -m src.main configfrom src.app import SocFlowApp
# Initialize app
app = SocFlowApp("config/settings.yml")
# Create tables
app.create_tables()
# Collect data
results = app.collect_data(platforms=["reddit"])
# Get statistics
stats = app.get_stats()
# Export data
app.export_data("data/export.json")
# Cleanup
app.close()π οΈ Click to expand Development Guide
src/
βββ app.py # Main application and CLI
βββ tui.py # Terminal User Interface
βββ config/ # Configuration management
β βββ settings.py
βββ models/ # Pydantic data models
β βββ base.py
β βββ reddit.py
β βββ bluesky.py
β βββ mastodon.py
βββ database/ # Database abstraction
β βββ base.py
β βββ sqlite.py
β βββ factory.py
βββ collectors/ # Data collectors
β βββ base.py
β βββ reddit.py
β βββ bluesky.py
β βββ mastodon.py
βββ utils/ # Utilities
βββ logger.py
- Create a new collector class inheriting from
BaseCollector - Create platform-specific data models
- Update the database schema if needed
- Register the collector in the main application
- Create a new database manager inheriting from
DatabaseManager - Implement all abstract methods
- Update the factory function
- Add configuration options
# Install development dependencies
make install
# Run tests
make test
# Format code
make format
# Run linting
make lint
# Clean up
make cleanβ‘ Click to expand Performance Details
- Reddit: ~15 posts per 2-3 seconds
- Bluesky: ~50-100 posts per collection cycle
- Mastodon: ~20-30 posts per collection cycle
- Batch Processing: Efficient batch collection and database insertion
- Deduplication: Prevents storing duplicate posts
- Parallel Collection: Simultaneous collection from multiple platforms
- Memory Management: Optimized memory usage for large datasets
- Indexed Fields: Optimized database indexes for fast queries
- Connection Pooling: Efficient database connection management
- Transaction Batching: Batch database operations for better performance
π§ Click to expand Troubleshooting Guide
Reddit not collecting:
# Check Reddit credentials
make setup-credentials
# Test Reddit collector directly
python -c "from src.app import SocFlowApp; app = SocFlowApp(); print(app.collectors['reddit'].is_enabled())"Database connection issues:
# Recreate database
rm -rf data/socflow.db
make setup-dbTUI not displaying properly:
# Check terminal size
echo $COLUMNS $LINES
# Try different terminal
export TERM=xterm-256color# Enable debug logging
export SOCFLOW_DEBUG=1
make collect-tui# Check system resources
htop
# Monitor database size
ls -lh data/socflow.db
# Check collection logs
tail -f logs/socflow.logWe welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
Please read our Code of Conduct to understand our community guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- Reddit API: PRAW for Reddit data access
- Bluesky API: atproto for Bluesky integration
- Mastodon API: Mastodon.py for Mastodon support
- TUI Framework: Rich for beautiful terminal interfaces
- Data Validation: Pydantic for type-safe data models
- Issues: GitHub Issues
- Discussions: GitHub Discussions
