Skip to content

gauravfs-14/socflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

🧩 SocFlow

Python 3.8+ License: MIT Code style: black

SocFlow is a unified framework for collecting and analyzing public discourse from multiple social platforms such as Reddit, Bluesky, and Mastodon.
It helps researchers and developers build large-scale social datasets for sentiment analysis, topic modeling, and behavioral studies.

πŸ–₯️ TUI Demo

SocFlow TUI

Real-time data collection from Reddit, Bluesky, and Mastodon with split-screen monitoring

πŸš€ Features

  • πŸ–₯️ Terminal User Interface (TUI): Real-time split-screen monitoring of all platforms
  • πŸ”„ Continuous Collection: Collect data continuously until interrupted
  • 🎯 Keyword-Based Collection: Filter posts by keywords and hashtags
  • πŸ“Š Real-Time Statistics: Live updates of collection progress and metrics
  • πŸ›‘οΈ Data Deduplication: Prevents storing duplicate posts across collection cycles
  • ⚑ Parallel Processing: Simultaneous collection from multiple platforms
  • πŸ—οΈ Object-Oriented Design: Clean, modular architecture with reusable components
  • βš™οΈ Hierarchical Configuration: Dev and user-level configuration management
  • πŸ—„οΈ Database Flexibility: Choose between single or separate databases per platform
  • βœ… Pydantic Validation: Type-safe data models with automatic validation
  • πŸ“± Multiple Platforms: Reddit, Bluesky, and Mastodon support
  • πŸ”— Unified Schema: Consistent data structure across all platforms
  • πŸ–₯️ CLI Interface: Easy-to-use command-line interface
  • πŸ“€ Export Options: JSON, CSV, and Parquet export formats

βš™οΈ Installation

Quick Setup (Recommended)

# Clone the repository
git clone https://github.com/gauravfs-14/socflow.git
cd socflow

# Complete setup with one command
make setup

Manual Setup

# Create a virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv sync

# Setup environment and configuration
make setup-env
make setup-config

🧠 Quick Start

πŸ“‹ Click to expand Quick Start Guide

1. Environment Setup

# Copy environment template
cp .env.example .env

# Edit .env with your API credentials
nano .env

2. API Credentials Setup

# Setup credentials interactively
make setup-credentials

3. Database Setup

# Create database tables
make setup-db

4. Start Data Collection

# Launch TUI for real-time collection
make collect-tui

# Or collect from specific platforms
make collect-reddit
make collect-bluesky
make collect-mastodon

5. View Results

# Show collection statistics
make stats

# Export data
make export-json

πŸ–₯️ Terminal User Interface (TUI)

The TUI provides a real-time split-screen interface for monitoring data collection across all platforms:

Features

  • πŸ“Š Live Statistics: Real-time post counts and collection status
  • ⏱️ Timestamps: Last update times for each platform
  • πŸ”„ Continuous Updates: Automatic refresh of collection progress
  • πŸ›‘ Graceful Shutdown: Ctrl+C to stop all collection processes
  • πŸ“ˆ Progress Tracking: Visual indicators for collection status

Usage

# Launch TUI
make collect-tui

# The TUI will show:
# - Reddit Collector: Status, posts collected, last update
# - Bluesky Collector: Status, posts collected, last update  
# - Mastodon Collector: Status, posts collected, last update

πŸ—οΈ Architecture

πŸ›οΈ Click to expand Architecture Details

Configuration Management

  • Dev Config: config/settings.yml - Development settings
  • User Config: ~/.socflow/config.yml - User-specific settings
  • Environment Variables: Override any setting via environment variables

Database Options

  • Single Database: All platforms in one database (default)
  • Separate Databases: One database per platform
  • Supported Types: SQLite (default), PostgreSQL, MySQL

Data Models

  • BasePost: Common interface for all platforms
  • Platform-Specific Models: RedditPost, BlueskyPost, MastodonPost
  • Pydantic Validation: Automatic data validation and serialization

Collection Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Reddit    β”‚    β”‚   Bluesky    β”‚    β”‚  Mastodon   β”‚
β”‚  Collector  β”‚    β”‚  Collector   β”‚    β”‚  Collector  β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚                   β”‚                   β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                  β”‚   Database  β”‚
                  β”‚  Manager   β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🧩 Configuration

βš™οΈ Click to expand Configuration Options

Database Configuration

database:
  type: "sqlite"  # sqlite, postgresql, mysql
  path: "data/socflow.db"
  separate_databases: false  # true for separate DBs per platform

Reddit Configuration

collectors:
  reddit:
    enabled: true
    subreddits: ["all", "MachineLearning", "worldnews", "politics"]
    max_posts_per_subreddit: 999999
    sort_by: "hot"  # hot, new, top, rising
    time_filter: "day"  # hour, day, week, month, year, all

Bluesky Configuration

collectors:
  bluesky:
    enabled: true
    keywords: []  # Empty for timeline collection
    max_posts: 999999

Mastodon Configuration

collectors:
  mastodon:
    enabled: true
    instances: ["https://mastodon.social", "https://mastodon.technology"]
    hashtags: []  # Empty for public timeline collection
    max_posts_per_instance: 999999

Environment Variables

# Reddit API
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret

# Bluesky
BLUESKY_HANDLE=your_handle.bsky.app
BLUESKY_PASSWORD=your_password

# Mastodon
MASTODON_ACCESS_TOKEN=your_access_token

πŸ—ƒοΈ Unified Schema

πŸ“Š Click to expand Data Schema Details

Each collected post is normalized into a common structure:

Field Description
platform Source platform (reddit, bluesky, etc.)
object_id Unique ID per platform
author_handle Username or handle
text Post or comment text
created_at Timestamp
tags Hashtags or communities
metrics Likes, shares, upvotes, etc.
url Link to the original post

Platform-Specific Fields

Reddit:

  • subreddit, title, flair, is_self, is_nsfw
  • upvotes, downvotes, score, gilded

Bluesky:

  • handle, display_name, is_reply, is_repost
  • likes, reposts, replies, quotes

Mastodon:

  • instance, is_reblog, is_sensitive
  • favourites, reblogs, replies, bookmarks

πŸš€ Usage

πŸ› οΈ Click to expand Usage Examples

Makefile Commands (Recommended)

The project includes a comprehensive Makefile for easy usage:

# Show all available commands
make help

# Complete project setup
make setup

# Data collection
make collect-tui          # Launch TUI for real-time collection
make collect-reddit       # Collect from Reddit
make collect-bluesky      # Collect from Bluesky  
make collect-mastodon     # Collect from Mastodon
make collect-all          # Collect from all platforms

# Data management
make stats                # Show statistics
make export-json          # Export as JSON
make export-csv           # Export as CSV
make export-parquet       # Export as Parquet

# Configuration
make config               # Show current config
make setup-env            # Setup environment file
make setup-credentials    # Setup API credentials
make setup-db             # Create database tables

CLI Commands

# Collect data from Reddit
python -m src.main collect --platforms reddit --subreddits all

# Collect data from Bluesky
python -m src.main collect --platforms bluesky --keywords AI machinelearning

# Collect data from Mastodon
python -m src.main collect --platforms mastodon --hashtags AI tech

# Collect from all platforms
python -m src.main collect --platforms reddit bluesky mastodon

# Show statistics
python -m src.main stats --platform reddit

# Export data
python -m src.main export --output data/export.json --platform reddit

# Show configuration
python -m src.main config

Programmatic Usage

from src.app import SocFlowApp

# Initialize app
app = SocFlowApp("config/settings.yml")

# Create tables
app.create_tables()

# Collect data
results = app.collect_data(platforms=["reddit"])

# Get statistics
stats = app.get_stats()

# Export data
app.export_data("data/export.json")

# Cleanup
app.close()

πŸ”§ Development

πŸ› οΈ Click to expand Development Guide

Project Structure

src/
β”œβ”€β”€ app.py              # Main application and CLI
β”œβ”€β”€ tui.py              # Terminal User Interface
β”œβ”€β”€ config/             # Configuration management
β”‚   └── settings.py
β”œβ”€β”€ models/             # Pydantic data models
β”‚   β”œβ”€β”€ base.py
β”‚   β”œβ”€β”€ reddit.py
β”‚   β”œβ”€β”€ bluesky.py
β”‚   └── mastodon.py
β”œβ”€β”€ database/           # Database abstraction
β”‚   β”œβ”€β”€ base.py
β”‚   β”œβ”€β”€ sqlite.py
β”‚   └── factory.py
β”œβ”€β”€ collectors/         # Data collectors
β”‚   β”œβ”€β”€ base.py
β”‚   β”œβ”€β”€ reddit.py
β”‚   β”œβ”€β”€ bluesky.py
β”‚   └── mastodon.py
└── utils/              # Utilities
    └── logger.py

Adding New Platforms

  1. Create a new collector class inheriting from BaseCollector
  2. Create platform-specific data models
  3. Update the database schema if needed
  4. Register the collector in the main application

Adding New Database Types

  1. Create a new database manager inheriting from DatabaseManager
  2. Implement all abstract methods
  3. Update the factory function
  4. Add configuration options

Development Commands

# Install development dependencies
make install

# Run tests
make test

# Format code
make format

# Run linting
make lint

# Clean up
make clean

πŸ“Š Performance & Optimization

⚑ Click to expand Performance Details

Collection Performance

  • Reddit: ~15 posts per 2-3 seconds
  • Bluesky: ~50-100 posts per collection cycle
  • Mastodon: ~20-30 posts per collection cycle

Optimization Features

  • Batch Processing: Efficient batch collection and database insertion
  • Deduplication: Prevents storing duplicate posts
  • Parallel Collection: Simultaneous collection from multiple platforms
  • Memory Management: Optimized memory usage for large datasets

Database Optimization

  • Indexed Fields: Optimized database indexes for fast queries
  • Connection Pooling: Efficient database connection management
  • Transaction Batching: Batch database operations for better performance

πŸ› Troubleshooting

πŸ”§ Click to expand Troubleshooting Guide

Common Issues

Reddit not collecting:

# Check Reddit credentials
make setup-credentials

# Test Reddit collector directly
python -c "from src.app import SocFlowApp; app = SocFlowApp(); print(app.collectors['reddit'].is_enabled())"

Database connection issues:

# Recreate database
rm -rf data/socflow.db
make setup-db

TUI not displaying properly:

# Check terminal size
echo $COLUMNS $LINES

# Try different terminal
export TERM=xterm-256color

Debug Mode

# Enable debug logging
export SOCFLOW_DEBUG=1
make collect-tui

Performance Issues

# Check system resources
htop

# Monitor database size
ls -lh data/socflow.db

# Check collection logs
tail -f logs/socflow.log

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Contribution Steps

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

πŸ“œ Code of Conduct

Please read our Code of Conduct to understand our community guidelines.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Reddit API: PRAW for Reddit data access
  • Bluesky API: atproto for Bluesky integration
  • Mastodon API: Mastodon.py for Mastodon support
  • TUI Framework: Rich for beautiful terminal interfaces
  • Data Validation: Pydantic for type-safe data models

πŸ“ž Support

About

Unified data collection framework for Reddit, Bluesky, Mastodon, and News sources.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks