# Overview:

This utility script provides a streamlined way to process and load large volumes of financial data into a ScyllaDB database. It primarily targets minute-resolution, adjusted-split financial bars but can be adapted for different data structures. In addition to its core functionalities, it offers the `fetch_minute_data` function to retrieve specific minute-level datasets from the database. The script uses efficient techniques such as multiprocessing, batching, and asynchronous fetching to enhance speed and consistency.

---

## Features:

- **Parallel Processing:** Utilizes Python's multiprocessing library to parallelize data processing, making full use of available CPU cores.
- **Automatic Retries:** Implements a retry mechanism for data insertions, ensuring data consistency even in the face of transient database errors.
- **Progress Monitoring:** Offers a real-time progress bar to keep track of the data loading process.
- **Error Logging:** Captures and logs errors encountered during data processing and loading for easier troubleshooting.
- **Data Preprocessing:** Preprocesses raw data files to extract relevant information and organize it into a more database-friendly format.
- **Custom Date Ranges:** Users can set specific start and end times using `fetch_minute_data`.
- **Trading Hours Filter:** Provides an option to retrieve data only from standard trading hours.
- **Month-based Bucketing System:** Organizes data for efficient retrieval.
- **Asynchronous Data Fetching:** Enables parallel data retrieval across different time intervals.

---

## Setup and Usage:

### Prerequisites:
- Linux
- Python 3.11.4
- ScyllaDB
- Python packages: `os`, `shutil`, `subprocess`, `multiprocessing`, `cassandra-driver`, `tqdm`, `datetime`

### Configuration:
1. Adjust the constants at the beginning of the script, such as `KEYSPACE`, `TABLE`, and `CSV_PATH`, to match your setup.
2. Ensure the ScyllaDB cluster is up and accessible from the script execution environment.

### Execution:
- Simply run the cells sequentially. By default, the script processes a specific number of .txt files located in the directory specified by `CSV_PATH`. Adjust as needed for processing different file counts.

---

## Function Descriptions:

- `divide_list_into_chunks()`: Splits a list into approximately equal-sized chunks for parallel processing.
- `list_files()`: Lists all '.txt' files in the specified directory.
- `create_keyspace_and_table()`: Establishes a connection to ScyllaDB and creates a keyspace and table if not already present.
- `clear_temp_folder()`: Clears any existing temporary files before processing starts.
- `log_error()`: Logs errors encountered during data processing.
- `execute_with_retry()`: Retries data insertion in case of transient database errors.
- `get_bucket_from_timestamp()`: Extracts year-month info from timestamps, useful for bucketing data.
- `preprocess_and_load()`: Processes each financial data file and loads the data into ScyllaDB.
- `process_chunk()`: Handles a chunk of data files, invoking preprocessing and loading for each.
- `monitor_progress()`: A thread-safe function that updates the progress bar as data files are processed.
- `fetch_minute_data`: Retrieves minute-level financial data from the database based on specified criteria.

---

### XFS Filesystem:

ScyllaDB, being a high-performance NoSQL database, has specific requirements when it comes to disk I/O. It's crucial to choose the right filesystem to achieve optimal performance. Here's why XFS is recommended for ScyllaDB:

- **High Throughput:** XFS is designed for high parallelism, which is in line with ScyllaDB's architecture. This ensures the database can manage multiple read/write operations efficiently.

- **Delayed Allocation:** XFS improves disk performance with its delayed allocation feature, helping in reducing fragmentation.

- **Scalability:** XFS supports filesystems up to 8 exabytes, which makes it suitable for ScyllaDB deployments that might need to handle large volumes of data.

- **Journaling Capability:** XFS has a robust journaling mechanism that ensures data integrity even in case of system crashes. This feature is crucial for databases like ScyllaDB, where data integrity is paramount.

- **Optimized for Large Files:** ScyllaDB often deals with large SSTable files. XFS's design is optimized for handling large files, ensuring quick reads and writes.

- **Built-in Utilities:** XFS comes with a set of administrative utilities like `xfs_repair` (for repairing the filesystem) and `xfs_growfs` (for resizing the filesystem), which can be beneficial for database management.

When deploying ScyllaDB, using the XFS filesystem ensures you are aligning with ScyllaDB’s design principles and maximizing the database's performance capabilities.

**Note:** Always ensure you have backups of your data and that you've set up appropriate monitoring and alerting for your ScyllaDB cluster. Regularly check the error log for any issues during data loading.


In [1]:
import os
import sys
import shutil
import subprocess
import time
import multiprocessing as mp
from cassandra.cluster import Cluster
from tqdm.notebook import tqdm
from datetime import datetime
from threading import Thread


In [2]:
# Constants for database, paths, and retries
KEYSPACE = 'financial_data'
TABLE = 'test_data_bars_1m_adjsplit'
CSV_PATH = '/home/jj/anaconda3/envs/stocks/Database/1M/data'
CSV_PATH_TEMP = '/home/jj/anaconda3/envs/stocks//Database/1M/temp'
CSV_PATH_LOG = '/home/jj/anaconda3/envs/stocks//Database/1M/log'
MAX_RETRIES = 5
RETRY_PAUSE = 60  # Duration in seconds to pause between retries
SCYLLA_NODE_IP = '192.168.3.41' # Node IP address
SCYLLA_NODE_PORT = '9042' # Node port

def divide_list_into_chunks(lst, m):
    # Split a list into approximately equal-sized chunks
    n = len(lst)
    chunk_size = n // m
    for i in range(0, m - 1):
        yield lst[i * chunk_size : (i + 1) * chunk_size]
    yield lst[(m - 1) * chunk_size:]

def list_files(path):
    # Lists all '.txt' files in the provided directory path
    all_files = os.listdir(path)
    files = list(filter(lambda f: f.endswith('.txt'), all_files))
    return files
    

In [3]:
import pandas as pd

symbol = 'A'
sample_data = pd.read_csv(f"{CSV_PATH}/{symbol}_full_1min_adjsplit.txt", header=None)
sample_data

Unnamed: 0,0,1,2,3,4,5
0,2005-01-03 09:30:00,17.2389,17.2389,17.2318,17.2389,173492.0
1,2005-01-03 09:31:00,17.2389,17.2389,17.2246,17.2246,9646.0
2,2005-01-03 09:32:00,17.2246,17.2389,17.2246,17.2318,34810.0
3,2005-01-03 09:33:00,17.2318,17.2389,17.2318,17.2389,11464.0
4,2005-01-03 09:34:00,17.2389,17.2389,17.2318,17.2318,12302.0
...,...,...,...,...,...,...
1835693,2023-10-05 15:56:00,110.3200,110.3400,110.2800,110.3200,8780.0
1835694,2023-10-05 15:57:00,110.3100,110.4350,110.3100,110.3800,13658.0
1835695,2023-10-05 15:58:00,110.3750,110.4200,110.3650,110.4200,12518.0
1835696,2023-10-05 15:59:00,110.4200,110.4200,110.3100,110.3200,30497.0


In [4]:
def clear_temp_folder():
    """Purges the contents of the designated temporary directory."""
    
    for filename in os.listdir(CSV_PATH_TEMP):
        file_path = os.path.join(CSV_PATH_TEMP, filename)
        
        try:
            # Identify/remove files or symbolic links
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            # Recognize/delete directories with their contents
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            # Report any deletion issues
            print(f'Failed to delete {file_path}. Reason: {e}')

def log_error(symbol, error_message):
    """
    Logs error messages for specific financial symbols to a log file.

    Parameters:
        symbol (str): The financial instrument identifier.
        error_message (str): Description of the issue.
    """
    
    log_file = os.path.join(CSV_PATH_LOG, "error_log.txt")
    
    with open(log_file, "a") as file:
        file.write(f"Symbol: {symbol} - Error: {error_message}\n")

def execute_with_retry(command, symbol):
    """
    Executes a command and retries on specific exceptions.
    
    Parameters:
        command (list): The command as a list of strings.
        symbol (str): The financial symbol for the command.
        
    Returns:
        subprocess.CompletedProcess: The executed command result.
    """
    
    # Initialize retry count
    retries = 0
    
    # Keep trying until max retries
    while retries < MAX_RETRIES:
        try:
            # Run the command, capture output
            result = subprocess.run(command, capture_output=True, text=True)
            
            # Return result if no "WriteTimeout" in error
            if "WriteTimeout" not in result.stderr:
                return result
            else:
                # Else, trigger a retry
                raise Exception("WriteTimeout encountered")
        except Exception as e:
            # Increment retry count if exception
            retries += 1
            
            # Log the error for the symbol
            log_error(symbol, str(e))
            
            # Pause before next retry
            time.sleep(RETRY_PAUSE)
    
    # If here, max retries attempted and failed
    print(f"Max retries reached for {symbol}. Moving on...")


In [5]:
def clear_temp_folder():
    """Purges the contents of the designated temporary directory."""
    
    for filename in os.listdir(CSV_PATH_TEMP):
        file_path = os.path.join(CSV_PATH_TEMP, filename)
        
        try:
            # Identify/remove files or symbolic links
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            # Recognize/delete directories with their contents
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            # Report any deletion issues
            print(f'Failed to delete {file_path}. Reason: {e}')

def log_error(symbol, error_message):
    """
    Logs error messages for specific financial symbols to a log file.

    Parameters:
        symbol (str): The financial instrument identifier.
        error_message (str): Description of the issue.
    """
    
    log_file = os.path.join(CSV_PATH_LOG, "error_log.txt")
    
    with open(log_file, "a") as file:
        file.write(f"Symbol: {symbol} - Error: {error_message}\n")

def execute_with_retry(command, symbol):
    """
    Executes a command and retries on specific exceptions.
    
    Parameters:
        command (list): The command as a list of strings.
        symbol (str): The financial symbol for the command.
        
    Returns:
        subprocess.CompletedProcess: The executed command result.
    """
    
    # Initialize retry count
    retries = 0
    
    # Keep trying until max retries
    while retries < MAX_RETRIES:
        try:
            # Run the command, capture output
            result = subprocess.run(command, capture_output=True, text=True)
            
            # Return result if no "WriteTimeout" in error
            if "WriteTimeout" not in result.stderr:
                return result
            else:
                # Else, trigger a retry
                raise Exception("WriteTimeout encountered")
        except Exception as e:
            # Increment retry count if exception
            retries += 1
            
            # Log the error for the symbol
            log_error(symbol, str(e))
            
            # Pause before next retry
            time.sleep(RETRY_PAUSE)
    
    # If here, max retries attempted and failed
    print(f"Max retries reached for {symbol}. Moving on...")

def create_keyspace_and_table():
    """
    Establishes a connection to Scylla DB and initializes a keyspace and a 
    table if they aren't present. This function serves as the initial setup 
    phase, ensuring the database is ready for incoming data. The table stores 
    financial data like stock prices optimizing query performance.
      
    Steps:
    1. Connect to the Scylla cluster.
    2. Create keyspace if absent. It uses a simple replication strategy with 
       a factor of 1, implying data on one node. This may not be optimal for 
       production systems where redundancy is vital.
    3. Set active keyspace for the session.
    4. Design and initialize the table. Schema has:
       - Composite primary key for efficient time-based queries.
       - 'TimeWindowCompactionStrategy' for time series data in 30-day windows.
    5. Terminate the session and connection after operations.
    """
    
    cluster = Cluster([SCYLLA_NODE_IP], port=SCYLLA_NODE_PORT)
    session = cluster.connect()
    
    session.execute(f"CREATE KEYSPACE IF NOT EXISTS {KEYSPACE} "
                    f"WITH replication = {{'class': 'SimpleStrategy', "
                    f"'replication_factor' : 1}};")
    session.set_keyspace(KEYSPACE)
    
    session.execute(f"""
    CREATE TABLE IF NOT EXISTS {KEYSPACE}.{TABLE} (
        symbol text,
        bucket text,
        timestamp text,
        open float,
        high float,
        low float,
        close float,
        volume float,
        PRIMARY KEY ((symbol, bucket), timestamp)
    ) WITH compaction = {{
        'class': 'TimeWindowCompactionStrategy',
        'compaction_window_unit': 'DAYS',
        'compaction_window_size': 30
    }};
    """)
    
    # The decision to not use the default_time_to_live property in the table 
    # definition was made to ensure continuous access to historical financial data.
    
    session.shutdown()
    cluster.shutdown()

def get_bucket_from_timestamp(timestamp):
    dt_obj = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')
    return f"{dt_obj.year}-{dt_obj.month:02}"

def preprocess_and_load(symbol):
    """
    Preprocesses and loads data for a symbol into the database.
    
    Parameters:
        symbol (str): The symbol to be processed.
        
    Returns:
        int: The number of entries processed for the symbol.
    """
    input_file = os.path.join(CSV_PATH, f"{symbol}_full_1min_adjsplit.txt")
    temp_file = os.path.join(CSV_PATH_TEMP, f"{symbol}_temp.txt")
    
    line_count = 0
    
    with open(input_file, 'r') as source, open(temp_file, 'w') as target:
        for line in source:
            line_count += 1
            data = line.strip().split(',')
            timestamp = data[0]
            bucket = get_bucket_from_timestamp(timestamp)
            target.write(f"{symbol},{bucket},{timestamp},{','.join(data[1:])}\n")
    
    start_time = time.time()
    copy_cmd = ['/home/jj/anaconda3/envs/stocks/bin/cqlsh', '-e',
                f"COPY {KEYSPACE}.{TABLE} FROM '{temp_file}' "
                f"WITH DELIMITER=',' AND HEADER=FALSE"]
    
    result = execute_with_retry(copy_cmd, symbol)
    os.remove(temp_file)
    end_time = time.time()
    time_taken = end_time - start_time
    
    return line_count, time_taken

def process_chunk(chunk):
    """
    Process a subset (chunk) of financial symbol data files.
    Preprocesses and loads data for each symbol in the chunk.
    """
    total_rows = 0
    total_time = 0
    chunk_start_time = time.time()  # Mark the start time for the chunk processing

    for symbol in chunk:
        rows, time_taken = preprocess_and_load(symbol)
        total_rows += rows
        total_time += time_taken
        progress_queue.put(1)

    chunk_end_time = time.time()  # Mark the end time for the chunk processing
    chunk_total_time = chunk_end_time - chunk_start_time  # Calculate total time taken for the chunk

    return total_rows, total_time, round(chunk_total_time, 2)  # Return the total time taken for the chunk


def monitor_progress():
    """
    Monitors the progress of the data loading process.
    Counts processed symbols and updates the progress bar.
    """
    processed = 0
    while processed < len(files_list):
        _ = progress_queue.get()
        pbar.update(1)
        processed += 1


In [6]:
if __name__ == "__main__":
    # Main script execution starts.
    
    # Clear any existing temporary files.
    clear_temp_folder()
    # Create the required keyspace and table in ScyllaDB.
    create_keyspace_and_table()
    # List all the financial data files.
    files = list_files(CSV_PATH)
    # Extract symbol names from file names.
    files_list = [element.replace('_full_1min_adjsplit.txt', '') for element in files]
    # Sort the file list for consistent processing order.
    files_list.sort()
    # Initialize a multiprocessing queue to track progress.
    progress_queue = mp.Queue()
    processes = 5
    # (Optional) Limit to the first 10 files for processing.
    files_list = files_list[0:100]
    # Divide the file list into chunks for parallel processing.
    chunks = divide_list_into_chunks(files_list, processes)
    # Initialize a progress bar using tqdm.
    pbar = tqdm(total=len(files_list))
    # Start a separate thread to monitor processing progress.
    t = Thread(target=monitor_progress)
    t.start()
    with mp.Pool(processes=processes) as pool:
        global_start_time = time.time()
        results = []
        for chunk in chunks:
            results.append(pool.apply_async(process_chunk, (chunk,)))
        results = [res.get() for res in results]
        global_end_time = time.time()
        
        # Aggregate the number of rows processed and time taken across all chunks
        total_rows_aggregated = sum([r[0] for r in results])
        total_time_aggregated = sum([r[1] for r in results])
        process_times = [r[2] for r in results]  # List of total execution times for each process

        print(f"\nTotal rows inserted across processes: {total_rows_aggregated:,.0f} rows")
        print(f"\nTotal time for all processes: {global_end_time - global_start_time:.2f} seconds")
        print(f"Average insertion rate for all processes combined: {total_rows_aggregated / (global_end_time - global_start_time):,.0f} rows/sec")
        print(f"Average individual insertion rate: {total_rows_aggregated / total_time_aggregated:,.0f} rows/sec")
        print(f"\nIndividual process times: {process_times}")

    t.join()

  0%|          | 0/100 [00:00<?, ?it/s]


Total rows inserted across processes: 57,091,172 rows

Total time for all processes: 229.69 seconds
Average insertion rate for all processes combined: 248,552 rows/sec
Average individual insertion rate: 80,665 rows/sec

Individual process times: [229.69, 166.77, 172.61, 148.56, 206.05]


## Data Fetcher from the Database

### `fetch_minute_data` Function Overview:

The `fetch_minute_data` function is designed to pull minute-level financial data from a ScyllaDB (or Cassandra) database. With the assistance of pandas, it provides several functionalities:

- **Custom Date Ranges:** Users can set a specific start and end times for fetching the dataset.
- **Trading Hours Filter:** Provides an option to retrieve data only from standard trading hours.
- **Month-based Bucketing System:** Organizes data for efficient retrieval.
- **Asynchronous Data Fetching:** Enables parallel data retrieval across different time intervals.
- **Data Rounding:** Users have the option to round numerical values in the dataset to a precision of four decimal places.

For a smooth operation, ensure there's an operational ScyllaDB or Cassandra cluster, an appropriate Python environment, and the necessary keyspace and table references.


In [26]:
from datetime import datetime, timedelta
import pandas as pd
from cassandra.cluster import Cluster

def fetch_minute_data(symbol, start_time, end_time, KEYSPACE, TABLE, trading_hours_only, rounding):
    """
    Fetch minute-level financial data for a symbol and timeframe from ScyllaDB/Cassandra.
    
    Parameters:
    - symbol (str): Ticker symbol for data retrieval.
    - start_time (str): Start of desired timeframe 'YYYY-MM-DD HH:MM:SS'.
    - end_time (str): End of desired timeframe 'YYYY-MM-DD HH:MM:SS'.
    - KEYSPACE (str): ScyllaDB/Cassandra keyspace to connect.
    - TABLE (str): Table within the keyspace to query.
    
    Returns:
    - dataframe (pd.DataFrame): pandas DataFrame with queried financial data.
    """
    
    # Helper to generate monthly bucket strings for data partitioning
    def generate_monthly_buckets(start_time, end_time):
        start_date = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S')
        end_date = datetime.strptime(end_time, '%Y-%m-%d %H:%M:%S')
        current_date = start_date
        buckets = []
        
        # Loop through each month to generate bucket strings
        while current_date <= end_date:
            buckets.append(f"{current_date.year}-{current_date.month:02}")
            
            # Move to next month
            current_date = current_date + timedelta(days=31)
            current_date = datetime(current_date.year, current_date.month, 1)
        return buckets

    # Generate necessary bucket strings for timeframe
    buckets = generate_monthly_buckets(start_time, end_time)
    
    # List to store dataframes for each bucket
    dfs = []
    
    # Connect to ScyllaDB/Cassandra cluster and keyspace
    cluster = Cluster([SCYLLA_NODE_IP], port=SCYLLA_NODE_PORT)
    session = cluster.connect(KEYSPACE)
    
    # Prepare data fetch query
    query = f"""SELECT timestamp, open, high, low, close, volume 
                FROM {TABLE} WHERE symbol = ? AND bucket = ? AND 
                \"timestamp\" >= ? AND \"timestamp\" <= ?
            """
    prepared = session.prepare(query)
    futures = []

    # Asynchronously fetch data for each bucket
    for bucket in buckets:
        future = session.execute_async(prepared, (symbol, bucket, start_time, end_time))
        futures.append(future)

    # Convert query results to pandas dataframes
    for future in futures:
        rows = future.result()
        dfs.append(pd.DataFrame(list(rows)))

    # Close session and cluster connection after fetching
    session.shutdown()
    cluster.shutdown()
    
    # Combine all dataframes
    dataframe = pd.concat(dfs, ignore_index=True)
    
    # Set timestamp as index and remove timestamp column
    dataframe.index = dataframe['timestamp'].astype('datetime64[ns]') # type: ignore
    dataframe.drop('timestamp', inplace=True, axis=1)

    if trading_hours_only:
        dataframe = dataframe.between_time('09:30:00', '16:00:00')
    if rounding:
        dataframe = dataframe.round(4)

    return dataframe


In [27]:
symbol = 'A'
KEYSPACE = 'financial_data'
TABLE = 'test_data_bars_1m_adjsplit'
ohlc_df = fetch_minute_data(
    symbol, 
    '2005-01-01 00:00:00', 
    '2023-10-05 23:59:00',                      
    KEYSPACE, 
    TABLE, 
    trading_hours_only=False, 
    rounding=True
)
ohlc_df

Unnamed: 0_level_0,open,high,low,close,volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2005-01-03 09:30:00,17.2389,17.2389,17.2318,17.2389,173492.0
2005-01-03 09:31:00,17.2389,17.2389,17.2246,17.2246,9646.0
2005-01-03 09:32:00,17.2246,17.2389,17.2246,17.2318,34810.0
2005-01-03 09:33:00,17.2318,17.2389,17.2318,17.2389,11464.0
2005-01-03 09:34:00,17.2389,17.2389,17.2318,17.2318,12302.0
...,...,...,...,...,...
2023-10-05 15:56:00,110.3200,110.3400,110.2800,110.3200,8780.0
2023-10-05 15:57:00,110.3100,110.4350,110.3100,110.3800,13658.0
2023-10-05 15:58:00,110.3750,110.4200,110.3650,110.4200,12518.0
2023-10-05 15:59:00,110.4200,110.4200,110.3100,110.3200,30497.0


In [23]:
sample_data # To visually compare with original data

Unnamed: 0,0,1,2,3,4,5
0,2005-01-03 09:30:00,17.2389,17.2389,17.2318,17.2389,173492.0
1,2005-01-03 09:31:00,17.2389,17.2389,17.2246,17.2246,9646.0
2,2005-01-03 09:32:00,17.2246,17.2389,17.2246,17.2318,34810.0
3,2005-01-03 09:33:00,17.2318,17.2389,17.2318,17.2389,11464.0
4,2005-01-03 09:34:00,17.2389,17.2389,17.2318,17.2318,12302.0
...,...,...,...,...,...,...
1835693,2023-10-05 15:56:00,110.3200,110.3400,110.2800,110.3200,8780.0
1835694,2023-10-05 15:57:00,110.3100,110.4350,110.3100,110.3800,13658.0
1835695,2023-10-05 15:58:00,110.3750,110.4200,110.3650,110.4200,12518.0
1835696,2023-10-05 15:59:00,110.4200,110.4200,110.3100,110.3200,30497.0
