# VN30 to Qlib Format Conversion

Converts VN30 raw CSV data to qlib binary format for high-performance processing.

## Overview
- **Input**: Raw VN30 CSV data in `data/symbols/{symbol}/raw/historical_price.csv`
- **Output**: Qlib format data in `data/symbols/{symbol}/qlib/`
- **Process**: Data validation → Cleaning → Qlib format conversion → Binary format

## Requirements
- qlib library installed
- VN30 raw data available
- Python 3.12+ environment

In [1]:
# Import required libraries
import sys
from pathlib import Path
import pandas as pd
import logging

# Add src to path for imports
project_root = Path.cwd().parent.parent
src_path = project_root / 'src'
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

# Import VN30 to Qlib converter
from data_processing.vn30_to_qlib_converter import Vn30ToQlibConverter

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print(f"Project root: {project_root}")

Project root: e:\finance\StocketAI


In [3]:
# Configuration
SYMBOLS_DIR = project_root / "data/symbols"
VN30_SYMBOLS_FILE = project_root / "data/symbols/vn30_constituents.csv"

# Load VN30 symbols
symbols_df = pd.read_csv(VN30_SYMBOLS_FILE)
symbols = symbols_df['symbol'].tolist()

print(f"Loaded {len(symbols)} VN30 symbols")
print(f"Sample symbols: {symbols[:5]}")

Loaded 30 VN30 symbols
Sample symbols: ['ACB', 'BCM', 'BID', 'CTG', 'DGC']


In [4]:
# Initialize converter
converter = Vn30ToQlibConverter(SYMBOLS_DIR)

print(f"Converter initialized with symbols directory: {SYMBOLS_DIR}")
print(f"Validation rules: {converter.validation_rules}")

Converter initialized with symbols directory: e:\finance\StocketAI\data\symbols
Validation rules: {'required_columns': ['time', 'open', 'high', 'low', 'close', 'volume'], 'date_format': '%Y-%m-%d', 'min_data_points': 100, 'max_missing_ratio': 0.05}


In [5]:
# Test data validation for first symbol
test_symbol = symbols[0]
is_valid, error_msg = converter.validate_raw_data(test_symbol)

print(f"Data validation for {test_symbol}:")
print(f"Valid: {is_valid}")
if not is_valid:
    print(f"Error: {error_msg}")
else:
    print("✅ Data validation passed")

2025-10-06 12:57:22,414 - data_processing.vn30_to_qlib_converter - INFO - Data validation passed for ACB: 2750 records


Data validation for ACB:
Valid: True
✅ Data validation passed


In [6]:
# Test data cleaning for first symbol
cleaned_df = converter.clean_raw_data(test_symbol)

if cleaned_df is not None:
    print(f"Data cleaning successful: {len(cleaned_df)} records")
    print(f"Columns: {list(cleaned_df.columns)}")
    print(f"Date range: {cleaned_df['time'].min()} to {cleaned_df['time'].max()}")
    print(f"Missing values: {cleaned_df.isnull().sum().sum()}")
else:
    print("❌ Data cleaning failed")

  df = df.fillna(method='ffill', limit=5)  # Fill up to 5 consecutive missing days
2025-10-06 12:57:28,408 - data_processing.vn30_to_qlib_converter - INFO - Cleaned data for ACB: 3622 records


Data cleaning successful: 3622 records
Columns: ['time', 'open', 'high', 'low', 'close', 'volume', 'data_source']
Date range: 2015-10-05 00:00:00 to 2025-10-03 00:00:00
Missing values: 0


In [7]:
# Test qlib format preparation
if cleaned_df is not None:
    qlib_df = converter.prepare_qlib_format(test_symbol, cleaned_df)
    
    if qlib_df is not None:
        print(f"Qlib format preparation successful: {len(qlib_df)} records")
        print(f"Columns: {list(qlib_df.columns)}")
        print(f"Sample data:")
        print(qlib_df.head())
    else:
        print("❌ Qlib format preparation failed")
else:
    print("❌ Cannot test format preparation: cleaned data is None")

2025-10-06 12:57:36,789 - data_processing.vn30_to_qlib_converter - INFO - Prepared qlib format for ACB: 3622 records


Qlib format preparation successful: 3622 records
Columns: ['time', 'symbol', 'open', 'high', 'low', 'close', 'volume']
Sample data:
        time symbol  open  high   low  close    volume
0 2015-10-05    ACB  3.11  3.14  3.09   3.13  220204.0
1 2015-10-06    ACB  3.14  3.27  3.16   3.22  895315.0
2 2015-10-07    ACB  3.26  3.22  3.14   3.18  188078.0
3 2015-10-08    ACB  3.16  3.28  3.14   3.22  541780.0
4 2015-10-09    ACB  3.27  3.31  3.26   3.29  557772.0


In [8]:
# Convert single symbol to qlib format
print(f"Converting {test_symbol} to qlib format...")
success = converter.convert_symbol_to_qlib(test_symbol)

print(f"Conversion result: {'✅ Success' if success else '❌ Failed'}")

# Check output files
qlib_dir = Path(SYMBOLS_DIR) / test_symbol / 'qlib'
if qlib_dir.exists():
    output_files = list(qlib_dir.glob('*'))
    print(f"Output files created: {len(output_files)}")
    for file in output_files:
        size_mb = file.stat().st_size / (1024 * 1024)
        print(f"  - {file.name} ({size_mb:.2f} MB)")
else:
    print("❌ Qlib directory not created")

2025-10-06 12:57:45,099 - data_processing.vn30_to_qlib_converter - INFO - Converting ACB to qlib format...
2025-10-06 12:57:45,111 - data_processing.vn30_to_qlib_converter - INFO - Data validation passed for ACB: 2750 records
  df = df.fillna(method='ffill', limit=5)  # Fill up to 5 consecutive missing days
2025-10-06 12:57:45,139 - data_processing.vn30_to_qlib_converter - INFO - Cleaned data for ACB: 3622 records
2025-10-06 12:57:45,141 - data_processing.vn30_to_qlib_converter - INFO - Prepared qlib format for ACB: 3622 records
2025-10-06 12:57:45,212 - data_processing.vn30_to_qlib_converter - INFO - Saved qlib binary format data for ACB to e:\finance\StocketAI\data\symbols\ACB\qlib\acb.bin (binary fallback)


Converting ACB to qlib format...
Conversion result: ✅ Success
Output files created: 2
  - acb.bin (0.20 MB)
  - acb_temp.csv (0.16 MB)


In [9]:
# Convert all VN30 symbols
print(f"Starting conversion of {len(symbols)} VN30 symbols to qlib format...")
print("=" * 60)

results = converter.convert_all_symbols(symbols)

print("=" * 60)
print("Conversion completed!")

2025-10-06 12:57:51,434 - data_processing.vn30_to_qlib_converter - INFO - Starting conversion of 30 symbols to qlib format...
2025-10-06 12:57:51,435 - data_processing.vn30_to_qlib_converter - INFO - Converting ACB to qlib format...
2025-10-06 12:57:51,446 - data_processing.vn30_to_qlib_converter - INFO - Data validation passed for ACB: 2750 records
  df = df.fillna(method='ffill', limit=5)  # Fill up to 5 consecutive missing days
2025-10-06 12:57:51,469 - data_processing.vn30_to_qlib_converter - INFO - Cleaned data for ACB: 3622 records
2025-10-06 12:57:51,472 - data_processing.vn30_to_qlib_converter - INFO - Prepared qlib format for ACB: 3622 records
2025-10-06 12:57:51,511 - data_processing.vn30_to_qlib_converter - INFO - Saved qlib binary format data for ACB to e:\finance\StocketAI\data\symbols\ACB\qlib\acb.bin (binary fallback)
2025-10-06 12:57:51,513 - data_processing.vn30_to_qlib_converter - INFO - Converting BCM to qlib format...
2025-10-06 12:57:51,526 - data_processing.vn30_t

Starting conversion of 30 VN30 symbols to qlib format...


  df = df.fillna(method='ffill', limit=5)  # Fill up to 5 consecutive missing days
2025-10-06 12:57:51,642 - data_processing.vn30_to_qlib_converter - INFO - Cleaned data for BID: 3624 records
2025-10-06 12:57:51,645 - data_processing.vn30_to_qlib_converter - INFO - Prepared qlib format for BID: 3624 records
2025-10-06 12:57:51,685 - data_processing.vn30_to_qlib_converter - INFO - Saved qlib binary format data for BID to e:\finance\StocketAI\data\symbols\BID\qlib\bid.bin (binary fallback)
2025-10-06 12:57:51,687 - data_processing.vn30_to_qlib_converter - INFO - Converting CTG to qlib format...
2025-10-06 12:57:51,701 - data_processing.vn30_to_qlib_converter - INFO - Data validation passed for CTG: 2750 records
  df = df.fillna(method='ffill', limit=5)  # Fill up to 5 consecutive missing days
2025-10-06 12:57:51,724 - data_processing.vn30_to_qlib_converter - INFO - Cleaned data for CTG: 3624 records
2025-10-06 12:57:51,727 - data_processing.vn30_to_qlib_converter - INFO - Prepared qlib f

Conversion completed!


In [10]:
# Analyze conversion results
successful = sum(results.values())
total = len(results)
failed = total - successful

print(f"Conversion Summary:")
print(f"Total symbols: {total}")
print(f"Successful: {successful}")
print(f"Failed: {failed}")
print(f"Success rate: {successful/total:.1%}")

if failed > 0:
    failed_symbols = [symbol for symbol, success in results.items() if not success]
    print(f"\nFailed symbols:")
    for symbol in failed_symbols:
        print(f"  - {symbol}")
else:
    print("\n✅ All symbols converted successfully!")

Conversion Summary:
Total symbols: 30
Successful: 30
Failed: 0
Success rate: 100.0%

✅ All symbols converted successfully!


In [11]:
# Generate conversion report
report = converter.generate_conversion_report(results)
print(report)

# Save report to file
report_file = Path(SYMBOLS_DIR) / 'conversion_report.txt'
with open(report_file, 'w') as f:
    f.write(report)
    
print(f"\nReport saved to: {report_file}")


VN30 Data Conversion Report

Total symbols processed: 30
Successful conversions: 30
Failed conversions: 0

Success rate: 100.0%

Failed symbols:


Report saved to: e:\finance\StocketAI\data\symbols\conversion_report.txt


## Next Steps

The VN30 data has been successfully converted to qlib format. Next steps:

1. **Feature Engineering**: Add technical indicators using qlib expressions
2. **Model Training**: Use processed data for baseline model training
3. **Performance Testing**: Verify data loading and processing speeds

The converted data is now ready for Task 04 (Baseline Model Training).