# Accelerating Data Processing with Parquet Format

## Objective
I'm running this on an old 2014 Mac mini.  I'm hoping to speed up data loading times by transitioning from Excel to Parquet file format.

### Why Parquet?
- **Data Integrity**: Maintains data schema and types, ensuring reliable analysis.
- **Efficiency**: Compressed file sizes save space and improve performance.
- **Speed**: Parquet's columnar storage enables faster read/write.


In [1]:
import os
import pandas as pd
import json

# Importing excel from ../data/raw
excel_path = os.path.join(os.getcwd(), '..', 'data', 'raw', 'case_study_data.xlsx')

# Define data types from DQR analysis
config_path = os.path.join(os.getcwd(), 'config', 'raw_data_types.json')

# Read column types config
with open(config_path, 'r') as file:
    column_types_config = json.load(file)

# Read Excel File
xls = pd.ExcelFile(excel_path)
for sheet in xls.sheet_names:
    # Determine data types for the current sheet
    column_types = column_types_config.get(sheet, {})

    # Read sheet with specified data types
    df = pd.read_excel(xls, sheet_name=sheet, dtype=column_types)

    # Path for the Parquet file for the current sheet
    parquet_path = os.path.join(os.getcwd(), '..', 'data', 'interim', f'{sheet}_data.parquet')

    # Write to Parquet
    df.to_parquet(parquet_path)
