# Read KFF Data Files

This notebook reads all data files under `2433_p3_data/KFF_data/` directory and displays the first 5 rows of each file.

KFF (Kaiser Family Foundation) data files contain healthcare-related statistics and information.

In [1]:
# Import required libraries
import pandas as pd
from pathlib import Path
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

# pandas display options
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 50)

In [2]:
# Define KFF data directory
KFF_DATA_DIR = Path('2433_p3_data/KFF_data')

# Check directory exists
if not KFF_DATA_DIR.exists():
    print(f"ERROR: directory {KFF_DATA_DIR} not found!")
else:
    print(f"✓ Found data directory: {KFF_DATA_DIR}")
    print(f"  Absolute path: {KFF_DATA_DIR.absolute()}")

✓ 找到数据目录: 2433_p3_data/KFF_data
  绝对路径: /Users/mac/Desktop/Lecture_2433/group_project_251117/2433_p3_data/KFF_data


In [None]:
# Read all CSV files into dictionary (robust reading, skip metadata rows and auto-detect header)
kff_data = {}
errors = {}

def robust_read_csv(path):
    """Try robust reading: skip leading metadata lines, detect header line, then read with pandas."""
    # Read first few lines to detect header position
    with open(path, 'r', encoding='utf-8') as fh:
        preview = []
        for _ in range(40):
            try:
                preview.append(next(fh))
            except StopIteration:
                break

    header_idx = None
    for i, line in enumerate(preview):
        # header likely contains Location or multiple commas
        if 'Location' in line or line.count(',') >= 2:
            header_idx = i
            break

    # If header_idx found and >0, skip that many rows so pandas uses next as header
    if header_idx is not None and header_idx > 0:
        df = pd.read_csv(path, skiprows=header_idx, low_memory=False)
    else:
        # Try normal read; if it fails, try engine='python' with flexible sep
        try:
            df = pd.read_csv(path, low_memory=False)
        except Exception:
            df = pd.read_csv(path, sep='\t', engine='python', low_memory=False)
    return df

for file_path in data_files:
    file_name = file_path.name
    try:
        print(f"Reading: {file_name}...", end=' ')
        df = robust_read_csv(file_path)
        kff_data[file_name] = df
        print(f"✓ Success (shape: {df.shape})")
    except Exception as e:
        errors[file_name] = str(e)
        print(f"✗ Failed: {e}")

print(f"\nTotal: Successfully read {len(kff_data)} files, {len(errors)} failed")

正在读取: raw_data_2020.csv... ✓ 成功 (形状: (56, 5))
正在读取: raw_data_2021.csv... ✓ 成功 (形状: (56, 5))
正在读取: raw_data_2022.csv... ✓ 成功 (形状: (56, 5))
正在读取: raw_data_2023.csv... ✓ 成功 (形状: (56, 5))
正在读取: raw_data_2024.csv... ✓ 成功 (形状: (56, 5))
正在读取: raw_data_2025.csv... ✓ 成功 (形状: (56, 5))
正在读取: raw_data_2026.csv... ✓ 成功 (形状: (56, 5))

总计: 成功读取 7 个文件, 失败 0 个文件


## Data Overview Summary

In [None]:
# Create data overview summary table
summary_data = []

for file_name, df in sorted(kff_data.items()):
    summary_data.append({
        'Filename': file_name,
        'Rows': df.shape[0],
        'Columns': df.shape[1],
        'Total Missing': df.isnull().sum().sum(),
        'Missing Rate(%)': f"{(df.isnull().sum().sum() / (df.shape[0] * df.shape[1]) * 100):.2f}%",
        'Memory(MB)': f"{df.memory_usage(deep=True).sum() / 1024 / 1024:.2f}"
    })

summary_df = pd.DataFrame(summary_data)

print("="*100)
print("KFF Data Files Summary")
print("="*100)
display(summary_df)

print(f"\nTotal rows: {summary_df['Rows'].sum():,}")
print(f"Average columns: {summary_df['Columns'].mean():.1f}")

KFF数据文件汇总


Unnamed: 0,文件名,行数,列数,总缺失值,缺失率(%),内存使用(MB)
0,raw_data_2020.csv,56,5,16,5.71%,0.01
1,raw_data_2021.csv,56,5,16,5.71%,0.01
2,raw_data_2022.csv,56,5,16,5.71%,0.01
3,raw_data_2023.csv,56,5,16,5.71%,0.01
4,raw_data_2024.csv,56,5,16,5.71%,0.01
5,raw_data_2025.csv,56,5,16,5.71%,0.01
6,raw_data_2026.csv,56,5,16,5.71%,0.01



总行数: 392
平均列数: 5.0


## Check Data Consistency

In [None]:
# Check if all files have the same column structure
if len(kff_data) > 0:
    first_file = list(kff_data.keys())[0]
    first_columns = set(kff_data[first_file].columns)
    
    print("Column structure consistency check:")
    all_same = True
    
    for file_name, df in kff_data.items():
        current_columns = set(df.columns)
        if current_columns == first_columns:
            print(f"  ✓ {file_name}: Consistent column structure")
        else:
            all_same = False
            print(f"  ✗ {file_name}: Different column structure")
            missing = first_columns - current_columns
            extra = current_columns - first_columns
            if missing:
                print(f"    Missing columns: {missing}")
            if extra:
                print(f"    Extra columns: {extra}")
    
    if all_same:
        print("\n✓ All files have consistent column structure, safe to merge!")
    else:
        print("\n⚠ Files have inconsistent column structure, alignment needed before merging.")

列结构一致性检查:
  ✓ raw_data_2020.csv: 列结构一致
  ✓ raw_data_2021.csv: 列结构一致
  ✓ raw_data_2022.csv: 列结构一致
  ✓ raw_data_2023.csv: 列结构一致
  ✓ raw_data_2024.csv: 列结构一致
  ✓ raw_data_2025.csv: 列结构一致
  ✓ raw_data_2026.csv: 列结构一致

✓ 所有文件的列结构一致，可以安全地合并!


## Optional: Merge All Years Data

In [None]:
# If all files have consistent column structure, merge them
try:
    combined_df = pd.concat(kff_data.values(), ignore_index=True)
    print(f"✓ Successfully merged all files")
    print(f"  Combined shape: {combined_df.shape[0]:,} rows × {combined_df.shape[1]} columns")
    
    # If there are year columns, display year distribution
    year_cols = [col for col in combined_df.columns if 'year' in col.lower()]
    if year_cols:
        print(f"\nYear distribution (based on column '{year_cols[0]}'):")
        print(combined_df[year_cols[0]].value_counts().sort_index())
    
    # Store merged data to global variable
    globals()['kff_combined'] = combined_df
    print("\n✓ Merged data saved to variable 'kff_combined'")
    
except Exception as e:
    print(f"✗ Merge failed: {e}")
    print("  Files may have inconsistent column structure, check column structure first.")

✓ 成功合并所有文件
  合并后形状: 392 行 × 5 列

✓ 合并数据已保存到变量 'kff_combined'


## Export Data (Optional)

In [None]:
# Export kff_combined to CSV file
if 'kff_combined' in globals():
    output_dir = Path('2433_p3_data/KFF_data/exports')
    output_dir.mkdir(parents=True, exist_ok=True)
    
    output_file = output_dir / 'kff_combined_2020_2026.csv'
    kff_combined.to_csv(output_file, index=False)
    
    file_size_mb = output_file.stat().st_size / 1024 / 1024
    print(f"✓ kff_combined exported to: {output_file}")
    print(f"  Rows: {len(kff_combined):,}")
    print(f"  Columns: {kff_combined.shape[1]}")
    print(f"  File size: {file_size_mb:.2f} MB")
else:
    print("⚠ kff_combined not found. Please run the merge cell first.")

In [None]:
# Save all data to global variables for subsequent use
globals()['kff_data'] = kff_data
print(f"✓ All KFF data saved to variable 'kff_data' (dictionary type)")
print(f"  Available files: {list(kff_data.keys())}")
print(f"\nUsage examples:")
print("  - Access 2020 data: kff_data['raw_data_2020.csv']")
print("  - Access merged data: kff_combined (if merge has been executed)")

✓ 所有KFF数据已保存到变量 'kff_data' (字典类型)
  可用文件: ['raw_data_2020.csv', 'raw_data_2021.csv', 'raw_data_2022.csv', 'raw_data_2023.csv', 'raw_data_2024.csv', 'raw_data_2025.csv', 'raw_data_2026.csv']

使用示例:
  - 访问2020年数据: kff_data['raw_data_2020.csv']
  - 访问合并数据: kff_combined (如果已执行合并)
