# NACE Data Propagation Notebook

## 1. Purpose

This notebook is designed to enhance a dataset of economic indicators classified by NACE (Nomenclature of Economic Activities) codes. The primary goal is to propagate data from higher, more aggregated NACE levels to lower, more granular levels where data might be missing. This process helps create a more complete dataset for analysis by filling gaps in the NACE hierarchy.

The key metrics targeted for propagation are:
*   Average wages by NACE (`avg_wages_by_nace`)
*   Number of employees by NACE (`no_of_employees_by_nace`)
*   Producer Price Index by NACE (`ppi_by_nace`)

## 2. Input Data

The notebook relies on two main input files:

*   **`data/source_cleaned/t_nace_matching.parquet`**: This file contains the NACE hierarchy. It maps NACE codes (`czso_code`) to their respective levels (0 through 5) and provides parent codes at each level (e.g., `level1_code`, `level2_code`, etc.). This structure is crucial for understanding the relationships between different NACE categories.
*   **`data/source_cleaned/data_by_nace_annual_tidy.parquet`**: This file contains the curated NACE data, with various economic indicators, their values, units, and original sources, all organized in a tidy format.

## 3. Propagation Logic

The core of the notebook is the `propagate_nace_data` function, which implements the following logic:

1.  **Full Hierarchy Creation**: It first creates a complete grid of all possible NACE codes (from `t_nace_matching.parquet`), years (from the input data), and target metrics.
2.  **Identifying Missing Data**: It merges this full grid with the existing NACE data to identify combinations of NACE code, year, and metric for which data is missing (i.e., `value` is NaN).
3.  **Hierarchical Search**: For each missing data point, the function attempts to find a suitable value by searching up the NACE hierarchy:
    *   It starts from the level immediately above the missing NACE code and moves upwards (e.g., for a level 3 code, it checks its level 2 parent, then its level 1 parent).
    *   The parent codes are determined using the `levelX_code` columns from the `t_nace_matching` table.
4.  **Special "Industry" Aggregate**: For NACE codes within the manufacturing sector (typically starting with 'C' at level 1), if no direct parent provides data, the function may look for a general "industry" aggregate (often coded as `B+C+D+E` or a similar "industry" identifier at level 0) if available.
5.  **Level 0 "Umbrella" Codes**: If no data is found through the direct hierarchical parents (up to level 1), the logic checks for level 0 "umbrella" codes. These are aggregate codes like `B+C+D+E` (for industry) or `014+015+017+031` (for specific agricultural aggregations). If the `level1_code` of the target NACE item is part of such an umbrella code (i.e., listed as one of the components separated by "+"), and the umbrella code has data, that data is used for propagation.
6.  **Recording Propagation**: When a value is successfully propagated, a new record is created. The `source` column for this new record is updated to indicate that the data was propagated and from which NACE code and level it originated (e.g., "PROPAGATED from level 1 (A)").

## 4. Output Data

The primary output of this notebook is a Parquet file:

*   **`data/source_cleaned/data_by_nace_annual_tidy_propagated.parquet`**: This file contains the original data combined with the newly propagated data. It retains the same structure as the input `data_by_nace_annual_tidy.parquet` but with more records due to the filled gaps.

Key columns in the output include:
*   `czso_code`: The NACE code.
*   `level`: The NACE level of the `czso_code`.
*   `name_cs`, `name_en`: Czech and English names of the NACE category.
*   `year`: The year of the data.
*   `metric`: The economic indicator.
*   `value`: The value of the indicator.
*   `unit`: The unit of the value.
*   `source`: Indicates the origin of the data. For propagated data, this will show "PROPAGATED from..."

## 5. Notebook Steps

The notebook proceeds through the following main steps:

1.  **Setup**: Imports necessary libraries (pandas, numpy, os, datetime).
2.  **Load Data**: Loads the NACE matching table and the main NACE data.
3.  **Initial Examination**: Prints shapes, column names, and sample data from the loaded tables to understand their structure.
4.  **Filter Data**: Selects only the records corresponding to the `metrics_to_propagate`.
5.  **Define Propagation Function**: Contains the `propagate_nace_data` function with the detailed logic described above.
6.  **Execute Propagation**: Calls the `propagate_nace_data` function with the source data, NACE hierarchy, and target metrics.
7.  **Review Results**: Prints statistics about the propagation, such as the number of added records, source distribution, and metric distribution before and after propagation.
8.  **Save Propagated Data**: Saves the combined original and propagated data to the output Parquet file.
9.  **Final Summary & Quality Checks**: Displays summary statistics of the final dataset (total records, date range, unique NACE codes, levels represented) and performs basic data quality checks (missing values, duplicates).
10. **Sample Propagated Data**: Shows examples of records that were generated through propagation.

This process ensures that the resulting dataset is more comprehensive, facilitating more robust downstream analyses that rely on NACE-classified data.

In [95]:
import os
import pandas as pd
import numpy as np
from datetime import datetime

In [96]:
# Define file paths
script_dir = os.getcwd()  # Current directory in Jupyter
project_root = os.path.abspath(os.path.join(script_dir, ".."))

# Load NACE matching table
nace_matching_file = os.path.join(project_root, "data", "source_cleaned", "t_nace_matching.parquet")
df_nace_matching = pd.read_parquet(nace_matching_file)

# Load the main NACE data file
data_file = os.path.join(project_root, "data", "source_cleaned", "data_by_nace_annual_tidy.parquet")
df_nace_data = pd.read_parquet(data_file)

print(f"NACE matching table shape: {df_nace_matching.shape}")
print(f"NACE data shape: {df_nace_data.shape}")
print(f"\nNACE matching columns: {df_nace_matching.columns.tolist()}")
print(f"NACE data columns: {df_nace_data.columns.tolist()}")

NACE matching table shape: (1717, 12)
NACE data shape: (7700, 9)

NACE matching columns: ['name_czso_cs', 'name_czso_en', 'level', 'czso_code', 'level1_code', 'level2_code', 'level3_code', 'level4_code', 'level5_code', 'full_nace', 'magnus_nace', 'industry_flag']
NACE data columns: ['czso_code', 'level', 'name_cs', 'name_en', 'year', 'metric', 'value', 'unit', 'source']


In [97]:
# Filter data for the metrics we want to propagate
metrics_to_propagate = ['avg_wages_by_nace', 'no_of_employees_by_nace', 'ppi_by_nace']

df_propagation_source = df_nace_data[df_nace_data['metric'].isin(metrics_to_propagate)].copy()

print(f"Data for propagation shape: {df_propagation_source.shape}")
print(f"\nMetrics distribution:")
print(df_propagation_source['metric'].value_counts())
print(f"\nLevel distribution:")
print(df_propagation_source['level'].value_counts().sort_index())
print(f"\nYear range: {df_propagation_source['year'].min()} - {df_propagation_source['year'].max()}")

Data for propagation shape: (4350, 9)

Metrics distribution:
metric
ppi_by_nace                3350
avg_wages_by_nace           500
no_of_employees_by_nace     500
Name: count, dtype: int64

Level distribution:
level
0     150
1    1100
2    1250
3    1850
Name: count, dtype: int64

Year range: 2000 - 2024


In [98]:
# expand the rows that level=0 czso_code contains + to be listed separately for each NACE code in the source data df_propagation_source
# e.g. expanding B+C+D+E to separate rows for B, C, D, E containing the same values (except for the NACE code)
# since the names are already clean, the level0 is always in format czso_nace separated by "+"

# Expand level-0 umbrella codes (czso_code containing "+") into separate rows
umbrella_mask = (df_propagation_source['level'] == 0) & df_propagation_source['czso_code'].str.contains(r'\+')
umbrella_rows = df_propagation_source[umbrella_mask]

# Split the combined code into components and explode
expanded = (
    umbrella_rows
    .assign(czso_code=umbrella_rows['czso_code'].str.split(r'\+'))
    .explode('czso_code')
    .reset_index(drop=True)
)

# Merge the expanded rows back into the original DataFrame
df_propagation_source = pd.concat([
    df_propagation_source[~umbrella_mask],  # Keep non-umbrella rows
    expanded  # Add the expanded rows
], ignore_index=True)
# Ensure czso_code is stripped of whitespace
df_propagation_source['czso_code'] = df_propagation_source['czso_code'].str.strip()

# level is integer
df_propagation_source['level'] = df_propagation_source['level'].astype(int)



In [99]:
# remove rows with missing value in 'value' column
df_propagation_source = df_propagation_source.dropna(subset=['value'])


In [100]:
def propagate_nace_data(df_source, df_nace_hierarchy, metrics_list):
    """
    Propagate data from higher NACE levels to lower levels where data is missing.
    Uses the hierarchical structure from level1_code, level2_code, etc.
    
    Parameters:
    - df_source: DataFrame with NACE data to propagate. 
                 Expected columns: czso_code, level, name_cs, name_en, year, metric, value, unit, source.
    - df_nace_hierarchy: DataFrame with NACE hierarchy information.
                         Expected columns: czso_code, level, name_czso_cs, name_czso_en, levelX_code.
    - metrics_list: List of metrics to propagate
    
    Returns:
    - DataFrame with propagated data
    """
    
    # Create a copy of source data
    df_result = df_source.copy()
    
    # Get all possible combinations of czso_code, year, and metric from hierarchy
    years = df_source['year'].unique()
    
    # Create full hierarchy with all years and metrics
    hierarchy_expanded = []
    for metric in metrics_list:
        for year in years:
            temp_df = df_nace_hierarchy.copy()
            temp_df['year'] = year
            temp_df['metric'] = metric
            hierarchy_expanded.append(temp_df)
    
    df_full_hierarchy = pd.concat(hierarchy_expanded, ignore_index=True)
    
    # Merge with existing data to identify missing combinations
    # df_source columns: czso_code, level, name_cs, name_en, year, metric, value, unit, source
    # df_full_hierarchy columns: czso_code, level, name_czso_cs, name_czso_en, year, metric, levelX_code
    # Overlapping 'level' column will be suffixed.
    # 'name_cs'/'name_en' vs 'name_czso_cs'/'name_czso_en' are distinct.
    # 'unit', 'value', 'source' are only in df_source.
    df_merged = pd.merge(
        df_full_hierarchy,
        df_source,
        on=['czso_code', 'year', 'metric'],
        how='left',
        suffixes=('_hierarchy', '_data') # level -> level_hierarchy, level_data
    )
    
    # Identify missing data (where value is NaN)
    missing_mask = df_merged['value'].isna()
    print(f"Found {missing_mask.sum()} missing data points to potentially propagate")
    
    # Sort by level (higher levels first for propagation) - use the hierarchy level
    df_merged = df_merged.sort_values(['metric', 'year', 'level_hierarchy', 'czso_code'])
    
    # Propagation logic: for each missing data point, try to find parent data
    propagated_records = []
    
    for metric in metrics_list:
        print(f"\nProcessing metric: {metric}")
        metric_data = df_merged[df_merged['metric'] == metric].copy()
        
        for year_val in years: # Renamed to avoid conflict with 'year' column name
            year_data = metric_data[metric_data['year'] == year_val].copy()
            
            # Get existing data for this year/metric
            # This data comes from df_merged, so it has 'level_data' (original NACE level from df_source)
            # and 'unit' (original unit from df_source)
            existing_data_for_year_metric = year_data[~year_data['value'].isna()].copy()
            
            # Prepare a map of available data. If a czso_code exists at multiple NACE levels
            # (e.g. 'C' at level 1 and 'C' at level 0 from expansion), prioritize higher NACE level.
            # 'level_data' is the NACE level from the original df_source.
            existing_data_prepared = existing_data_for_year_metric.sort_values(
                'level_data', ascending=False
            ).drop_duplicates(subset=['czso_code'], keep='first')
            
            existing_data_map = {}
            for _, r_existing in existing_data_prepared.iterrows():
                existing_data_map[r_existing['czso_code']] = {
                    'value': r_existing['value'],
                    'level': r_existing['level_data'], # Actual NACE level of this source data point
                    'unit': r_existing['unit']         # Unit of this source data point
                }
            
            # Try to propagate to missing data points for this year/metric
            missing_data_for_year_metric = year_data[year_data['value'].isna()]
            
            for _, row in missing_data_for_year_metric.iterrows():
                target_code = row['czso_code']
                # target_level is the NACE level of the item we are trying to fill (from hierarchy)
                target_level = row['level_hierarchy'] 
                
                propagated_value = None
                source_code_found = None
                actual_level_of_source_code = None # NACE level of the source_code_found
                propagated_unit = 'unknown'      # Default unit

                # Try to find parent data by going up the hierarchy
                # Loop from parent (target_level-1) up to NACE Level 1 ancestor
                for i in range(target_level - 1, 0, -1): 
                    parent_code_at_level_i = None
                    if i == 1: parent_code_at_level_i = row['level1_code']
                    elif i == 2: parent_code_at_level_i = row['level2_code']
                    elif i == 3: parent_code_at_level_i = row['level3_code']
                    elif i == 4: parent_code_at_level_i = row['level4_code']
                    elif i == 5: parent_code_at_level_i = row['level5_code']
                    
                    if pd.notna(parent_code_at_level_i) and parent_code_at_level_i in existing_data_map:
                        parent_data_entry = existing_data_map[parent_code_at_level_i]
                        propagated_value = parent_data_entry['value']
                        actual_level_of_source_code = parent_data_entry['level']
                        actual_level_of_source_code = int(actual_level_of_source_code) 
                        propagated_unit = parent_data_entry['unit']
                        source_code_found = parent_code_at_level_i 

                        break # Found data from the closest parent in hierarchy
                
                if propagated_value is not None:
                    new_record = {
                        'czso_code': target_code,
                        'level': target_level, # NACE level of the item being filled
                        'name_cs': row['name_czso_cs'], # Name from hierarchy table
                        'name_en': row['name_czso_en'], # Name from hierarchy table
                        'year': year_val,
                        'metric': metric,
                        'value': propagated_value,
                        'unit': propagated_unit if pd.notna(propagated_unit) else 'unknown',
                        'source': f"PROPAGATED from level {actual_level_of_source_code} ({source_code_found})"
                    }
                    propagated_records.append(new_record)
    
    print(f"\nGenerated {len(propagated_records)} propagated records")
    
    if propagated_records:
        df_propagated_new = pd.DataFrame(propagated_records)
        df_result = pd.concat([df_result, df_propagated_new], ignore_index=True)

    return df_result

In [101]:
# Execute the propagation
print("Starting data propagation...")
print(f"Original data shape: {df_propagation_source.shape}")

df_propagated = propagate_nace_data(
    df_propagation_source, 
    df_nace_matching, 
    metrics_to_propagate
)

print(f"\nPropagated data shape: {df_propagated.shape}")
print(f"Added {df_propagated.shape[0] - df_propagation_source.shape[0]} new records")

# Check the results
print("\n=== Propagation Results ===")
print("Source distribution:")
print(df_propagated['source'].value_counts())

print("\nMetric distribution after propagation:")
for metric in metrics_to_propagate:
    metric_data = df_propagated[df_propagated['metric'] == metric]
    print(f"{metric}: {len(metric_data)} records")
    print(f"  - Original: {len(metric_data[~metric_data['source'].str.contains('PROPAGATED', na=False)])}")
    print(f"  - Propagated: {len(metric_data[metric_data['source'].str.contains('PROPAGATED', na=False)])}")

df_final = df_propagated.copy()

Starting data propagation...
Original data shape: (4713, 9)
Found 124362 missing data points to potentially propagate
Found 124362 missing data points to potentially propagate

Processing metric: avg_wages_by_nace

Processing metric: avg_wages_by_nace

Processing metric: no_of_employees_by_nace

Processing metric: no_of_employees_by_nace

Processing metric: ppi_by_nace

Processing metric: ppi_by_nace

Generated 117547 propagated records

Propagated data shape: (122260, 9)
Added 117547 new records

=== Propagation Results ===
Source distribution:
source
PROPAGATED from level 1 (C)     29670
PROPAGATED from level 1 (G)     10850
PROPAGATED from level 1 (A)      6650
PROPAGATED from level 0 (G)      5425
PROPAGATED from level 1 (N)      4600
                                ...  
PROPAGATED from level 2 (19)       90
PROPAGATED from level 2 (06)       77
PROPAGATED from level 2 (36)       50
PROPAGATED from level 2 (59)       12
PROPAGATED from level 2 (60)        6
Name: count, Length: 79

In [102]:
# drop rows with czso_code that are not in the hierarchy
df_final = df_final[df_final['czso_code'].isin(df_nace_matching['czso_code'])]

# add magnus_nace column based on the matching table
df_final = df_final.merge(
    df_nace_matching[['czso_code', 'magnus_nace']],
    on='czso_code',
    how='left'
)
# 2nd column
df_final = df_final[['czso_code', 'magnus_nace', 'level', 'name_cs', 'name_en', 'year', 'metric', 'value', 'unit', 'source']]



In [103]:
# Save the propagated data
output_folder = os.path.join(project_root, "data", "source_cleaned")
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

output_file = os.path.join(output_folder, "data_by_nace_annual_tidy_propagated.parquet")

df_final.to_parquet(output_file, index=False, engine="pyarrow")
print(f"\nPropagated data saved to: {output_file}")



Propagated data saved to: /Users/adam/Library/Mobile Documents/com~apple~CloudDocs/School/Master's Thesis/Analysis/profit-margins-inflation/data/source_cleaned/data_by_nace_annual_tidy_propagated.parquet


## Data Summary

In [104]:
# Generate summary statistics
print("\n=== Final Summary ===")
print(f"Total records: {len(df_final):,}")
print(f"Date range: {df_final['year'].min()} - {df_final['year'].max()}")
print(f"Unique NACE codes: {df_final['czso_code'].nunique()}")
print(f"Levels represented: {sorted(df_final['level'].unique())}")

print("\nRecords by metric:")
for metric in sorted(df_final['metric'].unique()):
    count = len(df_final[df_final['metric'] == metric])
    propagated_count = len(df_final[
        (df_final['metric'] == metric) & 
        (df_final['source'].str.contains('PROPAGATED', na=False))
    ])
    print(f"  {metric}: {count:,} total ({propagated_count:,} propagated)")

print("\nRecords by level:")
for level in sorted(df_final['level'].unique()):
    count = len(df_final[df_final['level'] == level])
    print(f"  Level {level}: {count:,} records")


=== Final Summary ===
Total records: 122,260
Date range: 2000 - 2024
Unique NACE codes: 1700
Levels represented: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5)]

Records by metric:
  avg_wages_by_nace: 42,600 total (42,025 propagated)
  no_of_employees_by_nace: 42,600 total (42,025 propagated)
  ppi_by_nace: 37,060 total (33,497 propagated)

Records by level:
  Level 0: 640 records
  Level 1: 1,095 records
  Level 2: 6,060 records
  Level 3: 19,170 records
  Level 4: 43,980 records
  Level 5: 51,315 records
  ppi_by_nace: 37,060 total (33,497 propagated)

Records by level:
  Level 0: 640 records
  Level 1: 1,095 records
  Level 2: 6,060 records
  Level 3: 19,170 records
  Level 4: 43,980 records
  Level 5: 51,315 records


In [105]:
def summarize_data_availability(df_data, df_nace_hierarchy, target_metric):
    """
    Summarizes data availability for a given metric across all years.

    For each year, it shows how many NACE codes have data, how many are missing,
    and details for the missing NACE codes including their parent NACE codes.

    Parameters:
    - df_data: DataFrame containing the data (e.g., df_final).
               Expected columns: 'czso_code', 'year', 'metric', 'value', 'level', 
                                 'name_cs', 'name_en'.
    - df_nace_hierarchy: DataFrame with NACE hierarchy information (e.g., df_nace_matching).
                         Expected columns: 'czso_code', 'level', 'name_czso_cs', 
                                           'name_czso_en', 'level1_code', 'level2_code', 
                                           'level3_code', 'level4_code', 'level5_code'.
    - target_metric: String, the name of the metric to summarize.
    """
    print(f"--- Data Availability Summary for Metric: {target_metric} ---")

    metric_data = df_data[df_data['metric'] == target_metric]
    
    if metric_data.empty:
        print(f"No data found for metric: {target_metric}")
        return

    unique_years = sorted(metric_data['year'].unique())

    # Select relevant columns from NACE hierarchy for the full grid
    hierarchy_cols = ['czso_code', 'level', 'name_czso_cs', 'name_czso_en', 
                      'level1_code', 'level2_code', 'level3_code', 
                      'level4_code', 'level5_code']
    df_nace_base = df_nace_hierarchy[hierarchy_cols].copy()
    df_nace_base.rename(columns={'level': 'nace_level_hierarchy', # To avoid potential merge conflicts
                                 'name_czso_cs': 'name_cs_hierarchy',
                                 'name_czso_en': 'name_en_hierarchy'}, inplace=True)


    for year in unique_years:
        print(f"\n--- Year: {year} ---")
        
        # Create a full grid of all NACE codes for this year and metric
        df_year_full_nace = df_nace_base.copy()
        df_year_full_nace['year'] = year
        df_year_full_nace['metric'] = target_metric
        
        # Data for the current year and metric
        current_year_metric_data = metric_data[metric_data['year'] == year]
        
        # Merge the full NACE grid with the actual data
        # We are interested in 'value' from current_year_metric_data
        merged_df = pd.merge(
            df_year_full_nace,
            current_year_metric_data[['czso_code', 'year', 'metric', 'value', 'source']],
            on=['czso_code', 'year', 'metric'],
            how='left'
        )
        
        available_mask = merged_df['value'].notna()
        missing_mask = merged_df['value'].isna()
        
        available_count = available_mask.sum()
        missing_count = missing_mask.sum()
        
        print(f"  NACE codes with data: {available_count}")
        print(f"  NACE codes missing data: {missing_count}")
        
        if missing_count > 0:
            print(f"  Details for missing NACE codes (showing first 5 if many):")
            missing_details_df = merged_df[missing_mask]
            
            for _, row in missing_details_df.head(5).iterrows():
                parent_info = []
                if pd.notna(row['level5_code']): parent_info.append(f"L5P:{row['level5_code']}")
                if pd.notna(row['level4_code']): parent_info.append(f"L4P:{row['level4_code']}")
                if pd.notna(row['level3_code']): parent_info.append(f"L3P:{row['level3_code']}")
                if pd.notna(row['level2_code']): parent_info.append(f"L2P:{row['level2_code']}")
                if pd.notna(row['level1_code']): parent_info.append(f"L1P:{row['level1_code']}")
                
                print(f"    - Code: {row['czso_code']} (Level: {row['nace_level_hierarchy']}), "
                      f"Name: {row['name_en_hierarchy'][:30]}..., " # Truncate name
                      f"Parents: [{', '.join(parent_info)}]")
            if missing_count > 5:
                print(f"    ... and {missing_count - 5} more missing NACE codes.")
    print("\n--- End of Summary ---")



In [106]:
summarize_data_availability(df_final, df_nace_matching, 'ppi_by_nace')

--- Data Availability Summary for Metric: ppi_by_nace ---

--- Year: 2000 ---
  NACE codes with data: 1404
  NACE codes missing data: 317
  Details for missing NACE codes (showing first 5 if many):
    - Code: A (Level: 1), Name: AGRICULTURE, FORESTRY AND FISH..., Parents: [L5P:, L4P:, L3P:, L2P:, L1P:A]
    - Code: 01 (Level: 2), Name: Crop and animal production, hu..., Parents: [L5P:, L4P:, L3P:, L2P:01, L1P:A]
    - Code: 011 (Level: 3), Name: Growing of non-perennial crops..., Parents: [L5P:, L4P:, L3P:1, L2P:01, L1P:A]
    - Code: 0111 (Level: 4), Name: Growing of cereals (except ric..., Parents: [L5P:, L4P:1, L3P:1, L2P:01, L1P:A]
    - Code: 01110 (Level: 5), Name: Growing of cereals (except ric..., Parents: [L5P:0, L4P:1, L3P:1, L2P:01, L1P:A]
    ... and 312 more missing NACE codes.

--- Year: 2001 ---
  NACE codes with data: 1404
  NACE codes missing data: 317
  Details for missing NACE codes (showing first 5 if many):
    - Code: A (Level: 1), Name: AGRICULTURE, FORESTRY AND