### Magnusweb Data Cleaning and Enrichment 
This notebook performs data cleaning on the Magnusweb dataset, focusing on ensuring data quality and preparing it for analysis. The cleaning process includes:
- removing invalid rows 

Initial shape: 
```python
Type of panel: <class 'polars.dataframe.frame.DataFrame'>
Static columns (do not change across years for each ICO): ['name', 'main_nace', 'main_nace_code', 'sub_nace_cz', 'sub_nace_cz_code', 'main_okec', 'main_okec_code', 'sub_okec', 'sub_okec_code', 'esa2010', 'esa95', 'locality', 'region', 'num_employees', 'num_employees_cat', 'turnover_cat', 'audit', 'consolidation', 'currency', 'date_founded', 'date_dissolved', 'status', 'legal_form', 'entity_type']
Time-series columns (change across years for each ICO): ['year', 'profit_pre_tax', 'total_liabilities_and_equity', 'oper_profit', 'total_assets', 'profit_net', 'sales_revenue', 'equity', 'turnover', 'costs']
```

In [191]:
import polars as pl
import os

# Path to curated panel file
curated_path = os.path.join("..", "data", "source_cleaned", "magnusweb_panel.parquet")

# Load the curated panel
panel = pl.read_parquet(curated_path)
print(f"Loaded panel shape: {panel.shape}")
#panel.head()

Loaded panel shape: (1754304, 40)


## Correcting types: 

Summary:  
Six columns needed type corrections:  
- `num_employees_cat`, `status`, `legal_form`, and `entity_type` were converted to categorical type.
- `date_founded` and `date_dissolved` were converted to date type.  


In [192]:
# Check type and first 5 values for each column, correct if needed
expected_types = {
    'ico': pl.String,
    'name': pl.String,
    'main_nace': pl.Categorical,
    'main_nace_code': pl.String,
    'sub_nace_cz': pl.Categorical,
    'sub_nace_cz_code': pl.String,
    'main_okec': pl.Categorical,
    'main_okec_code': pl.String,
    'sub_okec': pl.Categorical,
    'sub_okec_code': pl.String,
    'esa2010': pl.Categorical,
    'esa95': pl.Categorical,
    'locality': pl.Categorical,
    'region': pl.Categorical,
    'num_employees': pl.Int32,
    'num_employees_cat': pl.Categorical,
    'turnover_cat': pl.Categorical,
    'audit': pl.Categorical,
    'consolidation': pl.Categorical,
    'currency': pl.Categorical,
    'date_founded': pl.Date,
    'date_dissolved': pl.Date,
    'status': pl.Categorical,
    'legal_form': pl.Categorical,
    'entity_type': pl.Categorical,
    'year': pl.Int16,
    'profit_pre_tax': pl.Float64,
    'total_liabilities_and_equity': pl.Float64,
    'oper_profit': pl.Float64,
    'total_assets': pl.Float64,
    'profit_net': pl.Float64,
    'sales_revenue': pl.Float64,
    'equity': pl.Float64,
    'turnover': pl.Float64,
    'costs': pl.Float64,
    'debt': pl.Float64,
}

panel_checked = panel.clone()
corrections = {}

for col in panel_checked.columns:
    # print(f"Column: {col}")
    # print(f"  Type: {panel_checked[col].dtype}")
    # print(f"  First 5 values: {panel_checked[col].head(5).to_list()}")
    expected = expected_types.get(col)
    if expected and panel_checked[col].dtype != expected:
        # Try to convert
        try:
            if expected == pl.Categorical:
                panel_checked = panel_checked.with_columns(pl.col(col).cast(pl.Categorical, strict=False))
            elif expected == pl.Date:
                panel_checked = panel_checked.with_columns(pl.col(col).str.strptime(pl.Date, "%Y-%m-%d", strict=False))
            else:
                panel_checked = panel_checked.with_columns(pl.col(col).cast(expected, strict=False))
            corrections[col] = expected
        except Exception as e:
            print(f"  Could not convert {col} to {expected}: {e}")
    print()

if corrections:
    print("\nCorrections made:")
    for col, typ in corrections.items():
        print(f"  {col}: {typ}")
        print(f"    New type: {panel_checked[col].dtype}")
        print(f"    First 5 values: {panel_checked[col].head(5).to_list()}")
















































Corrections made:
  num_employees_cat: Categorical
    New type: Categorical(ordering='physical')
    First 5 values: ['10 - 19 zaměstnanců', 'Neuvedeno', '25 - 49 zaměstnanců', 'Neuvedeno', '25 - 49 zaměstnanců']
  date_founded: Date
    New type: Date
    First 5 values: [datetime.date(2023, 5, 9), datetime.date(2023, 9, 18), datetime.date(2024, 5, 2), datetime.date(1994, 7, 21), datetime.date(1997, 10, 20)]
  date_dissolved: Date
    New type: Date
    First 5 values: [None, None, None, datetime.date(2021, 1, 1), datetime.date(2008, 8, 26)]
  status: Categorical
    New type: Categorical(ordering='physical')
    First 5 values: ['', '', '', 'Zrušený', 'Zrušený']
  legal_form: Categorical
    New type: Categorical(ordering='physical')
    First 5 values: ['Společnost s ručením omezeným', 'Společnost s ručením omezeným', 'Společnost s ručením omezeným', 'Společnost s ručením omezeným', 'Akciová společnost']
  entity_type: Categorical
    N

## Removing rows with 0–9 employees
- even when filtered in Magnusweb for 10+ employees, some rows still have 0–9 employees due to data inconsistencies.
- This step ensures that only companies with 10 or more employees are retained for analysis

In [193]:
# Remove rows with 0–9 employees
panel_new = panel_checked.filter(pl.col("num_employees") > 9)
print(f"Filtered panel shape: {panel_new.shape}")
# panel_new.head()

print("Number of unique ICOs:", panel_new.select(pl.col("ico").n_unique()).to_numpy()[0, 0])

Filtered panel shape: (1285128, 40)
Number of unique ICOs: 53547


## Remove rows with missing ICO 
- Firms with missing ico checked against the raw data and they are indeed empty except the firm name -> remove


In [194]:
# Count unique firms with missing ICO or empty ICO
missing_ico_count = panel_new.filter(pl.col("ico").is_null() | (pl.col("ico") == "")).select(pl.col("ico")).height
print(f"Number of unique firms with missing ICO or empty ICO: {missing_ico_count}")

# Remove rows with missing ICO
panel_new = panel_new.filter(pl.col("ico").is_not_null() & (pl.col("ico") != ""))
print(f"Panel shape after removing missing ICO: {panel_new.shape}")

print("Number of unique ICOs:", panel_new.select(pl.col("ico").n_unique()).to_numpy()[0, 0]) # one ICO removed - the empty one. 

Number of unique firms with missing ICO or empty ICO: 24
Panel shape after removing missing ICO: (1285104, 40)
Panel shape after removing missing ICO: (1285104, 40)
Number of unique ICOs: 53546
Number of unique ICOs: 53546


## Add year founded and dissolved, and flag for dissolved firms


In [195]:
# Add year founded and dissolved, and flag for dissolved firms
panel_new = panel_new.with_columns([
    # Extract year from date columns
    pl.col("date_founded").dt.year().alias("year_founded"),
    pl.col("date_dissolved").dt.year().alias("year_dissolved"),
    pl.when(pl.col("date_dissolved").is_not_null()).then(True).otherwise(False).alias("is_dissolved")
])

## Financial Metrics Calculation

Now we'll calculate various financial ratios and metrics for analysis. These include:
1. **Profitability ratios** - operating margin, net margin, gross margin, ROA, ROE, ROIC
2. **Leverage & solvency** - debt ratios, equity ratios  
3. **Cost structure** - cost ratios, price-cost margins
4. **Interest & tax diagnostics** - interest coverage, effective tax rates
5. **Growth dynamics** - year-over-year growth rates
6. **Efficiency & scale** - asset turnover, labor productivity

All calculated columns will have a `_cal` suffix to distinguish them from source data.

In [196]:
# 1. Profitability ratios
print("Calculating profitability ratios...")

panel_with_metrics = panel_new.with_columns([
    # Operating margin - core pricing power from operations
    (pl.col("oper_profit") / pl.col("sales_revenue")).alias("operating_margin_cal"),
    
    # Net margin - all-in profitability after interest & tax
    (pl.col("profit_net") / pl.col("sales_revenue")).alias("net_margin_cal"),
    
    # Return on assets (EBIT) - efficiency of asset base
    (pl.col("oper_profit") / pl.col("total_assets")).alias("roa_ebit_cal"),
    
    # Return on equity - owners' yield
    (pl.col("profit_net") / pl.col("equity")).alias("roe_cal"),

])

print(f"Added profitability ratios. Shape: {panel_with_metrics.shape}")

# Quick check for any new columns
new_profitability_cols = ["operating_margin_cal", "net_margin_cal", "roa_ebit_cal", "roe_cal"]
print("New profitability columns:", new_profitability_cols)

Calculating profitability ratios...
Added profitability ratios. Shape: (1285104, 47)
New profitability columns: ['operating_margin_cal', 'net_margin_cal', 'roa_ebit_cal', 'roe_cal']
Added profitability ratios. Shape: (1285104, 47)
New profitability columns: ['operating_margin_cal', 'net_margin_cal', 'roa_ebit_cal', 'roe_cal']


In [197]:
# 2. Leverage & solvency ratios
print("Calculating leverage & solvency ratios...")

panel_with_metrics = panel_with_metrics.with_columns([
    # # Debt-to-asset ratio - share of assets financed by liabilities
    # (pl.col("total_liabilities") / pl.col("total_assets")).alias("debt_asset_ratio_cal"),
    
    # # Debt-to-equity ratio - traditional leverage gauge
    # (pl.col("total_liabilities") / pl.col("equity")).alias("debt_equity_ratio_cal"),
    
    # Equity ratio - solvency cushion
    (pl.col("equity") / pl.col("total_assets")).alias("equity_ratio_cal"),
])

print(f"Added leverage ratios. Shape: {panel_with_metrics.shape}")

# Quick check
new_leverage_cols = ["equity_ratio_cal"]
print("New leverage columns:", new_leverage_cols)

Calculating leverage & solvency ratios...
Added leverage ratios. Shape: (1285104, 48)
New leverage columns: ['equity_ratio_cal']


In [198]:
# 3. Cost structure metrics
print("Calculating cost structure metrics...")

panel_with_metrics = panel_with_metrics.with_columns([
    # Cost ratio - average cost share in revenue
    (pl.col("costs") / pl.col("sales_revenue")).alias("cost_ratio_cal"),
])

print(f"Added cost structure metrics. Shape: {panel_with_metrics.shape}")

# Quick check
new_cost_cols = ["cost_ratio_cal"]
print("New cost structure columns:", new_cost_cols)

Calculating cost structure metrics...
Added cost structure metrics. Shape: (1285104, 49)
New cost structure columns: ['cost_ratio_cal']


In [199]:
# 4. Interest & tax diagnostics
print("Calculating interest & tax diagnostics...")

panel_with_metrics = panel_with_metrics.with_columns([
    
    # Effective tax rate - fiscal burden (avoid division by zero)
    pl.when(pl.col("profit_pre_tax") != 0)
      .then((pl.col("profit_pre_tax") - pl.col("profit_net")) / pl.col("profit_pre_tax"))
      .otherwise(None)
      .alias("effective_tax_rate_cal"),
])

print(f"Added interest & tax diagnostics. Shape: {panel_with_metrics.shape}")

# Quick check
new_tax_cols = ["effective_tax_rate_cal"]
print("New interest & tax columns:", new_tax_cols)

Calculating interest & tax diagnostics...
Added interest & tax diagnostics. Shape: (1285104, 50)
New interest & tax columns: ['effective_tax_rate_cal']


In [200]:
# 5. Growth dynamics (year-on-year)
print("Calculating growth dynamics...")

# Sort by ico and year to ensure proper ordering for lag calculations
panel_with_metrics = (
    panel_with_metrics
    .sort(["ico", "year"])
    .with_columns([
        # Revenue growth - revenue momentum
        (pl.col("sales_revenue").pct_change().over("ico")).alias("rev_growth_cal"),
        
        # Cost growth - cost shock size
        (pl.col("costs").pct_change().over("ico")).alias("cost_growth_cal"),
        
        # Operating profit growth - margin expansion driver
        (pl.col("oper_profit").pct_change().over("ico")).alias("op_profit_growth_cal"),
    ])
)

print(f"Added growth dynamics. Shape: {panel_with_metrics.shape}")

# Quick check
new_growth_cols = ["rev_growth_cal", "cost_growth_cal", "op_profit_growth_cal"]
print("New growth columns:", new_growth_cols)

Calculating growth dynamics...
Added growth dynamics. Shape: (1285104, 53)
New growth columns: ['rev_growth_cal', 'cost_growth_cal', 'op_profit_growth_cal']
Added growth dynamics. Shape: (1285104, 53)
New growth columns: ['rev_growth_cal', 'cost_growth_cal', 'op_profit_growth_cal']


In [201]:
# 6. Efficiency & scale metrics
print("Calculating efficiency & scale metrics...")

panel_with_metrics = panel_with_metrics.with_columns([
    # Asset turnover - how quickly assets generate revenue
    (pl.col("sales_revenue") / pl.col("total_assets")).alias("asset_turnover_cal"),
    
    # Labor productivity - sales per employee (since num_employees is numeric)
    (pl.col("sales_revenue") / pl.col("num_employees")).alias("labor_productivity_cal"),
])

print(f"Added efficiency & scale metrics. Shape: {panel_with_metrics.shape}")

# Quick check
new_efficiency_cols = ["asset_turnover_cal", "labor_productivity_cal"]
print("New efficiency columns:", new_efficiency_cols)

Calculating efficiency & scale metrics...
Added efficiency & scale metrics. Shape: (1285104, 55)
New efficiency columns: ['asset_turnover_cal', 'labor_productivity_cal']


In [202]:
# Quality check on calculated metrics
print("Performing quality checks on calculated metrics...")

# List all calculated columns
calculated_cols = [col for col in panel_with_metrics.columns if col.endswith("_cal")]
print(f"Total calculated columns: {len(calculated_cols)}")
print("Calculated columns:", calculated_cols)

# Check for missing values in calculated columns
print("\nMissing values in calculated columns:")
for col in calculated_cols:
    null_count = panel_with_metrics.select(pl.col(col).is_null().sum()).to_numpy()[0, 0]
    total_count = panel_with_metrics.shape[0]
    null_pct = (null_count / total_count) * 100
    print(f"  {col}: {null_count:,} ({null_pct:.1f}%)")

print(f"\nFinal dataset shape: {panel_with_metrics.shape}")
print(f"Total columns: {len(panel_with_metrics.columns)}")

Performing quality checks on calculated metrics...
Total calculated columns: 12
Calculated columns: ['operating_margin_cal', 'net_margin_cal', 'roa_ebit_cal', 'roe_cal', 'equity_ratio_cal', 'cost_ratio_cal', 'effective_tax_rate_cal', 'rev_growth_cal', 'cost_growth_cal', 'op_profit_growth_cal', 'asset_turnover_cal', 'labor_productivity_cal']

Missing values in calculated columns:
  operating_margin_cal: 671,795 (52.3%)
  net_margin_cal: 667,498 (51.9%)
  roa_ebit_cal: 667,828 (52.0%)
  roe_cal: 695,135 (54.1%)
  equity_ratio_cal: 640,395 (49.8%)
  cost_ratio_cal: 666,520 (51.9%)
  effective_tax_rate_cal: 674,739 (52.5%)
  rev_growth_cal: 486,146 (37.8%)
  cost_growth_cal: 482,646 (37.6%)
  op_profit_growth_cal: 492,050 (38.3%)
  asset_turnover_cal: 666,527 (51.9%)
  labor_productivity_cal: 666,112 (51.8%)

Final dataset shape: (1285104, 55)
Total columns: 55


In [203]:
# Statistical summary of key calculated metrics
print("Statistical summary of key calculated metrics:")
print("=" * 60)

key_metrics = [
    "operating_margin_cal", "net_margin_cal", "gross_margin_cal",
    "equity_ratio_cal", 
    "cost_ratio_cal", "asset_turnover_cal"
]

for metric in key_metrics:
    if metric in panel_with_metrics.columns:
        summary = panel_with_metrics.select([
            pl.col(metric).count().alias("count"),
            pl.col(metric).mean().alias("mean"),
            pl.col(metric).median().alias("median"),
            pl.col(metric).std().alias("std"),
            pl.col(metric).min().alias("min"),
            pl.col(metric).max().alias("max"),
            pl.col(metric).quantile(0.05).alias("p5"),
            pl.col(metric).quantile(0.95).alias("p95"),
        ]).to_pandas().iloc[0]
        
        print(f"\n{metric}:")
        print(f"  Count: {summary['count']:,}")
        print(f"  Mean: {summary['mean']:.3f}")
        print(f"  Median: {summary['median']:.3f}")
        print(f"  Std: {summary['std']:.3f}")
        print(f"  Min: {summary['min']:.3f}")
        print(f"  Max: {summary['max']:.3f}")
        print(f"  P5-P95: [{summary['p5']:.3f}, {summary['p95']:.3f}]")

Statistical summary of key calculated metrics:

operating_margin_cal:
  Count: 613,309.0
  Mean: nan
  Median: 0.038
  Std: nan
  Min: -inf
  Max: inf
  P5-P95: [-0.220, 0.282]

net_margin_cal:
  Count: 617,606.0
  Mean: nan
  Median: 0.024
  Std: nan
  Min: -inf
  Max: inf
  P5-P95: [-0.234, 0.238]

equity_ratio_cal:
  Count: 644,709.0
  Mean: nan
  Median: 0.443
  Std: nan
  Min: -inf
  Max: inf
  P5-P95: [-0.302, 0.928]

cost_ratio_cal:
  Count: 618,584.0
  Mean: nan
  Median: 1.002
  Std: nan
  Min: -inf
  Max: inf
  P5-P95: [0.812, 2.694]

asset_turnover_cal:
  Count: 618,577.0
  Mean: nan
  Median: 1.635
  Std: nan
  Min: -inf
  Max: inf
  P5-P95: [0.107, 5.114]

operating_margin_cal:
  Count: 613,309.0
  Mean: nan
  Median: 0.038
  Std: nan
  Min: -inf
  Max: inf
  P5-P95: [-0.220, 0.282]

net_margin_cal:
  Count: 617,606.0
  Mean: nan
  Median: 0.024
  Std: nan
  Min: -inf
  Max: inf
  P5-P95: [-0.234, 0.238]

equity_ratio_cal:
  Count: 644,709.0
  Mean: nan
  Median: 0.443
  S

In [204]:
# Outlier detection for calculated metrics
print("Detecting extreme outliers in calculated metrics...")
print("=" * 60)

# Check for extreme values that might indicate data quality issues
outlier_checks = {
    "operating_margin_cal": {"min_threshold": -5, "max_threshold": 5},
    "net_margin_cal": {"min_threshold": -5, "max_threshold": 5},
    "gross_margin_cal": {"min_threshold": -5, "max_threshold": 5},
    "debt_asset_ratio_cal": {"min_threshold": -1, "max_threshold": 10},
    # "debt_equity_ratio_cal": {"min_threshold": -10, "max_threshold": 100},
    "equity_ratio_cal": {"min_threshold": -5, "max_threshold": 5},
    "asset_turnover_cal": {"min_threshold": -1, "max_threshold": 100},
    "labor_productivity_cal": {"min_threshold": -1000000, "max_threshold": 100000000},
}

for metric, thresholds in outlier_checks.items():
    if metric in panel_with_metrics.columns:
        extreme_low = panel_with_metrics.filter(
            pl.col(metric) < thresholds["min_threshold"]
        ).shape[0]
        
        extreme_high = panel_with_metrics.filter(
            pl.col(metric) > thresholds["max_threshold"]
        ).shape[0]
        
        total_obs = panel_with_metrics.select(pl.col(metric).count()).to_numpy()[0, 0]
        
        print(f"{metric}:")
        print(f"  Extreme low (< {thresholds['min_threshold']}): {extreme_low:,} ({extreme_low/total_obs*100:.2f}%)")
        print(f"  Extreme high (> {thresholds['max_threshold']}): {extreme_high:,} ({extreme_high/total_obs*100:.2f}%)")
        print()

Detecting extreme outliers in calculated metrics...
operating_margin_cal:
  Extreme low (< -5): 9,484 (1.55%)
  Extreme high (> 5): 7,533 (1.23%)

net_margin_cal:
  Extreme low (< -5): 9,932 (1.61%)
  Extreme high (> 5): 8,086 (1.31%)

operating_margin_cal:
  Extreme low (< -5): 9,484 (1.55%)
  Extreme high (> 5): 7,533 (1.23%)

net_margin_cal:
  Extreme low (< -5): 9,932 (1.61%)
  Extreme high (> 5): 8,086 (1.31%)

equity_ratio_cal:
  Extreme low (< -5): 4,013 (0.62%)
  Extreme high (> 5): 1,012 (0.16%)

asset_turnover_cal:
  Extreme low (< -1): 89 (0.01%)
  Extreme high (> 100): 1,172 (0.19%)

labor_productivity_cal:
  Extreme low (< -1000000): 7 (0.00%)
  Extreme high (> 100000000): 1,633 (0.26%)

equity_ratio_cal:
  Extreme low (< -5): 4,013 (0.62%)
  Extreme high (> 5): 1,012 (0.16%)

asset_turnover_cal:
  Extreme low (< -1): 89 (0.01%)
  Extreme high (> 100): 1,172 (0.19%)

labor_productivity_cal:
  Extreme low (< -1000000): 7 (0.00%)
  Extreme high (> 100000000): 1,633 (0.26%)



In [205]:
# Sample preview of enriched dataset
print("Sample preview of enriched dataset:")
print("=" * 60)

# Show a sample of firms with their calculated metrics
sample_cols = [
    "ico", "year", "sales_revenue", "costs", "oper_profit", "total_assets", "equity",
    "operating_margin_cal", "gross_margin_cal", "roa_ebit_cal",
    "asset_turnover_cal", "rev_growth_cal"
]

sample_data = (
    panel_with_metrics
    .filter((pl.col("year") >= 2020) & (pl.col("sales_revenue").is_not_null()))
    .select([col for col in sample_cols if col in panel_with_metrics.columns])
    .limit(10)
)

print(sample_data)

print(f"\nDataset ready for analysis!")
print(f"Final shape: {panel_with_metrics.shape}")
print(f"Years covered: {panel_with_metrics['year'].min()} - {panel_with_metrics['year'].max()}")
print(f"Unique companies: {panel_with_metrics.select(pl.col('ico').n_unique()).to_numpy()[0, 0]:,}")

Sample preview of enriched dataset:
shape: (10, 11)
┌──────────┬──────┬────────────┬───────────┬───┬────────────┬────────────┬────────────┬────────────┐
│ ico      ┆ year ┆ sales_reve ┆ costs     ┆ … ┆ operating_ ┆ roa_ebit_c ┆ asset_turn ┆ rev_growth │
│ ---      ┆ ---  ┆ nue        ┆ ---       ┆   ┆ margin_cal ┆ al         ┆ over_cal   ┆ _cal       │
│ str      ┆ i16  ┆ ---        ┆ f64       ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│          ┆      ┆ f64        ┆           ┆   ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
╞══════════╪══════╪════════════╪═══════════╪═══╪════════════╪════════════╪════════════╪════════════╡
│ 00000205 ┆ 2020 ┆ 2.0082e9   ┆ 2.8444e9  ┆ … ┆ 0.024292   ┆ 0.005338   ┆ 0.219762   ┆ 0.011569   │
│ 00000205 ┆ 2021 ┆ 2.8842e9   ┆ 2.7322e9  ┆ … ┆ 0.198764   ┆ 0.054401   ┆ 0.273696   ┆ 0.436204   │
│ 00000205 ┆ 2022 ┆ 3.0298e9   ┆ 2.9313e9  ┆ … ┆ 0.247512   ┆ 0.071189   ┆ 0.287617   ┆ 0.050471   │
│ 00000205 ┆ 2023 ┆ 2.6920e9   ┆ 2.7482

In [206]:
nace_matching_path = os.path.join("..", "data", "source_cleaned", "t_nace_matching.parquet")
nace_matching = pl.read_parquet(nace_matching_path)

# Enrich the panel with NACE level 1 and level 2 codes
print("Enriching panel with NACE level codes...")

# Select relevant columns from the NACE matching table for joining
nace_for_join = nace_matching.select([
    "magnus_nace", 
    "level1_code", 
    "level2_code", 
    "name_czso_en",
    "industry_flag"
]).unique(subset=["magnus_nace"])  # Ensure unique magnus_nace codes

print(f"NACE table for join shape: {nace_for_join.shape}")

# Perform left join to add NACE level codes
panel_enriched = panel_with_metrics.join(
    nace_for_join,
    left_on="main_nace_code",
    right_on="magnus_nace",
    how="left"
)

print(f"Panel after NACE enrichment shape: {panel_enriched.shape}")

# Check how many rows got matched
matched_count = panel_enriched.filter(pl.col("level1_code").is_not_null()).shape[0]
total_count = panel_enriched.shape[0]
match_rate = (matched_count / total_count) * 100

print(f"Successfully matched {matched_count:,} out of {total_count:,} rows ({match_rate:.1f}%)")

# Show sample of enriched data
print("\nSample of enriched data:")
sample_enriched = panel_enriched.select([
    "ico", "year", "main_nace_code", "level1_code", "level2_code", 
    "name_czso_en", "industry_flag"
]).filter(pl.col("level1_code").is_not_null()).head(10)
print(sample_enriched)

Enriching panel with NACE level codes...
NACE table for join shape: (1005, 5)
Panel after NACE enrichment shape: (1285104, 59)
Panel after NACE enrichment shape: (1285104, 59)
Successfully matched 1,268,976 out of 1,285,104 rows (98.7%)

Sample of enriched data:
shape: (10, 7)
┌──────────┬──────┬────────────────┬─────────────┬─────────────┬──────────────┬───────────────┐
│ ico      ┆ year ┆ main_nace_code ┆ level1_code ┆ level2_code ┆ name_czso_en ┆ industry_flag │
│ ---      ┆ ---  ┆ ---            ┆ ---         ┆ ---         ┆ ---          ┆ ---           │
│ str      ┆ i16  ┆ str            ┆ str         ┆ str         ┆ str          ┆ bool          │
╞══════════╪══════╪════════════════╪═════════════╪═════════════╪══════════════╪═══════════════╡
│ 00000205 ┆ 2000 ┆ 020000         ┆ A           ┆ 02          ┆ Forestry and ┆ false         │
│          ┆      ┆                ┆             ┆             ┆ logging      ┆               │
│ 00000205 ┆ 2001 ┆ 020000         ┆ A           ┆

In [207]:
# Save the enriched dataset with NACE codes
print("Saving enriched dataset with NACE level codes...")

# Path for the output
output_path = os.path.join("..", "data", "source_cleaned", "magnusweb_panel_with_margins.parquet")

# Save the enriched dataset
panel_enriched.write_parquet(output_path, compression="snappy")

print(f"✔ Enriched dataset saved to: {output_path}")
print(f"Final dataset:")
print(f"  - Shape: {panel_enriched.shape}")
print(f"  - Unique companies: {panel_enriched.select(pl.col('ico').n_unique()).to_numpy()[0, 0]:,}")
print(f"  - Years: {panel_enriched['year'].min()} - {panel_enriched['year'].max()}")
print(f"  - Total calculated metrics: {len([col for col in panel_enriched.columns if col.endswith('_cal')])}")
print(f"  - NACE enrichment: Level 1 and Level 2 codes added")

# Show final column list
print(f"\nAll columns ({len(panel_enriched.columns)}):")
for i, col in enumerate(panel_enriched.columns, 1):
    if col.endswith("_cal"):
        suffix = " (calculated)"
    elif col in ["level1_code", "level2_code", "name_czso_en", "industry_flag"]:
        suffix = " (NACE enriched)"
    else:
        suffix = ""
    print(f"  {i:2d}. {col}{suffix}")

print(f"\n🎉 Data cleaning and enrichment complete!")
print(f"📊 Ready for profit margin and inflation analysis")

Saving enriched dataset with NACE level codes...
✔ Enriched dataset saved to: ../data/source_cleaned/magnusweb_panel_with_margins.parquet
Final dataset:
  - Shape: (1285104, 59)
  - Unique companies: 53,546
  - Years: 2000 - 2023
  - Total calculated metrics: 12
  - NACE enrichment: Level 1 and Level 2 codes added

All columns (59):
   1. ico
   2. year
   3. other_liabilities
   4. costs
   5. sales_revenue
   6. equity
   7. profit_net
   8. turnover
   9. current_assets
  10. oper_profit
  11. total_liabilities
  12. total_assets
  13. total_liabilities_and_equity
  14. profit_pre_tax
  15. other_assets
  16. fixed_assets
  17. name
  18. main_nace
  19. main_nace_code
  20. sub_nace_cz
  21. sub_nace_cz_code
  22. main_okec
  23. main_okec_code
  24. sub_okec
  25. sub_okec_code
  26. esa2010
  27. esa95
  28. locality
  29. region
  30. num_employees
  31. num_employees_cat
  32. turnover_cat
  33. audit
  34. consolidation
  35. currency
  36. date_founded
  37. date_dissolved
  