### Magnusweb Data Cleaning and Enrichment 
This notebook performs data cleaning on the Magnusweb dataset, focusing on ensuring data quality and preparing it for analysis. The cleaning process includes:
- removing invalid rows 

Initial shape: 
```python
Type of panel: <class 'polars.dataframe.frame.DataFrame'>
Static columns (do not change across years for each ICO): ['name', 'main_nace', 'main_nace_code', 'sub_nace_cz', 'sub_nace_cz_code', 'main_okec', 'main_okec_code', 'sub_okec', 'sub_okec_code', 'esa2010', 'esa95', 'locality', 'region', 'num_employees', 'num_employees_cat', 'turnover_cat', 'audit', 'consolidation', 'currency', 'date_founded', 'date_dissolved', 'status', 'legal_form', 'entity_type']
Time-series columns (change across years for each ICO): ['year', 'profit_pre_tax', 'total_liabilities', 'oper_profit', 'total_assets', 'profit_net', 'sales_revenue', 'equity', 'turnover', 'costs']
```

In [4]:
import polars as pl
import os

# Path to curated panel file
curated_path = os.path.join("..", "data", "source_cleaned", "magnusweb_panel.parquet")

# Load the curated panel
panel = pl.read_parquet(curated_path)
print(f"Loaded panel shape: {panel.shape}")
#panel.head()

Loaded panel shape: (1754304, 40)


## Correcting types: 

Summary:  
Six columns needed type corrections:  
- `num_employees_cat`, `status`, `legal_form`, and `entity_type` were converted to categorical type.
- `date_founded` and `date_dissolved` were converted to date type.  


In [None]:
# Check type and first 5 values for each column, correct if needed
expected_types = {
    'ico': pl.String,
    'name': pl.String,
    'main_nace': pl.Categorical,
    'main_nace_code': pl.String,
    'sub_nace_cz': pl.Categorical,
    'sub_nace_cz_code': pl.String,
    'main_okec': pl.Categorical,
    'main_okec_code': pl.String,
    'sub_okec': pl.Categorical,
    'sub_okec_code': pl.String,
    'esa2010': pl.Categorical,
    'esa95': pl.Categorical,
    'locality': pl.Categorical,
    'region': pl.Categorical,
    'num_employees': pl.Int32,
    'num_employees_cat': pl.Categorical,
    'turnover_cat': pl.Categorical,
    'audit': pl.Categorical,
    'consolidation': pl.Categorical,
    'currency': pl.Categorical,
    'date_founded': pl.Date,
    'date_dissolved': pl.Date,
    'status': pl.Categorical,
    'legal_form': pl.Categorical,
    'entity_type': pl.Categorical,
    'year': pl.Int16,
    'profit_pre_tax': pl.Float64,
    'total_liabilities': pl.Float64,
    'oper_profit': pl.Float64,
    'total_assets': pl.Float64,
    'profit_net': pl.Float64,
    'sales_revenue': pl.Float64,
    'equity': pl.Float64,
    'turnover': pl.Float64,
    'costs': pl.Float64,
}

panel_checked = panel.clone()
corrections = {}

for col in panel_checked.columns:
    # print(f"Column: {col}")
    # print(f"  Type: {panel_checked[col].dtype}")
    # print(f"  First 5 values: {panel_checked[col].head(5).to_list()}")
    expected = expected_types.get(col)
    if expected and panel_checked[col].dtype != expected:
        # Try to convert
        try:
            if expected == pl.Categorical:
                panel_checked = panel_checked.with_columns(pl.col(col).cast(pl.Categorical, strict=False))
            elif expected == pl.Date:
                panel_checked = panel_checked.with_columns(pl.col(col).str.strptime(pl.Date, "%Y-%m-%d", strict=False))
            else:
                panel_checked = panel_checked.with_columns(pl.col(col).cast(expected, strict=False))
            corrections[col] = expected
        except Exception as e:
            print(f"  Could not convert {col} to {expected}: {e}")
    print()

if corrections:
    print("\nCorrections made:")
    for col, typ in corrections.items():
        print(f"  {col}: {typ}")
        print(f"    New type: {panel_checked[col].dtype}")
        print(f"    First 5 values: {panel_checked[col].head(5).to_list()}")











































Corrections made:
  num_employees_cat: Categorical
    New type: Categorical(ordering='physical')
    First 5 values: ['25 - 49 zaměstnanců', 'Neuvedeno', '25 - 49 zaměstnanců', 'Neuvedeno', '10 - 19 zaměstnanců']
  date_founded: Date
    New type: Date
    First 5 values: [datetime.date(2008, 1, 30), datetime.date(2008, 1, 17), datetime.date(2018, 7, 27), datetime.date(2018, 7, 25), datetime.date(2018, 7, 29)]
  date_dissolved: Date
    New type: Date
    First 5 values: [None, datetime.date(2021, 7, 7), None, None, None]
  status: Categorical
    New type: Categorical(ordering='physical')
    First 5 values: ['', 'Zrušený', '', '', '']
  legal_form: Categorical
    New type: Categorical(ordering='physical')
    First 5 values: ['Společnost s ručením omezeným', 'Společnost s ručením omezeným', 'Odštěpný závod zahraniční právnické osoby', 'Společnost s ručením omezeným', 'Akciová společnost']
  entity_type: Categorical
    New type: Categorical(

## Removing rows with 0–9 employees
- even when filtered in Magnusweb for 10+ employees, some rows still have 0–9 employees due to data inconsistencies.
- This step ensures that only companies with 10 or more employees are retained for analysis

In [7]:
# Remove rows with 0–9 employees
panel_new = panel_checked.filter(pl.col("num_employees") > 9)
print(f"Filtered panel shape: {panel_new.shape}")
# panel_new.head()

print("Number of unique ICOs:", panel_new.select(pl.col("ico").n_unique()).to_numpy()[0, 0])

Filtered panel shape: (1285128, 40)
Number of unique ICOs: 53547
