### Matching of NACE codes from different sources 

Main source of the matching is the NACE Rev. 2 classification from CZSO: https://apl2.czso.cz/iSMS/en/klasdata.jsp?kodcis=80004

1. **Ancestry Extraction:**  
   The code builds the hierarchy for each record by following the parent pointer (`nadvaz`) from the current row back to the top-level record (level 1). This produces a lineage (a list of `(uroven, chodnota)` tuples) that starts at the top and ends at the current row. This hierarchical path is used for further transformations.

2. **Separation of Top-Level Letter and Numeric Levels:**  
   - **Top-Level Code (Level 1):** The first element of the ancestry (level 1) is expected to be a letter (e.g., "A", "B", etc.) and is stored as the `level1_code`.  
   - **Numeric Levels:** All subsequent levels (levels 2 and above) contain numeric codes. These codes are collected separately for further processing.

3. **Incremental Code Computation:**  
   For each numeric level, an incremental code is computed. The idea is to capture only the additional characters beyond the parent’s numeric code:
   - For the first numeric level (level 2), the full numeric code is used as is.
   - For each subsequent level, the prefix that matches the parent’s full numeric code is removed. This yields a shorter “incremental” code that highlights the difference between the level and its parent.

4. **Construction of `full_nace`:**  
   The `full_nace` field is constructed by concatenating the top-level letter with the incremental codes from each numeric level, joined with dots. This produces a compact representation that still reflects the hierarchical structure (e.g., `"A.01.1.1.10"`). This format is used by Eurostat. 

5. **Magnus NACE Code Calculation:**  
   The `magnus_nace` is generated from the current row’s `chodnota`:
   - For rows at levels 1, 2, 3, 4, 5 the code is padded to 6 digits (using trailing zeros) to conform to the Magnus format.
   - Additionally, for level 1 rows, the code explicitly fills `magnus_nace` with the `level1_code` (the top-level letter).

6. **Industry Flag:**  
   An `industry_flag` is set to `True` if the top-level letter (from `level1_code`) belongs to one of the specified sectors (B, C, D, E), indicating that the classification is part of an industry sector.

7. **Final Output:**  
   The resulting DataFrame includes the transformed columns:  
   - `name_czso_cs` (short description),  
   - `level` (hierarchical level),  
   - `level1_code` through `level5_code` (incremental codes for each level),  
   - `full_nace` (incremental full hierarchy),  
   - `magnus_nace` (formatted 6-digit code where applicable), and  
   - `industry_flag` (industry indicator).  
   This table is then saved to Parquet.



In [13]:
import os
import pandas as pd
from datetime import datetime

In [14]:
# -------------------------------------------------------------------------
# 1. Define file paths
# -------------------------------------------------------------------------
script_dir = os.getcwd()  # current directory in Jupyter
project_root = os.path.abspath(os.path.join(script_dir, ".."))
input_file = os.path.join(project_root, "data", "source_raw", "NACE", "KLAS80004_CS_NACE_classification.csv")
output_folder = os.path.join(project_root, "data", "source_cleaned")
if not os.path.exists(output_folder):
    os.makedirs(output_folder)
output_file = os.path.join(output_folder, "t_nace_matching.parquet")

# -------------------------------------------------------------------------
# 2. Load the CSV file
# -------------------------------------------------------------------------
df_czso_nace = pd.read_csv(input_file, sep=",")

# Expected columns: "kodjaz","akrcis","kodcis","uroven","chodnota",
# "zkrtext","text","admplod","admnepo","nadvaz"

# -------------------------------------------------------------------------
# 3. Preprocess data: ensure string type and fill NaNs
# -------------------------------------------------------------------------
df_czso_nace = df_czso_nace.fillna("")
df_czso_nace["chodnota"] = df_czso_nace["chodnota"].astype(str)
df_czso_nace["nadvaz"] = df_czso_nace["nadvaz"].astype(str)

# Build a lookup: mapping from chodnota value to DataFrame index
chodnota_to_idx = { row["chodnota"]: i for i, row in df_czso_nace.iterrows() }

# -------------------------------------------------------------------------
# 4. Function to retrieve ancestry (from level 1 to current)
# -------------------------------------------------------------------------
def get_ancestry_codes(row, df):
    """
    Returns a list of (uroven, chodnota) tuples from the top ancestor (level 1)
    down to the current row.
    """
    lineage = []
    current_chodnota = row["chodnota"]
    while current_chodnota:
        idx = chodnota_to_idx.get(current_chodnota)
        if idx is None:
            break
        current_row = df.loc[idx]
        lineage.append((int(current_row["uroven"]), current_row["chodnota"]))
        parent = current_row["nadvaz"]
        if not parent or parent == current_chodnota:
            break
        current_chodnota = parent
    lineage.reverse()  # now from top (level 1) to current
    return lineage

# -------------------------------------------------------------------------
# 5. Helper to create a 6-digit Magnus NACE code from a full numeric code
# -------------------------------------------------------------------------
def to_magnus_nace(code_str):
    if not code_str.isdigit():
        return None
    # Pad with trailing zeros to get 6 digits
    return code_str.ljust(6, "0") if len(code_str) < 6 else code_str[:6]

# -------------------------------------------------------------------------
# 6. Build new classification table with adjustments
# -------------------------------------------------------------------------
rows_output = []

for i, row in df_czso_nace.iterrows():
    lineage = get_ancestry_codes(row, df_czso_nace)
    
    # Separate top-level letter and numeric levels (levels>=2)
    letter = ""
    numeric_levels = []  # store full numeric codes as strings
    for lvl, code in lineage:
        if lvl == 1:
            letter = code  # expected to be a letter (e.g. "A", "U")
        else:
            numeric_levels.append(code)
    
    # full_nace: join letter with all full numeric codes using dots
    # full_nace_parts = [letter] + numeric_levels if letter else numeric_levels
    # full_nace = ".".join(full_nace_parts)
    
    # Always fill level1_code from the letter (even for lower levels)
    level1_code = letter
    
    # Compute incremental codes for numeric levels:
    # For the first numeric level, incremental = full numeric (since no parent)
    # For subsequent ones, remove the parent's full code prefix.
    inc_codes = []
    prev_full = ""
    for idx_num, num_code in enumerate(numeric_levels):
        if idx_num == 0:
            inc = num_code  # full code for level 2 remains as is
        else:
            # Remove parent's full code (its length) from current full code
            inc = num_code[len(numeric_levels[idx_num - 1]):]
        inc_codes.append(inc)

    # Build full_nace using the top-level letter and the incremental numeric codes:
    full_nace_parts = [letter] + inc_codes
    full_nace = ".".join(full_nace_parts)
    
    # Prepare level2_code to level5_code (if available)
    level2_code = inc_codes[0] if len(inc_codes) >= 1 else ""
    level3_code = inc_codes[1] if len(inc_codes) >= 2 else ""
    level4_code = inc_codes[2] if len(inc_codes) >= 3 else ""
    level5_code = inc_codes[3] if len(inc_codes) >= 4 else ""
    
    # Determine magnus_nace:
    # Use the current row's full numeric code (row["chodnota"]) if level is 2,3,4,5.
    try:
        current_level = int(row["uroven"])
    except:
        current_level = 0
    if current_level in [2, 3, 4, 5]:
        magnus_nace = to_magnus_nace(row["chodnota"])
    else:
        magnus_nace = ""  # leave empty for level 1 or 5
    
    # Determine industry_flag based on top-level letter (B, C, D, E)
    industry_flag = (letter in ["B", "C", "D", "E"])
    
    # Build record; note that full_nace is the cumulative full numeric code sequence
    out_record = {
        "name_czso_cs": row["zkrtext"],
        "level": row["uroven"],
        "czso_code": row["chodnota"],
        "level1_code": level1_code,
        "level2_code": level2_code,
        "level3_code": level3_code,
        "level4_code": level4_code,
        "level5_code": level5_code,
        "full_nace": full_nace,
        "magnus_nace": magnus_nace,
        "industry_flag": industry_flag,
    }
    rows_output.append(out_record)

# when level = 1, fill magnus_nace with level1_code
for record in rows_output:
    if record["level"] == 1:
        record["magnus_nace"] = record["level1_code"]

df_result = pd.DataFrame(rows_output)


In [15]:
# enrich with english names 
input_file = os.path.join(project_root, "data", "source_raw", "NACE", "KLAS80004_EN_NACE_classification.csv")

df_en_nace = pd.read_csv(input_file, sep=",")
df_en_nace = df_en_nace.fillna("")
df_en_nace["chodnota"] = df_en_nace["chodnota"].astype(str)

# Build a lookup: mapping from chodnota value to df_results, creating new column name_czso_en
chodnota_to_idx_en = { row["chodnota"]: i for i, row in df_en_nace.iterrows() }
# Merge with df_result
df_result["name_czso_en"] = ""
for i, row in df_result.iterrows():
    # Look up using czso_code rather than non-existent column
    idx = chodnota_to_idx_en.get(row["czso_code"])
    if idx is not None:
        df_result.at[i, "name_czso_en"] = df_en_nace.loc[idx]["zkrtext"]

# count null or empty values
# df_result["name_czso_en"].isnull().sum(), df_result["name_czso_en"].eq("").sum()

# duplicate czso_code
df_result["czso_code"].duplicated().sum()

# move name_czso_en to the second position
cols = df_result.columns.tolist()
cols.insert(1, cols.pop(cols.index("name_czso_en")))
df_result = df_result[cols]


In [16]:
# 'other' row - present in nagnus data, but not in czso data
new_row = {
    "name_czso_cs": "Výroba, obchod a služby neuvedené v přílohách 1 až 3 živnostenského zákona",
    "level": 1,
    "czso_code": "00",
    "level1_code": "00",
    "level2_code": "",
    "level3_code": "",
    "level4_code": "",
    "level5_code": "",
    "full_nace": "00",
    "magnus_nace": "00",
    "industry_flag": False,
    "name_czso_en": "Production, trade and services not listed in Annexes 1 to 3 of the Trade Licensing Act"
}

df_result = pd.concat([df_result, pd.DataFrame([new_row])], ignore_index=True)


In [17]:
# save to parquet
df_result.to_parquet(output_file, index=False)
# Print the output file path
print(f"Data saved to {output_file}")

Data saved to /Users/adam/Library/Mobile Documents/com~apple~CloudDocs/School/Master's Thesis/Analysis/profit-margins-inflation/data/source_cleaned/t_nace_matching.parquet
