---
# Enhancing Data Quality with Business Type Categorization

### Context:
This notebook handles large datasets where the "Business Type" field is missing.
It uses a separate reference table to add the correct "Business Type" to the main data by matching records using a common key.

### Workflow:
1. **Understand the Goal**: Define classification objective.
2. **Data Loading & Initial Checks**: Import data, perform initial quality assessment.
3. **Data Cleansing**: Prepare datasets for matching.
3. **Design Matching Logic**: Develop exact and partial classification strategies.
4. **Business Type Assignment**: Apply logic to classify records.
5. **Final Summary**: Analyze and present final classification outcomes.

---

In [17]:
import pandas as pd
import numpy as np

from collections import Counter
import string
import re

# 1. Data Loading

Load `data.csv` and Business Type reference table as dataframes:
- `data.csv`: Main dataset with ~1 million records.
- `BUSINESS_TYPE.xlsx`: Reference table for Business Type mapping.

In [31]:
# Step 1: Load the main dataset
df = pd.read_csv('Data.csv', sep=';', quotechar='"', on_bad_lines='skip',
                 encoding='utf-8', dtype={'UNIQUE_KEY': str, 'REVENUES': str, 'TOTAL ASSETS': str},
                 low_memory=False)

print(f"DataFrame has {df.shape[0]} rows and {df.shape[1]} columns.")
df.dtypes

DataFrame has 999967 rows and 13 columns.


UNIQUE_KEY       object
NAME             object
YEAR            float64
CITY             object
COUNTRY          object
STATUS           object
LEGAL_FORM       object
ISIC            float64
REGISTRATION    float64
LIQUIDATION     float64
TERMINATION     float64
REVENUES         object
TOTAL ASSETS     object
dtype: object

In [15]:
# Step 2: Load the reference Business Type table
business_type = pd.read_excel('BUSINESS_TYPE.xlsx')

print(f"DataFrame has {business_type.shape[0]} rows and {business_type.shape[1]} columns.")
business_type.head(5)

DataFrame has 20 rows and 2 columns.


Unnamed: 0,Legal_Form,Business_Type
0,AG – Aktiengesellschaft,Public
1,SE – Societas Europaea,Public
2,KGaA – Kommanditgesellschaft auf Aktien,Hybrid
3,GmbH – Gesellschaft mit beschränkter Haftung,Private
4,UG (haftungsbeschränkt),Private


---
# 2. Initial Data Quality Checks

## 2.1. Overall Quality Check

 Quality Check (QC) on main data:
- Check data shape, missing values, blank entries, and duplicates.
- Key columns: `UNIQUE_KEY`, `NAME`, and `LEGAL_FORM`.

In [4]:
def basic_qc(df):
    print("=== 📊 DATA OVERVIEW 📊 ===")
    print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"Duplicated rows: {df.duplicated().sum():,}")
    
    print("\n=== 🧬 DATA TYPES 🧬 ===")
    for col in ['UNIQUE_KEY', 'NAME', 'LEGAL_FORM']:
        print(f"{col}: {df[col].dtype}")
    
    print("\n=== ❓ MISSING VALUES ❓ ===")
    missing = df.isnull().sum()
    for col in ['UNIQUE_KEY', 'NAME', 'LEGAL_FORM']:
        count = missing[col]
        percent = (count / len(df)) * 100
        print(f"{col}: {count:,} missing ({percent:.2f}%)")

    print("\n=== ⚠️ (NON-NULL) BLANK ENTRIES ⚠️ ===")
    for col in ['UNIQUE_KEY', 'NAME', 'LEGAL_FORM']:
        if col in df.columns:
            blank_count = df[~df[col].isnull() & (df[col].astype(str).str.strip() == '')].shape[0]
            percent = (blank_count / len(df)) * 100
            print(f"{col}: {blank_count:,} blank ({percent:.2f}%)")

    print("\n=== 🔍 UNIQUE VALUE 🔍 ===")
    for col in ['UNIQUE_KEY', 'NAME', 'LEGAL_FORM']:
        unique_count = df[col].nunique()
        percent = (unique_count / len(df)) * 100
        print(f"{col}: {unique_count:,} unique ({percent:.2f}%)")

basic_qc(df)

=== 📊 DATA OVERVIEW 📊 ===
Shape: 999,967 rows × 13 columns
Duplicated rows: 158

=== 🧬 DATA TYPES 🧬 ===
UNIQUE_KEY: object
NAME: object
LEGAL_FORM: object

=== ❓ MISSING VALUES ❓ ===
UNIQUE_KEY: 0 missing (0.00%)
NAME: 0 missing (0.00%)
LEGAL_FORM: 85,493 missing (8.55%)

=== ⚠️ (NON-NULL) BLANK ENTRIES ⚠️ ===
UNIQUE_KEY: 0 blank (0.00%)
NAME: 0 blank (0.00%)
LEGAL_FORM: 0 blank (0.00%)

=== 🔍 UNIQUE VALUE 🔍 ===
UNIQUE_KEY: 79,532 unique (7.95%)
NAME: 985,153 unique (98.52%)
LEGAL_FORM: 198 unique (0.02%)


### * LEGAL_FORM Specific Check

In [5]:
# View unqiue LEGAL_FORM values
# pd.set_option('display.max_rows', None)
df['LEGAL_FORM'].value_counts(dropna=False).sort_index()

LEGAL_FORM
A/S                25
A/S & Co. KG        5
AB                 26
AB & Co. KG         5
AG               5495
                ...  
ΟΕ                  3
ООО                 2
ПАО                 1
בע"מ                1
NaN             85493
Name: count, Length: 199, dtype: int64

## 2.2. Check for duplicate entries (Optional)

- Checking for and addressing duplicate records, which can skew analysis and lead to incorrect enrichments.
- We'll differentiate between exact row duplicates and duplicates based solely on `UNIQUE_KEY`

In [6]:
# Check number of exact row duplicates (all column values are identical)
exact_duplicates_count = df.duplicated().sum()
print(f"Exact row duplicates: {exact_duplicates_count}")

# Check number of duplicates based on UNIQUE_KEY only
key_duplicates_count = df.duplicated(subset=['UNIQUE_KEY']).sum()
print(f"Duplicates by UNIQUE_KEY: {key_duplicates_count}")

Exact row duplicates: 158
Duplicates by UNIQUE_KEY: 920435


---
# 3. Data Cleansing

## 3.1. Handle duplicated entries

In [12]:
# Drop exact duplicate rows to reduce redundancy
df = df.drop_duplicates()

# Check again
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Duplicated rows: {df.duplicated().sum():,}")

Shape: 999,809 rows × 13 columns
Duplicated rows: 0


## 3.2. Clean and prepare the Business Type reference table

    1. Simplify legal forms by extracting short codes (e.g., from "AG – Aktiengesellschaft" to "AG").  
    2. Handle special cases like UG (haftungsbeschränkt) to UG.  
    3. Expand combined entries like 'e.K. / e.Kfm. / e.Kfr.' into separate rows.

In [16]:
# Step 1: Make a copy table
ref = business_type.copy()

# Step 2: Clean and simplify the legal forms
ref['short_form'] = ref['Legal_Form'].str.split('–').str[0].str.strip()
ref['short_form'] = ref['short_form'].replace({'UG (haftungsbeschränkt)': 'UG'})

# Step 3: Remove the combined 'e.K. / e.Kfm. / e.Kfr.' entry
ref = ref[ref['short_form'] != 'e.K. / e.Kfm. / e.Kfr.']

# Step 4: Add manual entries for expanded forms
expanded_forms = {
    'e. K.': 'Private',
    'e. Kfm.': 'Private',
    'e. Kfr.': 'Private'
}

manual_df = pd.DataFrame({
    'short_form': list(expanded_forms.keys()),
    'Business_Type': list(expanded_forms.values()),
    'Legal_Form': ['e.K. / e.Kfm. / e.Kfr.'] * len(expanded_forms)
})

# Step 5: Combine cleaned reference and manual rows
ref_cleaned = pd.concat([ref[['Legal_Form', 'short_form', 'Business_Type']], manual_df], ignore_index=True)
ref_cleaned

Unnamed: 0,Legal_Form,short_form,Business_Type
0,AG – Aktiengesellschaft,AG,Public
1,SE – Societas Europaea,SE,Public
2,KGaA – Kommanditgesellschaft auf Aktien,KGaA,Hybrid
3,GmbH – Gesellschaft mit beschränkter Haftung,GmbH,Private
4,UG (haftungsbeschränkt),UG,Private
5,GmbH & Co. KG,GmbH & Co. KG,Hybrid
6,AG & Co. KG,AG & Co. KG,Hybrid
7,SE & Co. KG,SE & Co. KG,Hybrid
8,OHG – Offene Handelsgesellschaft,OHG,Private
9,KG – Kommanditgesellschaft,KG,Private


## 3.3 Create the final Lookup dictionary

In [18]:
lookup_dict = dict(zip(ref_cleaned['short_form'], ref_cleaned['Business_Type']))
lookup_dict

{'AG': 'Public',
 'SE': 'Public',
 'KGaA': 'Hybrid',
 'GmbH': 'Private',
 'UG': 'Private',
 'GmbH & Co. KG': 'Hybrid',
 'AG & Co. KG': 'Hybrid',
 'SE & Co. KG': 'Hybrid',
 'OHG': 'Private',
 'KG': 'Private',
 'GbR': 'Private',
 'PartG': 'Private',
 'PartGmbB': 'Private',
 'eG': 'Hybrid',
 'Stiftung': 'Private',
 'VVaG': 'Hybrid',
 'EWIV': 'Hybrid',
 'Freiberufler': 'Private',
 'Other': 'Private',
 'e. K.': 'Private',
 'e. Kfm.': 'Private',
 'e. Kfr.': 'Private'}

---
# 4. Adding Match Quality column

## 4.1. Determine MATCH_QUALITY

For each record in the dataset, assign a 'MATCH_QUALITY' label:

- **❓ Missing**: LEGAL_FORM is missing or blank.
- **✅ Exact**: LEGAL_FORM matches exactly (case insensitive) in reference lookup.
- **❌ Unmatched**: No match found.

This helps identify records needing further attention.

In [32]:
def assign_match_quality(df, business_type_lookup):
    def get_match_quality(legal_form):
        if pd.isna(legal_form) or str(legal_form).strip() == '':
            return '❓ Missing'

        legal_form_str = str(legal_form).strip()

        # Exact match (case sensitive)
        if legal_form_str in business_type_lookup:
            return '✅ Exact'

        # Exact match (case insensitive)
        for key in business_type_lookup:
            if key.lower() == legal_form_str.lower():
                return '✅ Exact'

        return '❌ Unmatched'

    df_enriched = df.copy()
    df_enriched['MATCH_QUALITY'] = df_enriched['LEGAL_FORM'].apply(get_match_quality)
    return df_enriched

In [34]:
# Assign mock company names
mock_names = [f"Company {char}" for char in string.ascii_uppercase]
df['NAME'] = df.index.map(lambda i: mock_names[i % len(mock_names)])

# Apply to dataset using the cleaned dictionary
df_enriched = assign_match_quality(df, lookup_dict)

#### Preview the results

In [35]:
print("Preview of the DataFrame with initial 'MATCH_QUALITY' column:")
preview_cols = ['NAME', 'LEGAL_FORM', 'MATCH_QUALITY']
available_cols = [col for col in preview_cols if col in df_enriched.columns]

with pd.option_context('display.max_colwidth', None,
                       'display.max_columns', None,
                       'display.width', None):
    display(df_enriched[available_cols].sample(10))

Preview of the DataFrame with initial 'MATCH_QUALITY' column:


Unnamed: 0,NAME,LEGAL_FORM,MATCH_QUALITY
144454,Company Y,GmbH,✅ Exact
132331,Company R,e. V.,❌ Unmatched
196981,Company F,GmbH & Co. KG,✅ Exact
725184,Company S,e. K.,✅ Exact
819531,Company L,eG,✅ Exact
72101,Company D,e. V.,❌ Unmatched
617199,Company L,GmbH,✅ Exact
404392,Company O,,❓ Missing
547061,Company V,GmbH,✅ Exact
635617,Company V,GmbH & Co. KG,✅ Exact


#### Inspect the results

In [36]:
# Show summary statistics
print(f"Total records processed: {len(df_enriched):,}")

print("\nMatch Quality Distribution:")
match_stats = df_enriched['MATCH_QUALITY'].value_counts()
for match_type, count in match_stats.items():
    percentage = (count / len(df_enriched)) * 100
    print(f"  {match_type}: {count:,} ({percentage:.1f}%)")

Total records processed: 999,967

Match Quality Distribution:
  ✅ Exact: 694,673 (69.5%)
  ❌ Unmatched: 219,801 (22.0%)
  ❓ Missing: 85,493 (8.5%)


## 4.2. Analyze Unmatched patterns

Identifying and analyzing the '❌ Unmatched' entries to see if common patterns emerge (typos, variants) that can be resolved with a "partial match" logic.

In [37]:
# Display sample unmatched records
unmatched = df_enriched[df_enriched['MATCH_QUALITY'] == '❌ Unmatched']

with pd.option_context('display.max_colwidth', None,
                       'display.max_columns', None,
                       'display.width', None):
    display(unmatched[['NAME', 'LEGAL_FORM', 'MATCH_QUALITY']].sample(10))

Unnamed: 0,NAME,LEGAL_FORM,MATCH_QUALITY
824702,Company I,e. V.,❌ Unmatched
240753,Company T,e. V.,❌ Unmatched
726834,Company E,e. V.,❌ Unmatched
172613,Company Z,e. V.,❌ Unmatched
354778,Company I,e. V.,❌ Unmatched
525665,Company X,e. V.,❌ Unmatched
935730,Company Q,e. V.,❌ Unmatched
270305,Company J,B.V.,❌ Unmatched
718364,Company K,e. V.,❌ Unmatched
440599,Company D,UG & Co. KG,❌ Unmatched


In [30]:
# Show unique unmatched legal forms
unique_unmatched = unmatched['LEGAL_FORM'].drop_duplicates().sort_values()
print(f"Total unique unmatched legal forms:")
print(unique_unmatched)
unique_unmatched.reset_index(drop=True)

# Check if NaN exists in unique unmatched legal forms
has_nan = unique_unmatched.isnull().any()
print(f"Contains NaN: {has_nan}")

Total unique unmatched legal forms:
66890               A/S
46519      A/S & Co. KG
3072                 AB
22760       AB & Co. KG
10797     AG & Co. KGaA
              ...      
308371              ΕΠΕ
284757               ΟΕ
292620              ООО
991996              ПАО
773570             בע"מ
Name: LEGAL_FORM, Length: 182, dtype: object
Contains NaN: False


In [28]:
# Display the most common unmatched legal forms to prioritize further cleaning efforts
print(f"\n TOP 15 MOST COMMON UNMATCHED LEGAL FORMS:")

unmatched_forms = unmatched['LEGAL_FORM'].value_counts().head(15)
for form, count in unmatched_forms.items():
    percentage = (count / len(unmatched)) * 100
    print(f"  '{form}': {count:,} records ({percentage:.1f}%)")


 TOP 15 MOST COMMON UNMATCHED LEGAL FORMS:
  'e. V.': 183,749 records (83.6%)
  'eGbR': 17,581 records (8.0%)
  'UG & Co. KG': 4,641 records (2.1%)
  'Ltd.': 2,627 records (1.2%)
  'PartG mbB': 2,330 records (1.1%)
  'gGmbH': 1,996 records (0.9%)
  'Ltd. & Co. KG': 1,710 records (0.8%)
  'EI': 1,426 records (0.6%)
  'gUG': 860 records (0.4%)
  'GmbH & Co. OHG': 444 records (0.2%)
  'B.V.': 382 records (0.2%)
  'AG & Co. OHG': 237 records (0.1%)
  'Association': 131 records (0.1%)
  'gesellschaft mbH': 122 records (0.1%)
  'KG & Co. KG': 119 records (0.1%)


## 4.3. Handling Unmatched Legal Forms: Partial vs Truly Unmatched

- Introduce a new layer of matching: **🔍 Partial Match** – for forms that closely resemble known legal forms but may involve typos, variants, or uncommon representations. These are handled via a supplementary mapping dictionary.
- Remaining entries with no reliable mapping stay as **❌ Unmatched**.
---

#### 🔍 Probably Partial Matches:

| Unmatched Form       | Suggested Match | Reason |
|----------------------|------------------|--------|
| `eGbR`               | GbR              | Typo with "e" prefix or legit variant? |
| `PartG mbB`          | PartGmbB         | Variant spelling |
| `gGmbH`              | GmbH             | Typo with "g" prefix or legit variant?  |
| `gUG`                | UG               | Typo with "g" prefix or legit variant?  |
| `gesellschaft mbH`   | GmbH             | Full spelling of GmbH |

➡️ These can be labeled as **'🔍 Partial'** matches using an explicit mapping dictionary.

⚠️ *Some partially matched forms may not be simple errors, but rather reflect legitimate alternative legal representations.
These are flagged for now as partial matches, but further investigation may be required to confirm their accuracy.*  
➡️ The mapping dictionary should remain flexible and extensible, allowing for updates as more cases are encountered. Ongoing validation is encouraged to ensure consistent and accurate classification.

---

#### ❌ Truly Unmatched Forms:

| Unmatched Form       | Reason |
|----------------------|--------|
| `e. V.`              | Verein (association), not a company form |
| `UG & Co. KG`        | Hybrid structure |
| `Ltd.`               | Foreign legal form (UK) |
| `EI`                 | Not a standard form; possibly erroneous |
| `GmbH & Co. OHG`     | Complex hybrid; not cleanly matchable |
| `B.V.`               | Dutch legal form (*Besloten Vennootschap*) |
| `AG & Co. OHG`       | Hybrid structure |
| `Association`        | Generic term, not a legal form |
| `KG & Co. KG`        | Hybrid structure |

➡️ Keep these as **'❌ Unmatched'** for now — may require further discussion with the team before adding special handling, especially for hybrids or international forms.

In [38]:
# Create a new dictionary for partial matched entries
partial_match_map = {
    'eGbR': 'GbR',
    'PartG mbB': 'PartGmbB',
    'gGmbH': 'GmbH',
    'gUG': 'UG',
    'gesellschaft mbH': 'GmbH'
}

In [39]:
# Update the assign_match_quality function
def assign_match_quality(df, lookup_dict, partial_match_map):
    def get_match_quality(legal_form):
        if pd.isna(legal_form) or str(legal_form).strip() == '':
            return '❓ Missing'

        legal_form_str = str(legal_form).strip()

        if legal_form_str in lookup_dict:
            return '✅ Exact'

        for key in lookup_dict:
            if key.lower() == legal_form_str.lower():
                return '✅ Exact'

        # Add partial match check
        if legal_form_str in partial_match_map:
            return '🔍 Partial'

        return '❌ Unmatched'

    df_enriched = df.copy()
    df_enriched['MATCH_QUALITY'] = df_enriched['LEGAL_FORM'].apply(get_match_quality)
    return df_enriched

#### Preview the result

In [40]:
# Apply to our dataset using the cleaned dictionary
df_enriched_partial = assign_match_quality(df, lookup_dict, partial_match_map)

# View the result
preview_cols = ['NAME', 'LEGAL_FORM', 'MATCH_QUALITY']
available_cols = [col for col in preview_cols if col in df_enriched_partial.columns]

with pd.option_context('display.max_colwidth', None,
                       'display.max_columns', None,
                       'display.width', None):
    display(df_enriched_partial[available_cols].sample(10))

Unnamed: 0,NAME,LEGAL_FORM,MATCH_QUALITY
856211,Company F,GmbH & Co. KG,✅ Exact
200024,Company G,e. K.,✅ Exact
57397,Company P,eGbR,🔍 Partial
113968,Company K,GmbH,✅ Exact
216541,Company N,GmbH,✅ Exact
630595,Company R,e. V.,❌ Unmatched
630402,Company G,e. K.,✅ Exact
167158,Company E,GmbH,✅ Exact
8462,Company M,GmbH & Co. KG,✅ Exact
293211,Company J,UG,✅ Exact


In [41]:
# Display partially matched records
partial_matches = df_enriched_partial[df_enriched_partial['MATCH_QUALITY'] == '🔍 Partial']

with pd.option_context('display.max_colwidth', None,
                       'display.max_columns', None,
                       'display.width', None):
    display(partial_matches[['NAME', 'LEGAL_FORM', 'MATCH_QUALITY']].sample(10))

Unnamed: 0,NAME,LEGAL_FORM,MATCH_QUALITY
206933,Company Z,eGbR,🔍 Partial
563016,Company M,eGbR,🔍 Partial
610832,Company O,PartG mbB,🔍 Partial
406400,Company U,gGmbH,🔍 Partial
635344,Company I,eGbR,🔍 Partial
145987,Company X,eGbR,🔍 Partial
793410,Company U,eGbR,🔍 Partial
789219,Company P,eGbR,🔍 Partial
939640,Company A,eGbR,🔍 Partial
348742,Company E,eGbR,🔍 Partial


In [42]:
# Show summary statistics
print(f"Total records processed: {len(df_enriched):,}")

print("\nMatch Quality Distribution:")
match_stats = df_enriched_partial['MATCH_QUALITY'].value_counts()
for match_type, count in match_stats.items():
    percentage = (count / len(df_enriched)) * 100
    print(f"  {match_type}: {count:,} ({percentage:.1f}%)")

Total records processed: 999,967

Match Quality Distribution:
  ✅ Exact: 694,673 (69.5%)
  ❌ Unmatched: 196,911 (19.7%)
  ❓ Missing: 85,493 (8.5%)
  🔍 Partial: 22,890 (2.3%)


---
# 5. Final Business Type Assignment

Based on MATCH_QUALITY:
- **'✅ Exact'** → Use lookup dictionary directly.
- **'🔍 Partial'** → Use corrected partial match mapping.
- **'❓ Missing'** or **'❌ Unmatched'** → Default to 'Private'.

This step creates a new BUSINESS_TYPE column with consistent classifications.

In [43]:
def assign_business_type(df, lookup_dict, partial_match_map):
    def get_business_type(row):
        legal_form = row['LEGAL_FORM']
        match_quality = row['MATCH_QUALITY']
        
        if pd.isna(legal_form) or str(legal_form).strip() == '':
            return 'Private'  # Missing
        
        legal_form_str = str(legal_form).strip()
        
        if match_quality == '✅ Exact':
            # Match directly in lookup_dict
            for key in lookup_dict:
                if key.lower() == legal_form_str.lower():
                    return lookup_dict[key]
            return 'Private'  # Fallback
        
        elif match_quality == '🔍 Partial':
            corrected_form = partial_match_map.get(legal_form_str)
            if corrected_form:
                for key in lookup_dict:
                    if key.lower() == corrected_form.lower():
                        return lookup_dict[key]
            return 'Private'  # Fallback if correction fails

        # For ❌ Unmatched or ❓ Missing
        return 'Private'
    
    df['BUSINESS_TYPE'] = df.apply(get_business_type, axis=1)
    return df

In [44]:
# Apply and preview the result
df_fin = assign_business_type(df_enriched_partial, lookup_dict, partial_match_map)

# Displaying sample result
guaranteed_rows = (
    df_fin
    .reset_index(drop=True)
    .groupby(['MATCH_QUALITY', 'BUSINESS_TYPE'], group_keys=False)
    .sample(1, random_state=1)
)

remaining_sample = df_fin.drop(guaranteed_rows.index).sample(15 - len(guaranteed_rows), random_state=1)
final_sample = pd.concat([guaranteed_rows, remaining_sample]).sample(frac=1, random_state=1).reset_index(drop=True)

with pd.option_context('display.max_colwidth', None,
                       'display.max_columns', None,
                       'display.width', None):
    display(final_sample[['NAME', 'LEGAL_FORM', 'MATCH_QUALITY', 'BUSINESS_TYPE']])

Unnamed: 0,NAME,LEGAL_FORM,MATCH_QUALITY,BUSINESS_TYPE
0,Company W,e. V.,❌ Unmatched,Private
1,Company J,PartG mbB,🔍 Partial,Private
2,Company W,,❓ Missing,Private
3,Company K,AG,✅ Exact,Public
4,Company K,e. K.,✅ Exact,Private
5,Company P,,❓ Missing,Private
6,Company B,GmbH,✅ Exact,Private
7,Company A,GmbH,✅ Exact,Private
8,Company L,GmbH & Co. KG,✅ Exact,Hybrid
9,Company Y,e. V.,❌ Unmatched,Private


---
# 6. Analyze Classification Results

In [45]:
print("=== 📊 BUSINESS TYPE DISTRIBUTION ===")
bt_counts = df_fin['BUSINESS_TYPE'].value_counts(dropna=False)
bt_percent_raw = df_fin['BUSINESS_TYPE'].value_counts(normalize=True, dropna=False) * 100
bt_percent = bt_percent_raw.round(1).tolist()

bt_percent[-1] = round(100 - sum(bt_percent[:-1]), 1)

bt_summary = pd.DataFrame({
    'Count': bt_counts.astype(int),
    'Percent': bt_percent
}, index=bt_counts.index)

bt_summary['Count'] = bt_summary['Count'].apply(lambda x: f"{x:,}")
bt_summary['Percent'] = bt_summary['Percent'].apply(lambda x: f"{x:.1f}%")

bt_summary.loc['Total'] = [
    f"{bt_counts.sum():,}",
    "100.0%"
]
print(bt_summary)

print("\n=== 📊 MATCH QUALITY BREAKDOWN 📊 ===")
mq_counts = df_fin['MATCH_QUALITY'].value_counts(dropna=False)
mq_percent_raw = df_fin['MATCH_QUALITY'].value_counts(normalize=True, dropna=False) * 100
mq_percent = mq_percent_raw.round(1).tolist()

mq_percent[-1] = round(100 - sum(mq_percent[:-1]), 1)

mq_summary = pd.DataFrame({
    'Count': mq_counts.astype(int),
    'Percent': mq_percent
}, index=mq_counts.index)

mq_summary['Count'] = mq_summary['Count'].apply(lambda x: f"{x:,}")
mq_summary['Percent'] = mq_summary['Percent'].apply(lambda x: f"{x:.1f}%")

mq_summary.loc['Total'] = [
    f"{mq_counts.sum():,}",
    "100.0%"
]
print(mq_summary)

print("\n=== 📊 MATCH QUALITY vs. BUSINESS TYPE 📊 ===")
crosstab = pd.crosstab(df_fin['MATCH_QUALITY'], df_fin['BUSINESS_TYPE'], margins=True)
print(crosstab)

=== 📊 BUSINESS TYPE DISTRIBUTION ===
                 Count Percent
BUSINESS_TYPE                 
Private        919,701   92.0%
Hybrid          74,521    7.5%
Public           5,745    0.5%
Total          999,967  100.0%

=== 📊 MATCH QUALITY BREAKDOWN 📊 ===
                 Count Percent
MATCH_QUALITY                 
✅ Exact        694,673   69.5%
❌ Unmatched    196,911   19.7%
❓ Missing       85,493    8.5%
🔍 Partial       22,890    2.3%
Total          999,967  100.0%

=== 📊 MATCH QUALITY vs. BUSINESS TYPE 📊 ===
BUSINESS_TYPE  Hybrid  Private  Public     All
MATCH_QUALITY                                 
✅ Exact         74521   614407    5745  694673
❌ Unmatched         0   196911       0  196911
❓ Missing           0    85493       0   85493
🔍 Partial           0    22890       0   22890
All             74521   919701    5745  999967


In [60]:
# Export sample result to CSV file
sample = df_fin.sample(100, random_state=42)
sample.to_csv("BusinessType_SampleResults.csv", index=False)

--- 
# ✅ Suggested next steps

- **Review & Validate Mappings**: Collaborate with domain experts to validate **'🔍 Partial'** matches and ensure business logic accuracy.
- **Deepen '❌ Unmatched' Analysis**:
    - Continue investigating remaining **'❌ Unmatched'** Legal Forms to identify new patterns or refine existing rules.
    - Flag uncommon or foreign Legal Forms for manual review.
- **Audit '❓ Missing' Entries**:
Examine blank or missing values to determine if they reflect true absences or data quality issues requiring upstream fixes.