# MCC-Centered Industry Classification Entity Resolution


## Purpose
This notebook focuses on entity resolution for industry classification lookup tables starting with MCC (Merchant Category Codes) as the primary table. We'll use AI functions to standardize and match industry codes across three different classification systems:

### Target Tables
- **MCC** (Merchant Category Codes) - PRIMARY/LEFT TABLE
- **NAICS** (North American Industry Classification System)
- **SIC** (Standard Industrial Classification)  

### Approach
1. Load and explore each classification table
2. Create cross joins with MCC as the leftmost table
3. Apply AI_SIMILARITY to find matching classifications across systems
4. Use AI_CLASSIFY to identify best matches for each MCC code
5. Create unified MCC-centered industry mapping for entity resolution

### Expected Output
- One record for each MCC code
- Ability to lookup corresponding NAICS and SIC values for any MCC


In [1]:
# =============================================================================
# 🔧 ENTITY RESOLUTION FRAMEWORK
# =============================================================================

# 1. IMPORTS & SETUP
# Python Data 
import pandas as pd
from pydantic import BaseModel, Field
import json

# Python Formatting & Display
import humanize 
from datetime import datetime
from textwrap import dedent

#  Snowpark
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
import snowflake.snowpark.window as W
from snowflake.snowpark import Session
from snowflake.snowpark.context import get_active_session

# Cortex
import snowflake.cortex as C

# Helper Functions
def display_df_info(spdf, name="DataFrame"):
    """
    Display first 10 rows and metrics for a Snowpark DataFrame
    
    Args:
        spdf: Snowpark DataFrame to analyze
        name: Name to display for the DataFrame
    """
    # Get row and column counts
    row_count = spdf.count()
    col_count = len(spdf.columns)
    
    print(f"\n📊 {name} Overview:")
    print(f"  • Rows: {humanize.intword(row_count)} ({humanize.intcomma(row_count)})")
    print(f"  • Columns: {col_count}")
    
    print("\n🔍 First 10 rows:")
    spdf.limit(10).show()

def show_full_df(df, num_rows=10):
    """Display DataFrame with full formatting"""
    return df.limit(num_rows).to_pandas().style.set_properties(**{
        'text-align': 'left',
        'white-space': 'pre-wrap'
    }).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

def clean_na_values(df, columns_to_select=None):
    """
    Replace 'NA' string values with None in all string columns of a dataframe
    
    Args:
        df: Snowpark DataFrame to clean
        columns_to_select: Optional list of columns to select in output DataFrame
        
    Returns:
        Snowpark DataFrame with 'NA' values replaced with None
    """
    # Get all string columns
    string_columns = [field.name for field in df.schema.fields 
                     if isinstance(field.datatype, T.StringType)]

    # Create list of column transformations
    column_transformations_list = [
        F.when(F.col(column) == 'NA', None)
         .when((F.col(column) == 'x') & (column == 'FEDTAXID'), None)
         .otherwise(F.col(column))
        for column in string_columns
    ]

    # Apply all transformations at once
    cleaned_df = df.with_columns(string_columns, column_transformations_list)
    
    # Select specified columns if provided
    if columns_to_select:
        cleaned_df = cleaned_df.select(columns_to_select)
        
    return cleaned_df

print("✅ All imports and helper functions loaded successfully")


* 'allow_population_by_field_name' has been renamed to 'validate_by_name'


✅ All imports and helper functions loaded successfully


In [3]:
# 2. SESSION INITIALIZATION
def initialize_session():
    """Initialize Snowflake session with fallback options"""
    try:
        # First check for active session
        session = get_active_session()
        print("🔑 Using existing active Snowflake session")
        return session
    except Exception as e:
        print(f"⚠️  No active session found: {e}")
        try:
            # Try to load local credentials
            with open("/Users/jsoliz/.creds/gpn_connection.json", 'r') as f:
                connection_params = json.load(f)
            session = Session.builder.configs(connection_params).create()
            print("🔑 Local session initialized successfully")
            return session
        except Exception as e2:
            print(f"❌ Session initialization failed: {e2}")
            return None

# Initialize session
session = initialize_session()

if session:
    print(f"📊 Session active: {session.get_current_warehouse()}")
    print(f"🏢 Database: {session.get_current_database()}")
    print(f"📁 Schema: {session.get_current_schema()}")
else:
    print("❌ No session available - please check your connection configuration")


⚠️  No active session found: (1403): No default Session is found. Please create a session before you call function 'udf' or use decorator '@udf'.
Initiating login request with your identity provider. A browser window should have opened for you to complete the login. If you can't see it, check existing browser windows, or your OS settings. Press CTRL+C to abort and try again...
Going to open: https://login.microsoftonline.com/3a7077d2-14ae-4001-a736-75e0437e2b89/saml2?SAMLRequest=lZLdbuIwEIVfJfJeJ3YcINQCqlBUFQm2LD9bde9MMoC3jp3aTkP79GtCkboXrdQ7a3yO%2Fc2cGVwfSxm8gLFCqyGKI4ICULkuhNoP0WZ9G%2FZRYB1XBZdawRC9gkXXo4HlpaxYVruDWsJzDdYF%2FiFlWXsxRLVRTHMrLFO8BMtczlbZfMZoRFhltNO5luiD5WsHtxaM84QXS2GFxzs4VzGMm6aJmiTSZo8pIQSTK%2BxVJ8mPi%2F7oe%2FpEH2PSOem9wssX72xjoc4j%2BAprexZZdrdeL8LF%2FWqNguyCeqOVrUswKzAvIofNcnYGsJ7gANw46YcaNm%2B9LiU0sko3O8mfINdlVTv%2FauRPeAcFlnovfOPTyRBVT6L4K5%2BX8wUhd2q7GVcH%2B7ippZ6Nf2XzJnt7eDTHVex6hPDxdpOj4PclWXpKdmptDVN1ytP5EqHdkKRhEq9JzGiXJf2IUvIHBROfp1Dctc4LdMsRlSI32uqd00oKBS1l

### Schemata (Confirm Tables)


In [4]:
# Load the classification tables
print("📥 Loading classification lookup tables...")

# Load all three classification systems
mcc_lu_spdf = session.table("sandbox.javier.LU_MCC").where(~F.col("MCC").startswith("3"))
naics_lu_spdf = session.table("sandbox.javier.LU_NAICS")
sic_lu_spdf = session.table("sandbox.javier.LU_SIC")

# Get record counts
mcc_count = mcc_lu_spdf.count()
naics_count = naics_lu_spdf.count() 
sic_count = sic_lu_spdf.count()

print(f"🎯 MCC Records (Primary): {mcc_count:,} total")
print(f"📋 NAICS Records: {naics_count:,} total")
print(f"🏭 SIC Records: {sic_count:,} total")

print(f"\n🔍 MCC Table Sample (Primary Table):")
show_full_df(mcc_lu_spdf.limit(3))

print(f"\n📋 NAICS Table Sample:")
naics_lu_spdf.limit(3).show()

print(f"\n🏭 SIC Table Sample:")  
sic_lu_spdf.limit(3).show()

# Get schema information
print("\n📊 Schema Summary:")
print("=" * 50)

print(f"\n🎯 MCC Schema (Primary - {len(mcc_lu_spdf.schema.names)} columns):")
for field in mcc_lu_spdf.schema.fields:
    print(f"  - {field.name}: {field.datatype}")

print(f"\n📋 NAICS Schema ({len(naics_lu_spdf.schema.names)} columns):")
for field in naics_lu_spdf.schema.fields:
    print(f"  - {field.name}: {field.datatype}")

print(f"\n🏭 SIC Schema ({len(sic_lu_spdf.schema.names)} columns):")
for field in sic_lu_spdf.schema.fields:
    print(f"  - {field.name}: {field.datatype}")


📥 Loading classification lookup tables...
🎯 MCC Records (Primary): 286 total
📋 NAICS Records: 2,125 total
🏭 SIC Records: 1,005 total

🔍 MCC Table Sample (Primary Table):

📋 NAICS Table Sample:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"CODE"  |"TITLE"                                     |"DESCRIPTION"                                       |"REFERENCE_CODE"  |"REFERENCE_DESCRIPTION"  |"DESCRIPTION_FULL"                                  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|11      |Agriculture, Forestry, Fishing and Hunting  |The Sector as a Whole    The Agriculture, Fores...  |NULL              |NULL                     |The Sector as a Whole    The Agri

# 🎯 MCC → NAICS Mapping


## Objective
Create systematic mappings between MCC and NAICS classification systems using Snowflake's AI functions.
**MCC is the primary/leftmost table** - we want one record per MCC code with corresponding NAICS lookups.

## Step-by-Step Approach

### Step 1: Cross Join and AI Similarity Calculation
- **Goal**: Cross join MCC (left) with NAICS (right) tables
- **AI Function**: Use `AI_SIMILARITY()` to compare names and descriptions between all MCC-NAICS pairs
- **Output**: Every MCC record paired with every NAICS record, including similarity scores

### Step 2: Ranking and Top 3 Selection  
- **Goal**: Use ranking function to keep top 3 NAICS matches per MCC code
- **Method**: `ROW_NUMBER()` or `RANK()` partitioned by MCC, ordered by similarity score DESC
- **Output**: Maximum 3 NAICS candidates per MCC record

### Step 3: AI Classification for Final Selection
- **Goal**: Use `AI_CLASSIFY()` to pick the best match among the top 3 candidates
- **Method**: Let AI choose the most appropriate NAICS match for each MCC
- **Output**: Final `mcc_naics_lu_spdf` with one-to-one MCC→NAICS mapping

---

## 🚀 Step 1: Cross Join and AI Similarity Calculation


In [6]:
# Step 1: Cross Join MCC and NAICS with AI Similarity Calculation

print("\n🔍 First, let's examine the table structures:")
print("\nMCC Table Columns (Primary):")
print(mcc_lu_spdf.columns)
print("\nNAICS Table Columns:")
print(naics_lu_spdf.columns)

print(f"\n📊 Table Sizes:")
print(f"MCC Records (Primary): {humanize.intcomma(mcc_lu_spdf.count())}")
print(f"NAICS Records: {humanize.intcomma(naics_lu_spdf.count())}")
print(f"Cross Join Size: {humanize.intcomma(mcc_lu_spdf.count() * naics_lu_spdf.count())} total combinations")

# Create cross join with AI similarity calculation
print("\n🤖 Creating cross join with AI_SIMILARITY scores...")

import time
start_time = time.time()

# Time the cross join creation
mcc_naics_cross_join = (
    mcc_lu_spdf.alias("mcc").cross_join(
        naics_lu_spdf.alias("naics")
    )
)

# Time the column selection and AI similarity calculation
mcc_naics_cross_join = (
    mcc_naics_cross_join.select(
        # MCC fields (Primary)
        F.col("MCC").alias("MCC_CODE"),
        F.concat(
            F.col("MCC_DESCRIPTIVE_TITLE"),
            F.lit(" - "),
            F.col("INCLUDED_IN_THIS_MCC")
        ).alias("MCC_TEXT"),
        
        # NAICS fields  
        F.col("CODE").alias("NAICS_CODE"),
        F.concat(
            F.col("TITLE"),
            F.lit(" - "),
            F.col("DESCRIPTION_FULL")
        ).alias("NAICS_TEXT"),
        
        # AI Similarity score
        F.call_function("AI_SIMILARITY",
            F.col("MCC_TEXT"),
            F.col("NAICS_TEXT")
        ).alias("SIMILARITY_SCORE")
    )
)

# Write results to Snowflake table
print("\n💾 Writing results to Snowflake table...")
mcc_naics_cross_join.write.mode("overwrite").saveAsTable("sandbox.javier.mcc_naics_cross_join")

print("\n✅ Cross join with AI similarity created and saved successfully!")

end_time = time.time()
print(f"⏱️ Total execution time: {round(end_time - start_time, 2)} seconds")



🔍 First, let's examine the table structures:

MCC Table Columns (Primary):
['MCC', 'MCC_DESCRIPTIVE_TITLE', 'INCLUDED_IN_THIS_MCC', 'SIMILAR_MCC_CODES', 'SIMILAR_MCC_CODE_READABLE', 'MCC_ARRAY', 'MCC_ARR_MAPS']

NAICS Table Columns:
['CODE', 'TITLE', 'DESCRIPTION', 'REFERENCE_CODE', 'REFERENCE_DESCRIPTION', 'DESCRIPTION_FULL']

📊 Table Sizes:
MCC Records (Primary): 286
NAICS Records: 2,125
Cross Join Size: 607,750 total combinations

🤖 Creating cross join with AI_SIMILARITY scores...

💾 Writing results to Snowflake table...

✅ Cross join with AI similarity created and saved successfully!
⏱️ Total execution time: 96.73 seconds


In [7]:
# Step 2: Load Cross Join Table and Apply Ranking for MCC-NAICS

print("📂 Loading existing MCC-NAICS cross join table...")
mcc_naics_cross_join = session.table("sandbox.javier.mcc_naics_cross_join")

print("🔍 Examining cross join table structure:")
print("Columns:", mcc_naics_cross_join.columns)
print(f"Total records: {humanize.intcomma(mcc_naics_cross_join.count())}")

print("\n🏆 Applying ranking to get top 3 NAICS matches per MCC...")

# Apply ROW_NUMBER() window function to rank NAICS matches by similarity score
mcc_naics_top_3_matches = mcc_naics_cross_join.select(
    "*",
    F.row_number().over(
        W.Window.partition_by("MCC_CODE").orderBy(F.col("SIMILARITY_SCORE").desc())
    ).alias("SIMILARITY_RANK")
).filter(
    F.col("SIMILARITY_RANK") <= 3
)

print("✅ Ranking applied! Keeping top 3 NAICS matches per MCC code.")
print(f"📊 Reduced from {humanize.intcomma(mcc_naics_cross_join.count())} to {humanize.intcomma(mcc_naics_top_3_matches.count())} records")

# Write results to table for Step 3
mcc_naics_top_3_matches.write.mode("overwrite").save_as_table("sandbox.javier.mcc_naics_cross_join_top3")
print("\n💾 Results saved to table for MCC-NAICS AI classification!")


📂 Loading existing MCC-NAICS cross join table...
🔍 Examining cross join table structure:
Columns: ['MCC_CODE', 'MCC_TEXT', 'NAICS_CODE', 'NAICS_TEXT', 'SIMILARITY_SCORE']
Total records: 607,750

🏆 Applying ranking to get top 3 NAICS matches per MCC...
✅ Ranking applied! Keeping top 3 NAICS matches per MCC code.
📊 Reduced from 607,750 to 858 records

💾 Results saved to table for MCC-NAICS AI classification!


In [8]:
# Step 3: AI Classification for Final MCC-NAICS Selection

print("📂 Loading top 3 MCC-NAICS results table...")
mcc_naics_cross_join_top3_spdf = session.table("sandbox.javier.mcc_naics_cross_join_top3")

print("🤖 Preparing data for AI_CLASSIFY to select best NAICS match for each MCC...")

# Group the top 3 NAICS codes per MCC into arrays for AI_CLASSIFY
mcc_with_naics_categories = mcc_naics_cross_join_top3_spdf.select(
    "MCC_CODE",
    "MCC_TEXT", 
    "NAICS_CODE",
    "NAICS_TEXT",
    "SIMILARITY_SCORE",
    "SIMILARITY_RANK"
).group_by(
    "MCC_CODE", "MCC_TEXT"
).agg(
    # Create array of NAICS codes as classification categories
    F.array_agg(F.expr("NAICS_CODE")).within_group("SIMILARITY_RANK").alias("NAICS_CODES"),
    # Also keep NAICS titles for reference
    F.array_agg(F.expr("NAICS_TEXT")).within_group("SIMILARITY_RANK").alias("NAICS_TEXTS"),
    F.array_agg(F.expr("SIMILARITY_RANK")).within_group("SIMILARITY_RANK").alias("SIMILARITY_RANKS")
)

print("✅ NAICS categories prepared!")

print(f"\n🤖 Using AI_CLASSIFY to select best NAICS matches for {humanize.intcomma(mcc_with_naics_categories.count())} MCC codes...")

# Use AI_CLASSIFY to classify each MCC into one of its 3 NAICS categories
mcc_naics_classified = mcc_with_naics_categories.select(
    "MCC_CODE",
    "MCC_TEXT",
    "NAICS_CODES",
    "NAICS_TEXTS",
   
    # Use AI_CLASSIFY to select best NAICS match from the 3 options
    F.call_function("AI_CLASSIFY",
        F.concat(F.col("MCC_CODE"), F.lit(". "), F.col("MCC_TEXT")),    # input
        F.col("NAICS_TEXTS"),                                           # list_of_categories
    ).alias("AI_CLASSIFIED_NAICS_TEXT")
)

# Save AI classifications
mcc_naics_classified.selectExpr("""MCC_CODE
                                , MCC_TEXT
                                , NAICS_TEXTS
                                , AI_CLASSIFIED_NAICS_TEXT:labels[0]::string                   as AI_CLASSIFIED_NAICS_TEXT_STR
                                , COALESCE(AI_CLASSIFIED_NAICS_TEXT_STR, NAICS_TEXTS[0]::string) as MOST_LIKELY_NAICS_TEXT
                                , AI_CLASSIFIED_NAICS_TEXT_STR is not null as CLASSIFY_HAD_RETURN 
                                ,CLASSIFY_HAD_RETURN AND  AI_CLASSIFIED_NAICS_TEXT_STR = NAICS_TEXTS[0]::string as CLASSIFY_RETURNED_MOST_SIMILAR_NAICS
                              """
).write.mode("overwrite").save_as_table("sandbox.javier.mcc_naics_classified")

print("✅ MCC-NAICS AI classifications complete and saved!")


📂 Loading top 3 MCC-NAICS results table...
🤖 Preparing data for AI_CLASSIFY to select best NAICS match for each MCC...
✅ NAICS categories prepared!

🤖 Using AI_CLASSIFY to select best NAICS matches for 286 MCC codes...
✅ MCC-NAICS AI classifications complete and saved!


In [91]:
# Print schema to see structure
mcc_naics_classified.print_schema()

# Check if any MCC has more than one NAICS code assigned
mcc_counts = mcc_naics_classified.group_by("MCC_CODE").count()

print("\nChecking for MCCs with multiple NAICS assignments:")
mcc_counts.filter(F.col("count") > 1).orderBy(F.col("count").desc()).show()

print("\nSummary of MCC-NAICS assignments:")
print(f"Total unique MCCs: {mcc_counts.count()}")
print(f"MCCs with multiple NAICS: {mcc_counts.filter(F.col('count') > 1).count()}")
print(f"MCCs with single NAICS: {mcc_counts.filter(F.col('count') == 1).count()}")

root
 |-- "MCC_CODE": LongType() (nullable = True)
 |-- "MCC_TEXT": StringType(33554435) (nullable = True)
 |-- "NAICS_TEXTS": ArrayType (nullable = False)
 |   |-- element: StringType()
 |-- "AI_CLASSIFIED_NAICS_TEXT_STR": StringType() (nullable = True)
 |-- "MOST_LIKELY_NAICS_TEXT": StringType() (nullable = True)
 |-- "CLASSIFY_HAD_RETURN": BooleanType() (nullable = True)
 |-- "CLASSIFY_RETURNED_MOST_SIMILAR_NAICS": BooleanType() (nullable = True)

Checking for MCCs with multiple NAICS assignments:
------------------------
|"MCC_CODE"  |"COUNT"  |
------------------------
|            |         |
------------------------


Summary of MCC-NAICS assignments:
Total unique MCCs: 286
MCCs with multiple NAICS: 0
MCCs with single NAICS: 286


In [106]:
naics_pairs.show()

---------------------------------------------------------------------
|"NAICS_CODE"  |"NAICS_TEXT"                                        |
---------------------------------------------------------------------
|11            |Agriculture, Forestry, Fishing and Hunting - Th...  |
|111           |Crop Production - Industries in the Crop Produc...  |
|1111          |Oilseed and Grain Farming - This industry group...  |
|11111         |Soybean Farming - This industry comprises estab...  |
|111110        |Soybean Farming - This industry comprises estab...  |
|11112         |Oilseed (except Soybean) Farming - This industr...  |
|111120        |Oilseed (except Soybean) Farming - This industr...  |
|11113         |Dry Pea and Bean Farming - This industry compri...  |
|111130        |Dry Pea and Bean Farming - This industry compri...  |
|111140        |Wheat Farming - This industry comprises establi...  |
---------------------------------------------------------------------



In [105]:
mcc_naics_classified.show()

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"MCC_CODE"  |"MCC_TEXT"                                          |"NAICS_TEXTS"                                       |"AI_CLASSIFIED_NAICS_TEXT_STR"  |"MOST_LIKELY_NAICS_TEXT"                            |"CLASSIFY_HAD_RETURN"  |"CLASSIFY_RETURNED_MOST_SIMILAR_NAICS"  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|5013        |Motor Vehicle Supplies and New Parts - Merchant...  |[                                                   |NULL                            |Motor Vehicle Supplies and New 

In [121]:
mcc_naics_classified.show()

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"MCC_CODE"  |"MCC_TEXT"                                          |"NAICS_TEXTS"                                       |"AI_CLASSIFIED_NAICS_TEXT_STR"  |"MOST_LIKELY_NAICS_TEXT"                            |"CLASSIFY_HAD_RETURN"  |"CLASSIFY_RETURNED_MOST_SIMILAR_NAICS"  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|5013        |Motor Vehicle Supplies and New Parts - Merchant...  |[                                                   |NULL                            |Motor Vehicle Supplies and New 

In [122]:
# Create MCC to NAICS Mapping Table

print("📥 Loading MCC-NAICS AI classifications...")
mcc_naics_classified = session.table("sandbox.javier.mcc_naics_classified")

# Get distinct NAICS code-text pairs from cross join table for joining back NAICS codes
print("📊 Getting distinct NAICS code-text pairs...")
naics_pairs = (
    naics_lu_spdf
    .select(
        F.col("CODE").alias("NAICS_CODE"),
        F.concat(
            F.col("TITLE"),
            F.lit(" - "),
            F.col("DESCRIPTION") # DO NOT USE DESCRIPTION_FULL, THIS AVODES DUPES IN LOWER REJOIN !!!!!!!!!!!!!!!!!!!!!!!!!!!
        ).alias("NAICS_TEXT")
    )
    .distinct()
)

# Join NAICS codes back to AI classifications based on matched text
print("🔄 Joining NAICS codes to AI classifications...")
mcc_naics_lookup = (
    mcc_naics_classified.join(
        naics_pairs.distinct(),
        mcc_naics_classified["MOST_LIKELY_NAICS_TEXT"] == naics_pairs["NAICS_TEXT"],
        "left"
    )
    .select(
        mcc_naics_classified["MCC_CODE"],
        naics_pairs["NAICS_CODE"].alias("MOST_LIKELY_NAICS_CODE")
    )
)

# Create final MCC to NAICS mapping table
code_map_mcc_naics = mcc_naics_lookup.select(
    "MCC_CODE",
    "MOST_LIKELY_NAICS_CODE"
).distinct().orderBy("MCC_CODE")

code_map_mcc_naics.write.mode("overwrite").save_as_table("sandbox.javier.code_map_mcc_naics")

print(f"✅ MCC to NAICS mapping table created!")
print(f"📊 {humanize.intcomma(code_map_mcc_naics.count())} MCC codes mapped to NAICS")


📥 Loading MCC-NAICS AI classifications...
📊 Getting distinct NAICS code-text pairs...
🔄 Joining NAICS codes to AI classifications...
✅ MCC to NAICS mapping table created!
📊 286 MCC codes mapped to NAICS


In [123]:
code_map_mcc_naics.groupBy("MCC_CODE") \
    .agg(F.count("MOST_LIKELY_NAICS_CODE").alias("NAICS_COUNT")) \
    .orderBy(F.col("NAICS_COUNT").desc()) \
    .show()

------------------------------
|"MCC_CODE"  |"NAICS_COUNT"  |
------------------------------
|8021        |1              |
|5045        |1              |
|7549        |1              |
|7622        |1              |
|5699        |1              |
|5051        |1              |
|5044        |1              |
|5734        |1              |
|5231        |1              |
|5139        |1              |
------------------------------



##### Verify MCC to SIC mapping is 1:1

In [125]:
# Verify MCC to NAICS mapping is 1:1
print("\n🔍 Verifying MCC to NAICS mapping cardinality...")

mcc_naics_counts = (
    mcc_naics_classified
    .groupBy("MCC_CODE")
    .agg(F.count("*").alias("NAICS_COUNT"))
    .where(F.col("NAICS_COUNT") > 1)
)

multiple_mappings_count = mcc_naics_counts.count()

if multiple_mappings_count > 0:
    print(f"⚠️ Found {multiple_mappings_count} MCC codes with multiple NAICS mappings:")
    mcc_naics_counts.show()
else:
    print("✅ Verified: Each MCC code maps to exactly one NAICS code")

# Create histogram of NAICS mappings per MCC code from classified data
print("\n📊 Distribution of NAICS mappings per MCC code (from classified data):")

mapping_distribution = (
    mcc_naics_classified
    .groupBy("MCC_CODE")
    .agg(F.count("*").alias("NAICS_COUNT"))
    .groupBy("NAICS_COUNT")
    .agg(F.count("*").alias("MCC_COUNT"))
    .orderBy("NAICS_COUNT")
)

mapping_distribution.show()

# Create histogram of NAICS mappings per MCC code from final mapping table
print("\n📊 Distribution of NAICS mappings per MCC code (from final mapping table):")

final_mapping_distribution = (
    code_map_mcc_naics
    .groupBy("MCC_CODE")
    .agg(F.count("*").alias("NAICS_COUNT"))
    .groupBy("NAICS_COUNT")
    .agg(F.count("*").alias("MCC_COUNT"))
    .orderBy("NAICS_COUNT")
)

final_mapping_distribution.show()

# Compare number of MCC codes in mapping vs source table
print("\n🔍 Comparing MCC code coverage...")

mapped_mcc_count = code_map_mcc_naics.select("MCC_CODE").distinct().count()
source_mcc_count = mcc_lu_spdf.select("MCC").distinct().count()

print(f"Source MCC codes: {humanize.intcomma(source_mcc_count)}")
print(f"Mapped MCC codes: {humanize.intcomma(mapped_mcc_count)}")

if mapped_mcc_count < source_mcc_count:
    print(f"⚠️ Missing mappings for {humanize.intcomma(source_mcc_count - mapped_mcc_count)} MCC codes")
    
    # Show unmapped MCCs
    print("\nUnmapped MCC codes:")
    unmapped_mccs = (
        mcc_lu_spdf.select("MCC", "MCC_DESCRIPTIVE_TITLE")
        .join(code_map_mcc_naics, mcc_lu_spdf.MCC == code_map_mcc_naics.MCC_CODE, "left_anti")
    )
    unmapped_mccs.show()
else:
    print("✅ All source MCC codes have been mapped")



🔍 Verifying MCC to NAICS mapping cardinality...
✅ Verified: Each MCC code maps to exactly one NAICS code

📊 Distribution of NAICS mappings per MCC code (from classified data):
-------------------------------
|"NAICS_COUNT"  |"MCC_COUNT"  |
-------------------------------
|1              |286          |
-------------------------------


📊 Distribution of NAICS mappings per MCC code (from final mapping table):
-------------------------------
|"NAICS_COUNT"  |"MCC_COUNT"  |
-------------------------------
|1              |286          |
-------------------------------


🔍 Comparing MCC code coverage...
Source MCC codes: 286
Mapped MCC codes: 286
✅ All source MCC codes have been mapped


In [126]:
code_map_mcc_naics.show()

-----------------------------------------
|"MCC_CODE"  |"MOST_LIKELY_NAICS_CODE"  |
-----------------------------------------
|742         |541940                    |
|763         |115116                    |
|780         |561730                    |
|1520        |23611                     |
|1711        |238220                    |
|1731        |238210                    |
|1740        |238140                    |
|1750        |238350                    |
|1761        |238160                    |
|1771        |238140                    |
-----------------------------------------



# 🎯 MCC → SIC Mapping

## Objective
Create systematic mappings between MCC and SIC classification systems using Snowflake's AI functions.
**MCC is the primary/leftmost table** - we want one record per MCC code with corresponding SIC lookups.

## Step-by-Step Approach
Following the same methodology as MCC-NAICS mapping:

1. **Cross Join and AI Similarity**: MCC (left) with SIC (right) 
2. **Ranking and Top 3 Selection**: Keep best SIC matches per MCC
3. **AI Classification**: Use AI_CLASSIFY for final selection

---


In [23]:
# MCC-SIC Mapping Implementation (Following Same Pattern as MCC-NAICS)

print("🏭 Creating MCC-SIC mappings using the same methodology...")

# Step 1: Cross Join MCC and SIC with AI Similarity
print("\n📊 Step 1: Creating MCC-SIC cross join with AI_SIMILARITY...")

mcc_sic_cross_join = (
    mcc_lu_spdf.alias("mcc").cross_join(sic_lu_spdf.alias("sic"))
    .select(
        # MCC fields (Primary)
        F.col("MCC").alias("MCC_CODE"),
        F.concat(F.col("MCC_DESCRIPTIVE_TITLE"), F.lit(" - "), F.col("INCLUDED_IN_THIS_MCC")).alias("MCC_TEXT"),
        # SIC fields  
        F.col("SIC_INDUSTRY_CODE").alias("SIC_CODE"),
        F.concat(F.col("SIC_INDUSTRY_DESCRIPTION"), F.lit(" - "), F.col("SIC_MAJOR_GROUP_DESCRIPTION")).alias("SIC_TEXT"),
        # AI Similarity score
        F.call_function("AI_SIMILARITY", F.col("MCC_TEXT"), F.col("SIC_TEXT")).alias("SIMILARITY_SCORE")
    )
)

mcc_sic_cross_join.write.mode("overwrite").saveAsTable("sandbox.javier.mcc_sic_cross_join")
print("✅ MCC-SIC cross join created!")

# Step 2: Ranking - Top 3 SIC matches per MCC
print("\n🏆 Step 2: Applying ranking to get top 3 SIC matches per MCC...")

mcc_sic_top_3 = mcc_sic_cross_join.select(
    "*",
    F.row_number().over(W.Window.partition_by("MCC_CODE").orderBy(F.col("SIMILARITY_SCORE").desc())).alias("SIMILARITY_RANK")
).filter(F.col("SIMILARITY_RANK") <= 3)

mcc_sic_top_3.write.mode("overwrite").save_as_table("sandbox.javier.mcc_sic_cross_join_top3")
print("✅ Top 3 SIC matches per MCC saved!")


# Step 3: AI Classification
print("\n🤖 Step 3: Using AI_CLASSIFY for final SIC selection...")

# Group top 3 SIC options for AI_CLASSIFY
mcc_with_sic_categories = mcc_sic_top_3.group_by("MCC_CODE", "MCC_TEXT").agg(
    F.array_agg(F.expr("SIC_TEXT")).within_group("SIMILARITY_RANK").alias("SIC_TEXTS")
)

# Apply AI_CLASSIFY
mcc_sic_classified = mcc_with_sic_categories.select(
    "MCC_CODE", "MCC_TEXT", "SIC_TEXTS",
    F.call_function("AI_CLASSIFY", F.concat(F.col("MCC_CODE"), F.lit(". "), F.col("MCC_TEXT")), F.col("SIC_TEXTS")).alias("AI_CLASSIFIED_SIC_TEXT")
)


🏭 Creating MCC-SIC mappings using the same methodology...

📊 Step 1: Creating MCC-SIC cross join with AI_SIMILARITY...
✅ MCC-SIC cross join created!

🏆 Step 2: Applying ranking to get top 3 SIC matches per MCC...
✅ Top 3 SIC matches per MCC saved!

🤖 Step 3: Using AI_CLASSIFY for final SIC selection...


In [24]:

# Save and create mapping table
mcc_sic_classified.selectExpr("""MCC_CODE, MCC_TEXT, SIC_TEXTS,
                               AI_CLASSIFIED_SIC_TEXT:labels[0]::string as AI_CLASSIFIED_SIC_TEXT_STR,
                               COALESCE(AI_CLASSIFIED_SIC_TEXT_STR, SIC_TEXTS[0]::string) as MOST_LIKELY_SIC_TEXT"""
).write.mode("overwrite").save_as_table("sandbox.javier.mcc_sic_classified")

session.table("sandbox.javier.mcc_sic_classified").print_schema()
# Create final mapping table
sic_pairs = session.table("sandbox.javier.mcc_sic_cross_join").select("SIC_CODE", "SIC_TEXT").distinct()
mcc_sic_lookup = (
    session.table("sandbox.javier.mcc_sic_classified").join(sic_pairs, F.col("MOST_LIKELY_SIC_TEXT") == F.col("SIC_TEXT"), "left")
    .select("MCC_CODE", F.col("SIC_CODE").alias("MOST_LIKELY_SIC_CODE")).distinct().orderBy("MCC_CODE")
)
mcc_sic_lookup.write.mode("overwrite").save_as_table("sandbox.javier.code_map_mcc_sic")

print("✅ MCC-SIC mapping complete!")
print(f"📊 {humanize.intcomma(mcc_sic_lookup.count())} MCC codes mapped to SIC")


root
 |-- "MCC_CODE": LongType() (nullable = True)
 |-- "MCC_TEXT": StringType(33554435) (nullable = True)
 |-- "SIC_TEXTS": ArrayType (nullable = False)
 |   |-- element: StringType()
 |-- "AI_CLASSIFIED_SIC_TEXT_STR": StringType() (nullable = True)
 |-- "MOST_LIKELY_SIC_TEXT": StringType() (nullable = True)
✅ MCC-SIC mapping complete!
📊 286 MCC codes mapped to SIC


# 🎯 Final MCC-Centered Industry Classification Table

## Objective
Create the final MCC-centered industry classification table by joining all mappings together.
This will provide one record per MCC code with corresponding NAICS and SIC lookups.

## Final Output Structure
- **MCC_CODE** (Primary Key)
- **NAICS_CODE** (Mapped)
- **SIC_CODE** (Mapped)
- Plus descriptive fields from all three classification systems

---


In [128]:
mcc_naics.show()

-----------------------------------------
|"MCC_CODE"  |"MOST_LIKELY_NAICS_CODE"  |
-----------------------------------------
|7216        |8123                      |
|7217        |561740                    |
|7221        |541921                    |
|7230        |812112                    |
|7251        |811430                    |
|7261        |812210                    |
|7273        |812990                    |
|7276        |541213                    |
|7277        |624190                    |
|7278        |8121                      |
-----------------------------------------



In [129]:
mcc_sic.show()

---------------------------------------
|"MCC_CODE"  |"MOST_LIKELY_SIC_CODE"  |
---------------------------------------
|7534        |7534                    |
|7535        |7539                    |
|7538        |5599                    |
|7542        |7542                    |
|7549        |4789                    |
|7622        |7629                    |
|7623        |7623                    |
|7629        |3639                    |
|7631        |7631                    |
|7641        |7641                    |
---------------------------------------



In [138]:
sic_lu_spdf.columns

['SIC_INDUSTRY_CODE',
 'SIC_INDUSTRY_DESCRIPTION',
 'SIC_MAJOR_GROUP_DESCRIPTION',
 'SIC_DIVISION_DESCRIPTION']

In [141]:
# Create lookup tables with text descriptions
mcc_lu_with_text_spdf = mcc_lu_spdf.select(
    "MCC",
    "MCC_DESCRIPTIVE_TITLE",
    F.col("MCC_DESCRIPTIVE_TITLE").alias("MCC_TEXT")
)

sic_lu_with_text_spdf = sic_lu_spdf.select(
    "SIC_INDUSTRY_CODE",
    "SIC_INDUSTRY_DESCRIPTION",
    "SIC_MAJOR_GROUP_DESCRIPTION",
    F.concat(F.col("SIC_INDUSTRY_DESCRIPTION"), F.lit(" - "), F.col("SIC_MAJOR_GROUP_DESCRIPTION")).alias("SIC_TEXT")
)

# Join MCC-NAICS and MCC-SIC lookups to create final classification table
mcc_naics = session.table("sandbox.javier.code_map_mcc_naics").select(
    F.col("MCC_CODE").alias("MCC_CODE_NAICS_SIDE"),
    "MOST_LIKELY_NAICS_CODE"
)
mcc_sic = session.table("sandbox.javier.code_map_mcc_sic").select(
    F.col("MCC_CODE").alias("MCC_CODE_SIC_SIDE"), 
    "MOST_LIKELY_SIC_CODE"
)

# Create final classification table with all descriptive fields
mcc_classifications = (
    mcc_naics
    .join(mcc_sic, mcc_naics.MCC_CODE_NAICS_SIDE == mcc_sic.MCC_CODE_SIC_SIDE, "full")
    .join(mcc_lu_with_text_spdf, F.coalesce(F.col("MCC_CODE_NAICS_SIDE"), F.col("MCC_CODE_SIC_SIDE")) == mcc_lu_with_text_spdf.MCC, "left")
    .join(
        naics_lu_spdf.select("CODE", F.col("DESCRIPTION").alias("NAICS_TEXT")), 
        F.col("MOST_LIKELY_NAICS_CODE") == naics_lu_spdf.CODE,
        "left"
    )
    .join(
        sic_lu_with_text_spdf,
        F.col("MOST_LIKELY_SIC_CODE") == sic_lu_with_text_spdf.SIC_INDUSTRY_CODE,
        "left"
    )
    .select(
        F.coalesce(F.col("MCC_CODE_NAICS_SIDE"), F.col("MCC_CODE_SIC_SIDE")).alias("MCC_CODE"),
        "MCC_TEXT",
        F.col("MOST_LIKELY_NAICS_CODE").alias("NAICS_CODE"),
        "NAICS_TEXT", 
        F.col("MOST_LIKELY_SIC_CODE").alias("SIC_CODE"),
        "SIC_TEXT"
    )
    .orderBy("MCC_CODE")
)

# Save final classification table
mcc_classifications.write.mode("overwrite").save_as_table("sandbox.javier.mcc_industry_classifications")


In [142]:
# Final Statistics and Verification

print("📈 Final Statistics and Verification")
print("=" * 50)


# Verify one record per MCC code
mcc_classifications = session.table("sandbox.javier.mcc_industry_classifications")

# Get comprehensive statistics
total_mcc_records = mcc_classifications.count()
records_with_naics = mcc_classifications.filter(F.col("NAICS_CODE").isNotNull()).count()
records_with_sic = mcc_classifications.filter(F.col("SIC_CODE").isNotNull()).count()
records_with_both = mcc_classifications.filter(
    F.col("NAICS_CODE").isNotNull() & F.col("SIC_CODE").isNotNull()
).count()

print(f"\n🎯 MCC-Centered Classification Summary:")
print(f"Total MCC records:                 {total_mcc_records:>6}")
print(f"MCC records with NAICS mapping:    {records_with_naics:>6} ({(records_with_naics/total_mcc_records*100):>5.1f}%)")
print(f"MCC records with SIC mapping:      {records_with_sic:>6} ({(records_with_sic/total_mcc_records*100):>5.1f}%)")
print(f"MCC records with both mappings:    {records_with_both:>6} ({(records_with_both/total_mcc_records*100):>5.1f}%)")

# Verify exactly one record per MCC code
print(f"\n🔍 Verification - One Record Per MCC Code:")
records_per_mcc = mcc_classifications.group_by("MCC_CODE").count()
max_records_per_mcc = records_per_mcc.agg(F.max("count")).collect()[0][0]
min_records_per_mcc = records_per_mcc.agg(F.min("count")).collect()[0][0]

print(f"Max records per MCC: {max_records_per_mcc}")
print(f"Min records per MCC: {min_records_per_mcc}")

if max_records_per_mcc == 1 and min_records_per_mcc == 1:
    print("✅ SUCCESS: Exactly one record per MCC code confirmed!")
else:
    print("⚠️  WARNING: Multiple records found for some MCC codes")

# Show unique counts
unique_naics = mcc_classifications.select("NAICS_CODE").distinct().count()
unique_sic = mcc_classifications.select("SIC_CODE").distinct().count()

print(f"\n📊 Unique Code Counts:")
print(f"Unique MCC codes:                  {total_mcc_records:>6}")
print(f"Unique NAICS codes mapped:         {unique_naics:>6}")
print(f"Unique SIC codes mapped:           {unique_sic:>6}")

print(f"\n✅ MCC-centered industry classification lookup complete!")
print(f"💡 Table: 'mcc_industry_classifications' - One record per MCC with NAICS and SIC lookups")


📈 Final Statistics and Verification

🎯 MCC-Centered Classification Summary:
Total MCC records:                    286
MCC records with NAICS mapping:       286 (100.0%)
MCC records with SIC mapping:         286 (100.0%)
MCC records with both mappings:       286 (100.0%)

🔍 Verification - One Record Per MCC Code:
Max records per MCC: 1
Min records per MCC: 1
✅ SUCCESS: Exactly one record per MCC code confirmed!

📊 Unique Code Counts:
Unique MCC codes:                     286
Unique NAICS codes mapped:            233
Unique SIC codes mapped:              218

✅ MCC-centered industry classification lookup complete!
💡 Table: 'mcc_industry_classifications' - One record per MCC with NAICS and SIC lookups


In [143]:
# Create comprehensive industry classification table with descriptions

print("🔄 Creating comprehensive MCC-centered industry classification table...")

# NOTE: This assumes the mapping tables code_map_mcc_naics and code_map_mcc_sic have been created
# by following the same methodology as the original notebook but with MCC as the primary table

try:
    # Load final mappings (these would be created by the full implementation)
    code_map_mcc_naics = session.table("sandbox.javier.code_map_mcc_naics")
    code_map_mcc_sic = session.table("sandbox.javier.code_map_mcc_sic")
    
    # Join the MCC-NAICS and MCC-SIC mappings
    mcc_industry_mappings = code_map_mcc_naics.join(
        code_map_mcc_sic,
        on="MCC_CODE", 
        how="left"
    ).orderBy("MCC_CODE")

    # Join with lookup tables to get comprehensive descriptions
    mcc_industry_classifications = \
    (mcc_industry_mappings
        .join(mcc_lu_spdf, mcc_industry_mappings.MCC_CODE == mcc_lu_spdf.MCC, "left")
        .join(naics_lu_spdf, mcc_industry_mappings.MOST_LIKELY_NAICS_CODE == naics_lu_spdf.CODE, "left") 
        .join(sic_lu_spdf, mcc_industry_mappings.MOST_LIKELY_SIC_CODE == sic_lu_spdf.SIC_INDUSTRY_CODE, "left")
        .selectExpr(
            """ MCC_CODE,
                MOST_LIKELY_NAICS_CODE AS NAICS_CODE,
                MOST_LIKELY_SIC_CODE AS SIC_CODE,
                -- MCC fields (Primary)
                MCC_DESCRIPTIVE_TITLE AS MCC_TITLE,
                INCLUDED_IN_THIS_MCC AS MCC_INCLUDED_MERCHANTS,
                -- NAICS fields  
                TITLE AS NAICS_TITLE,
                DESCRIPTION_FULL AS NAICS_DESCRIPTION,
                -- SIC fields
                SIC_INDUSTRY_DESCRIPTION AS SIC_TITLE,
                SIC_MAJOR_GROUP_DESCRIPTION AS SIC_MAJOR_GROUP_TITLE,
                SIC_DIVISION_DESCRIPTION AS SIC_DIVISION_TITLE
            """
        )
    )

    print("🔍 Sample of MCC-centered classification table:")
    mcc_industry_classifications.limit(5).show()

    # Save the comprehensive table
    print(f"\n💾 Writing comprehensive classifications to 'mcc_industry_classifications' table...")
    mcc_industry_classifications.orderBy("MCC_CODE").write.mode("overwrite").saveAsTable("sandbox.javier.mcc_industry_classifications")

    print(f"\n📊 Final table contains {humanize.intcomma(mcc_industry_classifications.count())} MCC records")
    print("\n🎯 Final MCC-centered industry classification table created!")
    print("💡 You can now lookup NAICS and SIC values for any MCC code using this table")

except Exception as e:
    print(f"⚠️ Mapping tables not yet created. Complete the full implementation first.")
    print(f"   This cell shows the expected final outcome structure.")
    print(f"   Error: {e}")
    
    # Show expected structure with sample data
    print(f"\n📋 Expected Final Table Structure:")
    print("MCC_CODE | NAICS_CODE | SIC_CODE | MCC_TITLE | NAICS_TITLE | SIC_TITLE")
    print("---------|------------|----------|-----------|-------------|----------")
    print("5411     | 445110     | 5411     | Grocery   | Supermarket | Grocery  ")
    print("5812     | 722511     | 5812     | Restaurant| Full-Service| Restaurant")
    print("...")
    print("\n🎯 Each MCC code will have exactly one record with corresponding NAICS and SIC lookups")


🔄 Creating comprehensive MCC-centered industry classification table...
🔍 Sample of MCC-centered classification table:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"MCC_CODE"  |"NAICS_CODE"  |"SIC_CODE"  |"MCC_TITLE"                                        |"MCC_INCLUDED_MERCHANTS"                            |"NAICS_TITLE"                                |"NAICS_DESCRIPTION"                                 |"SIC_TITLE"                               |"SIC_MAJOR_GROUP_TITLE"                        |"SIC_DIVISION_TITLE"              |
----------------------------------------------------------------------------------------------------------------------------------------

In [157]:
# Reload and display the saved table
print("\n📋 Displaying saved MCC industry classifications ordered by MCC code:")
mcc_industry_classifications = session.table("sandbox.javier.mcc_industry_classifications")
mcc_industry_classifications.orderBy("MCC_CODE").show()
# Count distinct MCC codes
distinct_mcc_count = mcc_industry_classifications.select("MCC_CODE").distinct().count()
print(f"\n🔢 Total distinct MCC codes: {humanize.intcomma(distinct_mcc_count)}")



📋 Displaying saved MCC industry classifications ordered by MCC code:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"MCC_CODE"  |"NAICS_CODE"  |"SIC_CODE"  |"MCC_TITLE"                                         |"MCC_INCLUDED_MERCHANTS"                            |"NAICS_TITLE"                                       |"NAICS_DESCRIPTION"                                 |"SIC_TITLE"                                |"SIC_MAJOR_GROUP_TITLE"                        |"SIC_DIVISION_TITLE"              |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
# Close session
session.close()
