# Chapter 7 Lab


## üß™ Lab: Cleanin' Up TheraGPT's CRM

TheraGPT is scaling fast thanks to the AI boom. But their CRM is a mess‚Äînested customer profiles, mangled strings, duplicate records, and timezone-jumbled transactions. Your job: clean it up, enrich it, and build a foundation for B2B growth.

This lab builds directly on the skills you practiced in this chapter: handling nested data structures, complex text processing with regex, string normalization and entity resolution, and time series transformations. Your goal is to write both Python solutions and AI-driven solutions that produce the same results.

You'll be working with three files:
- `setup/incoming_customers.json` - Deeply nested customer data from mobile onboarding
- `setup/crm_customers.csv` - Messy existing CRM with duplicates and inconsistencies  
- `setup/transactions.csv` - Transaction data with messy timestamps and NET terms

---

## 1. Handling Hierarchical and Nested Data Structures

**Scenario:** TheraGPT's incoming customer data is a heavily nested JSON (e.g., from a mobile onboarding API).

**Goal:** Flatten it using AI and Python.

**Input:** `incoming_customers.json`

**AI Task:** Extract first name, last name, email, nested address details, subscription info, and therapy preferences.

**Python Parallel:** Use recursive extraction or json_normalize.

‚úÖ **Try It Now:** Add a new nested field like `insurance_info` and extract it with minimal schema changes.


### Step 1: Flatten Using Python (json_normalize)


In [21]:
import json
import pandas as pd
from pandas import json_normalize

# Load the nested JSON data
with open('setup/incoming_customers.json', 'r') as f:
    nested_data = json.load(f)

print("Sample nested record:")
print(json.dumps(nested_data[0], indent=2))

# Function to manually flatten nested JSON
def flatten_customer_record(record):
    """Manually flatten a single customer record."""
    flattened = {
        'customer_id': record['customer_id'],
        'first_name': record['profile']['personal']['first_name'],
        'last_name': record['profile']['personal']['last_name'],
        'email': record['profile']['personal']['email'],
        'address_line': record['profile']['address']['address_line'],
        'apartment': record['profile']['address']['details'].get('apartment'),
        'building': record['profile']['address']['details'].get('building'),
        'subscription_code': record['profile']['subscription']['subscription_code'],
        'subscription_start': record['profile']['subscription']['start_date'],
        'preferred_time': record['profile']['therapy_preferences']['preferred_time'],
        'therapist_gender': record['profile']['therapy_preferences']['therapist_gender'],
        'therapy_topics': ', '.join(record['profile']['therapy_preferences']['topics'])
    }
    return flattened

# Flatten all records
flattened_records = [flatten_customer_record(record) for record in nested_data]
df_python = pd.DataFrame(flattened_records)

print(f"\nFlattened {len(df_python)} customer records using Python:")
display(df_python)


Sample nested record:
{
  "customer_id": "c001",
  "profile": {
    "personal": {
      "first_name": "Alex",
      "last_name": "Johnson",
      "email": "alex.johnson@gmail.com"
    },
    "address": {
      "address_line": "123mainstreet,newyorkcity,ny,10001",
      "details": {
        "apartment": "4B",
        "building": "Sunrise Towers"
      }
    },
    "subscription": {
      "subscription_code": "GOLD2xONLINE",
      "start_date": "2024-03-01"
    },
    "therapy_preferences": {
      "preferred_time": "Evenings",
      "therapist_gender": "Any",
      "topics": [
        "anxiety",
        "career"
      ]
    }
  }
}

Flattened 15 customer records using Python:


Unnamed: 0,customer_id,first_name,last_name,email,address_line,apartment,building,subscription_code,subscription_start,preferred_time,therapist_gender,therapy_topics
0,c001,Alex,Johnson,alex.johnson@gmail.com,"123mainstreet,newyorkcity,ny,10001",4B,Sunrise Towers,GOLD2xONLINE,2024-03-01,Evenings,Any,"anxiety, career"
1,c002,Jordan,Smith,jordan.smith@techcorp.io,"456 Elm St, Boston, Massachusetts, 02118",,,PLATINUM1xINPERSON,2024-02-15,Mornings,Female,stress
2,c003,Casey,Brown,casey_brown@innovate.com,"789 Oak Ave,Los Angeles,CA,90001",2A,Palm Court,SILVER1xONLINE,2024-01-20,Afternoons,Male,relationships
3,c004,Taylor,Davis,taylor.davis@gmail.com,"321 Pine Rd,Miami,FL,33101",,Ocean View,GOLD1xINPERSON,2024-04-10,Evenings,Any,depression
4,c005,Morgan,Wilson,morgan.wilson@globaltech.com,"555 Maple Blvd,Seattle,WA,98101",10C,,GOLD2xONLINE,2024-03-15,Mornings,Male,"work stress, family"
5,c006,Riley,Anderson,riley.anderson@healthplus.org,"777 Cedar St, Denver, Colorado, 80203",5F,Mountain View,PLATINUM2xHYBRID,2024-01-05,Afternoons,Female,"anxiety, self-esteem"
6,c007,Avery,Thompson,avery.thompson@yahoo.com,"888spruceway,phoenix,az,85001",,Desert Oaks,SILVER2xONLINE,2024-02-28,Evenings,Any,"grief, trauma"
7,c008,Cameron,Lee,cameron.lee@dataflow.net,"999 Birch Ln, Portland, Oregon, 97201",12A,,ENTERPRISE3xHYBRID,2024-03-20,Mornings,Female,"leadership, burnout"
8,c009,Quinn,Martinez,q.martinez@outlook.com,"111 Willow Dr,Austin,TX,73301",3C,Riverside Plaza,GOLD1xONLINE,2024-04-01,Afternoons,Male,addiction recovery
9,c010,Sage,Robinson,sage.robinson@medtech.com,"222 Aspen Ave, Nashville, Tennessee, 37201",,Music City Towers,PLATINUM1xINPERSON,2024-01-15,Evenings,Any,"performance anxiety, creativity"


In [None]:
import openai
import os
from dotenv import load_dotenv
from pydantic import BaseModel
from typing import Optional, List
from tqdm.notebook import tqdm

# Load API key
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# Define the flattened structure
class FlattenedCustomer(BaseModel):
    customer_id: str
    first_name: str
    last_name: str
    email: str
    address_line: str
    apartment: Optional[str]
    building: Optional[str]
    subscription_code: str
    subscription_start: str
    preferred_time: str
    therapist_gender: str
    therapy_topics: str  # Comma-separated string

# System prompt for flattening
system_prompt = """
You are a data extraction assistant. Extract and flatten the nested customer data into the following structure:
- customer_id: The customer ID
- first_name: From profile.personal.first_name
- last_name: From profile.personal.last_name  
- email: From profile.personal.email
- address_line: From profile.address.address_line
- apartment: From profile.address.details.apartment (null if not present)
- building: From profile.address.details.building (null if not present)
- subscription_code: From profile.subscription.subscription_code
- subscription_start: From profile.subscription.start_date
- preferred_time: From profile.therapy_preferences.preferred_time
- therapist_gender: From profile.therapy_preferences.therapist_gender
- therapy_topics: Join the topics array into a comma-separated string

Return the result as a JSON object matching the FlattenedCustomer structure.
"""

# Process each record with AI
ai_flattened = []
for record in tqdm(nested_data, desc="AI Flattening"):
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": str(record)}
            ],
            response_format=FlattenedCustomer
        )
        ai_flattened.append(completion.choices[0].message.parsed.dict())
    except Exception as e:
        print(f"Error processing record: {e}")
        continue

df_ai = pd.DataFrame(ai_flattened)
print(f"\nFlattened {len(df_ai)} customer records using AI:")
display(df_ai)


ModuleNotFoundError: No module named 'pydantic'

### Step 2: Flatten Using AI (Structured Response)


In [22]:
import openai
import os
from dotenv import load_dotenv
from pydantic import BaseModel
from typing import Optional, List
from tqdm.notebook import tqdm

# Load API key
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# Define the flattened structure
class FlattenedCustomer(BaseModel):
    customer_id: str
    first_name: str
    last_name: str
    email: str
    address_line: str
    apartment: Optional[str]
    building: Optional[str]
    subscription_code: str
    subscription_start: str
    preferred_time: str
    therapist_gender: str
    therapy_topics: str  # Comma-separated string

# System prompt for flattening
system_prompt = """
You are a data extraction assistant. Extract and flatten the nested customer data into the following structure:
- customer_id: The customer ID
- first_name: From profile.personal.first_name
- last_name: From profile.personal.last_name  
- email: From profile.personal.email
- address_line: From profile.address.address_line
- apartment: From profile.address.details.apartment (null if not present)
- building: From profile.address.details.building (null if not present)
- subscription_code: From profile.subscription.subscription_code
- subscription_start: From profile.subscription.start_date
- preferred_time: From profile.therapy_preferences.preferred_time
- therapist_gender: From profile.therapy_preferences.therapist_gender
- therapy_topics: Join the topics array into a comma-separated string

Return the result as a JSON object matching the FlattenedCustomer structure.
"""

# Process each record with AI
ai_flattened = []
for record in tqdm(nested_data, desc="AI Flattening"):
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": str(record)}
            ],
            response_format=FlattenedCustomer
        )
        ai_flattened.append(completion.choices[0].message.parsed.dict())
    except Exception as e:
        print(f"Error processing record: {e}")
        continue

df_ai = pd.DataFrame(ai_flattened)
print(f"\nFlattened {len(df_ai)} customer records using AI:")
display(df_ai)


AI Flattening:   0%|          | 0/15 [00:00<?, ?it/s]


Flattened 15 customer records using AI:


Unnamed: 0,customer_id,first_name,last_name,email,address_line,apartment,building,subscription_code,subscription_start,preferred_time,therapist_gender,therapy_topics
0,c001,Alex,Johnson,alex.johnson@gmail.com,"123mainstreet,newyorkcity,ny,10001",4B,Sunrise Towers,GOLD2xONLINE,2024-03-01,Evenings,Any,"anxiety,career"
1,c002,Jordan,Smith,jordan.smith@techcorp.io,"456 Elm St, Boston, Massachusetts, 02118",,,PLATINUM1xINPERSON,2024-02-15,Mornings,Female,stress
2,c003,Casey,Brown,casey_brown@innovate.com,"789 Oak Ave,Los Angeles,CA,90001",2A,Palm Court,SILVER1xONLINE,2024-01-20,Afternoons,Male,relationships
3,c004,Taylor,Davis,taylor.davis@gmail.com,"321 Pine Rd,Miami,FL,33101",,Ocean View,GOLD1xINPERSON,2024-04-10,Evenings,Any,depression
4,c005,Morgan,Wilson,morgan.wilson@globaltech.com,"555 Maple Blvd,Seattle,WA,98101",10C,,GOLD2xONLINE,2024-03-15,Mornings,Male,"work stress,family"
5,c006,Riley,Anderson,riley.anderson@healthplus.org,"777 Cedar St, Denver, Colorado, 80203",5F,Mountain View,PLATINUM2xHYBRID,2024-01-05,Afternoons,Female,"anxiety,self-esteem"
6,c007,Avery,Thompson,avery.thompson@yahoo.com,"888spruceway,phoenix,az,85001",,Desert Oaks,SILVER2xONLINE,2024-02-28,Evenings,Any,"grief,trauma"
7,c008,Cameron,Lee,cameron.lee@dataflow.net,"999 Birch Ln, Portland, Oregon, 97201",12A,,ENTERPRISE3xHYBRID,2024-03-20,Mornings,Female,"leadership,burnout"
8,c009,Quinn,Martinez,q.martinez@outlook.com,"111 Willow Dr,Austin,TX,73301",3C,Riverside Plaza,GOLD1xONLINE,2024-04-01,Afternoons,Male,addiction recovery
9,c010,Sage,Robinson,sage.robinson@medtech.com,"222 Aspen Ave, Nashville, Tennessee, 37201",,Music City Towers,PLATINUM1xINPERSON,2024-01-15,Evenings,Any,"performance anxiety, creativity"


---

## 2. Complex Text Processing with Regular Expressions

**Scenario:** Some flattened fields are still messy‚Äîlike `address_line` (123mainstreet,newyorkcity,ny,10000) or `subscription_code` (GOLD2xONLINE).

**Goal:** Use regex (and AI-assisted regex) to parse those into structured fields.

**Input:** DataFrame from Step 1.

**AI Task:** Use AI to extract address components and subscription details (Tier, Frequency, Channel).

**Python Parallel:** Build regex manually and validate with test cases.

‚úÖ **Try It Now:** Add a subscription like PLATINUM1xINPERSON and update the logic.


### Step 1: Parse Complex Fields Using Python Regex


In [23]:
import re

# Use the Python flattened data for regex processing
df_regex = df_python.copy()

# Function to parse address_line into components
def parse_address(address_line):
    """Parse address line into street, city, state, zip."""
    if pd.isna(address_line):
        return None, None, None, None
    
    # Handle comma-separated format: "123mainstreet,newyorkcity,ny,10000"
    if ',' in address_line:
        parts = [part.strip() for part in address_line.split(',')]
        if len(parts) >= 4:
            return parts[0], parts[1], parts[2], parts[3]
        elif len(parts) == 3:
            return parts[0], parts[1], parts[2], None
    
    # Handle space-separated format: "456 Elm St Boston Massachusetts 02118"
    # Use regex to extract zip code first, then work backwards
    zip_match = re.search(r'\b\d{5}\b', address_line)
    if zip_match:
        zip_code = zip_match.group()
        remaining = address_line.replace(zip_code, '').strip()
        
        # Split remaining into parts
        parts = remaining.split()
        if len(parts) >= 3:
            # Last part is likely state, rest is street + city
            state = parts[-1]
            city_parts = []
            street_parts = []
            
            # Simple heuristic: if a part is a known state, everything before is city
            for i, part in enumerate(parts[:-1]):
                if part.upper() in ['NY', 'CA', 'FL', 'MA', 'WA', 'NEW', 'MASSACHUSETTS', 'CALIFORNIA']:
                    street_parts = parts[:i]
                    city_parts = parts[i:-1]
                    break
            
            if not city_parts:  # Fallback: assume last 2 parts before state are city
                street_parts = parts[:-2]
                city_parts = parts[-2:-1]
            
            street = ' '.join(street_parts)
            city = ' '.join(city_parts)
            return street, city, state, zip_code
    
    return address_line, None, None, None

# Function to parse subscription_code
def parse_subscription(sub_code):
    """Parse subscription code like GOLD2xONLINE into tier, frequency, channel."""
    if pd.isna(sub_code):
        return None, None, None
    
    # Pattern: TIER + NUMBER + x + CHANNEL
    pattern = r'^([A-Z]+)(\d+)x([A-Z]+)$'
    match = re.match(pattern, sub_code)
    
    if match:
        tier = match.group(1)
        frequency = f"{match.group(2)}x per week"
        channel = match.group(3)
        return tier, frequency, channel
    
    return sub_code, None, None

# Apply parsing functions
print("Parsing address lines...")
address_parsed = df_regex['address_line'].apply(parse_address)
df_regex[['street', 'city', 'state', 'zip_code']] = pd.DataFrame(address_parsed.tolist(), index=df_regex.index)

print("Parsing subscription codes...")
subscription_parsed = df_regex['subscription_code'].apply(parse_subscription)
df_regex[['subscription_tier', 'subscription_frequency', 'subscription_channel']] = pd.DataFrame(subscription_parsed.tolist(), index=df_regex.index)

print("\nParsed data using Python regex:")
display(df_regex[['customer_id', 'first_name', 'last_name', 'street', 'city', 'state', 'zip_code', 
                  'subscription_tier', 'subscription_frequency', 'subscription_channel']])


Parsing address lines...
Parsing subscription codes...

Parsed data using Python regex:


Unnamed: 0,customer_id,first_name,last_name,street,city,state,zip_code,subscription_tier,subscription_frequency,subscription_channel
0,c001,Alex,Johnson,123mainstreet,newyorkcity,ny,10001,GOLD,2x per week,ONLINE
1,c002,Jordan,Smith,456 Elm St,Boston,Massachusetts,2118,PLATINUM,1x per week,INPERSON
2,c003,Casey,Brown,789 Oak Ave,Los Angeles,CA,90001,SILVER,1x per week,ONLINE
3,c004,Taylor,Davis,321 Pine Rd,Miami,FL,33101,GOLD,1x per week,INPERSON
4,c005,Morgan,Wilson,555 Maple Blvd,Seattle,WA,98101,GOLD,2x per week,ONLINE
5,c006,Riley,Anderson,777 Cedar St,Denver,Colorado,80203,PLATINUM,2x per week,HYBRID
6,c007,Avery,Thompson,888spruceway,phoenix,az,85001,SILVER,2x per week,ONLINE
7,c008,Cameron,Lee,999 Birch Ln,Portland,Oregon,97201,ENTERPRISE,3x per week,HYBRID
8,c009,Quinn,Martinez,111 Willow Dr,Austin,TX,73301,GOLD,1x per week,ONLINE
9,c010,Sage,Robinson,222 Aspen Ave,Nashville,Tennessee,37201,PLATINUM,1x per week,INPERSON


### Step 2: Parse Complex Fields Using AI


In [24]:
# Define structure for parsed fields
class ParsedFields(BaseModel):
    street: Optional[str]
    city: Optional[str]
    state: Optional[str]
    zip_code: Optional[str]
    subscription_tier: Optional[str]
    subscription_frequency: Optional[str]
    subscription_channel: Optional[str]

# System prompt for parsing
parse_prompt = """
You are a text parsing assistant. Parse the given address_line and subscription_code fields:

For address_line:
- Extract street, city, state, and zip_code
- Handle formats like "123mainstreet,newyorkcity,ny,10000" or "456 Elm St Boston Massachusetts 02118"

For subscription_code:
- Extract tier, frequency, and channel from codes like "GOLD2xONLINE"
- Format frequency as "2x per week" style
- Tier should be the first part (GOLD, SILVER, PLATINUM)
- Channel should be the last part (ONLINE, INPERSON)

Return the result as a JSON object matching the ParsedFields structure.
"""

# Process each record with AI
ai_parsed = []
for _, row in tqdm(df_ai.iterrows(), total=len(df_ai), desc="AI Parsing"):
    input_data = {
        "address_line": row['address_line'],
        "subscription_code": row['subscription_code']
    }
    
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": parse_prompt},
                {"role": "user", "content": str(input_data)}
            ],
            response_format=ParsedFields
        )
        parsed = completion.choices[0].message.parsed.dict()
        ai_parsed.append(parsed)
    except Exception as e:
        print(f"Error parsing record: {e}")
        ai_parsed.append({
            'street': None, 'city': None, 'state': None, 'zip_code': None,
            'subscription_tier': None, 'subscription_frequency': None, 'subscription_channel': None
        })

# Add parsed fields to AI dataframe
df_ai_parsed = df_ai.copy()
for i, parsed in enumerate(ai_parsed):
    for key, value in parsed.items():
        df_ai_parsed.loc[i, key] = value

print("\nParsed data using AI:")
display(df_ai_parsed[['customer_id', 'first_name', 'last_name', 'street', 'city', 'state', 'zip_code', 
                      'subscription_tier', 'subscription_frequency', 'subscription_channel']])


AI Parsing:   0%|          | 0/15 [00:00<?, ?it/s]


Parsed data using AI:


Unnamed: 0,customer_id,first_name,last_name,street,city,state,zip_code,subscription_tier,subscription_frequency,subscription_channel
0,c001,Alex,Johnson,123mainstreet,newyorkcity,ny,10001,GOLD,2x per week,ONLINE
1,c002,Jordan,Smith,456 Elm St,Boston,Massachusetts,2118,PLATINUM,1x per week,INPERSON
2,c003,Casey,Brown,789 Oak Ave,Los Angeles,CA,90001,SILVER,1x per week,ONLINE
3,c004,Taylor,Davis,321 Pine Rd,Miami,FL,33101,GOLD,1x per week,INPERSON
4,c005,Morgan,Wilson,555 Maple Blvd,Seattle,WA,98101,GOLD,2x per week,ONLINE
5,c006,Riley,Anderson,777 Cedar St,Denver,Colorado,80203,PLATINUM,2x per week,HYBRID
6,c007,Avery,Thompson,888 Spruce Way,Phoenix,AZ,85001,SILVER,2x per week,ONLINE
7,c008,Cameron,Lee,999 Birch Ln,Portland,Oregon,97201,ENTERPRISE,3x per week,HYBRID
8,c009,Quinn,Martinez,111 Willow Dr,Austin,TX,73301,GOLD,1x per week,ONLINE
9,c010,Sage,Robinson,222 Aspen Ave,Nashville,Tennessee,37201,PLATINUM,1x per week,INPERSON


---

## 3. String Normalization and Entity Resolution

This is where things get fun and complicated. We have a CRM database where each row is treated as a unique customer initially, but we suspect there are duplicates (same person with slightly different information). We'll use AI to identify these duplicates and create golden records.

### Part A: Normalize Existing CRM

**Goal:** Clean up the CRM database CSV (string normalization) - normalize states, validate email addresses, parse domains, and format names correctly. Each row starts as a "unique" customer with its own ID.

### Part B: AI-Powered Entity Resolution to Find Duplicates

**Goal:** Use AI to compare each customer record against all others to identify potential duplicates. The AI will look for similar names, emails, addresses, and other identifying information to find customers who are likely the same person.

### Part C: Create Golden Records from Duplicates

**Goal:** Take the identified duplicate groups and merge them into single "golden records" with the best information from each source. Non-duplicates become their own golden records.

### Part D: Match Incoming Customers to Golden Records

**Goal:** Finally we will match incoming customers to the golden records in the database. As we are matching we should flag if the incoming customer email is a business email or not.

‚úÖ **Try It Now:** Add a row with a nickname or corporate domain and see if it matches.


### Part A: Normalize Existing CRM Using Python


In [25]:
# Load the CRM data - each row is treated as a unique customer initially
crm_df = pd.read_csv('setup/crm_customers.csv')

print("Raw CRM data (each row treated as unique customer):")
display(crm_df)

# Normalize CRM data using Python
def normalize_crm_python(df):
    """Normalize CRM data using Python pandas operations."""
    df_clean = df.copy()
    
    # Normalize names - title case
    df_clean['first_name'] = df_clean['first_name'].str.title().str.strip()
    df_clean['last_name'] = df_clean['last_name'].str.title().str.strip()
    
    # Normalize email - lowercase and strip
    df_clean['email'] = df_clean['email'].str.lower().str.strip()
    
    # Extract domain from email
    df_clean['email_domain'] = df_clean['email'].str.extract(r'@([^.]+\.[^.]+)$')
    
    # Validate email format
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    df_clean['is_valid_email'] = df_clean['email'].str.match(email_pattern, na=False)
    
    # Normalize address - lowercase and clean
    df_clean['address_normalized'] = df_clean['address'].str.lower().str.strip()
    
    # Normalize state names to abbreviations
    state_mapping = {
        'new york': 'NY', 'ny': 'NY', 'newyorkcity': 'NY',
        'massachusetts': 'MA', 'ma': 'MA', 'boston': 'MA',
        'california': 'CA', 'ca': 'CA', 'los angeles': 'CA',
        'florida': 'FL', 'fl': 'FL', 'miami': 'FL',
        'washington': 'WA', 'wa': 'WA', 'seattle': 'WA',
        'colorado': 'CO', 'co': 'CO', 'denver': 'CO'
    }
    
    df_clean['state_normalized'] = df_clean['state'].str.lower().map(state_mapping).fillna(df_clean['state'].str.upper())
    
    # Clean zip codes - extract 5 digits and pad with zeros if needed
    df_clean['zip_normalized'] = df_clean['zip'].astype(str).str.extract(r'(\d{4,5})')[0].str.zfill(5)
    
    return df_clean

crm_normalized = normalize_crm_python(crm_df)
print(f"\nNormalized CRM data ({len(crm_normalized)} records):")
display(crm_normalized)


Raw CRM data (each row treated as unique customer):


Unnamed: 0,customer_id,first_name,last_name,email,address,state,zip,subscription_code
crm001,Alexander,Johnson,alex.johnson@gmail.com,123 Main St,New York,NY,10001,GOLD2xONLINE
crm002,Alex,Johnson,alex.johnson@gmail.com,123mainstreet,newyorkcity,ny,10001,GOLD2xONLINE
crm003,Jordan,Smith,jordan.smith@techcorp.io,456 Elm St,Boston,Massachusetts,2118,PLATINUM1xINPERSON
crm004,J,Smith,j.smith@techcorp.io,456 Elm Street,Boston,MA,2118,PLATINUM1xINPERSON
crm005,Casey,Brown,casey@innovate.com,789 Oak Ave,Los Angeles,CA,90001,SILVER1xONLINE
crm006,Casey,Brown,casey_brown@innovate.com,789 Oak Ave,Los Angeles,CA,90001,SILVER1xONLINE
crm007,Taylor,Davis,taylor.davis@gmail.com,321 Pine Rd,Miami,FL,33101,GOLD1xINPERSON
crm008,Morgan,Wilson,morgan.wilson@globaltech.com,555 Maple Blvd,Seattle,WA,98101,GOLD2xONLINE
crm009,Riley,Anderson,riley.anderson@healthplus.org,777 Cedar St,Denver,Colorado,80203,PLATINUM2xHYBRID
crm010,Avery,Thompson,avery.thompson@yahoo.com,888 Spruce Way,Phoenix,AZ,85001,SILVER2xONLINE



Normalized CRM data (28 records):


Unnamed: 0,customer_id,first_name,last_name,email,address,state,zip,subscription_code,email_domain,is_valid_email,address_normalized,state_normalized,zip_normalized
crm001,Alexander,Johnson,Alex.Johnson@Gmail.Com,123 main st,New York,NY,10001,GOLD2xONLINE,,False,new york,NY,10001
crm002,Alex,Johnson,Alex.Johnson@Gmail.Com,123mainstreet,newyorkcity,ny,10001,GOLD2xONLINE,,False,newyorkcity,NY,10001
crm003,Jordan,Smith,Jordan.Smith@Techcorp.Io,456 elm st,Boston,Massachusetts,2118,PLATINUM1xINPERSON,,False,boston,MA,2118
crm004,J,Smith,J.Smith@Techcorp.Io,456 elm street,Boston,MA,2118,PLATINUM1xINPERSON,,False,boston,MA,2118
crm005,Casey,Brown,Casey@Innovate.Com,789 oak ave,Los Angeles,CA,90001,SILVER1xONLINE,,False,los angeles,CA,90001
crm006,Casey,Brown,Casey_Brown@Innovate.Com,789 oak ave,Los Angeles,CA,90001,SILVER1xONLINE,,False,los angeles,CA,90001
crm007,Taylor,Davis,Taylor.Davis@Gmail.Com,321 pine rd,Miami,FL,33101,GOLD1xINPERSON,,False,miami,FL,33101
crm008,Morgan,Wilson,Morgan.Wilson@Globaltech.Com,555 maple blvd,Seattle,WA,98101,GOLD2xONLINE,,False,seattle,WA,98101
crm009,Riley,Anderson,Riley.Anderson@Healthplus.Org,777 cedar st,Denver,Colorado,80203,PLATINUM2xHYBRID,,False,denver,CO,80203
crm010,Avery,Thompson,Avery.Thompson@Yahoo.Com,888 spruce way,Phoenix,AZ,85001,SILVER2xONLINE,,False,phoenix,AZ,85001


### Part B: AI-Powered Entity Resolution to Find Duplicates


In [26]:
# Define structure for entity resolution results
class EntityMatch(BaseModel):
    customer_id: str
    similar_customers: List[str]  # List of customer IDs that are likely the same person
    match_confidence: float
    match_reasoning: str
    is_likely_duplicate: bool

# System prompt for entity resolution
entity_resolution_prompt = """
You are an entity resolution specialist. You will receive one customer record and a list of all other customers in the database. Your task is to identify if this customer appears to be the same person as any other customers in the database.

Consider these matching factors:
1. Email similarity (exact match is strongest indicator)
2. Name similarity (nicknames, abbreviations, typos)
3. Address similarity (same location, minor formatting differences)
4. Phone number similarity (if available)

Rules for matching:
- Exact email matches are very strong indicators
- Similar names + similar addresses are strong indicators  
- Consider common nicknames (Dave/David, Janie/Jane, Johnny/Jonny)
- Consider typos and formatting differences
- Be conservative - only flag as duplicate if confidence is high (>0.7)

Return:
- customer_id: The ID of the customer being evaluated
- similar_customers: List of customer IDs that appear to be the same person
- match_confidence: Score from 0.0-1.0 indicating confidence in the match
- match_reasoning: Explanation of why you think they match (or don't)
- is_likely_duplicate: True if confidence > 0.7

Return a JSON object matching the EntityMatch structure.
"""

# Perform entity resolution using AI
entity_matches = []
all_customers = crm_normalized.to_dict('records')

for i, customer in enumerate(tqdm(all_customers, desc="AI Entity Resolution")):
    # Create a list of other customers to compare against
    other_customers = [c for j, c in enumerate(all_customers) if j != i]
    
    input_data = {
        "target_customer": customer,
        "other_customers": other_customers
    }
    
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": entity_resolution_prompt},
                {"role": "user", "content": str(input_data)}
            ],
            response_format=EntityMatch
        )
        match_result = completion.choices[0].message.parsed.dict()
        entity_matches.append(match_result)
    except Exception as e:
        print(f"Error in entity resolution for {customer['customer_id']}: {e}")
        continue

df_entity_matches = pd.DataFrame(entity_matches)

# Show the entity resolution results
print(f"\nEntity Resolution Results:")
display(df_entity_matches)

# Show potential duplicates
duplicates = df_entity_matches[df_entity_matches['is_likely_duplicate'] == True]
print(f"\nüîç Found {len(duplicates)} potential duplicate groups:")
for _, match in duplicates.iterrows():
    print(f"Customer {match['customer_id']} matches: {match['similar_customers']}")
    print(f"   Confidence: {match['match_confidence']:.2f}")
    print(f"   Reasoning: {match['match_reasoning']}")
    print()


AI Entity Resolution:   0%|          | 0/28 [00:00<?, ?it/s]


Entity Resolution Results:


Unnamed: 0,customer_id,similar_customers,match_confidence,match_reasoning,is_likely_duplicate
0,Alexander,[Alex],0.8,The target customer and 'Alex' both have 'John...,True
1,Alex,[Alexander],0.9,The first name 'Alex' is a common short form o...,True
2,Jordan,[J],0.85,Both customers have the same first and last na...,True
3,J,[Jordan],0.85,The target customer 'J' and customer 'Jordan' ...,True
4,Casey,[Casey],0.95,"The target customer, customer ID 'Casey', shar...",True
5,Casey,[Casey],0.9,The target customer 'Casey' has an exact match...,True
6,Taylor,[],0.0,No exact email matches found for Taylor.Davis@...,False
7,Morgan,[],0.0,No exact email match found. The customer's nam...,False
8,Riley,[],0.0,No exact email match found for the target cust...,False
9,Avery,[],0.0,No exact email match found across other custom...,False



üîç Found 14 potential duplicate groups:
Customer Alexander matches: ['Alex']
   Confidence: 0.80
   Reasoning: The target customer and 'Alex' both have 'Johnson' as the last name and similar email addresses 'Alex.Johnson@Gmail.Com' with minor formatting differences in addresses '123 main st' vs '123mainstreet'. The addresses seem to describe the same location and both are normalized to 'new york'. The ZIP codes also match perfectly (10001). The name 'Alexander' and 'Alex' are often interchangeable as one is a short form of the other.

Customer Alex matches: ['Alexander']
   Confidence: 0.90
   Reasoning: The first name 'Alex' is a common short form of 'Alexander'. Both customers have the same last name and an exact match in the email address, both of which are 'Alex.Johnson@Gmail.Com'. The address components are also strongly similar, with '123mainstreet, newyorkcity, ny, 10001' being effectively equivalent to '123 main st, New York, NY, 10001'. Both records share the same subscript

### Part C: Create Golden Records from Duplicates


In [27]:
# Define golden record structure
class GoldenRecord(BaseModel):
    golden_id: str  # New unified ID
    customer_ids: List[str]  # Original IDs that were merged
    first_name: str
    last_name: str
    email: str
    email_domain: str
    address: str
    state: str
    zip_code: str
    subscription_code: str
    confidence_score: float
    source_records: int
    is_business_email: bool

# System prompt for creating golden records from duplicates
golden_prompt = """
You are a data consolidation assistant. You will receive a group of customer records that have been identified as likely duplicates (same person). Your task is to merge them into a single "golden record" with the best information from each source.

Rules for consolidation:
- Use the most complete and accurate information available
- For names: prefer full names over nicknames, but use consistent capitalization
- For emails: prefer the most complete/valid email address
- For addresses: use the most complete address information
- For states: normalize to 2-letter abbreviations (NY, CA, FL, etc.)
- For zip codes: use 5-digit format
- Extract email domain from the email address
- Determine if this is a business email (corporate domain vs gmail/yahoo/hotmail)
- Assign a confidence_score (0.0-1.0) based on data completeness and consistency
- Count the number of source_records used
- Generate a new golden_id in format "GOLDEN_001", "GOLDEN_002", etc.

Return a JSON object matching the GoldenRecord structure.
"""

# Create golden records from identified duplicates
golden_records = []
processed_customers = set()
golden_counter = 1

# Process duplicates first
for _, match in duplicates.iterrows():
    customer_id = match['customer_id']
    similar_ids = match['similar_customers']
    
    # Skip if already processed
    if customer_id in processed_customers:
        continue
    
    # Get all records for this duplicate group
    all_ids = [customer_id] + similar_ids
    duplicate_records = []
    
    for cid in all_ids:
        if cid not in processed_customers:
            record = crm_normalized[crm_normalized['customer_id'] == cid].iloc[0].to_dict()
            duplicate_records.append(record)
            processed_customers.add(cid)
    
    if duplicate_records:
        input_data = {
            "golden_id": f"GOLDEN_{golden_counter:03d}",
            "customer_records": duplicate_records
        }
        
        try:
            completion = openai.beta.chat.completions.parse(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": golden_prompt},
                    {"role": "user", "content": str(input_data)}
                ],
                response_format=GoldenRecord
            )
            golden_record = completion.choices[0].message.parsed.dict()
            golden_record['customer_ids'] = all_ids
            golden_records.append(golden_record)
            golden_counter += 1
        except Exception as e:
            print(f"Error creating golden record for {all_ids}: {e}")

# Process remaining unique customers
for _, customer in crm_normalized.iterrows():
    if customer['customer_id'] not in processed_customers:
        # Create golden record for unique customer
        golden_record = {
            'golden_id': f"GOLDEN_{golden_counter:03d}",
            'customer_ids': [customer['customer_id']],
            'first_name': customer['first_name'],
            'last_name': customer['last_name'],
            'email': customer['email'],
            'email_domain': customer['email_domain'],
            'address': customer['address'],
            'state': customer['state_normalized'],
            'zip_code': customer['zip_normalized'],
            'subscription_code': customer['subscription_code'],
            'confidence_score': 1.0,  # High confidence for unique records
            'source_records': 1,
            'is_business_email': not customer['email_domain'] in ['gmail.com', 'yahoo.com', 'hotmail.com'] if pd.notna(customer['email_domain']) else False
        }
        golden_records.append(golden_record)
        golden_counter += 1

df_golden = pd.DataFrame(golden_records)
print(f"\n‚ú® Created {len(df_golden)} golden records from {len(crm_normalized)} original records:")
display(df_golden)

# Summary
duplicates_merged = len(crm_normalized) - len(df_golden)
print(f"\nüìä Summary:")
print(f"Original CRM records: {len(crm_normalized)}")
print(f"Golden records created: {len(df_golden)}")
print(f"Duplicate records merged: {duplicates_merged}")
print(f"Business emails identified: {len(df_golden[df_golden['is_business_email'] == True])}")



‚ú® Created 21 golden records from 28 original records:


Unnamed: 0,golden_id,customer_ids,first_name,last_name,email,email_domain,address,state,zip_code,subscription_code,confidence_score,source_records,is_business_email
0,GOLDEN_001,"[Alexander, Alex]",Alexander,Johnson,alex.johnson@gmail.com,gmail.com,"123 Main St, New York",NY,10001,GOLD2xONLINE,0.8,2,False
1,GOLDEN_002,"[Jordan, J]",Smith,Jordan-Smith,Jordan.Smith@Techcorp.Io,techcorp.io,"456 Elm St, Boston",MA,2118,PLATINUM1xINPERSON,0.9,2,True
2,GOLDEN_003,"[Casey, Casey]",Brown,Casey,Casey@Innovate.Com,innovate.com,"789 Oak Ave, Los Angeles",CA,90001,SILVER1xONLINE,0.7,1,True
3,GOLDEN_001,"[Cameron, Cameron]",Cameron,Lee,Cameron.Lee@Dataflow.Net,dataflow.net,"999 Birch Ln, Portland",OR,97201,ENTERPRISE3xHYBRID,0.8,1,True
4,GOLDEN_006,[Sage],Sage,Robinson,sage.robinson@medtech.com,medtech.com,"222 Aspen Ave, Nashville",TN,37201,PLATINUM1xINPERSON,0.8,1,True
5,GOLDEN_001,[Emery],Emery,Hall,Emery.Hall@Startup.Co,Startup.Co,"777 Cedar Blvd, Raleigh",NC,27601,PLATINUM3xONLINE,0.75,1,True
6,GOLDEN_008,"[Dakota, Dakota]",Green,Dakota,Dakota.Green@gmail.com,gmail.com,"999 Maple Ave, Richmond",VA,23220,SILVER1xONLINE,0.8,1,False
7,GOLDEN_008,[Taylor],Davis,Taylor.Davis@Gmail.Com,321 pine rd,,Miami,FL,33101,GOLD1xINPERSON,1.0,1,False
8,GOLDEN_009,[Morgan],Wilson,Morgan.Wilson@Globaltech.Com,555 maple blvd,,Seattle,WA,98101,GOLD2xONLINE,1.0,1,False
9,GOLDEN_010,[Riley],Anderson,Riley.Anderson@Healthplus.Org,777 cedar st,,Denver,CO,80203,PLATINUM2xHYBRID,1.0,1,False



üìä Summary:
Original CRM records: 28
Golden records created: 21
Duplicate records merged: 7
Business emails identified: 5


### Part D: Match Incoming Customers to Golden Records


In [28]:
# Define structure for matching incoming customers to golden records
class CustomerMatch(BaseModel):
    incoming_customer_id: str
    matched_golden_id: Optional[str]
    match_confidence: float
    is_business_email: bool
    match_reasoning: str
    is_new_customer: bool

# System prompt for matching incoming customers
match_prompt = """
You are an entity resolution assistant. You will receive an incoming customer record and a list of existing golden CRM records. Your task is to:

1. Determine if the incoming customer matches any existing golden record
2. Consider email similarity, name similarity, and other identifying information
3. Flag whether the incoming customer's email is a business email (corporate domain vs personal like gmail.com)
4. Provide a confidence score (0.0-1.0) for the match
5. Explain your reasoning
6. Mark as new customer if no good match found (confidence < 0.7)

Business email indicators:
- Corporate domains (not gmail.com, yahoo.com, hotmail.com, etc.)
- Company names in domain
- Professional email formats

Return a JSON object matching the CustomerMatch structure.
"""

# Match incoming customers to golden records
matches = []
for _, incoming in tqdm(df_ai_parsed.iterrows(), total=len(df_ai_parsed), desc="Matching Incoming Customers"):
    input_data = {
        "incoming_customer": incoming.to_dict(),
        "golden_records": df_golden.to_dict('records')
    }
    
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": match_prompt},
                {"role": "user", "content": str(input_data)}
            ],
            response_format=CustomerMatch
        )
        match_result = completion.choices[0].message.parsed.dict()
        matches.append(match_result)
    except Exception as e:
        print(f"Error matching customer: {e}")
        matches.append({
            'incoming_customer_id': incoming['customer_id'],
            'matched_golden_id': None,
            'match_confidence': 0.0,
            'is_business_email': False,
            'match_reasoning': f"Error: {e}",
            'is_new_customer': True
        })

df_matches = pd.DataFrame(matches)
print(f"\nüéØ Incoming Customer Matching Results:")
display(df_matches)

# Summary statistics
print(f"\nüìä Matching Summary:")
print(f"Total incoming customers: {len(df_matches)}")
print(f"Matched to existing golden records: {len(df_matches[df_matches['is_new_customer'] == False])}")
print(f"New customers: {len(df_matches[df_matches['is_new_customer'] == True])}")
print(f"Business emails identified: {len(df_matches[df_matches['is_business_email'] == True])}")

# Show specific matches
existing_matches = df_matches[df_matches['is_new_customer'] == False]
if len(existing_matches) > 0:
    print(f"\nüîó Existing Customer Matches:")
    for _, match in existing_matches.iterrows():
        print(f"Incoming {match['incoming_customer_id']} ‚Üí Golden {match['matched_golden_id']} (confidence: {match['match_confidence']:.2f})")
        print(f"   Reasoning: {match['match_reasoning']}")
        print()


Matching Incoming Customers:   0%|          | 0/15 [00:00<?, ?it/s]


üéØ Incoming Customer Matching Results:


Unnamed: 0,incoming_customer_id,matched_golden_id,match_confidence,is_business_email,match_reasoning,is_new_customer
0,c001,GOLDEN_001,0.9,False,The incoming customer matches Golden Record GO...,False
1,c002,GOLDEN_002,0.95,True,The incoming customer matches the golden recor...,False
2,c003,GOLDEN_003,0.95,True,The incoming customer's email 'casey_brown@inn...,False
3,c004,GOLDEN_008,1.0,False,The incoming customer matches on all available...,False
4,c005,GOLDEN_009,0.95,True,The incoming customer's details closely match ...,False
5,c006,GOLDEN_010,0.95,True,The incoming customer 'Riley Anderson' has a s...,False
6,c007,GOLDEN_011,1.0,False,The incoming customer 'Avery Thompson' matches...,False
7,c008,GOLDEN_001,0.95,True,The incoming customer record matches the golde...,False
8,c009,GOLDEN_012,1.0,False,The incoming customer matches golden record GO...,False
9,c010,GOLDEN_006,1.0,True,The incoming customer matches GOLDEN_006 based...,False



üìä Matching Summary:
Total incoming customers: 15
Matched to existing golden records: 15
New customers: 0
Business emails identified: 9

üîó Existing Customer Matches:
Incoming c001 ‚Üí Golden GOLDEN_001 (confidence: 0.90)
   Reasoning: The incoming customer matches Golden Record GOLDEN_001 based on exact matches in email (alex.johnson@gmail.com), last name (Johnson), and address components (123 Main St, New York, NY 10001). The first name 'Alex' also aligns as a common variant of 'Alexander'. The subscription_code also matches exactly. The email is from gmail.com, a personal email provider.

Incoming c002 ‚Üí Golden GOLDEN_002 (confidence: 0.95)
   Reasoning: The incoming customer matches the golden record GOLDEN_002 based on several factors: the email address jordan.smith@techcorp.io matches exactly, indicating it's the same business email. The name Jordan Smith matches closely with Smith Jordan-Smith, accommodating potential variations in order and hyphenation in last names. The

In [29]:
import openai
import os
from dotenv import load_dotenv
from pydantic import BaseModel
from typing import Optional, List
from tqdm.notebook import tqdm

# Load API key
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# Define the flattened structure
class FlattenedCustomer(BaseModel):
    customer_id: str
    first_name: str
    last_name: str
    email: str
    address_line: str
    apartment: Optional[str]
    building: Optional[str]
    subscription_code: str
    subscription_start: str
    preferred_time: str
    therapist_gender: str
    therapy_topics: str  # Comma-separated string

# System prompt for flattening
system_prompt = """
You are a data extraction assistant. Extract and flatten the nested customer data into the following structure:
- customer_id: The customer ID
- first_name: From profile.personal.first_name
- last_name: From profile.personal.last_name  
- email: From profile.personal.email
- address_line: From profile.address.address_line
- apartment: From profile.address.details.apartment (null if not present)
- building: From profile.address.details.building (null if not present)
- subscription_code: From profile.subscription.subscription_code
- subscription_start: From profile.subscription.start_date
- preferred_time: From profile.therapy_preferences.preferred_time
- therapist_gender: From profile.therapy_preferences.therapist_gender
- therapy_topics: Join the topics array into a comma-separated string

Return the result as a JSON object matching the FlattenedCustomer structure.
"""

# Process each record with AI
ai_flattened = []
for record in tqdm(nested_data, desc="AI Flattening"):
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": str(record)}
            ],
            response_format=FlattenedCustomer
        )
        ai_flattened.append(completion.choices[0].message.parsed.dict())
    except Exception as e:
        print(f"Error processing record: {e}")
        continue

df_ai = pd.DataFrame(ai_flattened)
print(f"\nFlattened {len(df_ai)} customer records using AI:")
display(df_ai)


AI Flattening:   0%|          | 0/15 [00:00<?, ?it/s]


Flattened 15 customer records using AI:


Unnamed: 0,customer_id,first_name,last_name,email,address_line,apartment,building,subscription_code,subscription_start,preferred_time,therapist_gender,therapy_topics
0,c001,Alex,Johnson,alex.johnson@gmail.com,"123mainstreet,newyorkcity,ny,10001",4B,Sunrise Towers,GOLD2xONLINE,2024-03-01,Evenings,Any,"anxiety,career"
1,c002,Jordan,Smith,jordan.smith@techcorp.io,"456 Elm St, Boston, Massachusetts, 02118",.,.,PLATINUM1xINPERSON,2024-02-15,Mornings,Female,stress
2,c003,Casey,Brown,casey_brown@innovate.com,"789 Oak Ave,Los Angeles,CA,90001",2A,Palm Court,SILVER1xONLINE,2024-01-20,Afternoons,Male,relationships
3,c004,Taylor,Davis,taylor.davis@gmail.com,"321 Pine Rd,Miami,FL,33101",,Ocean View,GOLD1xINPERSON,2024-04-10,Evenings,Any,depression
4,c005,Morgan,Wilson,morgan.wilson@globaltech.com,"555 Maple Blvd,Seattle,WA,98101",10C,,GOLD2xONLINE,2024-03-15,Mornings,Male,"work stress,family"
5,c006,Riley,Anderson,riley.anderson@healthplus.org,"777 Cedar St, Denver, Colorado, 80203",5F,Mountain View,PLATINUM2xHYBRID,2024-01-05,Afternoons,Female,"anxiety,self-esteem"
6,c007,Avery,Thompson,avery.thompson@yahoo.com,"888spruceway,phoenix,az,85001",,Desert Oaks,SILVER2xONLINE,2024-02-28,Evenings,Any,"grief,trauma"
7,c008,Cameron,Lee,cameron.lee@dataflow.net,"999 Birch Ln, Portland, Oregon, 97201",12A,,ENTERPRISE3xHYBRID,2024-03-20,Mornings,Female,"leadership,burnout"
8,c009,Quinn,Martinez,q.martinez@outlook.com,"111 Willow Dr,Austin,TX,73301",3C,Riverside Plaza,GOLD1xONLINE,2024-04-01,Afternoons,Male,addiction recovery
9,c010,Sage,Robinson,sage.robinson@medtech.com,"222 Aspen Ave, Nashville, Tennessee, 37201",,Music City Towers,PLATINUM1xINPERSON,2024-01-15,Evenings,Any,"performance anxiety,creativity"


---

## 4. Time Series and Date-Time Transformations

**Scenario:** You got a messy `transactions.csv` with raw UTC timestamps and terms like NET30.

**Goal:** Standardize datetime formats, calculate due dates, aggregate spend by time window.

**Input:** `transactions.csv`

**AI Task:** Extract date-only, timezones (PST/EST/GST), fiscal quarter, business-day due date, and contribution to account balance.

‚úÖ **Try It Now:**
- Add a new timezone like JST
- Change fiscal calendar: Q1 = Mar‚ÄìMay  
- Add a "day_of_week" field


### Step 1: Time Series Transformations Using Python


In [30]:
import pytz
from datetime import datetime
from pandas.tseries.offsets import BDay

# Load transaction data
transactions_df = pd.read_csv('setup/transactions.csv')
print("Raw transaction data:")
display(transactions_df)

# Convert transaction_date to datetime with timezone awareness
transactions_df['transaction_date'] = pd.to_datetime(transactions_df['transaction_date'], utc=True)

# A: Plain date without time
transactions_df['date_only'] = transactions_df['transaction_date'].dt.date

# B-D: Timestamps converted to PST, EST, and GST respectively
transactions_df['timestamp_pst'] = transactions_df['transaction_date'].dt.tz_convert('US/Pacific')
transactions_df['timestamp_est'] = transactions_df['transaction_date'].dt.tz_convert('US/Eastern')
transactions_df['timestamp_gst'] = transactions_df['transaction_date'].dt.tz_convert('Asia/Dubai')

# E: Extracted month and year for reporting breakdowns
transactions_df['month'] = transactions_df['transaction_date'].dt.month
transactions_df['year'] = transactions_df['transaction_date'].dt.year

# F: Custom fiscal quarter based on internal calendar (Q1 = Feb‚ÄìApr)
def get_fiscal_quarter(date):
    """Calculate fiscal quarter where Q1 = Feb-Apr, Q2 = May-Jul, Q3 = Aug-Oct, Q4 = Nov-Jan"""
    month = date.month
    if month in [2, 3, 4]:
        return "Q1"
    elif month in [5, 6, 7]:
        return "Q2"
    elif month in [8, 9, 10]:
        return "Q3"
    else:  # Nov, Dec, Jan
        return "Q4"

transactions_df['fiscal_quarter'] = transactions_df['transaction_date'].apply(get_fiscal_quarter)

# G: Due date calculation using business days only
def calculate_due_date(row):
    """Calculate due date by adding NET terms as business days."""
    # Extract number from NET terms (e.g., NET30 -> 30)
    net_days = int(row['terms'].replace('NET', ''))
    # Add business days to transaction date
    due_date = row['transaction_date'] + BDay(net_days)
    return due_date

transactions_df['due_date'] = transactions_df.apply(calculate_due_date, axis=1)

# H: Percent contribution of each transaction to its account's total balance
account_totals = transactions_df.groupby('account')['amount_due'].sum()
transactions_df['account_total'] = transactions_df['account'].map(account_totals)
transactions_df['contribution_pct'] = (transactions_df['amount_due'] / transactions_df['account_total'] * 100).round(2)

print(f"\nTransformed transaction data using Python ({len(transactions_df)} records):")
display(transactions_df[['transaction_id', 'account', 'date_only', 'fiscal_quarter', 'due_date', 'contribution_pct']])


Raw transaction data:


Unnamed: 0,transaction_id,account,user_email,transaction_date,amount_due,terms
0,T001,innovate,casey@innovate.com,2024-03-01T15:00:00Z,200.0,NET30
1,T002,innovate,casey_brown@innovate.com,2024-03-15T18:30:00Z,150.0,NET60
2,T003,techcorp,jordan.smith@techcorp.io,2024-02-20T09:00:00Z,300.0,NET30
3,T004,techcorp,j.smith@techcorp.io,2024-03-05T10:00:00Z,250.0,NET90
4,T005,globaltech,morgan.wilson@globaltech.com,2024-04-01T20:00:00Z,400.0,NET30
5,T006,globaltech,morgan.wilson@globaltech.com,2024-04-15T21:00:00+09:00,350.0,NET60
6,T007,johnson,alex.johnson@gmail.com,2024-03-10T12:00:00Z,180.0,NET30
7,T008,johnson,alex.johnson@gmail.com,2024-03-25T13:00:00-05:00,220.0,NET60
8,T009,davis,taylor.davis@gmail.com,2024-04-05T08:00:00Z,275.0,NET30
9,T010,healthplus,riley.anderson@healthplus.org,2024-03-20T16:00:00-07:00,500.0,NET90



Transformed transaction data using Python (34 records):


Unnamed: 0,transaction_id,account,date_only,fiscal_quarter,due_date,contribution_pct
0,T001,innovate,2024-03-01,Q1,2024-04-12 15:00:00+00:00,57.14
1,T002,innovate,2024-03-15,Q1,2024-06-07 18:30:00+00:00,42.86
2,T003,techcorp,2024-02-20,Q1,2024-04-02 09:00:00+00:00,54.55
3,T004,techcorp,2024-03-05,Q1,2024-07-09 10:00:00+00:00,45.45
4,T005,globaltech,2024-04-01,Q1,2024-05-13 20:00:00+00:00,53.33
5,T006,globaltech,2024-04-15,Q1,2024-07-08 12:00:00+00:00,46.67
6,T007,johnson,2024-03-10,Q1,2024-04-19 12:00:00+00:00,45.0
7,T008,johnson,2024-03-25,Q1,2024-06-17 18:00:00+00:00,55.0
8,T009,davis,2024-04-05,Q1,2024-05-17 08:00:00+00:00,100.0
9,T010,healthplus,2024-03-20,Q1,2024-07-24 23:00:00+00:00,51.02


### Step 2: Time Series Transformations Using AI


In [31]:
# Define structure for time series transformations
class TransformedTransaction(BaseModel):
    transaction_id: str
    account: str
    user_email: str
    date_only: str
    timestamp_pst: str
    timestamp_est: str
    timestamp_gst: str
    month: int
    year: int
    fiscal_quarter: str
    due_date: str
    contribution_pct: float

# Load fresh transaction data for AI processing
transactions_raw = pd.read_csv('setup/transactions.csv')

# Calculate account totals for contribution percentage
account_totals_dict = transactions_raw.groupby('account')['amount_due'].sum().to_dict()

# System prompt for time series transformations
time_prompt = """
You are a data transformation assistant specializing in time series and financial data. You will receive one transaction at a time and must return a single object with the following transformations:

Input fields:
- transaction_date: ISO-8601 timestamp (may include timezone info)
- terms: Payment terms like 'NET30', 'NET60', 'NET90' 
- amount_due: Transaction amount
- account: Account name
- account_total: Total balance for this account (for percentage calculation)

Required transformations:
- date_only: Extract just the date (YYYY-MM-DD format)
- timestamp_pst: Convert to Pacific Standard Time
- timestamp_est: Convert to Eastern Standard Time  
- timestamp_gst: Convert to Gulf Standard Time (Asia/Dubai)
- month: Extract month number (1-12)
- year: Extract year (YYYY)
- fiscal_quarter: Calculate using custom fiscal calendar where Q1=Feb-Apr, Q2=May-Jul, Q3=Aug-Oct, Q4=Nov-Jan
- due_date: Add the NET term days as BUSINESS DAYS (not calendar days) to the transaction date
- contribution_pct: Calculate (amount_due / account_total) * 100, rounded to 2 decimal places

Return a JSON object matching the TransformedTransaction structure exactly.
"""

# Process each transaction with AI
ai_transformed_transactions = []
for _, row in tqdm(transactions_raw.iterrows(), total=len(transactions_raw), desc="AI Time Transformations"):
    # Prepare input with account total
    input_data = row.to_dict()
    input_data['account_total'] = account_totals_dict[row['account']]
    
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": time_prompt},
                {"role": "user", "content": str(input_data)}
            ],
            response_format=TransformedTransaction
        )
        transformed = completion.choices[0].message.parsed.dict()
        ai_transformed_transactions.append(transformed)
    except Exception as e:
        print(f"Error transforming transaction {row['transaction_id']}: {e}")
        continue

df_ai_transactions = pd.DataFrame(ai_transformed_transactions)
print(f"\nTransformed transaction data using AI ({len(df_ai_transactions)} records):")
display(df_ai_transactions)


AI Time Transformations:   0%|          | 0/34 [00:00<?, ?it/s]


Transformed transaction data using AI (34 records):


Unnamed: 0,transaction_id,account,user_email,date_only,timestamp_pst,timestamp_est,timestamp_gst,month,year,fiscal_quarter,due_date,contribution_pct
0,T001,innovate,casey@innovate.com,2024-03-01,2024-03-01T07:00:00-08:00,2024-03-01T10:00:00-05:00,2024-03-01T19:00:00+04:00,3,2024,Q1,2024-04-15,57.14
1,T002,innovate,casey_brown@innovate.com,2024-03-15,2024-03-15T10:30:00-08:00,2024-03-15T13:30:00-05:00,2024-03-15T22:30:00+04:00,3,2024,Q1,2024-06-10,42.86
2,T003,techcorp,jordan.smith@techcorp.io,2024-02-20,2024-02-20T01:00:00-08:00,2024-02-20T04:00:00-05:00,2024-02-20T13:00:00+04:00,2,2024,Q1,2024-04-02,54.55
3,T004,techcorp,j.smith@techcorp.io,2024-03-05,2024-03-05T02:00:00-08:00,2024-03-05T05:00:00-05:00,2024-03-05T14:00:00+04:00,3,2024,Q1,2024-07-09,45.45
4,T005,globaltech,morgan.wilson@globaltech.com,2024-04-01,2024-04-01T13:00:00-07:00,2024-04-01T16:00:00-04:00,2024-04-02T00:00:00+04:00,4,2024,Q1,2024-05-13,53.33
5,T006,globaltech,morgan.wilson@globaltech.com,2024-04-15,2024-04-15T05:00:00-07:00,2024-04-15T08:00:00-04:00,2024-04-15T17:00:00+04:00,4,2024,Q1,2024-07-09,46.67
6,T007,johnson,alex.johnson@gmail.com,2024-03-10,2024-03-10T04:00:00-08:00,2024-03-10T07:00:00-05:00,2024-03-10T16:00:00+04:00,3,2024,Q1,2024-04-19,45.0
7,T008,johnson,alex.johnson@gmail.com,2024-03-25,2024-03-25T11:00:00-07:00,2024-03-25T14:00:00-04:00,2024-03-25T22:00:00+04:00,3,2024,Q1,2024-06-17,55.0
8,T009,davis,taylor.davis@gmail.com,2024-04-05,2024-04-05T01:00:00-07:00,2024-04-05T04:00:00-04:00,2024-04-05T12:00:00+04:00,4,2024,Q1,2024-05-17,100.0
9,T010,healthplus,riley.anderson@healthplus.org,2024-03-20,2024-03-20T16:00:00-07:00,2024-03-20T19:00:00-04:00,2024-03-21T03:00:00+04:00,3,2024,Q1,2024-07-29,51.02


---

## 5. Extra Credit: Build a Golden Account List

**Scenario:** We movin' B2B, cher. Gotta figure out which companies to target.

**Goal:** Create a golden account list that aggregates users, spend, and business opportunity.

**Input:** Golden CRM + Transactions

**AI Task:** Match emails to companies, aggregate counts, and generate golden account records.

‚úÖ **Try It Now:**
- Create a scoring logic (e.g., high-spend, many-users)
- Flag top 5 target accounts


In [32]:
# Define structure for golden account records
class GoldenAccount(BaseModel):
    account_name: str
    company_domain: str
    total_users: int
    total_spend: float
    avg_spend_per_user: float
    primary_location: str
    subscription_tiers: List[str]
    business_opportunity_score: float
    account_status: str  # "High Value", "Growth Potential", "Standard"
    key_contacts: List[str]

# System prompt for creating golden accounts
account_prompt = """
You are a B2B account analysis assistant. You will receive transaction data and CRM data for a specific account/company and need to create a comprehensive golden account record.

Your task:
1. Analyze all users associated with this account
2. Calculate total spend, user count, and average spend per user
3. Identify the primary business location (most common state/city)
4. List unique subscription tiers used by this account
5. Calculate a business opportunity score (0-100) based on:
   - Total spend (higher = better)
   - Number of users (more users = better scaling potential)
   - Subscription diversity (multiple tiers = growth potential)
   - Business email usage (corporate domains = B2B opportunity)
6. Assign account status: "High Value" (score 80+), "Growth Potential" (score 50-79), "Standard" (score <50)
7. Identify key contact emails (business emails preferred)

Consider business indicators:
- Corporate email domains (not gmail/yahoo/hotmail)
- Multiple users from same company
- Higher-tier subscriptions
- Consistent spending patterns

Return a JSON object matching the GoldenAccount structure.
"""

# Group data by account to create golden account records
# First, let's combine our data sources
account_data = {}

# Load original transaction data for spend calculations
transactions_original = pd.read_csv('setup/transactions.csv')

# Add transaction data with correct spend calculations
for _, row in transactions_original.iterrows():
    account = row['account']
    if account not in account_data:
        account_data[account] = {
            'transactions': [],
            'users': set(),
            'total_spend': 0
        }
    
    account_data[account]['transactions'].append(row.to_dict())
    account_data[account]['users'].add(row['user_email'])
    account_data[account]['total_spend'] += float(row['amount_due'])

# Add CRM data (match by email domain or account name)
crm_by_email = {}
for _, row in df_golden.iterrows():
    email_domain = row['email_domain']
    crm_by_email[email_domain] = row.to_dict()

# Create golden account records
golden_accounts = []
for account_name, data in tqdm(account_data.items(), desc="Creating Golden Accounts"):
    # Prepare comprehensive account data
    account_input = {
        'account_name': account_name,
        'transaction_data': data['transactions'],
        'user_emails': list(data['users']),
        'total_spend': data['total_spend'],
        'user_count': len(data['users']),
        'crm_data': [crm_by_email.get(email.split('@')[1], {}) for email in data['users']]
    }
    
    try:
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": account_prompt},
                {"role": "user", "content": str(account_input)}
            ],
            response_format=GoldenAccount
        )
        golden_account = completion.choices[0].message.parsed.dict()
        golden_accounts.append(golden_account)
    except Exception as e:
        print(f"Error creating golden account for {account_name}: {e}")
        continue

df_golden_accounts = pd.DataFrame(golden_accounts)

# Sort by business opportunity score
df_golden_accounts = df_golden_accounts.sort_values('business_opportunity_score', ascending=False)

print(f"\nGolden Account List ({len(df_golden_accounts)} accounts):")
display(df_golden_accounts)

# Identify top 5 target accounts
print(f"\nüéØ TOP 5 TARGET ACCOUNTS:")
top_5 = df_golden_accounts.head(5)
for i, (_, account) in enumerate(top_5.iterrows(), 1):
    print(f"{i}. {account['account_name']} - Score: {account['business_opportunity_score']:.1f}")
    print(f"   Users: {account['total_users']}, Spend: ${account['total_spend']:.2f}, Status: {account['account_status']}")
    print(f"   Domain: {account['company_domain']}, Location: {account['primary_location']}")
    print()


Creating Golden Accounts:   0%|          | 0/20 [00:00<?, ?it/s]


Golden Account List (20 accounts):


Unnamed: 0,account_name,company_domain,total_users,total_spend,avg_spend_per_user,primary_location,subscription_tiers,business_opportunity_score,account_status,key_contacts
7,medtech,medtech.com,2,830.0,415.0,"Nashville, TN",[PLATINUM1xINPERSON],90.0,High Value,"[sage.robinson@medtech.com, s.robinson@medtech..."
1,techcorp,techcorp.io,2,550.0,275.0,"Massachusetts, Boston",[PLATINUM1xINPERSON],83.0,High Value,"[jordan.smith@techcorp.io, j.smith@techcorp.io]"
6,dataflow,dataflow.net,2,1450.0,725.0,"Portland, OR",[ENTERPRISE3xHYBRID],75.0,Growth Potential,"[cameron.lee@dataflow.net, cam.lee@dataflow.net]"
0,innovate,innovate.com,2,350.0,175.0,"Los Angeles, CA",[SILVER1xONLINE],70.0,Growth Potential,[casey@innovate.com]
8,fintech,fintech.biz,1,1350.0,1350.0,Unknown,[],68.0,Growth Potential,[reese.walker@fintech.biz]
11,enterprise,enterprise.com,1,1750.0,1750.0,,[Tier 2],65.0,Growth Potential,[skyler.miller@enterprise.com]
9,startup,startup.co,2,600.0,300.0,,[],65.0,Growth Potential,"[e.hall@startup.co, emery.hall@startup.co]"
5,healthplus,healthplus.org,2,980.0,490.0,Not Specified,[Standard],65.0,Growth Potential,"[riley.anderson@healthplus.org, rowan.anderson..."
10,consulting,consulting.com,1,970.0,970.0,Not Provided,[],60.0,Growth Potential,[parker.white@consulting.com]
13,techstart,techstart.io,1,700.0,700.0,Unknown,[Standard],55.0,Growth Potential,[river.adams@techstart.io]



üéØ TOP 5 TARGET ACCOUNTS:
1. medtech - Score: 90.0
   Users: 2, Spend: $830.00, Status: High Value
   Domain: medtech.com, Location: Nashville, TN

2. techcorp - Score: 83.0
   Users: 2, Spend: $550.00, Status: High Value
   Domain: techcorp.io, Location: Massachusetts, Boston

3. dataflow - Score: 75.0
   Users: 2, Spend: $1450.00, Status: Growth Potential
   Domain: dataflow.net, Location: Portland, OR

4. innovate - Score: 70.0
   Users: 2, Spend: $350.00, Status: Growth Potential
   Domain: innovate.com, Location: Los Angeles, CA

5. fintech - Score: 68.0
   Users: 1, Spend: $1350.00, Status: Growth Potential
   Domain: fintech.biz, Location: Unknown



---

## üí¨ Discussion Questions

Now that you've completed the full TheraGPT CRM cleanup pipeline, take a moment to reflect:

### **Technical Comparison**
* Which approach was easier to build and test for each section?
* Which was more flexible when you needed to change rules or add new fields?
* How did the AI approaches handle edge cases compared to the Python approaches?
* What was the performance difference between the two approaches?

### **Business Impact**
* How confident are you in the AI-generated entity resolution and golden records?
* Which business emails were flagged correctly? Any false positives/negatives?
* Do the top 5 target accounts make sense based on the data?
* How would you validate the business opportunity scores in a real scenario?

### **Scalability & Production**
* Could you imagine building a reusable framework from these AI patterns?
* What would you need to consider for production deployment?
* How would you handle error cases and data quality monitoring?
* What hybrid approaches (AI + Python) might work best?

### **Try It Now: Advanced Challenges**

1. **Add a new nested field** to `incoming_customers.json` (e.g., `insurance_info` or `emergency_contact`) and update both Python and AI extraction logic.

2. **Create a new subscription format** like `ENTERPRISE3xHYBRID` and update the regex parsing logic.

3. **Add a new timezone** (JST) and change the fiscal calendar (Q1 = Mar‚ÄìMay) in the time series transformations.

4. **Implement a custom scoring algorithm** for the golden accounts that weights different factors (spend, users, engagement, etc.).

5. **Build a data quality dashboard** that shows the confidence scores and match rates across all sections.

---

**üéâ Congratulations!** You've successfully built a comprehensive data transformation pipeline that combines traditional Python techniques with cutting-edge AI approaches. You've seen how AI can simplify complex data engineering tasks while maintaining the precision and control that traditional methods provide.
