# Import ACME Customer Data

This notebook imports the ACME customer data from the .sav file.

In [2]:
# Import required libraries
import pandas as pd
import pyreadstat

# Set display options
pd.set_option('display.max_columns', None)

In [4]:
# Read the SPSS .sav data file
data_path = r'ACME customer and rfm data.sav'
df, meta = pyreadstat.read_sav(data_path)

print(f"Data loaded successfully!")
print(f"Shape: {df.shape} (rows, columns)")
print(f"\nColumns: {df.columns.tolist()}")

Data loaded successfully!
Shape: (30000, 12) (rows, columns)

Columns: ['customer_id', 'gender', 'email_address', 'postal_code', 'monetary_value_01_01_2011', 'frequency_01_01_2011', 'recency_01_01_2011', 'has_received_test_mailing', 'response_to_test_mailing_02_01_2011', 'orderdate', 'number_of_days_between_mailing_and_orderdate', 'ordered_within_month']


In [None]:
# Display first few rows
df.head(10)

In [9]:
# Data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 12 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   customer_id                                   30000 non-null  float64
 1   gender                                        30000 non-null  object 
 2   email_address                                 30000 non-null  object 
 3   postal_code                                   30000 non-null  object 
 4   monetary_value_01_01_2011                     30000 non-null  object 
 5   frequency_01_01_2011                          30000 non-null  object 
 6   recency_01_01_2011                            30000 non-null  object 
 7   has_received_test_mailing                     30000 non-null  object 
 8   response_to_test_mailing_02_01_2011           30000 non-null  object 
 9   orderdate                                     361 non-null   

## 1. Data Overview

Answer the following questions:
- How many records are in the training dataset?
- How many fields are in the training dataset?

In [10]:
# Data Overview - Answer Questions
print("=" * 80)
print("DATA OVERVIEW - TRAINING DATASET")
print("=" * 80)

# Number of records (rows)
num_records = df.shape[0]
print(f"\n✓ Number of records in the training dataset: {num_records:,}")

# Number of fields (columns)
num_fields = df.shape[1]
print(f"✓ Number of fields in the training dataset: {num_fields}")

print("\n" + "-" * 80)
print("Field Names:")
print("-" * 80)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

print("\n" + "=" * 80)

DATA OVERVIEW - TRAINING DATASET

✓ Number of records in the training dataset: 30,000
✓ Number of fields in the training dataset: 12

--------------------------------------------------------------------------------
Field Names:
--------------------------------------------------------------------------------
 1. customer_id
 2. gender
 3. email_address
 4. postal_code
 5. monetary_value_01_01_2011
 6. frequency_01_01_2011
 7. recency_01_01_2011
 8. has_received_test_mailing
 9. response_to_test_mailing_02_01_2011
10. orderdate
11. number_of_days_between_mailing_and_orderdate
12. ordered_within_month



In [11]:
# Data Summary Table (similar to SPSS Table node output)
print("=" * 80)
print("DATA SUMMARY TABLE")
print("=" * 80)

summary_data = {
    'Field Name': df.columns.tolist(),
    'Data Type': [str(dtype) for dtype in df.dtypes],
    'Non-Null Count': [df[col].count() for col in df.columns],
    'Null Count': [df[col].isnull().sum() for col in df.columns],
    'Unique Values': [df[col].nunique() for col in df.columns]
}

summary_df = pd.DataFrame(summary_data)
summary_df.index = range(1, len(summary_df) + 1)

print(summary_df.to_string())
print("\n" + "=" * 80)

DATA SUMMARY TABLE
                                      Field Name Data Type  Non-Null Count  Null Count  Unique Values
1                                    customer_id   float64           30000           0          30000
2                                         gender    object           30000           0              2
3                                  email_address    object           30000           0          30000
4                                    postal_code    object           30000           0           4983
5                      monetary_value_01_01_2011    object           30000           0              3
6                           frequency_01_01_2011    object           30000           0              3
7                             recency_01_01_2011    object           30000           0              3
8                      has_received_test_mailing    object           30000           0              2
9            response_to_test_mailing_02_01_2011    object     

## 2. Test Mailing Customers

Select only customers who were in the test mailing.
- How many customers were included in the test mailing?

In [12]:
# Select customers who were in the test mailing
# Filter where has_received_test_mailing = 'yes'
test_mailing_customers = df[df['has_received_test_mailing'] == 'yes']

print("=" * 80)
print("TEST MAILING CUSTOMERS SELECTION")
print("=" * 80)

# Count customers in test mailing
num_test_mailing = len(test_mailing_customers)
num_no_mailing = len(df[df['has_received_test_mailing'] == 'no'])

print(f"\n✓ Number of customers included in the test mailing: {num_test_mailing:,}")
print(f"  Number of customers NOT in the test mailing: {num_no_mailing:,}")
print(f"  Total customers: {len(df):,}")
print(f"\n  Percentage in test mailing: {(num_test_mailing/len(df)*100):.2f}%")

print("\n" + "=" * 80)

TEST MAILING CUSTOMERS SELECTION

✓ Number of customers included in the test mailing: 0
  Number of customers NOT in the test mailing: 0
  Total customers: 30,000

  Percentage in test mailing: 0.00%



In [None]:
# Display sample of test mailing customers
print("Sample of customers in test mailing (first 10 records):")
print("\n" + test_mailing_customers.head(10).to_string())

print("\n" + "=" * 80)
print(f"Test mailing dataset shape: {test_mailing_customers.shape}")

In [13]:
# Check the unique values in has_received_test_mailing field
print("Checking 'has_received_test_mailing' field:")
print(f"Data type: {df['has_received_test_mailing'].dtype}")
print(f"\nUnique values:")
print(df['has_received_test_mailing'].unique())
print(f"\nValue counts:")
print(df['has_received_test_mailing'].value_counts())

Checking 'has_received_test_mailing' field:
Data type: object

Unique values:
['yes' 'no']

Value counts:
has_received_test_mailing
no     20000
yes    10000
Name: count, dtype: int64


## 3. Predictive Modeling - CHAID Decision Tree

Build a CHAID decision tree model to predict response using:
- gender
- recency_01_01_2011
- frequency_01_01_2011
- monetary_value_01_01_2011

Questions:
- Which field is used as the first split?
- Which group shows the highest response rate? What is the probability of responding for this group?

In [14]:
# Install CHAID library if needed
%pip install CHAID --quiet

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
# Prepare data for CHAID modeling
# Use only test mailing customers and select relevant features
import numpy as np

# Reselect test mailing customers to ensure we have the data
test_mailing_customers = df[df['has_received_test_mailing'] == 'yes'].copy()
print(f"Test mailing customers: {len(test_mailing_customers):,}")

# Work with test mailing customers
model_data = test_mailing_customers.copy()

# Check the data types and values
print("\nChecking predictor variables:")
print(f"gender unique values: {model_data['gender'].unique()}")
print(f"recency_01_01_2011 unique values: {model_data['recency_01_01_2011'].unique()}")
print(f"frequency_01_01_2011 unique values: {model_data['frequency_01_01_2011'].unique()}")
print(f"monetary_value_01_01_2011 unique values: {model_data['monetary_value_01_01_2011'].unique()}")
print(f"\nTarget variable (response_to_test_mailing_02_01_2011):")
print(model_data['response_to_test_mailing_02_01_2011'].value_counts())

Test mailing customers: 10,000

Checking predictor variables:
gender unique values: ['male' 'female']
recency_01_01_2011 unique values: ['2 medium' '1 low' '3 high']
frequency_01_01_2011 unique values: ['3 high' '1 low' '2 medium']
monetary_value_01_01_2011 unique values: ['2 medium' '1 low' '3 high']

Target variable (response_to_test_mailing_02_01_2011):
response_to_test_mailing_02_01_2011
F    9639
T     361
Name: count, dtype: int64


In [17]:
# Prepare features and target for CHAID
from CHAID import Tree

# Select features
features = ['gender', 'recency_01_01_2011', 'frequency_01_01_2011', 'monetary_value_01_01_2011']
X = model_data[features]
y = model_data['response_to_test_mailing_02_01_2011']

# Convert to appropriate format
# CHAID expects categorical data
X_chaid = X.copy()
for col in features:
    X_chaid[col] = X_chaid[col].astype('category')

y_chaid = y.astype('category')

print("=" * 80)
print("BUILDING CHAID DECISION TREE MODEL")
print("=" * 80)
print(f"\nFeatures used: {features}")
print(f"Target variable: response_to_test_mailing_02_01_2011")
print(f"Training samples: {len(X_chaid):,}")
print(f"Response rate: {(y == 'T').sum() / len(y) * 100:.2f}%")

BUILDING CHAID DECISION TREE MODEL

Features used: ['gender', 'recency_01_01_2011', 'frequency_01_01_2011', 'monetary_value_01_01_2011']
Target variable: response_to_test_mailing_02_01_2011
Training samples: 10,000
Response rate: 3.61%


In [25]:
# Build CHAID tree with more lenient parameters to allow proper splits
# Combine features and target into one dataframe
chaid_df = X_chaid.copy()
chaid_df['response'] = y_chaid

# Define variable types (all are nominal/categorical)
var_types = {
    'gender': 'nominal',
    'recency_01_01_2011': 'nominal',
    'frequency_01_01_2011': 'nominal',
    'monetary_value_01_01_2011': 'nominal'
}

# Build tree with more lenient parameters
tree = Tree.from_pandas_df(
    chaid_df,
    var_types,
    'response',
    dep_variable_type='categorical',
    max_depth=5,
    min_parent_node_size=30,  # Reduced from 100
    min_child_node_size=10,    # Reduced from 30
    alpha_merge=0.05           # Standard significance level
)

print("\n" + "=" * 80)
print("CHAID TREE BUILT SUCCESSFULLY")
print("=" * 80)
print(f"Max depth: 5")
print(f"Min parent node size: 30")
print(f"Min child node size: 10")


CHAID TREE BUILT SUCCESSFULLY
Max depth: 5
Min parent node size: 30
Min child node size: 10


In [26]:
# Print the tree structure
print("\nCHAID DECISION TREE STRUCTURE:")
print("=" * 80)
print(tree.print_tree())

print("\n" + "=" * 80)


CHAID DECISION TREE STRUCTURE:
([], {'F': np.float64(9639.0), 'T': np.float64(361.0)}, (monetary_value_01_01_2011, p=2.0894701847278354e-93, score=426.8070062294727, groups=[['1 low'], ['2 medium'], ['3 high']]), dof=2))
|-- (['1 low'], {'F': np.float64(3340.0), 'T': np.float64(14.0)}, (recency_01_01_2011, p=0.015127117941917219, score=5.901602220910049, groups=[['1 low'], ['2 medium', '3 high']]), dof=1))
|   |-- (['1 low'], {'F': np.float64(2043.0), 'T': np.float64(13.0)}, <Invalid Chaid Split> - the node only contains single category respondents)
|   +-- (['2 medium', '3 high'], {'F': np.float64(1297.0), 'T': np.float64(1.0)}, (frequency_01_01_2011, p=0.7723043062171515, score=0.5167532551754949, groups=[['1 low'], ['2 medium'], ['3 high']]), dof=2))
|       |-- (['1 low'], {'F': np.float64(20.0), 'T': 0}, <Invalid Chaid Split> - the minimum parent node size threshold has been reached)
|       |-- (['2 medium'], {'F': np.float64(422.0), 'T': 0}, <Invalid Chaid Split> - the node onl

In [27]:
# Manually extract leaf node information from the tree
# Based on the tree structure output

print("=" * 80)
print("CHAID MODEL ANALYSIS - DETAILED ANSWERS")
print("=" * 80)

# Root split
print(f"\n✓ FIRST SPLIT VARIABLE: monetary_value_01_01_2011")

# Manually identified leaf nodes from the tree output
leaf_nodes_data = [
    {
        'path': 'monetary_value=1 low -> recency=1 low',
        'F': 2043, 'T': 13, 'total': 2056,
        'response_rate': (13/2056)*100
    },
    {
        'path': 'monetary_value=1 low -> recency=2 medium,3 high -> frequency=1 low',
        'F': 20, 'T': 0, 'total': 20,
        'response_rate': 0
    },
    {
        'path': 'monetary_value=1 low -> recency=2 medium,3 high -> frequency=2 medium',
        'F': 422, 'T': 0, 'total': 422,
        'response_rate': 0
    },
    {
        'path': 'monetary_value=1 low -> recency=2 medium,3 high -> frequency=3 high',
        'F': 855, 'T': 1, 'total': 856,
        'response_rate': (1/856)*100
    },
    {
        'path': 'monetary_value=2 medium -> frequency=1 low',
        'F': 731, 'T': 0, 'total': 731,
        'response_rate': 0
    },
    {
        'path': 'monetary_value=2 medium -> frequency=2 medium',
        'F': 1589, 'T': 0, 'total': 1589,
        'response_rate': 0
    },
    {
        'path': 'monetary_value=2 medium -> frequency=3 high -> recency=1 low,3 high',
        'F': 780, 'T': 42, 'total': 822,
        'response_rate': (42/822)*100
    },
    {
        'path': 'monetary_value=2 medium -> frequency=3 high -> recency=2 medium',
        'F': 94, 'T': 0, 'total': 94,
        'response_rate': 0
    },
    {
        'path': 'monetary_value=3 high -> frequency=1 low -> recency=1 low,3 high',
        'F': 1354, 'T': 66, 'total': 1420,
        'response_rate': (66/1420)*100
    },
    {
        'path': 'monetary_value=3 high -> frequency=1 low -> recency=2 medium',
        'F': 754, 'T': 0, 'total': 754,
        'response_rate': 0
    },
    {
        'path': 'monetary_value=3 high -> frequency=2 medium,3 high -> recency=1 low',
        'F': 800, 'T': 77, 'total': 877,
        'response_rate': (77/877)*100
    },
    {
        'path': 'monetary_value=3 high -> frequency=2 medium,3 high -> recency=2 medium',
        'F': 100, 'T': 0, 'total': 100,
        'response_rate': 0
    },
    {
        'path': 'monetary_value=3 high -> frequency=2 medium,3 high -> recency=3 high -> frequency=2 medium',
        'F': 97, 'T': 128, 'total': 225,
        'response_rate': (128/225)*100
    },
    {
        'path': 'monetary_value=3 high -> frequency=2 medium,3 high -> recency=3 high -> frequency=3 high',
        'F': 0, 'T': 34, 'total': 34,
        'response_rate': (34/34)*100
    }
]

# Sort by response rate
leaf_nodes_sorted = sorted(leaf_nodes_data, key=lambda x: x['response_rate'], reverse=True)

print(f"\n✓ NUMBER OF LEAF NODES: {len(leaf_nodes_sorted)}")
print("\nTOP 5 LEAF NODES (by response rate):")
print("-" * 80)

for i, node in enumerate(leaf_nodes_sorted[:5], 1):
    print(f"\nNode {i}:")
    print(f"  Path: {node['path']}")
    print(f"  Size: {node['total']:,} customers")
    print(f"  Responders (T): {int(node['T'])}")
    print(f"  Non-Responders (F): {int(node['F'])}")
    print(f"  Response Rate: {node['response_rate']:.2f}%")

# Best node
best = leaf_nodes_sorted[0]

print("\n" + "=" * 80)
print("FINAL ANSWERS:")
print("=" * 80)
print(f"\n1. Field used as FIRST SPLIT: monetary_value_01_01_2011")
print(f"\n2. Group with HIGHEST RESPONSE RATE:")
print(f"   Path: {best['path']}")
print(f"   Size: {best['total']:,} customers")
print(f"   Responders: {int(best['T'])} out of {best['total']}")
print(f"\n3. PROBABILITY OF RESPONDING for this group: {best['response_rate']:.2f}%")
print(f"   (or {best['response_rate']/100:.4f} as probability)")
print("=" * 80)

CHAID MODEL ANALYSIS - DETAILED ANSWERS

✓ FIRST SPLIT VARIABLE: monetary_value_01_01_2011

✓ NUMBER OF LEAF NODES: 14

TOP 5 LEAF NODES (by response rate):
--------------------------------------------------------------------------------

Node 1:
  Path: monetary_value=3 high -> frequency=2 medium,3 high -> recency=3 high -> frequency=3 high
  Size: 34 customers
  Responders (T): 34
  Non-Responders (F): 0
  Response Rate: 100.00%

Node 2:
  Path: monetary_value=3 high -> frequency=2 medium,3 high -> recency=3 high -> frequency=2 medium
  Size: 225 customers
  Responders (T): 128
  Non-Responders (F): 97
  Response Rate: 56.89%

Node 3:
  Path: monetary_value=3 high -> frequency=2 medium,3 high -> recency=1 low
  Size: 877 customers
  Responders (T): 77
  Non-Responders (F): 800
  Response Rate: 8.78%

Node 4:
  Path: monetary_value=2 medium -> frequency=3 high -> recency=1 low,3 high
  Size: 822 customers
  Responders (T): 42
  Non-Responders (F): 780
  Response Rate: 5.11%

Node 5:
 

In [None]:
# Visualize the tree (if graphviz is available)
try:
    from CHAID import Tree
    import matplotlib.pyplot as plt

    # Try to render tree visualization
    tree.render()
    print("Tree visualization generated!")
except Exception as e:
    print(f"Tree visualization not available: {e}")
    print("\nTree structure is shown in the text output above.")

## 4. Model Output - New Fields Added by the Model

Apply the model to the data and identify the new fields added by the CHAID model.

In [7]:
# Apply the CHAID model to generate predictions
# Based on the tree rules, assign predictions and probabilities to each customer

def get_prediction_and_probability(row):
    """Apply CHAID tree rules to get prediction and probability"""

    monetary = row['monetary_value_01_01_2011']
    frequency = row['frequency_01_01_2011']
    recency = row['recency_01_01_2011']

    # Apply the tree splits based on the CHAID output
    # Root split: monetary_value

    if monetary == '1 low':
        if recency == '1 low':
            # Node: 2043 F, 13 T
            prob_T = 13 / (2043 + 13)
        else:  # recency in ['2 medium', '3 high']
            if frequency == '1 low':
                prob_T = 0 / 20  # 20 F, 0 T
            elif frequency == '2 medium':
                prob_T = 0 / 422  # 422 F, 0 T
            else:  # frequency == '3 high'
                prob_T = 1 / (855 + 1)  # 855 F, 1 T

    elif monetary == '2 medium':
        if frequency == '1 low':
            prob_T = 0 / 731  # 731 F, 0 T
        elif frequency == '2 medium':
            prob_T = 0 / 1589  # 1589 F, 0 T
        else:  # frequency == '3 high'
            if recency in ['1 low', '3 high']:
                prob_T = 42 / (780 + 42)  # 780 F, 42 T
            else:  # recency == '2 medium'
                prob_T = 0 / 94  # 94 F, 0 T

    else:  # monetary == '3 high'
        if frequency == '1 low':
            if recency in ['1 low', '3 high']:
                prob_T = 66 / (1354 + 66)  # 1354 F, 66 T
            else:  # recency == '2 medium'
                prob_T = 0 / 754  # 754 F, 0 T
        else:  # frequency in ['2 medium', '3 high']
            if recency == '1 low':
                prob_T = 77 / (800 + 77)  # 800 F, 77 T
            elif recency == '2 medium':
                prob_T = 0 / 100  # 100 F, 0 T
            else:  # recency == '3 high'
                if frequency == '2 medium':
                    prob_T = 128 / (97 + 128)  # 97 F, 128 T
                else:  # frequency == '3 high'
                    prob_T = 34 / 34  # 0 F, 34 T

    prob_F = 1 - prob_T
    predicted_class = 'T' if prob_T > 0.5 else 'F'

    return predicted_class, prob_F, prob_T

# Apply predictions
results = model_data.apply(get_prediction_and_probability, axis=1, result_type='expand')
results.columns = ['$C-response', '$CC-response-F', '$CC-response-T']

# Create output dataframe
output_df = model_data[['customer_id', 'gender', 'recency_01_01_2011',
                        'frequency_01_01_2011', 'monetary_value_01_01_2011',
                        'response_to_test_mailing_02_01_2011']].copy()
output_df = pd.concat([output_df, results], axis=1)

print("=" * 80)
print("MODEL OUTPUT - NEW FIELDS ADDED")
print("=" * 80)

print("\nOriginal dataset columns (sample):")
original_cols = ['customer_id', 'gender', 'recency_01_01_2011', 'frequency_01_01_2011',
                 'monetary_value_01_01_2011', 'response_to_test_mailing_02_01_2011']
print(original_cols)

print("\n" + "-" * 80)
print("✓ NEW COLUMNS AFTER APPLYING CHAID MODEL:")
print("-" * 80)
new_columns = ['$C-response', '$CC-response-F', '$CC-response-T']
for i, col in enumerate(new_columns, 1):
    print(f"{i}. {col}")

print("\n" + "=" * 80)

MODEL OUTPUT - NEW FIELDS ADDED

Original dataset columns (sample):
['customer_id', 'gender', 'recency_01_01_2011', 'frequency_01_01_2011', 'monetary_value_01_01_2011', 'response_to_test_mailing_02_01_2011']

--------------------------------------------------------------------------------
✓ NEW COLUMNS AFTER APPLYING CHAID MODEL:
--------------------------------------------------------------------------------
1. $C-response
2. $CC-response-F
3. $CC-response-T



In [31]:
# Display sample of the output with new fields
print("Sample of model output (first 20 records):")
print("=" * 80)
display_cols = ['customer_id', 'response_to_test_mailing_02_01_2011', '$C-response'] + \
               [col for col in output_df.columns if col.startswith('$CC-response')]

print(output_df[display_cols].head(20).to_string(index=False))
print("\n" + "=" * 80)

Sample of model output (first 20 records):
 customer_id response_to_test_mailing_02_01_2011 $C-response  $CC-response-F  $CC-response-T
       723.0                                   F           F        1.000000        0.000000
       724.0                                   F           F        0.993677        0.006323
       725.0                                   F           F        0.953521        0.046479
       726.0                                   F           F        0.953521        0.046479
       727.0                                   F           F        0.993677        0.006323
       728.0                                   T           F        0.948905        0.051095
       729.0                                   F           F        0.993677        0.006323
       730.0                                   F           F        0.948905        0.051095
       731.0                                   F           F        0.912201        0.087799
       732.0               

In [32]:
# Explain what the new fields represent
print("=" * 80)
print("EXPLANATION OF NEW FIELDS")
print("=" * 80)

print("\n✓ TWO NEW FIELDS ADDED BY THE MODEL:")
print("-" * 80)

print("\n1. $C-response")
print("   - Type: Predicted Class")
print("   - Description: This field contains the PREDICTED response class for each customer")
print("   - Values: 'T' (will respond) or 'F' (will not respond)")
print("   - This is the model's best guess for whether the customer will respond")

print("\n2. $CC-response-[class]")
print("   - Type: Confidence/Probability scores")
print("   - Description: These fields contain the PROBABILITY/CONFIDENCE for each class")
print("   - In this case, two fields are created:")
print("     • $CC-response-F: Probability that customer will NOT respond")
print("     • $CC-response-T: Probability that customer WILL respond")
print("   - Values: Range from 0.0 to 1.0 (or 0% to 100%)")
print("   - These probabilities sum to 1.0 (100%) for each customer")
print("   - Higher probability indicates higher confidence in that prediction")

print("\n" + "-" * 80)
print("NAMING CONVENTION (IBM SPSS Modeler standard):")
print("-" * 80)
print("• $C- prefix: Predicted CLASS/Category")
print("• $CC- prefix: Confidence/probability for each CLASS")
print("• These are standard field names generated by predictive models in SPSS Modeler")

print("\n" + "=" * 80)

EXPLANATION OF NEW FIELDS

✓ TWO NEW FIELDS ADDED BY THE MODEL:
--------------------------------------------------------------------------------

1. $C-response
   - Type: Predicted Class
   - Description: This field contains the PREDICTED response class for each customer
   - Values: 'T' (will respond) or 'F' (will not respond)
   - This is the model's best guess for whether the customer will respond

2. $CC-response-[class]
   - Type: Confidence/Probability scores
   - Description: These fields contain the PROBABILITY/CONFIDENCE for each class
   - In this case, two fields are created:
     • $CC-response-F: Probability that customer will NOT respond
     • $CC-response-T: Probability that customer WILL respond
   - Values: Range from 0.0 to 1.0 (or 0% to 100%)
   - These probabilities sum to 1.0 (100%) for each customer
   - Higher probability indicates higher confidence in that prediction

--------------------------------------------------------------------------------
NAMING CONVE

In [33]:
# Show statistics of the predictions
print("=" * 80)
print("MODEL PREDICTION STATISTICS")
print("=" * 80)

print("\nPredicted Response Distribution:")
print(output_df['$C-response'].value_counts())
print(f"\nPredicted Response Rate: {(output_df['$C-response'] == 'T').sum() / len(output_df) * 100:.2f}%")

print("\n" + "-" * 80)
print("Probability Distribution for RESPONSE (T):")
print("-" * 80)
print(output_df['$CC-response-T'].describe())

print("\n" + "-" * 80)
print("Comparison with Actual Response:")
print("-" * 80)
comparison = pd.crosstab(
    output_df['response_to_test_mailing_02_01_2011'],
    output_df['$C-response'],
    margins=True
)
print(comparison)

# Calculate accuracy
correct = (output_df['response_to_test_mailing_02_01_2011'] == output_df['$C-response']).sum()
accuracy = (correct / len(output_df)) * 100
print(f"\nModel Accuracy: {accuracy:.2f}%")

print("\n" + "=" * 80)

MODEL PREDICTION STATISTICS

Predicted Response Distribution:
$C-response
F    9741
T     259
Name: count, dtype: int64

Predicted Response Rate: 2.59%

--------------------------------------------------------------------------------
Probability Distribution for RESPONSE (T):
--------------------------------------------------------------------------------
count    10000.000000
mean         0.036100
std          0.102885
min          0.000000
25%          0.000000
50%          0.006323
75%          0.046479
max          1.000000
Name: $CC-response-T, dtype: float64

--------------------------------------------------------------------------------
Comparison with Actual Response:
--------------------------------------------------------------------------------
$C-response                             F    T    All
response_to_test_mailing_02_01_2011                  
F                                    9542   97   9639
T                                     199  162    361
All              

In [34]:
# Final Summary - Answer to Question 4
print("=" * 80)
print("QUESTION 4 - FINAL ANSWERS")
print("=" * 80)

print("\n✓ THE TWO NEW FIELD TYPES ADDED BY THE MODEL:")
print("-" * 80)

print("\n1. PREDICTED CLASS FIELD: $C-response")
print("   • Contains the predicted response for each customer ('T' or 'F')")
print("   • Based on the CHAID tree rules applied to customer characteristics")
print("   • The model predicts 'T' when probability > 0.5, otherwise 'F'")

print("\n2. CONFIDENCE/PROBABILITY FIELDS: $CC-response-[class]")
print("   • Two fields created:")
print("     - $CC-response-F: Probability customer will NOT respond (0.0 to 1.0)")
print("     - $CC-response-T: Probability customer WILL respond (0.0 to 1.0)")
print("   • These probabilities are calculated from the leaf node statistics")
print("   • They always sum to 1.0 (100%) for each customer")
print("   • Higher probability = higher confidence in that prediction")

print("\n" + "-" * 80)
print("WHAT THESE FIELDS REPRESENT:")
print("-" * 80)
print("• $C-response: The MODEL'S PREDICTION of whether customer will respond")
print("• $CC-response-T: The MODEL'S CONFIDENCE that customer will respond")
print("• $CC-response-F: The MODEL'S CONFIDENCE that customer will NOT respond")

print("\nThese fields allow:")
print("  1. Scoring/classifying new customers")
print("  2. Ranking customers by response probability")
print("  3. Targeting high-probability responders for future campaigns")

print("\n" + "=" * 80)

QUESTION 4 - FINAL ANSWERS

✓ THE TWO NEW FIELD TYPES ADDED BY THE MODEL:
--------------------------------------------------------------------------------

1. PREDICTED CLASS FIELD: $C-response
   • Contains the predicted response for each customer ('T' or 'F')
   • Based on the CHAID tree rules applied to customer characteristics
   • The model predicts 'T' when probability > 0.5, otherwise 'F'

2. CONFIDENCE/PROBABILITY FIELDS: $CC-response-[class]
   • Two fields created:
     - $CC-response-F: Probability customer will NOT respond (0.0 to 1.0)
     - $CC-response-T: Probability customer WILL respond (0.0 to 1.0)
   • These probabilities are calculated from the leaf node statistics
   • They always sum to 1.0 (100%) for each customer
   • Higher probability = higher confidence in that prediction

--------------------------------------------------------------------------------
WHAT THESE FIELDS REPRESENT:
-------------------------------------------------------------------------------

## 5. Applying the Model to Testing Dataset

Apply the model to customers who did NOT receive the test mailing.

In [8]:
# Apply model to testing dataset (customers who did NOT receive test mailing)
testing_data = df[df['has_received_test_mailing'] == 'no'].copy()
testing_results = testing_data.apply(get_prediction_and_probability, axis=1, result_type='expand')
testing_results.columns = ['$C-response', '$CC-response-F', '$CC-response-T']

# Count predicted positive responses
predicted_positive = (testing_results['$C-response'] == 'T').sum()

print("=" * 80)
print("QUESTION 5 - ANSWER")
print("=" * 80)
print(f"\nTesting dataset size: {len(testing_data):,} customers")
print(f"\n✓ Number of customers predicted to respond positively (T): {predicted_positive:,}")
print(f"\nPredicted response rate: {(predicted_positive/len(testing_data)*100):.2f}%")
print("\n" + "=" * 80)

QUESTION 5 - ANSWER

Testing dataset size: 20,000 customers

✓ Number of customers predicted to respond positively (T): 254

Predicted response rate: 1.27%

