## Synthetic Onboarding Dataset Generation

This notebook generates a synthetic dataset to simulate client onboarding behavior 
for a B2B software context.

The dataset is intentionally biased to reflect common operational realities:
- Clients are often low to medium in technical capability
- Onboarding ownership is frequently unclear
- Founder involvement compensates for missing processes
- Poor enablement increases friction and delays time-to-value

The purpose of this dataset is analytical, not predictive.


In [None]:
import pandas as pd
import numpy as np

RANDOM_SEED = 42
NUM_CLIENTS = 120

np.random.seed(RANDOM_SEED)


###  Encoded Bias Assumptions

This synthetic dataset encodes the following assumptions:

1. Most clients are not highly technical.
2. Onboarding ownership is undefined more often than defined.
3. Founder involvement increases when ownership is unclear.
4. Missing documentation leads to repeated questions and higher support demand.
5. Early scope clarification reduces post-onboarding friction.

These assumptions reflect operational patterns observed in early-stage and mid-scale B2B teams.


In [None]:
def generate_client_technical_level():
    return np.random.choice(
        ["Low", "Medium", "High"],
        p=[0.45, 0.35, 0.20]
    )

def generate_owner_defined():
    return np.random.choice(
        ["Yes", "No"],
        p=[0.4, 0.6]
    )

def generate_founder_involvement(owner_defined):
    if owner_defined == "No":
        return "Yes" if np.random.rand() < 0.7 else "No"
    return "No"


In [None]:
records = []

for i in range(1, NUM_CLIENTS + 1):
    technical_level = generate_client_technical_level()
    owner_defined = generate_owner_defined()
    founder_involved = generate_founder_involvement(owner_defined)

    documentation_provided = np.random.choice(["Yes", "No"], p=[0.45, 0.55])
    recorded_walkthrough = (
        "Yes" if documentation_provided == "Yes" and np.random.rand() < 0.6 else "No"
    )

    single_source_of_truth = (
        "Yes" if documentation_provided == "Yes" and np.random.rand() < 0.4 else "No"
    )

    scope_constraints_discussed_early = np.random.choice(["Yes", "No"], p=[0.45, 0.55])

    repeat_questions = (
        np.random.poisson(2)
        + (3 if documentation_provided == "No" else 0)
        + (2 if technical_level == "Low" else 0)
    )

    support_requests_30_days = (
        np.random.poisson(1)
        + (2 if scope_constraints_discussed_early == "No" else 0)
        + (2 if repeat_questions > 4 else 0)
    )

    onboarding_calls = np.random.randint(2, 7) + (2 if founder_involved == "Yes" else 0)
    onboarding_duration_days = onboarding_calls * np.random.randint(2, 5)

    scope_changes_post_onboarding = (
        "Yes"
        if scope_constraints_discussed_early == "No" and np.random.rand() < 0.65
        else "No"
    )

    time_to_first_value = (
        onboarding_duration_days
        + support_requests_30_days * 2
        + (5 if technical_level == "Low" else 0)
    )

    onboarding_satisfaction = max(
        1,
        min(
            5,
            5
            - (1 if documentation_provided == "No" else 0)
            - (1 if scope_changes_post_onboarding == "Yes" else 0)
            - (1 if repeat_questions > 5 else 0),
        ),
    )

    records.append({
        "client_id": f"C{i:03d}",
        "client_technical_level": technical_level,
        "onboarding_owner_defined": owner_defined,
        "founder_involved": founder_involved,
        "number_of_onboarding_calls": onboarding_calls,
        "onboarding_duration_days": onboarding_duration_days,
        "documentation_provided": documentation_provided,
        "recorded_walkthrough_available": recorded_walkthrough,
        "single_source_of_truth": single_source_of_truth,
        "scope_constraints_discussed_early": scope_constraints_discussed_early,
        "scope_changes_post_onboarding": scope_changes_post_onboarding,
        "repeat_questions_count": repeat_questions,
        "support_requests_first_30_days": support_requests_30_days,
        "time_to_first_value_days": time_to_first_value,
        "onboarding_satisfaction_score": onboarding_satisfaction,
    })


In [None]:
df = pd.DataFrame(records)

df.to_csv(
    "synthetic_onboarding_data.csv",
    index=False
)

df.head()


###  Notes on Limitations

- This dataset is not representative of all B2B onboarding contexts.
- Relationships are intentionally simplified.
- Biases are introduced consciously to support exploratory analysis.

Future iterations may rebalance distributions or introduce industry-specific effects.
