## Synthetic Data Validation

### Purpose

This notebook validates whether the synthetic onboarding dataset generated in
`03_02_generate_synthetic_data.ipynb` adheres to the assumptions explicitly
defined in `03_01_data_generation_logic.md`.

The objective is **internal consistency and plausibility**, not statistical rigor.
Validation focuses on whether encoded relationships behave directionally as intended.


### Dataset Overview and Schema Check

We begin by loading the dataset and performing basic schema and completeness checks.


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("synthetic_onboarding_data.csv")

df.head()


Unnamed: 0,client_id,client_technical_level,onboarding_owner_defined,founder_involved,number_of_onboarding_calls,onboarding_duration_days,documentation_provided,recorded_walkthrough_available,single_source_of_truth,scope_constraints_discussed_early,scope_changes_post_onboarding,repeat_questions_count,support_requests_first_30_days,time_to_first_value_days,onboarding_satisfaction_score
0,C001,Low,No,No,3,9,No,No,No,Yes,No,6,5,24,3
1,C002,Low,Yes,No,4,16,Yes,Yes,No,Yes,No,4,0,21,5
2,C003,High,No,No,3,6,No,No,No,No,Yes,3,4,14,3
3,C004,Low,No,Yes,5,15,No,No,No,Yes,No,6,3,26,3
4,C005,Low,No,Yes,8,16,Yes,No,No,Yes,No,7,4,29,4


In [2]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 15 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   client_id                          120 non-null    object
 1   client_technical_level             120 non-null    object
 2   onboarding_owner_defined           120 non-null    object
 3   founder_involved                   120 non-null    object
 4   number_of_onboarding_calls         120 non-null    int64 
 5   onboarding_duration_days           120 non-null    int64 
 6   documentation_provided             120 non-null    object
 7   recorded_walkthrough_available     120 non-null    object
 8   single_source_of_truth             120 non-null    object
 9   scope_constraints_discussed_early  120 non-null    object
 10  scope_changes_post_onboarding      120 non-null    object
 11  repeat_questions_count             120 non-null    int64 
 12  support_

In [3]:
df.isnull().sum()


client_id                            0
client_technical_level               0
onboarding_owner_defined             0
founder_involved                     0
number_of_onboarding_calls           0
onboarding_duration_days             0
documentation_provided               0
recorded_walkthrough_available       0
single_source_of_truth               0
scope_constraints_discussed_early    0
scope_changes_post_onboarding        0
repeat_questions_count               0
support_requests_first_30_days       0
time_to_first_value_days             0
onboarding_satisfaction_score        0
dtype: int64

### Boundary and Range Validation

Numerical fields are checked against realistic operational boundaries to ensure
no logically impossible values were generated.


In [4]:

range_checks = {
    "number_of_onboarding_calls": (0, 20),
    "onboarding_duration_days": (0, 120),
    "repeat_questions_count": (0, 50),
    "support_requests_first_30_days": (0, 50),
    "time_to_first_value_days": (0, 90),
    "onboarding_satisfaction_score": (1, 5)
}

for col, (low, high) in range_checks.items():
    invalid_rows = df[(df[col] < low) | (df[col] > high)]
    print(f"{col}: {len(invalid_rows)} invalid rows")


number_of_onboarding_calls: 0 invalid rows
onboarding_duration_days: 0 invalid rows
repeat_questions_count: 0 invalid rows
support_requests_first_30_days: 0 invalid rows
time_to_first_value_days: 0 invalid rows
onboarding_satisfaction_score: 0 invalid rows


### Assumption Validation & Structural Consistency Checks


Purpose of this Section

This section evaluates whether the synthetically generated onboarding dataset exhibits internally consistent relationships aligned with theoretically motivated onboarding assumptions.

These checks are not intended to establish causal claims, but rather to verify that the generated data respects the conceptual logic used during its construction. Ensuring such alignment is necessary before proceeding to downstream analysis and modeling.

### Assumption Check 1: Documentation vs Repeat Questions

Assumption:
Missing documentation increases onboarding friction, reflected by a higher
number of repeat questions from clients.


The following assumption checks are strictly bounded to the variables explicitly defined during data generation.
No inferred or proxy variables are introduced to avoid schema drift and unintended analytical bias.

In [5]:
EXPECTED_COLUMNS = [
    "client_id",
    "client_technical_level",
    "onboarding_owner_defined",
    "founder_involved",
    "number_of_onboarding_calls",
    "onboarding_duration_days",
    "documentation_provided",
    "recorded_walkthrough_available",
    "single_source_of_truth",
    "scope_constraints_discussed_early",
    "scope_changes_post_onboarding",
    "repeat_questions_count",
    "support_requests_first_30_days",
    "time_to_first_value_days",
    "onboarding_satisfaction_score",
]

assert set(EXPECTED_COLUMNS).issubset(df.columns)


Helper Utilities

Several assumptions involve binary operational indicators encoded as "Yes" / "No".
To ensure analytical consistency and prevent silent type coercion, a helper function is defined for explicit boolean projection.

In [6]:
def yes_no_flag(series):
    return series.eq("Yes")


Assumption 1: Technical Capability and Onboarding Duration

**Assumption**

Clients with higher technical proficiency are expected to complete onboarding in fewer days due to reduced dependency on support and explanations.

**Validation Goal**

Verify whether onboarding duration decreases as client technical level increases.

In [7]:
(
    df.groupby("client_technical_level")["onboarding_duration_days"]
      .mean()
      .sort_index()
)


client_technical_level
High      14.047619
Low       13.491525
Medium    15.050000
Name: onboarding_duration_days, dtype: float64

A monotonic or near-monotonic decrease supports the structural plausibility of this assumption.
Deviations may indicate either:

intentional noise introduced during synthesis, or

a misalignment between technical level labeling and onboarding complexity.

Assumption 2: Ownership Clarity and Founder Involvement


**Assumption**

When onboarding ownership is clearly defined, founder involvement tends to be more structured and deliberate rather than reactive.

**Validation Goal**

Compare founder involvement rates across ownership clarity states.

In [8]:
(
    df.assign(founder_flag = yes_no_flag(df["founder_involved"]))
      .groupby("onboarding_owner_defined")["founder_flag"]
      .mean()
)


onboarding_owner_defined
No     0.618421
Yes    0.000000
Name: founder_flag, dtype: float64


Lower values under "Yes" indicate improved scope stability.
Inverted results may suggest that early discussions surface complexity rather than eliminate it.

Assumption 4: Enablement Materials and Cognitive Load


**Assumption**

Comprehensive enablement, defined as the presence of both documentation and recorded walkthroughs, reduces repeated clarification requests.

**Validation Goal**

Compare average repeat question counts between fully enabled and partially enabled clients.

In [9]:
df.assign(
    has_docs = yes_no_flag(df["documentation_provided"]),
    has_walkthrough = yes_no_flag(df["recorded_walkthrough_available"]),
    enablement_complete = lambda x: x["has_docs"] & x["has_walkthrough"]
).groupby("enablement_complete")["repeat_questions_count"].mean()


enablement_complete
False    5.482353
True     2.800000
Name: repeat_questions_count, dtype: float64

This composite indicator avoids over-attributing impact to a single enablement artifact and reflects realistic onboarding practice.

Assumption 5: Satisfaction and Early Support Demand

**Assumption**

Lower onboarding satisfaction is associated with increased early-stage support requests.

**Validation Goal**

Observe support request trends across satisfaction score levels.

In [10]:
(
    df.groupby("onboarding_satisfaction_score")["support_requests_first_30_days"]
      .mean()
      .sort_index(ascending=False)
)


onboarding_satisfaction_score
5    1.434783
4    3.214286
3    3.525000
2    4.800000
Name: support_requests_first_30_days, dtype: float64

A declining trend supports the assumption.
Flat or noisy trends suggest satisfaction may be influenced by factors independent of support volume.

Assumption 6: Information Centralization and Time to Value

**Assumption**

The presence of a single source of truth accelerates time-to-first-value by reducing information fragmentation.

**Validation Goal**

Compare average time-to-first-value between centralized and non-centralized setups.

In [11]:
(
    df.assign(ssot_flag = yes_no_flag(df["single_source_of_truth"]))
      .groupby("ssot_flag")["time_to_first_value_days"]
      .mean()
)


ssot_flag
False    22.764706
True     23.777778
Name: time_to_first_value_days, dtype: float64

This metric reflects operational efficiency rather than perceived satisfaction and should be interpreted independently.

### Reproducibility and Synthetic Bias Note

All assumption checks are conducted on synthetically generated data derived from predefined business logic and probabilistic rules.

Consequently, observed relationships should be interpreted as structural consistency checks rather than empirical validation of real-world onboarding dynamics.

This section exists to ensure analytical coherence prior to model development, not to assert causal claims.
