## Synthetic Data Validation

### Purpose

This notebook validates whether the synthetic onboarding dataset generated in
`03_02_generate_synthetic_data.ipynb` adheres to the assumptions explicitly
defined in `03_01_data_generation_logic.md`.

The objective is **internal consistency and plausibility**, not statistical rigor.
Validation focuses on whether encoded relationships behave directionally as intended.


### Dataset Overview and Schema Check

We begin by loading the dataset and performing basic schema and completeness checks.


In [6]:
import pandas as pd
import numpy as np

df = pd.read_csv("synthetic_onboarding_data.csv")

df.head()


Unnamed: 0,client_id,client_technical_level,onboarding_owner_defined,founder_involved,number_of_onboarding_calls,onboarding_duration_days,documentation_provided,recorded_walkthrough_available,single_source_of_truth,scope_constraints_discussed_early,scope_changes_post_onboarding,repeat_questions_count,support_requests_first_30_days,time_to_first_value_days,onboarding_satisfaction_score
0,C001,Low,No,No,3,9,No,No,No,Yes,No,6,5,24,3
1,C002,Low,Yes,No,4,16,Yes,Yes,No,Yes,No,4,0,21,5
2,C003,High,No,No,3,6,No,No,No,No,Yes,3,4,14,3
3,C004,Low,No,Yes,5,15,No,No,No,Yes,No,6,3,26,3
4,C005,Low,No,Yes,8,16,Yes,No,No,Yes,No,7,4,29,4


In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 15 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   client_id                          120 non-null    object
 1   client_technical_level             120 non-null    object
 2   onboarding_owner_defined           120 non-null    object
 3   founder_involved                   120 non-null    object
 4   number_of_onboarding_calls         120 non-null    int64 
 5   onboarding_duration_days           120 non-null    int64 
 6   documentation_provided             120 non-null    object
 7   recorded_walkthrough_available     120 non-null    object
 8   single_source_of_truth             120 non-null    object
 9   scope_constraints_discussed_early  120 non-null    object
 10  scope_changes_post_onboarding      120 non-null    object
 11  repeat_questions_count             120 non-null    int64 
 12  support_

In [8]:
df.isnull().sum()


client_id                            0
client_technical_level               0
onboarding_owner_defined             0
founder_involved                     0
number_of_onboarding_calls           0
onboarding_duration_days             0
documentation_provided               0
recorded_walkthrough_available       0
single_source_of_truth               0
scope_constraints_discussed_early    0
scope_changes_post_onboarding        0
repeat_questions_count               0
support_requests_first_30_days       0
time_to_first_value_days             0
onboarding_satisfaction_score        0
dtype: int64

### Boundary and Range Validation

Numerical fields are checked against realistic operational boundaries to ensure
no logically impossible values were generated.


In [9]:
# Corrected range checks for the actual columns in your dataset
range_checks = {
    "number_of_onboarding_calls": (0, 20),
    "onboarding_duration_days": (0, 120),
    "repeat_questions_count": (0, 50),
    "support_requests_first_30_days": (0, 50),
    "time_to_first_value_days": (0, 90),
    "onboarding_satisfaction_score": (1, 5)
}

for col, (low, high) in range_checks.items():
    invalid_rows = df[(df[col] < low) | (df[col] > high)]
    print(f"{col}: {len(invalid_rows)} invalid rows")


number_of_onboarding_calls: 0 invalid rows
onboarding_duration_days: 0 invalid rows
repeat_questions_count: 0 invalid rows
support_requests_first_30_days: 0 invalid rows
time_to_first_value_days: 0 invalid rows
onboarding_satisfaction_score: 0 invalid rows


### Assumption Check 1: Documentation vs Repeat Questions

Assumption:
Missing documentation increases onboarding friction, reflected by a higher
number of repeat questions from clients.


In [11]:
# Assumption 1: Missing documentation increases repeat questions
df.groupby("documentation_provided")["repeat_questions_count"].mean()

documentation_provided
No     6.00
Yes    2.88
Name: repeat_questions_count, dtype: float64

In [12]:
df.groupby("documentation_provided")["repeat_questions_count"].median()

documentation_provided
No     6.0
Yes    2.0
Name: repeat_questions_count, dtype: float64

### Assumption Check 2: Onboarding Ownership and Founder Dependency

Assumption Check 2: Founder Involvement vs Onboarding Ownership

The variable founder_involved is encoded categorically (Yes / No).
To avoid implicit numeric coercion, founder involvement is converted into a boolean indicator during analysis (Yes = True).

The resulting group-wise mean represents the proportion of onboarding cases involving founders, not an arithmetic mean of a numeric field.

This approach preserves interpretability and avoids introducing artificial numeric bias.

In [14]:
(
    df
    .assign(founder_involved_flag = df["founder_involved"].eq("Yes"))
    .groupby("onboarding_owner_defined")["founder_involved_flag"]
    .mean()
)


onboarding_owner_defined
No     0.618421
Yes    0.000000
Name: founder_involved_flag, dtype: float64

### Assumption 3

Higher friction delays time to first value

In [None]:
df["high_friction"] = (
    (df["repeat_questions_count"] > df["repeat_questions_count"].median()) |
    (df["scope_changed_post_onboarding"] == "Yes")
)

df.groupby("high_friction")["time_to_first_value_days"].mean()


### Assumption 4

Higher friction lowers satisfaction

In [None]:
df.groupby("high_friction")["client_satisfaction_score"].mean()


## Distribution Sanity Checks

We verify that the dataset reflects a realistic early-stage environment where:
- Lower maturity clients are more common
- Failed or delayed onboardings exist but do not dominate


In [None]:
df["client_technical_level"].value_counts(normalize=True)


In [None]:
df["onboarding_status"].value_counts(normalize=True)


## Validation Summary

Key observations:

- Core assumptions encoded during data generation generally hold directionally.
- Relationships exhibit noise, reflecting realistic operational variability.
- Founder involvement appears as a compensatory mechanism rather than a scalable solution.
- Friction indicators align with delayed value realization and reduced satisfaction.

Limitations:

- Relationships are probabilistic and illustrative, not causal.
- No financial or longitudinal effects are modeled.

Conclusion:

The dataset is internally consistent with documented assumptions and suitable
for exploratory analysis of onboarding failure patterns.
