## Purpose of This Document

This document explains how I "built" the simulated world. I created 500 client scenarios to test how different onboarding choices (like giving them a guide vs. just jumping on a call) affect the speed of the project. 

### The "Why" 

Real-world data is often "noisy" and "skewed." If we just used simple averages, the results would look too perfect to be true. To fix this, I introduced three "Real World" factors:

**The Human Factor:** Not every client reacts the same way. Some "High-Tech" clients are still high-maintenance. We added "randomness" to simulate personality differences.

**The Founder Tax:** In startups, founders often step in to help. While they are experts, they are also busy. Our data reflects that Founder involvement often causes delays simply because their calendars are a expectedly a *nightmare.*

**The Snowball Effect:** Small problems (like missing a setup guide) snowball into bigger problems (like 10 extra support tickets), which eventually causes the client to get frustrated and take longer to reach "success."


## Core Variables & Logic

* **Time to First Value (TTFV):** This measures the days from the first call to the moment the client actually gets value from the software.
* **Stochastic Noise:** We use "Poisson" and "Exponential" distributions. In plain English: we ensure most people fall in the middle, but there are always "long-tail" outliers (insane projects) to keep the analysis realistic.

This code creates our 500 clients. Instead of saying "If A, then B," we tell that: "If A happens, B is more *likely* to happen."

In [1]:
import pandas as pd
import numpy as np

# Setting the stage
RANDOM_SEED = 42
NUM_CLIENTS = 500 
np.random.seed(RANDOM_SEED)

records = []

for i in range(1, NUM_CLIENTS + 1):
    # 1. Who is the client?
    tech_level = np.random.choice(["Low", "Medium", "High"], p=[0.45, 0.35, 0.20])
    
    # 2. Who owns the onboarding? (Owner vs. Founder)
    owner_defined = np.random.choice(["Yes", "No"], p=[0.4, 0.6])
    founder_involved = "Yes" if (owner_defined == "No" and np.random.rand() < 0.8) else "No"
    
    # 3. Did we give them tools to succeed? (Enablement)
    docs_provided = "Yes" if (np.random.rand() < 0.7 if owner_defined == "Yes" else np.random.rand() < 0.3) else "No"
    
    # 4. The Friction (The "Noise")
    # Instead of a fixed +3 questions, we use a 'shaker' to add randomness
    base_q = 2
    doc_penalty = np.random.randint(2, 6) if docs_provided == "No" else 0
    repeat_questions = np.random.poisson(base_q + doc_penalty)

    # 5. The Outcome: Time to First Value (TTFV)
    # This is how long it takes for the client to actually start using the product.
    # It's a mix of call counts, support delays, and technical hurdles.
    onboarding_calls = np.random.randint(3, 6) + (np.random.randint(2, 5) if founder_involved == "Yes" else 0)
    duration = (onboarding_calls * 3) + int(np.random.exponential(4)) # Added 'calendar drag'
    
    ttfv = duration + (repeat_questions * 1.2)
    
    # 6. Satisfaction (1-5 score)
    # We start at 5 and deduct points for a bad experience.
    satisfaction = 5.0
    if docs_provided == "No": satisfaction -= 1.0
    if repeat_questions > 6: satisfaction -= 1.0
    # Add a bit of 'human randomness' so it's not a perfect math formula
    satisfaction = round(max(1, min(5, satisfaction + np.random.uniform(-0.4, 0.4))), 1)

    records.append({
        "client_id": f"C{i:03d}",
        "client_technical_level": tech_level,
        "onboarding_owner_defined": owner_defined,
        "founder_involved": founder_involved,
        "docs_provided": docs_provided,
        "repeat_questions_count": repeat_questions,
        "onboarding_duration_days": duration,
        "time_to_first_value_days": int(ttfv),
        "satisfaction_score": satisfaction
    })

df = pd.DataFrame(records)
df.to_csv("synthetic_onboarding_data.csv", index=False)

In [2]:
df.head()

Unnamed: 0,client_id,client_technical_level,onboarding_owner_defined,founder_involved,docs_provided,repeat_questions_count,onboarding_duration_days,time_to_first_value_days,satisfaction_score
0,C001,Low,No,Yes,No,4,24,28,4.2
1,C002,High,Yes,No,No,4,12,16,4.4
2,C003,Low,Yes,No,Yes,4,9,13,5.0
3,C004,High,No,Yes,Yes,2,21,23,4.9
4,C005,Low,No,Yes,Yes,1,23,24,4.9


### **Understanding the Dataset Limits**

While this dataset is designed to be as realistic as possible, it is important to acknowledge its boundaries for the sake of analytical integrity:

1.  **Scope of "Success":** We define success purely through **Time to First Value (TTFV)** and **Satisfaction**. In a real-world scenario, success would also include "Contract Value" or "Churn Risk," which are excluded here to focus on the onboarding process itself.
2.  **Linear Assumptions:** The model assumes that Documentation *always* helps reduce questions. In reality, poorly written documentation can actually increase confusion. We are modeling "Standard Quality" enablement here.
3.  **Data Distributions:** We used **Exponential distributions** for duration and **Poisson distributions** for counts. This ensures we have "outlier" cases (projects that took 50+ days), which is critical for testing how a business handles "nightmare" scenarios.
4.  **The "Human" Element:** By adding a random bias to the Satisfaction score, we ensure that the data isn't "perfectly correlated." This forces the analysis to look for trends rather than simple mathematical certainties.