# AI ROI Dataset: 200 B2B Deployments Analysis (2022-2025)

**Author:** Denis ATLAN  
**Organization:** ENDKOO / HumaLoop  
**Date:** December 2025  
**Version:** 1.0

## Abstract

This notebook analyzes 200 AI deployments in French B2B companies (2022-2025). It provides statistical analysis of ROI, deployment durations, investment patterns, and failure rates across sectors and company sizes.

**Key Findings:**
- Median ROI: 159.8% (24-month horizon)
- Failure rate: 17.5% (vs. 80-95% market average)
- Human-in-the-Loop adoption: 88.5%
- Median time-to-positive-ROI: 180-270 days depending on company size

## 1. Setup & Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)

# Load dataset
df = pd.read_csv('ai_roi_dataset_200_deployments.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst rows:")
df.head()

## 2. Descriptive Statistics

In [None]:
# Overall statistics
print("=== DATASET OVERVIEW ===")
print(f"\nTotal projects: {len(df)}")
print(f"Date range: {df['year'].min()} - {df['year'].max()}")
print(f"\nSuccess rate: {(1 - df['failure'].mean()):.1%}")
print(f"Failure rate: {df['failure'].mean():.1%}")
print(f"Human-in-the-Loop adoption: {df['human_in_loop'].mean():.1%}")

# Success subset
df_success = df[df['failure'] == 0]

print("\n=== ROI STATISTICS (successful projects) ===")
print(f"Median ROI: {df_success['roi_percent'].median():.1f}%")
print(f"Mean ROI: {df_success['roi_percent'].mean():.1f}%")
print(f"Q1 (25th percentile): {df_success['roi_percent'].quantile(0.25):.1f}%")
print(f"Q3 (75th percentile): {df_success['roi_percent'].quantile(0.75):.1f}%")
print(f"Min ROI: {df_success['roi_percent'].min():.1f}%")
print(f"Max ROI: {df_success['roi_percent'].max():.1f}%")

## 3. Distribution Analysis

In [None]:
# Sector distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Pie chart - sectors
sector_counts = df['sector'].value_counts()
axes[0].pie(sector_counts, labels=sector_counts.index, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Projects Distribution by Sector')

# Bar chart - company size
size_counts = df['company_size'].value_counts()
axes[1].bar(size_counts.index, size_counts.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
axes[1].set_title('Projects Distribution by Company Size')
axes[1].set_ylabel('Number of Projects')

plt.tight_layout()
plt.savefig('distribution_sectors_size.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: distribution_sectors_size.png")

## 4. ROI Analysis by Sector

In [None]:
# ROI by sector (success only)
roi_by_sector = df_success.groupby('sector')['roi_percent'].agg(['median', 'mean', 'min', 'max', 'count'])
roi_by_sector = roi_by_sector.sort_values('median', ascending=False)

print("=== ROI BY SECTOR (Successful Projects) ===")
print(roi_by_sector.round(1))

# Boxplot ROI by sector
plt.figure(figsize=(14, 6))
df_success_sorted = df_success.copy()
sector_order = roi_by_sector.index
sns.boxplot(data=df_success_sorted, x='sector', y='roi_percent', order=sector_order)
plt.xticks(rotation=45, ha='right')
plt.title('ROI Distribution by Sector (Successful Projects)')
plt.ylabel('ROI (%)')
plt.xlabel('Sector')
plt.tight_layout()
plt.savefig('roi_by_sector_boxplot.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: roi_by_sector_boxplot.png")

## 5. Investment & Gains Analysis

In [None]:
# Investment by company size
invest_by_size = df.groupby('company_size')['investment_eur'].agg(['median', 'mean', 'min', 'max', 'count'])
invest_by_size = invest_by_size.reindex(['PME', 'ETI', 'Grande'])

print("=== INVESTMENT BY COMPANY SIZE ===")
print(invest_by_size.round(0))

# Gains by company size (success only)
gains_by_size = df_success.groupby('company_size')['annual_gain_eur'].agg(['median', 'mean', 'min', 'max'])
gains_by_size = gains_by_size.reindex(['PME', 'ETI', 'Grande'])

print("\n=== ANNUAL GAINS BY COMPANY SIZE (Successful Projects) ===")
print(gains_by_size.round(0))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Investment
axes[0].bar(invest_by_size.index, invest_by_size['median'], color='steelblue')
axes[0].set_title('Median Investment by Company Size')
axes[0].set_ylabel('Investment (€)')
axes[0].ticklabel_format(style='plain', axis='y')

# Gains
axes[1].bar(gains_by_size.index, gains_by_size['median'], color='seagreen')
axes[1].set_title('Median Annual Gain by Company Size (Success)')
axes[1].set_ylabel('Annual Gain (€)')
axes[1].ticklabel_format(style='plain', axis='y')

plt.tight_layout()
plt.savefig('investment_gains_by_size.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: investment_gains_by_size.png")

## 6. Deployment Durations Analysis

In [None]:
# Duration analysis by company size
duration_metrics = ['days_diagnostic', 'days_poc', 'days_to_deployment', 'days_to_positive_roi']
duration_by_size = df_success.groupby('company_size')[duration_metrics].median()
duration_by_size = duration_by_size.reindex(['PME', 'ETI', 'Grande'])

print("=== MEDIAN DURATIONS BY COMPANY SIZE (Successful Projects) ===")
print(duration_by_size.round(0))

# Stacked bar chart
plt.figure(figsize=(12, 6))
x = np.arange(len(duration_by_size.index))
width = 0.6

plt.bar(x, duration_by_size['days_diagnostic'], width, label='Diagnostic', color='#1f77b4')
plt.bar(x, duration_by_size['days_poc'], width, bottom=duration_by_size['days_diagnostic'], label='POC', color='#ff7f0e')
plt.bar(x, duration_by_size['days_to_deployment'] - duration_by_size['days_poc'] - duration_by_size['days_diagnostic'], width, bottom=duration_by_size['days_diagnostic'] + duration_by_size['days_poc'], label='Deployment', color='#2ca02c')

plt.xlabel('Company Size')
plt.ylabel('Days')
plt.title('Median Project Durations by Phase and Company Size')
plt.xticks(x, duration_by_size.index)
plt.legend()
plt.tight_layout()
plt.savefig('durations_by_size.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: durations_by_size.png")

## 7. Failure Analysis

In [None]:
df_failed = df[df['failure'] == 1]

print("=== FAILURE ANALYSIS ===")
print(f"\nTotal failures: {len(df_failed)} ({df['failure'].mean():.1%})")
print(f"\nFailure rate by sector:")
failure_by_sector = df.groupby('sector')['failure'].agg(['sum', 'mean']).sort_values('mean', ascending=False)
failure_by_sector.columns = ['Total Failures', 'Failure Rate']
failure_by_sector['Failure Rate'] = failure_by_sector['Failure Rate'].apply(lambda x: f"{x:.1%}")
print(failure_by_sector)

print(f"\nFailure reasons:")
failure_reasons = df_failed['failure_reason'].value_counts()
print(failure_reasons)

# Pie chart failure reasons
plt.figure(figsize=(10, 8))
plt.pie(failure_reasons, labels=failure_reasons.index, autopct='%1.1f%%', startangle=90)
plt.title(f'Failure Reasons Distribution (n={len(df_failed)})')
plt.tight_layout()
plt.savefig('failure_reasons.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: failure_reasons.png")

## 8. Correlation Analysis

In [None]:
# Correlation matrix (numeric columns, success only)
numeric_cols = ['investment_eur', 'annual_gain_eur', 'roi_percent', 'time_saved_hours_month', 'days_to_deployment', 'days_to_positive_roi']
corr_matrix = df_success[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, square=True, linewidths=1)
plt.title('Correlation Matrix (Successful Projects)')
plt.tight_layout()
plt.savefig('correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: correlation_matrix.png")

print("\n=== KEY CORRELATIONS ===")
print(f"Investment vs Annual Gain: {corr_matrix.loc['investment_eur', 'annual_gain_eur']:.3f}")
print(f"Investment vs ROI: {corr_matrix.loc['investment_eur', 'roi_percent']:.3f}")
print(f"Deployment Duration vs ROI: {corr_matrix.loc['days_to_deployment', 'roi_percent']:.3f}")

## 9. Human-in-the-Loop Impact

In [None]:
# Compare success rates with/without Human-in-the-Loop
hitl_impact = df.groupby('human_in_loop')['failure'].agg(['count', 'mean'])
hitl_impact['success_rate'] = 1 - hitl_impact['mean']
hitl_impact.index = ['No HITL', 'With HITL']
hitl_impact.columns = ['Total Projects', 'Failure Rate', 'Success Rate']

print("=== HUMAN-IN-THE-LOOP IMPACT ===")
print(hitl_impact.round(3))

# ROI comparison
roi_hitl = df_success.groupby('human_in_loop')['roi_percent'].agg(['median', 'mean', 'count'])
roi_hitl.index = ['No HITL', 'With HITL']
print("\n=== ROI COMPARISON (Successful Projects) ===")
print(roi_hitl.round(1))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Success rate
axes[0].bar(hitl_impact.index, hitl_impact['Success Rate'], color=['coral', 'seagreen'])
axes[0].set_title('Success Rate: With vs Without Human-in-the-Loop')
axes[0].set_ylabel('Success Rate')
axes[0].set_ylim(0, 1)
for i, v in enumerate(hitl_impact['Success Rate']):
    axes[0].text(i, v + 0.02, f"{v:.1%}", ha='center')

# ROI
axes[1].bar(roi_hitl.index, roi_hitl['median'], color=['coral', 'seagreen'])
axes[1].set_title('Median ROI: With vs Without Human-in-the-Loop')
axes[1].set_ylabel('ROI (%)')
for i, v in enumerate(roi_hitl['median']):
    axes[1].text(i, v + 5, f"{v:.1f}%", ha='center')

plt.tight_layout()
plt.savefig('hitl_impact.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: hitl_impact.png")

## 10. Summary Statistics for Report

In [None]:
print("=" * 60)
print("SUMMARY STATISTICS FOR TECHNICAL REPORT")
print("=" * 60)

print("\n1. GLOBAL METRICS")
print(f"   - Total projects analyzed: {len(df)}")
print(f"   - Success rate: {(1 - df['failure'].mean()):.1%}")
print(f"   - Market benchmark failure rate: 80-95%")
print(f"   - Human-in-the-Loop adoption: {df['human_in_loop'].mean():.1%}")

print("\n2. ROI METRICS (Successful Projects)")
print(f"   - Median ROI: {df_success['roi_percent'].median():.1f}%")
print(f"   - Mean ROI: {df_success['roi_percent'].mean():.1f}%")
print(f"   - 25th percentile: {df_success['roi_percent'].quantile(0.25):.1f}%")
print(f"   - 75th percentile: {df_success['roi_percent'].quantile(0.75):.1f}%")

print("\n3. TOP PERFORMING SECTORS (by median ROI)")
top_sectors = df_success.groupby('sector')['roi_percent'].median().sort_values(ascending=False).head(3)
for sector, roi in top_sectors.items():
    print(f"   - {sector}: {roi:.1f}%")

print("\n4. INVESTMENT RANGES")
for size in ['PME', 'ETI', 'Grande']:
    inv_med = df[df['company_size'] == size]['investment_eur'].median()
    print(f"   - {size}: {inv_med:,.0f}€ (median)")

print("\n5. TIME TO POSITIVE ROI")
for size in ['PME', 'ETI', 'Grande']:
    roi_time = df_success[df_success['company_size'] == size]['days_to_positive_roi'].median()
    print(f"   - {size}: {roi_time:.0f} days")

print("\n" + "=" * 60)

## Conclusion

This analysis demonstrates:

1. **Significantly lower failure rate (17.5%)** compared to market average (80-95%), suggesting effective deployment methodology
2. **Strong ROI performance** with median 159.8%, driven by sectors like Retail (240%), Finance (185%), and Manufacturing (170%)
3. **Human-in-the-Loop** approach shows correlation with higher success rates
4. **Predictable deployment timelines** varying by company size: PME (90-120 days), ETI (180-270 days), Grande (365-540 days)
5. **Investment efficiency**: median investments of 19k€ (PME), 84k€ (ETI), 414k€ (Grande) with proportional gains

### Limitations & Biases

- Sample represents successful deployment methodology (selection bias)
- Primarily French B2B market (geographic limitation)
- 24-month ROI horizon (may underestimate long-term value)
- Dataset period (2022-2025) spans rapid GenAI evolution

### Citation

```
Atlan, D. (2025). AI ROI Dataset: 200 B2B Deployments Analysis (2022-2025). 
ENDKOO / HumaLoop. https://github.com/denisatlan/ai-roi-dataset
```