# OSE Dataset Visualization - New Data Format (CSV-based)

This notebook provides visualizations and examples for each of the 9 extracted datasets from the OSE project, using CSV files exported from the sector picker notebook.

Each dataset has been categorized and includes index columns (`company_name`, `siren`, `siret`) for easy joining and reference.

**Key difference from v2:** This notebook loads pre-extracted CSV files instead of running the extraction pipeline.


In [1]:
# Configuration
import sys
from pathlib import Path
import os

# Add project root to path so we can import from src
# In Jupyter, __file__ is not available, so we use os.getcwd()
cwd = Path(os.getcwd())

# Check if we're in the project root (has src/ and notebooks/ directories)
if (cwd / 'src').exists() and (cwd / 'notebooks').exists():
    project_root = cwd
# Check if we're in notebooks/ directory
elif (cwd.parent / 'src').exists() and (cwd.parent / 'notebooks').exists():
    project_root = cwd.parent
# Fallback: try relative path from notebooks/
else:
    project_root = Path('..').resolve()

sys.path.insert(0, str(project_root))
print(f"Project root: {project_root}")
print(f"src exists: {(project_root / 'src').exists()}")

# Configuration
OUTPUT_DIR = project_root / 'src' / 'ose_core' / 'data' / 'extracted_datasets'
SEED = 42

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
np.random.seed(SEED)

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Configuration loaded successfully!")
print(f"Output directory: {OUTPUT_DIR}")


Project root: /Users/jlb/Documents/Python Course/git_OSE/ose-main
src exists: True
Configuration loaded successfully!
Output directory: /Users/jlb/Documents/Python Course/git_OSE/ose-main/src/ose_core/data/extracted_datasets


In [2]:
# Load datasets from CSV files
print("Loading datasets from CSV files...")
print("="*80)

data = {}
dataset_files = {
    '01_company_basic_info': '01_company_basic_info.csv',
    '02_financial_data': '02_financial_data.csv',
    '03_workforce_data': '03_workforce_data.csv',
    '04_company_structure': '04_company_structure.csv',
    '05_classification_flags': '05_classification_flags.csv',
    '06_contact_metrics': '06_contact_metrics.csv',
    '07_kpi_data': '07_kpi_data.csv',
    '08_signals': '08_signals.csv',
    '09_articles': '09_articles.csv'
}

for name, filename in dataset_files.items():
    filepath = OUTPUT_DIR / filename
    if filepath.exists():
        try:
            df = pd.read_csv(filepath, low_memory=False, na_values=[''])
            data[name] = df
            print(f"  ✓ {name}: {df.shape}")
        except Exception as e:
            print(f"  ✗ {name}: Error loading - {e}")
            data[name] = pd.DataFrame()
    else:
        print(f"  ⚠ {name}: File not found - {filepath}")
        data[name] = pd.DataFrame()

print(f"\n✅ Loaded {len([k for k, v in data.items() if not v.empty])} datasets")
print(f"   Total datasets: {len(data)}")


Loading datasets from CSV files...
  ✓ 01_company_basic_info: (18116, 17)
  ✓ 02_financial_data: (18116, 12)
  ✓ 03_workforce_data: (18116, 8)
  ✓ 04_company_structure: (18116, 10)
  ✓ 05_classification_flags: (18116, 17)
  ✓ 06_contact_metrics: (18116, 10)
  ✓ 07_kpi_data: (3779, 28)
  ✓ 08_signals: (2704187, 17)
  ✓ 09_articles: (907270, 15)

✅ Loaded 9 datasets
   Total datasets: 9


## 1. Company Basic Info

This dataset contains basic company information:
- Company name and identifiers (`siren`, `siret`)
- Department location
- Activity description
- Company metadata

**Use case:** Company identification, geographic distribution, basic company profiling.


In [3]:
# Load company basic info from CSV
df_basic = data['01_company_basic_info']

print(f"Dataset shape: {df_basic.shape}")
print(f"\nColumns: {list(df_basic.columns)}")
display(df_basic.head(10))


Dataset shape: (18116, 17)

Columns: ['company_name', 'siren', 'siret', 'address', 'cp', 'departement', 'departement_id', 'juridic_form', 'last_modified', 'naf_code', 'naf_label', 'processedAt', 'raison_sociale', 'raison_sociale_keyword', 'resume_activite', 'updatedAt', 'ville']


Unnamed: 0,company_name,siren,siret,address,cp,departement,departement_id,juridic_form,last_modified,naf_code,naf_label,processedAt,raison_sociale,raison_sociale_keyword,resume_activite,updatedAt,ville
0,AVI-CHARENTE,305689432,30568940000000.0,9 rue Galilee ZAC DE BELLE AIRE 17440 AYTRE,17440,,,,2025-09-13T09:38:58+02:00,,,2014-10-13T08:16:59+02:00,AVI-CHARENTE,,L'entreprise se spécialise dans la fabrication...,2025-09-13T09:38:58+02:00,AYTRE
1,SOCIETE D'ABATTAGE DE MONTMORILLON,752129643,75212960000000.0,rue Pierre Pagenaud ZI SUD 86500 MONTMORILLON,86500,,,,2025-09-13T09:19:15+02:00,,,2015-05-04T09:32:37+02:00,SOCIETE D'ABATTAGE DE MONTMORILLON,ABATTOIR DE MONTMORILLON - 86500 - MONTMORILLON,Cette entreprise se consacre à l'élevage et à ...,2025-09-13T09:19:15+02:00,MONTMORILLON
2,COVI,391892171,39189220000000.0,boulevard du Marechal Foch 79300 BRESSUIRE,79300,,,,2025-09-13T06:55:55+02:00,,,2013-09-12T14:20:49+02:00,COVI,COVI - 79300 - BRESSUIRE,"Fabricant de plats cuisinés, conserves de vian...",2025-09-13T06:55:55+02:00,BRESSUIRE
3,LE COQ NOIR,316203942,31620390000000.0,70 chemin des Jonquiers 84800 L'ISLE-SUR-LA-SO...,84800,,,,2025-09-13T13:09:27+02:00,,,2015-09-24T11:17:36+02:00,LE COQ NOIR,LE COQ NOIR - 84800 - L'ISLE-SUR-LA-SORGUE,Cette entreprise se consacre à la conception e...,2025-09-13T13:09:27+02:00,L'ISLE-SUR-LA-SORGUE
4,API TECH,451972483,45197250000000.0,11 avenue du General de Gaulle 54280 SEICHAMPS,54280,,,,2025-09-13T12:45:39+02:00,,,2018-10-26T14:26:05+02:00,API TECH,,Cette entreprise conçoit et fabrique des distr...,2025-09-13T12:45:39+02:00,SEICHAMPS
5,SOREAL-ILOU,478608037,47860800000000.0,Bois de Teillay PARC D ACTIVITES 35150 BRIE,35150,,,,2025-09-13T11:24:09+02:00,,,2015-04-02T10:15:27+02:00,SOREAL-ILOU,SOREAL ILOU - 35150 - BRIE,Cette entreprise conçoit et fabrique des recet...,2025-09-13T11:24:09+02:00,BRIE
6,SPECIALITES PET FOOD,560500498,56050050000000.0,ZA du Gohelis ZA DU GOHELIS 56250 ELVEN,56250,,,,2025-09-13T11:04:33+02:00,,,2015-02-16T10:34:31+01:00,SPECIALITES PET FOOD,SPECIALITES PET FOOD - 56250 - ELVEN,Cette entreprise développe des formulations d'...,2025-09-13T11:04:33+02:00,ELVEN
7,WHAT'S COOKING FRANCE,322304197,32230420000000.0,Espace Docteur Zuckermann MEZIDON 14140 MEZIDO...,14140,,,,2025-09-13T11:55:00+02:00,,,2015-09-10T09:39:03+02:00,WHAT'S COOKING FRANCE,STEFANO TOSELLI - 14270 - MEZIDON CANON,Cette entreprise se spécialise dans la product...,2025-09-13T11:55:00+02:00,MEZIDON VALLEE D'AUGE
8,S.E.M. DES SOURCES DE SOULTZMATT,380356436,38035640000000.0,avenue Nessel 68570 SOULTZMATT,68570,,,,2025-09-13T14:39:47+02:00,,,2015-11-30T14:16:00+01:00,S.E.M. DES SOURCES DE SOULTZMATT,SEM DES SOURCES DE SOULTZMATT - 68570 - SOULTZ...,Cette entreprise se consacre à la production e...,2025-09-13T14:39:47+02:00,SOULTZMATT
9,ATLAGEL,7280365,728036500000.0,rue Nicolas Appert ZAC DE LA BROSSE 44400 REZE,44400,,,,2025-09-13T14:37:20+02:00,,,2017-04-21T10:23:37+02:00,ATLAGEL,,Cette entreprise se consacre à la distribution...,2025-09-13T14:37:20+02:00,REZE


In [4]:
# Summary statistics
print(f"\nSummary:")
print(f"- Total companies: {len(df_basic)}")
if 'siret' in df_basic.columns:
    print(f"- Companies with SIRET: {df_basic['siret'].notna().sum()} ({df_basic['siret'].notna().sum()/len(df_basic)*100:.1f}%)")
if 'siren' in df_basic.columns:
    print(f"- Companies with SIREN: {df_basic['siren'].notna().sum()} ({df_basic['siren'].notna().sum()/len(df_basic)*100:.1f}%)")

if 'departement' in df_basic.columns:
    print(f"- Unique departments: {df_basic['departement'].nunique()}")
if 'resume_activite' in df_basic.columns:
    print(f"- Companies with activity description: {df_basic['resume_activite'].notna().sum()}")



Summary:
- Total companies: 18116
- Companies with SIRET: 17650 (97.4%)
- Companies with SIREN: 18116 (100.0%)
- Unique departments: 0
- Companies with activity description: 18116


In [5]:
# # Visualize department distribution
# if 'departement' in df_basic.columns and df_basic['departement'].notna().any():
#     dept_counts = df_basic['departement'].value_counts().head(15)

#     plt.figure(figsize=(12, 6))
#     dept_counts.plot(kind='bar', color='steelblue')
#     plt.title('Top 15 Departments by Number of Companies', fontsize=14, fontweight='bold')
#     plt.xlabel('Department Code', fontsize=12)
#     plt.ylabel('Number of Companies', fontsize=12)
#     plt.xticks(rotation=45)
#     plt.tight_layout()
#     plt.show()

#     print(f"\nTop 5 departments:")
#     print(dept_counts.head())


## 2. Financial Data

This dataset contains financial metrics including:
- Consolidated revenue (`caConsolide`)
- Group revenue (`caGroupe`)
- Operating results (`resultatExploitation`)
- Yearly KPI financial metrics (2014-2025)

**Use case:** Financial analysis, revenue trends, profitability assessment.


In [6]:
# Load financial data from CSV
df_financial = data['02_financial_data']
print(f"Original Shape: {df_financial.shape}")
df_financial.head(10)


Original Shape: (18116, 12)


Unnamed: 0,company_name,siren,siret,caBilan,caConsolide,caGroupe,dateConsolide,fondsPropres,resultatExploitation,resultatNet,trancheCaBilan,trancheCaConsolide
0,AVI-CHARENTE,305689432,30568940000000.0,67014460.0,,,2023-12-31T09:38:58+01:00,2484255.0,175778.0,34403.0,15.0,
1,SOCIETE D'ABATTAGE DE MONTMORILLON,752129643,75212960000000.0,3565501.0,,,2023-12-31T09:19:15+01:00,442654.0,-213424.0,-269464.0,12.0,
2,COVI,391892171,39189220000000.0,76888444.0,,,2019-03-31T06:55:55+02:00,23423096.0,134114.0,619734.0,15.0,
3,LE COQ NOIR,316203942,31620390000000.0,4102063.0,,,2024-03-31T13:09:26+02:00,1145201.0,483701.0,29766.0,12.0,
4,API TECH,451972483,45197250000000.0,67275883.0,,,2023-12-31T12:45:38+01:00,2056526.0,1615196.0,10224378.0,15.0,
5,SOREAL-ILOU,478608037,47860800000000.0,24629381.0,,,2017-12-31T11:24:09+01:00,3959054.0,663445.0,457568.0,14.0,14.0
6,SPECIALITES PET FOOD,560500498,56050050000000.0,133000000.0,,,1970-01-01T00:00:00+01:00,,,,15.0,
7,WHAT'S COOKING FRANCE,322304197,32230420000000.0,100956809.0,,,2023-12-31T11:55:00+01:00,16900416.0,4544804.0,3781230.0,15.0,
8,S.E.M. DES SOURCES DE SOULTZMATT,380356436,38035640000000.0,10535729.0,,,2020-12-31T14:39:47+01:00,5824418.0,161568.0,220150.0,14.0,
9,ATLAGEL,7280365,728036500000.0,47477328.0,,,2023-12-31T14:37:19+01:00,2572245.0,1350165.0,843777.0,14.0,


In [7]:
print(f"Dataset shape: {df_financial.shape}")
print(f"\nMain financial columns:")

# Select main columns
index_cols = ['company_name', 'siren', 'siret']
main_cols = ['caConsolide', 'caGroupe', 'resultatExploitation', 'dateConsolide']
available_cols = [col for col in main_cols if col in df_financial.columns]

if available_cols:
    display(df_financial[index_cols + available_cols].head(10))


Dataset shape: (18116, 12)

Main financial columns:


Unnamed: 0,company_name,siren,siret,caConsolide,caGroupe,resultatExploitation,dateConsolide
0,AVI-CHARENTE,305689432,30568940000000.0,,,175778.0,2023-12-31T09:38:58+01:00
1,SOCIETE D'ABATTAGE DE MONTMORILLON,752129643,75212960000000.0,,,-213424.0,2023-12-31T09:19:15+01:00
2,COVI,391892171,39189220000000.0,,,134114.0,2019-03-31T06:55:55+02:00
3,LE COQ NOIR,316203942,31620390000000.0,,,483701.0,2024-03-31T13:09:26+02:00
4,API TECH,451972483,45197250000000.0,,,1615196.0,2023-12-31T12:45:38+01:00
5,SOREAL-ILOU,478608037,47860800000000.0,,,663445.0,2017-12-31T11:24:09+01:00
6,SPECIALITES PET FOOD,560500498,56050050000000.0,,,,1970-01-01T00:00:00+01:00
7,WHAT'S COOKING FRANCE,322304197,32230420000000.0,,,4544804.0,2023-12-31T11:55:00+01:00
8,S.E.M. DES SOURCES DE SOULTZMATT,380356436,38035640000000.0,,,161568.0,2020-12-31T14:39:47+01:00
9,ATLAGEL,7280365,728036500000.0,,,1350165.0,2023-12-31T14:37:19+01:00


In [8]:
# # Filter out zeros and negative values for better visualization
# # Convert to numeric first (handles string values)
# if 'caConsolide' in df_financial.columns:
#     ca_consolide = pd.to_numeric(df_financial['caConsolide'], errors='coerce').replace(0, np.nan).dropna()
# else:
#     ca_consolide = pd.Series(dtype=float)

# if 'caGroupe' in df_financial.columns:
#     ca_groupe = pd.to_numeric(df_financial['caGroupe'], errors='coerce').replace(0, np.nan).dropna()
# else:
#     ca_groupe = pd.Series(dtype=float)

# # Visualize financial metrics distribution
# fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# if len(ca_consolide) > 0:
#     axes[0].hist(np.log10(ca_consolide + 1), bins=30, color='steelblue', edgecolor='black')
#     axes[0].set_title('Distribution of Consolidated Revenue (log scale)', fontweight='bold')
#     axes[0].set_xlabel('Log10(Revenue + 1)')
#     axes[0].set_ylabel('Frequency')

# if len(ca_groupe) > 0:
#     axes[1].hist(np.log10(ca_groupe + 1), bins=30, color='coral', edgecolor='black')
#     axes[1].set_title('Distribution of Group Revenue (log scale)', fontweight='bold')
#     axes[1].set_xlabel('Log10(Revenue + 1)')
#     axes[1].set_ylabel('Frequency')

# plt.tight_layout()
# plt.show()

# # Summary statistics
# print("\nFinancial Summary:")
# print(f"Companies with consolidated revenue data: {len(ca_consolide)}")
# print(f"Companies with group revenue data: {len(ca_groupe)}")

# if len(ca_consolide) > 0:
#     print(f"\nConsolidated Revenue:")
#     print(f"  Mean: {ca_consolide.mean():,.0f}")
#     print(f"  Median: {ca_consolide.median():,.0f}")
#     print(f"  Max: {ca_consolide.max():,.0f}")


## 3. Workforce Data

This dataset contains employee and workforce information:
- `effectif`: Workforce
- `effectifConsolide`: Consolidated workforce
- `effectifEstime`: Estimated workforce
- `effectifGroupe`: Group workforce

**Use case:** Company size analysis, workforce trends, employee count comparisons.


In [9]:
# Load workforce data from CSV
df_workforce = data['03_workforce_data']
df_workforce.head(20)


Unnamed: 0,company_name,siren,siret,effectif,effectifConsolide,effectifGroupe,trancheEffectifConsolide,trancheEffectifPrecis
0,AVI-CHARENTE,305689432,30568940000000.0,225.0,,,,27.0
1,SOCIETE D'ABATTAGE DE MONTMORILLON,752129643,75212960000000.0,35.0,,,,25.0
2,COVI,391892171,39189220000000.0,225.0,,,,27.0
3,LE COQ NOIR,316203942,31620390000000.0,35.0,,,,25.0
4,API TECH,451972483,45197250000000.0,375.0,,,,28.0
5,SOREAL-ILOU,478608037,47860800000000.0,150.0,,,27.0,27.0
6,SPECIALITES PET FOOD,560500498,56050050000000.0,375.0,,,,28.0
7,WHAT'S COOKING FRANCE,322304197,32230420000000.0,225.0,,,,27.0
8,S.E.M. DES SOURCES DE SOULTZMATT,380356436,38035640000000.0,34.0,,,,25.0
9,ATLAGEL,7280365,728036500000.0,150.0,,,,27.0


In [10]:
# Summary
print(f"Dataset shape: {df_workforce.shape}")
print(f"\nColumns: {list(df_workforce.columns)}")

print(f"\nWorkforce Summary:")
workforce_cols = ['effectif', 'effectifConsolide', 'effectifEstime', 'effectifGroupe']
for col in workforce_cols:
    if col in df_workforce.columns:
        # Convert to numeric first (handles string values)
        col_numeric = pd.to_numeric(df_workforce[col], errors='coerce')
        non_zero = (col_numeric > 0).sum()
        if non_zero > 0:
            mean_val = col_numeric[col_numeric > 0].mean()
            print(f"{col}: {non_zero} companies with data, mean: {mean_val:.1f}")


Dataset shape: (18116, 8)

Columns: ['company_name', 'siren', 'siret', 'effectif', 'effectifConsolide', 'effectifGroupe', 'trancheEffectifConsolide', 'trancheEffectifPrecis']

Workforce Summary:
effectif: 12267 companies with data, mean: 63.3
effectifConsolide: 370 companies with data, mean: 5398.3
effectifGroupe: 370 companies with data, mean: 5398.3


In [11]:
# # Visualize workforce distribution
# workforce_cols = ['effectif', 'effectifConsolide', 'effectifEstime', 'effectifGroupe']
# available_cols = [col for col in workforce_cols if col in df_workforce.columns]

# if available_cols:
#     fig, axes = plt.subplots(2, 2, figsize=(14, 10))
#     axes = axes.flatten()

#     for idx, col in enumerate(workforce_cols):
#         if col in df_workforce.columns:
#             # Convert to numeric first (handles string values)
#             col_numeric = pd.to_numeric(df_workforce[col], errors='coerce')
#             data_col = col_numeric[col_numeric > 0]
#             if len(data_col) > 0:
#                 axes[idx].hist(data_col, bins=30, color=plt.cm.viridis(idx/len(workforce_cols)), edgecolor='black')
#                 axes[idx].set_title(f'{col} Distribution', fontweight='bold')
#                 axes[idx].set_xlabel('Number of Employees')
#                 axes[idx].set_ylabel('Frequency')
#                 axes[idx].set_yscale('log')
#             else:
#                 axes[idx].text(0.5, 0.5, 'No data', ha='center', va='center', transform=axes[idx].transAxes)
#                 axes[idx].set_title(f'{col} Distribution', fontweight='bold')
#         else:
#             axes[idx].text(0.5, 0.5, 'Column not found', ha='center', va='center', transform=axes[idx].transAxes)
#             axes[idx].set_title(f'{col} Distribution', fontweight='bold')

#     plt.tight_layout()
#     plt.show()


## 4. Company Structure

This dataset contains organizational structure information:
- Number of direct subsidiaries
- Number of secondary establishments
- Number of brands
- Group ownership flags

**Use case:** Understanding company complexity, group structures, organizational analysis.


In [12]:
# Load company structure data from CSV
df_structure = data['04_company_structure']

print(f"Dataset shape: {df_structure.shape}")
display(df_structure.head(10))


Dataset shape: (18116, 10)


Unnamed: 0,company_name,siren,siret,groupOwnerSiren,groupOwnerSocialName,hasEtabSecondaire,hasGroupOwner,nbActionnaires,nbEtabSecondaire,nbMarques
0,AVI-CHARENTE,305689432,30568940000000.0,,,False,False,1.0,,
1,SOCIETE D'ABATTAGE DE MONTMORILLON,752129643,75212960000000.0,,,False,False,,,
2,COVI,391892171,39189220000000.0,,,True,False,,4.0,19.0
3,LE COQ NOIR,316203942,31620390000000.0,848003281,GROUPE NATIMPACT,False,True,,,
4,API TECH,451972483,45197250000000.0,,,True,False,1.0,2.0,2.0
5,SOREAL-ILOU,478608037,47860800000000.0,442449229,YDEO,True,True,1.0,1.0,7.0
6,SPECIALITES PET FOOD,560500498,56050050000000.0,ETR_000149258,Symrise,False,True,1.0,,10.0
7,WHAT'S COOKING FRANCE,322304197,32230420000000.0,ETR_000387050,What's Cooking,False,True,,,9.0
8,S.E.M. DES SOURCES DE SOULTZMATT,380356436,38035640000000.0,,,False,False,2.0,,21.0
9,ATLAGEL,7280365,728036500000.0,,,False,False,,,


In [13]:
# Summary
print(f"\nStructure Summary:")
# Convert to numeric first (handles string values)
if 'nbFilialesDirectes' in df_structure.columns:
    nb_filiales = pd.to_numeric(df_structure['nbFilialesDirectes'], errors='coerce')
    print(f"Companies with subsidiaries: {(nb_filiales > 0).sum()}")
if 'nbEtabSecondaire' in df_structure.columns:
    nb_etab = pd.to_numeric(df_structure['nbEtabSecondaire'], errors='coerce')
    print(f"Companies with secondary establishments: {(nb_etab > 0).sum()}")
if 'nbMarques' in df_structure.columns:
    nb_marques = pd.to_numeric(df_structure['nbMarques'], errors='coerce')
    print(f"Companies with brands: {(nb_marques > 0).sum()}")
if 'hasGroupOwner' in df_structure.columns:
    print(f"Companies with group owner: {df_structure['hasGroupOwner'].sum()}")
if 'appartient_groupe' in df_structure.columns:
    print(f"Companies belonging to group: {df_structure['appartient_groupe'].sum()}")



Structure Summary:
Companies with secondary establishments: 2425
Companies with brands: 2115
Companies with group owner: 1543


In [14]:
# # Visualize company structure
# fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# # Convert to numeric first (handles string values)
# if 'nbFilialesDirectes' in df_structure.columns:
#     nb_filiales = pd.to_numeric(df_structure['nbFilialesDirectes'], errors='coerce')
#     subs_data = nb_filiales[nb_filiales > 0]
#     if len(subs_data) > 0:
#         axes[0].hist(subs_data, bins=20, color='steelblue', edgecolor='black')
#         axes[0].set_title('Distribution of Direct Subsidiaries', fontweight='bold')
#         axes[0].set_xlabel('Number of Subsidiaries')
#         axes[0].set_ylabel('Frequency')

# if 'nbEtabSecondaire' in df_structure.columns:
#     nb_etab = pd.to_numeric(df_structure['nbEtabSecondaire'], errors='coerce')
#     etab_data = nb_etab[nb_etab > 0]
#     if len(etab_data) > 0:
#         axes[1].hist(etab_data, bins=20, color='coral', edgecolor='black')
#         axes[1].set_title('Distribution of Secondary Establishments', fontweight='bold')
#         axes[1].set_xlabel('Number of Establishments')
#         axes[1].set_ylabel('Frequency')

# if 'nbMarques' in df_structure.columns:
#     nb_marques = pd.to_numeric(df_structure['nbMarques'], errors='coerce')
#     brands_data = nb_marques[nb_marques > 0]
#     if len(brands_data) > 0:
#         axes[2].hist(brands_data, bins=20, color='green', edgecolor='black')
#         axes[2].set_title('Distribution of Brands', fontweight='bold')
#         axes[2].set_xlabel('Number of Brands')
#         axes[2].set_ylabel('Frequency')

# plt.tight_layout()
# plt.show()


## 5. Classification Flags

This dataset contains various classification and flag information about companies.

**Use case:** Company categorization, filtering, and segmentation analysis.


In [15]:
# Load classification flags data from CSV
df_flags = data['05_classification_flags']

print(f"Dataset shape: {df_flags.shape}")
print(f"\nColumns: {list(df_flags.columns)}")
display(df_flags.head(10))


Dataset shape: (18116, 17)

Columns: ['company_name', 'siren', 'siret', 'cac40', 'entreprise_b2b', 'entreprise_b2c', 'entreprise_biotech_medtech', 'entreprise_familiale', 'fintech', 'hasBodacc', 'hasBrevets', 'hasESV1Contacts', 'hasMarques', 'radiee', 'risk', 'site_ecommerce', 'startup']


Unnamed: 0,company_name,siren,siret,cac40,entreprise_b2b,entreprise_b2c,entreprise_biotech_medtech,entreprise_familiale,fintech,hasBodacc,hasBrevets,hasESV1Contacts,hasMarques,radiee,risk,site_ecommerce,startup
0,AVI-CHARENTE,305689432,30568940000000.0,False,True,False,False,False,False,False,False,True,False,False,False,False,False
1,SOCIETE D'ABATTAGE DE MONTMORILLON,752129643,75212960000000.0,False,False,False,False,False,False,False,False,True,False,False,False,False,False
2,COVI,391892171,39189220000000.0,False,True,False,False,False,False,False,False,False,True,False,False,False,False
3,LE COQ NOIR,316203942,31620390000000.0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,API TECH,451972483,45197250000000.0,False,True,True,False,False,False,True,False,True,True,False,False,False,False
5,SOREAL-ILOU,478608037,47860800000000.0,False,True,False,False,False,False,True,True,False,True,False,False,True,False
6,SPECIALITES PET FOOD,560500498,56050050000000.0,False,False,False,False,False,False,False,False,True,True,False,False,False,False
7,WHAT'S COOKING FRANCE,322304197,32230420000000.0,False,False,False,False,False,False,False,False,True,True,False,False,True,False
8,S.E.M. DES SOURCES DE SOULTZMATT,380356436,38035640000000.0,False,False,False,False,False,False,True,False,False,True,False,False,False,False
9,ATLAGEL,7280365,728036500000.0,False,False,False,False,False,False,False,False,True,False,False,False,False,False


## 6. Contact Metrics

This dataset contains contact and communication metrics for companies.

**Use case:** Contact analysis, communication patterns, engagement metrics.


In [16]:
# Load contact metrics data from CSV
df_contact = data['06_contact_metrics']

print(f"Dataset shape: {df_contact.shape}")
print(f"\nColumns: {list(df_contact.columns)}")
display(df_contact.head(10))


Dataset shape: (18116, 10)

Columns: ['company_name', 'siren', 'siret', 'emailContact', 'nbContacts', 'telephoneNumber', 'urlFacebook', 'urlLinkedin', 'urlTwitter', 'webSite']


Unnamed: 0,company_name,siren,siret,emailContact,nbContacts,telephoneNumber,urlFacebook,urlLinkedin,urlTwitter,webSite
0,AVI-CHARENTE,305689432,30568940000000.0,contact@avi-charente.fr,0.0,33586301140,https://fr-fr.facebook.com/avicharente/,https://www.linkedin.com/company/avicharente/,,http://www.avi-charente.fr/
1,SOCIETE D'ABATTAGE DE MONTMORILLON,752129643,75212960000000.0,,0.0,33549910336,,https://www.linkedin.com/company/t-rh%C3%A9a/,,https://www.t-rhea.fr/
2,COVI,391892171,39189220000000.0,export@covi.com,6.0,33549740377,https://www.facebook.com/covi.groupe,https://www.linkedin.com/company/covi-sas/,,https://www.covi.com/
3,LE COQ NOIR,316203942,31620390000000.0,info@le-coq-noir.com,0.0,33490385964,https://www.facebook.com/pages/Le-Voyage-de-Ma...,https://www.linkedin.com/company/lecoqnoir/,,http://www.le-coq-noir.com/
4,API TECH,451972483,45197250000000.0,contact@apitech-fr.com,28.0,33383495423,https://www.facebook.com/Api-Tech-168242325533...,https://www.linkedin.com/company/api-tech/,,http://www.apitech-solution.com/
5,SOREAL-ILOU,478608037,47860800000000.0,contact@soreal.fr,39.0,33299472121,https://www.facebook.com/Soreal.sauce,https://www.linkedin.com/company/soreal-ilou-sas/,https://twitter.com/@Soreal_Ilou,http://www.soreal.fr/
6,SPECIALITES PET FOOD,560500498,56050050000000.0,contact@spf-diana.com,0.0,33297938080,https://www.facebook.com/symriseag-28794040508...,https://www.linkedin.com/company/specialites-p...,,https://petfood.symrise.com/fr/
7,WHAT'S COOKING FRANCE,322304197,32230420000000.0,commercial@stefano-toselli.com,0.0,33231200596,https://m.facebook.com/tosellien/,https://www.linkedin.com/company/stefano-toselli/,twitter.com/https://twitter.com/@StefanoToselli1,https://whatscooking.group/fr-FR
8,S.E.M. DES SOURCES DE SOULTZMATT,380356436,38035640000000.0,standard@lisbeth.fr,28.0,33389470006,https://www.facebook.com/sourcesdesoultzmatt/,https://www.linkedin.com/company/sources-de-so...,twitter.com/https://twitter.com/@Rivella,https://www.lisbeth.alsace/
9,ATLAGEL,7280365,728036500000.0,,0.0,33240131300,,https://www.linkedin.com/company/relais-d'or-m...,,https://webshop.relaisdor.fr/


## 7. KPI Data

This dataset contains Key Performance Indicators (KPIs) expanded by year.

**Note:** KPI data is structured with one row per company per year, so a company may have multiple rows.

**Use case:** Time series analysis, year-over-year comparisons, KPI trend analysis.


In [17]:
# Load KPI data from CSV
df_kpi = data['07_kpi_data']

if df_kpi.empty:
    print("⚠ KPI dataset is empty")
else:
    print(f"Dataset shape: {df_kpi.shape}")
    print(f"\nColumns: {list(df_kpi.columns)}")
    display(df_kpi.head(10))

    # Show sample KPI metrics
    kpi_metric_cols = [col for col in df_kpi.columns if col not in ['company_name', 'siren', 'siret', 'year']]
    if len(kpi_metric_cols) > 0:
        print(f"\nSample KPI metrics (showing first 5): {kpi_metric_cols[:5]}")

        # Show a sample company's KPI over time
        if 'siren' in df_kpi.columns and len(df_kpi) > 0:
            sample_siren = df_kpi['siren'].iloc[0]
            sample_company = df_kpi[df_kpi['siren'] == sample_siren]
            if 'year' in df_kpi.columns:
                sample_company = sample_company.sort_values('year')
            print(f"\nSample: KPI data for company {sample_siren}:")
            display(sample_company[['year'] + kpi_metric_cols[:10]].head(10) if 'year' in df_kpi.columns else sample_company[kpi_metric_cols[:10]].head(10))


Dataset shape: (3779, 28)

Columns: ['company_name', 'siren', 'siret', 'year', 'fonds_propres', 'ca_france', 'date_cloture_exercice', 'duree_exercice', 'salaires_traitements', 'charges_financieres', 'impots_taxes', 'ca_bilan', 'resultat_exploitation', 'dotations_amortissements', 'capital_social', 'code_confidentialite', 'resultat_bilan', 'annee', 'effectif', 'effectif_sous_traitance', 'filiales_participations', 'evolution_ca', 'subventions_investissements', 'ca_export', 'evolution_effectif', 'participation_bilan', 'ca_consolide', 'resultat_net_consolide']


Unnamed: 0,company_name,siren,siret,year,fonds_propres,ca_france,date_cloture_exercice,duree_exercice,salaires_traitements,charges_financieres,...,effectif,effectif_sous_traitance,filiales_participations,evolution_ca,subventions_investissements,ca_export,evolution_effectif,participation_bilan,ca_consolide,resultat_net_consolide
0,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2023,2192166.0,6729652.0,2023-01-31,12.0,1394492.0,80993.0,...,,,,,,,,,,
1,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2022,1614077.0,6247357.0,2022-01-31,12.0,1327711.0,81469.0,...,35.0,,,,,,,,,
2,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2021,1497114.0,5296275.0,2021-01-31,12.0,1318083.0,66111.0,...,34.0,,,,,,,,,
3,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2020,1577275.0,5710890.0,2020-01-31,12.0,1380952.0,70953.0,...,45.0,18930.0,1.0,,,,,,,
4,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2019,1348804.0,5221375.0,2019-01-31,12.0,1230571.0,88389.0,...,43.0,15835.0,1.0,,,,,,,
5,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2018,1492199.0,,2018-01-31,12.0,1372333.0,385712.0,...,44.0,,,,,,,,,
6,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2017,1419433.0,5630041.0,2017-01-31,12.0,1394179.0,47878.0,...,44.0,,,,,,,,,
7,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2016,,,2016-01-31,12.0,,,...,,,,,,,,,,
8,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2015,,,2015-01-31,12.0,,,...,,,,,,,,,,
9,PAIN D'EPICES MULOT ET PETITJEAN,15751530,1575153000000.0,2025,,,,,,,...,,,,1.0772,,,,,,



Sample KPI metrics (showing first 5): ['fonds_propres', 'ca_france', 'date_cloture_exercice', 'duree_exercice', 'salaires_traitements']

Sample: KPI data for company 15751530:


Unnamed: 0,year,fonds_propres,ca_france,date_cloture_exercice,duree_exercice,salaires_traitements,charges_financieres,impots_taxes,ca_bilan,resultat_exploitation,dotations_amortissements
10,2014,,,2014-01-31,12.0,,,,6653070.0,,
8,2015,,,2015-01-31,12.0,,,,4905670.0,,
7,2016,,,2016-01-31,12.0,,,,4684680.0,,
6,2017,1419433.0,5630041.0,2017-01-31,12.0,1394179.0,47878.0,78290.0,5630040.0,32986.0,
5,2018,1492199.0,,2018-01-31,12.0,1372333.0,385712.0,81066.0,5971010.0,-47658.0,510525.0
4,2019,1348804.0,5221375.0,2019-01-31,12.0,1230571.0,88389.0,73800.0,5221380.0,-247638.0,606804.0
3,2020,1577275.0,5710890.0,2020-01-31,12.0,1380952.0,70953.0,82085.0,5710890.0,93684.0,581427.0
2,2021,1497114.0,5296275.0,2021-01-31,12.0,1318083.0,66111.0,162691.0,5296275.0,-62109.0,548599.0
1,2022,1614077.0,6247357.0,2022-01-31,12.0,1327711.0,81469.0,133727.0,6247357.0,65908.0,620473.0
0,2023,2192166.0,6729652.0,2023-01-31,12.0,1394492.0,80993.0,150846.0,6729652.0,76546.0,670125.0


## 8. Signals

This dataset contains company signals and events.

**Use case:** Event tracking, signal analysis, company activity monitoring.


In [18]:
# Load signals data from CSV
df_signals = data['08_signals']

if df_signals.empty:
    print("⚠ Signals dataset is empty")
else:
    print(f"Dataset shape: {df_signals.shape}")
    print(f"\nColumns: {list(df_signals.columns)}")
    display(df_signals.head(10))

    # Summary
    print(f"\nSignals Summary:")
    print(f"Total signal records: {len(df_signals)}")
    if 'siren' in df_signals.columns:
        print(f"Unique companies with signals: {df_signals['siren'].nunique()}")
        print(f"Average signals per company: {len(df_signals) / df_signals['siren'].nunique():.1f}")

    if 'type' in df_signals.columns:
        print(f"\nSignal types:")
        print(df_signals['type'].value_counts().head(10))

    if 'country' in df_signals.columns:
        print(f"\nCountries:")
        print(df_signals['country'].value_counts().head(10))


Dataset shape: (2704187, 17)

Columns: ['company_name', 'siren', 'siret', 'city_label', 'city_zip_code', 'companies_count', 'continent', 'country', 'createdAt', 'departement', 'isMain', 'natureOp', 'publishedAt', 'sirets_count', 'statut', 'type', 'type_id']


Unnamed: 0,company_name,siren,siret,city_label,city_zip_code,companies_count,continent,country,createdAt,departement,isMain,natureOp,publishedAt,sirets_count,statut,type,type_id
0,,142697.0,,BEAUMONT-HAMEL,80300.0,1,Europe,France,2021-12-07T03:50:04+01:00,Somme,True,Bâtiment public,2021-12-07T00:00:00+01:00,1,,,
1,,218000677.0,21800070000000.0,BEAUMONT-HAMEL,80300.0,1,Europe,France,2021-12-07T03:50:04+01:00,Somme,True,Bâtiment public,2021-12-07T00:00:00+01:00,1,,,
2,,142668.0,,Saint-Cloud,92210.0,1,Europe,France,2021-11-20T03:50:04+01:00,Hauts de Seine,True,Logements,2021-11-20T00:00:00+01:00,1,,,
3,,845406891.0,84540690000000.0,Saint-Cloud,92210.0,1,Europe,France,2021-11-20T03:50:04+01:00,Hauts de Seine,True,Logements,2021-11-20T00:00:00+01:00,1,,,
4,,33326.0,,,,1,,,2022-11-08T13:37:12+01:00,Charente Maritime,True,R&D,2022-11-09T00:00:00+01:00,1,,,
5,,344436712.0,34443670000000.0,,,1,,,2022-11-08T13:37:12+01:00,Charente Maritime,True,R&D,2022-11-09T00:00:00+01:00,1,,,
6,,33326.0,,,,0,,,2022-11-08T13:45:38+01:00,Charente Maritime,True,Innovation,2022-11-09T00:00:00+01:00,1,,,
7,,57359.0,,,,0,,,2022-11-08T13:45:38+01:00,Charente Maritime,True,Innovation,2022-11-09T00:00:00+01:00,1,,,
8,,344436712.0,34443670000000.0,,,0,,,2022-11-08T13:45:38+01:00,Charente Maritime,True,Innovation,2022-11-09T00:00:00+01:00,1,,,
9,,142639.0,,Montanay,69250.0,1,Europe,France,2021-11-20T03:50:04+01:00,Rhône,True,Logements,2021-11-20T00:00:00+01:00,1,,,



Signals Summary:
Total signal records: 2704187
Unique companies with signals: 593341
Average signals per company: 4.6

Signal types:
Series([], Name: count, dtype: int64)

Countries:
country
France              1382108
États-Unis            20899
Pays non précisé      20576
Allemagne              7911
Royaume-Uni            6795
Espagne                5308
Italie                 4947
Chine                  4529
Belgique               4339
Canada                 3237
Name: count, dtype: int64


In [19]:
# BROKEN  # Visualize signals
# if not df_signals.empty:
#     fig, axes = plt.subplots(1, 2, figsize=(16, 6))

#     # Signal types
#     if 'type' in df_signals.columns:
#         signal_types = df_signals['type'].value_counts().head(10)
#         signal_types.plot(kind='barh', ax=axes[0], color='steelblue')
#         axes[0].set_title('Top 10 Signal Types', fontweight='bold')
#         axes[0].set_xlabel('Count')

#     # Countries
#     if 'country' in df_signals.columns:
#         countries = df_signals['country'].value_counts().head(10)
#         countries.plot(kind='barh', ax=axes[1], color='coral')
#         axes[1].set_title('Top 10 Countries by Signal Count', fontweight='bold')
#         axes[1].set_xlabel('Count')

#     plt.tight_layout()
#     plt.show()

#     # Signals per company distribution
#     if 'siren' in df_signals.columns:
#         signals_per_company = df_signals.groupby('siren').size()
#         plt.figure(figsize=(10, 6))
#         signals_per_company.hist(bins=30, color='green', edgecolor='black')
#         plt.title('Distribution of Signals per Company', fontweight='bold')
#         plt.xlabel('Number of Signals')
#         plt.ylabel('Number of Companies')
#         plt.yscale('log')
#         plt.tight_layout()
#         plt.show()


## 9. Articles

This dataset contains articles and news related to companies.

**Use case:** Media analysis, news tracking, company visibility assessment.


In [20]:
# Load articles data from CSV
df_articles = data['09_articles']

if df_articles.empty:
    print("⚠ Articles dataset is empty")
else:
    print(f"Dataset shape: {df_articles.shape}")
    print(f"\nColumns: {list(df_articles.columns)}")
    display(df_articles.head(10))

    # Summary
    print(f"\nArticles Summary:")
    print(f"Total article records: {len(df_articles)}")
    if 'siren' in df_articles.columns:
        print(f"Unique companies with articles: {df_articles['siren'].nunique()}")
        print(f"Average articles per company: {len(df_articles) / df_articles['siren'].nunique():.1f}")


Dataset shape: (907270, 15)

Columns: ['company_name', 'siren', 'siret', 'all_companies_count', 'author', 'cities', 'companies_count', 'country', 'departments', 'publishedAt', 'sectors', 'signalsStatus', 'signalsType', 'sources', 'title']


Unnamed: 0,company_name,siren,siret,all_companies_count,author,cities,companies_count,country,departments,publishedAt,sectors,signalsStatus,signalsType,sources,title
0,FOURVIERE HOTEL LYON - F H L,802731182.0,80273120000000.0,1,,,1,,,2023-11-07T00:00:00+01:00,,,,TRIBUNE DE LYON,"À Avignon, l'hôtel des Monnaies devrait rouvri..."
1,MY TRAINING BOX,891180952.0,89118100000000.0,1,,Grenade,1,,Haute Garonne,2023-11-07T00:00:00+01:00,,,,GAZETTE DU MIDI (La),My Green Training Box veut diminuer son impact...
2,EUROPLASMA,384256095.0,38425610000000.0,0,,Morcenx-la-Nouvelle,0,,Landes,2023-11-07T00:00:00+01:00,,Détecté,Activité internationale (industrie),CENTRAL CHARTS,Europlasma signe officiellement son accord de ...
3,Hangzhou Jinyao New Energy Technology,262734.0,26273400000.0,0,,Morcenx-la-Nouvelle,0,,Landes,2023-11-07T00:00:00+01:00,,Détecté,Activité internationale (industrie),CENTRAL CHARTS,Europlasma signe officiellement son accord de ...
4,COMMUNE DE BENOUVILLE,211400601.0,21140060000000.0,0,,,1,,Calvados,2023-11-07T00:00:00+01:00,,,,LIBERTE LE BONHOMME LIBRE,Le dossier de la ZAC de la Clôture validé à Bé...
5,NETRI,840248744.0,84024870000000.0,0,,,1,,Rhône,2023-11-07T00:00:00+01:00,,,,LYONBIOPOLE,Netri soutenue dans ses actions de formation
6,VINCI,552037806.0,55203780000000.0,0,,,0,,,2023-11-07T00:00:00+01:00,,Suivi,Construction,ECHOS (Groupe Les),Les travaux de requalification de la friche Or...
7,SOCIETE DE LA TOUR EIFFEL,572182269.0,57218230000000.0,0,,,0,,,2023-11-07T00:00:00+01:00,,Suivi,Construction,ECHOS (Groupe Les),Les travaux de requalification de la friche Or...
8,LBMG,922124607.0,92212460000000.0,0,,,1,,,2023-11-07T00:00:00+01:00,,,,DELIBERATIONS et COMPTES RENDUS,Lbmg souhaite créer des meublés de tourisme au...
9,ETABLISSEMENTS LE GOFF,323048751.0,32304880000000.0,1,,Saint-Martin-des-Champs,1,,Finistère,2023-11-07T00:00:00+01:00,,,,,"La biscuiterie Le Goff licencie 14 personnes, ..."



Articles Summary:
Total article records: 907270
Unique companies with articles: 300161
Average articles per company: 3.0


In [21]:
# BROKEN # Visualize articles
# if not df_articles.empty:
#     fig, axes = plt.subplots(1, 2, figsize=(16, 6))

#     # Signal types (if available)
#     if 'signalsType' in df_articles.columns:
#         signal_types = df_articles['signalsType'].value_counts().head(10)
#         signal_types.plot(kind='barh', ax=axes[0], color='steelblue')
#         axes[0].set_title('Top 10 Article Signal Types', fontweight='bold')
#         axes[0].set_xlabel('Count')

#     # Countries
#     if 'country' in df_articles.columns:
#         countries = df_articles['country'].value_counts().head(10)
#         countries.plot(kind='barh', ax=axes[1], color='coral')
#         axes[1].set_title('Top 10 Countries by Article Count', fontweight='bold')
#         axes[1].set_xlabel('Count')

#     plt.tight_layout()
#     plt.show()

#     # Articles per company distribution
#     if 'siren' in df_articles.columns:
#         articles_per_company = df_articles.groupby('siren').size()
#         plt.figure(figsize=(10, 6))
#         articles_per_company.hist(bins=30, color='purple', edgecolor='black')
#         plt.title('Distribution of Articles per Company', fontweight='bold')
#         plt.xlabel('Number of Articles')
#         plt.ylabel('Number of Companies')
#         plt.yscale('log')
#         plt.tight_layout()
#         plt.show()
