## Generating Synthetic Data 

**Objective:**
Generate synthetic data for various tables related to KPIs for Pro Juventute.

### **Tables and Constraints**

**Circles Table:**

* Contains unique circle names derived from sample data.
* Timestamps for creation and last update.

**Users Table:**

* Generate data for 10 users.
* Users can belong to any circle.
* Each user has a unique username and email.
* Users have roles, either 'Admin' or 'User'.
* Timestamps for account creation and last update.

**KPIs Table:**

* Generate data for KPIs related to Pro Juventute's activities.
* KPIs can occasionally be shared across circles.
* Each KPI has a unique name, description, periodicity (Daily, Weekly, etc.), and unit of measurement.
* KPIs that represent percentages have a maximum value of 100.
* KPIs that do not represent percentages do not have a predefined maximum value (e.g., funds raised, sessions, etc.).
* Timestamps for KPI creation and last update.

**KPI Values Table:**

* Generate values for each KPI for 3 years.
* The values lie between the minimum and maximum specified for each KPI.
* Timestamps for each value's creation and last update.

**Audit Logs Table:**

* Generate log entries for each KPI value.
* Each log entry has an action type (Create, Update, Delete), table name, record name, user ID, and timestamp.

**KPI Targets Table:**

* Generate target values for each KPI.
* Targets have a timeframe (Daily, Weekly, etc.) and are associated with specific KPIs.
* Timestamps for target creation and last update.

**Additional Information:**

The generated data will be exported to .csv files for each table. The KPIs and their descriptions related to Pro Juventute's activities are provided in a list of dictionaries format. The sample data provided contains some KPI names and their acceptable ranges or descriptions.

Generation with assistance from ChatGPT.

In [1]:
import pandas as pd
import random
from datetime import datetime, timedelta
import os

random.seed(1)

#sample_size = 

sample_data_path = '../sample-data/pj_sample_value.csv'

synthetic_data_dir = '../synthetic-data/'

# Load the sample data
sample_data_df = pd.read_csv(sample_data_path)

# Generate Circles Table data
circles_df = pd.DataFrame({
    'Circle_id': range(1, sample_data_df['circle'].nunique() + 1),
    'Circle_name': sample_data_df['circle'].unique(),
    'created_at': [datetime.now() - timedelta(days=random.randint(0, 365*3)) for _ in range(sample_data_df['circle'].nunique())],
    'updated_at': [datetime.now() - timedelta(days=random.randint(0, 365*2)) for _ in range(sample_data_df['circle'].nunique())]
})
circles_df.to_csv(os.path.join(synthetic_data_dir, 'circles.csv'), index=False)

# Generate Users Table data (10 users as an example)
num_users = 10
users_df = pd.DataFrame({
    'user_id': range(1, num_users+1),
    'Circle_id': [random.choice(circles_df['Circle_id'].tolist()) for _ in range(num_users)],
    'username': [f"User_{i}" for i in range(1, num_users+1)],
    'email': [f"user{i}@example.com" for i in range(1, num_users+1)],
    'password': ['password'] * num_users,
    'role': ['Admin' if i == 0 else 'User' for i in range(num_users)],
    'created_at': [datetime.now() - timedelta(days=random.randint(0, 365*3)) for _ in range(num_users)],
    'updated_at': [datetime.now() - timedelta(days=random.randint(0, 365*2)) for _ in range(num_users)]
})
users_df.to_csv(os.path.join(synthetic_data_dir, 'users.csv'), index=False)

# Define KPIs related to Pro Juventute's activities
pro_juventute_kpis = [
    {"name": "Number of counseling sessions", "description": "Total counseling sessions held in a period.", "unit": "sessions"},
    {"name": "Active volunteers", "description": "Number of active volunteers for Pro Juventute.", "unit": "people"},
    {"name": "Workshops conducted", "description": "Total number of workshops conducted.", "unit": "workshops"},
    {"name": "Funds raised", "description": "Total funds raised in a period.", "unit": "CHF"},
    {"name": "Children reached", "description": "Number of children reached through programs.", "unit": "children"},
    {"name": "Digital literacy programs", "description": "Number of digital literacy programs conducted.", "unit": "programs"},
    {"name": "Community events", "description": "Number of community events organized.", "unit": "events"},
    {"name": "Feedback received", "description": "Number of feedback entries received from beneficiaries.", "unit": "feedbacks"},
    {"name": "Outreach programs", "description": "Total outreach programs conducted.", "unit": "programs"},
    {"name": "Online safety workshops", "description": "Number of online safety workshops held.", "unit": "workshops"},
    {"name": "Partnerships formed", "description": "Number of new partnerships or collaborations.", "unit": "partnerships"},
    {"name": "Awareness campaigns", "description": "Total awareness campaigns conducted.", "unit": "campaigns"}
]

# Generate KPIs Table data
kpis_data = []
for idx, circle in circles_df.iterrows():
    for kpi in pro_juventute_kpis:
        kpis_data.append({
            'kpi_id': len(kpis_data) + 1,
            'name': kpi['name'],
            'description': kpi['description'],
            'Circle_id': circle['Circle_id'],
            'Periodicity': random.choice(['Daily', 'Weekly', 'Monthly', 'Quarterly', 'Yearly']),
            'Value_min': 0,
            'Value_max': 100 if kpi['unit'] == '%' else None,
            'unit': kpi['unit'],
            'created_at': datetime.now() - timedelta(days=random.randint(0, 365*3)),
            'updated_at': datetime.now() - timedelta(days=random.randint(0, 365*2))
        })

kpis_df = pd.DataFrame(kpis_data)
kpis_df.to_csv(os.path.join(synthetic_data_dir, 'kpis.csv'), index=False)

# Generate KPI_Values Table data without anomalies
num_kpi_values = len(sample_data_df) * 3
kpi_values_df = pd.DataFrame({
    'kpi_value_id': range(1, num_kpi_values + 1),
    'kpi_id': [random.choice(kpis_df['kpi_id'].tolist()) for _ in range(num_kpi_values)],
    'Cirlce_id': [random.choice(circles_df['Circle_id'].tolist()) for _ in range(num_kpi_values)],
    'user_id': [random.choice(users_df['user_id'].tolist()) for _ in range(num_kpi_values)],
    'value': [random.uniform(0, 100) for _ in range(num_kpi_values)],
    'Period_year': [(datetime.now() - timedelta(days=random.randint(0, 365*3))).year for _ in range(num_kpi_values)],
    'period_month': [random.randint(1, 12) for _ in range(num_kpi_values)],
    'created_at': [datetime.now() - timedelta(days=random.randint(0, 365*3)) for _ in range(num_kpi_values)],
    'updated_at': [datetime.now() - timedelta(days=random.randint(0, 365*2)) for _ in range(num_kpi_values)]
})

# Introduce anomalies to the KPI Values table
for _ in range(2):
    idx = random.choice(kpi_values_df.index)
    kpi_values_df.at[idx, 'value'] *= random.choice([10, 0.1])

for _ in range(3):
    idx = random.choice(kpi_values_df.index)
    kpi_values_df.at[idx, 'value'] = random.choice([-500, 5000])

kpi_values_df.to_csv(os.path.join(synthetic_data_dir, 'anomalous_kpi_values.csv'), index=False)

# Generate Audit_Logs Table data
num_audit_logs = len(kpi_values_df)
audit_logs_df = pd.DataFrame({
    'session_id': range(1, num_audit_logs + 1),
    'action': ['Create'] * num_audit_logs,  # Assuming all are create actions for simplicity
    'table_name': ['KPI_Values'] * num_audit_logs,
    'record_name': kpi_values_df['kpi_value_id'].tolist(),
    'user_id': kpi_values_df['user_id'].tolist(),
    'timestamp': kpi_values_df['created_at'].tolist()
})

audit_logs_df.to_csv(os.path.join(synthetic_data_dir, 'audit_logs.csv'), index=False)

# Generate KPI_Targets Table data
num_kpi_targets = len(kpis_df)
kpi_targets_df = pd.DataFrame({
    'target_id': range(1, num_kpi_targets + 1),
    'kpi_id': kpis_df['kpi_id'].tolist(),
    'target_value': [random.uniform(0, 100) if row['Value_max'] is not None else random.uniform(0, 5000) 
                     for idx, row in kpis_df.iterrows()],
    'timeframe': kpis_df['Periodicity'].tolist(),
    'created_at': [datetime.now() - timedelta(days=random.randint(0, 365*3)) for _ in range(num_kpi_targets)],
    'updated_at': [datetime.now() - timedelta(days=random.randint(0, 365*2)) for _ in range(num_kpi_targets)]
})

kpi_targets_df.to_csv(os.path.join(synthetic_data_dir, 'kpi_targets.csv'), index=False)


print("Data generation complete")


Data generation complete
