# QA question generator using Deepseek


**Authors**
1. Alfan Dinda Rahmawan (alfan.d.rahmawan@gdplabs.id)

## Install dependencies

In [1]:
%pip install -q langchain=="0.3.0"
%pip install -q langchain-aws=="0.2.7"
%pip install -q langchain_openai=="0.2.14"
%pip install -q boto3=="1.35.71"
%pip install -q pandas=="2.2.2"
%pip install -q tqdm=="4.66.4"
%pip install google-api-python-client==2.100.0 gspread==5.10.0
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Data Preparation

In [2]:
import os
import pandas as pd
import json

from typing import List
from dotenv import load_dotenv

load_dotenv()

GOOGLE_SPREADSHEET_ID: str = "1dDMqrol_DrEMjvLy88IRu2WdHN7T5BU0LrD8ORLuNPI" # put your spreadsheet id here
GOOGLE_SPREADSHEET_URL: str = f"https://docs.google.com/spreadsheets/d/{GOOGLE_SPREADSHEET_ID}/edit?usp=sharing" # put your spreadsheet link here
DATA_TEST_SHEET_NAME: str = "test_data_time_management"

GOOGLE_SHEETS_CLIENT_EMAIL: str = os.getenv('GOOGLE_SHEETS_CLIENT_EMAIL')
GOOGLE_SHEETS_PRIVATE_KEY: str = os.getenv('GOOGLE_SHEETS_PRIVATE_KEY')

### Google Auth

In [2]:
# Google Authentication
from modules.google_sheets_writer import GoogleUtil

PRIVATE_KEY = GOOGLE_SHEETS_PRIVATE_KEY
google: GoogleUtil = GoogleUtil(PRIVATE_KEY, GOOGLE_SHEETS_CLIENT_EMAIL)


In [3]:
## Load Data Test
rows: List[list] = google.retrieve_worksheet(GOOGLE_SPREADSHEET_ID, DATA_TEST_SHEET_NAME)
df_data_test: pd.DataFrame = pd.DataFrame(rows[1:], columns=rows[0])
df_data_test = df_data_test[50:]
display(df_data_test.head(2))

Unnamed: 0,No,Prompt,Category,Expected SQL Query,Expected Query Result,Expected SQL Query (base from catapa),Expected Query Result (base from catapa),Status,llm_prompt,Local Query Result,Database
50,51,Berapa jumlah karyawan baru yang direkrut seti...,,WITH monthly_employees AS (\n SELECT \n DA...,"[{'bulan': '2023-10', 'jumlah_karyawan_baru': 2}]",,,,,,
51,52,Berapa jumlah karyawan yang memiliki bawahan l...,,SELECT COUNT(DISTINCT manager_id) AS total_man...,[{'total_managers': 4}],,,,,,


## Database Information

In [1]:
import pandas as pd
from pydantic import BaseModel, Field
from typing import List
from langchain_core.output_parsers import JsonOutputParser
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from tqdm import tqdm

from modules.database_info.schema import employee_schema, time_management_schema
from modules.database_info.master_data import employee_master_data, time_management_master_data
from modules.database_info.relation import employee_relations, time_management_relations
from modules.database_info.trustee_tables import data_trustee_employee, data_trustee_time_management
from modules.database_info.anonymize_entities import anonymized_entities_description

## Prompt augmentation sql question and sql query

In [4]:
SYSTEM_MESSAGE = """<instructions>
You are an HR SQL augmentation specialist with expertise in creating diverse but RELATED business question-query pairs. Your task is to generate HR analytics questions that explore variations and extensions of the original question while using different analytical techniques.

For each synthetic question-query pair, you will provide:
1. BUSINESS_QUESTION: A question in Indonesian that is related to the core theme of the original question
2. SQL_REASONING: Step-by-step explanation connecting the business question to the SQL solution in Indonesian
3. SQL_QUERY: Clean, executable SQL code that answers the business question

The existing question serves as the central theme. Your generated questions should explore different angles, dimensions, or deeper insights related to this theme.
</instructions>"""

USER_MESSAGE = """I need you to generate {total_augmentations} RELATED but DIVERSE HR business questions and SQL queries based on the existing question.

## DATABASE SCHEMA
{schema}

## DATABASE RELATIONS
{relations}

## MASTER DATA
{master_data}

## DATA TRUSTEE (Sensitive Fields)
{data_trustee_tables}

## ANONYMIZED ENTITIES
{anonymized_entities_description}

## EXISTING QUESTION-QUERY PAIR (THIS DEFINES THE THEME)
### Question
{existing_question}
### Query
{existing_query_pairs}

## TASK
Generate EXACTLY {total_augmentations} questions that are RELATED to the existing question's theme but explore different aspects, angles, or deeper insights. Each example must:

1. MAINTAIN THEMATIC CONNECTION: Each question should relate to the core theme of the existing question (e.g., if the original is about age distribution, generated questions should explore different aspects of age demographics)
2. USE DIFFERENT ANALYTICAL APPROACHES: Each generated question should analyze the theme using a different approach
3. VARY IN COMPLEXITY: Include both simpler and more complex variations
4. EXTEND THE ANALYSIS: Go beyond the basic analysis in the original question to provide deeper or different insights on the same theme

## RELATED VARIATIONS APPROACH

For each question, create variations by:
1. ADDING DIMENSIONS: Segment the core metric by additional variables (department, gender, location, etc.)
2. TEMPORAL ANALYSIS: Explore how the core metric changes over time
3. CORRELATION EXPLORATION: Analyze relationships between the core metric and other variables
4. OUTLIER IDENTIFICATION: Find notable extremes or anomalies in the core metric
5. COMPARATIVE STUDY: Compare the core metric across different groups or categories
6. TREND ANALYSIS: Identify patterns or trajectories in the core metric

## ANALYTICAL APPROACHES TO INCLUDE
Each question should use a DIFFERENT analytical approach related to the theme:
1. SEGMENTATION ANALYSIS: Dividing the core metric by different dimensions
2. TIME SERIES ANALYSIS: Analyzing trends in the core metric over time
3. COMPARATIVE ANALYSIS: Comparing the core metric across different groups
4. CORRELATION ANALYSIS: Examining relationships between the core metric and other variables
5. ANOMALY DETECTION: Identifying outliers or unusual patterns in the core metric
6. DISTRIBUTION ANALYSIS: Examining how the core metric is distributed across different ranges
7. RATIO ANALYSIS: Calculating and comparing ratios related to the core metric
8. PREDICTIVE MODELING: Using the core metric to predict future outcomes
9. COHORT ANALYSIS: Comparing groups based on shared characteristics related to the core metric
10. VARIANCE ANALYSIS: Analyzing the spread or variability of the core metric

## SQL TECHNIQUES TO UTILIZE
Each question should use DIFFERENT SQL techniques, appropriate for the analysis:
- Window functions (ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, etc.)
- Common Table Expressions (CTEs) for complex multi-step analysis
- CASE statements for conditional logic and categorization
- Date and time functions for temporal analysis
- Subqueries and correlated subqueries
- Self-joins for comparing records
- Aggregation functions (COUNT, SUM, AVG, MIN, MAX, etc.)
- Statistical functions (PERCENTILE, STDDEV, etc.)
- Pivoting data for different perspectives
- HAVING clauses for filtered aggregations

## FORMAT REQUIREMENTS
- Always include these filtering conditions with placeholders:
  * `organization_id IN ('[ORGANIZATION_IDS]')`
  * `job_level_id IN ('[JOB_LEVEL_IDS]')`
  * `location_id IN ('[LOCATION_IDS]')`
- Use exact format for date functions: `STR_TO_DATE('2023-01-01', '%Y-%m-%d')`
- Include appropriate placeholders in every query

## OUTPUT FORMAT
Provide the output in this JSON structure:

```json
{{
  "augmented_questions": [
    {{
      "question_id": 1,
      "business_question": "<augmented_business_question>",
      "sql_reasoning": "<augmented_sql_reasoning>",
      "required_tables": ["<augmented_required_tables>"],
      "sql_query": "<augmented_sql_query>"
    }},
    // Additional examples with RELATED but DIFFERENT analytical approaches
  ]
}}
```

## VERIFICATION STEPS
Before finalizing your response, verify that:
1. You have generated EXACTLY {total_augmentations} questions
2. EACH question is clearly RELATED to the theme of the existing question
3. EACH question explores a DIFFERENT aspect or dimension of the same theme
4. EACH question uses a DIFFERENT analytical approach
5. EACH question employs DIFFERENT SQL techniques

The goal is to create variations that explore the same theme from different angles, providing a comprehensive set of analytics on the core subject.
"""

### Creates the data generation pipeline

In [5]:
class QnAPair(BaseModel):
    paraphrased_input: str = Field(description="paraphrased_input")
    test_to_sql_query: str = Field(description="test_to_sql_query")

class QnAPairs(BaseModel):
    qna_pairs: List[QnAPair] = Field(description="list of qna pairs")

## OpenAI / DeepSeek
DEEPSEEK_MODEL_NAME = os.getenv("DEEPSEEK_MODEL")
DEEPSEEK_ENDPOINT = os.getenv("DEEPSEEK_ENDPOINT")
DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY")

llm = ChatOpenAI(
    model_name=DEEPSEEK_MODEL_NAME,
    temperature=0.7,  # Higher temperature (0.7-0.9) for more creative variations
    openai_api_base=DEEPSEEK_ENDPOINT,
    openai_api_key=DEEPSEEK_API_KEY,
    top_p=0.95,  # Keep high top_p for diverse outputs while filtering unlikely tokens
    seed=42  # Optional: set seed for reproducibility
)

# Create prompt template
system_message_prompt = SystemMessagePromptTemplate.from_template(SYSTEM_MESSAGE)
human_message_prompt = HumanMessagePromptTemplate.from_template(USER_MESSAGE)
prompt = ChatPromptTemplate.from_messages([
    system_message_prompt,
    human_message_prompt
])
parser = JsonOutputParser(pydantic_object=QnAPairs)
chain = prompt | llm | parser

### Sanity Check

In [7]:
from datetime import datetime

current_date = datetime.now().strftime("%d %B %Y")

database_type = "employee"

business_question = "Berapa total nilai tanpa angka dibelakang koma untuk custom data LINTING_MEREK_SATU"

current_sql_query = """SELECT
    SUM(CAST(employee_variables.value AS DECIMAL(10,0))) AS 'total_linting_merek_satu'
FROM
    employee_variables
    JOIN employee_variable_metadata
        ON employee_variables.employee_variable_metadata_id = employee_variable_metadata.id
    JOIN employees
        ON employee_variables.employee_id = employees.id
    JOIN employment_statuses
        ON employees.id = employment_statuses.employee_id
WHERE
    employee_variable_metadata.name = 'LINTING_MEREK_SATU'
    AND employment_statuses.organization_id IN ([ORGANIZATION_IDS])
    AND employment_statuses.job_level_id IN ([JOB_LEVEL_IDS])
    AND employment_statuses.location_id IN ([LOCATION_IDS]);"""


total_augmentations = 3

if database_type == "employee":
    schema = employee_schema
    relations = employee_relations
    master_data = employee_master_data
    data_trustee_tables = data_trustee_employee
    master_data = employee_master_data
else:
    schema = time_management_schema
    relations = time_management_relations
    master_data = time_management_master_data
    data_trustee_tables = data_trustee_time_management
    master_data = time_management_master_data

response = chain.invoke({
    "schema": schema,
    "relations": relations,
    "master_data": master_data,
    "data_trustee_tables": data_trustee_tables,
    "anonymized_entities_description": anonymized_entities_description,
    "existing_question": business_question,
    "existing_query_pairs": current_sql_query,
    "total_augmentations": total_augmentations
})

if response["augmented_questions"]:
    for qna_pair in response["augmented_questions"]:
        print(qna_pair["business_question"])
        print(qna_pair["sql_query"])
        print("-"*100)

2025-04-11 02:13:55,825 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"


Bagaimana distribusi nilai custom data LINTING_MEREK_SATU berdasarkan jenis kelamin karyawan?
SELECT
    ed.gender,
    MIN(CAST(ev.value AS DECIMAL(10,0))) AS min_value,
    MAX(CAST(ev.value AS DECIMAL(10,0))) AS max_value,
    AVG(CAST(ev.value AS DECIMAL(10,0))) AS avg_value,
    COUNT(*) AS count_employees
FROM
    employee_variables ev
    JOIN employee_variable_metadata evm ON ev.employee_variable_metadata_id = evm.id
    JOIN employees e ON ev.employee_id = e.id
    JOIN employment_statuses es ON e.id = es.employee_id
    JOIN employee_details ed ON e.id = ed.employee_id
WHERE
    evm.name = 'LINTING_MEREK_SATU'
    AND es.organization_id IN ('[ORGANIZATION_IDS]')
    AND es.job_level_id IN ('[JOB_LEVEL_IDS]')
    AND es.location_id IN ('[LOCATION_IDS]')
GROUP BY
    ed.gender
ORDER BY
    avg_value DESC;
----------------------------------------------------------------------------------------------------
Bagaimana tren nilai rata-rata custom data LINTING_MEREK_SATU per bulan da

## Generate augmentation synthetics data

In [8]:
from datetime import datetime
import json
from pathlib import Path
import pandas as pd
from tqdm import tqdm
from modules.constants import ColumnName

def generate_augmentations_in_batches(
    df_data_test: pd.DataFrame,
    batch_size: int = 10,
    total_augmentations: int = 3,
    database_type: str = "employee"
):
    """Generate augmentations for test data in batches and save results.

    Args:
        df_data_test (pd.DataFrame): DataFrame containing test data with questions and SQL queries
        batch_size (int): Number of questions to process in each batch
        total_augmentations (int): Number of augmentations to generate per question
        database_type (str): Type of case to process ("employee" or "time_management")

    Returns:
        pd.DataFrame: DataFrame containing all augmented data
    """
    # Create directory for results
    augmented_data_dir = Path("augmented_sql_data_2")
    augmented_data_dir.mkdir(parents=True, exist_ok=True)

    # Define progress file path
    progress_file = augmented_data_dir / "augmentation_progress.json"

    # Prepare master data based on case type

    # Select appropriate schema and data based on case type
    if database_type == "employee":
        schema = employee_schema
        relations = employee_relations
        master_data = employee_master_data
        data_trustee_tables = data_trustee_employee
    else:
        schema = time_management_schema
        relations = time_management_relations
        master_data = time_management_master_data
        data_trustee_tables = data_trustee_time_management

    # Initialize results DataFrame
    all_augmentations = pd.DataFrame(
        columns=[
            ColumnName.NO,
            ColumnName.BASE_PROMPT,
            ColumnName.PROMPT,
            ColumnName.SQL_REASONING,
            ColumnName.REQUIRED_TABLES,
            ColumnName.EXPECTED_SQL_QUERY
        ]
    )

    # Try to load existing progress if available
    processed_indices = set()
    if progress_file.exists():
        try:
            with open(progress_file, "r") as f:
                progress_data = json.load(f)
                all_augmentations = pd.DataFrame(progress_data.get("all_augmentations", []))
                processed_indices = set(progress_data.get("processed_indices", []))
                print(f"Loaded existing progress: {len(processed_indices)}/{len(df_data_test)} questions processed")
        except Exception as e:
            print(f"Error loading progress file: {e}")

    # Process data in batches
    total_rows = len(df_data_test)
    for batch_start in range(0, total_rows, batch_size):
        batch_end = min(batch_start + batch_size, total_rows)
        print(f"Processing batch {batch_start//batch_size + 1}: rows {batch_start} to {batch_end-1}")

        batch_augmentations = []

        # Process each row in the current batch
        for idx in tqdm(range(batch_start, batch_end), desc=f"Batch {batch_start//batch_size + 1}"):
            # Skip already processed indices
            if idx in processed_indices:
                continue

            row = df_data_test.iloc[idx]
            no = row[ColumnName.NO]
            base_business_question = row[ColumnName.PROMPT]
            base_sql_query = row[ColumnName.EXPECTED_SQL_QUERY]

            # Create existing_question_query_pairs for context
            # existing_question_query_pairs = {
            #     "business_question": base_business_question,
            #     "sql_query": base_sql_query
            # }
            try:
                # Generate augmentations
                # response = chain.invoke({
                #     "schema": schema,
                #     "relations": relations,
                #     "master_data": master_data,
                #     "data_trustee_tables": data_trustee_tables,
                #     "anonymized_entities_description": anonymized_entities_description,
                #     "existing_question_query_pairs": existing_question_query_pairs,
                #     "total_augmentations": total_augmentations
                # })

                response = chain.invoke({
                    "schema": schema,
                    "relations": relations,
                    "master_data": master_data,
                    "data_trustee_tables": data_trustee_tables,
                    "anonymized_entities_description": anonymized_entities_description,
                    "existing_question": base_business_question,
                    "existing_query_pairs": base_sql_query,
                    "total_augmentations": total_augmentations
                })

                # Process augmentations
                if response["augmented_questions"]:
                    for aug_idx, qna_pair in enumerate(response["augmented_questions"]):
                        augmentation_no = f"{no}_{aug_idx + 1}"
                        business_question = qna_pair["business_question"]
                        sql_query = qna_pair["sql_query"]
                        sql_reasoning = qna_pair["sql_reasoning"]
                        required_tables = qna_pair["required_tables"]

                        # Add to batch results
                        batch_augmentations.append({
                            ColumnName.NO: augmentation_no,
                            ColumnName.BASE_PROMPT: base_business_question,
                            ColumnName.PROMPT: business_question,
                            ColumnName.SQL_REASONING: sql_reasoning,
                            ColumnName.REQUIRED_TABLES: required_tables,
                            ColumnName.EXPECTED_SQL_QUERY: sql_query
                        })

                # Mark as processed
                processed_indices.add(idx)

            except Exception as e:
                print(f"Error processing row {idx} ({ColumnName.NO}: {no}): {e}")
                # Log the error but continue with next item
                with open(augmented_data_dir / "augmentation_errors.log", "a") as f:
                    f.write(f"Row {idx} ({ColumnName.NO}: {no}) error: {str(e)}\n")

        # Add batch results to all_augmentations
        if batch_augmentations:
            batch_df = pd.DataFrame(batch_augmentations)
            all_augmentations = pd.concat([all_augmentations, batch_df], ignore_index=True)

            # Save progress
            with open(progress_file, "w") as f:
                json.dump({
                    "all_augmentations": all_augmentations.to_dict("records"),
                    "processed_indices": list(processed_indices),
                    "last_updated": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                }, f, indent=2)

            # Save current results to CSV
            all_augmentations.to_csv(augmented_data_dir / f"augmented_data_{database_type}_{len(df_data_test)}_questions_in_progress.csv", index=False)

    # Save final results
    final_csv_path = augmented_data_dir / f"augmented_data_{database_type}_{len(df_data_test)}_questions_complete.csv"
    all_augmentations.to_csv(final_csv_path, index=False)
    print(f"Successfully generated {len(all_augmentations)} augmentations for {len(processed_indices)} questions")
    print(f"Results saved to {final_csv_path}")

    return all_augmentations

# Run the augmentation process
batch_size = 10  # Process 10 questions at a time
total_augmentations = 3  # Generate 3 augmentations per question
# database_type = "employee"  # Use "employee" or "time_management"
database_type = "time_management"  # Use "employee" or "time_management"

augmented_data = generate_augmentations_in_batches(
    df_data_test,
    batch_size=batch_size,
    total_augmentations=total_augmentations,
    database_type=database_type
)

# Display sample of results
print("\nSample of augmented data:")
display(augmented_data.head(7))

Processing batch 1: rows 0 to 9


Batch 1:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 02:16:07,108 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 1:  10%|█         | 1/10 [00:52<07:55, 52.79s/it]2025-04-11 02:16:59,777 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 1:  20%|██        | 2/10 [01:42<06:46, 50.82s/it]2025-04-11 02:17:49,227 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-04-11 02:18:00,942 - INFO - Retrying request to /chat/completions in 0.435330 seconds
2025-04-11 02:18:02,048 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 1:  30%|███       | 3/10 [03:04<07:34, 64.97s/it]2025-04-11 02:19:11,075 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 1:  40%|████      | 4/10 [03:46<05:35, 55.97s/it]2025-04-11 02:19:53,216 - INFO - HTTP Request: POST https://api.dee

Processing batch 2: rows 10 to 19


Batch 2:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 02:25:43,702 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 2:  10%|█         | 1/10 [00:52<07:48, 52.02s/it]2025-04-11 02:26:35,745 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 2:  20%|██        | 2/10 [01:52<07:37, 57.25s/it]2025-04-11 02:27:36,638 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 2:  30%|███       | 3/10 [02:57<07:05, 60.73s/it]2025-04-11 02:28:41,516 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 2:  40%|████      | 4/10 [04:17<06:48, 68.10s/it]2025-04-11 02:30:01,037 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 2:  50%|█████     | 5/10 [05:07<05:08, 61.63s/it]2025-04-11 02:30:51,060 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Processing batch 3: rows 20 to 29


Batch 3:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 02:35:32,446 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 3:  10%|█         | 1/10 [00:40<06:02, 40.24s/it]2025-04-11 02:36:12,659 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 3:  20%|██        | 2/10 [01:43<07:09, 53.72s/it]2025-04-11 02:37:15,821 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 3:  30%|███       | 3/10 [02:38<06:21, 54.56s/it]2025-04-11 02:38:11,433 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 3:  40%|████      | 4/10 [03:52<06:11, 61.98s/it]2025-04-11 02:39:24,813 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 3:  50%|█████     | 5/10 [04:39<04:43, 56.66s/it]2025-04-11 02:40:12,101 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Processing batch 4: rows 30 to 39


Batch 4:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 02:44:03,210 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 4:  10%|█         | 1/10 [00:43<06:35, 43.96s/it]2025-04-11 02:44:47,175 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 4:  20%|██        | 2/10 [01:31<06:06, 45.80s/it]2025-04-11 02:45:34,269 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 4:  30%|███       | 3/10 [02:12<05:06, 43.77s/it]2025-04-11 02:46:15,702 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 4:  40%|████      | 4/10 [02:54<04:18, 43.14s/it]2025-04-11 02:46:57,795 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 4:  50%|█████     | 5/10 [03:58<04:13, 50.69s/it]2025-04-11 02:48:01,882 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Processing batch 5: rows 40 to 49


Batch 5:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 02:52:51,030 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 5:  10%|█         | 1/10 [01:08<10:17, 68.56s/it]2025-04-11 02:53:59,590 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 5:  20%|██        | 2/10 [01:56<07:33, 56.63s/it]2025-04-11 02:54:47,892 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 5:  30%|███       | 3/10 [02:48<06:20, 54.37s/it]2025-04-11 02:55:39,588 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 5:  40%|████      | 4/10 [03:35<05:08, 51.43s/it]2025-04-11 02:56:26,481 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 5:  50%|█████     | 5/10 [04:15<03:56, 47.26s/it]2025-04-11 02:57:06,362 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Processing batch 6: rows 50 to 59


Batch 6:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 03:01:51,263 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 6:  10%|█         | 1/10 [01:01<09:09, 61.02s/it]2025-04-11 03:02:52,322 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 6:  20%|██        | 2/10 [02:23<09:47, 73.38s/it]2025-04-11 03:04:14,306 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 6:  30%|███       | 3/10 [03:28<08:07, 69.67s/it]2025-04-11 03:05:19,582 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 6:  40%|████      | 4/10 [04:16<06:07, 61.25s/it]2025-04-11 03:06:07,911 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 6:  50%|█████     | 5/10 [05:01<04:37, 55.48s/it]2025-04-11 03:06:53,151 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Processing batch 7: rows 60 to 69


Batch 7:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 03:11:36,206 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 7:  10%|█         | 1/10 [00:51<07:42, 51.40s/it]2025-04-11 03:12:27,611 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 7:  20%|██        | 2/10 [01:38<06:30, 48.86s/it]2025-04-11 03:13:15,740 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 7:  30%|███       | 3/10 [02:41<06:26, 55.25s/it]2025-04-11 03:14:17,611 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 7:  40%|████      | 4/10 [03:34<05:26, 54.40s/it]2025-04-11 03:15:10,641 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 7:  50%|█████     | 5/10 [04:16<04:09, 49.87s/it]2025-04-11 03:15:52,477 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Processing batch 8: rows 70 to 79


Batch 8:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 03:20:57,499 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 8:  10%|█         | 1/10 [01:09<10:28, 69.83s/it]2025-04-11 03:22:07,329 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 8:  20%|██        | 2/10 [02:03<08:03, 60.38s/it]2025-04-11 03:23:01,112 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 8:  30%|███       | 3/10 [02:44<06:01, 51.69s/it]2025-04-11 03:23:42,497 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 8:  40%|████      | 4/10 [03:50<05:42, 57.11s/it]2025-04-11 03:24:47,891 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 8:  50%|█████     | 5/10 [04:44<04:39, 55.90s/it]2025-04-11 03:25:41,648 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Processing batch 9: rows 80 to 89


Batch 9:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 03:30:34,410 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 9:  10%|█         | 1/10 [01:32<13:55, 92.83s/it]2025-04-11 03:32:07,334 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 9:  20%|██        | 2/10 [02:24<09:10, 68.87s/it]2025-04-11 03:32:59,337 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 9:  30%|███       | 3/10 [03:27<07:42, 66.01s/it]2025-04-11 03:34:01,959 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 9:  40%|████      | 4/10 [04:15<05:52, 58.81s/it]2025-04-11 03:34:49,773 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 9:  50%|█████     | 5/10 [05:09<04:46, 57.27s/it]2025-04-11 03:35:44,270 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Processing batch 10: rows 90 to 99


Batch 10:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 03:41:01,004 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 10:  10%|█         | 1/10 [01:11<10:40, 71.15s/it]2025-04-11 03:42:12,250 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 10:  20%|██        | 2/10 [02:11<08:37, 64.68s/it]2025-04-11 03:43:12,460 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 10:  30%|███       | 3/10 [03:13<07:25, 63.69s/it]2025-04-11 03:44:14,880 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 10:  40%|████      | 4/10 [04:03<05:49, 58.25s/it]2025-04-11 03:45:04,718 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 10:  50%|█████     | 5/10 [05:30<05:43, 68.61s/it]2025-04-11 03:46:31,740 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 11: rows 100 to 109


Batch 11:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 03:52:04,135 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 11:  10%|█         | 1/10 [01:43<15:30, 103.40s/it]2025-04-11 03:53:47,614 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 11:  20%|██        | 2/10 [02:50<10:55, 81.99s/it] 2025-04-11 03:54:54,521 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 11:  30%|███       | 3/10 [04:23<10:10, 87.22s/it]2025-04-11 03:56:27,970 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 11:  40%|████      | 4/10 [05:12<07:12, 72.12s/it]2025-04-11 03:57:16,946 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 11:  50%|█████     | 5/10 [06:12<05:38, 67.69s/it]2025-04-11 03:58:16,803 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completion

Processing batch 12: rows 110 to 119


Batch 12:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 04:03:13,640 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 12:  10%|█         | 1/10 [01:03<09:30, 63.39s/it]2025-04-11 04:04:16,963 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 12:  20%|██        | 2/10 [01:41<06:26, 48.26s/it]2025-04-11 04:04:54,617 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 12:  30%|███       | 3/10 [03:07<07:39, 65.64s/it]2025-04-11 04:06:20,990 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 12:  40%|████      | 4/10 [04:07<06:19, 63.28s/it]2025-04-11 04:07:20,623 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 12:  50%|█████     | 5/10 [05:19<05:32, 66.49s/it]2025-04-11 04:08:32,801 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 13: rows 120 to 129


Batch 13:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 04:14:00,808 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 13:  10%|█         | 1/10 [00:48<07:16, 48.51s/it]2025-04-11 04:14:49,331 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 13:  20%|██        | 2/10 [02:15<09:27, 70.95s/it]2025-04-11 04:16:16,009 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 13:  30%|███       | 3/10 [03:26<08:18, 71.14s/it]2025-04-11 04:17:27,392 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 13:  40%|████      | 4/10 [04:46<07:26, 74.47s/it]2025-04-11 04:18:46,929 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 13:  50%|█████     | 5/10 [05:47<05:48, 69.68s/it]2025-04-11 04:19:48,115 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 14: rows 130 to 139


Batch 14:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 04:24:36,919 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 14:  10%|█         | 1/10 [01:04<09:44, 64.91s/it]2025-04-11 04:25:41,856 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 14:  20%|██        | 2/10 [02:04<08:14, 61.79s/it]2025-04-11 04:26:41,384 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 14:  30%|███       | 3/10 [02:48<06:16, 53.84s/it]2025-04-11 04:27:25,771 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 14:  40%|████      | 4/10 [03:56<05:56, 59.37s/it]2025-04-11 04:28:33,618 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 14:  50%|█████     | 5/10 [04:50<04:47, 57.50s/it]2025-04-11 04:29:27,819 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 15: rows 140 to 149


Batch 15:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 04:33:55,149 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 15:  10%|█         | 1/10 [00:55<08:20, 55.64s/it]2025-04-11 04:34:50,791 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 15:  20%|██        | 2/10 [01:36<06:17, 47.13s/it]2025-04-11 04:35:31,951 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 15:  30%|███       | 3/10 [02:12<04:54, 42.02s/it]2025-04-11 04:36:07,895 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 15:  40%|████      | 4/10 [03:07<04:41, 46.89s/it]2025-04-11 04:37:02,273 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 15:  50%|█████     | 5/10 [03:57<03:59, 48.00s/it]2025-04-11 04:37:52,251 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 16: rows 150 to 159


Batch 16:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 04:42:55,604 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 16:  10%|█         | 1/10 [00:47<07:03, 47.00s/it]2025-04-11 04:43:42,585 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 16:  20%|██        | 2/10 [01:33<06:15, 47.00s/it]2025-04-11 04:44:29,577 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 16:  30%|███       | 3/10 [02:27<05:48, 49.84s/it]2025-04-11 04:45:22,806 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 16:  40%|████      | 4/10 [03:17<04:59, 49.91s/it]2025-04-11 04:46:12,814 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 16:  50%|█████     | 5/10 [04:54<05:35, 67.04s/it]2025-04-11 04:47:50,224 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 17: rows 160 to 169


Batch 17:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 04:52:11,213 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 17:  10%|█         | 1/10 [01:00<09:05, 60.64s/it]2025-04-11 04:53:11,844 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 17:  20%|██        | 2/10 [02:11<08:54, 66.85s/it]2025-04-11 04:54:23,162 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 17:  30%|███       | 3/10 [03:15<07:38, 65.44s/it]2025-04-11 04:55:26,807 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 17:  40%|████      | 4/10 [04:14<06:17, 62.85s/it]2025-04-11 04:56:25,684 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 17:  50%|█████     | 5/10 [04:58<04:40, 56.05s/it]2025-04-11 04:57:09,726 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 18: rows 170 to 179


Batch 18:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 05:01:45,103 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 18:  10%|█         | 1/10 [01:02<09:23, 62.58s/it]2025-04-11 05:02:47,642 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 18:  20%|██        | 2/10 [01:41<06:28, 48.60s/it]2025-04-11 05:03:26,469 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 18:  30%|███       | 3/10 [02:36<06:00, 51.53s/it]2025-04-11 05:04:21,554 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 18:  40%|████      | 4/10 [03:08<04:22, 43.72s/it]2025-04-11 05:04:53,232 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 18:  50%|█████     | 5/10 [03:50<03:36, 43.26s/it]2025-04-11 05:05:35,731 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 19: rows 180 to 189


Batch 19:   0%|          | 0/10 [00:00<?, ?it/s]2025-04-11 05:10:15,384 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 19:  10%|█         | 1/10 [00:57<08:33, 57.07s/it]2025-04-11 05:11:12,586 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 19:  20%|██        | 2/10 [02:06<08:35, 64.39s/it]2025-04-11 05:12:21,966 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 19:  30%|███       | 3/10 [03:02<07:03, 60.47s/it]2025-04-11 05:13:17,778 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 19:  40%|████      | 4/10 [04:02<06:00, 60.16s/it]2025-04-11 05:14:17,470 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 19:  50%|█████     | 5/10 [04:36<04:13, 50.71s/it]2025-04-11 05:14:51,436 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions 

Processing batch 20: rows 190 to 195


Batch 20:   0%|          | 0/6 [00:00<?, ?it/s]2025-04-11 05:19:17,391 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 20:  17%|█▋        | 1/6 [00:42<03:32, 42.55s/it]2025-04-11 05:19:59,965 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 20:  33%|███▎      | 2/6 [01:16<02:31, 37.78s/it]2025-04-11 05:20:34,415 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 20:  50%|█████     | 3/6 [02:09<02:13, 44.66s/it]2025-04-11 05:21:27,101 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 20:  67%|██████▋   | 4/6 [03:12<01:43, 51.92s/it]2025-04-11 05:22:30,289 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 20:  83%|████████▎ | 5/6 [03:50<00:46, 46.64s/it]2025-04-11 05:23:07,680 - INFO - HTTP Request: POST https://api.deepseek.com/v1/chat/completions "HTTP/

Successfully generated 588 augmentations for 196 questions
Results saved to augmented_sql_data_2/augmented_data_time_management_196_questions_complete.csv

Sample of augmented data:


Unnamed: 0,No,Base Prompt,Prompt,SQL Reasoning,Required Tables,Expected SQL Query
0,51_1,Berapa jumlah karyawan baru yang direkrut seti...,Bagaimana distribusi karyawan baru pada tahun ...,Pertanyaan ini memerlukan analisis segmentasi ...,"[employees, employee_details, employment_statu...",WITH monthly_employees AS (\n SELECT \n DA...
1,51_2,Berapa jumlah karyawan baru yang direkrut seti...,Bagaimana tingkat retensi karyawan baru yang d...,Pertanyaan ini memerlukan analisis kohort untu...,"[employees, employment_statuses]","WITH q1_recruits AS (\n SELECT \n e.id,\n ..."
2,51_3,Berapa jumlah karyawan baru yang direkrut seti...,Bagaimana perbandingan tingkat rekrutmen karya...,Pertanyaan ini memerlukan analisis komparatif ...,"[employees, employment_statuses, locations]","SELECT \n l.name AS lokasi_kerja,\n COUNT(e...."
3,52_1,Berapa jumlah karyawan yang memiliki bawahan l...,Bagaimana distribusi jumlah bawahan langsung p...,Pertanyaan ini memperluas analisis dengan meli...,"[employees, employment_statuses, organizations]",WITH manager_subordinates AS (\n SELECT \n ...
4,52_2,Berapa jumlah karyawan yang memiliki bawahan l...,Bagaimana tren perubahan jumlah manager dalam ...,Pertanyaan ini menganalisis tren temporal deng...,"[employees, employment_status_histories, emplo...",WITH monthly_managers AS (\n SELECT \n ...
5,52_3,Berapa jumlah karyawan yang memiliki bawahan l...,Berapa rasio manager terhadap total karyawan d...,Pertanyaan ini melakukan analisis komparatif d...,"[employees, employment_statuses, job_levels]",WITH employee_counts AS (\n SELECT \n ...
6,53_1,Berapa persentase karyawan yang terlambat lebi...,Bagaimana tren keterlambatan karyawan (lebih d...,Pertanyaan ini memerlukan analisis time series...,"[presence_entries, attendance_statuses, employ...",WITH monthly_late_data AS (\n SELECT \n or...


In [9]:
augmented_data

Unnamed: 0,No,Base Prompt,Prompt,SQL Reasoning,Required Tables,Expected SQL Query
0,51_1,Berapa jumlah karyawan baru yang direkrut seti...,Bagaimana distribusi karyawan baru pada tahun ...,Pertanyaan ini memerlukan analisis segmentasi ...,"[employees, employee_details, employment_statu...",WITH monthly_employees AS (\n SELECT \n DA...
1,51_2,Berapa jumlah karyawan baru yang direkrut seti...,Bagaimana tingkat retensi karyawan baru yang d...,Pertanyaan ini memerlukan analisis kohort untu...,"[employees, employment_statuses]","WITH q1_recruits AS (\n SELECT \n e.id,\n ..."
2,51_3,Berapa jumlah karyawan baru yang direkrut seti...,Bagaimana perbandingan tingkat rekrutmen karya...,Pertanyaan ini memerlukan analisis komparatif ...,"[employees, employment_statuses, locations]","SELECT \n l.name AS lokasi_kerja,\n COUNT(e...."
3,52_1,Berapa jumlah karyawan yang memiliki bawahan l...,Bagaimana distribusi jumlah bawahan langsung p...,Pertanyaan ini memperluas analisis dengan meli...,"[employees, employment_statuses, organizations]",WITH manager_subordinates AS (\n SELECT \n ...
4,52_2,Berapa jumlah karyawan yang memiliki bawahan l...,Bagaimana tren perubahan jumlah manager dalam ...,Pertanyaan ini menganalisis tren temporal deng...,"[employees, employment_status_histories, emplo...",WITH monthly_managers AS (\n SELECT \n ...
...,...,...,...,...,...,...
583,245_2,Berapa jumlah karyawan yang memiliki lebih dar...,Apakah ada korelasi antara jumlah bawahan lang...,Pertanyaan ini memerlukan analisis korelasi un...,"[employees, employment_statuses, job_levels]","SELECT jl.name AS job_level, AVG(sub.report_co..."
584,245_3,Berapa jumlah karyawan yang memiliki lebih dar...,Bagaimana tren perubahan jumlah bawahan langsu...,Pertanyaan ini memerlukan analisis time series...,"[employees, employment_statuses, employment_st...",WITH monthly_data AS (SELECT DATE_FORMAT(h.eff...
585,246_1,Bagaimana tren penggunaan cuti karyawan selama...,Bagaimana distribusi penggunaan cuti karyawan ...,Pertanyaan ini memerlukan analisis segmentasi ...,"[attendances, attendance_statuses, employees, ...","SELECT \n DATE_FORMAT(a.date, '%Y-%m') AS m..."
586,246_2,Bagaimana tren penggunaan cuti karyawan selama...,Apa jenis cuti yang paling sering digunakan ol...,Pertanyaan ini memerlukan analisis komparatif ...,"[attendances, attendance_statuses, employees]",WITH employee_tenure AS (\n SELECT \n ...


In [2]:
import pandas as pd

ggg = pd.read_csv('augmented_sql_data_2/augmented_data_employee_complete.csv')
ggg

Unnamed: 0,No,Base Prompt,Prompt,SQL Reasoning,Required Tables,Expected SQL Query
0,53_1,Berapa jumlah karyawan aktif saat ini berdasar...,Bagaimana distribusi karyawan aktif saat ini b...,Pertanyaan ini memperluas analisis dengan mena...,"['employees', 'employee_details', 'employment_...","SELECT jl.name AS job_level, ed.gender, COUNT(..."
1,53_2,Berapa jumlah karyawan aktif saat ini berdasar...,Bagaimana tren perubahan jumlah karyawan aktif...,Pertanyaan ini menganalisis perubahan temporal...,"['employees', 'employee_details', 'employment_...",WITH monthly_counts AS (SELECT DATE_FORMAT(esh...
2,53_3,Berapa jumlah karyawan aktif saat ini berdasar...,Apakah ada perbedaan signifikan dalam rasio ge...,Pertanyaan ini membandingkan distribusi gender...,"['employees', 'employee_details', 'employment_...","SELECT et.name AS employment_type, ed.gender, ..."
3,54_1,Apa alasan utama karyawan mengundurkan diri pa...,Bagaimana tren pengunduran diri karyawan berub...,Pertanyaan ini memerlukan analisis temporal de...,"['termination_entries', 'termination_reasons']",SELECT \n QUARTER(termination_entries.effec...
4,54_2,Apa alasan utama karyawan mengundurkan diri pa...,Apakah ada perbedaan alasan pengunduran diri a...,Pertanyaan ini memerlukan analisis segmentasi ...,"['termination_entries', 'termination_reasons',...","SELECT \n employee_details.gender,\n ter..."
...,...,...,...,...,...,...
289,149_2,Berapa jumlah karyawan dengan kontrak yang aka...,Apa distribusi karyawan dengan kontrak yang ak...,Pertanyaan ini memerlukan analisis segmentasi ...,"['employment_statuses', 'employment_types', 'j...","SELECT \n et.name AS contract_type,\n jl..."
290,149_3,Berapa jumlah karyawan dengan kontrak yang aka...,Berapa persentase karyawan dengan kontrak yang...,Pertanyaan ini memerlukan analisis komparatif ...,"['employment_statuses', 'employment_status_typ...",WITH contract_counts AS (\n SELECT \n ...
291,150_1,Tampilkan hierarki manajemen lengkap untuk sel...,Bagaimana distribusi gender di setiap level ja...,Pertanyaan ini memerlukan analisis segmentasi ...,"['employees', 'employment_statuses', 'job_leve...",WITH RECURSIVE org_hierarchy AS (\n SELECT e....
292,150_2,Tampilkan hierarki manajemen lengkap untuk sel...,Bagaimana rata-rata masa kerja (tenure) karyaw...,Pertanyaan ini memerlukan analisis temporal un...,"['employees', 'employment_statuses', 'job_leve...",WITH RECURSIVE org_hierarchy AS (\n SELECT e....


## Upload to Google Sheets

In [69]:
df_syntetic_data = pd.read_csv('syntetics_data/syntetic_data_150_2.csv')
df_syntetic_data.head()

Unnamed: 0,no,category,complexity,question,required_tables,sql_concepts
0,1,TURNOVER ANALYSIS,Intermediate,Berapa persentase karyawan yang mengajukan pen...,"['termination_entries', 'termination_reasons']","['JOIN', 'COUNT', 'WHERE', 'GROUP BY']"
1,2,EMPLOYEE HIRING PATTERNS,Beginner,Berapa jumlah karyawan baru yang direkrut seti...,['employees'],"['COUNT', 'GROUP BY', 'DATE_FORMAT']"
2,3,MANAGEMENT HIERARCHY,Intermediate,Siapa manajer dengan jumlah bawahan langsung t...,['employees'],"['JOIN', 'COUNT', 'GROUP BY', 'ORDER BY']"
3,4,ATTENDANCE AND TIME TRACKING,Advanced,Bagaimana pola ketidakhadiran karyawan setelah...,"['employee_variables', 'employee_variable_meta...","['JOIN', 'WHERE', 'DATE', 'LIKE']"
4,5,EMPLOYMENT STATUS CHANGES,Intermediate,Berapa jumlah karyawan yang dipromosikan pada ...,"['employment_status_histories', 'employment_st...","['JOIN', 'COUNT', 'WHERE']"


In [74]:
from modules.google_sheets_writer import GoogleSheetsWriter
import logging


writer = GoogleSheetsWriter(
    google_util=google,  # Your GoogleUtil instance
    sheet_id=GOOGLE_SPREADSHEET_ID,
    worksheet_name="catapa_syntetics_question",
    batch_size=10,  # Customize batch size
    max_retries=5,  # Customize retry attempts
    batch_delay=2  # Customize delay between batches
)
# Write the DataFrame
result = writer.write_dataframe(df_syntetic_data)

# Log results
logging.info(f"Successfully wrote {result.successful_rows} rows")
if result.failed_rows > 0:
    logging.error(f"Failed to write {result.failed_rows} rows")
    for error in result.errors:
        logging.error(f"Row {error['row_number']}: {error['error']}")

  0%|          | 0/15 [00:00<?, ?it/s]2025-03-06 13:51:08,571 - INFO - Successfully wrote row 1/150
2025-03-06 13:51:10,832 - INFO - Successfully wrote row 2/150
2025-03-06 13:51:13,200 - INFO - Successfully wrote row 3/150
2025-03-06 13:51:15,741 - INFO - Successfully wrote row 4/150
2025-03-06 13:51:18,000 - INFO - Successfully wrote row 5/150
2025-03-06 13:51:20,113 - INFO - Successfully wrote row 6/150
2025-03-06 13:51:22,396 - INFO - Successfully wrote row 7/150
2025-03-06 13:51:24,574 - INFO - Successfully wrote row 8/150
2025-03-06 13:51:26,962 - INFO - Successfully wrote row 9/150
2025-03-06 13:51:29,229 - INFO - Successfully wrote row 10/150
  7%|▋         | 1/15 [00:25<05:54, 25.35s/it]2025-03-06 13:51:33,455 - INFO - Successfully wrote row 11/150
2025-03-06 13:51:35,713 - INFO - Successfully wrote row 12/150
2025-03-06 13:51:38,071 - INFO - Successfully wrote row 13/150
2025-03-06 13:51:40,249 - INFO - Successfully wrote row 14/150
2025-03-06 13:51:42,474 - INFO - Successful