In [1]:
import boto3
import json

In [33]:
payload = """
Pay Attention, Do not hallucinate, only work on what is there in the below and think deep for all the edge cases for the below requirements.
Do NOT assume any properties (file format, update frequency, row counts, S3 location, etc.) that are not explicitly present in the JSON. If unknown, either omit or mark as "unknown".

You are a Senior Data Engineer Copilot specializing in **AWS Glue and Athena**.

You will receive a schema object extracted directly from the **AWS Glue Data Catalog**.
Your goal is to generate a Data Quality Contract and Post-Load Tests suitable for an AWS Data Lake environment and also confulence style documentation for the table.

The JSON object will look like this (shape, not exact values):

{ddl_obj}

Where:
- table_name: name of the table
- database: database / schema name
- schema: list of columns, each like:
  - name: column name
  - type: data type (string, int, double, date, etc.)
  - nullable: true / false
  - comments: optional free-text description that may contain strict business rules
- column_stats: ONLY for NON-PII columns, something like:
  - min, max, null_pct, distinct_count, top_values
- constraints: optional list of primary keys, unique keys, check constraints
- job_summary: includes inputs, filters, and grain if available
- rule_type MUST be exactly one of the allowed values. 
  Do NOT invent new rule_type names. If unsure, choose the closest one.

IMPORTANT PRIVACY RULES:
- PII columns (name, email, phone_number, salary, etc.) appear in "schema"
  but their stats and values are NOT provided.
- For PII columns:
  - You MAY define structural rules: not_null, not_empty, length, regex.
  - You MUST NOT include any concrete example values (no fake SSNs, emails, phones).
  - Only describe patterns, like "must be 9 digits", "must match email format".
- For non-PII columns:
  - You MAY use column_stats to propose ranges and allowed_values.
  - Still avoid writing specific sample values in descriptions; talk about rules.

----------------------------------------
INTERPRETING COLUMN COMMENTS (BUSINESS RULES)
----------------------------------------

For every column, you MUST read and interpret the "comments" field (if present).

- If comments clearly indicate the column is required or cannot be null
  (examples: "cannot be null", "must not be null", "mandatory", "required"),
  then you MUST treat that column as **business not-null**, even if nullable = true.
  In that case:
    - Generate a not_null rule.
    - spark_exp MUST enforce non-null, for example: "joining_date IS NOT NULL".

- If comments describe value constraints, you MUST turn them into rules:
  Examples:
    - "must be positive" → min rule with spark_exp like "col > 0" (with NULL allowed if column is nullable).
    - "0–100 only" → min + max, or an allowed_values rule.
    - "YYYY-MM-DD" → format / regex rule on a string column, or date validity check.
    - "ISO country code (3-char)" → length or regex rule.

- If there is a conflict between nullable and comments:
    - nullable = true but comments say "cannot be null" → comments WIN, you MUST enforce not_null.
    - nullable = false but comments are silent → follow nullable = false as usual.

- Comments can also add extra logic beyond nullability (e.g. business ranges, relations).
  You MUST convert any clear business rule from comments into either:
    - a data_quality rule (with spark_exp), and/or
    - a post-load test.

Do NOT ignore comments. If comments contain a clear rule, you must reflect it in the JSON.

----------------------------------------
THINKING / COVERAGE REQUIREMENTS
----------------------------------------

You must think column-by-column and constraint-by-constraint. 
Do not skip any column.
For coverage:
- Every column in "schema" (except purely technical partition columns) MUST appear in at least one rule in data_quality.rules.
- Do NOT skip columns just because nullable = true. If a column is nullable, you can still enforce rules like "if present, must not be empty" or "if present, must match pattern".

For every column in "schema":

1. COMPLETENESS
   IMPORTANT:
    - If nullable = true AND comments do NOT say things like "cannot be null" / "mandatory" / "required":
        * You MUST NOT create a not_null rule.
        * Nullable columns must allow NULL in all spark_exp expressions.
    - If nullable = true BUT comments explicitly say the column cannot be null (or equivalent):
        * Treat it as business-required.
        * You MUST create a not_null rule.
        * spark_exp must enforce non-null, e.g. "joining_date IS NOT NULL".
    - If nullable = false → always generate a not_null rule.

   - For ALL string-like columns (string, varchar, char), regardless of nullable:
       * Always generate a not_empty-style rule:
         - If nullable = true: use logic "value IS NULL OR trimmed length > 0".
         - If nullable = false OR business-required via comments: use logic "value IS NOT NULL AND trimmed length > 0".

2. VALIDITY
   Use the combination of:
   - column name
   - data type 
   - column_stats (only non-PII)
   - constraints (CHECK, PK, UNIQUE)
   - comments (business rules)
   to infer validity rules such as:
     * numeric columns >= 0 unless obviously not applicable
     * string columns with stable lengths → infer min_length / max_length or regex
     * year/date columns must not be in the future
     * codes (country_code, dep_code) must be in allowed_values if low-cardinality
     * any explicit range / format / business condition described in comments

3. RANGE RULES (NON-PII ONLY)
   Use column_stats[min_val, max_val, distinct_count, null_pct].
   Create soft WARNING rules with a 20–25% buffer around min/max or p95 if present.

4. ALLOWED VALUES (NON-PII ONLY)
   If distinct_count is small (< 50) AND stable → generate allowed_values.

5. PII COLUMNS
   - PII columns appear in schema but have NO column_stats.
   - For these columns you MUST generate:
       * not_null (if nullable = false OR comments say it is required)
       * not_empty for strings
       * regex or fixed-length patterns inferred ONLY from schema + column name + comments
   - NEVER include example email or SSN values. Only describe patterns.

6. CROSS-COLUMN LOGIC (IF OBVIOUS)
   If year/date columns exist → ensure year <= current year.
   If ID + email exist → email should not be null if ID exists.
   If joining_date and resign_date exist → resign_date >= joining_date.
   If comments describe cross-column relationships, you MUST convert those into rules or tests.

7. TABLE-LEVEL RULES
   - If constraints include primary key → include uniqueness rule.
   - Add table-level rule: row_count > 0.

8. Data Quality (pre-load PySpark)
   For each rule, you MUST output "spark_exp" using **Spark SQL syntax only**, not PySpark API.
    spark_exp MUST be a SQL expression that can be passed directly into:

    df.filter(expr(spark_exp))

    Examples of valid spark_exp:
      "salary >= 0"
      "salary IS NULL OR salary >= 0"
      "name IS NULL OR length(trim(name)) > 0"
      "joining_date <= current_date()"
      "joining_date IS NOT NULL"   -- when comments say 'cannot be null'

    Examples of INVALID spark_exp (do NOT generate these):
      col('salary') >= 0
      F.col("name").isNull()
      dataframe.count() > 0
      salary.notNull()

9. TEST COVERAGE (Post-load SQL)
   You must generate SQL tests for:
       * uniqueness of PK/grain
       * null checks on required columns (nullable=false OR required via comments)
       * each CHECK constraint
       * future-date violations
       * allowed_values validation (for low-cardinality columns)
       * numeric range violations

After generating rules and tests, REVIEW:
- Did you include ALL columns?
- Did you cover ALL non-nullable columns and all columns required by comments?
- Did you enforce ALL constraints + business rules from comments?
- Did you create BOTH rules AND tests?


----------------------------------------
OUTPUT FORMAT (MUST BE VALID JSON)
----------------------------------------
Important: In the final JSON, the set of column names used in data_quality.rules (excluding "__TABLE__") MUST match the set of column names in "schema" (case-insensitive). Do not omit any columns.
           Coverage requirement does NOT override the schema.
           If a column is nullable and comments do not make it required,
           you may generate "if present, must..." rules
           (e.g., not_empty, min/max with NULL allowed),
           but you MUST NOT force mandatory constraints such as not_null.

Return ONLY valid JSON in this exact structure (no extra comments):

{{
  "data_quality": {{
    "rules": [
      {{
        "column": "col_name_or__TABLE__for_table_level",
        "rule_type": "not_null | not_empty | min | max | allowed_values | regex | pk | fk | check_constraint | custom_sql",
        "condition": "value / list / SQL expression / description string",
        "severity": "ERROR | WARNING",
        "action": "FAIL_JOB | DROP_ROW | WARN",
        "description": "Short reasoning for the rule (no concrete example values).",
        "spark_exp": "A Spark SQL boolean expression that returns TRUE for valid rows and can be passed directly to pyspark.sql.functions.expr(). It MUST NOT reference any DataFrame variable and MUST NOT call actions like count(), groupBy(), collect(), etc."
      }}
    ]
  }},
  "tests": [
    {{
      "name": "test_name",
      "sql": "SELECT ...",
      "description": "What this test validates."
    }}
  ],
  "docs_markdown": "# Table Documentation\\n..."
}}

Do NOT include anything outside this JSON object.
"""


In [None]:
s3_client = boto3.client('s3',region_name='us-east-1') 
s3_client.put_object(Bucket = 'de-copilot-s3',Key = 'prompt/llm_prompt.txt',Body = payload,ContentType = 'text/plain')

{'ResponseMetadata': {'RequestId': 'CY45NEAH7VSZF29Q',
  'HostId': 'qB2/19Wj9Ug+tOBfrF2Og3wqhuVbywMN9Im1yq3aBToOFjq7W3YTmeUgv8xwDp6C/RR34BNGTAM=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'qB2/19Wj9Ug+tOBfrF2Og3wqhuVbywMN9Im1yq3aBToOFjq7W3YTmeUgv8xwDp6C/RR34BNGTAM=',
   'x-amz-request-id': 'CY45NEAH7VSZF29Q',
   'date': 'Sun, 23 Nov 2025 20:29:48 GMT',
   'x-amz-server-side-encryption': 'AES256',
   'etag': '"439b816afa33f1031ad050c88bd2dcdb"',
   'x-amz-checksum-crc32': 'dGvKPw==',
   'x-amz-checksum-type': 'FULL_OBJECT',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"439b816afa33f1031ad050c88bd2dcdb"',
 'ChecksumCRC32': 'dGvKPw==',
 'ChecksumType': 'FULL_OBJECT',
 'ServerSideEncryption': 'AES256'}

In [39]:
def build_prompt(cur_ddl):
    s3_client = boto3.client('s3')
    prompt = s3_client.get_object(Bucket = 'de-copilot-s3',Key='prompt/llm_prompt.txt')
    prompt = prompt['Body'].read()
    prompt = prompt.decode('utf-8')
           
    glue_ddl = json.dumps(cur_ddl,indent = 2)
    
    final_prompt = prompt.replace("{ddl_obj}", glue_ddl)
    
    return final_prompt

In [40]:
ddl = {'table_name': 'employees_test',
 'database': 'copilot_demo',
 'schema': [{'name': 'emp_id',
   'type': 'int',
   'nullable': False,
   'partition_key': False,
   'primary_key': True,
   'foregin_key': False,
   'comments': 'primary_key'},
  {'name': 'name',
   'type': 'string',
   'nullable': True,
   'partition_key': False,
   'primary_key': False,
   'foregin_key': False,
   'comments': ''},
  {'name': 'salary',
   'type': 'double',
   'nullable': True,
   'partition_key': False,
   'primary_key': False,
   'foregin_key': False,
   'comments': ''},
  {'name': 'department',
   'type': 'string',
   'nullable': True,
   'partition_key': False,
   'primary_key': False,
   'foregin_key': False,
   'comments': ''},
  {'name': 'joining_date',
   'type': 'date',
   'nullable': True,
   'partition_key': False,
   'primary_key': False,
   'foregin_key': False,
   'comments': 'cannot be null'}],
 'column_stats': {'ROW_COUNT': '261',
  'joining_date': {'min': '2011-01-17',
   'max': '2023-11-30',
   'null_pct': 0.0038314176245210726,
   'distinct_count': 250},
  'department': {'min': 'Finance',
   'max': 'Support',
   'null_pct': 0.0,
   'distinct_count': 6},
  'emp_id': {'min': '1',
   'max': '287',
   'null_pct': 0.0,
   'distinct_count': 258}}}

In [41]:
llm_prompt = build_prompt(ddl)

In [42]:
print(llm_prompt)


Pay Attention, Do not hallucinate, only work on what is there in the below and think deep for all the edge cases for the below requirements.
Do NOT assume any properties (file format, update frequency, row counts, S3 location, etc.) that are not explicitly present in the JSON. If unknown, either omit or mark as "unknown".

You are a Senior Data Engineer Copilot specializing in **AWS Glue and Athena**.

You will receive a schema object extracted directly from the **AWS Glue Data Catalog**.
Your goal is to generate a Data Quality Contract and Post-Load Tests suitable for an AWS Data Lake environment and also confulence style documentation for the table.

The JSON object will look like this (shape, not exact values):

{
  "table_name": "employees_test",
  "database": "copilot_demo",
  "schema": [
    {
      "name": "emp_id",
      "type": "int",
      "nullable": false,
      "partition_key": false,
      "primary_key": true,
      "foregin_key": false,
      "comments": "primary_key"
 

In [52]:
def call_gemini(payload):
    api_key = os.environ['GEMINI_API_KEY']
    url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent?key=AIzaSyBari9dB4rLabMtlEV0sZEPjdks-Vl0_84"
    
    headers = {"Content-Type": "application/json"}

    data = {"contents": [{"parts": [{"text": payload}]}]}

    response = requests.post(url, headers=headers, data=json.dumps(data))
    return response.text

In [53]:
import os, requests
llm_response = call_gemini(llm_prompt)

In [54]:
llm_response



In [55]:
import re 
clean_json_str = re.sub(r"```json\n|\n```", "", llm_response).strip()

In [56]:
cleaned_json = json.loads(clean_json_str)

In [57]:
cleaned_json

    'role': 'model'},
   'finishReason': 'STOP',
   'index': 0}],
 'usageMetadata': {'promptTokenCount': 3005,
  'candidatesTokenCount': 2066,
  'totalTokenCount': 11415,
  'promptTokensDetails': [{'modality': 'TEXT', 'tokenCount': 3005}],
  'thoughtsTokenCount': 6344},
 'modelVersion': 'gemini-2.5-pro',
 'responseId': 'yHEjaejaA4etz7IPvOX9gAI'}

In [58]:
[
  {
    "Name": "company_id",
    "Type": "int",
    "Comment": "primary_key "
  },
  {
    "Name": "company_name",
    "Type": "string",
    "Comment": "there will be no company without a name"
  },
  {
    "Name": "head_count",
    "Type": "int",
    "Comment": "total number of employees present in this company"
  },
  {
    "Name": "employee_name",
    "Type": "string",
    "Comment": "this the foreign_key FK : references employees_test(name)"
  },
  {
    "Name": "established",
    "Type": "date",
    "Comment": ""
  }
]

[{'Name': 'company_id', 'Type': 'int', 'Comment': 'primary_key '},
 {'Name': 'company_name',
  'Type': 'string',
  'Comment': 'there will be no company without a name'},
 {'Name': 'head_count',
  'Type': 'int',
  'Comment': 'total number of employees present in this company'},
 {'Name': 'employee_name',
  'Type': 'string',
  'Comment': 'this the foreign_key FK : references employees_test(name)'},
 {'Name': 'established', 'Type': 'date', 'Comment': ''}]