#Data Pipeline using LLM

Go to https://groq.com/ and generate a Free API Key.


1. Data Cleaning:

  Begin by loading the dataset into your Colab environment.
  Use pandas functions like head(), info(), describe(), and value_counts() to explore the structure, data types, and basic statistics of the dataset.

  Identify potential data quality issues such as missing values, inconsistent formats, or incorrect entries.
  Prompt Engineering:

  This is the core of the lab. Your task is to craft a prompt that instructs an LLM (Groq's LLama2) to clean the data.

  The Cleaning Goals: Your prompt should guide the LLM to perform the following tasks:

  * Address missing values: Infer or fill in missing information where possible (e.g., city names from addresses).
  * Standardize text: Correct spelling, apply consistent capitalization, and ensure uniformity in categorical values.
  * Validate and format: Ensure that addresses are in a standard format (e.g., "Street, Borough, NY"), and that dates and times follow ISO 8601.
  * Categorize: Assign clear categories to ambiguous complaint descriptions (e.g., "Noise," "Non-Noise").

  You are not given the prompt used in the example code, but you are given the expected results.
  Iterative Refinement: Start with a basic prompt and gradually refine it based on the LLM's output. Observe how the LLM responds and make adjustments to improve the cleaning process.

2. Data Validation:

  After cleaning the data, write unit tests (using Python's assert statements) to validate the output.
  Your tests should check data types, value ranges, and ensure that required fields are not null.
  Generate code for tests. Try to see the problems in running the code.

Submission: Write your prompts in a text file and upload on LMS.

In [17]:
# Groq-Powered Data Engineering Pipeline

# Step 1: Install Required Libraries
!pip install groq itables



In [18]:
# Step 2: Import Libraries
from groq import Groq
import pandas as pd
from itables import init_notebook_mode
from google.colab import userdata
import json
import re
from tqdm import tqdm
import itables

init_notebook_mode(all_interactive=True)

In [19]:
# Load a manageable sample (500 rows) for this lab
url = "https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$limit=100"
df = pd.read_csv(url)
df

unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,bbl,borough,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
Loading ITables v2.3.0 from the internet... (need help?),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 41 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   unique_key                      100 non-null    int64  
 1   created_date                    100 non-null    object 
 2   closed_date                     14 non-null     object 
 3   agency                          100 non-null    object 
 4   agency_name                     100 non-null    object 
 5   complaint_type                  100 non-null    object 
 6   descriptor                      99 non-null     object 
 7   location_type                   94 non-null     object 
 8   incident_zip                    100 non-null    int64  
 9   incident_address                98 non-null     object 
 10  street_name                     98 non-null     object 
 11  cross_street_1                  98 non-null     object 
 12  cross_street_2                  96 no

In [21]:
client = Groq(api_key=userdata.get("Groq_API"))

In [39]:
def llm_complex_clean(record):
    prompt = f"""
    You are a data cleaning assistant.

    Clean the following data record by performing these tasks:

    1. Address missing values: If fields like city are missing, infer them from the address if possible.
    2. Standardize text:
        - Fix spelling mistakes.
        - Apply consistent capitalization.
        - Normalize categorical values (e.g., convert 'NOISE', 'noise complaint' to 'Noise').
    3. Validate and format fields:
        - Ensure address is in the format: "Street, Borough, NY"
        - Convert date and time fields to ISO 8601 format (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS).
    4. Categorize complaints: Convert ambiguous complaint descriptions into either:
        - "Noise"
        - "Non-Noise"

    Please return the record in the form a dataframe
    Original Record:
    {record.to_dict()}
    """

    chat_completion = client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )


    return chat_completion.choices[0].message.content.strip()

In [40]:
# Feel Free to define any number of functions.

def extract_dict_from_response(response_string):
    """
    Extracts a dictionary from a string using regular expressions and fixes JSON formatting.

    Args:
    response_string: The string containing the dictionary representation.

    Returns:
    A dictionary extracted from the response string.
    """
    # Define a regular expression pattern to match the dictionary structure
    pattern = r"\{.*?\}"  # Matches any characters between curly braces

    # Find all matches in the response string
    matches = re.findall(pattern, response_string, re.DOTALL)

    # If matches are found, extract the first match and fix JSON formatting
    if matches:
        try:
            # Replace single quotes with double quotes for keys and values
            json_string = matches[0].replace("'", '"')
            # Replace Python's None with JSON's null
            json_string = json_string.replace("None", "null")

            # Parse the fixed JSON string
            extracted_dict = json.loads(json_string)
            return extracted_dict
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")
            return None  # Or raise an exception if desired
    else:
        print("No dictionary structure found in the response.")
        return None

In [41]:
cleaned_records = []
sample_df = df.head(10)  # Start with 10 rows due to complexity & API limits

for _, row in tqdm(sample_df.iterrows(), total=len(sample_df)):
    try:
        cleaned_record = llm_complex_clean(row)
        cleaned_records.append(cleaned_record)
    except Exception as e:
        print(f"Error cleaning row {_}: {e}")

cleaned_df = pd.DataFrame(cleaned_records)
cleaned_df.head()

100%|██████████| 10/10 [00:50<00:00,  5.09s/it]


0
Loading ITables v2.3.0 from the internet... (need help?)


Data Validation

In [35]:
def generate_complex_validation_tests(record):
    prompt = f"""
    You are a Python code generator.

    Based on the cleaned NYC 311 data record below, return ONLY raw Python assert statements that validate the record.

    DO NOT include any explanation, commentary, or headings. Just the assert statements.

    Validation Rules:
    1. 'Address', 'Complaint Type', and 'Created Date' must not be empty or null.
    2. 'Address' must follow this format: "Street, Borough, NY".
    3. 'Created Date' must follow ISO 8601 format: "YYYY-MM-DD" or "YYYY-MM-DDTHH:MM:SS".
    4. 'Complaint Type' must be either "Noise" or "Non-Noise".
    5. Text fields must be capitalized consistently (first letter uppercase).

    Use Python's assert statement.
    Use record['field_name'] format for field access.
    Again, return ONLY valid Python assert statements. No descriptions.

    Cleaned Record:
    {record}
    """

    chat_completion = client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return chat_completion.choices[0].message.content.strip()

# Generate tests based on first cleaned record
test_code = generate_complex_validation_tests(df.iloc[0].to_dict())
print(test_code)

assert record['Address'] != '', "Address cannot be empty"
assert record['Complaint Type'] != '', "Complaint Type cannot be empty"
assert record['Created Date'] != '', "Created Date cannot be empty"
assert record['Address'].startswith(record['Street Name']) and record['Address'].endswith(', ' + record['Borough'] + ', NY'), "Address must follow the format: 'Street, Borough, NY'"
assert record['Created Date'].startswith('20') or record['Created Date'].startswith('20T'), "Created Date must follow ISO 8601 format"
assert record['Complaint Type'] in ['Noise', 'Non-Noise'], "Complaint Type must be either 'Noise' or 'Non-Noise'"
assert record['Address'].split()[0].capitalize() == record['Address'].split()[0], "Text fields must be capitalized consistently"
assert record['Complaint Type'].split()[0].capitalize() == record['Complaint Type'].split()[0], "Text fields must be capitalized consistently"


In [34]:
# Evaluate tests programmatically (OPTIONAL)
# record = df.iloc[0].to_dict()

# record['Complaint Type'] = record['complaint_type']
# record['Address'] = record['incident_address']
# record['Created Date'] = record['created_date']
# test_code = test_code.replace("assert record['Street Name'] != ''", "")
# test_code = test_code.replace("assert record['Street Name'][0].isupper()", "")



exec(test_code)

KeyError: 'Street Name'