## Project Milestone 2 - Refine Your Project
### Five prompt experiments to see if your problem has the potential to be solved using the model you chose.

## Notebook overview

This notebook runs a set of prompt experiments to validate an LLM-based approach for your project (Project Milestone 2).

How to run
- Run CELL 2 first to initialize the OpenAI client and model configuration.
- Then run experimental cells in order (CELL 3 ‚Üí CELL 4 ‚Üí CELL 5 ‚Üí CELL 6) or run individual experiment cells after the client is initialized.
- Do not re-import or reassign shared variables (client, MODEL_NAME, MAX_TOKENS, TEMPERATURE) unless you intend to overwrite them.

What each cell does
- CELL 0: Notebook title / header.
- CELL 2: Initializes OpenAI client and defines query_gpt helper and model config.
- CELL 3: Experiment 1 ‚Äî code Q&A using a simulated code context.
- CELL 4: ask_gpt helper function used to call the client.
- CELL 5: Experiment 3 ‚Äî onboarding guidance using simple_context.
- CELL 6: Experiment 5 ‚Äî debugging guidance using simple_debugging.

Key variables (do not hardcode secrets in shared repos)
- MODEL_NAME, MAX_TOKENS, TEMPERATURE: model configuration.
- client: initialized OpenAI client.
- Note: API keys are currently set in variables; move them to environment variables or a secrets manager before committing.


In [8]:
import os
from openai import OpenAI
import json

OPENAI_API_KEY1 = "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"

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY1)

# Model configuration
MODEL_NAME = "gpt-3.5-turbo"
TEMPERATURE = 0.3  # Lower temperature for more consistent, factual responses
MAX_TOKENS = 500

# Helper function for chat completions
def query_gpt(system_prompt, user_prompt, context=""):
    """
    Query GPT-3.5-Turbo with system prompt, context, and user query.
    """
    messages = [
        {"role": "system", "content": system_prompt}
    ]
    
    if context:
        messages.append({
            "role": "user", 
            "content": f"Context:\n{context}\n\nQuestion: {user_prompt}"
        })
    else:
        messages.append({"role": "user", "content": user_prompt})
    
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS
    )
    
    return response.choices[0].message.content

print("‚úÖ OpenAI client initialized successfully")


‚úÖ OpenAI client initialized successfully


In [10]:
# Experiment 1: Code Understanding
print("\n" + "="*70)
print("EXPERIMENT 1: Direct Code Q&A (Function Documentation)")
print("="*70)

# Simulated retrieved context (in real system, this comes from vector DB)
code_context = """
File: auth_service.py, lines 45‚Äì67

def authenticate_user(username: str, password: str) -> bool:
    '''Authenticate a user against the configured user database.
    
    Args:
        username: The user's login identifier (email or username).
        password: The plaintext password to verify against stored hash.
    
    Returns:
        True if authentication succeeds; False if credentials are invalid.
    
    Side effects:
        - Logs authentication attempts to audit_log table
        - Increments failed_attempts counter (reset on success)
        - Raises ValueError if database connection fails
    '''
    try:
        user = db.query(User).filter_by(username=username).first()
        if not user:
            log_auth_attempt(username, success=False)
            return False
        if verify_password(password, user.password_hash):
            user.failed_attempts = 0
            db.session.commit()
            log_auth_attempt(username, success=True)
            return True
        else:
            user.failed_attempts += 1
            db.session.commit()
            log_auth_attempt(username, success=False)
            return False
    except Exception as e:
        raise ValueError(f"Database error during authentication: {e}")
"""

system_prompt = """You are an expert code documentation assistant. 
Explain code clearly and accurately based ONLY on the provided context. 
Always cite file names and line numbers. If information is not in the context, say so."""

user_prompt = """Based on the code provided, explain what the authenticate_user() function does.
Include:
1. What the function does in 1-2 sentences
2. What each parameter represents
3. What it returns and under what conditions
4. Any important side effects
5. A usage example

Reference the specific file and line numbers."""

# Execute query
response = query_gpt(system_prompt, user_prompt, code_context)

print("\nüìù PROMPT:")
print(user_prompt)
print("\nü§ñ MODEL RESPONSE:")
print(response)



EXPERIMENT 1: Direct Code Q&A (Function Documentation)

üìù PROMPT:
Based on the code provided, explain what the authenticate_user() function does.
Include:
1. What the function does in 1-2 sentences
2. What each parameter represents
3. What it returns and under what conditions
4. Any important side effects
5. A usage example

Reference the specific file and line numbers.

ü§ñ MODEL RESPONSE:
1. The `authenticate_user()` function in auth_service.py (lines 45-67) authenticates a user by checking the provided username and password against a stored hash in the user database.

2. Parameters:
   - `username`: Represents the user's login identifier (email or username).
   - `password`: Represents the plaintext password to verify against the stored hash.

3. Returns:
   - Returns True if the authentication succeeds, indicating valid credentials.
   - Returns False if the authentication fails, indicating invalid credentials.

4. Side Effects:
   - Logs authentication attempts to the `audit_

In [12]:
def ask_gpt(question, context):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful coding mentor. Guide developers using the provided codebase documentation."},
            {"role": "user", "content": f"Documentation:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.3,
        max_tokens=400
    )
    return response.choices[0].message.content

# Simple documentation about the codebase
simple_context = """
PROJECT STRUCTURE:

/data_pipeline/
  ‚îú‚îÄ‚îÄ ingestion/          # Read data from sources
  ‚îú‚îÄ‚îÄ validation/         # Check data quality
  ‚îÇ   ‚îú‚îÄ‚îÄ check_nulls.py
  ‚îÇ   ‚îî‚îÄ‚îÄ check_duplicates.py
  ‚îú‚îÄ‚îÄ transformation/     # Clean and transform data
  ‚îî‚îÄ‚îÄ alerts/            # Send notifications
      ‚îî‚îÄ‚îÄ slack_alerts.py

HOW TO ADD A NEW FEATURE:

1. Find the right folder for your feature
   - Data checks? ‚Üí /validation/
   - Notifications? ‚Üí /alerts/
   - Transformations? ‚Üí /transformation/

2. Look at existing examples
   - /alerts/slack_alerts.py shows how to send Slack messages
   - All alert functions follow this pattern:
   
   def send_alert(message, severity="INFO"):
       # Connect to notification service
       # Format message
       # Send notification
       return success_status

3. Create your new file
   - Name it clearly (e.g., email_alerts.py)
   - Copy the pattern from slack_alerts.py
   - Change the notification method

4. Connect it to your pipeline
   - Import your function in validation/check_nulls.py
   - Call it when a check fails:
     if data_has_errors:
         send_email_alert("Data quality issue found!")
"""

question = """I'm new to the team. I need to add email alerts when our null checks fail.
Where should I put this code? What files should I look at first?"""

print("="*70)
print("EXPERIMENT 3: Onboarding Guidance")
print("="*70)
print(f"\n‚ùì QUESTION:\n{question}")

answer = ask_gpt(question, simple_context)

print(f"\nü§ñ ANSWER:\n{answer}")


EXPERIMENT 3: Onboarding Guidance

‚ùì QUESTION:
I'm new to the team. I need to add email alerts when our null checks fail.
Where should I put this code? What files should I look at first?

ü§ñ ANSWER:
To add email alerts when null checks fail, you should follow these steps:

1. **Find the right folder for your feature**:
   Since email alerts are notifications, you should add your code in the `/alerts/` folder.

2. **Look at existing examples**:
   Take a look at `/alerts/slack_alerts.py` to understand the pattern for sending notifications. All alert functions follow a similar structure.

3. **Create your new file**:
   - Name your new file as `email_alerts.py`.
   - Copy the pattern from `slack_alerts.py` and modify it to send email alerts.

4. **Connect it to your pipeline**:
   - Import your email alert function in `/validation/check_nulls.py`.
   - Call the email alert function when a null check fails.

By following these steps, you can successfully add email alerts to the data p

In [13]:
# Simple architecture decision record
simple_decision = """
DECISION: Use PostgreSQL for our customer database

DATE: January 2024
TEAM: Backend Engineering

WHY WE NEEDED A DATABASE:
- Store customer information (name, email, orders)
- Need to query by customer ID, email, and order history
- Must handle transactions (orders must be consistent)

OPTIONS WE CONSIDERED:

1. PostgreSQL (Relational Database)
   ‚úì Good at: Relationships between data (customers ‚Üí orders)
   ‚úì Good at: Complex queries (find all customers who ordered in last 30 days)
   ‚úì Good at: Transactions (order + payment must succeed together)
   ‚úó Bad at: Flexible schemas (adding fields requires migrations)
   Cost: Free (open source)

2. MongoDB (Document Database)
   ‚úì Good at: Flexible schemas (easy to add new fields)
   ‚úì Good at: Storing complex nested data
   ‚úó Bad at: Relationships across collections
   ‚úó Bad at: Complex transactions (weaker than PostgreSQL)
   Cost: Free (open source) but more memory needed

3. DynamoDB (AWS Managed NoSQL)
   ‚úì Good at: Scaling automatically
   ‚úì Good at: Simple key lookups
   ‚úó Bad at: Complex queries (need to design around limitations)
   ‚úó Bad at: Cost (expensive at scale)
   Cost: ~$500/month at our scale

WHAT WE CHOSE: PostgreSQL

WHY:
- Our data has clear relationships (customers have many orders)
- We need complex queries (analytics reports)
- Transactions are critical (payment processing)
- Team already knows SQL

TRADE-OFF WE ACCEPTED:
- We gave up flexible schemas (MongoDB's strength)
- We get strong consistency and relations (our priority)
"""

question = """Why did we choose PostgreSQL instead of MongoDB? 
I heard MongoDB is more modern and flexible."""

print("\n" + "="*70)
print("EXPERIMENT 4: Design Decision Explanation")
print("="*70)
print(f"\n‚ùì QUESTION:\n{question}")

answer = ask_gpt(question, simple_decision)

print(f"\nü§ñ ANSWER:\n{answer}")




EXPERIMENT 4: Design Decision Explanation

‚ùì QUESTION:
Why did we choose PostgreSQL instead of MongoDB? 
I heard MongoDB is more modern and flexible.

ü§ñ ANSWER:
We chose PostgreSQL over MongoDB for our customer database for several reasons:

1. **Clear Relationships**: Our data has clear relationships where customers are associated with multiple orders. PostgreSQL is a relational database that excels at handling such relationships efficiently.

2. **Complex Queries**: We needed to perform complex queries for generating analytics reports and extracting insights from our data. PostgreSQL is known for its powerful query capabilities, making it a suitable choice for our use case.

3. **Transactions**: Transactions are critical for us, especially in scenarios like payment processing where data consistency is crucial. PostgreSQL provides strong support for transactions, ensuring data integrity and consistency.

4. **Team Familiarity**: Our team already has expertise in SQL, which is th

In [15]:
# Simple debugging guide
simple_debugging = """
CODE THAT'S FAILING:

File: process_orders.py, line 42

def calculate_daily_totals(orders_df):
    # This loads ALL orders into memory at once
    all_orders = orders_df.collect()  # ‚ö†Ô∏è PROBLEM: Loads everything into RAM
    
    daily_totals = {}
    for order in all_orders:
        date = order['order_date']
        if date not in daily_totals:
            daily_totals[date] = 0
        daily_totals[date] += order['total_amount']
    
    return daily_totals

SYSTEM CONFIGURATION:

- Server RAM: 8 GB
- Database: 5 million orders
- Typical query: Processes 100,000 orders (about 2 GB of data)

KNOWN ISSUES:

1. collect() loads ALL data into memory
   - Works fine with 10,000 orders (200 MB)
   - Crashes with 100,000+ orders (2+ GB)
   - Python dict operations slow on huge datasets

2. Processing row-by-row is inefficient
   - Better to use SQL aggregation (database does the work)
   - Better to process in batches (chunk the data)

DEBUGGING STEPS:

1. Check memory usage
   - Run: htop or top command
   - Look for process using >80% RAM
   - Check if process gets "Killed" (out of memory error)

2. Check how much data you're processing
   - Add: print(f"Processing {orders_df.count()} orders")
   - If >50,000 orders, memory issues likely

3. Fix approaches:

   OPTION A - Use SQL aggregation (best):
   query = '''
       SELECT order_date, SUM(total_amount) as daily_total
       FROM orders
       GROUP BY order_date
   '''
   result = database.execute(query)
   
   OPTION B - Process in batches:
   for batch in orders_df.to_batches(size=10000):
       process_batch(batch)
   
   OPTION C - Add more RAM:
   Upgrade server from 8GB ‚Üí 16GB
   (Not recommended - fix code first!)
"""

question = """Our process_orders.py script keeps crashing with "MemoryError: unable to allocate array".
It works with small data but fails on production data. What's wrong?"""

print("\n" + "="*70)
print("EXPERIMENT 5: Debugging Help")
print("="*70)
print(f"\n‚ùì QUESTION:\n{question}")

answer = ask_gpt(question, simple_debugging)

print(f"\nü§ñ ANSWER:\n{answer}")





EXPERIMENT 5: Debugging Help

‚ùì QUESTION:
Our process_orders.py script keeps crashing with "MemoryError: unable to allocate array".
It works with small data but fails on production data. What's wrong?

ü§ñ ANSWER:
The issue with the `calculate_daily_totals` function in the `process_orders.py` script is that it loads all orders into memory at once using the `collect()` method. This approach is not efficient and leads to memory errors when processing a large amount of data.

To address this issue and prevent memory errors, you can follow these steps:

1. **Identify the Problem**:
   - Loading all orders into memory at once is causing memory errors.
   - The script works fine with small data but crashes with production data.

2. **Debugging Steps**:
   - Check memory usage to see if the process is consuming a large amount of RAM.
   - Check the amount of data being processed to identify if it exceeds the system's capacity.

3. **Fix Approaches**:
   - **Option A - Use SQL Aggregation*