**CHAP 8 : Generative AI for Data Preparation**

In [None]:
GEMINI_KEY = "[PUT YOUR GEMEINI KEY HERE]"

# Prompt template

In [None]:
# Define a simple function to create a prompt for an LLM
def create_prompt(task, details):
    # Template for the prompt
    prompt = f"Task: {task}\nDetails: {details}\nResponse:"
    return prompt

# Example usage
task = "Explain the difference between supervised and unsupervised learning."
details = "Supervised learning uses labeled data to train models, while unsupervised learning deals with unlabeled data."

# Generate the prompt
prompt = create_prompt(task, details)

# Print the generated prompt
print("Generated Prompt:")
print(prompt)

Generated Prompt:
Task: Explain the difference between supervised and unsupervised learning.
Details: Supervised learning uses labeled data to train models, while unsupervised learning deals with unlabeled data.
Response:


chatGPT's answer:  

Supervised and unsupervised learning are two main types of machine learning, distinguished by the presence or absence of labeled data during the training process.

    Supervised Learning: This approach uses labeled data, meaning that each input is paired with a known output (or label). The model learns to map inputs to the correct outputs by minimizing the error between predictions and the actual labels. Common tasks include classification (e.g., email spam detection) and regression (e.g., predicting house prices).

    Unsupervised Learning: In this approach, the data is unlabeled, and the model tries to uncover patterns or structures within the data without explicit guidance. It focuses on finding similarities, differences, or clusters in the data. Common tasks include clustering (e.g., customer segmentation) and dimensionality reduction (e.g., principal component analysis for data compression).

In essence, supervised learning is about predicting outcomes, while unsupervised learning is about discovering hidden patterns.

# By using Gemini

Gemini provides a free API tier with lower rate limits for testing purposes. You can create a free API key through [Google AI Studio](https://aistudio.google.com/prompts/new_chat) to access this tier.

However, it's important to note that there are usage limits and potential costs associated with exceeding these limits. For production use cases, it's recommended to review the pricing details and consider paid plans to ensure optimal performance and cost-effectiveness.

Configuring and Using the Gemini API:
1. Create a Google AI Studio Account: If you don't have one already, sign up for a free account.
2. Generate an API Key: Once logged in, navigate to the API Key section and create a new key.
3. Store Your API Key Securely: Keep your API key confidential. Avoid sharing it publicly.

Note: the currentfree tier limitations and may change in the future. You can find the most up-to-date information on the Gemini API pricing page: https://ai.google.dev/pricing

In [None]:
import google.generativeai as genai
model = genai.GenerativeModel("gemini-pro") # or gemini-1.5-flash-002 or gemini-1.5-flash-8b

Simple prompt

In [None]:
original = "My customer is not satisfied"
response = model.generate_content(f"Paraphrase this in english: {original}")
print(response.text)

My client is unhappy


# Using Generative AI to impute missing values

In [None]:
!wget https://raw.githubusercontent.com/datacorner/dataprep-handbook/main/Titanic%20disaster/train.csv

--2024-12-04 15:00:31--  https://raw.githubusercontent.com/datacorner/dataprep-handbook/main/Titanic%20disaster/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘train.csv’


2024-12-04 15:00:31 (28.6 MB/s) - ‘train.csv’ saved [60302/60302]



In [None]:
import pandas as pd
df = pd.read_csv("/content/train.csv")

In [None]:
import pandas as pd

# Step 1: Split the dataset into manageable chunks
chunk_size = 5
chunks = [df.iloc[i:i + chunk_size] for i in range(0, len(df), chunk_size)]

# Step 2: Collect metadata for each chunk
def collect_metadata(chunk):
    metadata = {
        "columns": chunk.columns.tolist(),
        "missing_counts": chunk.isna().sum().to_dict(),
        "column_means": chunk.mean(numeric_only=True).to_dict(),
        "column_medians": chunk.median(numeric_only=True).to_dict(),
        "column_modes": chunk.mode(dropna=True).iloc[0].to_dict()
    }
    return metadata

# Step 3: Build the prompt for an LLM
def build_prompt(chunk, metadata):
    prompt = f"""
You are a data expert helping to clean a dataset. The dataset is structured as follows:
Columns: {metadata['columns']}
Metadata about the columns:
- Missing value counts: {metadata['missing_counts']}
- Column means: {metadata['column_means']}
- Column medians: {metadata['column_medians']}
- Column modes: {metadata['column_modes']}

Here is the dataset with missing values (NaN):
{chunk.to_string(index=False)}

Your task:
1. Identify columns with missing values.
2. Impute missing values using the most appropriate statistical method based on metadata (e.g., mean, median, or mode).
3. Return the updated data for the dataset provided in a JSON format (each row as a node).
Please only provide the result of task 3
"""
    return prompt


# Step 4: Generate prompts for each chunk
prompts = []
for chunk in chunks:
    metadata = collect_metadata(chunk)
    prompt = build_prompt(chunk, metadata)
    prompts.append(prompt)

# Step 5: Example of sending a prompt to an LLM with Gemini. This step requires an LLM endpoint and API key.
def send_to_llm(prompt):
    # Replace with actual LLM API code (e.g., OpenAI GPT)
    response = model.generate_content(prompt)
    return response.text

# Process each prompt with the LLM
#for i, prompt in enumerate(prompts):
#    print(f"Processing Chunk {i+1}/{len(prompts)}")
#    response = send_to_llm(prompt)
#    print(f"Response for Chunk {i+1}:\n{response}\n")

print(f"Prompt {prompts[0]}")
response = send_to_llm(prompt)
print(f"Response for Chunk {0}:\n{response}\n")

Prompt 
You are a data expert helping to clean a dataset. The dataset is structured as follows:
Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
Metadata about the columns:
- Missing value counts: {'PassengerId': 0, 'Survived': 0, 'Pclass': 0, 'Name': 0, 'Sex': 0, 'Age': 0, 'SibSp': 0, 'Parch': 0, 'Ticket': 0, 'Fare': 0, 'Cabin': 3, 'Embarked': 0}
- Column means: {'PassengerId': 3.0, 'Survived': 0.6, 'Pclass': 2.2, 'Age': 31.2, 'SibSp': 0.6, 'Parch': 0.0, 'Fare': 29.521660000000004}
- Column medians: {'PassengerId': 3.0, 'Survived': 1.0, 'Pclass': 3.0, 'Age': 35.0, 'SibSp': 1.0, 'Parch': 0.0, 'Fare': 8.05}
- Column modes: {'PassengerId': 1, 'Survived': 1.0, 'Pclass': 3.0, 'Name': 'Allen, Mr. William Henry', 'Sex': 'female', 'Age': 35.0, 'SibSp': 1.0, 'Parch': 0.0, 'Ticket': '113803', 'Fare': 7.25, 'Cabin': 'C123', 'Embarked': 'S'}

Here is the dataset with missing values (NaN):
 PassengerId  Survived  Pclass  

# Noise reduction and data normalization

In [None]:
def build_normalization_prompt(chunk, metadata):
    prompt = f"""
You are a data expert helping to clean a dataset by performing noise reduction and normalization. The dataset is structured as follows:
Columns: {metadata['columns']}
Metadata about the columns:
- Missing value counts: {metadata['missing_counts']}
- Column means: {metadata['column_means']}
- Column medians: {metadata['column_medians']}
- Column modes: {metadata['column_modes']}

Here is a sample of the data (including some noisy values or outliers) for analysis:
{chunk.head(5).to_string(index=False)}

### Your task:
1. **Noise Reduction**:
   - Identify any columns that have noisy data (e.g., outliers, spikes, extreme values).
   - Apply an appropriate technique to reduce noise, such as:
     - Using a rolling average or median for smoothing.
     - Identifying and handling outliers (e.g., via z-scores or IQR).
     - Handling missing values (e.g., forward filling or imputation with statistical values).

2. **Normalization**:
   - Normalize numerical features to ensure they are on a similar scale. You can apply one or more of the following methods:
     - **Min-Max Scaling**: Scale values between 0 and 1.
     - **Standardization (Z-Score Normalization)**: Adjust the dataset to have a mean of 0 and a standard deviation of 1.
     - **Robust Scaling**: Normalize using median and IQR to handle outliers.

   Provide the updated values after applying these techniques.
"""
    return prompt
metadata = collect_metadata(chunk)
print(build_normalization_prompt(chunks[0], metadata))


You are a data expert helping to clean a dataset by performing noise reduction and normalization. The dataset is structured as follows:
Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
Metadata about the columns:
- Missing value counts: {'PassengerId': 0, 'Survived': 0, 'Pclass': 0, 'Name': 0, 'Sex': 0, 'Age': 0, 'SibSp': 0, 'Parch': 0, 'Ticket': 0, 'Fare': 0, 'Cabin': 1, 'Embarked': 0}
- Column means: {'PassengerId': 891.0, 'Survived': 0.0, 'Pclass': 3.0, 'Age': 32.0, 'SibSp': 0.0, 'Parch': 0.0, 'Fare': 7.75}
- Column medians: {'PassengerId': 891.0, 'Survived': 0.0, 'Pclass': 3.0, 'Age': 32.0, 'SibSp': 0.0, 'Parch': 0.0, 'Fare': 7.75}
- Column modes: {'PassengerId': 891, 'Survived': 0, 'Pclass': 3, 'Name': 'Dooley, Mr. Patrick', 'Sex': 'male', 'Age': 32.0, 'SibSp': 0, 'Parch': 0, 'Ticket': '370376', 'Fare': 7.75, 'Cabin': nan, 'Embarked': 'Q'}

Here is a sample of the data (including some noisy values or out

# Data normalization and scaling

In [None]:
def generate_normalization_prompt(dataset_overview, target_columns=None, scaling_techniques=None):
    prompt = (
        "You are an expert in data preprocessing and feature engineering. I need your help in normalizing and scaling "
        "data from a dataset used for machine learning.\n\n"
        "Dataset Context:\n"
        f"{dataset_overview}\n\n"
    )

    if target_columns:
        prompt += (
            f"Focus specifically on the following columns for normalization and scaling: {', '.join(target_columns)}.\n"
        )
    else:
        prompt += "Analyze all numeric columns in the dataset for potential scaling.\n"

    if scaling_techniques:
        prompt += (
            f"Preferably suggest one or more of the following scaling techniques: {', '.join(scaling_techniques)}.\n"
        )
    else:
        prompt += (
            "Recommend suitable scaling methods such as min-max scaling, z-score normalization, or log transformations "
            "based on the data characteristics.\n"
        )

    prompt += (
        "Include the following in your response:\n"
        "- A summary of the dataset's scaling needs based on its metadata.\n"
        "- Specific scaling or normalization methods for each column and why you suggest them.\n"
        "- Any additional preprocessing steps needed to handle outliers or anomalies before scaling.\n\n"
        "If the dataset includes categorical columns or missing values, explain how to handle these issues.\n"
        "Provide a concise yet detailed explanation of your recommendations."
    )

    return prompt

# Example usage:
dataset_overview = """
The Titanic dataset includes the following columns:
- PassengerId (integer): Unique identifier for each passenger.
- Age (float): Passenger's age, with some missing values.
- Fare (float): Ticket price paid by the passenger.
- Embarked (categorical): Port of embarkation (C, Q, S).
- Survived (binary): Target column indicating survival (0 or 1).
"""

prompt = generate_normalization_prompt(dataset_overview, target_columns=["Age", "Fare"], scaling_techniques=["z-score"])
print(prompt)

You are an expert in data preprocessing and feature engineering. I need your help in normalizing and scaling data from a dataset used for machine learning.

Dataset Context:

The Titanic dataset includes the following columns:
- PassengerId (integer): Unique identifier for each passenger.
- Age (float): Passenger's age, with some missing values.
- Fare (float): Ticket price paid by the passenger.
- Embarked (categorical): Port of embarkation (C, Q, S).
- Survived (binary): Target column indicating survival (0 or 1).


Focus specifically on the following columns for normalization and scaling: Age, Fare.
Preferably suggest one or more of the following scaling techniques: z-score.
Include the following in your response:
- A summary of the dataset's scaling needs based on its metadata.
- Specific scaling or normalization methods for each column and why you suggest them.
- Any additional preprocessing steps needed to handle outliers or anomalies before scaling.

If the dataset includes cate

# Anomaly detection and outlier treatment

In [None]:
def build_anomaly_outlier_prompt(chunk, metadata):
    prompt = f"""
You are a data expert helping to clean a dataset by detecting and treating anomalies and outliers. The dataset is structured as follows:
Columns: {metadata['columns']}
Metadata about the columns:
- Missing value counts: {metadata['missing_counts']}
- Column means: {metadata['column_means']}
- Column medians: {metadata['column_medians']}
- Column modes: {metadata['column_modes']}

Here is the dataset, including potential anomalies or outliers:
{chunk.head(5).to_string(index=False)}

### Your task:
1. **Anomaly Detection**:
   - Identify any anomalies or outliers in the dataset. These may include:
     - Extreme values (e.g., values significantly larger or smaller than the rest).
     - Unexpected patterns in the data (e.g., missing values in key columns or unusual distributions).
     - Use techniques like z-scores, IQR (Interquartile Range), or visual inspection of statistical distributions to detect these anomalies.

2. **Outlier Treatment**:
   - For each identified outlier or anomaly:
     - Determine if the value should be capped or transformed (e.g., winsorizing or log transformation).
     - Decide if the anomaly should be removed or replaced with a more plausible value (e.g., using the median or mean).
     - Ensure the final dataset is clean and consistent.

Please provide the cleaned and updated values after applying these treatments.
"""
    return prompt

prompt = build_anomaly_outlier_prompt(chunks[0], metadata)
print(prompt)

NameError: name 'chunks' is not defined