## Data Analysis

In [1]:
import pandas as pd

df = pd.read_csv('data_ml.csv')
df.head()

Unnamed: 0,Customer,Customer Lifetime Value,Response,Coverage,Education,Gender,Income,Monthly Premium Auto,Number of Policies,Total Claim Amount,...,Sales Channel_Agent,Sales Channel_Branch,Sales Channel_Call Center,Sales Channel_Web,Vehicle Class_Four-Door Car,Vehicle Class_Luxury Car,Vehicle Class_Luxury SUV,Vehicle Class_SUV,Vehicle Class_Sports Car,Vehicle Class_Two-Door Car
0,BU79786,2763.519279,0,0,2,1,56274,69,1,384.811147,...,1,0,0,0,0,0,0,0,0,1
1,QZ44356,6979.535903,0,1,2,1,0,94,8,1131.464935,...,1,0,0,0,1,0,0,0,0,0
2,AI49188,12887.43165,0,2,2,1,48767,108,2,566.472247,...,1,0,0,0,0,0,0,0,0,1
3,WW63253,7645.861827,0,0,2,0,0,106,7,529.881344,...,0,0,1,0,0,0,0,1,0,0
4,HB64268,2813.692575,0,0,2,0,43836,73,1,138.130879,...,1,0,0,0,1,0,0,0,0,0


## Initialize the Amazon Bedrock runtime client

In [2]:
import boto3
from botocore.config import Config
import sys
sys.path.append("src/")
from claude_llm_invocation import invoke_claude_base

my_config = Config(
    region_name = 'us-east-1',
    signature_version = 'v4',
    retries = {
        'max_attempts': 3,
        'mode': 'standard'
    }
)

client = boto3.client("bedrock-runtime", config = my_config)


## Test Claude LLM invocation function

In [3]:
#model_id = "anthropic.claude-instant-v1"
#model_id = "anthropic.claude-v2:1"
#model_id = "anthropic.claude-3-haiku-20240307-v1:0"
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

prompt = "What is Large Langue Model?"

messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]

output_obj = \
invoke_claude_base(client, 
                   messages,
                   system = None,
                   model_id=model_id, 
                   max_tokens=1024, 
                   temperature = 0.0, 
                   top_k = None, 
                   top_p = None,
                   stop_sequences=["Human:"],
                   use_streaming = True,
                   anthropic_version = "bedrock-2023-05-31",
                   print_details = True)

output_obj

Response(s):
A large language model (LLM) is a type of artificial intelligence system that is trained on vast amounts of text data to understand and generate human-like language. These models use deep learning techniques, specifically transformer-based neural networks, to learn patterns and relationships in the training data, allowing them to perform a wide range of natural language processing tasks.

Some key characteristics of large language models include:

1. Massive scale: LLMs are trained on enormous datasets, often comprising billions or even trillions of words from various sources such as websites, books, and articles. This massive scale allows them to capture a broad understanding of language and knowledge.

2. Generative capabilities: LLMs can generate human-like text, including coherent paragraphs, stories, articles, code, and even creative content like poetry or scripts. They can continue writing based on a given prompt or context.

3. Multi-task capabilities: While initial

{'response_text': 'A large language model (LLM) is a type of artificial intelligence system that is trained on vast amounts of text data to understand and generate human-like language. These models use deep learning techniques, specifically transformer-based neural networks, to learn patterns and relationships in the training data, allowing them to perform a wide range of natural language processing tasks.\n\nSome key characteristics of large language models include:\n\n1. Massive scale: LLMs are trained on enormous datasets, often comprising billions or even trillions of words from various sources such as websites, books, and articles. This massive scale allows them to capture a broad understanding of language and knowledge.\n\n2. Generative capabilities: LLMs can generate human-like text, including coherent paragraphs, stories, articles, code, and even creative content like poetry or scripts. They can continue writing based on a given prompt or context.\n\n3. Multi-task capabilities:

## Code to Sample and Prepare Data


In [22]:
import pandas as pd

# Load the dataset
data = pd.read_csv('data_ml.csv')

# Sample 5 rows from the dataset for prompt inclusion
sample_data = data.sample(n=5, random_state=42)

# Display the sample in a format suitable for the prompt
print(sample_data.to_markdown(index=False))


| Customer   |   Customer Lifetime Value |   Response |   Coverage |   Education |   Gender |   Income |   Monthly Premium Auto |   Number of Policies |   Total Claim Amount |   Vehicle Size |   State_Arizona |   State_California |   State_Nevada |   State_Oregon |   State_Washington |   EmploymentStatus_Disabled |   EmploymentStatus_Employed |   EmploymentStatus_Medical Leave |   EmploymentStatus_Retired |   EmploymentStatus_Unemployed |   Location Code_Rural |   Location Code_Suburban |   Location Code_Urban |   Marital Status_Divorced |   Marital Status_Married |   Marital Status_Single |   Policy_Corporate L1 |   Policy_Corporate L2 |   Policy_Corporate L3 |   Policy_Personal L1 |   Policy_Personal L2 |   Policy_Personal L3 |   Policy_Special L1 |   Policy_Special L2 |   Policy_Special L3 |   Renew Offer Type_Offer1 |   Renew Offer Type_Offer2 |   Renew Offer Type_Offer3 |   Renew Offer Type_Offer4 |   Sales Channel_Agent |   Sales Channel_Branch |   Sales Channel_Call Center |   S

In [23]:
sample_data

Unnamed: 0,Customer,Customer Lifetime Value,Response,Coverage,Education,Gender,Income,Monthly Premium Auto,Number of Policies,Total Claim Amount,...,Sales Channel_Agent,Sales Channel_Branch,Sales Channel_Call Center,Sales Channel_Web,Vehicle Class_Four-Door Car,Vehicle Class_Luxury Car,Vehicle Class_Luxury SUV,Vehicle Class_SUV,Vehicle Class_Sports Car,Vehicle Class_Two-Door Car
1263,QI64880,7007.274228,0,0,2,0,0,101,6,973.787176,...,0,1,0,0,0,0,0,1,0,0
93,TV87155,20946.19251,0,0,2,1,69738,74,2,492.127532,...,0,1,0,0,1,0,0,0,0,0
3317,WB43408,4598.235733,0,0,1,1,51523,116,1,738.840818,...,0,1,0,0,0,0,0,1,0,0
6021,CO75086,7269.338677,0,1,1,0,20407,95,8,36.97749,...,0,0,1,0,1,0,0,0,0,0
5694,RA61917,8029.715047,0,0,0,1,76081,68,2,384.210821,...,0,1,0,0,1,0,0,0,0,0


## Load the data

In [26]:
def load_file(filename):
    import pandas as pd
    # Load the dataset
    data = pd.read_csv(filename)
    
    # Sample 5 rows from the dataset for prompt inclusion
    sample_data = data.sample(n=100, random_state=42)
    
    # Display the sample in a format suitable for the prompt
    sample_content = sample_data.to_markdown(index=False)
    return sample_content

In [27]:
text = load_file("data_ml.csv")


In [28]:
text

'| Customer   |   Customer Lifetime Value |   Response |   Coverage |   Education |   Gender |   Income |   Monthly Premium Auto |   Number of Policies |   Total Claim Amount |   Vehicle Size |   State_Arizona |   State_California |   State_Nevada |   State_Oregon |   State_Washington |   EmploymentStatus_Disabled |   EmploymentStatus_Employed |   EmploymentStatus_Medical Leave |   EmploymentStatus_Retired |   EmploymentStatus_Unemployed |   Location Code_Rural |   Location Code_Suburban |   Location Code_Urban |   Marital Status_Divorced |   Marital Status_Married |   Marital Status_Single |   Policy_Corporate L1 |   Policy_Corporate L2 |   Policy_Corporate L3 |   Policy_Personal L1 |   Policy_Personal L2 |   Policy_Personal L3 |   Policy_Special L1 |   Policy_Special L2 |   Policy_Special L3 |   Renew Offer Type_Offer1 |   Renew Offer Type_Offer2 |   Renew Offer Type_Offer3 |   Renew Offer Type_Offer4 |   Sales Channel_Agent |   Sales Channel_Branch |   Sales Channel_Call Center |   

## Define Prompt for Text-to-Code generation



In [29]:
PROMPT_code = """
You are a software engineer.
Your task is to generate Python code to build a classification model based on the provided dataset.

You will be given a context paragraph describing the business problem and requirements. You will also be provided with a sample of the dataset structure to inform your understanding of the data.

Here are the steps to follow:

1. Read through the entire paragraph with the business context and dataset sample to understand the task:
<content>
{INPUT_CONTENT}
</content>

2. Read through the function description to understand the function you're going to write:
<function description>
{FUNCTION_DESCRIPTION}
</function description>

3. Generate the Python code according to the function description and the content provided. Please note that:
 a. Only implement the essential steps for building the model.
 b. Code should include data preprocessing, model training, and evaluation steps.
 c. **Use StandardScaler to scale the features before training**.
 d. **Set `max_iter` in LogisticRegression to 500** to avoid convergence issues.
 e. Provide only the code, without analysis or commentary.

4. Provide the final function based on the task requirements.
"""



In [30]:
FUNCTION_DESCRIPTION = """
This function trains a classification model to predict whether a customer will renew their vehicle insurance based on the provided dataset.

Function name: train_insurance_renewal_model

Input variables:
- dataset : The dataset provided is from an automobile insurance business. The objective is to predict whether a customer will renew their vehicle insurance. The target column is Response, where 1 indicates renewal and 0 indicates no renewal. The dataset includes features like Customer Lifetime Value, Coverage, Education, Gender, Income, Monthly Premium Auto, Number of Policies, and encoded fields such as Sales Channel and Vehicle Class.

Output:
Python code that preprocesses the dataset, cleans the data (e.g., handling missing values and ensuring numerical inputs), and then trains a classification model (e.g., logistic regression or decision tree). Ensure scaling is applied to features using StandardScaler, set `max_iter` in LogisticRegression to 500, evaluate on a test set, and print key performance metrics such as accuracy.
"""


## Call LLM to generate the Python code

In [31]:
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

prompt = PROMPT_code.format(INPUT_CONTENT=text, 
                            FUNCTION_DESCRIPTION=FUNCTION_DESCRIPTION)

messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]

output_obj = \
invoke_claude_base(client, 
                   messages,
                   system = None,
                   model_id=model_id, 
                   temperature = 0.5, 
                   top_k = None, 
                   top_p = None,
                   stop_sequences=["Human:"],
                   use_streaming = True,
                   anthropic_version = "bedrock-2023-05-31",
                   print_details = True)

Response(s):
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def train_insurance_renewal_model(dataset):
    # Preprocess data
    X = dataset.drop(['Customer', 'Response'], axis=1)
    y = dataset['Response']
    
    # Handle missing values
    X = X.fillna(X.mean())
    
    # Encode categorical features
    X = pd.get_dummies(X, drop_first=True)
    
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Scale features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Train model
    model = LogisticRegression(max_iter=500)
    model.fit(X_train, y_train)
    
    # Evaluate model
    y_pred = model.pre

## Parse the Python code from the LLM response

In [32]:
import re
response_text = output_obj['response_text']
find_list = [x.start() for x in re.finditer('```',response_text)]
code_start = find_list[0]+3+len('python\n')
code_end = find_list[1]
generated_code = response_text[code_start: code_end]

In [33]:
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import TerminalFormatter

print(highlight(generated_code, PythonLexer(), TerminalFormatter()))

[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_split[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mpreprocessing[39;49;00m [34mimport[39;49;00m StandardScaler[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mlinear_model[39;49;00m [34mimport[39;49;00m LogisticRegression[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmetrics[39;49;00m [34mimport[39;49;00m accuracy_score, precision_score, recall_score, f1_score[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32mtrain_insurance_renewal_model[39;49;00m(dataset):[37m[39;49;00m
    [37m# Preprocess data[39;49;00m[37m[39;49;00m
    X = dataset.drop([[33m'[39;49;00m[33mC

In [44]:
# save the code to a file
with open('model_code.py', 'w') as f:
    f.write(generated_code)


In [34]:
exec(generated_code)

In [35]:
train_insurance_renewal_model

<function __main__.train_insurance_renewal_model(dataset)>

In [36]:
import pandas as pd

# Load the dataset
dataset = pd.read_csv('data_ml.csv')

# Call the model training function with the loaded dataset
model = train_insurance_renewal_model(dataset)


Accuracy: 0.8712
Precision: 0.6780
Recall: 0.1575
F1-Score: 0.2556


## Testing the model

In [54]:
test_data_path = 'data_ml_test.csv'
test_data = pd.read_csv(test_data_path)


In [55]:
from sklearn.preprocessing import StandardScaler

X_test = test_data.drop(['Customer', 'Response'], axis=1)
y_test = test_data['Response']

scaler = StandardScaler()
X_test_preprocessed = scaler.fit_transform(X_test)

predictions = model.predict(X_test_preprocessed)

print(predictions)

[0 0 0 0 0 0 0 0 0 0]
