# Semantic Evaluation of Summary Generation

This demo uses a subset of the data in https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail

### Instantiate Parameters, Variables and Models 

In [1]:
# parameters 

file_name = "articles.txt"

models = ["base", "tuned"] # only options are "base" model or "tuned" model
skip_examples = 150
tuning_examples = 10 # only applicable if tuning
testing_examples = 50
temperature_list = [0.2,0.8]
top_k_list = [40]
top_p_list = [0.8]
max_output_tokens_list = [200,1000]
prompts = ["summarize the following: ", "create a short abstract from the following article: "] # test different variations of your prompts


In [2]:
# variables

TUNED_MODEL_NAME="tuned_summarization_model"

In [3]:
# models

from vertexai.preview.language_models import TextEmbeddingModel
import numpy as np
from vertexai.preview.language_models import TextGenerationModel

llm_model = TextGenerationModel.from_pretrained("text-bison@001")
emb_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

### Read data

In [4]:
# import local dataset

import pandas as pd
headers_tune = ["id", "input_text", "output_text"]
df_tune = pd.read_csv(file_name, nrows=tuning_examples, skiprows=skip_examples,names=headers_tune).drop(columns="id")
headers = ["id", "context", "ground_truth_response"]
df_test = pd.read_csv(file_name, nrows=testing_examples, skiprows=tuning_examples+skip_examples,names=headers).drop(columns="id")


In [5]:
print(len(df_test))
print(len(df_tune))
df_test.head()

50
10


Unnamed: 0,context,ground_truth_response
0,Chelsea will try to add another trophy to thei...,Chelsea's Under 19 side are hoping to become s...
1,'Insta-fame' is the latest criterion for model...,Australian model agencies reveal new demand fo...
2,A selection of thirsty animals were captured d...,A zookeeper at the Zoological Center in Tel Av...
3,A red hot Novak Djokovic is being tipped to en...,Rafael Nadal lost heavily to Novak Djokovic on...
4,Rocker Richie Sambora is being investigated by...,Richie Sambora is about to be quizzed by polic...


### If Applicable, Tune Model (This May Take a Few Hours)

In [119]:
# add instructions from prompts to the tuned data
print(len(df_tune))
df_tune_final = pd.DataFrame()
for prompt in prompts: 
    df_temp = df_tune
    df_temp["input_text"] = df_temp["input_text"].apply(lambda x: prompt + x)
    df_tune_final = df_tune_final.append(df_temp)
print(len(df_tune_final))

10
20


In [120]:
# create tuned model (this can take a few hours)

def tuning(
    training_data: pd.DataFrame,
    train_steps: int = 5,
) -> None:
    """Tune a new model, based on a prompt-response data.

    "training_data" can be either the GCS URI of a file formatted in JSONL format
    (for example: training_data=f'gs://{bucket}/{filename}.jsonl'), or a pandas
    DataFrame. Each training example should be JSONL record with two keys, for
    example:
      {
        "input_text": <input prompt>,
        "output_text": <associated output>
      },
    or the pandas DataFame should contain two columns:
      ['input_text', 'output_text']
    with rows for each training example.

    Args:
      project_id: GCP Project ID, used to initialize vertexai
      location: GCP Region, used to initialize vertexai
      training_data: GCS URI of jsonl file or pandas dataframe of training data
      train_steps: Number of training steps to use when tuning the model.
    """

    llm_model.tune_model(
        training_data=training_data,
        # Optional:
        train_steps=train_steps,
        tuning_job_location="europe-west4",  # Only supported in europe-west4 for Public Preview
        tuned_model_location="us-central1",
        model_display_name=TUNED_MODEL_NAME
    )

    llm_model._job.status

if "tuned" in models: 
    tuned_model = tuning(training_data=df_tune)


Creating PipelineJob
PipelineJob created. Resource name: projects/185246287903/locations/europe-west4/pipelineJobs/tune-large-model-20230724204610
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/185246287903/locations/europe-west4/pipelineJobs/tune-large-model-20230724204610')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/europe-west4/pipelines/runs/tune-large-model-20230724204610?project=185246287903
PipelineJob projects/185246287903/locations/europe-west4/pipelineJobs/tune-large-model-20230724204610 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/185246287903/locations/europe-west4/pipelineJobs/tune-large-model-20230724204610 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/185246287903/locations/europe-west4/pipelineJobs/tune-large-model-20230724204610 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/185246287903/locations/europe-west4/pi

In [6]:
# reinstantiate if needed
tuned_model_path=""
tuned_model = TextGenerationModel.get_tuned_model(tuned_model_path)

In [7]:
tuned_model.predict("hello how are you?")

I am doing well, thank you for asking!

### Evaluation

In [8]:
# define functions

def generate_response(row, prompt_, model_, parameters_):
    response = model_.predict(
        prompt  + row["context"],
        **parameters_,
    )
    return response.text

def evaluate(row):
    if row["generated_response"] == '':
        return 0
    embedding_response = emb_model.get_embeddings([row["ground_truth_response"], row["generated_response"]])
    embeddings = [embedding.values for embedding in embedding_response]    
    return np.dot(embeddings[0], embeddings[1])

In [10]:
# evaluate with different parameters

eval_scores = []
model_number = 1

for model in models:
    for temperature in temperature_list:
        for top_k in top_k_list:
            for top_p in top_p_list:
                for max_output_tokens in max_output_tokens_list:
                    for prompt in prompts:
                        
                        if model == "base":
                            this_model = llm_model
                        elif model == "tuned":
                            this_model = tuned_model
                        else:
                            "the model" + model + "does not exist"
                            break
                        
                        parameters = {
                            "temperature": temperature,  # Temperature controls the degree of randomness in token selection.
                            "max_output_tokens": max_output_tokens,  # Token limit determines the maximum amount of text output.
                            "top_p": top_p,  # Tokens are selected from most probable to least until the sum of their probabilities equals the top_p value.
                            "top_k": top_k,  # A top_k of 1 means the selected token is the most probable among all tokens.
                        }
                        df_test["generated_response"] = df_test.apply(lambda x: generate_response(x, prompt, this_model, parameters), axis=1)    
                        df_test["similarity_score"] = df_test.apply(evaluate, axis =1)
                        similarity_score = df_test["similarity_score"].mean()

                        eval_scores.append([str(model_number),str(temperature),str(top_k),str(top_p), str(max_output_tokens),prompt,model, str(similarity_score)])

                        print("finished evaluating model: " + str(model_number))
                        model_number +=1
                    
                        
         

finished evaluating model: 1
finished evaluating model: 2
finished evaluating model: 3
finished evaluating model: 4
finished evaluating model: 5
finished evaluating model: 6
finished evaluating model: 7
finished evaluating model: 8
finished evaluating model: 9
finished evaluating model: 10
finished evaluating model: 11
finished evaluating model: 12
finished evaluating model: 13
finished evaluating model: 14
finished evaluating model: 15
finished evaluating model: 16


In [11]:
df_eval = pd.DataFrame(eval_scores)
df_eval.columns = ["model_number","temperature","top_k","top_p","max_output_tokens", "prompt","model_type", "similarity_score"]
df_eval

Unnamed: 0,model_number,temperature,top_k,top_p,max_output_tokens,prompt,model_type,similarity_score
0,1,0.2,40,0.8,200,summarize the following:,base,0.5626061567082256
1,2,0.2,40,0.8,200,create a short abstract from the following art...,base,0.5503191346143873
2,3,0.2,40,0.8,1000,summarize the following:,base,0.5794346868355766
3,4,0.2,40,0.8,1000,create a short abstract from the following art...,base,0.5313373976054452
4,5,0.8,40,0.8,200,summarize the following:,base,0.6311506753516014
5,6,0.8,40,0.8,200,create a short abstract from the following art...,base,0.5657017010607833
6,7,0.8,40,0.8,1000,summarize the following:,base,0.5805615349087037
7,8,0.8,40,0.8,1000,create a short abstract from the following art...,base,0.5617710006675751
8,9,0.2,40,0.8,200,summarize the following:,tuned,0.5411395024174513
9,10,0.2,40,0.8,200,create a short abstract from the following art...,tuned,0.4546638799021737
