# Tuned Gemini Flash 2.0 for citing sentence classification

This notebook attempts to determine if a Gemini instance tuned on labeled examples is capable of producing a good results in detecting citing sentences.

The Gemini instance has been tuned on 10000 samples from a dataset of sentences extracted from scientific papers and labeled as citing or non-citing.

## Instantiation

Using Google AI products requires the Google Cloud SDK to be installed on your system.

The following code initializes the Vertex project (I chose the project where I have the tuned model stored) and chooses a datacenter.

In [57]:
import vertexai

# Set up the VertexAI client
vertexai.init(
    project="citingsentececlassifier",
    location="europe-west8",
)

In [77]:
from google import genai
from google.genai import types
import base64

def generate(content: str):
  client = genai.Client(
      vertexai=True,
      project="438747908796",
      location="europe-west8",
  )


  model = "projects/438747908796/locations/europe-west8/endpoints/5478936809951985664"
  contents = [
      content
  ]
  generate_content_config = types.GenerateContentConfig(
    temperature = 0,
    top_p = 0.95,
    max_output_tokens = 256,
    response_modalities = ["TEXT"],
    safety_settings = [types.SafetySetting(
      category="HARM_CATEGORY_HATE_SPEECH",
      threshold="OFF"
    ),types.SafetySetting(
      category="HARM_CATEGORY_DANGEROUS_CONTENT",
      threshold="OFF"
    ),types.SafetySetting(
      category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
      threshold="OFF"
    ),types.SafetySetting(
      category="HARM_CATEGORY_HARASSMENT",
      threshold="OFF"
    )],
  )

  for chunk in client.models.generate_content_stream(
    model = model,
    contents = contents,
    config = generate_content_config,
    ):
    return chunk.text

In [60]:
from vertexai.generative_models import GenerativeModel

llm = GenerativeModel("citing_sentence",
                      generation_config={
                      "temperature": 0,
                      "max_output_tokens": 256,
                      })

## Invocation

We'll be using the model to classify a small set of sentences from scientific papers.

In [81]:
def generate_prediction(sentence: str) -> bool:
    prompt = "Does the given sentence reference a different scientific paper?\n- yes\n- no\n\nPlease only print the answer without anything else."
    messages = prompt + sentence

    return generate(messages) == "yes"

In [78]:
import pandas as pd

DATASET_PATH = "C:\\Users\\Adrian\\Documents\\datasets\\citing_test.parquet"

# Load the dataset into a pandas DataFrame
df = pd.read_parquet(DATASET_PATH)

# get first 500 rows
df = df.head(500)

df.describe()

Unnamed: 0,sentence,citing
count,500,500
unique,500,2
top,"Under these assumptions, we have the following...",False
freq,1,470


In [82]:
predictions = df["sentence"].apply(lambda x: generate_prediction(x))

## Results

Unfortunately, the model seems to have learned that saying "no" leads to fairly high accuracy. While this experiment is a failure, it leads us to the following question: what if we used a more balanced (but less realistic) set of examples to fine tune?

In [84]:
from sklearn.metrics import classification_report

report = classification_report(df["citing"], predictions)

print(report)

              precision    recall  f1-score   support

       False       0.94      1.00      0.97       470
        True       0.00      0.00      0.00        30

    accuracy                           0.94       500
   macro avg       0.47      0.50      0.48       500
weighted avg       0.88      0.94      0.91       500



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
