# DSPy - Advanced Prompt Engineering

In the following notebook, we'll explore an introduction to DSPy and what it can do in just a few lines of code!

To begin, we'll grab the only (top level) dependency we'll need - DSPy!

In [1]:
!pip install -qU dspy-ai
!pip install --upgrade pyarrow



DSPy can leverage OpenAI's models under the hood, and still provide an advantage - in order to do so, however, we'll need to provide an OpenAI API Key!

In [2]:
import os
import getpass

from google.colab import userdata
api_key = userdata.get('open_ai_key')

if not api_key:
  api_key = getpass.getpass("Enter your OpenAI API Key: ")

os.environ['OPENAI_API_KEY'] = api_key

## Model

Now we can setup our OpenAI language model - which we'll use through the remaining cells in the notebook.

In [4]:
from dspy import OpenAI

llm = OpenAI(model='gpt-3.5-turbo', api_key=api_key)

Similar to other libraries, we can call the LLM directly with a string to get a response!

In [5]:
llm("What is the square root of pi?")

['The square root of pi is approximately 1.77245385091.']

We'll also set our `setting.configure` with our OpenAI model in the `lm` (Language Model) field for a default LM to use in case we don't specify which LM we'd like to use when calling our DSPy `Predictors`.

In [6]:
import dspy

dspy.settings.configure(lm=llm)

## Data

We're going to be using a dataset that provides a number of example sentences, along with a rating that indicates their "dopeness" level.

We have a total of 99 rows of data, and will be splitting that into a `trainset` and a `valset` - for training and evaluation.

In [7]:
import pandas as pd

dataset = pd.read_csv("https://raw.githubusercontent.com/cjflanagan/cs68/master/stock_data_nlp.csv")
# change sentiment from 1 to "Positive" and 0 to "Negative"
# dataset['Sentiment'] = dataset['Sentiment'].replace([0, 1], ['Negative', 'Positive'])
dataset = dataset.sample(frac=1)  # frac=1 shuffles all rows
dataset.head()

Unnamed: 0,Text,Sentiment
4449,user: 5 Tech Stocks That Typically ally After ...,1
5212,RT @djtgallagher: Samsung - like Micron and Nv...,1
1890,CEN price action looking good going into earni...,1
2391,Buying V on any and all pullbacks. Doubled si...,1
5320,Altria��������s fraught investment in e-cigare...,0


In [None]:
dataset.info()

In [8]:
dataset.to_csv("tweet_sentiment.csv", index=False)

Due to the nature of the dataset, we'll need to shuffle our dataset to ensure our labels are not clumped up, and our `valset` is remotely representative to our `trainset`.

We'll move our `Dataset` into the expected format in DSPy which is the [`Example`](https://dspy-docs.vercel.app/docs/deep-dive/data-handling/examples)!


Our examples will have two keys:

- `sentence`, our input sentence to be rated
- `rating`, our rating label

We'll specify our input as `sentence` to properly leverage the DSPy framework.

In [9]:
from dspy import Example

trainset = []

# Iterate over rows in the DataFrame
for index, row in dataset.iterrows():
    trainset.append(Example(sentence=row["Text"], rating=row["Sentiment"]).with_inputs("sentence"))

len(trainset)

5791

In [10]:
trainset[0:10]

[Example({'sentence': 'user: 5 Tech Stocks That Typically ally After #SXSW  via user  NFX STX PCN AMZN EXPE', 'rating': 1}) (input_keys={'sentence'}),
 Example({'sentence': 'RT @djtgallagher: Samsung - like Micron and Nvidia - is getting a nice WFH boost from data center demand. Pain may be coming later, @jackyc�������', 'rating': 1}) (input_keys={'sentence'}),
 Example({'sentence': 'CEN price action looking good going into earnings tonight', 'rating': 1}) (input_keys={'sentence'}),
 Example({'sentence': 'Buying V on any and all pullbacks.  Doubled since IPO, expecting another double sooner than later', 'rating': 1}) (input_keys={'sentence'}),
 Example({'sentence': 'Altria��������s fraught investment in e-cigarette company Juul Labs will take even longer to redeem following a challenge������� https://t.co/AN7j1yYOe9', 'rating': 0}) (input_keys={'sentence'}),
 Example({'sentence': 'TM - ahhh! stupid prelim results! short but wanted to add puts before earnings...', 'rating': 0}) (input_k

We'll repeat the same process for our `valset` as well.

In [11]:
valset = trainset[0:100]
trainset = trainset[100:]

Let's take a peek at an example from our `trainset` and `valset`!

In [12]:
train_example = trainset[0]
print(f"Sentence: {train_example.sentence}")
print(f"Label: {train_example.rating}")

Sentence: FS daily few like the long call on this one but after chaos and mini sell we got oversold PWmo2!
Label: 1


In [13]:
valset_example = valset[0]
print(f"Sentence: {valset_example.sentence}")
print(f"Label: {valset_example.rating}")

Sentence: user: 5 Tech Stocks That Typically ally After #SXSW  via user  NFX STX PCN AMZN EXPE
Label: 1


## Signature

The first foundational unit in DSPy is the `Signature`.

In a sense, a `Signature` can be thought of as both a prompt, as well as metadata about that prompt.

Going beyong just a simple `SystemMessage`, as seen in other frameworks, the `Signature` helps DSPy validate datatypes, create examples, and more.

> NOTE: DSPy's [documentation](https://dspy-docs.vercel.app/docs/deep-dive/signature/understanding-signatures#what-is-a-signature) goes into more detail about what exactly a `Signature` is.

In [30]:
from dspy import Signature, InputField, OutputField

class PositiveOrNegativeSignature(Signature):
  """Rate the input as being either 1 or 0. Only return 1 or 0"""
  sentence: str = InputField()
  rating: int = OutputField(desc='key-value pairs')

## Predictor

Now that we have our `Signature`, we can build a `Predictor` that leverages it.

A `Predictor`, in the simplest terms, is what calls the LLM using our signature. Importantly, the `Predictor` knows how to leverage our signature to call the LLM. From DSPy's documentation, one of the most interesting parts of a `Predictor` is that it can *learn* to become better at the desired task!

Let's take a look at our `TypedPredictor` below to see more.

In [31]:
from dspy.functional import TypedPredictor

generate_label = TypedPredictor(PositiveOrNegativeSignature)

In [32]:
generate_label

TypedPredictor(PositiveOrNegativeSignature(sentence -> rating
    instructions='Rate the input as being either 1 or 0. Only return 1 or 0'
    sentence = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Sentence:', 'desc': '${sentence}'})
    rating = Field(annotation=int required=True json_schema_extra={'desc': 'key-value pairs', '__dspy_field_type': 'output', 'prefix': 'Rating:'})
))

In [33]:
label_prediction = generate_label(sentence=valset_example.sentence)
print(f"Sentence: {valset_example.sentence}")
print(f"Prediction: {label_prediction}")

Sentence: user: 5 Tech Stocks That Typically ally After #SXSW  via user  NFX STX PCN AMZN EXPE
Prediction: Prediction(
    rating=1
)


We can, at any time, check our LLMs outputs through the `inspect_history`.

In [34]:
llm.inspect_history(n=1)




Rate the input as being either 1 or 0. Only return 1 or 0

---

Follow the following format.

Sentence: ${sentence}
Rating: key-value pairs (Respond with a single int value)

---

Sentence: user: 5 Tech Stocks That Typically ally After #SXSW  via user  NFX STX PCN AMZN EXPE
Rating:[32m 1[0m





'\n\n\nRate the input as being either 1 or 0. Only return 1 or 0\n\n---\n\nFollow the following format.\n\nSentence: ${sentence}\nRating: key-value pairs (Respond with a single int value)\n\n---\n\nSentence: user: 5 Tech Stocks That Typically ally After #SXSW  via user  NFX STX PCN AMZN EXPE\nRating:\x1b[32m 1\x1b[0m\n\n\n'

Notice how, without our input - the `TypedPredictor` has included format instructions to the LLM to help ensure our returned data resembles what we desire.

Let's look at another example of a `Predictor` - this time with Chain of Thought.

In order to use this - we don't have to do anything with our `Signature`! We can leave it exactly as is - and allow the `Predictor` to adapt to it.

> NOTE: We won't be using this predictor going forward - this is just to showcase the ease of using another `Predictor` with a `Signature`.

In [35]:
from dspy.functional import TypedChainOfThought

generate_label_with_chain_of_thought = TypedChainOfThought(PositiveOrNegativeSignature)

label_prediction = generate_label_with_chain_of_thought(sentence=valset_example.sentence)

In [36]:
print(f"Sentence: {valset_example.sentence}")
print(f"Reasoning: {label_prediction.reasoning}")
print(f"Ground Truth Label: {valset_example.rating}")
print(f"Prediction: {label_prediction.rating}")

Sentence: user: 5 Tech Stocks That Typically ally After #SXSW  via user  NFX STX PCN AMZN EXPE
Reasoning: produce the rating. We see that the user is mentioning 5 tech stocks that typically rally after #SXSW, which indicates a positive trend in the stock market.
Ground Truth Label: 1
Prediction: 1


We can, again, check our LLM's history to see what the actual prompt/response is.


In [37]:
llm.inspect_history(n=1)




Rate the input as being either 1 or 0. Only return 1 or 0

---

Follow the following format.

Sentence: ${sentence}
Reasoning: Let's think step by step in order to ${produce the rating}. We ...
Rating: key-value pairs (Respond with a single int value)

---

Sentence: user: 5 Tech Stocks That Typically ally After #SXSW  via user  NFX STX PCN AMZN EXPE
Reasoning: Let's think step by step in order to[32m produce the rating. We see that the user is mentioning 5 tech stocks that typically rally after #SXSW, which indicates a positive trend in the stock market.
Rating: 1[0m





"\n\n\nRate the input as being either 1 or 0. Only return 1 or 0\n\n---\n\nFollow the following format.\n\nSentence: ${sentence}\nReasoning: Let's think step by step in order to ${produce the rating}. We ...\nRating: key-value pairs (Respond with a single int value)\n\n---\n\nSentence: user: 5 Tech Stocks That Typically ally After #SXSW  via user  NFX STX PCN AMZN EXPE\nReasoning: Let's think step by step in order to\x1b[32m produce the rating. We see that the user is mentioning 5 tech stocks that typically rally after #SXSW, which indicates a positive trend in the stock market.\nRating: 1\x1b[0m\n\n\n"

## Modules

Now that we have our `TypedPredictor`, we can create a `Module`!

A `Module` is useful because it allows us to interact with the `Predictor` and `Signature` in a way that DSPy can leverage for optimization.

The helps the DSPy framework determine paths through your program - and helps during the `compilation` or optimisation steps (formerly `teleprompting`).

> NOTE: You might notice this looks strikingly familiar to PyTorch, and this is by design!

In [38]:
from dspy import Module, Prediction

class PositiveOrNegativeStudent(Module):
  def __init__(self):
    super().__init__()

    self.generate_rating = TypedPredictor(PositiveOrNegativeSignature)

  def forward(self, sentence):
    prediction = self.generate_rating(sentence=sentence)
    return Prediction(rating=prediction.rating)

## Evaluate

As with any good framework, DSPy has the ability to `Evaluate` - we can leverage this to determine how our current DSPy "program" (our `Module` in this case) operates.

> NOTE: DSPy's "program" could be loosely related to a "chain" from the popular LLM Framework LangChain.

In [39]:
from dspy.evaluate.evaluate import Evaluate

evaluate_fewshot = Evaluate(devset=valset, num_threads=1, display_progress=True, display_table=10)

def exact_match_metric(answer, pred, trace=None):
  return answer.rating == pred.rating

evaluate_fewshot(PositiveOrNegativeStudent(), metric=exact_match_metric)

Average Metric: 74 / 100  (74.0): 100%|██████████| 100/100 [00:30<00:00,  3.32it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,user: 5 Tech Stocks That Typically ally After #SXSW via user NFX STX PCN AMZN EXPE,1,1,✔️ [True]
1,"RT @djtgallagher: Samsung - like Micron and Nvidia - is getting a nice WFH boost from data center demand. Pain may be coming later, @jackyc�������",1,1,✔️ [True]
2,CEN price action looking good going into earnings tonight,1,1,✔️ [True]
3,"Buying V on any and all pullbacks. Doubled since IPO, expecting another double sooner than later",1,1,✔️ [True]
4,Altria��������s fraught investment in e-cigarette company Juul Labs will take even longer to redeem following a challenge������� https://t.co/AN7j1yYOe9,0,1,False
5,TM - ahhh! stupid prelim results! short but wanted to add puts before earnings...,0,1,False
6,SPW new HOD,1,1,✔️ [True]
7,Market Wrap Video + Additions to Watch ist including: AMSC KBH MHK TW TTMI,1,1,✔️ [True]
8,NKD Over 179.35,1,1,✔️ [True]
9,"EPS still think there is a better than 50% chance it tags 8, but moving stop on rest to 7.35",1,1,✔️ [True]


74.0

In [40]:
llm.inspect_history(n=1)




Rate the input as being either 1 or 0. Only return 1 or 0

---

Follow the following format.

Sentence: ${sentence}
Rating: key-value pairs (Respond with a single int value)

---

Sentence: HPQ NVDA QCOM umored HP tablet could use Tegra 4 chip and Android as its OS
Rating:[32m 1[0m





'\n\n\nRate the input as being either 1 or 0. Only return 1 or 0\n\n---\n\nFollow the following format.\n\nSentence: ${sentence}\nRating: key-value pairs (Respond with a single int value)\n\n---\n\nSentence: HPQ NVDA QCOM umored HP tablet could use Tegra 4 chip and Android as its OS\nRating:\x1b[32m 1\x1b[0m\n\n\n'

## Program Optimization (the Artist Formerly Known as Teleprompting)

Optimization is the crux of the DSPy framework - it is what allows it to operate at a level beyond traditional prompt engineering.

At a high level, optimisation is a way for the DSPy framework to take the program, a training set, and a metric - and make changes/tweaks to our program to improve our metrics on our dataset.

Let's get started with the `LabeledFewShot` optimizer.

The `LabeledFewShot` optimizer very simply provides a sample of the `trainset` as few-shot examples!

In [41]:
from dspy.teleprompt import LabeledFewShot

labeled_fewshot_optimizer = LabeledFewShot(k=4)

Once we define our optimizer, we can compile our program!

In [42]:
compiled_dspy = labeled_fewshot_optimizer.compile(student=PositiveOrNegativeStudent(), trainset=trainset)

Let's evaluate!

In [43]:
evaluate_fewshot(compiled_dspy, metric=exact_match_metric)

Average Metric: 77 / 100  (77.0): 100%|██████████| 100/100 [00:30<00:00,  3.33it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,user: 5 Tech Stocks That Typically ally After #SXSW via user NFX STX PCN AMZN EXPE,1,1,✔️ [True]
1,"RT @djtgallagher: Samsung - like Micron and Nvidia - is getting a nice WFH boost from data center demand. Pain may be coming later, @jackyc�������",1,1,✔️ [True]
2,CEN price action looking good going into earnings tonight,1,1,✔️ [True]
3,"Buying V on any and all pullbacks. Doubled since IPO, expecting another double sooner than later",1,1,✔️ [True]
4,Altria��������s fraught investment in e-cigarette company Juul Labs will take even longer to redeem following a challenge������� https://t.co/AN7j1yYOe9,0,1,False
5,TM - ahhh! stupid prelim results! short but wanted to add puts before earnings...,0,0,✔️ [True]
6,SPW new HOD,1,1,✔️ [True]
7,Market Wrap Video + Additions to Watch ist including: AMSC KBH MHK TW TTMI,1,1,✔️ [True]
8,NKD Over 179.35,1,1,✔️ [True]
9,"EPS still think there is a better than 50% chance it tags 8, but moving stop on rest to 7.35",1,1,✔️ [True]


77.0

In [44]:
llm.inspect_history(n=1)




Rate the input as being either 1 or 0. Only return 1 or 0

---

Follow the following format.

Sentence: ${sentence}
Rating: key-value pairs (Respond with a single int value)

---

Sentence: Green Weekly Triangle on HEO,.....Third Scaling and Closed  
Rating: 0

---

Sentence: user: AAP  hod   WAT  CAN I SAY.... A THE BEAS>>OTA THE POO!
Rating: 1

---

Sentence: Why can't they take AAP out of the index? It's holding everyone back! Slacker !!
Rating: 0

---

Sentence: GNC - nice inverse head and shoulders, as well as a nice pullback to support from the previous box-breakout
Rating: 1

---

Sentence: HPQ NVDA QCOM umored HP tablet could use Tegra 4 chip and Android as its OS
Rating:[32m 1[0m





"\n\n\nRate the input as being either 1 or 0. Only return 1 or 0\n\n---\n\nFollow the following format.\n\nSentence: ${sentence}\nRating: key-value pairs (Respond with a single int value)\n\n---\n\nSentence: Green Weekly Triangle on HEO,.....Third Scaling and Closed  \nRating: 0\n\n---\n\nSentence: user: AAP  hod   WAT  CAN I SAY.... A THE BEAS>>OTA THE POO!\nRating: 1\n\n---\n\nSentence: Why can't they take AAP out of the index? It's holding everyone back! Slacker !!\nRating: 0\n\n---\n\nSentence: GNC - nice inverse head and shoulders, as well as a nice pullback to support from the previous box-breakout\nRating: 1\n\n---\n\nSentence: HPQ NVDA QCOM umored HP tablet could use Tegra 4 chip and Android as its OS\nRating:\x1b[32m 1\x1b[0m\n\n\n"

As you can see - with no effort at all - we can improve our performance on our `valset`!

Let's try another optimizer - this time: [`BootstrapFewShot`](https://dspy-docs.vercel.app/docs/deep-dive/teleprompter/bootstrap-fewshot).

The key thing to note is that this optimizer works with even very few examples - by way of generating new examples by the LLMs!

In [45]:
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=exact_match_metric, max_bootstrapped_demos=4, max_labeled_demos=12)

compiled_dspy_BOOTSTRAP = optimizer.compile(student=PositiveOrNegativeStudent(), trainset=trainset)

  0%|          | 8/5691 [00:02<27:40,  3.42it/s]


Let's finally evaluate!

In [46]:
eval_output = evaluate_fewshot(compiled_dspy_BOOTSTRAP, metric=exact_match_metric)
eval_output

Average Metric: 74 / 100  (74.0): 100%|██████████| 100/100 [00:36<00:00,  2.71it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,user: 5 Tech Stocks That Typically ally After #SXSW via user NFX STX PCN AMZN EXPE,1,1,✔️ [True]
1,"RT @djtgallagher: Samsung - like Micron and Nvidia - is getting a nice WFH boost from data center demand. Pain may be coming later, @jackyc�������",1,0,False
2,CEN price action looking good going into earnings tonight,1,1,✔️ [True]
3,"Buying V on any and all pullbacks. Doubled since IPO, expecting another double sooner than later",1,1,✔️ [True]
4,Altria��������s fraught investment in e-cigarette company Juul Labs will take even longer to redeem following a challenge������� https://t.co/AN7j1yYOe9,0,1,False
5,TM - ahhh! stupid prelim results! short but wanted to add puts before earnings...,0,0,✔️ [True]
6,SPW new HOD,1,1,✔️ [True]
7,Market Wrap Video + Additions to Watch ist including: AMSC KBH MHK TW TTMI,1,1,✔️ [True]
8,NKD Over 179.35,1,1,✔️ [True]
9,"EPS still think there is a better than 50% chance it tags 8, but moving stop on rest to 7.35",1,1,✔️ [True]


74.0

We can see that this optimization helps our program achieve 30 points higher on our evaluation!

In [47]:
llm.inspect_history(n=1)




Rate the input as being either 1 or 0. Only return 1 or 0

---

Follow the following format.

Sentence: ${sentence}
Rating: key-value pairs (Respond with a single int value)

---

Sentence: FS daily few like the long call on this one but after chaos and mini sell we got oversold PWmo2!
Rating: 1

---

Sentence: AAP Bullish signal here.  Want to see a close (preferably) above 462.60 to confirm.
Rating: 1

---

Sentence: SETPS some small some big a variety to watch - TZYM .56 JAG .75 C 1.23 AV 1.29 IDIX 4.85 MCP 7.75 MO 34.37 KO 38.75 HF 36.75
Rating: 1

---

Sentence: My setup alerts went bonkers today..One of many that got triggered.Cup+Handle breakout in PFE
Rating: 1

---

Sentence: RBI has cut the liquidity adjustment facility by 90 bps to 4%
Rating: 1

---

Sentence: user: Buffett  isn't interested in AAP at this valuation? too much profit and dead capital sitting in a bank overseas ;)
Rating: 1

---

Sentence: not likin MSFT neg dvrgnce+hammer today goin into quadruple bottom. 

"\n\n\nRate the input as being either 1 or 0. Only return 1 or 0\n\n---\n\nFollow the following format.\n\nSentence: ${sentence}\nRating: key-value pairs (Respond with a single int value)\n\n---\n\nSentence: FS daily few like the long call on this one but after chaos and mini sell we got oversold PWmo2!\nRating: 1\n\n---\n\nSentence: AAP Bullish signal here.  Want to see a close (preferably) above 462.60 to confirm.\nRating: 1\n\n---\n\nSentence: SETPS some small some big a variety to watch - TZYM .56 JAG .75 C 1.23 AV 1.29 IDIX 4.85 MCP 7.75 MO 34.37 KO 38.75 HF 36.75\nRating: 1\n\n---\n\nSentence: My setup alerts went bonkers today..One of many that got triggered.Cup+Handle breakout in PFE\nRating: 1\n\n---\n\nSentence: RBI has cut the liquidity adjustment facility by 90 bps to 4%\nRating: 1\n\n---\n\nSentence: user: Buffett  isn't interested in AAP at this valuation? too much profit and dead capital sitting in a bank overseas ;)\nRating: 1\n\n---\n\nSentence: not likin MSFT neg dvrg

In [48]:
for name, parameter in compiled_dspy_BOOTSTRAP.named_parameters():
  print(f"Parameter {name}: Num Examples: {len(parameter.demos)}, {parameter.demos[0]}")
  print()

Parameter generate_rating.predictor: Num Examples: 12, Example({'augmented': True, 'sentence': 'FS daily few like the long call on this one but after chaos and mini sell we got oversold PWmo2!', 'rating': '1'}) (input_keys=None)



In [49]:
def return_rating(sentence):
  return compiled_dspy_BOOTSTRAP(sentence=sentence).rating

In [50]:
return_rating("This is bad")

0

# Testing classifiers

## Niave Bayes Classifier

In [52]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score, classification_report


# Split the data into training and testing sets
X = dataset['Text']
y = dataset['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text data into numerical data using CountVectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vec, y_train)

# Predict probabilities
y_pred = nb_classifier.predict(X_test_vec)
# Convert string labels to numerical labels using NumPy's where function
# y_test_num = np.where(y_test == 'Positive', 1, 0)
# y_pred_num = np.where(y_pred == 'Positive', 1, 0)

# Calculate the ROC AUC metric
roc_auc = roc_auc_score(y_test, y_pred)

print("ROC AUC:", roc_auc)
print(classification_report(y_test, y_pred))



ROC AUC: 0.7030251240754688
              precision    recall  f1-score   support

           0       0.71      0.53      0.61       421
           1       0.77      0.87      0.82       738

    accuracy                           0.75      1159
   macro avg       0.74      0.70      0.71      1159
weighted avg       0.74      0.75      0.74      1159



## LLM Predictor

In [53]:
y_pred_llm = X_test.apply(return_rating)

In [54]:
for i in X_test[0:10].index:
  print(f"Sentence: {X_test[i]}")
  print(f"Prediction: {y_pred_llm[i]}")
  print(f"Ground Truth: {y_test[i]}")
  print()

Sentence: RT @DrewFitzGerald: And then there were 3: AT&amp;T, Verizon and T-Mobile https://t.co/YbaAavirhx
Prediction: 0
Ground Truth: 1

Sentence: ed Daily Triangle on HEO,....Cover Short Position,...Net Profit  76,560.00 (7.46%)  
Prediction: 1
Ground Truth: 1

Sentence: VFC - we are short this one again - bulls need to hold the support it is in now or the break could be significant
Prediction: 0
Ground Truth: 0

Sentence: Heard on the Street: U.S. stock bulls should take a careful look at China, where the post-coronavirus rebound has b������� https://t.co/Ct9zPe4ZGG
Prediction: 1
Ground Truth: 0

Sentence: on our ive Broadcast we got this long signal on PCN and covered discount buy-in ! needs push
Prediction: 1
Ground Truth: 1

Sentence: user: AAP Broke 435 and held above 437 which should lead higher  Since the slide began, the high days have been Tue, lows Fri
Prediction: 1
Ground Truth: 1

Sentence: CMCSA Will gap open higher; No idea where 2 enter. Would love 2 get in at 40.28 b

In [55]:
# Calculate the ROC AUC metric
roc_auc = roc_auc_score(y_test, y_pred_llm)

print("ROC AUC:", roc_auc)

print(classification_report(y_test, y_pred_llm))

ROC AUC: 0.7222431428589822
              precision    recall  f1-score   support

           0       0.66      0.62      0.64       421
           1       0.79      0.82      0.81       738

    accuracy                           0.75      1159
   macro avg       0.73      0.72      0.72      1159
weighted avg       0.75      0.75      0.75      1159

