In [20]:
import pandas as pd
import cohere
import dvc.api as DvcApi
from dotenv import load_dotenv 

In [21]:
#import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [22]:
import sys, os

sys.path.append(os.path.abspath(os.path.join("../..")))
sys.path.append(os.path.abspath(os.path.join("../scripts")))

In [23]:
import prompt_functions as promptf

In [24]:
load_dotenv(encoding='utf-16')

api_key = os.getenv('apikey')

In [25]:
# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

In [26]:
#pd.set_option('display.max_colwidth', None)

# Project1

### Project Overview
A client has a system that collects news artifacts from web pages, tweets, facebook posts, etc. The client is interested in scoring a given new artifact against a topic. The client has hired experts to score a few of these news items. The range of results between 0 and 10 signifies the degree of relevance of the news item to the topic (breaking news that may lead to **public unrest**)

We want to explore the efficiency of GPT3-like LLMs to this task. If the recommendation is positive, we must demonstrate that our strategies to design prompts are reproducible and produce a consistent result. 

Design a pipeline that takes a news item (e.g. title +  description + body) and returns a score for the news item

The columns of this data are as follows

- Domain: the base URL or a reference to the source these item comes from 
- **Title:** title of the item - the content of the item
- **Description:** the content of the item
- **Body:** the content of the item
- Link: URL to the item source (it may not functional anymore sometime)
- Timestamp: timestamp that this item was collected at
- **Analyst_Average_Score:** target variable - the score to be estimated 
- Analyst_Rank: score as rank
- Reference_Final_Score: Not relevant for now - it is a transformed quantity

In [27]:
path = "data/Example_data1.csv"
repo = "../"
version = "v0"

data_url = DvcApi.get_url(path = path, repo = repo, rev = version) #could be tag or git commit
data_news = pd.read_csv(data_url)

In [28]:
data_news.head(2)

Unnamed: 0,Domain,Title,Description,Body,Link,timestamp,Analyst_Average_Score,Analyst_Rank,Reference_Final_Score
0,rassegnastampa.news,Boris Johnson using a taxpayer-funded jet for ...,…often trigger a protest vote that can upset…t...,Boris Johnson using a taxpayer-funded jet for ...,https://rassegnastampa.news/boris-johnson-usin...,2021-09-09T18:17:46.258006,0.0,4,1.96
1,twitter.com,"Stumbled across an interesting case, a woman f...","Stumbled across an interesting case, a woman f...","Stumbled across an interesting case, a woman f...",http://twitter.com/CoruscaKhaya/status/1435585...,2021-09-08T13:02:45.802298,0.0,4,12.0


In [29]:
data_news['Analyst_Average_Score']

0    0.00
1    0.00
2    0.00
3    0.00
4    0.00
5    1.33
6    0.00
7    1.66
8    0.33
9    0.00
Name: Analyst_Average_Score, dtype: float64

We have a very small sample, 

- Ten examples in total,
- only three have non-zero relevance to public unrest.
- all values are below score of 2

So we will use an approach of Few-shot classification with the Classify endpoint as explained in this [blog post](https://txt.cohere.ai/classify-three-options/)


In [30]:
df = pd.DataFrame(columns = ['text', 'Avg_Score','label'], index=range(10))

df['text'] = data_news['Title'] + data_news['Body']
# We truncate the text so as it doesn't exceed the maximum number of tokens allowed LLM can take
df['text'] = df['text'].apply(lambda x: x[:500])
df['Avg_Score']= data_news['Analyst_Average_Score']
df['label'] = data_news['Analyst_Average_Score'].apply(lambda x: 'Relevant' if x > 0 else 'Irrelevant')

df.head(3)

Unnamed: 0,text,Avg_Score,label
0,Boris Johnson using a taxpayer-funded jet for ...,0.0,Irrelevant
1,"Stumbled across an interesting case, a woman f...",0.0,Irrelevant
2,Marché Résines dans les peintures et revêtemen...,0.0,Irrelevant


In [31]:
# 2 classes
df['label'].unique().tolist()

['Irrelevant', 'Relevant']

In [32]:
# sample per class

df['label'].value_counts()

Irrelevant    7
Relevant      3
Name: label, dtype: int64

Note that the classification we are using here should be more accurately represent as **Irrelevant** and **(Slighlty) Relevant** as the scores <2 and not close to 10. The only goal is to divide the sample according to the score.

In Few-shot classification with **Cohere’s Classify endpoint** we need to supply a few examples per class to have a decent classifier working. The minimum number of datapoints per class is five [[reference]](https://txt.cohere.ai/classify-three-options/). We don't have the minimum number in each class, so we will not expect to get a good classifier.

- Training dataset is referred to as "examples"
- Each example consist of a text, and a lable

In [33]:
# Split the dataset into training and test portions
# Training = to train the model
# Test = For evaluating the classifier performance
X, y = df["text"], df["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2, random_state=21)

In [34]:
ex_texts = X_train.tolist()
ex_labels = y_train.tolist()

In [35]:
# Collate the examples via the Example module
from cohere.classify import Example

examples = list()
for txt, lbl in zip(ex_texts,ex_labels):
    examples.append(Example(txt,lbl))

In [36]:
# Perform classification
def classify_text(text,examples):
    classifications = co.classify(model='medium', # model version - medium-22020720
                                  inputs=[text],
                                  examples=examples)
    return classifications.classifications[0].prediction

In [37]:
# Generate classification predictions on the test dataset (this will take a few minutes)
y_pred = X_test.apply(classify_text, args=(examples,)).tolist()

In [38]:
# Compute metrics on the test dataset
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {100*accuracy:.2f}')
print(f'F1-score: {100*f1:.2f}')

Accuracy: 100.00
F1-score: 100.00


We get 100% accuracy, which is a good sign. However, since our sample is tiny the result may not extend to a larger set of cases. More data is needed.