<a href="https://colab.research.google.com/github/datascisteven/Automated-Hate-Tweet-Detection/blob/main/Steven_Yan_%E2%80%94_Greenflash_Language_Analytics_Research.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Greenflash Language Analytics Research

## Purpose

This file is a research playground to evaluate a series of natural language analysis techniques with the ultimate goal of finding the best combination accuracy, speed, and cost for each metric.

## Project Timeline

**Weeks 1 and 2: Generate data and prepare models**
- Research commmon structures for chat data from ChatGPT, LLama3, Claude, HuggingFace, Gemini
- Chat with some models! The wider variety of topics the better.
- Find existing data sets online
- Convert chats to common structure

**Weeks 3-5: Conduct one analysis below per week**
- Toxicity
- Bias
- Hallucinations


1. Begin by loading chat data into memory at the top of the file. We will use the same data for each analysis. The data should should meet the following requirements:

  - Include at least 20 different conversations (but can be 100+ if you find a good pre-made data set).
  - The average conversation length should be no less than 5 prompts and responses (but can be as low as 1 prompt and response for a small number of the conversations).
  - The prompts (user input) and responses (model output) should be clearly marked as such.
  - If known, include the model that was used in the chat.
  - If known, include the system prompt for the chat.


## Week 1 To Do's
- [X] Request API keys
- [x] Choose top 3-5 analyses


## Week 2 To Do's
- [X] Figure out how to filter the datasets
- [X] Conduct cluster analysis of 1M Chat Dataset
- [ ] Get all the foundational models working
    - [X] HuggingFace
    - [X] OpenAI
    - [ ] Replicate
- [ ] Pull code for the research (animal-named) models
- [ ] Complete code for calculating cost per token

## Week 3 To Do's
- [ ] Finish setting up the API access
- [ ]

# Importing Packages

In [None]:
! pip install datasets openai tiktoken langdetect replicate tokencost



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
import os
import requests

# pd.set_option("display.max_columns", None)
# pd.set_option("display.max_rows", None)
# pd.set_option("display.max_colwidth", None)
# pd.reset_option('^display.', silent=True)

# import tiktoken

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import BertTokenizer, BertForQuestionAnswering
from transformers import RobertaTokenizer, RobertaForQuestionAnswering
from transformers import XLNetTokenizer, XLNetForQuestionAnswering

from datasets import load_dataset

import torch

import openai
from openai import OpenAI

import matplotlib.pyplot as plt


from huggingface_hub import InferenceClient
from huggingface_hub import InferenceApi

from getpass import getpass
import replicate

from tokencost import (
    calculate_prompt_cost,
    calculate_completion_cost,
    count_message_tokens,
    count_string_tokens
)

In [None]:
from google.colab import userdata

hugging_face_api_key = userdata.get('huggingface_token')
openai_api_key = userdata.get('openai_token')
replicate_api_key = userdata.get('replicate_token')

In [None]:
import sys
sys.path.append('/content/drive/MyDrive/Projects/Greenflash')

import modules

# Datasets

Reference: [A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot](https://link.springer.com/article/10.1007/s11390-024-3767-3)


For a more in-depth look at the datasets considered, see Datasets Research Notebook:  

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JmeZFooifQC-vj8thd87c7s3ylZEGI6N?usp=sharing)

## [OpenAssistant Dataset](https://huggingface.co/datasets/OpenAssistant/oasst1)

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

[OpenAssistant Conversations -- Democratizing Large Language Model Alignment](https://arxiv.org/abs/2304.07327)



In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
dataset = load_dataset("OpenAssistant/oasst1")
train = dataset["train"]
df = train.to_pandas()
val = dataset["validation"]
val_df = val.to_pandas()

df_eng = df.loc[df.lang == "en"].reset_index()
df_eng = df_eng.rename(columns={"index": "original_index"})

questions = df_eng.loc[df_eng.role == "prompter"][
    ["original_index", "message_id", "message_tree_id", "text"]
]
answers = df_eng.loc[df_eng.role == "assistant"][
    ["original_index", "parent_id", "text"]
]
qa = questions.merge(answers, left_on="message_id", right_on="parent_id")

qa_groupby = (
    qa.groupby(["message_tree_id", "original_index_x", "message_id", "text_x"])
    .agg({"original_index_y": lambda x: list(x), "text_y": lambda x: list(x)})
    .reset_index()
)

questions_df = (
    qa_groupby.groupby("message_tree_id")
    .agg({"text_x": lambda x: list(x)})
    .reset_index()
)

questions = questions_df.text_x.values.tolist()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/39.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/84437 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4401 [00:00<?, ? examples/s]

In [None]:
questions_one = questions[500: 1000]


# Models

For a more in-depth look at the full set of models explored, see Models Research Notebook:  

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1H0qq2YZrpgARSG01tBbMj6qYmBKryqC3?usp=sharing)




Load any models that are used across analyses so that they don't need to be reloaded into memory each time.
  - SpaCy
  - NLTK
  - Pytorch and/or tensorflow
  - HuggingFace
  - Replicate
  - Llama3
  - OpenAI (or other foundation model APIs)
  - 1-bit models (or other very low memory)
  - Whatever else you find!

## OpenAI API

### ```gpt-4o-2024-05-13```


In [None]:
%%time

client = OpenAI(
    api_key=openai_api_key
)

def ask_gpt4(questions):
    gpt4_answers = []
    prompt_token_cost = 0
    completion_token_cost = 0
    message_token_count = 0
    response_token_count = 0
    for q in questions:
        gpt4_answer_list = []
        for question in q:
            try:
                input = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": question}, ]
                response = client.chat.completions.create(
                    model="gpt-4o-2024-05-13",  # Specify GPT-4 model
                    messages=input
                )
                answer = response.choices[0].message.content
                prompt_token_cost += calculate_prompt_cost(input, model='gpt-4o-2024-05-13')
                completion_token_cost += calculate_completion_cost(answer, model='gpt-4o-2024-05-13')
                message_token_count += count_message_tokens(input, model='gpt-4o-2024-05-13')
                response_token_count += count_string_tokens(prompt=answer, model='gpt-4o-2024-05-13')
                gpt4_answer_list.append(answer)
            except Exception as e:
                gpt4_answer_list.append(f"Error: {str(e)}")
        gpt4_answers.append(gpt4_answer_list)
    return gpt4_answers, prompt_token_cost, completion_token_cost, message_token_count, response_token_count

gpt4_answers, gpt4_prompt_cost, gpt4_completion_cost, message_token_count, response_token_count = ask_gpt4(questions_one)

In [None]:
gpt4_answers

In [None]:
print("Prompt Cost: ", gpt4_prompt_cost)
print("Completion Cost: ", gpt4_completion_cost)
print("Message Token Count: ", message_token_count)
print("Response Token Count: ", response_token_count)

## Replicate API

In [None]:
REPLICATE_API_TOKEN = getpass()
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

··········


### ```meta/meta-llama-3-70b-instruct```

In [None]:
%%time

def run_replicate_llama(questions):

    answers = []
    # prompt_cost = 0
    # completion_cost = 0
    # message_token_count = 0
    # response_token_count = 0
    for q in questions:
        answer_list = []
        for question in q:
            output = replicate.run(
            "meta/meta-llama-3-70b-instruct",
            input={"seed": 42,
                    "top_p": 0.9,
                    "prompt": question,
                    "max_tokens": 512,
                    "min_tokens": 0,
                    "temperature": 0.6,
                    "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                    "presence_penalty": 1.15}
            )
            answer_list.append(''.join(output))
            # prompt_cost += calculate_prompt_cost(question, model='replicate/meta/meta-llama-3-70b-instruct')
            # completion += calculate_completion_cost(output, model='replicate/meta/meta-llama-3-70b-instruct')
            # message_token_count += count_message_tokens(question, model='replicate/meta/meta-llama-3-70b-instruct')
            # response_token_count += count_string_tokens(output, model='replicate/meta/meta-llama-3-70b-instruct')

        answers.append(answer_list)

    # return answers, prompt_cost, completion_cost, message_token_count, response_token_count
    return answers

# replicate_llama_answers, replicate_llama_prompt_cost, replicate_llama_completion_cost, replicate_llama_message_token_count, replicate_llama_response_token_count = run_replicate_llama(questions)

replicate_llama_answers = run_replicate_llama(questions_100)



### ```mistralai/mistral-7b-instruct-v0.2```



In [None]:
cost_per_token_input = 0.05/10**6
cost_per_token_output = 0.25/10**6

In [None]:
%%time

def run_replicate_mistral(questions):

    answers = []
    prompt_cost = 0
    completion_cost = 0
    message_token_count = 0
    response_token_count = 0

    for q in questions:
        answer_list = []
        for question in q:
            output = replicate.run(
            "mistralai/mistral-7b-instruct-v0.2",
            input={
                "seed": 42,
                "top_k": 50,
                "top_p": 0.9,
                "prompt": question,
                "temperature": 0.6,
                "system_prompt": "You are a very helpful, respectful and honest assistant.",
                "length_penalty": 1,
                "max_new_tokens": 1024,
                "prompt_template": "<s>[INST] {prompt} [/INST] ",
                "presence_penalty": 0}
            )
            answer_list.append(''.join(output))

            message_token_count += len(tokenizer.encode(question))
            response_token_count += len(tokenizer.encode(output))
            prompt_cost += message_token_count * cost_per_token_input
            completion_cost += response_token_count * cost_per_token_output

        answers.append(answer_list)

    # return answers, prompt_cost, completion_cost, message_token_count, response_token_count
    return answers

replicate_mistral_answers = run_replicate_mistral(qs)


In [None]:
%%time

def run_replicate_mistral(questions):

    answers = []
    # prompt_cost = 0
    # completion_cost = 0
    # message_token_count = 0
    # response_token_count = 0

    for q in questions:
        answer_list = []
        for question in q:
            output = replicate.run(
            "mistralai/mistral-7b-instruct-v0.2",
            input={
                "seed": 42,
                "top_k": 50,
                "top_p": 0.9,
                "prompt": question,
                "temperature": 0.6,
                "system_prompt": "You are a very helpful, respectful and honest assistant.",
                "length_penalty": 1,
                "max_new_tokens": 1024,
                "prompt_template": "<s>[INST] {prompt} [/INST] ",
                "presence_penalty": 0}
            )
            answer_list.append(''.join(output))
            # prompt_cost += calculate_prompt_cost(question, model='replicate/mistralai/mistral-7b-instruct-v0.2')
            # completion_cost += calculate_completion_cost(output, model='replicate/mistralai/mistral-7b-instruct-v0.2')
            # message_token_count += count_message_tokens(question, model='replicate/mistralai/mistral-7b-instruct-v0.2')
            # response_token_count += count_string_tokens(output, model='replicate/mistralai/mistral-7b-instruct-v0.2')

        answers.append(answer_list)

    # return answers, prompt_cost, completion_cost, message_token_count, response_token_count
    return answers

replicate_mistral_answers = run_replicate_mistral(questions_100)

# replicate_mistral_answers, replicate_mistral_prompt_cost, replicate_mistral_completion_cost, replicate_mistral_message_token_count, replicate_mistral_response_token_count = run_replicate_mistral(questions_100)

### ```google-deepmind/gemma-7b-it```

In [None]:
%%time

def run_replicate_gemma(questions):

    gemma_answers = []
    gemma_prompt_cost = 0
    gemma_completion_cost = 0

    for q in qs:
        gemma_answer_list = []
        for question in q:
            gemma_output = replicate.run("google-deepmind/gemma-7b-it:2790a695e5dcae15506138cc4718d1106d0d475e6dca4b1d43f42414647993d5",
            input={
                "top_k": 50,
                "seed": 42,
                "top_p": 0.95,
                "prompt": question,
                "temperature": 0.7,
                "max_new_tokens": 512,
                "min_new_tokens": -1}
            )
            gemma_answer_list.append(''.join(gemma_output))

        gemma_answers.append(gemma_answer_list)


## Model Comparison

In [None]:
model_df = pd.DataFrame(
    {'question': questions,
     'answer': answers,
     'DialoGPT-medium': dialogpt_answers,
     'GPT-NeoX-20B': eleuther_answers,
     'Meta-Llama-3-8B-Instruct': meta_answers,
     'Mistral-Nemo-Instruct-2407': mistral_answers,
     'Falcon-7B-Instruct': falcon_instruct_answers,
     'Phi-3-mini-4k-instruct': phi_answers,
     'zephyr-7b-beta': zephyr_answers
     }
)

## Costs Per Token


Determine the costs per token of all APIs, and define functions to estimate the cost of an analysis that can be used below.

In [None]:
# Example chat data
# (feel free to change the shape of these data if needed)
chat_data_example = [
    {
        "conversation_id": "12345",
        "model": "gpt4o",
        "system_prompt": "You are a helpful assisstant.",
        "messages": [
            {
                "sender": "user",
                "timestamp": "2024-07-28T10:00:00Z",
                "content": "Hello, how are you?"
            },
            {
                "sender": "assistant",
                "timestamp": "2024-07-28T10:00:10Z",
                "content": "I'm good, thank you! How can I assist you today?"
            },
            {
                "sender": "user",
                "timestamp": "2024-07-28T10:01:00Z",
                "content": "Can you help me with a Python question?"
            },
            {
                "sender": "assistant",
                "timestamp": "2024-07-28T10:01:10Z",
                "content": "Sure, what do you need help with?"
            }
        ]
    }
]

In [None]:
# Example of loading an expensive model up front
!python -m spacy download en_core_web_lg

import spacy as spacy_nlp

spacy = spacy_nlp.load("en_core_web_lg")

In [None]:
# Example of token cost function
example_API_cost_per_token = 0.001

def calculate_costs(api, num_tokens):
  match api:
    case "example":
      return num_tokens * example_API_cost_per_token
    case _:
      raise ValueError('API name is not defined')

# Analysis of Metrics


```
# This is formatted as code
```




The following categories are available to choose from for analysis:

- **Bias** (Steven)
- **Toxicity** (Steven)
- Politeness
- Prompt leakage
- Response drift
- **Hallucinations** (Steven)
- Political language & content
- Prompt & response specificity
- Grammar, spelling, and fluency
- Contextual completeness
- Coherence
- Linguistic diversity
- Humor & sarcasm detection
- Redundancy & repetition

For each analysis, include:
- A brief synopsis of what it is trying to measure
- Why this metric is important to a customer
- Any other relevant high level context about this field of analysis

Then:
- Research best practices for the analysis. Is there a generally accepted "best"? How about fastest? Lowest memory usage? Etc.
- Perform the analysis with each of the models and techniques researched/loaded above
- Measure the latency when performing the analysis
- Estimate the cost for each analysis based on token count (for NLP APIs) or latency (for locally-computed analyses)


EXAMPLE: Named Entity Recognition

This analysis attempts to recognize and flag known and customer-generated entitites such as people, brand names, and products.


Customer Impact

NER will help customers understand whether their users are particularly interested in their own brands and products, competitors brands and products, or unrelated topics that are outside of the scope of the tool.

## Bias

## Toxicity

## Hallucinations