<a href="https://colab.research.google.com/github/ankit-kothari/Data-Science-Journey/blob/master/evaluating_llm_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1: Productizing Large Language Models

The focus of this project is model evaluation. We will use a dataset from a code explanation use and explore some of the evaluation approaches from the written material.

Our object is to measure the performance of our LLM applications so that we can make decisions about quality and maintain performance.

**Note:** Please turn off your Colab when you're not using it so that Google doesn't count the idle time against your daily free GPU allowance.



## [Optional] Part 0: Use OpenAI

### Setting up OpenAI access

Before we dive into the exciting world of open source large language models, we can compare results to proprietary model services like Open AI. This is useful if your transitioning an application to open source and want to compare the results. That said, it is not essential to the project that use Open AI, so feel free to skip ahead.

### Open AI Python SDK Installation

The first step is to install the `openai` Python library. This library is a convenient tool that allows us to interact with the OpenAI API programmatically.

Please run the following command to install the `openai` library in your Jupyter notebook:


In [None]:
!pip install openai -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/226.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m225.3/226.1 kB[0m [31m7.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.1/226.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoke

In [None]:
# Before you do this step, please ensure you have created both OpenAI and HuggingFace tokens. Click on the 'Key' Icon in the sidebar on the left to input your tokens.
from google.colab import userdata

In [None]:
from openai import OpenAI

client = OpenAI(
    api_key=userdata.get('course-key'),
)

### Importing the Library
Once installed, we need to import the library into our notebook environment:

### API Key Configuration
To use the API, you'll need an API key from OpenAI. This key is unique to you and serves as your credential when making API requests.

Note: Keep your API key confidential to protect your account and prevent unauthorized usage. If you're viewing this notebook in a shared environment, remember to remove your API key before saving or sharing the notebook.

Once you've entered your API key, you'll be all set to start using the OpenAI API in our following exercises. Let's embark on this journey of learning and discovery together!

## Part 1: Load Our Use Case Data

Now we're starting on the project itself. We will use a dataset of code for our use case on **code explanation**.

### Accessing Code Generation Datasets

In this section of our course, we will explore datasets that are crucial for training and evaluating models specialized in code generation. We will be using the `datasets` library from Hugging Face to load a dataset that is pertinent to our work with CodeParrot—a model that is an adaptation of GPT-2 aimed at generating Python code.

### Loading the Dataset

To start, we'll load the `github-code` dataset, which is a part of the CodeParrot project:

With the streaming=True argument, we load the dataset in a streaming fashion, which is suitable for large datasets.

In [None]:
!pip install datasets transformers evaluate -qqq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/507.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m286.7/507.1 kB[0m [31m8.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25h

### Alternative Datasets:
As we progress, we should also be aware of the latest data resources available to us. There are a number of other comparable datasets for code generation tasks that we could. Feel free to explore these if you're interested in this use case.

Here is a list of datasets to explore:

* codeparrot-clean: Dataset for training and evaluating CodeParrot.
* GitHub-Code: A 1TB dataset comprising files in 32 programming languages from GitHub.
* GitHub-Jupyter: A dataset of Jupyter Notebooks mined from BigQuery GitHub.
* APPS: A benchmark for code generation with 10,000 problems.

## Part 2: Load Open source Models

Now we're going to start using open source models! At first, we will just download some models from the [huggingface hub](https://huggingface.co/docs/hub/models-the-hub) and evaluate them. In later weeks we will build on this use case, and these models will serve as our baseline.

### 2.1 Utilizing Transformers for Text-to-Code Generation

As we delve into the fascinating world of code generation and interpretation, we will be using models from the `transformers` library. This powerful library, created by Hugging Face, provides us with pre-trained models that can perform a variety of tasks related to natural language processing and beyond.

In the code cell below, we load a finetuned model named `deepseek-ai/deepseek-coder-1.3b-instruct`. This model is part of the Deepseek family, tailored to generate code snippets.

### AutoTokenizer

First, we will initialize the `AutoTokenizer`:

The AutoTokenizer is responsible for converting the input text into tokens that the model can understand. By using `from_pretrained`, we load a tokenizer that is already trained to work with the Deepseek model. This ensures that the tokenization is aligned with the model's training.

### AutoModelForCausalLM
Following the tokenizer, we'll load the actual model:

AutoModelForCausalLM stands for an AutoModel for Causal Language Modeling. Causal language modeling is a type of model that predicts the next token in a sequence, which is exactly what we need for generating text from code. The model we are loading has been fine-tuned to not just generate any text, but specifically to generate natural language explanations of code. The `Auto` part means that `transformers` will figure out exactly which model class and configuration you need for the model you are referring to.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-instruct", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-1.3b-instruct", trust_remote_code=True).cuda()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

## 2.2 Load an LLM in a Pipeline

### Pipeline
Finally, we create a pipeline. The Pipeline API is a powerful abstraction in Hugging Face that hides most of the other APIs, making it easier to perform tasks in the Hugging Face ecosystem (You can learn more about pipelines [here](https://huggingface.co/docs/transformers/main_classes/pipelines/))
This pipeline will use the tokenizer and model we just loaded to process input data and generate predictions in a streamlined manner:

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0) # setting device=0 allows you use GPU

## Part 3: Prompt Engineering
### Introducing the Code Explanation Use Case
In the realm of software development, understanding code is as crucial as writing it. Whether for educational purposes, code review, or improving documentation, being able to generate concise explanations of code segments can significantly enhance productivity and comprehension.

To explore this, our next use case involves the task of code explanation. We will employ two different methods to automatically generate explanations for given pieces of Python code. These explanations aim to be understandable by humans, potentially aiding in learning and reviewing processes.

Method 1: OpenAI's Language Model
Our first approach uses OpenAI's language model. We've crafted a prompt that sets the context for the model, informing it that it should take on the role of a Python engineer explaining the supplied code.



Here's the function openai_infer that encapsulates the logic for this:

In [None]:
# TODO: Define prompt to explain python code

OPENAI_PROMPT = "For the code attached to this prompt, I need you to give me a step-by-step breakdown of what it does: {code}"

def openai_infer(code: str) -> str:
    completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": OPENAI_PROMPT.format(code=code),
        }
    ],
    model="gpt-3.5-turbo",
    )
    return completion.choices[0].message.content

In [None]:
code = """[a * 10 for a in range(10)]"""

print(f"{openai_infer(code)}")

1. The code initializes a list comprehension using square brackets [].
2. It uses a for loop to iterate through each value in the range(10), which generates a sequence of integers from 0 to 9.
3. Inside the loop, it multiplies each value 'a' by 10.
4. The result of each multiplication is added to the list comprehension.
5. Once all the iterations are complete, the list comprehension returns a new list containing the results of multiplying each element in the range(10) by 10.
6. The final result is a list of integers [0, 10, 20, 30, 40, 50, 60, 70, 80, 90].


This function sends the code to OpenAI's API, which then returns the model's attempt at explaining the code snippet.

Method 2: Hugging Face's pipeline with Deepseek.
The second approach leverages the pipeline from Hugging Face's transformers library in conjunction with the `deepseek-ai/deepseek-coder-1.3b-instruct` model. This pipeline is fine-tuned for the specific task of code explanation.

Here's the code_generation function that uses this method:

In [None]:
code = "[a * 10 for a in range(10)]"
OPENSOURCE_PROMPT = f"I need you to give me a step-by-step breakdown of what this code does: {code}"

# Generate a response with a specific max_new_tokens value
generated_response = pipe(OPENSOURCE_PROMPT, max_new_tokens=512)

# Print the generated response
print(generated_response)

Setting `pad_token_id` to `eos_token_id`:32021 for open-end generation.


[{'generated_text': "I need you to give me a step-by-step breakdown of what this code does: [a * 10 for a in range(10)]\n\nThis code is a list comprehension in Python. It's a compact way of creating a list in Python.\n\n1. `a * 10`: This is the expression that gets evaluated for each element in the list. In this case, it's multiplying each number in the range by 10.\n\n2. `for a in range(10)`: This is the loop that's going to be executed for each element in the list. In this case, it's going through each number from 0 to 9.\n\n3. `range(10)`: This is the function that generates the numbers from 0 to 9. It's like a shortcut for creating a list of numbers from 0 to 9.\n\nSo, the whole expression `[a * 10 for a in range(10)]` is essentially creating a new list that contains the numbers from 0 to 9 multiplied by 10.\n\nThis is a very useful feature in Python, especially when you want to perform a similar operation on a large number of elements.\n"}]


In [None]:
#If you would like to have more control over the model's output by setting some parameters abstracted in the pipeline, you could uncomment and use the code snippet below:


code = """[a * 10 for a in range(10)]"""
OPENSOURCE_PROMPT = f"I need you to give me a step-by-step breakdown of what this code does: {code}"

messages=[
    { 'role': 'user', 'content': OPENSOURCE_PROMPT}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=32021) # 32021 is the id of <|EOT|> token

print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32021 for open-end generation.


This code is a list comprehension in Python. It's a compact way of creating a list in Python.

Here's a step-by-step breakdown:

1. `a * 10`: This is the operation being performed on each element in the range. In this case, it's multiplying each number in the range by 10.

2. `for a in range(10)`: This is the loop that's going through each number in the range from 0 to 9. `range(10)` generates a sequence of numbers from 0 to 9.

3. `[a * 10 for a in range(10)]`: This is the list comprehension itself. It's a compact way of creating a new list by performing an operation on each element in a sequence. In this case, it's creating a new list where each element is the result of multiplying the corresponding element in the original sequence by 10.

So, the whole expression `[a * 10 for a in range(10)]` is equivalent to `[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]`.



The code_generation function takes a code snippet and appends it to a predefined prompt that signals the start of an explanation. The pipeline then generates text that seeks to explain the code snippet.

## Part 4: Evaluating Language Models

In the following cells, we will test these methods on various code snippets to evaluate their effectiveness. We aim to compare the quality, clarity, and accuracy of the explanations provided by both approaches.

## 4.1 Text quality based metrics for evaluation
In the world of automated text generation, particularly when generating explanations for code, it isn't enough to have the text merely generated; the quality of the generated text is crucial. Quality can encompass various aspects, such as readability, coherence, and overall clarity. This is where our evaluation cell comes into play.

Evaluation Function: compute_text_metrics
To evaluate the generated explanations, we will be using a function called compute_text_metrics. This function utilizes the textdescriptives Python library in combination with spaCy to compute readability and coherence metrics for any given text.

### TASK:
Look up the [text descriptives documentation](https://github.com/HLasse/TextDescriptives), select the metrics you want to use, and implement them in the function below.
### Example of readability metrics

```python
        [
            'text', 'flesch_reading_ease', 'flesch_kincaid_grade', 'smog',
            'gunning_fog', 'automated_readability_index', 'coleman_liau_index',
            'lix', 'rix', 'token_length_mean', 'token_length_median',
            'token_length_std', 'sentence_length_mean', 'sentence_length_median',
           'sentence_length_std', 'syllables_per_token_mean',
           'syllables_per_token_median', 'syllables_per_token_std', 'n_tokens',
            'n_unique_tokens', 'proportion_unique_tokens', 'n_characters',
            'n_sentences'
        ]
```


In [None]:
# If you run into a Colab 'NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968', this code block resolves the problem
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install textdescriptives -qqq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m254.3/254.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for ftfy (setup.py) ... [?25l[?25hdone


In [None]:
import textdescriptives as td

METRICS_TO_EVALUATE = ["descriptive_stats", "readability",
  "dependency_distance", "pos_proportions", "coherence", "quality",
  "information_theory"] # Select text descriptives metrics that you want to use to evaluate


SPACY_MODEL = "en_core_web_sm"

def compute_text_metrics(text):
    """Compute readability and coherence metrics for a given text."""
    # Extract metrics using textdescriptives
    df = td.extract_metrics(text=text, spacy_model=SPACY_MODEL, metrics=METRICS_TO_EVALUATE)
    # TODO: Add code here to filter down the metrics to your Use case.
    results = df.iloc[0].to_dict()
    return results

In [None]:
example = """
The code you provided is a list comprehension in Python.

In this specific list comprehension, it generates a list of numbers where each element is the result of multiplying a number from 0 to 9 (generated by the `range(10)` function) by 10.

Let's break down the code step by step:

1. `range(10)` generates a sequence of numbers from 0 to 9 (excluding 10).
2. The `for a in range(10)` part assigns each number from the range to the variable `a` during each iteration.
3. `a * 10` multiplies each value of `a` by 10.
4. Finally, `[a * 10 for a in range(10)]` builds a new list by storing the multiplied results in it.

As a result, the code generates the list `[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]`, where each element is obtained by multiplying a number from 0 to 9 by 10.
"""

compute_text_metrics(example)

  similarities.append(sent.similarity(sents[i + order]))


{'text': "\nThe code you provided is a list comprehension in Python.\n\nIn this specific list comprehension, it generates a list of numbers where each element is the result of multiplying a number from 0 to 9 (generated by the `range(10)` function) by 10.\n\nLet's break down the code step by step:\n\n1. `range(10)` generates a sequence of numbers from 0 to 9 (excluding 10).\n2. The `for a in range(10)` part assigns each number from the range to the variable `a` during each iteration.\n3. `a * 10` multiplies each value of `a` by 10.\n4. Finally, `[a * 10 for a in range(10)]` builds a new list by storing the multiplied results in it.\n\nAs a result, the code generates the list `[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]`, where each element is obtained by multiplying a number from 0 to 9 by 10.\n",
 'pos_prop_ADJ': 0.009523809523809525,
 'pos_prop_ADP': 0.12857142857142856,
 'pos_prop_ADV': 0.004761904761904762,
 'pos_prop_AUX': 0.014285714285714285,
 'pos_prop_CCONJ': 0.0,
 'pos_prop_DET':

The function takes in the text and a list of metric types (by default, 'readability' and 'coherence'). It uses textdescriptives to calculate these metrics and returns a simplified dictionary containing the results.

#### Context on TextDescriptives
TextDescriptives is a versatile Python library that calculates a wide range of statistics and metrics from texts. It extends the functionality of spaCy, a popular NLP library, allowing users to gather insights into the textual properties that contribute to the text's overall comprehensibility and flow.

With TextDescriptives, we can measure the readability of the text through various established metrics and analyze the coherence by examining how different parts of the text relate to each other, among other things.

#### Application
The compute_text_metrics function will be applied to the explanations generated from both models. By doing this, we can quantitatively assess which method provides explanations that are not just correct, but also easier to understand and more coherent in terms of narrative flow.

#### Upcoming Execution
As we run the evaluation, we will compare these objective metrics alongside subjective assessments to get a holistic view of each method's efficacy in explaining code. Stay tuned to see which approach yields the most comprehensible explanations according to both numerical scores and human judgement.

## 4.2 Using Classification based metrics with Huggingface Evaluate

Next, we'll use classification models to evaluate the generative models. Now, we're shifting focus to another vital aspect: ensuring the language used by the AI is appropriate and respectful. This is where the concept of toxicity in generated text becomes relevant, and the following cell is designed to evaluate just that.

Evaluation Function: compute_toxicity_ratio
To measure the potential toxicity in the generated explanations, we are introducing a function that uses the evaluate library, a library designed to easily evaluate and compare the outputs of AI models in a multitude of ways, including fairness, bias, and, as in this case, toxicity.

Here's the code cell explained:

In [None]:
!pip install evaluate
import evaluate

# Load the toxicity model
toxicity = evaluate.load("toxicity")

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill (from evaluate)
  Downloading di

Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
def compute_toxicity_ratio(text):
    """Compute the toxicity ratio of generated samples"""
    # Compute the toxicity score of the text
    results = toxicity.compute(predictions=[text], aggregation="ratio")

    return results

In [None]:
not_toxic = """
The code you provided is a list comprehension in Python.

In this specific list comprehension, it generates a list of numbers where each element is the result of multiplying a number from 0 to 9 (generated by the `range(10)` function) by 10.

Let's break down the code step by step:

1. `range(10)` generates a sequence of numbers from 0 to 9 (excluding 10).
2. The `for a in range(10)` part assigns each number from the range to the variable `a` during each iteration.
3. `a * 10` multiplies each value of `a` by 10.
4. Finally, `[a * 10 for a in range(10)]` builds a new list by storing the multiplied results in it.

As a result, the code generates the list `[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]`, where each element is obtained by multiplying a number from 0 to 9 by 10.
"""

toxic = """The code you provided is a list comprehension in Python. written by an idiot. """

print(f"Not toxic: {compute_toxicity_ratio(not_toxic)}")
print(f"Toxic: {compute_toxicity_ratio(toxic)}")

Not toxic: {'toxicity_ratio': 0.0}
Toxic: {'toxicity_ratio': 1.0}


This function will calculate a toxicity ratio for the given predictions—texts generated as explanations of code snippets. Here's what it does:

- It loads the toxicity metric from the evaluate library, which is specifically designed to assess the level of toxicity in text.
- It defines compute_toxicity_ratio, which takes a list of text predictions (explanations) as input.
- It computes the toxicity ratio using the toxicity.compute method, which returns the proportion of predictions that are considered toxic.
- It extracts the toxicity_ratio from the results, which indicates the fraction of the text that may be problematic in terms of offensive or undesirable language.

#### Application
In practice, this function will be applied to the output generated text to ensure that the explanations they provide are not only informative and readable but also free from content that could be considered toxic or harmful. This is an essential check in maintaining ethical AI practices, especially when the model-generated text is intended for public or educational use.

#### Upcoming Execution
When we run this function on our generated code explanations, we aim to confirm that our AI maintains a professional and respectful tone. It's a safeguard against unintended negative outputs that could arise from the complex nature of language generation models. By ensuring a low toxicity ratio, we can trust that the AI's explanations are suitable for a broad audience.

## Part 5: [OPTIONAL] Model-based Evaluation of language models

After discussing the tools for evaluating readability, coherence, and toxicity, we're now introducing a comprehensive model-based evaluation system. This system grades the quality of answers provided by an AI model. It leverages another language model to judge the responses' accuracy and relevance. Below is the Python code cell that encapsulates this process.

In [None]:
%%capture
pip install langchain

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import QAGenerationChain
from langchain.evaluation.qa import QAEvalChain
from langchain import PromptTemplate

### Define an evaluation prompt to evaluate a language model

The grade_model_answer function utilizes the langchain library to create an evaluation workflow. It harnesses a deterministic GPT-3.5 model as a judge to assess language model outputs against a set of questions. This model-based evaluation method is efficient, reducing reliance on human judgment while maintaining effectiveness.

In [None]:
EVALUATION_PROMPT = PromptTemplate.from_template("""
You are an assistant assessing a code snippet generated by an LLM, relative to the true answer based on the provided criteria:

    QUERY: {query}

    GENERATED CODE (RESULT): {result}

    TRUE ANSWER (ANSWER): {answer}

    Criteria:
      relevance:  Is the generated code correct based on your knowledge?"
      conciseness:  Is the answer concise and to the point?"
      correct: Is the answer correct?"

    Does the submission meet the criterion? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print the "CORRECT" or "INCORRECT" (without quotes or punctuation) on its own line corresponding to the correct answer.

""")

In [None]:
try:
    # openai_apikey
    LLM = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=userdata.get('OPENAI_API_KEY'))
except NameError:
    raise ValueError("This section requires OpenAI")

def grade_model_answer(predicted_dataset, predictions):

    # Create an evaluation chain
    eval_chain = QAEvalChain.from_llm(
        llm=LLM,
        prompt=EVALUATION_PROMPT
    )

    # Evaluate the predictions and ground truth using the evaluation chain
    graded_outputs = eval_chain.evaluate(
        predicted_dataset,
        predictions,
        question_key="question",
        prediction_key="result"
    )

    return graded_outputs

  warn_deprecated(


In [None]:
code = """[a * 10 for a in range(10)]"""
OPENSOURCE_PROMPT = f"I need you to give me a step-by-step breakdown of what this code does: {code}"

answer = """1. The code starts with a list comprehension, denoted by the square brackets `[ ]`, which allows for concise creation of a new list based on an existing list or range.
2. It uses the `range()` function with an argument of `10` to generate a sequence of numbers from `0` to `9`, inclusive. This sequence is represented as an iterable.
3. For each element `a` in this sequence, the code executes the following steps.
4. Multiplies the current value of `a` by `10`, resulting in a new value.
5. Stores this new value in the new list.
6. Once all elements have been processed and the new list is complete, it is returned as the output of the list comprehension.

In simpler terms, the code generates a new list by multiplying each number from `0` to `9` by `10`. The resulting list will have the values `[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]`.
"""

result = """This code is a list comprehension in Python. It's a compact way of creating a list in Python.

Here's a step-by-step breakdown:

1. `a * 10`: This is the operation being performed on each element in the range. In this case, it's multiplying each number in the range by 10.

2. `for a in range(10)`: This is the loop that's going through each number in the range from 0 to 9. `range(10)` generates a sequence of numbers from 0 to 9.

3. `[a * 10 for a in range(10)]`: This is the list comprehension itself. It's a compact way of creating a new list by performing an operation on each element in a sequence. In this case, it's creating a new list where each element is the result of multiplying the corresponding element in the original sequence by 10.

So, the whole expression `[a * 10 for a in range(10)]` is equivalent to `[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]`.

"""

predicted_dataset = [
    {
        "question": OPENSOURCE_PROMPT,
        "answer": answer
    }
]

predictions = [
    {
        "result": result
    }
]

In [None]:
grade_model_answer(predicted_dataset, predictions)

[{'results': 'Relevance:\n- The generated code correctly identifies that the provided code is a list comprehension in Python.\n- It correctly explains that a list comprehension is a compact way of creating a list in Python.\n- It correctly explains that the code is multiplying each number in the range by 10.\n\nConciseness:\n- The generated code provides a step-by-step breakdown of what the code does.\n- It explains the operation being performed on each element in the range.\n- It explains the loop that goes through each number in the range.\n- It explains the list comprehension itself and how it creates a new list.\n\nCorrectness:\n- The generated code correctly explains the purpose and functionality of the code.\n- It correctly explains that the code generates a new list by multiplying each number from 0 to 9 by 10.\n- It correctly provides the resulting list as `[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]`.\n\nBased on the above analysis, the submission meets all the criteria and is cor


### Insights from Model-based Evaluation
The insights table summarizes the benefits and considerations observed from using model-based evaluation. Key points include:


| Insight                        | Description                                                                                          |
|--------------------------------|------------------------------------------------------------------------------------------------------|
| LLM-as-a-Judge Efficiency      | Utilizing LLMs as judges is nearly as effective as human evaluations.                                |
| Cost Efficiency with GPT-3.5   | GPT-3.5 has been shown to be a cost-effective alternative for judging model output.                  |
| Grading Scale Precision        | Lower-precision grading scales tend to be more beneficial for consistency and interpretation.        |
| Specific Benchmarks for RAG    | It's important to use relevant benchmarks when evaluating models for RAG applications.               |




## Part 6: Bring it all together

In this section you will bring together the results of the previous parts in to single dataset.

In [None]:
from pandas import DataFrame

def evaluate_generated_dataset(sample):
    """ Evaluate the dataset for toxicity """
    # TODO: CODE HERE
    # 1. get code out of the sample
    code = ""
    # 2. call the evaluation functions
    gpt35_explanation = openai_infer(code)
    opensource_explanation = opensource_infer(code)
    # 3. log the metrics to a dataframe
    gpt35_metrics = compute_text_metrics(gpt35_explanation)
    opensource_metrics = compute_text_metrics(opensource_explanation)
    # 4. add more metrics if you wish
    # 5. create a dictionary of results to return
    return {
        "opensource_explanation" : "",
        "gpt3.5_explanation" : ""
    }

In [None]:
# test the function above

from datasets import load_dataset

ds = load_dataset("codeparrot/github-code", streaming=True, split="train")

evaluated_dataset = ds.take(1).map(evaluate_generated_dataset, input_columns="code")

# The End 💐

That's all for this project. Well done on getting so far.

If you have any time left, I would encourage you to experiment with the code in this notebook and see if you can expand on the evaluation here.