# **Tutorial 4** - Submitting Your Solution

You learned about AI agents in the previous tutorial and about the data in tutorial number 2. Now, it's time to combine this knowledge into one working solution! In this tutorial, we will show you how to create a crew that can effectively utilize the data in the PIRLS database, how to test your code both locally and remotely using AWS, and how to submit your solution to compete with other teams.

As part of this year's GDSC, you are required to evaluate the submissions of other teams. At the end of this tutorial, we will explain why your role as evaluators is essential and how you can participate in the Chatbot Arena!

# Agenda
1. [Project structure](#project-structure) - explains the reasoning behid the code structure and which files are crutioal for your submission.
2. [Create your first application](#create-your-first-application) - covers the steps to run your first app.
3. [How to test your code](#how-to-test-your-code) - shows you how to run tests both locally and remotely.
4. [How to submit your code](#how-to-submit-your-code) - shows you how to finally submit your solution to the competition.
5. [How does the evaluation work](#how-does-the-evaluation-work) - explains the logic of automatic evaluation.
6. [How to check the status of your application](#how-to-check-the-status-of-your-application) - access the logs.
7. [What to tdo if the automatic evaluation fails](#what-to-do-if-the-automatic-evaluation-fails)
8. [Chatbot Arena](#chatbot-arena) - shows you how to use the Arena and explains why it is important to rate battles.
9. [Human evaluation questions](#human-evaluation-questions) - shows you how to add new questions to the competition and explains the benefits of it.

# Notebook setup

In [None]:
pip install -r requirements.txt

# Project structure

## [Code Commit](https://aws.amazon.com/de/codecommit/)
When you created or joined a team, you gained access to an AWS account. From the AWS Management Console, you can navigate to AWS CodeCommit to view your team's repository.

1. [<img src="../images/t4_code_commit_1.png" width=800/>](../images/t4_code_commit_1.png) $\space\space\space\space$ 2. [<img src="../images/t4_code_commit_2.png" width=800/>](../images/t4_code_commit_2.png)

In CodeCommit, you can see your team's repository with code that has already been prepared for you. We will explain this code in more detail throughout this and the next tutorial.

If you want to use this code in your local IDE, you'll need to install git-remote-codecommit, set up credentials for your AWS account locally, and then use the HTTPS (GRC) clone URL to download the repo. A tutorial for that can be found [here](https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up-git-remote-codecommit.html).

## Code structure

Let's get familiar with the GDSC directory structure. In the src folder, there are two main directories: submission and static.

- submission - This is the directory where all of your code will be placed. Here, you can modify and create new crews, tools, agents, and do whatever your heart desires.
- static - This directory contains code that cannot be modified. The static directory is replaced by the GDSC team with each submission, so no changes made here will be reflected in your final submission.

Despite the considerable freedom given to participants in the submission directory, there is one particularly important file to which you must pay extra attention:  [src/submission/create_submission.py](src/submission/create_submission.py). Let's take a look inside!



```python
from src.static.ChatBedrockWrapper import ChatBedrockWrapper
from src.static.submission import Submission

def create_submission(call_id: str) -> Submission:
    ...
```

As you can see, there's a single function defined. This function will be **the entry point** for your submission. It is used to instantiate your submission, and the signature of this function cannot be modified. It must take a string named `call_id` and return an object of type `Submission`. The body of the function and necessary imports are up to your implementation.

### What is `Submission`
`Submission` is an abstract class that enforces the implementation of a run method for your solution. This is important because we expect your submission to have a method with that name, which accepts a str and returns a str. This method is how your crew will receive the question (prompt) and how we (the GDSC team) expect to get the answer.

Implementation of this class can be found in the [src/static/submission.py](src/static/submission.py)

```python
from abc import ABC, abstractmethod

class Submission(ABC):
    @abstractmethod
    def run(self, prompt: str) -> str:
        ...
```


As you can see, this file contains nothing more than the abstract class Submission. It serves as a useful interface that enforces your submissions to implement the run method.

It's worth mentioning that due to Python's dynamic typing, any object with a method named `run` that satisfies the signature will work. However, this abstract class is a good programming practice because it clearly defines the expected interface for your submissions.

## Create your first application
Let's take a look at the `PythonHelpCrew` defined in the previous tutorial. The code can be found in the [src/submission/crews/python_help_crew.py](src/submission/crews/python_help_crew.py). The main difference now is that this `PythonHelpCrew` class inherits from `Submission`, which requires us to implement the `run` method. Since it was already defined in the previous tutorial, we're good to go.


```python
# ... code cut out

from src.static.submission import Submission


class PythonHelpCrew(Submission):  # PythonHelpCrew inhertis from Submission class
    def __init__(self, llm):
        self.llm = llm
    
    def run(self, prompt: str) -> str:
        return self.crew().kickoff(inputs={"prompt": prompt})

# ... code cut out
```


Now that our `PythonHelpCrew` class is ready, we can modify the `create_submission` function so that it returned our new class.
```python
from src.submission.crews.python_help_crew import PythonHelpCrew
from src.static.ChatBedrockWrapper import ChatBedrockWrapper
from src.static.submission import Submission

def create_submission(call_id: str) -> Submission:
    llm = ChatBedrockWrapper(
        model_id='anthropic.claude-3-haiku-20240307-v1:0',
        model_kwargs={'temperature': 0},
        call_id=call_id
    )
    crew = PythonHelpCrew(llm=llm)  # instantiate the new class
    return crew
```


One important point to note is the `ChatBedrockWrapper` class, which is used as the llm argument for your crew. This class is an extension of the [`ChatBedrock`](https://python.langchain.com/v0.2/docs/integrations/chat/bedrock/) class you used in the previous tutorial, and it handles communication with [AWS Bedrock](https://aws.amazon.com/bedrock/). The main difference is that it requires an additional argument, `call_id`, due to a technical requirement in this year's GDSC. The key point is that you **must** use this class instead of the standard ChatBedrock. This wrapper provides direct access to the number of tokens used by your submission. (See [src/static/ChatBedrockWrapper.py](src/static/ChatBedrockWrapper.py))

While you're not required to use crewAI, you **must** use this wrapper for all your llm interactions, as it includes the implementation for token counting and cost tracking.

# How to test your code?
There are two ways to test your code:
- locally
- push it to the test branch

### Testing locally
Run the application locally and send a bunch of test requests to localhost url. To do that run this command in the terminal

#### Linux
```bash
# activate python venv if needed (assuming venv is the virtual enviroment direcotry)
source venv/bin/activate

# add the current directory to the python path
export PYTHONPATH="$PYTHONPATH:$(pwd)"

# run the application
python src/static/app.py
```

#### Windows
```bat
rem activate python venv if needed (assuming venv is the virtual enviroment direcotry)
venv\Scripts\activate

rem add the current direcotry to the python path
set PYTHONPATH=%PYTHONPATH%;%cd%

rem run the application
python src\static\app.py
```

#### Expected console output:
```
INFO:     Started server process [7956]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

Now we can see that our application runs on the localhost. To test it out we need to send a POST request to the localhost/run endpoint with a payload that contains the prompt. Here is an exaple code how to do that

In [3]:
import requests

def ask_question(question: str, url: str):
    data = {'prompt': question}
    headers = {'Content-Type': 'application/json'}
    response = requests.post(url, json=data, headers=headers)

    return response.json()

In [None]:
LOCAL_HOST = 'http://127.0.0.1:8000/run'

In [None]:
res = ask_question("How many students participated in the study?", LOCAL_HOST)
print(res['result'])

Here is a list of values returned by your submission:
- result - it is the answer retuend by your crew as a result of calling the `run` method.
- time - time it took for your crew to answer the question in seconds.
- timed_out - information about whether your submission timed out or not.
- tokens - overall number of tokens used in all agents' converstions to generate the final answer.
- cost - what was the cost for your submission for getting the answer.
- token_details - this is a dictionary that holds more detailed data about token usage. Here you can see how many prompt and completion tokens used each of the models you selected.

### Testing using AWS
You can push changes in the code to the test_submission branch. This will automatically start a pipeline that will create a docker image containing the code that you just pushed, and start the app using the `app.py` script. You can later go to the [Elastic Contaier Service (ECS)](https://us-east-1.console.aws.amazon.com/ecs/v2/clusters?region=us-east-1), go to the clusters and select the gdsc cluster, select the test service, and go to the taks to see its public IP address.

[<img src="../images/t4_ecs_1.png"/>](../images/t4_ecs_1.png)

[<img src="../images/t4_ecs_2.png"/>](../images/t4_ecs_2.png)

Now that wee have the IP of our new submission we can send requests to this endpoint.

In [4]:
REMOTE_HOST = 'http://100.24.19.134:8000/run'

In [None]:
res = ask_question("How many students participated in the study?", REMOTE_HOST)
print(res['result'])

# How to submit your code?
Submitting your code is similar to testing it using AWS. This time, however, you will be pushing your changes to the `submission` branch. Like the `test_submission` process, this initiates the `app.py` script in the Docker instance, but it doesn't stop there. An automatic evaluation is run to test your submission, and you can monitor the evaluation status on your team's page. Aftrer successfull evaluation your submission will be allowed to the competition.

# How does the evaluation work?
After submitting your code it is tested using a few automatic evaluation questions. Your submission has to yield correct answers, and the response time should be shorter than the specified timeout. If either of these conditions is not fullfilled for any of the automatic evaluation questions, your submission is not allowed to participate any further. Nonetheless, it still affects the total number of submissions your team has made.

We all know how unstable LLMs tend to be. This is why each automatic evaluation question is asked 3 times, and in the worst case scenario, the total evaluation can take up to 1 hour! Be patient and you'll see your results in the team's page.

# How to check the status of your application?
You can access the logs directly on ECS. Go to the test task and enter the currently running application. Click on the 'Logs' located on the top bar. If you want to access the logs in real-time, click on the button to CloudWatch.

The first step is the same as [here](#testing-using-aws)

[<img src="../images/t4_ecs_1.png"/>](../images/t4_ecs_1.png)

[<img src="../images/t4_log_1.png"/>](../images/t4_log_1.png)

[<img src="../images/t4_log_2.png"/>](../images/t4_log_2.png)

Logs are only enabled for the testing tasks and the ECS task will automatically **shut down after 30 minutes**. This should be enough time to run your tests but if you need more time, you would have to make a dummy change to the code and push your changes to the **test_submission** branch again.

# What to do if the automatic evaluation fails?
If the automatic evaluation fails that means your new submission did not answered correctly, timed out or other unexpected error occured. The reason is displayed on the the team's website:
- "Unexpected error" - this status means there were issues with starting the `app.py` script. Chceck your submission implementation and `create_submission` method implementation.
- "Timeout" - this status means that your submission failed to answer to at least one of automatic evaluation questions in time.
- "Incorrect answers" - this status means that at least one of the answers was not correct. 

If the displayed status is "In progress" for a suspiciously long period of time, that being a few hours, please contact GDSC organizers.

# Better agents
Although our first application worked, we did not receive the right answer. Our current solution has no connection to the database and cannot answer questions that require some insight into the data. Let's fix that and develop our first crew that can extend its knowledge and retrieve additional information! The code for this new crew can be found in [src/submission/crews/student_knowing_crew.py](src/submission/crews/student_knowing_crew.py)

As in the previous example, here we created an agent that can use a tool - this time it's the `query_database` tool. It allows us to query the database using the [sqlalchemy](https://www.sqlalchemy.org/) engine. Now our crew can access additional information!
In order to establish connection to the database we had to include a few more credentials. You've seen those in the very first tutorial regarding the PIRLS data.

Now let's try and run our application with this new solution. To do that not only do we need an LLM but also an instance of sqlalchemy engine. This requires additional credentials, allowing our `StudentKnowingCrew` to connect to the databse.

Let's implement all the necessary changes inside our `create_submission` function.

```python
import sqlalchemy
from src.submission.crews.student_knowing_crew import StudentKnowingCrew  # import the new crew
from src.static.ChatBedrockWrapper import ChatBedrockWrapper
from src.static.submission import Submission

def create_submission(call_id: str) -> Submission:
    llm = ChatBedrockWrapper(
        model_id='anthropic.claude-3-haiku-20240307-v1:0',
        model_kwargs={'temperature': 0},
        call_id=call_id
    )
    
    crew = StudentKnowingCrew(llm)  # instantiate the new crew
    return crew

```

Now let's run our app locally and see if it can answer questions about PIRLS data! ([see section "How to test your code - testing locally"](#testing-locally))

In [17]:
res = ask_question("How many students participated in the study?")
print(res['result'])

{
	'result': 'The number of students who participated in the PIRLS study is 367575.'
	'time': 2.8910000000614673
	'timed_out': False
	'tokens': 917
	'cost': 0.0003222500000000003
	'token_details': {'anthropic.claude-3-haiku-20240307-v1:0': {'prompt_tokens': 824, 'completion_tokens': 93}}
}


In [None]:
res = ask_question("How many countries participated in the study")
print(res['result'])

As you can see we finally get a bunch of neat answers to a real world questions regarding PIRLS!

# Chatbot Arena
Evaluating LLM-based solutions is not a trivial task. There are no straightforward metrics, such as accuracy or F1 score, to easily compare different models. This is why most of the time, when dealing with LLM text output, human or semi-automated evaluation is used. This idea of human-based evaluation underlies the concept of the Chatbot Arena.

As a **human evaluator**, you can go to the [arena website](TODO:), choose a question from a list, and see how 2 random submissions responded to this question. Compare the results and decide which chatbot returned a better answer. Perhaps both are acceptable, or maybe both are complete nonsense? Below the text areas, select the appropriate verdict.

It's worth mentioning that everyone can evaluate questions, not just the people taking part in the GDSC.
It is also highly unlikely that you will evaluate the exact same battle twice.

[<img src="../images/t4_arena_1.png"/>](../images/t4_arena_1.png)

[<img src="../images/t4_arena_2.png"/>](../images/t4_arena_2.png)

[<img src="../images/t4_arena_3.png"/>](../images/t4_arena_3.png)

#### Why should I rank battles?
We need human evaluators to assess the quality of the returned answers. Your decision is counted as a win or loss for the competing submissions, and their ranking is updated based on your opinion. Because this is such an important step, we (the GDSC Team) have introduced some constraints on how many submissions your team can make. The first two submissions require no additional effort. However, if your team wants to add more submissions, it is required to rank a specific number of battles. This number starts low for the initial submissions and increases over time to a steady value of 50 ranked battles per submission.

This number is a total for your team and not a requirement for a single team member.

This system had to be implemented because having a large number of submissions requires a lot of battles for adequate ranking. Because there is almost no automatic evaluation in this year's GDSC edition, it's best to submit solutions that are robust and have a real chance of competing with others, rather than focusing on minor improvements.

# Human Evaluation Questions
Right now, on the arena website, you can see a bunch of predefined questions that can be asked to assess the quality of submitted crews. Every submission that passes the automatic evaluation is asked this set of questions, and the responses are stored in the database. You can, however, submit your own questions - specifically tough ones that your crew implementation excels at! This could help your submission win more battles and climb the rankings, as long as it can also handle the existing questions. We're seeking a general solution, not a highly specialized one.

Before adding a new question, be sure there is no similar question already in the list. This list will change over time as new, interesting questions pop up either from the GDSC Team's end or from you and other participants.

**All questions added by participants will be verified before being added to the list.**
