### **ArXiv Assistant Project - VLLM Inference Server Endpoint Testing**

This notebook is tailored to perform a simple inference for the VLLM inference server endpoint. It demonstrates how to interact with the deployed model using both the OpenAI API format and direct HTTP requests, showcasing the model's capabilities in handling arXiv-related queries and generating relevant responses.

The notebook is designed to be a practical tool for testing the deployed model's performance, response quality, inference speed, tokens utilization, and API functionality. It offers examples of prompts and response parsing, which can be valuable for both development and qualitative evaluation purposes.

The Notebook includes the following key components:

1. **Setup and Configuration**:
   - Imports necessary libraries such as openai and requests.
   - Sets up authentication by retrieving API tokens and endpoint URLs.

2. **OpenAI Client Interface**:
   - Demonstrates how to use the OpenAI Python client to interact with the VLLM endpoint.
   - Configures the client with the appropriate API key and base URL.

3. **Chat Completion Request**:
   - Implements a chat completion request using the OpenAI client.
   - Shows how to structure prompts and handle model parameters like temperature and max tokens.

4. **HTTP POST Request Implementation**:
   - Provides a custom function to send HTTP POST requests directly to the VLLM endpoint.
   - Demonstrates how to structure the payload and headers for the request.

5. **Response Handling and Parsing**:
   - Processes and displays the model's responses.
   - Extracts relevant information from the JSON responses.

6. **Example Queries and Outputs**:
   - Includes sample prompts related to arXiv papers and scientific topics.
   - Displays and analyzes the model's responses to these queries.

7. **Performance and Quality Assessment**:
   - Allows for quick testing and evaluation of the model's output quality and relevance.

Author: Amr Sherif  
Created Date: 2024-06-13  
Updated Date: 2024-09-30  
Version: 2.0

In [1]:
!pip install openai



In [None]:
from google.colab import userdata

vllmToken = userdata.get('vllm-arxiv_assistant')
vllmApiUrl = userdata.get('vllm-api-url')

#### Test the endpoint with OpenAI Client and Chat Template

In [2]:
prompt = """Answer the following question based on the text: What is the main point discussed? be clear, concise,
and structured.\nThe route requests reach the destination T, which \nconstructs the route replies; it calculates a MAC covering
the \nroute reply contents and returns the packet to S over the \nreverse of the route accumulated in the respective request
\npacket. The destination responds to one or more request \npackets of the same query, so that it provides the source
\nwith an as diverse topology picture as possible.4 The \nquerying node validates the replies and updates its topology \nview.
\nAs an illustrative example, consider the topology of \nFig.1, comprising ten nodes. S queries the network to \ndiscover
one or more routes to T. The nodes M1 and M2 are \ntwo malicious intermediate nodes. We denote the query \nrequest as a
list {QS,T;n1,n2,..,nk}, with QS,T denoting the SRP \nheader for a query searching for T and initiated by S. The ni,
\ni≠{1,k}, are the IP addresses of the traversed intermediate \nnodes and n1=S, nk=T. Similarly, the route reply is denoted
\nas {RS,T;n1,n2,..,nk}.We now consider a number of scenarios \nof possible security attacks by the two malicious nodes.
\n \n \n \n \n \n \n \n \n \nFigure 1. Example Topology: S wishes to discover a route \nto T in the presence of two malicious
nodes, M1 and M2. \n \nScenario 1: Consider the case that when M1 receives \n{QS,T;S}, it attempts to mislead S by generating
\n{RS,T;S,M1,T}. Not only would S accept such a reply, if a \nregular routing protocol were used, but it would most \nprobably
choose this fake route, since {S,M1,T} would have \nfewer hops than any other legitimate reply. It would also be \nreceived
with the least delay, because of the close distance \nbetween M1 and S. The requirement that the request reaches \n
\n4 The number of replies and the time-window the destination \nallocates for replies to a specific query are design
\nparameters. Moreover, the source could provide an indicator \nof the required diversity, so that T can regulate the number
\nof replies.   2 M1 5 \n3 M2 T \n6 S 1 4"""

In [18]:
from openai import OpenAI

openai_api_key = vllmToken
openai_api_base = vllmApiUrl

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="amrachraf/arxiv-assistant-merged_peft_model",
    top_p= 0.95,
    temperature= 0.9,
    max_tokens=1000,
    # stream=True,
    messages=[
        {"role": "user",
         "content": prompt},
    ],
    extra_body={"min_tokens": 1,
                "skip_special_tokens": True,
                "top_k": 50}

)

In [20]:
print(chat_response.choices[0].message.content)


The main point discussed in the text is a method for a querying node (S) to discover one or more routes to a destination (T) in a network, while preventing potential security attacks from malicious intermediate nodes (M1 and M2). The method involves route requests and replies, with the route replies calculated with a MAC (Message Authentication Code) to ensure their authenticity and integrity. The querying node validates the replies and updates its topology view. The text also provides an example topology and discusses possible scenarios of security attacks by the malicious nodes, such as attempting to mislead the source by generating fake routes with fewer hops and shorter delay. The number of replies and the time-window allocated for replies are design parameters that can be regulated by the source to ensure a diverse topology picture. 


#### Test the endpoint with a POST request and Chat Template

In [None]:
import requests

def post_http_request(prompt: str,
                      api_url: str,
                      n: int = 1,
                      stream: bool = False) -> requests.Response:
    headers = {"User-Agent": "Test Client", "Authorization": "Bearer "+openai_api_key}
    pload = {
        "messages": [{"role": "user", "content": "Answer the following question based on the text: What is the main point discussed? be clear, concise, and structured.\nThe route requests reach the destination T, which \nconstructs the route replies; it calculates a MAC covering the \nroute reply contents and returns the packet to S over the \nreverse of the route accumulated in the respective request \npacket. The destination responds to one or more request \npackets of the same query, so that it provides the source \nwith an as diverse topology picture as possible.4 The \nquerying node validates the replies and updates its topology \nview. \nAs an illustrative example, consider the topology of \nFig.1, comprising ten nodes. S queries the network to \ndiscover one or more routes to T. The nodes M1 and M2 are \ntwo malicious intermediate nodes. We denote the query \nrequest as a list {QS,T;n1,n2,..,nk}, with QS,T denoting the SRP \nheader for a query searching for T and initiated by S. The ni, \ni≠{1,k}, are the IP addresses of the traversed intermediate \nnodes and n1=S, nk=T. Similarly, the route reply is denoted \nas {RS,T;n1,n2,..,nk}.We now consider a number of scenarios \nof possible security attacks by the two malicious nodes. \n \n \n \n \n \n \n \n \n \nFigure 1. Example Topology: S wishes to discover a route \nto T in the presence of two malicious nodes, M1 and M2. \n \nScenario 1: Consider the case that when M1 receives \n{QS,T;S}, it attempts to mislead S by generating  \n{RS,T;S,M1,T}. Not only would S accept such a reply, if a \nregular routing protocol were used, but it would most \nprobably choose this fake route, since {S,M1,T} would have \nfewer hops than any other legitimate reply. It would also be \nreceived with the least delay, because of the close distance \nbetween M1 and S. The requirement that the request reaches \n                                                \n4 The number of replies and the time-window the destination \nallocates for replies to a specific query are design \nparameters. Moreover, the source could provide an indicator \nof the required diversity, so that T can regulate the number \nof replies.   2 M1 5 \n3 M2 T \n6 S 1 4"}],
        # "prompt": prompt,
        "n": n,
        # "use_beam_search": True,
        "temperature": 0.9,
        "max_tokens": 1000,
        "min_tokens": 1,
        # "stream": stream,
        "model": "amrachraf/arxiv-assistant-merged_peft_model",
        "skip_special_tokens": True,
        "top_k": 50,
        "SamplingParams": {"min_tokens": 1, "skip_special_tokens": True, "top_k": 50}
    }
    response = requests.post(api_url, headers=headers, json=pload)
    return response

In [None]:
response = post_http_request("<s>[INST]Answer the following question based on the text: What is the main point discussed? be clear, concise, and structured.\nThe route requests reach the destination T, which \nconstructs the route replies; it calculates a MAC covering the \nroute reply contents and returns the packet to S over the \nreverse of the route accumulated in the respective request \npacket. The destination responds to one or more request \npackets of the same query, so that it provides the source \nwith an as diverse topology picture as possible.4 The \nquerying node validates the replies and updates its topology \nview. \nAs an illustrative example, consider the topology of \nFig.1, comprising ten nodes. S queries the network to \ndiscover one or more routes to T. The nodes M1 and M2 are \ntwo malicious intermediate nodes. We denote the query \nrequest as a list {QS,T;n1,n2,..,nk}, with QS,T denoting the SRP \nheader for a query searching for T and initiated by S. The ni, \ni≠{1,k}, are the IP addresses of the traversed intermediate \nnodes and n1=S, nk=T. Similarly, the route reply is denoted \nas {RS,T;n1,n2,..,nk}.We now consider a number of scenarios \nof possible security attacks by the two malicious nodes. \n \n \n \n \n \n \n \n \n \nFigure 1. Example Topology: S wishes to discover a route \nto T in the presence of two malicious nodes, M1 and M2. \n \nScenario 1: Consider the case that when M1 receives \n{QS,T;S}, it attempts to mislead S by generating  \n{RS,T;S,M1,T}. Not only would S accept such a reply, if a \nregular routing protocol were used, but it would most \nprobably choose this fake route, since {S,M1,T} would have \nfewer hops than any other legitimate reply. It would also be \nreceived with the least delay, because of the close distance \nbetween M1 and S. The requirement that the request reaches \n                                                \n4 The number of replies and the time-window the destination \nallocates for replies to a specific query are design \nparameters. Moreover, the source could provide an indicator \nof the required diversity, so that T can regulate the number \nof replies.   2 M1 5 \n3 M2 T \n6 S 1 4 [\INST]",
                             openai_api_base+"/chat/completions")

In [None]:
response.json()

{'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': '\nThe main point discussed in the text is the process of route discovery and replies in a network, specifically focusing on the potential for malicious nodes to cause security attacks. The process involves a source node (S) querying the network for one or more routes to a destination node (T), with the destination node providing replies that cover the route contents and calculate a MAC to ensure their authenticity. The process is designed to provide the source node with a diverse topology picture.\n\nThe text then presents an example topology with ten nodes, including two malicious intermediate nodes (M1 and M2). It discusses several scenarios of possible security attacks, including the case where M1 attempts to mislead the source node (S) by generating a fake route reply with fewer hops and less delay, with the aim of being chosen as the preferred route by the source node. The text al