# Task
Create a chatbot API in Python using FastAPI and an open-source Mistral LLM.

## Set up the environment

### Subtask:
Install necessary libraries including `fastapi`, `uvicorn`, and libraries for loading and running the Mistral model (e.g., `transformers`, `torch`).


**Reasoning**:
Install the required libraries using pip.



In [None]:
!pip install fastapi uvicorn transformers torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

## Load the mistral model

### Subtask:
Download and load the chosen open-source Mistral model using the appropriate libraries.


**Reasoning**:
Import the necessary classes and load the tokenizer and model for the specified Mistral model.



In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

print(f"Model loaded onto device: {device}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-v0.1.
401 Client Error. (Request ID: Root=1-6894532d-0009fc7d4dfcedfd2987e8d3;c1d22e7f-f3b6-44c9-8e83-a5791d6bb34d)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-v0.1/resolve/main/config.json.
Access to model mistralai/Mistral-7B-v0.1 is restricted. You must have access to it and be authenticated to access it. Please log in.

**Reasoning**:
The previous attempt failed because the model is gated and requires authentication. I will try a different, non-gated Mistral model from Hugging Face to proceed with the subtask of loading the model.



In [None]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

print(f"Model loaded onto device: {device}")

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2.
401 Client Error. (Request ID: Root=1-68945334-343e465761b5e74e262bb8a1;77c955cb-c1fe-4673-a796-3c90fa700ad2)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/resolve/main/config.json.
Access to model mistralai/Mistral-7B-Instruct-v0.2 is restricted. You must have access to it and be authenticated to access it. Please log in.

## Load the mistral model

### Subtask:
Retry loading the mistral model by attempting to use a non-gated, publicly available version.


**Reasoning**:
Attempt to load a publicly available, non-gated Mistral model from Hugging Face.



In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "HuggingFaceH4/zephyr-7b-beta"

tokenizer = AutoTokenizer.from_pretrained(model_name)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

print(f"Model loaded onto device: {device}")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

## Create fastapi app and endpoint

### Subtask:
Initialize a FastAPI application and define a POST endpoint to receive user messages.


**Reasoning**:
The subtask is to initialize a FastAPI application and define a POST endpoint. This involves importing FastAPI, creating an app instance, and defining a route with an asynchronous function that returns a placeholder response. These steps can be completed in a single code block.



In [None]:
from fastapi import FastAPI

app = FastAPI()

@app.post("/chat")
async def chat_endpoint(message: str):
    return {"response": "Received your message."}

## Implement chatbot logic

### Subtask:
Write code within the endpoint to process the incoming message, interact with the loaded Mistral model to generate a response, and return the response.


**Reasoning**:
Implement the core logic within the chat endpoint to process the incoming message, interact with the loaded Mistral model to generate a response, and return the response, following the instructions.



In [None]:
@app.post("/chat")
async def chat_endpoint(message: str):
    try:
        # Access the global tokenizer and model objects
        global tokenizer, model, device

        # Create a prompt for the model
        # The prompt format for Zephyr-7B-beta is specific
        prompt = f"<|system|>\nYou are a friendly chatbot.<|endoftext|>\n<|user|>\n{message}<|endoftext|>\n<|assistant|>"

        # Use the tokenizer to encode the prompt into tensors
        inputs = tokenizer(prompt, return_tensors="pt").to(device)

        # Use the model to generate a response
        # Set parameters for the generation process
        output_sequences = model.generate(
            inputs["input_ids"],
            max_length=200,  # Limit the response length
            pad_token_id=tokenizer.eos_token_id, # Set padding token
            no_repeat_ngram_size=2, # Prevent repeating ngrams
            do_sample=True, # Enable sampling
            top_k=50, # Sample from top 50 tokens
            top_p=0.95, # Use nucleus sampling
            temperature=0.7 # Set temperature
        )

        # Decode the generated tensors back into a human-readable string
        generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=False)

        # Extract the actual chatbot response
        # Find the start of the assistant's response
        assistant_start = generated_text.find("<|assistant|>")
        if assistant_start != -1:
            # Extract everything after the assistant tag
            response = generated_text[assistant_start + len("<|assistant|>"):].strip()
            # Remove any potential end-of-text tokens that might be at the end
            response = response.replace("<|endoftext|>", "").strip()
        else:
            response = "Could not generate a valid response."


        # Return the extracted response
        return {"response": response}

    except Exception as e:
        # Handle any potential errors during processing
        return {"response": f"An error occurred: {e}"}


## Run the fastapi app

### Subtask:
Add code to run the FastAPI development server using uvicorn.


**Reasoning**:
Add the code to run the FastAPI application using uvicorn within an if __name__ == "__main__": block.



In [None]:
import uvicorn

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

RuntimeError: asyncio.run() cannot be called from a running event loop

**Reasoning**:
The error "asyncio.run() cannot be called from a running event loop" indicates that we are trying to run uvicorn from within an already running event loop, which is the case in an interactive environment like a Jupyter Notebook. To fix this, we can use `nest_asyncio` to allow nested event loops or run uvicorn in a separate process or thread, but the simplest approach for development in a notebook is to use `nest_asyncio`.



In [None]:
import nest_asyncio
import uvicorn

nest_asyncio.apply()

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

INFO:     Started server process [5518]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
