## Импорт библиотек

1) smolagents - библиотека для агентов. Импортируем следующие классы:
   - `HfApiModel` - класс для инициализации мозга Qwen.2.5 агента;
   - `ToolCallingAgent` - херня, которая связывает мозг и инструменты;
   - `DuckDuckGoSearchTool` - библиотечный инструмент для того, чтобы модель сама могла искать необходимую инфу в инете;
   - `tool` - будет использовать как декоратор для кастомных инструментов;
   - `tools.Tool` - класс Tool для того, чтобы написать отдельный инструмент, чтобы модель норм формировала финальный ответ и не зацикливалась;

2) langchain:
    - `HugginFaceEmbeddings` - импортируем для того, чтобы получить готовые embeddings для слов;
    - `InMemoryVectorStore` - херня, которая представляет из себя векторное хранилище (делаем для быстрого поиска по запрашиваемым embedding'ам);
    - `RecursiveCharacterTextSplitter` - штука, которая разделит наш pdf файл на документы, дабы потом векторизовать этот файл;

In [49]:
from smolagents import ToolCallingAgent, HfApiModel, DuckDuckGoSearchTool, tool
from smolagents.tools import Tool

from transformers import pipeline, MarianMTModel, MarianTokenizer
from typing import Any, Optional, List, Dict

# Для Юпитера
import IPython
IPython.display.clear_output(wait=True)

from langchain_huggingface import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.documents.base import Document


# два след. импорта для получения текущего времени
import datetime 
import pytz

# начальный промпт написан в *yaml файле
import yaml

import os

# Хуйня для парсинга
import requests
from xml.etree import ElementTree as ET
import time
import re
# from PyPDF2 import PdfReader


# Беру свой токен из HugginFace, так как для импорта любой моделей он требуется
TOKEN = "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"
os.environ["HF_TOKEN"] = TOKEN

## Определение инструментов

### Импорт необходимых моделей

In [50]:
# Модель из HugginFace. Импортируем Qwen2.5. Мозг Агента
model = HfApiModel(
    max_tokens=5000,
    temperature=0.5,
    model_id='Qwen/Qwen2.5-Coder-32B-Instruct',
    custom_role_conversions=None
)

# Инициализируем Embeddings для упрощенного поиска
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Модель, делающая саммаризацию текста
# Используем device='cpu', чтобы не было конфликтов при запуске на других ноутах
# Выгоднее было бы использовать cuda (windows) или mps (macos), но не у всех доступна cuda :(
summarizer = pipeline(task="summarization", model="facebook/bart-large-cnn", device='cpu')

# Разделитель нашего файла. Разделяет его на документы, дабы векторизовать файлик
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)


# Инициализируем модель, которая переводит с русского языка на английский
model_from_russian_to_english = "Helsinki-NLP/opus-mt-ru-en" 
tokenizer_russian = MarianTokenizer.from_pretrained(model_from_russian_to_english)
model_russian = MarianMTModel.from_pretrained(model_from_russian_to_english)

# Инициализируем моедель, которая с английского переводит на русский
model_from_english_to_russian = "Helsinki-NLP/opus-mt-en-ru"
tokenizer_english = MarianTokenizer.from_pretrained(model_from_english_to_russian)
model_english = MarianMTModel.from_pretrained(model_from_english_to_russian)

Device set to use cpu


### Далее пишем сами инструменты и вспомогательные функции 

Сам инструмент определяется декоратором `@tool`

#### Загрузка pdf файла для дальнейшей обработки

In [8]:
def download_pdf(pdf_url: str, save_path: str) -> None:
    """
    Download a PDF file from a given URL and save it to the specified path.
    
    Args:
        pdf_url (str): The URL of the PDF file to download.
        save_path (str): The file path where the PDF will be saved.
    
    Raises:
        ValueError: If the response from the server is not successful (status code not 200).
    """
    response = requests.get(pdf_url)
    if response.status_code == 200:
        with open(save_path, 'wb') as f:
            f.write(response.content)
    else:
        raise ValueError(f"Could not download PDF: {response.status_code}")

##### Пример использования функции:

```python
pdf_url = "http://arxiv.org/pdf/1909.03550v1"
pdf_filename = pdf_url.split("/")[-1] + ".pdf" # 1909.03550v1.pdf
pdf_path = os.path.join("downloads", pdf_filename)
download_pdf(pdf_url, pdf_path)
```

#### Вытаскиваем информацию из pdf файла и возвращаем векторизированный 

In [31]:
def from_pdf_to_vector(pdf_path: str) -> InMemoryVectorStore:
    """
    Extract text from a given PDF file and convert it into a vector store.

    This function loads text from a PDF file specified by `pdf_path`, splits the text
    into smaller documents, and creates an in-memory vector store using those documents.

    Args:
        pdf_path (str): The path to the PDF file from which to extract text.

    Returns:
        InMemoryVectorStore: An in-memory vector store populated with 
        vectorized representations of the extracted documents.

    Raises:
        FileNotFoundError: If the specified PDF file does not exist.
        ValueError: If the PDF file cannot be processed.
    """
    text_loader = PyPDFLoader(pdf_path)
    text_contents = [doc.page_content for doc in text_loader.load()]

    split_documents = splitter.create_documents(text_contents)
    vector_store = InMemoryVectorStore.from_documents(split_documents, embedding_model)
    
    return vector_store

#### Пример использования:

```python
pdf_path = os.path.join("downloads", "1910.14537v3.pdf")
vector = from_pdf_to_vector(pdf_path)
vector.similarity_search("Gaussian-masked Directional Transformer.")
```

### Делаем Саммаризацию выжимок из documents

In [35]:
def summarize_text_article(documents: List[Document],
                  percent_to_keep: float = 0.2) -> str:
    """
    Generate a summary of the given documents.

    This function extracts text from a list of Document objects, 
    and uses a summarization model to generate a concise summary 
    while keeping a specified percentage of the original text. 
    The percentage to keep can be defined through the 
    `percent_to_keep` parameter, allowing for control over 
    the length of the summary.

    Args:
        documents (List[Document]): A list of Document objects 
            containing the text to be summarized.
        summarizer (pipeline): The summarization model from HuggingFace 
            transformers used to generate the summary.
        percent_to_keep (float, optional): The percentage of the original 
            text length to retain in the summary (default is 0.2, which 
            means 20%).

    Returns:
        str: The generated summary of the documents, which reflects 
        the most important points based on the provided content.

    Raises:
        ValueError: If the list of documents is empty or if 
        the calculated target length is invalid.
    """
    
    context = "\n".join([re.sub(r'\s+', ' ', doc.page_content).strip() for doc in documents])
    original_length = len(context.split())
    target_length = int(original_length * percent_to_keep)
    target_length = max(30, min(target_length, 1024))

    summary = summarizer(context, max_length=target_length,
                         min_length=max(30, int(target_length * 0.8)),
                         do_sample=False)
    return summary[0]["summary_text"]

#### Пример исползования

```python
pdf_path = os.path.join("downloads", "1910.14537v3.pdf")
vector = from_pdf_to_vector(pdf_path)
documents = vector.similarity_search("Gaussian-masked Directional Transformer.")
summarize_text_article(documents, 0.2)
```

### Пишем инструмент для поиска статей по запросу и выжимок, которые будет делать ГПТ



In [93]:
@tool
def search_arxiv(query: str, max_results: int, summarize: bool) -> List:
    """
    A tool that searches for articles on arXiv based on a query and optionally generates summaries.

    This function queries the arXiv API to retrieve scholarly articles matching the provided search query. 
    It supports downloading PDFs of the articles, converting them into vector representations, and generating 
    summaries if requested. The function is designed to handle errors gracefully, including retries for network 
    issues and validation of input parameters.

    Args:
        query: The search query for finding articles on arXiv. This can be a topic, keyword, or phrase 
                     (e.g., "quantum computing", "machine learning applications").
        max_results: The maximum number of articles to return. Must be between 1 and 5 (inclusive). 
                           This limit ensures manageable processing times and avoids overloading the system.
        summarize: A flag indicating whether to generate summaries for the retrieved articles. If True, 
                          the function will process the downloaded PDFs to extract and summarize key information.

    Returns:
        List[Dict]: A list of dictionaries, where each dictionary contains the following keys:
            - 'title' (str): The title of the article.
            - 'pdf_url' (str): The URL of the article's PDF on arXiv.
            - 'local_pdf_path' (str): The local file path where the PDF has been downloaded.
            - 'summary' (str): A summary of the article's content (if `summarize` is True); otherwise, an empty string.

    Raises:
        ValueError: If `max_results` is less than 1 or greater than 5.
        RuntimeError: If there are repeated failures in fetching data from arXiv after multiple retry attempts.

    Notes:
        - The function uses exponential backoff for retrying failed requests to the arXiv API.
        - PDFs are downloaded into a local "downloads" directory, which is created if it does not already exist.
        - Summaries are generated using similarity search over vectorized PDF content, followed by text summarization techniques.
        - If a PDF download fails, the corresponding article is skipped, and the function proceeds with the next result.

    Example Usage:
        # Search for 3 articles on "neural networks" and generate summaries
        results = search_arxiv(query="neural networks", max_results=3, summarize=True)
        for result in results:
            print(f"Title: {result['title']}")
            print(f"PDF URL: {result['pdf_url']}")
            print(f"Summary: {result['summary']}\n")

        # Handle invalid input
        try:
            results = search_arxiv(query="attention in machine learning", max_results=10, summarize=True)
        except ValueError as e:
            print(e)  # Output: max_results must be between 1 and 5.
    """
    print("In process of finding articles...")
    if max_results > 5 or max_results < 1:
        return "max_results must be between 1 and 5."

    base_url = "http://export.arxiv.org/api/query"
    params = {
        "search_query": query,
        "start": 0,
        "max_results": max_results
    }

    for attempt in range(5):
        try:
            response = requests.get(base_url, params=params)
            response.raise_for_status()
            break
        except (requests.exceptions.RequestException, ConnectionResetError) as e:
            if attempt < 4:
                time.sleep(2 ** attempt)
                continue
            return f"Error fetching data from arXiv after {attempt + 1} attempts: {e}"

    root = ET.fromstring(response.content)
    entries = []
    os.makedirs("downloads", exist_ok=True)

    for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):
        title = entry.find('{http://www.w3.org/2005/Atom}title').text
        pdf_url = entry.find('{http://www.w3.org/2005/Atom}link[@title="pdf"]').attrib['href']

        pdf_filename = pdf_url.split("/")[-1] + ".pdf"
        pdf_path = os.path.join("downloads", pdf_filename)
        try:
            download_pdf(pdf_url, pdf_path)
        except Exception as e:
            print(f"Failed to download PDF: {e}")
            continue
        
        vector = from_pdf_to_vector(pdf_path)
        documents = vector.similarity_search(query)

        summary_text = summarize_text_article(documents) if summarize else ""

        entries.append({
            'title': title,
            'pdf_url': pdf_url,
            'local_pdf_path': pdf_path,
            'summary': summary_text
        })

    return entries

#### Пример использования

```python
search_arxiv("Attention in Machine Learning", 2, True)
```

### Следующая функция тестовая для проверки того, что агент не ебанутый

In [94]:
@tool
def get_current_time_in_timezone(timezone: str) -> str:
    """
    A tool that fetches the current local time in a specified timezone.
    
    Args:
        timezone: A string representing a valid timezone (e.g., 'America/New_York').
    """
    try:
        tz = pytz.timezone(timezone)
        local_time = datetime.datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S")
        return f"The current local time in {timezone} is: {local_time}"
    except Exception as e:
        return f"Error fetching time for timezone '{timezone}': {str(e)}"

#### Пример использования

```python
get_current_time_in_timezone("Europe/London")
```

### Определяем класс, который будет управлять финальным ответом

In [95]:
class FinalAnswerTool(Tool):
    name = "final_answer"
    description = "Provides a final answer to the given problem."
    inputs = {'answer': {'type': 'any', 'description': 'The final answer to the problem'}}
    output_type = "any"

    def forward(self, answer: Any) -> Any:
        return answer

    def __init__(self, *args, **kwargs):
        self.is_initialized = False

### Читаем подготовленный файл для того, чтобы мозгу Агента (Qwen2.5) дать необходимые указания по поводу нашего функционала

In [107]:
try:
    with open("prompts.yaml", 'r') as stream:
        prompt_templates = yaml.safe_load(stream)
except FileNotFoundError:
    prompt_templates = {}  # Если файл не найден, используем пустой словарь

final_answer = FinalAnswerTool()
web_search = DuckDuckGoSearchTool()
prompt_templates['system_prompt'][:100]

'You are an expert assistant who can solve any task by using the available tools.\nTo solve the task, '

### Инициализируем наешго агента

In [108]:
agent =  ToolCallingAgent(
    model=model,
    tools=[final_answer, web_search, get_current_time_in_timezone, search_arxiv],
    max_steps=5,
    verbosity_level=1,
    grammar=None,
    planning_interval=None,
    name="AndrewSolver",
    description=None,
    prompt_templates=prompt_templates
)

### Тестируем

In [121]:
results = []
for i in range(1):
    message = input()
    try:
        result = agent.run(message)
        results += [result]
    except Exception as e:
        results += [f"Возникла ошибка: {e}"]

 Send me 3 articles about attention. Also make your own summary and send me some link


In process of finding articles...


In [124]:
text = results[0].replace("\\n", "\n").replace('*', '')

print(text)

Here are 3 articles about attention along with summaries and links:
1. Attention: Theory, Principles, Models and Applications
   Summary: This article discusses theories of attention from which models and principles are derived, focusing on a descriptive model of human attention and performance. It explores how principles of attention can be applied in design to remediate attention-related problems.
   Link: https://www.researchgate.net/publication/349082014_Attention_Theory_Principles_Models_and_Applications \\[PDF]
2. Attention: Multiple types, brain resonances, psychological functions and conscious states
   Summary: This article delves into neural models of attention, explaining how brain processes of consciousness, learning, expectation, and attention interact. It highlights the role of brain resonances in different types of attention and their impact on conscious states.
   Link: https://www.researchgate.net/publication/350586329_Attention_Multiple_types_brain_resonances_psycholo