# 3-better-slack-researcher
> A collaborator has hundreds of Slack messages that have citations to research papers.

Let's see if we can't incrementally improve on the naive approach.

We need to do a few pip installs to get the environment setup

```bash
python3 -m venv crewai-venv
source crewai-venv/bin/activate
pip install ipykernel "crewai[tools]" python-dotenv langchain-community pymupdf pypdf2
```

In [None]:
# Import the necessary libraries
from crewai import Agent, Crew, Task, Process
from crewai_tools import WebsiteSearchTool
from pydantic import Field, BaseModel
from crewai.tools import BaseTool
import requests

from langchain_community.document_loaders import PyPDFLoader

## Creating our first Crew AI Crew

Let's make a Crew whose purpose is to extract text from an image.

## Instantiate the tools
We'll use the [WebsiteSearchTool](https://docs.crewai.com/tools/websitesearchtool) to navigate and extract information from a website based on semantic search.

In [3]:
# Create the VisionTool
website_search_tool = WebsiteSearchTool()

### Create a Custom PDFDownloaderTool
What if we...really did want to download the PDF? Well, turns out we can make our own custom tools.

See more here: [Custom Tools](https://docs.crewai.com/how-to/create-custom-tools)

You could also do this directly using the PyPDFLoader from langchain, but you would need to do some additional work to download the file. Crew.ai has integrations with the ridiculously incredible corpus of both [LangChain tools](https://docs.crewai.com/concepts/langchain-tools) and [LlamaIndex tools](https://docs.crewai.com/concepts/llamaindex-tools).

In [4]:
class PDFDownloaderTool(BaseTool):
    name: str = "PDFDownloader"
    description: str = "Useful for downloading PDF files from a given URL and returning the path and content"

    def _run(self, uri: str) -> str:
        """Download the pdf file from the given URL and return the path to the downloaded pdf file"""
        try:
            response = requests.get(uri)

            # Get the end part of the uri as the file name
            file_name = uri.split('/')[-1]

            # if the file name already has a .pdf extension, use it otherwise add it
            file_path = file_name if file_name.endswith('.pdf') else file_name + '.pdf'

            if response.status_code == 200:
                with open(file_path, 'wb') as file:
                    file.write(response.content)
                
                # Use PyPDFLoader to extract text from the downloaded PDF
                pdf_loader = PyPDFLoader(file_path=file_path,
                                         extract_images=True,
                                         mode="single")
                text_content = pdf_loader.load()
                
                return {
                    "file_path": file_path,
                    "content": text_content
                }
            else:
                return f"Failed to download file: {response.status_code}"
        except Exception as e:
            return f"Error downloading PDF: {str(e)}"

## Formatting the output
We can also used structured outputs on the output format if we want to have a more structured result.

In [5]:
class ForumListOutput(BaseModel):
    title: str = Field(description="The title of the forum post")
    file_path: str = Field(description="The path to the downloaded PDF file")
    abstract: str = Field(description="The abstract of the paper")
    summary: str = Field(description="A paragraph summarizing the most important points of the paper to an expert in the field")
    link: str = Field(description="The URL link to the pdf file")

## Define the Agent
Visit https://docs.crewai.com/concepts/agents#direct-code-definition to learn more. Below are the full options for the agent.
This basically resolves to determining the role of the agent, specifically:
* The role title
* The goal of the agent
* The backstory of the agent



In [6]:
# Create an agent with all available parameters
archivist_agent = Agent(
    role="Modern Research Archivist",
    goal="Find and download the the latest research articles, blogs, and papers based on the user's query",
    backstory="You are an experienced research archivist who meticulously collects and organizes research papers, articles, and other online media. "
              "You are an expert in using Arxiv, extracting technical writing from blogs and websites, and tracking down references made on twitter and other social media platforms.",
    llm="gpt-4o-mini",  # Default: OPENAI_MODEL_NAME or "gpt-4"
    function_calling_llm=None,  # Optional: Separate LLM for tool calling
    verbose=True,  # *******Default: False
    tools=[website_search_tool, PDFDownloaderTool()],  # *******Optional: List of tools
)

## Define the task
Learn more here: [Task](https://docs.crewai.com/concepts/tasks)

The purpose of a task is to define the assignment that an agent will complete. Minimally, you need a description of the task to be completed, the expected output, and the agent that is responsible to complete the task.

In [15]:
#%% Define the task
main_task = Task(
    description="Find and download the research articles, blog, paper, or other online technical writing based on the information at the following website URL: {provided_link}",
    expected_output="A paragraph summarizing the most important points of the paper to an expert in the field and a link to the paper",
    agent=archivist_agent,
)

#%% Define a formatting task to separate the output formatting from the main task
formatting_task = Task(
    description="Format the output of the main task based on a pydantic schema to report the results of the fetch, summarization, and download",
    expected_output="An object with the following fields: title, file_path, abstract, summary, link",
    agent=archivist_agent,
    output_pydantic=ForumListOutput #******* Note that we have added this pydantic output to the model
)

## Define the Crew

In [16]:
# Create a crew
crew = Crew(
    agents=[archivist_agent],
    tasks=[main_task, formatting_task],
    process = Process.sequential,
    verbose=True
)

## Run the Crew

In [None]:
# Run the crew
result = crew.kickoff({"provided_link": "https://arxiv.org/pdf/2501.13533"})

## Print the result

In [None]:
# Print the result
print(result)