# Agentic System Design

**Concept:** SciDigest AI is a virtual research assistant that helps scientists, students, and industry professionals quickly digest academic literature and technical reports.

**Core capabilities:**
1. Entity Extraction — Identifies key concepts, authors, organizations, datasets, methods, and metrics from research papers.
2. Summarization — Produces concise, coherent summaries of papers so users can quickly assess relevance.

**Real-World Problem Solved:**
Researchers are drowning in a flood of new publications (e.g., thousands of papers uploaded to arXiv weekly). Manually reading them to find relevant works wastes hours. SciDigest AI:

- Quickly distills content into readable summaries.
- Extracts key metadata for indexing, search, and filtering.
- Enables faster literature review and decision-making.

In [2]:
!pip install google-generativeai --quiet

In [5]:
!pip install spacy --quiet
!python -m spacy download en_core_web_sm --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m77.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
from google.colab import userdata
import os

google_api_key = userdata.get('GOOGLE_API_KEY')

In [4]:
import google.generativeai as genai

genai.configure(api_key=google_api_key)

In [16]:
import re
import pandas as pd
from transformers import pipeline
from collections import Counter
import spacy

nlp = spacy.load("en_core_web_sm")

def spacy_ner_extraction(texts, batch_size=20):
    """
    Runs spaCy NER on a list of texts and returns entity label counts.
    """
    entity_counter = Counter()
    entity_examples = {}

    for doc in nlp.pipe(texts, batch_size=batch_size):
        for ent in doc.ents:
            entity_counter[ent.label_] += 1
            # Store a small sample of examples per entity type
            if ent.label_ not in entity_examples:
                entity_examples[ent.label_] = set()
            if len(entity_examples[ent.label_]) < 5:
                entity_examples[ent.label_].add(ent.text)

    # Convert examples sets to lists
    entity_examples = {k: list(v) for k, v in entity_examples.items()}

    return entity_counter, entity_examples

def abstractive_summarization(texts, max_length=70, min_length=20):
    """
    Summarizes a list of texts using BART.
    Parameters:
        texts (list[str]): List of documents to summarize
        max_length (int): Maximum length of generated summary tokens
        min_length (int): Minimum length of generated summary tokens
    Returns:
        list[str]: List of summaries
    """
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

    summaries = []
    for text in texts:
        # Ensure text isn't too long for the model
        input_text = text[:1024]  # BART handles ~1024 tokens
        summary = summarizer(input_text, max_length=max_length, min_length=min_length, do_sample=False)
        summaries.append(summary[0]['summary_text'])
    return summaries

functions = {
    "spacy_ner_extraction": spacy_ner_extraction,
    "abstractive_summarization": abstractive_summarization
}

In [17]:
prompt = """You are SciDigest AI, a professional AI research assistant designed to help users process academic and technical content.
You have two core functions:

1. spacy_ner_extraction: Extracts named entities from a given text, including people, organizations, locations, dates, metrics, datasets, and other relevant research-related entities. Output should be structured, accurate, and concise.
2. abstractive_summarization: Generates a clear, coherent, and concise abstractive summary of the given text, preserving the main ideas and omitting irrelevant details. The tone should be formal and informative, suitable for academic or professional contexts.

Always provide outputs that are:
- Accurate: Extract factual information without introducing errors.
- Readable: Ensure clarity and fluency for professional readers.
- Relevant: Focus on research-relevant content, ignoring filler text.

Important:
- Keep the results from the summarization funciton as they are, no need to tweak them.

When responding, ensure your answer is well-structured, professional, and directly aligned with the requested function."""

In [18]:
model = genai.GenerativeModel(
    model_name="gemini-1.5-flash", tools=functions.values(), system_instruction=prompt
)

chat = model.start_chat(enable_automatic_function_calling=True)

In [19]:
response = chat.send_message(
    "Awesome! Can you please summarize for me this abstract? This work introduces an AI-driven framework combining named entity recognition and abstractive summarization to process research papers and technical documents. The entity extraction module identifies organizations, people, dates, metrics, and other key terms, while the summarization module condenses lengthy content into concise, coherent abstracts. Together, these capabilities enable faster information retrieval, support academic research, and enhance knowledge accessibility"
)
print(response.text)

This AI framework uses named entity recognition and abstractive summarization to process research papers.  It identifies key entities (organizations, people, dates, metrics) and condenses text into concise summaries, improving information retrieval and knowledge access.



In [13]:
chat.history

[parts {
   text: "Hello how are you?."
 }
 role: "user",
 parts {
   text: "I\'m doing well, thank you for asking. How can I help you today?\n"
 }
 role: "model",
 parts {
   text: "What are you functionalities?"
 }
 role: "user",
 parts {
   text: "I am SciDigest AI, a research assistant designed to help process academic and technical content.  My core functionalities are:\n\n1. **`spacy_ner_extraction`:** This function extracts named entities from text, such as people, organizations, locations, dates, and other research-relevant information. The output is structured and concise.\n\n2. **`abstractive_summarization`:** This function generates concise and coherent abstractive summaries of text, preserving key ideas while omitting irrelevant details.  The summaries are written in a formal and informative style suitable for academic or professional contexts.\n"
 }
 role: "model",
 parts {
   text: "Can you please get me the main entities in this text: [OpenAI, founded in San Francisco in