#Researchify Backend

- Please change the runtime to support **GPU** which is needed by CoPali for Vision RAG pipeline (Free tier is sufficient)
- Note: **Ngrok** is used to expose the Backend API to public as a workaround

####Install Dependecies

In [1]:
!pip install flask flask_cors pyngrok cohere chromadb pyvis PyMuPDF PyPDF2 byaldi

Collecting flask_cors
  Downloading flask_cors-5.0.1-py3-none-any.whl.metadata (961 bytes)
Collecting pyngrok
  Downloading pyngrok-7.2.3-py3-none-any.whl.metadata (8.7 kB)
Collecting cohere
  Downloading cohere-5.14.0-py3-none-any.whl.metadata (3.4 kB)
Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting pyvis
  Downloading pyvis-0.3.2-py3-none-any.whl.metadata (1.7 kB)
Collecting PyMuPDF
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting byaldi
  Downloading Byaldi-0.0.7-py3-none-any.whl.metadata (20 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting types-requests<3.0.0,>=2.0.0 

In [2]:
!sudo apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 29 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.6 [186 kB]
Fetched 186 kB in 1s (302 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package poppler-utils.
(Reading database ... 126209 

####Import Modules and Setup API Keys and Database, LLM Config

In [3]:
import requests
from typing import List, Optional
from werkzeug.utils import secure_filename
import io
import fitz
import cohere
import chromadb
import json
import os
import time
import networkx as nx
import matplotlib.pyplot as plt
from pyvis.network import Network
import base64
from flask import Flask, request, jsonify, send_file
from flask_cors import CORS
from pyngrok import ngrok
from threading import Thread
from byaldi import RAGMultiModalModel
from openai import OpenAI
from google.colab import userdata
from PyPDF2 import PdfReader
from collections import defaultdict

In [4]:
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
COHERE_API_KEY = userdata.get('COHERE_API_KEY')
SS_API_KEY = userdata.get('SS_API_KEY')
NGROK_AUTH_TOKEN = userdata.get("NGROK_AUTH_TOKEN")
co = cohere.ClientV2(COHERE_API_KEY)
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection(name="relevant_research_papers")

In [5]:
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2", verbose=1)

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/750 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/862M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/78.6M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

In [6]:
GPTclient = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

# Utility Functions
- Validate and Downlaod PDF from a URL
- Extract the First Page of the PDF
- Extract the Abstract from the first page
- Extract Topic from Query
- Extract Summary from Description/Abstract
- Extract Paper Info from Database
- Extract Abstract for a give Paper from Database

In [7]:
#Future Work: Make scraper more robust

def validate_and_download_pd(
    url: str,
    download: bool):

    """
    Given an Open Access URL from Semantic Scholar, it checks wether url is valid or not and if valid wether it can be scraped.
    If needed it scrapes and downloads the pdf content
    """

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
    }

    if not download:
      try:
          response = requests.head(url, headers=headers, timeout=10)
          if response.status_code == 200:
              return True
          response = requests.head(url, timeout=10)
          if response.status_code == 200:
              return True
          return False
      except requests.RequestException as e:
          print(f"Request failed: {e}")
          return False

    else:
      try:
          response = requests.get(url, headers=headers, timeout=10)
          if response.status_code == 200:
            return response.content
          response = requests.get(url, timeout=10)
          if response.status_code == 200:
            return response.content
          return None
      except requests.RequestException as e:
          print(f"Request failed: {e}")
          return None

In [8]:
def extract_abstract(text: str) -> str:

    """Extracts the abstract from a given first page of the pdf text"""

    prompt = """
    Extract the full abstract from the following research paper text. The abstract section is always at the start of the research paper.
    Return only the full abstract.

    Research Paper Text:
    """ + text

    response = co.chat(
        model="command-r-plus-08-2024",
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
    )

    return response.message.content[0].text



def extract_first_page_text(pdf_bytes: io.BytesIO) -> str:

    """Extracts the text from the first page of the pdf"""

    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    if len(doc) > 0:
        return doc[0].get_text("text")
    return ""

In [9]:
def extract_topic_from_query(query: str) -> str:

    """Indentify the central topic from a given sentence"""

    prompt = (
        "Given the following user query, extract the most central topic from it.\n\n"
        f"User Query: {query}\n"
        "Output format: 'topic'"
    )

    response = co.chat(
        model="command-r-plus-08-2024",
        messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    )
    result = response.message.content[0].text
    return result




def extract_summary_from_desc(query: str) -> str:

    """ Indentify the short summary from a given description/abstract """

    prompt = (
        "Given the following abstract, summarize it in no more than 5-6 words."
        "The extracted summary will be used to query similar papers on semantic scholar api so ensure that central theme is captured."
        "Also ensure that since the summary will be used for api, it should not have generic terms and only contain the main topic\n\n"
        f"Abstract: {query}\n"
        "Output format: 'summary'"
    )

    response = co.chat(
        model="command-r-plus-08-2024",
        messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    )
    result = response.message.content[0].text
    return result

In [10]:
def fetch_paper_info(paper_id: str) -> dict:

  """Fetch the info of paper stored in ChromaDB based on a given paper_id """

  data = collection.get(paper_id).get('metadatas')[0]
  return data

In [11]:
def fetch_abstract(paper_id) -> str:

  """Fetch the abstract of paper stored in ChromaDB based on a given paper_id """

  abstract = collection.get(paper_id).get('documents')[0]
  return abstract

#Services

- Fetch the relevant papers from Semantic Scholar
- Store the valid papers in Chroma DB
- Rerank the papers to get Top 5 Based on User Input
- Fetch reference information for a paper from Semantic Scholar
- Build a reference Graph

In [12]:
def query_semantic_scholar(
    query: str,
    start_year: Optional[int] = None,
    end_year: Optional[int] = None,
    limit: int = 100
) -> List[dict]:

    """
    Query the Semantic Scholar API for open-access papers relevant to the given query.
    Filters by publication year range if provided.

    :param query: Search query string.
    :param start_year: Start of publication year range (inclusive).
    :param end_year: End of publication year range (inclusive).
    :param limit: Maximum number of results to return (default is 50).
    :return: List of dictionaries containing relevant paper details.
    """

    url = "https://api.semanticscholar.org/graph/v1/paper/search"

    headers = {
        "x-api-key": SS_API_KEY
    }

    params = {
        "query": query,
        "limit": min(limit, 100),  # API max per request is 100
        "openAccessPdf": "true",  # Only retrieve open-access papers
        "fields": "title,authors,year,openAccessPdf,abstract,journal,citationCount,url",
    }

    response = requests.get(url, params=params,headers=headers)

    if response.status_code != 200:
        raise Exception(f"Error {response.status_code}: {response.text}")

    papers = response.json().get("data", [])

    filtered_papers = []
    for paper in papers:
        paper['paper_url'] = paper.get('openAccessPdf').get('url','')
        paper.pop('openAccessPdf')
        if paper.get("journal"):
            paper['journal_name'] = paper.get('journal','N/A').get('name','')
        else:
            paper['journal_name'] = 'Unknown'
        paper.pop('journal')
        paper_year = paper.get("year")
        if paper_year and (start_year is None or paper_year >= start_year) \
                        and (end_year is None or paper_year <= end_year):
            filtered_papers.append(paper)

    return filtered_papers

In [13]:
def store_papers_in_chroma(papers: List[dict]) -> None:

    """Store a list of valid papers in ChromaDB."""

    seen_titles = set()
    for paper in papers:
        paper_id = paper.get('paperId', None)
        abstract = paper.get("abstract", "")
        title = paper.get("title", "Unknown Title")
        semantic_scholar_url = paper.get("url", "#")
        paper_url = paper.get("paper_url", "#")
        year = paper.get("year", "Unknown Year")
        journal = paper.get("journal_name", "Unknown Journal")
        authors = ", ".join([author['name'] for author in paper.get("authors", [])])
        citationCount = paper.get("citationCount", 0)

        if not paper_id:
            print(f"Skipping paper with missing ID: {title}")
            continue
        if not validate_and_download_pd(paper_url,False):
            print(f"Skipping paper with invalid URL: {title} with the URL: {paper_url}")
            continue
        if title in seen_titles:
            print(f"Skipping duplicate paper: {title}")
            continue
        seen_titles.add(title)

        if not abstract and paper_url:
            pdf_content = validate_and_download_pd(paper_url,True)
            if pdf_content:
                pdf_bytes = io.BytesIO(pdf_content)
                first_page_text = extract_first_page_text(pdf_bytes)
                if first_page_text:
                    abstract = extract_abstract(first_page_text)
                    print(f"Extracted abstract for: {title}")
                else:
                    print(f"Failed to extract text from the first page of the paper: {title}")
            else:
                print(f"Failed to download the paper: {title}")

        if not abstract:
            print(f"Skipping paper with missing abstract: {title}")
            continue

        collection.add(
            ids=[paper_id],
            documents=[abstract],
            metadatas=[{
                "title": title,
                "paper_url": paper_url,
                "year": year,
                "journal_name": journal,
                "authors": authors,
                "citationCount": citationCount,
                "semantic_scholar_url": semantic_scholar_url
            }]
        )
    print(f"Stored {len(collection.get().get('ids'))} papers in ChromaDB!")

In [14]:
def retrieve_and_rerank(
    query: str,
    top_k: int =5,
    initial_retrieval_k: int=100
    )-> dict:

    """Rerank the papers"""

    results = collection.query(
        query_texts=[query],
        n_results=initial_retrieval_k
    )

    documents = results["documents"][0]
    metadatas = results["metadatas"][0]
    paper_ids = results["ids"][0]

    rerank_results = co.rerank(
        query=query,
        documents=documents,
        top_n=top_k,
        model="rerank-v3.5"
    )

    reranked_papers = []
    for r in rerank_results.results:
        index = r.index
        print(f"Paper Id: {paper_ids[index]}")
        paper = {
            "paper_id": paper_ids[index],
            "title": metadatas[index]["title"],
            "url": metadatas[index]["paper_url"],
            "year": metadatas[index]["year"],
            "authors": metadatas[index]["authors"],
            "journal": metadatas[index]["journal"],
            "abstract": documents[index],
            "relevance_score": r.relevance_score
        }
        reranked_papers.append(paper)

    return reranked_papers

In [15]:
def fetch_references(paper_id) -> List[tuple]:

    """Fetch references of a given paper using the Semantic Scholar API."""

    API_DELAY = 1.2
    base_url = "https://api.semanticscholar.org/graph/v1/paper/"
    headers = {
        "x-api-key": SS_API_KEY
    }
    time.sleep(API_DELAY)

    response = requests.get(f"{base_url}{paper_id}?fields=references.paperId,references.title,references.url,references.year,references.citationCount,references.authors,references.journal",
                            headers=headers)

    if response.status_code != 200:
        print(f"Error fetching references for {paper_id}: {response.status_code}")
        return []

    references = response.json().get("references", [])
    for ref in references:
      if ref.get("authors"):
        ref['authors'] = ", ".join([author['name'] for author in ref.get("authors", [])])
      if ref.get("journal"):
        ref['journal_name'] = ref.get('journal','N/A').get('name','Unknown Journal')
      else:
        ref['journal_name'] = 'Unknown Journal'
      ref.pop('journal')


    return [(ref.get("paperId"), ref.get("title","Unknown Title"), ref.get("url","N/A"),
             ref.get("year","N/A"), ref.get("citationCount",0), ref.get("authors",[]), ref.get("journal_name","Unknown Journal"))
              for ref in references if ref.get("paperId")]

In [16]:
def build_reference_graph(paper_id:str) -> dict:

    """Build a reference graph and return its data as nodes & edges."""

    first_level_refs = fetch_references(paper_id)
    fetch_main_paper_infos  =  fetch_paper_info(paper_id)
    fetch_main_paper_infos.pop("paper_url")



    G = nx.DiGraph()
    G.add_node(paper_id, label="Main Paper")

    co_reference_counts = defaultdict(int)
    edges = []
    nodes = [{"id": paper_id,
              "label": fetch_main_paper_infos
              or "Main Paper"
              }]

    for ref_id, ref_title, ref_url, ref_year, ref_citationCount, ref_authors, ref_journal  in first_level_refs:
        G.add_node(ref_id, label=ref_title or "Reference")
        second_level_refs = fetch_references(ref_id)
        edges.append((paper_id, ref_id))  # Main paper → first-level ref
        nodes.append({"id": ref_id,
                      "label": {
                          "title": ref_title,
                          "semantic_scholar_url": ref_url,
                          "year": ref_year,
                          "citationCount": ref_citationCount,
                          "authors": ref_authors,
                          "journal_name": ref_journal

                      }
                      or "Reference"})

        for sec_ref_id, sec_ref_title, sec_ref_url, sec_ref_year, sec_ref_citationCount, sec_ref_authors, sec_ref_journal in second_level_refs:
            if sec_ref_id in [r[0] for r in first_level_refs]:
                co_reference_counts[sec_ref_id] += 1
                edges.append((ref_id, sec_ref_id))  # First-level ref → co-referenced paper

    edges = [{"from": src, "to": dst} for src, dst in edges]

    return {"nodes": nodes, "edges": edges}

#MAIN FLASK APP

### Routes

- **"/query"** - Gives the top 5 papers based on user input
- **"/graph"** - Gives the reference graph based on selected paper
- **"/get_abstract"** - Returns the abstract for a given paper
- **"/index_paper"** -  Indexes the paper for Retrieval
- **"/get_pdf"** - Fetches the PDF to be displayed
- **"/search_pdf"** - Q/A with the paper using OPENAI

In [17]:
%load_ext autoreload
%autoreload 2

In [18]:
ngrok.set_auth_token(NGROK_AUTH_TOKEN)
app = Flask(__name__)
CORS(app,)

PDF_STORAGE_PATH = "/tmp/research_paper.pdf"


@app.route('/')
def home():
    return "Welcome to the Research Paper Analysis API!"


@app.route('/query', methods=['POST'])
def query_papers():
    search_type = request.form.get('search_type')

    if search_type not in ['keyword', 'description', 'attachment']:
        return jsonify({'error': 'Invalid search_type'}), 400

    user_input = request.form.get('query')
    uploaded_file = request.files.get('file')

    # Handle 'attachment' search type
    if search_type == 'attachment':
        if not uploaded_file:
            return jsonify({'error': 'No file uploaded'}), 400
        first_page = extract_first_page_text(uploaded_file.stream.read())
        if not first_page:
            return jsonify({'error': 'Failed to extract the first page of the uploaded file.'}), 400
        user_input = extract_abstract(first_page)
        if not user_input:
            return jsonify({'error': 'Failed to extract abstract from the uploaded file.'}), 400
        summary = extract_summary_from_desc(user_input)
        if not summary:
            return jsonify({'error': 'Failed to understand main concept from the uploaded file.'}), 400
        topic = extract_topic_from_query(summary)

    # Handle 'description' and 'keyword' search types
    elif search_type == 'description':
        if not user_input:
            return jsonify({'error': 'Description is required'}), 400
        summary = extract_summary_from_desc(user_input)
        if not summary:
            return jsonify({'error': 'Failed to understand main concept from the description.'}), 400
        topic = extract_topic_from_query(summary)

    elif search_type == 'keyword':
        if not user_input:
            return jsonify({'error': 'Keyword query is required'}), 400
        topic = extract_topic_from_query(user_input)

    if not topic:
        return jsonify({'message': 'No topic found'}), 404
    papers = query_semantic_scholar(topic)
    if not papers:
        return jsonify({'message': 'No papers found'}), 404

    store_papers_in_chroma(papers)
    top_paper = retrieve_and_rerank(user_input)

    if not top_paper:
        return jsonify({'message': 'No top papers found'}), 404

    return jsonify({'papers': top_paper})



@app.route("/graph", methods=["GET"])
def get_graph():
    paper_id = request.args.get("id")
    if not paper_id:
        return jsonify({"error": "Missing paper ID"}), 400

    graph_data = build_reference_graph(paper_id)
    if not graph_data:
        return jsonify({"error": "No graph data found"}), 404
    return jsonify(graph_data)


@app.route("/get_abstract", methods=["GET"])
def get_abstract():
    paper_id = request.args.get("id")
    if not paper_id:
        return jsonify({"error": "Missing paper ID"}), 400

    abstract = fetch_abstract(paper_id)
    if not abstract:
        return jsonify({"error": "No abstract found"}), 404
    return jsonify({"abstract": abstract})


@app.route('/index_paper', methods=['POST'])
def index_paper():

    data = request.json
    pdf_url = data.get('pdf_url')
    if not pdf_url:
        return jsonify({"error": "PDF URL is required"}), 400


    response = validate_and_download_pd(pdf_url, True)
    if not response:
        return jsonify({"error": "Failed to download the PDF"}), 400

    with open(PDF_STORAGE_PATH, 'wb') as f:
        f.write(response.content)


    RAG.index(
        input_path=PDF_STORAGE_PATH,
        index_name="attention",
        store_collection_with_index=True,
        overwrite=True
    )
    return jsonify({"message": "Paper indexed successfully", "pdf_path": "/get_pdf"})


@app.route('/get_pdf', methods=['GET'])
def get_pdf():
    return send_file(PDF_STORAGE_PATH, as_attachment=False)



@app.route('/search_paper', methods=['POST'])
def search_paper():
    data = request.json
    query = data.get('query', '')
    k = data.get('k', 3)

    results = RAG.search(query, k=k)
    if results:
      print("Results found")
    else:
      return jsonify({"error": "No results found"}), 404

    pages = set()

    for res in results:
        page_number = res['page_num']
        if page_number is not None:
            pages.update([page_number, page_number-2, page_number-1, page_number+1, page_number+2])

    pages = sorted(p for p in pages if p >= 1)
    all_results = RAG.search("", k=100)
    page_images = {res['page_num']: res['base64'] for res in all_results if res['page_num'] in pages}

    image_contents = [
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{url}"}}
        for url in page_images.values()
    ]

    messages = [
        {"role": "system", "content": "You are an AI that helps answer questions from research papers."},
        {"role": "user", "content": [{"type": "text", "text": query}] + image_contents}
    ]

    response = GPTclient.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    if response:
      return jsonify({"answer": response.choices[0].message.content}), 200
    else:
      return jsonify({"error": "No response from the API"}), 500


public_url = ngrok.connect(5000)
print(f" * Flask app is running on {public_url}")

def run():
    app.run(port=5000)

Thread(target=run).start()

 * Flask app is running on NgrokTunnel: "https://2ea2-34-125-4-238.ngrok-free.app" -> "http://localhost:5000"


#Future Work List

- Integrate Separate Data Pipeline with Semantic Scholar Database API for scale
- Integrate more LLM for Q/A Pipeline
- Add Chat History and Tools