# Paper Similarity Search Application
An application that recommends the top 3 most similar papers based on a given paragraph.


## Table of Contents

1. [Setup and Configuration](#Setup-and-Configuration)
2. [Data Loading and Preprocessing](#Data-Loading-and-Preprocessing)
3. [Indexing and Retrieval](#Indexing-and-Retrieval)
4. [Streamlit Application](#Streamlit-Application)



<a name='setup-and-configuration'></a>
## Setup and Configuration

This section will set up and configure the required tools, authenticate with GCP, and ensure we have all necessary libraries imported for the subsequent steps.

In [None]:
!pip install weaviate-client
!pip install llama-index
!pip install streamlit
!pip install pyngrok

In [None]:
# Import Necessary Libraries
from google.colab import auth
import os
import weaviate
from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.vector_stores import WeaviateVectorStore
from llama_index import VectorStoreIndex, StorageContext

# Authenticate GCP
auth.authenticate_user()

# Set GCP project ID
!gcloud config set project 'scientific-review-ai-assistant'

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = "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"

# Connect to the Weaviate instance
client = weaviate.Client(embedded_options=weaviate.embedded.EmbeddedOptions(),
                         additional_headers={'X-OpenAI-Api-Key': os.environ["OPENAI_API_KEY"]})


Are you sure you wish to set property [core/project] to 
scientific-review-ai-assistant?

Do you want to continue (Y/n)?  Y

Updated property [core/project].
embedded weaviate is already listening on port 6666


<a name='data-loading'></a>
## Data Loading and Preprocessing


In [None]:
# Copy data from the GCP bucket to the local directory
!gsutil -m cp -r gs://llm-technical-test-data/raw-pdf/* ./data/



def preprocess_data(bucket_path='./data'):
    """Read papers from the provided path and parse them into Node objects."""
    papers = SimpleDirectoryReader(bucket_path).load_data()
    parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)
    nodes = parser.get_nodes_from_documents(papers)
    return nodes

# Preprocess data
nodes = preprocess_data()


Copying gs://llm-technical-test-data/raw-pdf/PMC8325057.pdf...
/ [0/15 files][    0.0 B/ 32.1 MiB]   0% Done                                   Copying gs://llm-technical-test-data/raw-pdf/Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine.pdf...
/ [0/15 files][    0.0 B/ 32.1 MiB]   0% Done                                   Copying gs://llm-technical-test-data/raw-pdf/82_2020_217.pdf...
/ [0/15 files][    0.0 B/ 32.1 MiB]   0% Done                                   Copying gs://llm-technical-test-data/raw-pdf/Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine.pdf...
/ [0/15 files][    0.0 B/ 32.1 MiB]   0% Done                                   Copying gs://llm-technical-test-data/raw-pdf/PMC8198544.pdf...
/ [0/15 files][    0.0 B/ 32.1 MiB]   0% Done                                   Copying gs://llm-technical-test-data/raw-pdf/mRNA vaccines — a new era(1).pdf...
Copying gs://llm-technical-test-data/raw-pdf/Emerging Frontiers in Drug Delivery.pdf...
/ [0/15 files][   

<a name='Indexing-and-Retrieval'></a>
## Indexing and Retrieval

In [None]:
# Indexing and Retrieval

def process_and_index_data(nodes):
    """Process the provided nodes and index them using Weaviate."""
    # construct vector store
    vector_store = WeaviateVectorStore(weaviate_client=client, index_name="PaperText", text_key="content")
    # setting up the storage for the embeddings
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    # set up the index
    index = VectorStoreIndex(nodes, storage_context=storage_context)
    return index

def retrieve_similar_papers(index, paragraph, top_k=3):
    """Retrieve top k similar papers for the given paragraph."""
    query_engine = index.as_query_engine()
    response = query_engine.query(f"give the titles and 1 sentence summary for each of the top {top_k} most similar papers in the database to this paragraph: {paragraph}")
    return response.response

# Process and index the data
index = process_and_index_data(nodes)


Embedded weaviate wasn't listening on port 6666, so starting embedded weaviate again
Started /root/.cache/weaviate-embedded: process ID 78860


In [None]:
# Test
results = retrieve_similar_papers(index, "Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and the resulting coronavirus disease 2019 (Covid-19) have afflicted tens of millions of people in a worldwide pandemic. Safe and effective vaccines are needed urgently.")
print(results)


Embedded weaviate wasn't listening on port 6666, so starting embedded weaviate again
Started /root/.cache/weaviate-embedded: process ID 95683
1. Title: "Development of a safe and effective vaccine against SARS-CoV-2: Challenges and prospects"
   Summary: This paper discusses the challenges and prospects in the development of a safe and effective vaccine against SARS-CoV-2, considering the urgent need for such vaccines due to the worldwide Covid-19 pandemic.

2. Title: "Advances in the development of Covid-19 vaccines: Current status, challenges, and future directions"
   Summary: This paper provides an overview of the current status, challenges, and future directions in the development of Covid-19 vaccines, highlighting the urgent need for safe and effective vaccines to combat the worldwide pandemic caused by SARS-CoV-2.

3. Title: "Emerging strategies for the development of Covid-19 vaccines: A comprehensive review"
   Summary: This comprehensive review explores the emerging strategie

<a name='Streamlit-Application'></a>
## Streamlit Application

In [None]:
%%writefile app.py

import streamlit as st
import os
import weaviate
from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.vector_stores import WeaviateVectorStore
from llama_index import VectorStoreIndex, StorageContext
import time


# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = "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"

# Connect to the Weaviate instance
client = weaviate.Client(embedded_options=weaviate.embedded.EmbeddedOptions(),
                         additional_headers={'X-OpenAI-Api-Key': os.environ["OPENAI_API_KEY"]})

# Function to process the data and index it
def process_and_index_data(bucket_path='./data'):
    # Read papers and parse them
    papers = SimpleDirectoryReader(bucket_path).load_data()
    parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)
    nodes = parser.get_nodes_from_documents(papers)

    # Set up the index with Weaviate
    vector_store = WeaviateVectorStore(weaviate_client=client, index_name="PaperText", text_key="content")
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(nodes, storage_context=storage_context)

    return index

index = process_and_index_data()

# Streamlit UI
st.title("Similar Paper Recommendation System")

# Input paragraph
input_paragraph = st.text_area("Input a paragraph for recommendations:", "")

if st.button('Get Recommendations'):
    if input_paragraph:

        # Display spinner while fetching results
        with st.spinner('Fetching recommendations...'):

            # Progress bar for user's visual cue
            latest_iteration = st.empty()
            bar = st.progress(0)

            for i in range(100):
                latest_iteration.text(f'Progress {i+1}%')
                bar.progress(i + 1)
                time.sleep(0.01)

            # Create a query engine and fetch results
            query_engine = index.as_query_engine()
            response = query_engine.query(f"give the titles and 1 sentence summary for each of the top 3 most similar papers in the database to this paragraph: {input_paragraph}")
            st.write(response.response)

    else:
        st.warning('Please input a paragraph.')

Overwriting app.py


In [None]:
from pyngrok import ngrok

# Setup a tunnel to the streamlit port 8501
public_url = ngrok.connect(port='8501')
print('Streamlit app is live at:', public_url)

!streamlit run app.py
