In [1]:
import os 
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from IPython.display import Markdown

In [3]:
GOOGLE_API_KEY=os.environ['GOOGLE_API_KEY']

In [4]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

  from .autonotebook import tqdm as notebook_tqdm


### Load the Paper

`Set proper name of the path to the paper's pdf`

If loading single dcoument: 

In [7]:
loader = PyPDFLoader('./papers/0.pdf')
data = loader.load()

Multiple Docs

`here set the directory where you have the papers`

In [8]:
loader = DirectoryLoader(path='./papers/', glob='**/*.pdf', loader_cls=PyPDFLoader)
data = loader.load()

Data is loaded page by page

In [28]:
data[12].page_content

'S2275 Attention-based graph neural networks: a\xa0survey  \n1 3\nInter-layer GATs: This kind of works usually select features beyond neural network \nlayers with multiple feature spaces, not just local neighborhoods. Across the neural net-\nwork layer, attention in inter-layer GATs can be regarded as an operation of cross-layer \nfusion of different feature spaces with feature fusion attention. In this term, attention-\nbased GNNs dynamically select features from different levels, different channels, dif-\nferent views, or different time slices. Therefore, we further divide these methods into \nfive sub-categories (i.e., multi-level attention (Liu et\xa0 al. 2020; Zhang et\xa0 al. 2022c), \nmulti-channel (Bo et\xa0al. 2021; Luan et\xa0al. 2021), multi-view (Wang et\xa0al. 2020b; Yuan \net\xa0al. 2021b), Spatio-temporal attention (Sankar et\xa0al. 2018; Lu et\xa0al. 2019), and time \nseries attention (Zhang et\xa0al. 2021c; Zhao et\xa0al. 2020)). By considering temporal attrib-\nutes, 

### Split all the docs into Chunks

Split pages into 1000 word chunks with buffer/overlap of 200 words

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size= 1000, chunk_overlap=200)
texts = text_splitter.split_documents(data)

In [None]:
texts[0].page_content

Generate model embeddings

In [10]:
embeddings = GoogleGenerativeAIEmbeddings(model='models/embedding-001', google_api_key=GOOGLE_API_KEY)
vector_store = Chroma.from_documents(documents=texts, embedding=embeddings)

I0000 00:00:1722946400.395771   43860 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


Setup Retriver

In [11]:
retriever = vector_store.as_retriever()

Setup LLM to be used and chain

In [12]:
llm = GoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=GOOGLE_API_KEY)

In [13]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", 
    retriever=retriever,
    return_source_documents=True
)

A Query to generate review of all papers: 

`Should contain all the names of your papers`

In [29]:
query = ("""
You are a [Research Assitant] bot. You help with creating [Literature Review]. 

Input :You will be given access to a [research paper]


Task: You have to [extract] the following information:
Information To BE extracted is present in backticks: 

Paper Name
Focus Area of the paper
Date: month/year
Methodology: (eg Qualitative, Quantitative, Review, Conceptual, Report)
Key Findings: in 10 words what the paper has implemented/achieved
Application: real life potential use cases (summerize in 10 words or 1-2 points)
Challenges:  Drawbacks of this paper/approach (summerize 1-2 short points)
Opportunities: Future scope/possibilities of paper (summerize in 1-2 points)
Dataset: (dataset used in the paper if available else write none) 

Note: if including multiple points, insert newline character or write a unordered list in markdown wrapped as a string

Output Format is inside triple backticks: 
```
python
[
        
{
"paper_name": "name of the paper 0",
.
.
.
"opportunities": "..."
"datatset": "name of the dataset"
},
{
"paper_name": "name of the paper 1",
.
.
.
"opportunities": "..."
"datatset": "name of the dataset"
}
...
]

```
         
Names of the papers: 
<<<
A Survey on Graph Neural Networks for Time
Series: Forecasting, Classification, Imputation,
and Anomaly Detection, 
Attention Is All You Need,
Exploring Hierarchical Structures for Recommender Systems,
Heterogeneous Graph Attention Network, 
Hyperbolic Graph Attention Network, 
KGAT: Knowledge Graph Attention Network for Recommendation, 
Knowledge Graph Embedding Based on Graph Neural Network,
Research on the application of Nerual Network model in knowledge graph completion technology,
Attention based graph neural networks: a survey, 
Graph neural networks for visual question answering: a systematic review, 
A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions
>>>
""")

### Run the chain

In [30]:
result = qa_chain.invoke({'query': query})

Result is Stored in `result` object and can be accessed using key: 

In [31]:
result['result']

'```python\n[\n{\n"paper_name": "A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection",\n"Focus Area of the paper": "Graph Neural Networks for Time Series",\n"Date": "12/2022",\n"Methodology": "Review",\n"Key Findings": "Surveyed GNNs for time series tasks",\n"Application": "Forecasting, classification, imputation, anomaly detection",\n"Challenges": "Limited data availability, model complexity",\n"Opportunities": "Explore new architectures, improve interpretability",\n"datatset": "None"\n},\n{\n"paper_name": "Attention Is All You Need",\n"Focus Area of the paper": "Attention Mechanism in Neural Networks",\n"Date": "06/2017",\n"Methodology": "Quantitative",\n"Key Findings": "Proposed Transformer architecture based on attention",\n"Application": "Machine translation, text summarization",\n"Challenges": "High computational cost, lack of interpretability",\n"Opportunities": "Explore new attention mechanisms, improve efficiency",\

### Our Output is formatted in a way that it can be parsed as a python list of dicts/JSON objects

In [32]:
Markdown(result['result'])

```python
[
{
"paper_name": "A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection",
"Focus Area of the paper": "Graph Neural Networks for Time Series",
"Date": "12/2022",
"Methodology": "Review",
"Key Findings": "Surveyed GNNs for time series tasks",
"Application": "Forecasting, classification, imputation, anomaly detection",
"Challenges": "Limited data availability, model complexity",
"Opportunities": "Explore new architectures, improve interpretability",
"datatset": "None"
},
{
"paper_name": "Attention Is All You Need",
"Focus Area of the paper": "Attention Mechanism in Neural Networks",
"Date": "06/2017",
"Methodology": "Quantitative",
"Key Findings": "Proposed Transformer architecture based on attention",
"Application": "Machine translation, text summarization",
"Challenges": "High computational cost, lack of interpretability",
"Opportunities": "Explore new attention mechanisms, improve efficiency",
"datatset": "WMT 2014 English-to-German translation task"
},
{
"paper_name": "Exploring Hierarchical Structures for Recommender Systems",
"Focus Area of the paper": "Hierarchical Structures in Recommender Systems",
"Date": "01/2019",
"Methodology": "Quantitative",
"Key Findings": "Proposed hierarchical attention network for recommendation",
"Application": "Recommending products, services, content",
"Challenges": "Data sparsity, cold-start problem",
"Opportunities": "Explore new hierarchical structures, improve personalization",
"datatset": "MovieLens 100K, Amazon Reviews"
},
{
"paper_name": "Heterogeneous Graph Attention Network",
"Focus Area of the paper": "Heterogeneous Graph Attention Networks",
"Date": "08/2019",
"Methodology": "Quantitative",
"Key Findings": "Proposed HGAN for heterogeneous graph learning",
"Application": "Node classification, link prediction",
"Challenges": "Handling different node types and relations",
"Opportunities": "Explore new attention mechanisms for heterogeneous graphs",
"datatset": "Amazon co-purchasing network, DBLP citation network"
},
{
"paper_name": "Hyperbolic Graph Attention Network",
"Focus Area of the paper": "Hyperbolic Graph Attention Networks",
"Date": "02/2020",
"Methodology": "Quantitative",
"Key Findings": "Proposed HGAT for learning on hyperbolic graphs",
"Application": "Node classification, link prediction",
"Challenges": "Handling non-Euclidean space, model complexity",
"Opportunities": "Explore new hyperbolic embedding techniques",
"datatset": "Cora, PubMed, Amazon"
},
{
"paper_name": "KGAT: Knowledge Graph Attention Network for Recommendation",
"Focus Area of the paper": "Knowledge Graph Attention Networks for Recommendation",
"Date": "06/2019",
"Methodology": "Quantitative",
"Key Findings": "Proposed KGAT for incorporating knowledge graphs into recommendation",
"Application": "Recommending products, services, content",
"Challenges": "Scalability, knowledge graph incompleteness",
"Opportunities": "Explore new ways to integrate knowledge graphs",
"datatset": "Amazon, MovieLens"
},
{
"paper_name": "Knowledge Graph Embedding Based on Graph Neural Network",
"Focus Area of the paper": "Knowledge Graph Embedding using GNNs",
"Date": "03/2020",
"Methodology": "Quantitative",
"Key Findings": "Proposed GNN-based embedding for knowledge graphs",
"Application": "Knowledge graph completion, question answering",
"Challenges": "Data sparsity, model complexity",
"Opportunities": "Explore new GNN architectures for knowledge graphs",
"datatset": "FB15k-237, WN18RR"
},
{
"paper_name": "Research on the application of Nerual Network model in knowledge graph completion technology",
"Focus Area of the paper": "Knowledge Graph Completion using Neural Networks",
"Date": "06/2020",
"Methodology": "Review",
"Key Findings": "Surveyed NN models for knowledge graph completion",
"Application": "Knowledge graph completion, question answering",
"Challenges": "Data sparsity, model complexity",
"Opportunities": "Explore new NN architectures for knowledge graphs",
"datatset": "None"
},
{
"paper_name": "Attention based graph neural networks: a survey",
"Focus Area of the paper": "Attention Mechanisms in Graph Neural Networks",
"Date": "06/2021",
"Methodology": "Review",
"Key Findings": "Surveyed attention-based GNNs",
"Application": "Node classification, link prediction, graph classification",
"Challenges": "Computational cost, interpretability",
"Opportunities": "Explore new attention mechanisms, improve efficiency",
"datatset": "None"
},
{
"paper_name": "Graph neural networks for visual question answering: a systematic review",
"Focus Area of the paper": "Graph Neural Networks for Visual Question Answering",
"Date": "08/2021",
"Methodology": "Review",
"Key Findings": "Surveyed GNNs for visual question answering",
"Application": "Visual question answering, image captioning",
"Challenges": "Data scarcity, model complexity",
"Opportunities": "Explore new GNN architectures for visual tasks",
"datatset": "None"
},
{
"paper_name": "A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions",
"Focus Area of the paper": "Graph Neural Networks",
"Date": "12/2020",
"Methodology": "Review",
"Key Findings": "Comprehensive review of GNNs",
"Application": "Node classification, link prediction, graph classification",
"Challenges": "Scalability, interpretability, data sparsity",
"Opportunities": "Explore new GNN architectures, improve interpretability",
"datatset": "None"
}
]
```

# Final Automated Generation

A cleaner approach to run one by one: 
- will help with limited context
- possibly better accuracy and generation

In [33]:
system_promp_template = ("""
You are a [Research Assitant] bot. You help with creating [Literature Review]. 

Input :You will be given access to a [research paper]


Task: You have to [extract] the following information:
Information To BE extracted is present in backticks: 

Paper Name
Focus Area of the paper
Date: month/year
Methodology: (eg Qualitative, Quantitative, Review, Conceptual, Report)
Key Findings: in 10 words what the paper has implemented/achieved
Application: real life potential use cases (summerize in 10 words or 1-2 points)
Challenges:  Drawbacks of this paper/approach (summerize 1-2 short points)
Opportunities: Future scope/possibilities of paper (summerize in 1-2 points)
Dataset: (dataset used in the paper if available else write none) 

Note: if including multiple points, insert newline character or write a unordered list in markdown wrapped as a string

Output Format is inside triple backticks: 
```
json      
{
"paper_name": "name of the paper 0",
.
.
.
"opportunities": "..."
"datatset": "name of the dataset"
}
```
              
""")

In [35]:
user_query_template = """
{system_prompt_template}
Generate a literature review based on this research paper
{paper_name}
"""

In [45]:
from langchain.prompts import PromptTemplate

final_prompt_template= PromptTemplate.from_template(user_query_template)

**All our processing for the paper consolidated into one function** 

In [47]:
def process_paper(file_path, paper_name: str): 
    loader = PyPDFLoader(file_path)
    data = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size= 1000, chunk_overlap=200)
    texts = text_splitter.split_documents(data)
    
    embeddings = GoogleGenerativeAIEmbeddings(model='models/embedding-001', google_api_key=GOOGLE_API_KEY)
    vector_store = Chroma.from_documents(documents=texts, embedding=embeddings)

    retriever = vector_store.as_retriever()
    
    llm = GoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=GOOGLE_API_KEY)

    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff", 
        retriever=retriever,
        return_source_documents=True    
    )


    final_prompt = final_prompt_template.format(
        system_prompt_template=system_prompt, 
        paper_name =paper_name
    )
    result = qa_chain.invoke({'query': final_prompt})
    
    return result
    

- A simple functions to parseout the JSON
- Alternatively you can setup JSONOutputParser from langchain.outputparsers (reqires additional setup)

In [40]:
import json
def extract_and_append_json(json_string):
    # Remove the ```json and ``` markers if present
    json_string = json_string.strip()
    if json_string.startswith("```json"):
        json_string = json_string[7:]
    if json_string.endswith("```"):
        json_string = json_string[:-3]
    
    # Parse the JSON string
    try:
        json_data = json.loads(json_string)
        
        # Append the parsed JSON to the result list
        return json_data

    except json.JSONDecodeError as e:
        print(e)

Name of all the papers you're working with

In [39]:

paper_names = [
"Self-Supervised Learning of Graph Neural Networks",
"""A Survey on Graph Neural Networks for Time
Series: Forecasting, Classification, Imputation,
and Anomaly Detection""", 
"""Attention Is All You Need""",
"""Exploring Hierarchical Structures for Recommender Systems""",
"""Heterogeneous Graph Attention Network""", 
"""Hyperbolic Graph Attention Network""", 
"""KGAT: Knowledge Graph Attention Network for Recommendation""", 
"""Knowledge Graph Embedding Based on Graph Neural Network""",
"""Research on the application of Nerual Network model in knowledge graph completion technology""",
"""Attention based graph neural networks: a survey""", 
"""Graph neural networks for visual question answering: a systematic review""", 
"""A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions"""
]


Call Chain on each paper individually and parse output into dicts, stored into `reslist`

In [None]:
papers_directory = './papers'

#stores the result
reslist = []
ct = 0

for filename in os.listdir(papers_directory):
    # if ct > 1: break
    if filename.endswith(".pdf"):
        file_path = os.path.join(papers_directory, filename)
        print(f"Processing: {file_path}")
        num = int(filename[:-4])


        result = process_paper(file_path, paper_names[ct])

        print(f"Processed Paper: {paper_names[ct]}")

        reslist.append(extract_and_append_json(result['result']))
    ct+=1

reslist

```Accessing One Object```

In [53]:
reslist[0]

{'paper_name': 'Self-Supervised Learning of Graph Neural Networks',
 'focus_area': 'Graph Neural Networks',
 'date': '12/2020',
 'methodology': 'Review',
 'key_findings': 'Survey of self-supervised learning methods for GNNs',
 'application': 'Improve GNN performance on tasks with limited labeled data',
 'challenges': '- Difficulty in designing effective self-supervision tasks for GNNs\n-  Limited empirical evaluation of self-supervised GNN methods',
 'opportunities': '- Explore new self-supervision tasks for GNNs\n- Develop more robust and scalable self-supervised GNN methods',
 'dataset': 'None'}

In [54]:
Markdown(reslist[0]['opportunities'])

- Explore new self-supervision tasks for GNNs
- Develop more robust and scalable self-supervised GNN methods

### We can also create another chain that takes in these json lists and turn those into MD

Why not then just return MD instead of JSON in first place? 
- allows to process each paper individually
- can feed into other chains easily
- cleaner and structured approach


In [55]:
create_markdown_tempalte = """
Given a list of JSON objects inside triple backticks. \
Generate a table in [Markdown] using the keys as Table Columns

```{list_of_papers}```
"""

md_prompt = PromptTemplate.from_template(create_markdown_tempalte)

fin_prompt = md_prompt.format(
    list_of_papers=reslist
)

In [56]:
llm = llm 

In [58]:
lit_review_md = llm.invoke(fin_prompt)

### Visualize: 

In [60]:
Markdown(lit_review_md)

| paper_name | focus_area | date | methodology | key_findings | application | challenges | opportunities | dataset |
|---|---|---|---|---|---|---|---|---|
| Self-Supervised Learning of Graph Neural Networks | Graph Neural Networks | 12/2020 | Review | Survey of self-supervised learning methods for GNNs | Improve GNN performance on tasks with limited labeled data | - Difficulty in designing effective self-supervision tasks for GNNs<br>-  Limited empirical evaluation of self-supervised GNN methods | - Explore new self-supervision tasks for GNNs<br>- Develop more robust and scalable self-supervised GNN methods | None |
| A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection | Graph Neural Networks for Time Series Analysis | August/2021 | Review | Surveys GNNs for time series analysis tasks | Forecasting, classification, imputation, anomaly detection in various domains | - Limited availability of large-scale time series graph datasets<br>- Difficulty in designing effective graph structures for complex time series data | - Development of novel GNN architectures for specific time series tasks<br>- Exploration of hybrid GNN models that combine different graph architectures | None |
| Attention Is All You Need | Natural Language Processing, Machine Translation | 6/2017 | Quantitative | Transformer model, self-attention mechanism, parallel computation | Machine translation, text summarization, question answering | Computational complexity, training data requirements | Improving efficiency, exploring new applications | WMT 2014 English-to-German, WMT 2014 English-to-French |
| Exploring Hierarchical Structures for Recommender Systems | Recommender Systems | 07/2022 | Conceptual | Exploring implicit hierarchies for recommender systems | Improved recommendation performance in various domains | - Difficulty in identifying implicit hierarchies<br>- Computational complexity of exploring hierarchies | - Development of efficient algorithms for hierarchy exploration<br>- Integration of hierarchical structures with other recommendation techniques | None |
| Heterogeneous Graph Attention Network | Graph Neural Networks | N/A | Quantitative | Proposed a heterogeneous graph attention network | Learning from heterogeneous graphs, real-world data | Scalability, handling large graphs | Further research on heterogeneous graph representation | None |
| Hyperbolic Graph Attention Network | Graph Neural Networks (GNNs) | 04/2022 | Quantitative | Hyperbolic space improves GNNs performance | Social network analysis, recommendation systems | Hyperbolic space requires more computation | Explore different hyperbolic space models | Cora, PubMed, Amazon |
| KGAT: Knowledge Graph Attention Network for Recommendation | Recommendation Systems | 00/2023 | Quantitative | Knowledge graph attention network for recommendation | Personalized recommendation systems, Improved recommendation accuracy | Computational complexity, Data sparsity | Extension to other domains, Integration with other recommendation techniques | None |
| Knowledge Graph Embedding Based on Graph Neural Networks | Knowledge Graph Completion | 12/2019 | Quantitative | Embedding knowledge graphs using graph neural networks | Improving knowledge graph completion accuracy | - Difficulty in handling large graphs<br>- Limited scalability for very large datasets | - Exploring new graph neural network architectures for better performance<br>- Integrating external information into the embedding process | None |
| Knowledge Graph Completion Based on Recurrent Neural Network | Knowledge Graph Completion | None | Quantitative | Proposed a recurrent neural network model for knowledge graph completion | Improve knowledge graph completion accuracy and efficiency | Limited by the size and complexity of the knowledge graph | Explore more complex neural network architectures and integrate external knowledge sources | None |
| Attention-based Graph Neural Networks: A Survey | Attention mechanisms in Graph Neural Networks (GNNs) | 01/2023 | Review | Categorization and analysis of attention-based GNNs | Node classification, link prediction, graph generation | - Lack of standardized evaluation metrics<br>- Scalability issues for large graphs | - Development of novel attention mechanisms<br>- Exploration of attention-based GNNs for complex tasks | None |
| Graph neural networks for visual question answering: a systematic review | Visual Question Answering (VQA) using Graph Neural Networks | November/2023 | Systematic Review | Graph networks enhance VQA performance. | Answering questions about images, Image captioning, Object recognition | Data scarcity and computational complexity | Improving graph network architectures, Exploring new applications | None |
| A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions | Graph Neural Networks | 1/2024 | Review | Survey of graph neural networks and their applications | -  Drug discovery<br>-  Social network analysis | -  Scalability of GNNs<br>-  Overfitting to specific graph structures | -  Development of more efficient GNN architectures<br>-  Application of GNNs to new domains | None | 


Done