# Notion Data Connector and OpenAI Integration (with LlamaIndex)

This notebook provides an in-depth exploration of the Notion data connector in LlamaIndex and its integration with OpenAI. We'll cover everything from basic setup to advanced use cases and performance optimization.

## 1. Introduction and Setup

### 1.1 Introduction to Data Connectors in LlamaIndex

Data connectors in LlamaIndex are powerful tools that allow you to import data from various sources into your AI applications. They act as bridges between external data repositories (like Notion, Google Drive, or Slack) and LlamaIndex, enabling seamless integration of diverse information into your AI models.

Key benefits of data connectors include:
- Easy access to data from multiple platforms
- Standardized data ingestion process
- Ability to keep your AI models up-to-date with the latest information

In this notebook, we'll focus on the Notion data connector, demonstrating how to leverage Notion's rich document structure in AI applications.

### 1.2 Setup and Installation

First, let's install the necessary packages and import the required modules.

In [None]:
!pip install llama-index llama-index-readers-notion openai fpdf

import os
import logging
import sys
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)
from llama_index.readers.notion import NotionPageReader
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate
from IPython.display import Markdown, display
from fpdf import FPDF


# Set up logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))



In [None]:
import os
import openai
os.environ["OPENAI_API_KEY"] = "" # Add your API Key here.
openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
os.environ["NOTION_INTEGRATION_TOKEN"] = "secret_2Javzi8aljUpiy5SF409LFrCg2KOojFrtnF4kndmupy"


## 2. Notion Data Connector

Now, let's explore the Notion data connector in detail.

In [None]:
notion_reader = NotionPageReader()

### 2.2 Loading Data from Notion Pages

We can load data from specific Notion pages using their IDs.

In [None]:
page_ids = ["115eb6c4652280419508d37521969b68", "115eb6c4652280478a32e7ca6b1d82c0"]
page_documents = notion_reader.load_data(page_ids=page_ids)

print(f"Loaded {len(page_documents)} documents from Notion pages")

Loaded 2 documents from Notion pages


In [None]:
for doc in page_documents:
    print(f"Document ID: {doc.doc_id}")
    print(f"Document content: {doc.text[:100]}...")  # Print first 100 characters of the text

    if doc.metadata:
        print("Metadata:")
        for key, value in doc.metadata.items():
            print(f"  {key}: {value}")

    if doc.extra_info:
        print("Extra Info:")
        for key, value in doc.extra_info.items():
            print(f"  {key}: {value}")

    print("---")

Document ID: 115eb6c4652280419508d37521969b68
Document content: Multimodal AI: Vision-Language Models
[Last updated: September 28, 2024]
Research Overview
Developin...
Metadata:
  page_id: 115eb6c4652280419508d37521969b68
Extra Info:
  page_id: 115eb6c4652280419508d37521969b68
---
Document ID: 115eb6c4652280478a32e7ca6b1d82c0
Document content: Multimodal AI: Experimental Results and Future Directions
Recent Experimental Findings
Just finished...
Metadata:
  page_id: 115eb6c4652280478a32e7ca6b1d82c0
Extra Info:
  page_id: 115eb6c4652280478a32e7ca6b1d82c0
---


## 3. Data Processing and Indexing

Now that we have our Notion data, let's process and index it for efficient querying.

In [None]:
index = VectorStoreIndex.from_documents(page_documents)

In [None]:
index.storage_context.persist("notion_index")

In [None]:
storage_context = StorageContext.from_defaults(persist_dir="notion_index")
loaded_index = load_index_from_storage(storage_context)

## 4. OpenAI Integration

We'll now set up the OpenAI integration for advanced querying and summarization.

In [None]:
llm = OpenAI(temperature=0.7, model="gpt-3.5-turbo")

In [None]:
query_engine = loaded_index.as_query_engine(
    llm=llm,
    response_mode="tree_summarize"
)

In [None]:
custom_prompt = PromptTemplate(
    "You are an AI assistant answering questions about Notion documents. "
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given this information, please answer the question: {query_str}\n"
)

In [None]:
query_engine = loaded_index.as_query_engine(
    llm=llm,
    text_qa_template=custom_prompt,
    response_mode="tree_summarize"
)

## 5. Querying the Notion Data

Let's start with some basic queries to our Notion data.

In [None]:
response1 = query_engine.query("What are the main research findings or conclusions from these Notion documents?")
display(Markdown(f"**Key Findings:** {response1}"))

**Key Findings:** The research findings and conclusions from the Notion documents include the development of advanced vision-language models for multimodal AI applications with a focus on improving zero-shot image classification, enhancing cross-modal attention mechanisms, and optimizing fine-tuning for domain-specific tasks. The experiments have shown promising results in utilizing CLIP-inspired contrastive learning, improving alignment between visual and textual features through cross-modal attention, and achieving a performance boost through fine-tuning on domain-specific datasets. Ablation studies have revealed the contributions of different attention mechanisms and pretraining objectives to model performance, while challenges such as computational resources, data quality, and evaluation metrics have been identified. Future research directions involve exploring few-shot learning, multimodal reasoning, temporal understanding, and ethical considerations in AI systems. The goal is to address gaps in performance, scalability, and real-world applicability of vision-language models for multimodal AI tasks.

In [None]:
response2 = query_engine.query("What research methods or approaches are mentioned in these documents?")
display(Markdown(f"**Research Methods:** {response2}"))

**Research Methods:** The research methods or approaches mentioned in these documents include developing advanced vision-language models for multimodal AI applications, improving zero-shot image classification, enhancing cross-modal attention mechanisms, optimizing fine-tuning for domain-specific tasks, contrastive learning approach for training, experimenting with different attention mechanisms (self, cross, multi-head), using a Transformer-based architecture with cross-modal attention layers, implementing curriculum learning for more efficient training, exploring few-shot learning capabilities in new domains, investigating potential for image generation tasks, developing interpretability tools for cross-modal attention patterns, applying the model to video understanding, exploring attention mechanisms for long-range temporal dependencies, and exploring neuro-symbolic approaches to enhance logical reasoning.

## 6. Advanced Querying and Analysis

Now, let's perform some more advanced queries and analysis on our Notion data.

In [None]:
response3 = query_engine.query("Are there any significant dates, deadlines, or timelines mentioned in the research notes?")
display(Markdown(f"**Important Dates:** {response3}"))

**Important Dates:** Important dates mentioned in the research notes include:
- March 15, 2024: Project kickoff
- April 30, 2024: Initial data collection complete
- May 1 - July 31, 2024: First round of model training
- August 1 - September 15, 2024: Comprehensive evaluation and analysis
- October 1, 2024: Target date for paper submission (ICLR deadline)

In [None]:
response4 = query_engine.query("What open questions or areas for further research are identified in these documents?")
display(Markdown(f"**Future Research Directions:** {response4}"))

**Future Research Directions:** Further research areas identified in the documents include exploring few-shot learning in new domains, investigating image generation tasks using the vision-language model, developing interpretability tools for cross-modal attention patterns, potential application of the model to video understanding, addressing challenges related to computational resources and data quality, improving evaluation metrics for multimodal understanding, enhancing multimodal reasoning capabilities, extending the model for temporal understanding in videos, incorporating ethical considerations for bias mitigation in multimodal systems, and exploring new research directions such as pretraining on scientific papers, multi-task learning, and cross-lingual multimodal models.

In [None]:
response5 = query_engine.query("What are the most frequently cited sources or references in these research documents?")
display(Markdown(f"**Key References:** {response5}"))

**Key References:** Radford, A., et al. (2021), Chen, Y., et al. (2020), and Gebru, T., et al. (2020) are among the most frequently cited sources or references in these research documents.

## 7. Comprehensive Summarization

Let's generate a comprehensive summary of all the Notion documents.

In [None]:

# Create a custom prompt template
custom_prompt = PromptTemplate(
    """Based on the following information from research documents, create a comprehensive and insightful summary:

Key Findings: {findings}
Research Methods: {methods}
Important Dates: {dates}
Future Research Directions: {future_research}
Key References: {references}

Please synthesize this information into a coherent summary that highlights the most important aspects of the research, identifies any patterns or connections between different elements, and provides a holistic overview of the work. Structure the summary with appropriate headings and ensure it flows logically.

Summary:
"""
)

# Combine all responses into a single text
combined_text = f"""
Key Findings: {response1}
Research Methods: {response2}
Important Dates: {response3}
Future Research Directions: {response4}
Key References: {response5}
"""

# Generate the summary
response = query_engine.query("Summarize the research project")

# Display the generated summary
print("AI-Generated Research Summary:")
print(response)

AI-Generated Research Summary:
The research project focuses on developing advanced vision-language models for multimodal AI applications. The key objectives include improving zero-shot image classification, enhancing cross-modal attention mechanisms, and optimizing fine-tuning for domain-specific tasks. The project utilizes a curated dataset of image-text pairs, a Transformer-based architecture with cross-modal attention layers, and a contrastive learning approach for training. Initial findings show promising results with CLIP-inspired contrastive learning and the significance of cross-modal attention in aligning visual and textual features. The project aims to explore few-shot learning capabilities, investigate image generation tasks, and develop interpretability tools for cross-modal attention patterns. Future directions also include exploring multimodal reasoning, handling temporal dependencies in video inputs, and addressing ethical considerations in multimodal systems. The project

Let's also add this back to our Notion page.

In [None]:
import requests
import json

# Use the existing Notion integration token
NOTION_API_KEY = os.environ["NOTION_INTEGRATION_TOKEN"]

# Function to update a Notion page
def update_notion_page(page_id, summary_text):
    url = f"https://api.notion.com/v1/blocks/{page_id}/children"
    headers = {
        "Authorization": f"Bearer {NOTION_API_KEY}",
        "Content-Type": "application/json",
        "Notion-Version": "2022-06-28"  # Use the latest API version
    }

    # Prepare the data for the update
    data = {
        "children": [
            {
                "object": "block",
                "type": "paragraph",
                "paragraph": {
                    "rich_text": [{"type": "text", "text": {"content": summary_text}}]
                }
            }
        ]
    }

    response = requests.patch(url, headers=headers, data=json.dumps(data))
    return response.json()

# Assuming 'response' contains your generated summary
summary_text = str(response)  # Convert the response to a string if it's not already

# Update the first Notion page with the summary
first_page_id = page_ids[0]  # Get the ID of the first page
result = update_notion_page(first_page_id, summary_text)

if 'results' in result:
    print(f"Successfully updated Notion page: {first_page_id}")
else:
    print(f"Failed to update Notion page. Error: {result.get('message', 'Unknown error')}")

Successfully updated Notion page: 115eb6c4652280419508d37521969b68


## 8. Exporting Results as PDF

Now, we'll export our summary as a PDF

In [None]:
def export_to_pdf(content, filename="notion_summary.pdf"):
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("Arial", size=12)
    pdf.multi_cell(0, 10, content)
    pdf.output(filename)
    print(f"Content exported as {filename}")

In [None]:
export_to_pdf(str(response), "notion_comprehensive_summary.pdf")

Content exported as notion_comprehensive_summary.pdf


In [None]:
additional_insights = query_engine.query("Provide additional insights, trends, or patterns observed across all documents that weren't included in the main summary.")
export_to_pdf(str(additional_insights), "notion_additional_insights.pdf")

Content exported as notion_additional_insights.pdf



## Conclusion and Next Steps

In this project, we've accomplished:
1. Loading and processing research data from Notion
2. Generating comprehensive summaries using LlamaIndex and OpenAI
3. Exporting results as a PDF summary

Next steps:
1. Integrate with Overleaf for LaTeX-based report generation
2. Implement automated literature review features
3. Develop interactive visualizations of research findings
4. Explore multi-language support for international research
5. Optimize performance for larger datasets
