# Scholarly Article Analysis with Gemini AI

**Author**: Vincent

This notebook implements a system for analyzing scholarly articles using:
- Gemini AI for analysis and understanding
- PDF extraction for full text analysis
- Local storage for caching results

## Setup
First, we'll set up our API keys and install required packages

In [4]:
%pip install -U -q google-generativeai pypdf2 requests numpy pandas beautifulsoup4 scholarly fake-useragent fitz pillow

Note: you may need to restart the kernel to use updated packages.


## Import Libraries and Configure API
Import all necessary libraries and configure the Google services

In [5]:
import os
import json
import numpy as np
import pandas as pd
import google.generativeai as genai
from PyPDF2 import PdfReader
import requests
from bs4 import BeautifulSoup
from typing import List, Dict, Optional
import re
from datetime import datetime
from fake_useragent import UserAgent
import io
from PIL import Image
import base64

## Set Up for Gemini
*Make sure you disable this in kaggle.* 

Only leave the model variable for Genai model type to work


In [6]:
# Set up Google Generative AI API key
def get_api_key_from_file(filepath='api_key_google'):
    """Extract API key from the api_key_google file"""
    try:
        with open(filepath, 'r') as f:
            content = f.read()
            key_match = re.search(r'key=([^"\s]+)', content)
            if key_match:
                return key_match.group(1)
            raise ValueError("API key not found in file")
    except Exception as e:
        print(f"Error reading API key: {e}")
        return None

# Configure Gemini AI
GOOGLE_API_KEY = get_api_key_from_file()
if not GOOGLE_API_KEY:
    raise ValueError("Could not get API key from file")
genai.configure(api_key=GOOGLE_API_KEY)

# Initialize the model
model = genai.GenerativeModel('gemini-2.0-flash-exp') #DO NOT CHANGE THIS LINE

In [7]:
def get_arxiv_info(url: str) -> Dict[str, str]:
    """Get paper metadata from arXiv URL"""
    try:
        # Extract arXiv ID from URL
        arxiv_id = re.search(r'\d+\.\d+', url)
        if not arxiv_id:
            return {}
            
        # Convert PDF URL to abstract URL
        abstract_url = f"https://arxiv.org/abs/{arxiv_id.group()}"
        
        # Get the abstract page with proper headers
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(abstract_url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract title and abstract using more robust selectors
        title_elem = soup.find('h1', class_='title mathjax')
        abstract_elem = soup.find('blockquote', class_='abstract mathjax')
        
        if not title_elem or not abstract_elem:
            print(f"Could not find title or abstract elements on page: {abstract_url}")
            return {}
            
        title = title_elem.text.replace('Title:', '').strip()
        abstract = abstract_elem.text.replace('Abstract:', '').strip()
        
        print(f"Successfully fetched metadata for arXiv paper: {title}")
        
        return {
            'title': title,
            'abstract': abstract,
            'pdf_url': url
        }
    except Exception as e:
        print(f"Error fetching arXiv metadata: {e}")
        return {}

class ScholarlyAnalyzer:
    def __init__(self, cache_file="paper_cache.json"):
        self.model = model
        self.cache_file = cache_file
        self.ua = UserAgent()
        self.load_cache()
        
    def load_cache(self):
        """Load cached paper data"""
        try:
            with open(self.cache_file, 'r') as f:
                self.cache = json.load(f)
        except FileNotFoundError:
            self.cache = {}
            
    def save_cache(self):
        """Save paper data to cache"""
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)

    def get_pdf_text(self, url: str) -> str:
        """Download and extract text from PDF"""
        try:
            headers = {'User-Agent': self.ua.random}
            response = requests.get(url, headers=headers, timeout=10)
            
            if response.headers.get('content-type', '').lower() != 'application/pdf':
                print(f"URL does not point to a PDF: {url}")
                return ''
            
            with open('temp.pdf', 'wb') as f:
                f.write(response.content)
            
            reader = PdfReader('temp.pdf')
            text = ''
            for page in reader.pages:
                text += page.extract_text()
            
            os.remove('temp.pdf')  # Clean up
            return text
        except Exception as e:
            print(f"Error downloading/processing PDF: {e}")
            return ''

    def extract_images_from_pdf(self, pdf_path: str) -> List[Dict]:
        """Extract images from PDF using PyPDF2 and return them with their captions"""
        images = []
        try:
            reader = PdfReader(pdf_path)
            for page_num, page in enumerate(reader.pages):
                text = page.extract_text()
                
                # Look for image-related text/captions
                captions = []
                caption_patterns = [
                    r"(?:Figure|Fig\.)\s+\d+[.:]\s*[^\n]+",
                    r"Table\s+\d+[.:]\s*[^\n]+"
                ]
                
                for pattern in caption_patterns:
                    matches = re.finditer(pattern, text)
                    captions.extend(match.group() for match in matches)
                
                # Try to extract images if the page has captions
                if captions and "/XObject" in page:
                    try:
                        xObject = page["/XObject"]
                        for obj in xObject:
                            if xObject[obj]["/Subtype"] == "/Image":
                                try:
                                    image_data = xObject[obj].get_data()
                                    image = Image.frombytes(
                                        mode="RGB",
                                        size=(xObject[obj]["/Width"], xObject[obj]["/Height"]),
                                        data=image_data
                                    )
                                    
                                    # Convert to base64
                                    buffered = io.BytesIO()
                                    image.save(buffered, format="PNG")
                                    img_base64 = base64.b64encode(buffered.getvalue()).decode()
                                    
                                    # Store image with caption if available
                                    images.append({
                                        'image_base64': img_base64,
                                        'page_number': page_num + 1,
                                        'caption': captions[0] if captions else "",  # Use first caption found
                                        'surrounding_text': text[:500]  # Store some context
                                    })
                                except Exception as e:
                                    print(f"Error processing image on page {page_num + 1}: {e}")
                                    continue
                    except Exception as e:
                        print(f"Error extracting images from page {page_num + 1}: {e}")
                        continue
            
            return images
        except Exception as e:
            print(f"Error extracting images from PDF: {e}")
            return []

    def analyze_figure(self, image_data: Dict) -> str:
        """Analyze a figure/image using the flash-exp model"""
        try:
            # Prepare the image and context for analysis
            prompt = f"""Please analyze this academic figure/image in detail:
            
            Caption: {image_data['caption']}
            Context: {image_data['surrounding_text']}
            
            Provide:
            1. Figure Type: What kind of visualization is this?
            2. Key Components: What are the main elements shown?
            3. Main Findings: What is the key message or result being conveyed?
            4. Technical Details: Any specific metrics, measurements, or technical aspects shown?
            5. Relationship to Paper: How does this support the paper's arguments?
            
            Be specific and technical in your analysis."""
            
            # Create multimodal prompt with both image and text
            response = self.model.generate_content([
                {
                    "mime_type": "image/png",
                    "data": image_data['image_base64']
                },
                {
                    "text": prompt
                }
            ])
            
            return response.text
        except Exception as e:
            print(f"Error analyzing figure: {e}")
            return "Could not analyze this figure"

    def analyze_arxiv_paper(self, url: str) -> Dict:
        """Analyze an arXiv paper from its URL"""
        # Get paper metadata
        paper_info = get_arxiv_info(url)
        if not paper_info:
            print("Could not fetch paper metadata from arXiv")
            return {}
        
        # Analyze the paper
        return self.analyze_paper(
            url=paper_info['pdf_url'],
            title=paper_info['title'],
            abstract=paper_info['abstract']
        )

    def analyze_paper(self, url: str, title: str = "", abstract: str = "") -> Dict:
        """Analyze a paper using Gemini AI"""
        # If it's an arXiv URL and no title/abstract provided, try to fetch them
        if 'arxiv.org' in url and not (title and abstract):
            paper_info = get_arxiv_info(url)
            if paper_info:
                title = paper_info['title']
                abstract = paper_info['abstract']
        
        cache_key = url or title
        
        # Check cache first
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        # Get full text and images if URL is provided
        full_text = ""
        figures = []
        if url and url.lower().endswith('.pdf'):
            try:
                headers = {'User-Agent': self.ua.random}
                response = requests.get(url, headers=headers, timeout=10)
                
                with open('temp.pdf', 'wb') as f:
                    f.write(response.content)
                
                # Extract text
                reader = PdfReader('temp.pdf')
                for page in reader.pages:
                    full_text += page.extract_text()
                
                # Extract and analyze figures
                figures = self.extract_images_from_pdf('temp.pdf')
                figure_analyses = []
                
                for figure in figures:
                    analysis = self.analyze_figure(figure)
                    figure_analyses.append({
                        'page_number': figure['page_number'],
                        'caption': figure['caption'],
                        'analysis': analysis
                    })
                
                os.remove('temp.pdf')  # Clean up
            except Exception as e:
                print(f"Error processing PDF: {e}")
                
        # Prepare text for analysis
        text_to_analyze = full_text if full_text else abstract
        if not text_to_analyze:
            return {}
            
        try:
            # Generate comprehensive analysis using Gemini
            prompt = f"""Analyze this scholarly text and provide a structured response with these components:
            Title: {title}
            Text: {text_to_analyze[:2000]}  # Limit text length for token constraints
            
            Please provide:
            1. Key findings (3-5 bullet points)
            2. Main methodology used
            3. Potential applications
            4. Limitations or gaps identified
            5. Technical complexity rating (1-10)
            
            Format the response in a clear, structured way."""
            
            response = self.model.generate_content(prompt)
            
            # Extract entities and relationships
            entities_prompt = f"""From this text, extract:
            1. Author names (if mentioned)
            2. Research institutions (if mentioned)
            3. Key technical terms
            4. Dataset names (if mentioned)
            5. Publication year (if mentioned)
            
            Text: {text_to_analyze[:1000]}"""
            
            entities_response = self.model.generate_content(entities_prompt)
            
            # Combine results
            analysis = {
                'title': title,
                'url': url,
                'abstract': abstract,
                'analysis': response.text,
                'entities': entities_response.text,
                'has_full_text': bool(full_text),
                'figures': figure_analyses if 'figure_analyses' in locals() else [],
                'analyzed_at': datetime.now().isoformat()
            }
            
            # Cache the results
            self.cache[cache_key] = analysis
            self.save_cache()
            
            return analysis
            
        except Exception as e:
            print(f"Error analyzing paper: {e}")
            return {}
    
    def explain_with_analogy(self, text: str) -> str:
        """Use Gemini to explain concepts with analogies"""
        if not text.strip():
            return "No text provided for explanation"
            
        prompt = f"""Please explain this text using simple analogies that make it easy to understand:
        {text}
        Provide 2-3 clear analogies using everyday concepts that most people would understand."""
        
        try:
            response = self.model.generate_content(prompt)
            return response.text
        except Exception as e:
            return f"Error generating explanation: {str(e)}"

## Helper Functions
Functions for analyzing papers and displaying results

In [8]:
# Initialize the analyzer
analyzer = ScholarlyAnalyzer()

def analyze_paper_url(url: str, title: str = "", abstract: str = ""):
    """Analyze a paper from its URL"""
    print(f"Analyzing paper from URL: {url}")
    
    # Initialize the analyzer
    analyzer = ScholarlyAnalyzer()
    
    # For arXiv papers, use the specialized method
    if 'arxiv.org' in url:
        analysis = analyzer.analyze_arxiv_paper(url)
    else:
        analysis = analyzer.analyze_paper(url, title, abstract)
    
    if not analysis:
        print("Could not analyze paper")
        return
    print("\nAnalysis Results:")
    print(f"Title: {analysis['title']}")
    
    if 'abstract' in analysis and analysis['abstract']:
        print("\nAbstract:")
        print(analysis['abstract'])
        print("\nGenerating explanation with analogies...")
        explain_paper(analysis['abstract'])
    else:
        print("\nNo abstract available")
    
    print(f"\nFull text available: {'Yes' if analysis['has_full_text'] else 'No'}")
    print("\nDetailed Analysis:")
    print(analysis['analysis'])
    print("\nExtracted Entities:")
    print(analysis['entities'])
    
    return analysis

def explain_paper(text: str):
    """Get an explanation with analogies for a paper"""
    if not text.strip():
        print("No text provided for explanation")
        return
        
    explanation = analyzer.explain_with_analogy(text)
    print("\nExplanation with analogies:")
    print(explanation)

## Example Usage
Let's analyze a paper by providing its URL

In [9]:
# Example: Analyze an arXiv paper
# Using a real arXiv paper about language models
paper_url = "https://arxiv.org/pdf/2302.09419.pdf"  # A real paper about large language models
analysis = analyze_paper_url(paper_url)

# You can also print specific parts of the analysis if needed
if analysis and 'title' in analysis:
    print("\nPaper Title:", analysis['title'])



Analyzing paper from URL: https://arxiv.org/pdf/2302.09419.pdf


Successfully fetched metadata for arXiv paper: A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT

Analysis Results:
Title: A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT

Abstract:
Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks with different data modalities. A PFM (e.g., BERT, ChatGPT, and GPT-4) is trained on large-scale data which provides a reasonable parameter initialization for a wide range of downstream applications. BERT learns bidirectional encoder representations from Transformers, which are trained on large datasets as contextual language models. Similarly, the generative pretrained transformer (GPT) method employs Transformers as the feature extractor and is trained using an autoregressive paradigm on large datasets. Recently, ChatGPT shows promising success on large language models, which applies an autoregressive language model with zero shot or fe

In [10]:
# Test with the RAG paper
paper_url = "https://arxiv.org/pdf/2005.11401"
print("Starting paper analysis...")
analysis = analyze_paper_url(paper_url)

print("\nGenerating analogies and explanations...")
if analysis and analysis.get('abstract'):
    explain_paper(analysis['abstract'])

Starting paper analysis...
Analyzing paper from URL: https://arxiv.org/pdf/2005.11401
Successfully fetched metadata for arXiv paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Analysis Results:
Title: Language Models are Few-Shot Learners

No abstract available

Full text available: No

Detailed Analysis:
Here's a structured analysis of the provided text:

**Title: Language Models are Few-Shot Learners**

**1. Key Findings:**

*   Scaling up language models (specifically GPT-3 with 175 billion parameters) significantly improves task-agnostic, few-shot performance in NLP.
*   GPT-3, without fine-tuning, achieves competitive or even state-of-the-art results on various NLP tasks using only a few examples.
*   GPT-3 demonstrates strong performance in tasks requiring reasoning and domain adaptation, like unscrambling words or performing arithmetic.
*   The model exhibits the ability to generate realistic news articles that are difficult for humans to distinguish from 

In [11]:
# Test with a paper that contains figures
paper_url = "https://arxiv.org/pdf/2303.08774.pdf"  # GPT-4 Technical Report
print("Starting paper analysis with figures...")
analysis = analyze_paper_url(paper_url)

if analysis and analysis.get('figures'):
    print("\nFigure Analysis Results:")
    for idx, figure in enumerate(analysis['figures'], 1):
        print(f"\nFigure {idx} (Page {figure['page_number']}):")
        print(f"Caption: {figure['caption']}")
        print(f"Analysis: {figure['analysis']}")

Starting paper analysis with figures...
Analyzing paper from URL: https://arxiv.org/pdf/2303.08774.pdf
Successfully fetched metadata for arXiv paper: GPT-4 Technical Report

Analysis Results:
Title: 

No abstract available

Full text available: Yes

Detailed Analysis:
Okay, here's a structured analysis of the GPT-4 Technical Report excerpt you provided:

**1. Key Findings:**

*   **GPT-4 is a large-scale, multimodal model:** It accepts both image and text inputs and produces text outputs. This represents an advancement over text-only models.
*   **Human-level performance on benchmarks:** GPT-4 demonstrates human-level performance on professional and academic benchmarks, such as the simulated bar exam (top 10%). This highlights its improved reasoning and problem-solving abilities.
*   **Transformer-based architecture with next token prediction:** GPT-4 is based on the Transformer architecture and pre-trained to predict the next token in a document.
*   **Alignment process enhances perfo