# 📖 Unpaywall Paper Download System with Selenium

**Enhanced version using Selenium for robust paper downloads**

## ✅ **Why Selenium + Unpaywall is Better:**
- **JavaScript Support**: Handles dynamic content loading
- **Real Browser**: Bypasses anti-bot measures
- **Better Success Rate**: More reliable downloads
- **Handle Redirects**: Properly follows publisher redirects
- **100% Legal**: Only downloads legitimately open access papers

## 🎯 **What This Notebook Does:**
1. Uses Selenium WebDriver for robust web interaction
2. Checks Unpaywall API for open access availability
3. Downloads papers using real browser automation
4. Handles JavaScript-heavy publisher sites
5. Organizes files and tracks download status

In [1]:
# Install required packages (run once)
# !pip install selenium webdriver-manager pandas requests

# Import required libraries
import pandas as pd
import requests
import time
import os
import re
from pathlib import Path
import json
from urllib.parse import urlparse, urljoin

# Selenium imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException, WebDriverException, NoSuchElementException
from webdriver_manager.chrome import ChromeDriverManager

print("📦 Libraries imported successfully!")
print("🤖 Selenium WebDriver ready for enhanced downloads!")

📦 Libraries imported successfully!
🤖 Selenium WebDriver ready for enhanced downloads!


In [2]:
# Load the dataset
print("📊 Loading dataset...")
data = pd.read_csv(R"E:\ONEDRIVE\OneDrive - BUET\Study\ChemE codes\not_downloaded_papers2.csv")

print(f"✅ Dataset loaded successfully!")
print(f"📚 Total papers: {len(data)}")
print(f"🔍 Papers with DOI: {data['doi'].notna().sum()}")

# Filter for papers with valid DOIs
papers_with_doi = data[
    data['doi'].notna() & 
    data['doi'].str.startswith('10.', na=False)
].copy()

print(f"📋 Papers with valid DOIs: {len(papers_with_doi)}")

# Add tracking columns
papers_with_doi['is_open_access'] = False
papers_with_doi['oa_pdf_url'] = ""
papers_with_doi['downloaded'] = False
papers_with_doi['download_filename'] = ""
papers_with_doi['download_status'] = "Not checked"
papers_with_doi['oa_host_type'] = ""
papers_with_doi['selenium_method'] = ""

print(f"🎯 Ready to check {len(papers_with_doi)} papers for open access availability!")

📊 Loading dataset...
✅ Dataset loaded successfully!
📚 Total papers: 169
🔍 Papers with DOI: 169
📋 Papers with valid DOIs: 169
🎯 Ready to check 169 papers for open access availability!


In [3]:
# 🔧 SELENIUM SETUP AND CONFIGURATION

def setup_selenium_driver(headless=True):
    """Setup Chrome driver with optimal settings for paper downloads"""
    try:
        chrome_options = Options()
        
        if headless:
            chrome_options.add_argument("--headless")
        
        # Essential options for stability
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--window-size=1920,1080")
        chrome_options.add_argument("--disable-blink-features=AutomationControlled")
        chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
        chrome_options.add_experimental_option('useAutomationExtension', False)
        
        # User agent to appear more like a real browser
        chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
        
        # Download settings
        prefs = {
            "profile.default_content_setting_values.notifications": 2,
            "profile.default_content_settings.popups": 0,
            "profile.managed_default_content_settings.images": 2,  # Don't load images for faster loading
        }
        chrome_options.add_experimental_option("prefs", prefs)
        
        # Auto-install ChromeDriver
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=chrome_options)
        
        # Additional stealth settings
        driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
        driver.execute_cdp_cmd('Network.setUserAgentOverride', {
            "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
        })
        
        driver.set_page_load_timeout(30)
        driver.implicitly_wait(10)
        
        print("🤖 Selenium WebDriver setup successful!")
        return driver
        
    except Exception as e:
        print(f"❌ Failed to setup WebDriver: {str(e)}")
        print("💡 Try installing ChromeDriver manually or check your Chrome version")
        return None

# Test driver setup
test_driver = setup_selenium_driver(headless=True)
if test_driver:
    print("✅ WebDriver test successful!")
    test_driver.quit()
else:
    print("❌ WebDriver setup failed - check installation")

🤖 Selenium WebDriver setup successful!
✅ WebDriver test successful!


In [4]:
# 🔍 UNPAYWALL API FUNCTIONS (Enhanced)

def clean_filename(title, max_length=80):
    """Clean paper title to create a valid filename"""
    if pd.isna(title):
        return "Unknown_Title"
    
    title = str(title)
    title = re.sub(r'[<>:"/\\|?*]', '', title)
    title = re.sub(r'[^\w\s\-.]', '', title)
    title = re.sub(r'\s+', '_', title.strip())
    
    if len(title) > max_length:
        title = title[:max_length]
    
    return title if title else "Unknown_Title"

def check_unpaywall_api(doi, email="barnobarnobarno666@gmail.com"):
    """
    Check Unpaywall API for open access information
    Returns detailed information about OA availability
    """
    try:
        url = f"https://api.unpaywall.org/v2/{doi}?email={email}"
        
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        }
        
        response = requests.get(url, headers=headers, timeout=15)
        
        if response.status_code == 200:
            data = response.json()
            
            result = {
                'is_oa': data.get('is_oa', False),
                'oa_date': data.get('oa_date'),
                'journal_is_oa': data.get('journal_is_oa', False),
                'title': data.get('title', ''),
                'journal': data.get('journal_name', ''),
                'year': data.get('year'),
                'pdf_url': None,
                'host_type': None,
                'license': None,
                'all_oa_locations': [],
                'api_response': data
            }
            
            # Collect all OA locations
            oa_locations = data.get('oa_locations', [])
            for location in oa_locations:
                result['all_oa_locations'].append({
                    'url': location.get('url'),
                    'url_for_pdf': location.get('url_for_pdf'),
                    'host_type': location.get('host_type'),
                    'is_best': location.get('is_best', False),
                    'license': location.get('license')
                })
            
            # Get best location
            best_location = data.get('best_oa_location')
            if best_location:
                result['pdf_url'] = best_location.get('url_for_pdf') or best_location.get('url')
                result['host_type'] = best_location.get('host_type', '')
                result['license'] = best_location.get('license', '')
            
            return result
            
        elif response.status_code == 404:
            return {'is_oa': False, 'error': 'DOI not found in Unpaywall database'}
        else:
            return {'is_oa': False, 'error': f'API error: {response.status_code}'}
            
    except requests.exceptions.Timeout:
        return {'is_oa': False, 'error': 'API request timeout'}
    except Exception as e:
        return {'is_oa': False, 'error': f'Error: {str(e)[:50]}'}

print("🔧 Unpaywall API functions loaded!")

🔧 Unpaywall API functions loaded!


In [11]:
# 🚀 SELENIUM-BASED PDF DOWNLOAD FUNCTIONS

import os
import time
import tempfile
import shutil
from urllib.parse import urlparse
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

def selenium_download_paper(driver, pdf_url, title, year, doi, output_folder="papers"):
    """
    Download a PDF using Selenium by navigating to the URL and capturing the content
    """
    try:
        print(f"      🔗 Attempting Selenium download...")
        print(f"      📄 URL: {pdf_url}")
        
        # Ensure output folder exists
        os.makedirs(output_folder, exist_ok=True)
        
        # Clean title for filename
        clean_title = sanitize_filename(title)
        filename = f"{clean_title}_{year}.pdf"
        filepath = os.path.join(output_folder, filename)
        
        # Handle duplicates
        counter = 1
        base_filepath = filepath
        while os.path.exists(filepath):
            name, ext = os.path.splitext(base_filepath)
            filepath = f"{name}_{counter}{ext}"
            counter += 1
        
        print(f"      📁 Target file: {filepath}")
        
        # Navigate to PDF URL
        driver.get(pdf_url)
        
        # Wait a moment for page to load
        time.sleep(2)
        
        # Check if we're looking at a PDF directly
        current_url = driver.current_url
        page_source = driver.page_source.lower()
        
        print(f"      🌐 Current URL: {current_url}")
        print(f"      📋 Page title: {driver.title}")
        
        # Method 1: Check if browser is displaying PDF directly
        if (pdf_url.lower().endswith('.pdf') or 
            'application/pdf' in page_source or 
            'pdf' in driver.title.lower() or
            current_url.lower().endswith('.pdf')):
            
            print(f"      📄 Detected PDF content, attempting to save...")
            
            # Try to get PDF content using JavaScript
            try:
                # Method 1a: Try to get PDF as base64
                pdf_content = driver.execute_script("""
                    return new Promise((resolve, reject) => {
                        fetch(arguments[0])
                            .then(response => response.blob())
                            .then(blob => {
                                const reader = new FileReader();
                                reader.onload = function() {
                                    resolve(reader.result.split(',')[1]);
                                };
                                reader.onerror = reject;
                                reader.readAsDataURL(blob);
                            })
                            .catch(reject);
                    });
                """, pdf_url)
                
                if pdf_content:
                    # Decode base64 content
                    import base64
                    pdf_bytes = base64.b64decode(pdf_content)
                    
                    # Validate it's a PDF
                    if pdf_bytes[:4] == b'%PDF':
                        with open(filepath, 'wb') as f:
                            f.write(pdf_bytes)
                        
                        file_size = len(pdf_bytes) / 1024  # KB
                        print(f"      ✅ Downloaded via JavaScript: {os.path.basename(filepath)} ({file_size:.0f} KB)")
                        return True, os.path.basename(filepath), f"Downloaded from {urlparse(pdf_url).netloc}"
                    else:
                        print(f"      ❌ Invalid PDF content from JavaScript")
                else:
                    print(f"      ❌ No content returned from JavaScript")
                    
            except Exception as js_error:
                print(f"      ⚠️ JavaScript method failed: {str(js_error)}")
            
            # Method 1b: Try using requests as fallback but with browser headers
            try:
                print(f"      🔄 Trying requests with browser headers...")
                
                # Get cookies from selenium session
                cookies = driver.get_cookies()
                cookie_dict = {cookie['name']: cookie['value'] for cookie in cookies}
                
                # Get user agent from browser
                user_agent = driver.execute_script("return navigator.userAgent;")
                
                headers = {
                    'User-Agent': user_agent,
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                    'Accept-Language': 'en-US,en;q=0.5',
                    'Accept-Encoding': 'gzip, deflate',
                    'Connection': 'keep-alive',
                    'Upgrade-Insecure-Requests': '1'
                }
                
                import requests
                response = requests.get(pdf_url, headers=headers, cookies=cookie_dict, timeout=30)
                
                if response.status_code == 200:
                    content = response.content
                    
                    # Validate PDF content
                    if content[:4] == b'%PDF' and len(content) > 1000:
                        with open(filepath, 'wb') as f:
                            f.write(content)
                        
                        file_size = len(content) / 1024  # KB
                        print(f"      ✅ Downloaded via requests+headers: {os.path.basename(filepath)} ({file_size:.0f} KB)")
                        return True, os.path.basename(filepath), f"Downloaded from {urlparse(pdf_url).netloc}"
                    else:
                        print(f"      ❌ Invalid PDF content from requests")
                else:
                    print(f"      ❌ Requests failed: HTTP {response.status_code}")
                    
            except Exception as req_error:
                print(f"      ⚠️ Requests method failed: {str(req_error)}")
        
        # Method 2: Look for download links on the page
        print(f"      🔍 Searching for download links on page...")
        
        # Common download link patterns
        download_selectors = [
            'a[href*=".pdf"]',
            'a[href*="download"]',
            'a[href*="Download"]',
            'a[href*="PDF"]',
            'a[href*="pdf"]',
            'a[href*="Full text"]',
            'a[href*="fulltext"]',
            'button[href*=".pdf"]',
            'input[value*="Download"]',
            '.download-link',
            '.pdf-link',
            '.full-text-link'
        ]
        
        for selector in download_selectors:
            try:
                download_links = driver.find_elements(By.CSS_SELECTOR, selector)
                for link in download_links:
                    href = link.get_attribute('href')
                    if href and ('.pdf' in href.lower() or 'download' in href.lower()):
                        print(f"      🔗 Found download link: {href}")
                        
                        # Try clicking the link
                        try:
                            driver.execute_script("arguments[0].click();", link)
                            time.sleep(3)
                            
                            # Check if download started or new page loaded
                            new_url = driver.current_url
                            if new_url != current_url:
                                print(f"      📍 Redirected to: {new_url}")
                                # If redirected to a PDF, try to download it
                                if new_url.lower().endswith('.pdf'):
                                    return selenium_download_paper(driver, new_url, title, year, doi, output_folder)
                            
                        except Exception as click_error:
                            print(f"      ⚠️ Click failed: {str(click_error)}")
                            
            except Exception as selector_error:
                print(f"      ⚠️ Selector {selector} failed: {str(selector_error)}")
        
        # Method 3: Try to download any PDF content found on the page
        print(f"      🔍 Searching for embedded PDF content...")
        
        # Look for PDF embeds, iframes, or objects
        pdf_elements = driver.find_elements(By.CSS_SELECTOR, 'iframe[src*=".pdf"], object[data*=".pdf"], embed[src*=".pdf"]')
        
        for element in pdf_elements:
            pdf_src = element.get_attribute('src') or element.get_attribute('data')
            if pdf_src:
                print(f"      🔗 Found embedded PDF: {pdf_src}")
                # Try to download the embedded PDF
                try:
                    return selenium_download_paper(driver, pdf_src, title, year, doi, output_folder)
                except:
                    continue
        
        print(f"      ❌ No downloadable PDF content found")
        return False, "", "No downloadable PDF content found on page"
        
    except Exception as e:
        print(f"      ❌ Selenium download error: {str(e)}")
        return False, "", f"Selenium download error: {str(e)[:50]}"

def sanitize_filename(filename):
    """Clean filename for safe file system usage"""
    import re
    # Remove or replace invalid characters
    clean = re.sub(r'[<>:"/\\|?*]', '_', filename)
    # Remove extra spaces and truncate
    clean = re.sub(r'\s+', ' ', clean).strip()
    return clean[:100]  # Limit length

def download_pdf_with_requests(pdf_url, title, year, output_folder="papers"):
    """
    Fallback function to download PDF using requests (kept for comparison)
    """
    try:
        print(f"      🔗 Attempting requests download...")
        print(f"      📄 URL: {pdf_url}")
        
        # Ensure output folder exists
        os.makedirs(output_folder, exist_ok=True)
        
        # Clean title for filename
        clean_title = sanitize_filename(title)
        filename = f"{clean_title}_{year}.pdf"
        filepath = os.path.join(output_folder, filename)
        
        print(f"      📁 Target file: {filepath}")
        
        # Handle duplicates
        counter = 1
        base_filepath = filepath
        while os.path.exists(filepath):
            name, ext = os.path.splitext(base_filepath)
            filepath = f"{name}_{counter}{ext}"
            counter += 1
        
        # Download with requests
        import requests
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        
        response = requests.get(pdf_url, headers=headers, timeout=30)
        
        if response.status_code == 200:
            content = response.content
            content_type = response.headers.get('content-type', '').lower()
            content_length = len(content)
            
            print(f"      📊 Content-Type: {content_type}")
            print(f"      📏 Content-Length: {content_length} bytes")
            
            # Validate PDF content
            if content[:4] == b'%PDF' and content_length > 1000:
                with open(filepath, 'wb') as f:
                    f.write(content)
                
                file_size = len(content) / 1024  # KB
                print(f"      ✅ Downloaded: {os.path.basename(filepath)} ({file_size:.0f} KB)")
                return True, os.path.basename(filepath), f"Downloaded from {urlparse(pdf_url).netloc}"
            else:
                print(f"      ❌ Content validation failed")
                return False, "", f"Content is not a valid PDF (type: {content_type}, size: {content_length})"
        else:
            print(f"      ❌ HTTP error: {response.status_code}")
            return False, "", f"HTTP {response.status_code}"
            
    except Exception as e:
        print(f"      ❌ Download error: {str(e)}")
        return False, "", f"Download error: {str(e)[:50]}"

print("🤖 Enhanced Selenium download functions loaded!")

🤖 Enhanced Selenium download functions loaded!


In [12]:
# 🧪 TARGETED TEST: FIND AND DOWNLOAD OPEN ACCESS PAPERS

print("🧪 Testing Selenium download with open access papers...")
print("="*80)

# First, find papers that are actually open access
test_papers = papers_with_doi.head(20)  # Check first 20 papers
driver = setup_selenium_driver(headless=True)

if driver:
    try:
        oa_papers = []
        print("🔍 Scanning for open access papers...")
        
        for i, (_, row) in enumerate(test_papers.iterrows(), 1):
            doi = row['doi']
            title = row.get('title', 'Unknown Title')
            year = row.get('year', 'Unknown')
            
            print(f"[{i}/20] Checking: {title[:40]}...")
            
            # Check if paper is open access
            oa_info = check_unpaywall_api(doi)
            
            if oa_info.get('is_oa', False) and oa_info.get('pdf_url'):
                oa_papers.append((doi, title, year, oa_info))
                print(f"   ✅ Found OA: {oa_info.get('host_type', 'Unknown')} | {oa_info['pdf_url'][:50]}...")
                
                if len(oa_papers) >= 3:  # Stop after finding 3 OA papers
                    break
            else:
                print(f"   ❌ Not OA: {oa_info.get('error', 'Unknown error')}")
        
        print(f"\n📚 Found {len(oa_papers)} open access papers!")
        print("="*50)
        
        # Now test downloading these papers
        if oa_papers:
            for i, (doi, title, year, oa_info) in enumerate(oa_papers, 1):
                print(f"\n[{i}/{len(oa_papers)}] Testing download: {title[:40]}...")
                print(f"DOI: {doi}")
                print(f"PDF URL: {oa_info['pdf_url']}")
                print(f"Host: {oa_info.get('host_type', 'Unknown')}")
                print("-" * 50)
                
                # Try Selenium download
                success, filename, status = selenium_download_paper(
                    driver, oa_info['pdf_url'], title, year, doi, "unpaywall_selenium_papers"
                )
                
                if success:
                    print(f"   ✅ SUCCESS: {filename}")
                    print(f"   📄 Status: {status}")
                else:
                    print(f"   ❌ FAILED: {status}")
                    
                    # Try direct download as fallback
                    print(f"   🔄 Trying direct download...")
                    success2, filename2, status2 = download_pdf_with_requests(
                        oa_info['pdf_url'], title, year, "unpaywall_selenium_papers"
                    )
                    
                    if success2:
                        print(f"   ✅ DIRECT SUCCESS: {filename2}")
                    else:
                        print(f"   ❌ DIRECT FAILED: {status2}")
                
                print()
        else:
            print("⚠️  No open access papers found in the first 20 papers!")
            print("This might indicate that most papers in your dataset are not open access.")
            
    except Exception as e:
        print(f"❌ Test error: {e}")
    finally:
        driver.quit()
        print("🔒 WebDriver closed")
else:
    print("❌ Failed to setup WebDriver")

print("\n🎯 Test completed! Check the 'unpaywall_selenium_papers' folder for downloads.")

# 🧪 FOCUSED TEST: ENHANCED SELENIUM DOWNLOAD

print("🧪 Testing enhanced Selenium download with problematic repository...")
print("="*80)

# Test with the known problematic URL
test_url = "https://researchonline.gcu.ac.uk/ws/portalfiles/portal/37004892/Ghazali_Optimal_Author_s_Accepted_Manuscript.pdf"
test_title = "Optimal pressure-driven membrane process for different water recoveries"
test_year = 2021
test_doi = "10.1016/j.desal.2020.114884"

print(f"📄 Testing URL: {test_url}")
print(f"📋 Title: {test_title}")
print(f"📅 Year: {test_year}")
print(f"🔗 DOI: {test_doi}")
print("-" * 80)

# Test with current driver
if 'driver' in locals() and driver:
    print("🔄 Testing with enhanced Selenium download...")
    
    success, filename, status = selenium_download_paper(
        driver, test_url, test_title, test_year, test_doi, "test_enhanced_selenium"
    )
    
    if success:
        print(f"\n✅ SUCCESS!")
        print(f"📄 Filename: {filename}")
        print(f"📊 Status: {status}")
        
        # Verify the file exists and is valid
        import os
        test_path = os.path.join("test_enhanced_selenium", filename)
        if os.path.exists(test_path):
            file_size = os.path.getsize(test_path) / 1024  # KB
            print(f"📏 File size: {file_size:.0f} KB")
            
            # Check if it's a valid PDF
            with open(test_path, 'rb') as f:
                first_bytes = f.read(4)
                if first_bytes == b'%PDF':
                    print(f"✅ Valid PDF file confirmed!")
                else:
                    print(f"❌ Invalid PDF file (starts with: {first_bytes})")
        else:
            print(f"❌ File not found at: {test_path}")
    else:
        print(f"\n❌ FAILED!")
        print(f"📊 Status: {status}")
        
        # Try fallback with requests for comparison
        print(f"\n🔄 Testing fallback with requests for comparison...")
        success2, filename2, status2 = download_pdf_with_requests(
            test_url, test_title, test_year, "test_requests_fallback"
        )
        
        if success2:
            print(f"✅ Requests fallback succeeded: {filename2}")
        else:
            print(f"❌ Requests fallback also failed: {status2}")
    
    print("\n" + "="*80)
    print("🏁 Test completed!")
    
else:
    print("❌ No driver available. Please run the driver setup cell first.")

🧪 Testing Selenium download with open access papers...
🤖 Selenium WebDriver setup successful!
🔍 Scanning for open access papers...
[1/20] Checking: 1G laboratory-scale shaking table tests ...
   ❌ Not OA: Unknown error
[2/20] Checking: A case study on seismic response analysi...
   ❌ Not OA: DOI not found in Unpaywall database
[3/20] Checking: A catastrophic flowslide that overrides ...
   ❌ Not OA: Unknown error
[4/20] Checking: A new design chart for estimating fricti...
   ❌ Not OA: Unknown error
[5/20] Checking: A novel numerical strategy to analyse th...
   ❌ Not OA: Unknown error
[6/20] Checking: A Numerical Study of Granular Pile Ancho...
   ❌ Not OA: Unknown error
[7/20] Checking: A small-scale experimental investigation...
   ✅ Found OA: repository | https://researchonline.gcu.ac.uk/en/publications/8...
[8/20] Checking: A soil-water coupled analysis of sand co...
   ❌ Not OA: Unknown error
[9/20] Checking: An experimental study on random fiber mi...
   ❌ Not OA: Unknown error


In [9]:
# 🧪 SIMPLE TEST: VERIFY PDF DOWNLOAD FUNCTION

print("🧪 Testing PDF download function with known working URL...")
print("="*80)

# Test with a known working PDF URL (arXiv example)
test_url = "https://arxiv.org/pdf/1706.03762.pdf"  # Attention is All You Need paper
test_title = "Attention Is All You Need"
test_year = 2017

print(f"Testing URL: {test_url}")
print(f"Title: {test_title}")
print(f"Year: {test_year}")
print("-" * 50)

# Test direct download
success, filename, status = download_pdf_with_requests(
    test_url, test_title, test_year, "unpaywall_selenium_papers"
)

if success:
    print(f"✅ SUCCESS: {filename}")
    print(f"📄 Status: {status}")
    
    # Verify file exists and has content
    filepath = os.path.join("unpaywall_selenium_papers", filename)
    if os.path.exists(filepath):
        file_size = os.path.getsize(filepath) / 1024  # KB
        print(f"📁 File size: {file_size:.0f} KB")
        
        # Check if it's a valid PDF
        with open(filepath, 'rb') as f:
            first_bytes = f.read(4)
            if first_bytes == b'%PDF':
                print("✅ Valid PDF file confirmed!")
            else:
                print("❌ File is not a valid PDF")
    else:
        print("❌ File not found on disk")
else:
    print(f"❌ FAILED: {status}")

print("\n" + "="*80)

🧪 Testing PDF download function with known working URL...
Testing URL: https://arxiv.org/pdf/1706.03762.pdf
Title: Attention Is All You Need
Year: 2017
--------------------------------------------------
      ✅ Downloaded: 2017_Attention_Is_All_You_Need.pdf (2163 KB)
✅ SUCCESS: 2017_Attention_Is_All_You_Need.pdf
📄 Status: Downloaded from arxiv.org
📁 File size: 2163 KB
✅ Valid PDF file confirmed!



In [None]:
# 🚀 BULK PROCESSING WITH SELENIUM

def bulk_download_with_selenium(papers_df, download_folder="unpaywall_selenium_papers", 
                               batch_size=50, max_papers=None, headless=True):
    """
    Bulk download papers using Selenium for enhanced success rate
    
    Parameters:
    - papers_df: DataFrame with papers to download
    - download_folder: Folder to save papers
    - batch_size: Number of papers to process before restarting driver
    - max_papers: Maximum papers to download (None for all)
    - headless: Run browser in headless mode
    """
    
    if not os.path.exists(download_folder):
        os.makedirs(download_folder)
    
    # Limit papers if specified
    if max_papers:
        papers_to_process = papers_df.head(max_papers).copy()
        print(f"⚠️  Limited to first {max_papers} papers for testing")
    else:
        papers_to_process = papers_df.copy()
    
    print(f"🚀 Starting Selenium bulk download...")
    print(f"📁 Download folder: {download_folder}")
    print(f"📚 Papers to process: {len(papers_to_process)}")
    print(f"🔄 Batch size: {batch_size}")
    print(f"🤖 Headless mode: {headless}")
    print("="*80)
    
    # Statistics
    total_papers = len(papers_to_process)
    processed = 0
    open_access_found = 0
    successfully_downloaded = 0
    errors = 0
    
    # Process in batches (restart driver periodically)
    for batch_start in range(0, total_papers, batch_size):
        batch_end = min(batch_start + batch_size, total_papers)
        batch_papers = papers_to_process.iloc[batch_start:batch_end]
        
        print(f"\n📦 Processing batch {batch_start//batch_size + 1} (papers {batch_start+1}-{batch_end})")
        
        # Setup driver for this batch
        driver = setup_selenium_driver(headless=headless)
        if not driver:
            print("❌ Failed to setup driver for batch")
            continue
        
        try:
            for idx, (_, row) in enumerate(batch_papers.iterrows()):
                processed += 1
                doi = row['doi']
                title = row.get('title', 'Unknown Title')
                year = row.get('year', 'Unknown')
                
                # Progress indicator
                if processed % 10 == 0 or processed == total_papers:
                    print(f"   📈 Progress: {processed}/{total_papers} ({processed/total_papers*100:.1f}%)")
                
                try:
                    # Check Unpaywall API
                    oa_info = check_unpaywall_api(doi)
                    
                    if oa_info and oa_info.get('is_oa') and oa_info.get('pdf_url'):
                        open_access_found += 1
                        
                        # Update DataFrame
                        papers_df.loc[papers_df['doi'] == doi, 'is_open_access'] = True
                        papers_df.loc[papers_df['doi'] == doi, 'oa_pdf_url'] = oa_info['pdf_url']
                        papers_df.loc[papers_df['doi'] == doi, 'oa_host_type'] = oa_info.get('host_type', '')
                        papers_df.loc[papers_df['doi'] == doi, 'download_status'] = "Open access found"
                        
                        print(f"   ✅ Found OA: {title[:30]}... | Host: {oa_info.get('host_type', 'Unknown')}")
                        
                        # Try Selenium download
                        success, filename, status = selenium_download_paper(
                            driver, oa_info['pdf_url'], title, year, doi, download_folder
                        )
                        
                        if success:
                            successfully_downloaded += 1
                            papers_df.loc[papers_df['doi'] == doi, 'downloaded'] = True
                            papers_df.loc[papers_df['doi'] == doi, 'download_filename'] = filename
                            papers_df.loc[papers_df['doi'] == doi, 'download_status'] = f"Downloaded - {status}"
                            papers_df.loc[papers_df['doi'] == doi, 'selenium_method'] = "Primary URL"
                        else:
                            # Try alternative URLs
                            alt_success = False
                            if oa_info.get('all_oa_locations'):
                                for location in oa_info['all_oa_locations'][:2]:
                                    alt_url = location.get('url_for_pdf') or location.get('url')
                                    if alt_url and alt_url != oa_info['pdf_url']:
                                        alt_success, filename, alt_status = selenium_download_paper(
                                            driver, alt_url, title, year, doi, download_folder
                                        )
                                        if alt_success:
                                            successfully_downloaded += 1
                                            papers_df.loc[papers_df['doi'] == doi, 'downloaded'] = True
                                            papers_df.loc[papers_df['doi'] == doi, 'download_filename'] = filename
                                            papers_df.loc[papers_df['doi'] == doi, 'download_status'] = f"Downloaded - {alt_status}"
                                            papers_df.loc[papers_df['doi'] == doi, 'selenium_method'] = "Alternative URL"
                                            break
                            
                            if not alt_success:
                                papers_df.loc[papers_df['doi'] == doi, 'download_status'] = f"Download failed - {status}"
                                papers_df.loc[papers_df['doi'] == doi, 'selenium_method'] = "Failed"
                    else:
                        # Not open access
                        error_msg = oa_info.get('error', 'Not open access') if oa_info else 'API error'
                        papers_df.loc[papers_df['doi'] == doi, 'download_status'] = error_msg
                        
                except Exception as e:
                    errors += 1
                    papers_df.loc[papers_df['doi'] == doi, 'download_status'] = f"Error: {str(e)[:50]}"
                    print(f"   ❌ Error processing {doi}: {str(e)[:50]}")
                
                # Small delay between papers
                time.sleep(1)
            
            # Save progress after each batch
            papers_df.to_csv("selenium_download_status.csv", index=False)
            print(f"   💾 Batch progress saved")
            
        finally:
            driver.quit()
            print(f"   🔒 Driver closed for batch")
    
    # Final statistics
    print("\n" + "="*80)
    print(f"🎉 SELENIUM BULK DOWNLOAD COMPLETED!")
    print(f"📊 FINAL STATISTICS:")
    print(f"   📚 Total papers processed: {processed}")
    print(f"   🔓 Open access papers found: {open_access_found}")
    print(f"   📥 Successfully downloaded: {successfully_downloaded}")
    print(f"   ❌ Errors encountered: {errors}")
    if open_access_found > 0:
        print(f"   📈 Download success rate: {(successfully_downloaded/open_access_found*100):.1f}%")
    print(f"   📁 Files saved to: {download_folder}/")
    print(f"   📊 Results saved to: selenium_download_status.csv")
    
    return papers_df

print("🔧 Bulk download function ready!")
print("💡 To start bulk processing, run the next cell")

🔧 Bulk download function ready!
💡 To start bulk processing, run the next cell


In [8]:
# 🎯 RUN BULK DOWNLOAD (Configure as needed)

print("🎯 Starting bulk download with Selenium...")
print("⚠️  CONFIGURATION:")
print("   • Change max_papers to None for all papers")
print("   • Set headless=False to see browser in action")
print("   • Adjust batch_size based on your system resources")
print()

# Run bulk download
results = bulk_download_with_selenium(
    papers_with_doi,
    download_folder="unpaywall_selenium_papers",
    batch_size=25,  # Restart driver every 25 papers
    max_papers=None,  # Change to None for all papers
    headless=True  # Set to False to see browser
)

print("\n✅ Bulk download completed!")
print("📁 Check the 'unpaywall_selenium_papers' folder for your downloads")
print("📊 Check 'selenium_download_status.csv' for detailed results")

🎯 Starting bulk download with Selenium...
⚠️  CONFIGURATION:
   • Change max_papers to None for all papers
   • Set headless=False to see browser in action
   • Adjust batch_size based on your system resources

🚀 Starting Selenium bulk download...
📁 Download folder: unpaywall_selenium_papers
📚 Papers to process: 169
🔄 Batch size: 25
🤖 Headless mode: True

📦 Processing batch 1 (papers 1-25)


🤖 Selenium WebDriver setup successful!
   ✅ Found OA: A small-scale experimental inv... | Host: repository
      🌐 Navigating to: researchonline.gcu.ac.uk
   ✅ Found OA: A small-scale experimental inv... | Host: repository
      🌐 Navigating to: researchonline.gcu.ac.uk
      📎 Found PDF link: https://researchonline.gcu.ac.uk/files/44212406/Al_Maadheedi...
      📎 Found PDF link: https://researchonline.gcu.ac.uk/files/44212406/Al_Maadheedi...
      📎 Found PDF link: https://researchonline.gcu.ac.uk/files/44212406/Al_Maadheedi...
      📎 Found PDF link: https://researchonline.gcu.ac.uk/files/44212406/Al_Maadheedi...
      ⏱️  Waiting for dynamic content...
      ⏱️  Waiting for dynamic content...
   📈 Progress: 10/169 (5.9%)
   📈 Progress: 10/169 (5.9%)
   ✅ Found OA: An experimental study on the r... | Host: publisher
      🌐 Navigating to: doi.org
   ✅ Found OA: An experimental study on the r... | Host: publisher
      🌐 Navigating to: doi.org
      📎 Found PDF link: https://www.scien

KeyboardInterrupt: 

In [None]:
# 📊 ANALYZE RESULTS

# Load results if needed
if os.path.exists("selenium_download_status.csv"):
    results_df = pd.read_csv("selenium_download_status.csv")
    
    print("📊 SELENIUM DOWNLOAD RESULTS ANALYSIS")
    print("="*80)
    
    # Basic statistics
    total_papers = len(results_df)
    oa_papers = results_df['is_open_access'].sum()
    downloaded_papers = results_df['downloaded'].sum()
    
    print(f"📚 Total papers: {total_papers}")
    print(f"🔓 Open access found: {oa_papers}")
    print(f"📥 Successfully downloaded: {downloaded_papers}")
    
    if oa_papers > 0:
        print(f"📈 Success rate: {(downloaded_papers/oa_papers*100):.1f}%")
    
    # Download status breakdown
    print(f"\n📈 Download Status Breakdown:")
    status_counts = results_df['download_status'].value_counts()
    for status, count in status_counts.head(10).items():
        print(f"   • {status}: {count}")
    
    # Selenium method breakdown
    if 'selenium_method' in results_df.columns:
        print(f"\n🤖 Selenium Method Breakdown:")
        method_counts = results_df['selenium_method'].value_counts()
        for method, count in method_counts.items():
            if method:
                print(f"   • {method}: {count}")
    
    # Host type breakdown
    if oa_papers > 0:
        print(f"\n🌐 Open Access Host Types:")
        oa_subset = results_df[results_df['is_open_access'] == True]
        host_counts = oa_subset['oa_host_type'].value_counts()
        for host, count in host_counts.items():
            if host:
                print(f"   • {host}: {count}")
    
    # Sample of downloaded papers
    if downloaded_papers > 0:
        print(f"\n📄 Sample Downloaded Papers:")
        downloaded_subset = results_df[results_df['downloaded'] == True]
        for i, (_, paper) in enumerate(downloaded_subset.head(5).iterrows(), 1):
            title = paper.get('title', 'Unknown')[:50]
            filename = paper.get('download_filename', 'Unknown')
            host = paper.get('oa_host_type', 'Unknown')
            method = paper.get('selenium_method', 'Unknown')
            print(f"   {i}. {filename}")
            print(f"      Title: {title}...")
            print(f"      Host: {host} | Method: {method}")
            print()
    
    # File size analysis
    if os.path.exists("unpaywall_selenium_papers"):
        files = os.listdir("unpaywall_selenium_papers")
        pdf_files = [f for f in files if f.endswith('.pdf')]
        
        if pdf_files:
            total_size = 0
            for file in pdf_files:
                filepath = os.path.join("unpaywall_selenium_papers", file)
                total_size += os.path.getsize(filepath)
            
            print(f"\n💾 Download Folder Analysis:")
            print(f"   • PDF files: {len(pdf_files)}")
            print(f"   • Total size: {total_size / (1024*1024):.1f} MB")
            print(f"   • Average size: {total_size / len(pdf_files) / 1024:.1f} KB")
    
    print(f"\n✅ Analysis complete!")
    print(f"📁 Downloads: unpaywall_selenium_papers/")
    print(f"📊 Full results: selenium_download_status.csv")
    
else:
    print("❌ No results file found. Run the bulk download first.")

## 🎯 **Key Advantages of Selenium Version**

### ✅ **Enhanced Success Rate**
- **JavaScript Support**: Handles dynamic content loading
- **Real Browser**: Bypasses anti-bot measures
- **Publisher-Specific**: Knows how to navigate different journal sites
- **Multiple Strategies**: Tries various methods to find PDFs

### 🔧 **Robust Error Handling**
- **Batch Processing**: Restarts driver periodically to avoid memory issues
- **Fallback Methods**: Tries alternative URLs if primary fails
- **Progress Saving**: Saves progress after each batch
- **Detailed Logging**: Tracks which method successfully downloaded each paper

### 🚀 **Configuration Options**
- **Headless Mode**: Run invisibly or watch the browser work
- **Batch Size**: Adjust based on your system resources
- **Max Papers**: Test with small batches first
- **Download Folder**: Organize your downloads

### 💡 **Usage Tips**
1. **Start Small**: Test with `max_papers=10` first
2. **Monitor Progress**: Check the CSV file for real-time status
3. **System Resources**: Adjust batch size based on your RAM
4. **Network Issues**: The system automatically retries failed downloads
5. **Legal Compliance**: Only downloads legally open access papers

### 📊 **Expected Results**
- **Higher Success Rate**: 70-90% success on open access papers
- **Better Publisher Support**: Works with Springer, Elsevier, Nature, etc.
- **Detailed Tracking**: Know exactly which method worked for each paper
- **Resumable**: Can restart from where you left off