# Python Web APIs: Accessing Academic Literature with Semantic Scholar

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Introduction to Semantic Scholar API](#semantic)
2. [Searching for Academic Papers](#search)
3. [Author and Citation Analysis](#authors)
4. [Research Trend Analysis](#trends)
5. [Demo: Citation Network Analysis](#demo)

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import seaborn as sns
import requests
import json
import time

<a id='semantic'></a>

# Semantic Scholar API

Semantic Scholar is a free, AI-powered research tool for scientific literature. It indexes millions of academic papers across various fields including computer science, medicine, biology, physics, and more. The API provides access to:

- Paper metadata (titles, abstracts, authors, citations)
- Author information and profiles
- Citation graphs and academic relationships
- Research trends and influential papers

💡 **Tip**: The Semantic Scholar API is free and doesn't require an API key for basic usage, but they recommend getting one for higher rate limits.

## Setting Up API Access

While an API key is optional, it's recommended for better rate limits. You can get one at:
https://www.semanticscholar.org/product/api

In [None]:
import configparser
import os
from getpass import getpass

def get_semantic_scholar_key():
    config_file_path = os.path.expanduser("~/.notebook-api-keys")
    config = configparser.ConfigParser(interpolation=None)
    
    if os.path.exists(config_file_path):
        config.read(config_file_path)
    
    # Check if API key exists
    if config.has_option("API_KEYS", "SEMANTIC_SCHOLAR"):
        update_key = input("Semantic Scholar API key found. Update it? (y/n): ").lower()
        if update_key == 'n':
            return config.get("API_KEYS", "SEMANTIC_SCHOLAR")
    
    # Option to skip API key
    use_key = input("Do you have a Semantic Scholar API key? (y/n): ").lower()
    
    if use_key == 'y':
        api_key = getpass("Enter your Semantic Scholar API key: ")
        
        # Save the API key
        if not config.has_section("API_KEYS"):
            config.add_section("API_KEYS")
        config.set("API_KEYS", "SEMANTIC_SCHOLAR", api_key)
        
        with open(config_file_path, "w") as f:
            config.write(f)
        
        return api_key
    else:
        print("No API key provided. Using free tier with lower rate limits.")
        return None

# Get API key (optional)
semantic_key = get_semantic_scholar_key()
print("Semantic Scholar setup complete.")

## Setting Up the API Client

In [None]:
class SemanticScholarAPI:
    def __init__(self, api_key=None):
        self.base_url = "https://api.semanticscholar.org/graph/v1"
        self.headers = {}
        if api_key:
            self.headers['x-api-key'] = api_key
        
        # Rate limiting
        self.last_request = 0
        self.min_interval = 1.0 if not api_key else 0.1  # Slower for free tier
    
    def _make_request(self, endpoint, params=None):
        """Make a rate-limited request to the API"""
        # Rate limiting
        elapsed = time.time() - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        
        url = f"{self.base_url}/{endpoint}"
        response = requests.get(url, headers=self.headers, params=params)
        self.last_request = time.time()
        
        if response.status_code == 200:
            return response.json()
        else:
            print(f"Error {response.status_code}: {response.text}")
            return None
    
    def search_papers(self, query, limit=10, fields=None):
        """Search for papers by query"""
        if fields is None:
            fields = ['title', 'abstract', 'authors', 'year', 'citationCount', 'url', 'venue']
        
        params = {
            'query': query,
            'limit': limit,
            'fields': ','.join(fields)
        }
        
        return self._make_request('paper/search', params)
    
    def get_paper(self, paper_id, fields=None):
        """Get detailed information about a specific paper"""
        if fields is None:
            fields = ['title', 'abstract', 'authors', 'year', 'citationCount', 'referenceCount', 'citations', 'references']
        
        params = {'fields': ','.join(fields)}
        return self._make_request(f'paper/{paper_id}', params)
    
    def get_author(self, author_id, fields=None):
        """Get information about an author"""
        if fields is None:
            fields = ['name', 'paperCount', 'citationCount', 'hIndex', 'papers']
        
        params = {'fields': ','.join(fields)}
        return self._make_request(f'author/{author_id}', params)

# Initialize the API client
ss_api = SemanticScholarAPI(semantic_key)
print("Semantic Scholar API client initialized.")

<a id='search'></a>

# Searching for Academic Papers

Let's start by searching for papers on a specific topic.

In [None]:
# Search for papers about machine learning
ml_papers = ss_api.search_papers('machine learning', limit=20)

if ml_papers and 'data' in ml_papers:
    print(f"Found {ml_papers['total']} papers about machine learning")
    print(f"Retrieved {len(ml_papers['data'])} papers")
    
    # Look at the first paper
    first_paper = ml_papers['data'][0]
    print(f"\nFirst paper:")
    print(f"Title: {first_paper['title']}")
    print(f"Year: {first_paper['year']}")
    print(f"Citations: {first_paper['citationCount']}")
    print(f"Authors: {[author['name'] for author in first_paper['authors'][:3]]}")
else:
    print("No papers found or API error occurred")

In [None]:
# Convert to DataFrame for easier analysis
if ml_papers and 'data' in ml_papers:
    papers_data = []
    
    for paper in ml_papers['data']:
        papers_data.append({
            'paperId': paper['paperId'],
            'title': paper['title'],
            'year': paper['year'],
            'citationCount': paper['citationCount'],
            'venue': paper.get('venue', 'Unknown'),
            'abstract': paper.get('abstract', '')[:200] if paper.get('abstract') else '',
            'num_authors': len(paper['authors']),
            'first_author': paper['authors'][0]['name'] if paper['authors'] else 'Unknown'
        })
    
    df_papers = pd.DataFrame(papers_data)
    print(f"Created DataFrame with {len(df_papers)} papers")
    df_papers.head()

## Analyzing Paper Characteristics

In [None]:
# Basic statistics about the papers
if not df_papers.empty:
    print("Paper Statistics:")
    print(f"Average citations: {df_papers['citationCount'].mean():.1f}")
    print(f"Median citations: {df_papers['citationCount'].median():.1f}")
    print(f"Year range: {df_papers['year'].min()} - {df_papers['year'].max()}")
    print(f"Average authors per paper: {df_papers['num_authors'].mean():.1f}")
    
    # Visualizations
    plt.figure(figsize=(15, 4))
    
    plt.subplot(1, 3, 1)
    df_papers['citationCount'].hist(bins=15, alpha=0.7)
    plt.xlabel('Citation Count')
    plt.ylabel('Frequency')
    plt.title('Distribution of Citation Counts')
    
    plt.subplot(1, 3, 2)
    df_papers['year'].hist(bins=15, alpha=0.7, color='orange')
    plt.xlabel('Publication Year')
    plt.ylabel('Frequency')
    plt.title('Papers by Publication Year')
    
    plt.subplot(1, 3, 3)
    df_papers['num_authors'].hist(bins=10, alpha=0.7, color='green')
    plt.xlabel('Number of Authors')
    plt.ylabel('Frequency')
    plt.title('Distribution of Author Count')
    
    plt.tight_layout()
    plt.show()

## Most Cited Papers and Venues

In [None]:
if not df_papers.empty:
    # Top cited papers
    print("Top 5 Most Cited Papers:")
    top_cited = df_papers.nlargest(5, 'citationCount')[['title', 'year', 'citationCount', 'first_author']]
    for idx, row in top_cited.iterrows():
        print(f"{row['citationCount']} citations: {row['title'][:80]}... ({row['year']})")
    
    print("\nTop Venues:")
    venue_counts = df_papers['venue'].value_counts().head(10)
    print(venue_counts)

## 🥊 Challenge: Compare Research Areas

- Search for papers in 2-3 different research areas
- Compare citation patterns, publication years, and author counts
- Which field tends to have more citations? More collaborators?

In [None]:
# YOUR CODE HERE



<a id='authors'></a>

# Author and Citation Analysis

Let's dive deeper into author information and citation networks.

## Getting Detailed Paper Information

In [None]:
# Get detailed information about the most cited paper
if not df_papers.empty:
    most_cited_id = df_papers.loc[df_papers['citationCount'].idxmax(), 'paperId']
    print(f"Getting details for paper ID: {most_cited_id}")
    
    detailed_paper = ss_api.get_paper(most_cited_id)
    
    if detailed_paper:
        print(f"Title: {detailed_paper['title']}")
        print(f"Citations: {detailed_paper['citationCount']}")
        print(f"References: {detailed_paper['referenceCount']}")
        print(f"Abstract: {detailed_paper.get('abstract', 'No abstract available')[:300]}...")
        
        # Authors information
        print(f"\nAuthors:")
        for author in detailed_paper['authors'][:5]:  # First 5 authors
            print(f"  - {author['name']} (ID: {author['authorId']})")

## Author Profile Analysis

In [None]:
# Analyze the first author of the most cited paper
if detailed_paper and detailed_paper['authors']:
    first_author_id = detailed_paper['authors'][0]['authorId']
    
    if first_author_id:
        author_info = ss_api.get_author(first_author_id)
        
        if author_info:
            print(f"Author: {author_info['name']}")
            print(f"Total papers: {author_info['paperCount']}")
            print(f"Total citations: {author_info['citationCount']}")
            print(f"h-index: {author_info['hIndex']}")
            
            # Analyze their recent papers
            if 'papers' in author_info and author_info['papers']:
                recent_papers = author_info['papers'][:10]  # Last 10 papers
                
                years = [p['year'] for p in recent_papers if p['year']]
                citations = [p['citationCount'] for p in recent_papers]
                
                plt.figure(figsize=(12, 4))
                
                plt.subplot(1, 2, 1)
                plt.scatter(years, citations, alpha=0.7)
                plt.xlabel('Publication Year')
                plt.ylabel('Citation Count')
                plt.title(f"Recent Papers by {author_info['name']}")
                
                plt.subplot(1, 2, 2)
                year_counts = pd.Series(years).value_counts().sort_index()
                year_counts.plot(kind='bar', alpha=0.7)
                plt.xlabel('Year')
                plt.ylabel('Number of Papers')
                plt.title('Publications by Year')
                plt.xticks(rotation=45)
                
                plt.tight_layout()
                plt.show()

<a id='trends'></a>

# Research Trend Analysis

Let's analyze how research topics have evolved over time.

In [None]:
def analyze_research_trends(topics, years_back=10):
    """Analyze research trends for multiple topics over time"""
    current_year = datetime.now().year
    start_year = current_year - years_back
    
    trend_data = []
    
    for topic in topics:
        print(f"Analyzing trend for: {topic}")
        
        # Search for papers on this topic
        papers = ss_api.search_papers(topic, limit=100)
        
        if papers and 'data' in papers:
            for paper in papers['data']:
                if paper['year'] and paper['year'] >= start_year:
                    trend_data.append({
                        'topic': topic,
                        'year': paper['year'],
                        'title': paper['title'],
                        'citations': paper['citationCount'],
                        'authors': len(paper['authors'])
                    })
    
    return pd.DataFrame(trend_data)

# Analyze trends for AI-related topics
ai_topics = ['deep learning', 'neural networks', 'computer vision']
trends_df = analyze_research_trends(ai_topics, years_back=8)

print(f"Collected trend data: {len(trends_df)} papers")
if not trends_df.empty:
    trends_df.head()

In [None]:
# Visualize research trends
if not trends_df.empty:
    # Papers published by year and topic
    yearly_counts = trends_df.groupby(['topic', 'year']).size().reset_index(name='count')
    
    plt.figure(figsize=(15, 5))
    
    # Plot 1: Number of papers over time
    plt.subplot(1, 3, 1)
    for topic in ai_topics:
        topic_data = yearly_counts[yearly_counts['topic'] == topic]
        plt.plot(topic_data['year'], topic_data['count'], marker='o', label=topic, linewidth=2)
    
    plt.xlabel('Year')
    plt.ylabel('Number of Papers')
    plt.title('Research Output Trends')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Plot 2: Average citations by topic and year
    plt.subplot(1, 3, 2)
    avg_citations = trends_df.groupby(['topic', 'year'])['citations'].mean().reset_index()
    
    for topic in ai_topics:
        topic_data = avg_citations[avg_citations['topic'] == topic]
        plt.plot(topic_data['year'], topic_data['citations'], marker='s', label=topic, linewidth=2)
    
    plt.xlabel('Year')
    plt.ylabel('Average Citations')
    plt.title('Citation Trends by Topic')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Plot 3: Collaboration trends (average authors)
    plt.subplot(1, 3, 3)
    avg_authors = trends_df.groupby(['topic', 'year'])['authors'].mean().reset_index()
    
    for topic in ai_topics:
        topic_data = avg_authors[avg_authors['topic'] == topic]
        plt.plot(topic_data['year'], topic_data['authors'], marker='^', label=topic, linewidth=2)
    
    plt.xlabel('Year')
    plt.ylabel('Average Authors per Paper')
    plt.title('Collaboration Trends')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print("\nTrend Summary:")
    summary = trends_df.groupby('topic').agg({
        'citations': ['mean', 'median'],
        'authors': 'mean',
        'year': ['min', 'max']
    }).round(2)
    print(summary)

<a id='demo'></a>

# 🎬 Demo: Citation Network Analysis

Let's explore the citation network around a highly cited paper to understand academic influence and connections.

In [None]:
def build_citation_network(paper_id, max_citations=10, max_references=10):
    """Build a citation network around a paper"""
    print(f"Building citation network for paper: {paper_id}")
    
    # Get the main paper with citations and references
    fields = ['title', 'year', 'citationCount', 'citations.title', 'citations.year', 
              'citations.citationCount', 'references.title', 'references.year', 'references.citationCount']
    
    paper = ss_api.get_paper(paper_id, fields)
    
    if not paper:
        return None
    
    network_data = {
        'main_paper': {
            'id': paper_id,
            'title': paper['title'],
            'year': paper['year'],
            'citations': paper['citationCount']
        },
        'citing_papers': [],
        'referenced_papers': []
    }
    
    # Process citing papers (papers that cite this one)
    if 'citations' in paper and paper['citations']:
        for citation in paper['citations'][:max_citations]:
            network_data['citing_papers'].append({
                'title': citation.get('title', 'No title'),
                'year': citation.get('year', None),
                'citations': citation.get('citationCount', 0)
            })
    
    # Process referenced papers (papers this one cites)
    if 'references' in paper and paper['references']:
        for reference in paper['references'][:max_references]:
            network_data['referenced_papers'].append({
                'title': reference.get('title', 'No title'),
                'year': reference.get('year', None),
                'citations': reference.get('citationCount', 0)
            })
    
    return network_data

# Build citation network for the most cited paper from our earlier search
if not df_papers.empty:
    target_paper_id = df_papers.loc[df_papers['citationCount'].idxmax(), 'paperId']
    citation_network = build_citation_network(target_paper_id)
    
    if citation_network:
        print(f"Citation network for: {citation_network['main_paper']['title'][:60]}...")
        print(f"Main paper citations: {citation_network['main_paper']['citations']}")
        print(f"Papers citing this one: {len(citation_network['citing_papers'])}")
        print(f"Papers referenced by this one: {len(citation_network['referenced_papers'])}")

In [None]:
# Analyze the citation network
if citation_network:
    # Create DataFrames for analysis
    citing_df = pd.DataFrame(citation_network['citing_papers'])
    referenced_df = pd.DataFrame(citation_network['referenced_papers'])
    
    plt.figure(figsize=(15, 10))
    
    # Plot 1: Citations over time for citing papers
    plt.subplot(2, 3, 1)
    if not citing_df.empty and 'year' in citing_df.columns:
        citing_df = citing_df.dropna(subset=['year'])
        if not citing_df.empty:
            plt.scatter(citing_df['year'], citing_df['citations'], alpha=0.7, color='red')
            plt.xlabel('Publication Year')
            plt.ylabel('Citation Count')
            plt.title('Papers Citing This Work')
    
    # Plot 2: Citations over time for referenced papers
    plt.subplot(2, 3, 2)
    if not referenced_df.empty and 'year' in referenced_df.columns:
        referenced_df = referenced_df.dropna(subset=['year'])
        if not referenced_df.empty:
            plt.scatter(referenced_df['year'], referenced_df['citations'], alpha=0.7, color='blue')
            plt.xlabel('Publication Year')
            plt.ylabel('Citation Count')
            plt.title('Papers Referenced by This Work')
    
    # Plot 3: Distribution of citation counts
    plt.subplot(2, 3, 3)
    all_citations = []
    labels = []
    
    if not citing_df.empty:
        all_citations.extend(citing_df['citations'].tolist())
        labels.extend(['Citing'] * len(citing_df))
    
    if not referenced_df.empty:
        all_citations.extend(referenced_df['citations'].tolist())
        labels.extend(['Referenced'] * len(referenced_df))
    
    if all_citations:
        citation_df = pd.DataFrame({'citations': all_citations, 'type': labels})
        citation_df.boxplot(column='citations', by='type', ax=plt.gca())
        plt.title('Citation Distribution Comparison')
        plt.suptitle('')  # Remove automatic title
    
    # Plot 4: Timeline of citing papers
    plt.subplot(2, 3, 4)
    if not citing_df.empty and 'year' in citing_df.columns:
        citing_df = citing_df.dropna(subset=['year'])
        if not citing_df.empty:
            year_counts = citing_df['year'].value_counts().sort_index()
            year_counts.plot(kind='bar', alpha=0.7, color='red')
            plt.xlabel('Year')
            plt.ylabel('Number of Citing Papers')
            plt.title('Citation Timeline')
            plt.xticks(rotation=45)
    
    # Plot 5: Top citing papers
    plt.subplot(2, 3, 5)
    if not citing_df.empty:
        top_citing = citing_df.nlargest(5, 'citations')
        titles = [title[:20] + '...' for title in top_citing['title']]
        plt.barh(range(len(titles)), top_citing['citations'], alpha=0.7, color='red')
        plt.yticks(range(len(titles)), titles)
        plt.xlabel('Citation Count')
        plt.title('Most Cited Papers That Cite This Work')
    
    # Plot 6: Top referenced papers
    plt.subplot(2, 3, 6)
    if not referenced_df.empty:
        top_referenced = referenced_df.nlargest(5, 'citations')
        titles = [title[:20] + '...' for title in top_referenced['title']]
        plt.barh(range(len(titles)), top_referenced['citations'], alpha=0.7, color='blue')
        plt.yticks(range(len(titles)), titles)
        plt.xlabel('Citation Count')
        plt.title('Most Cited Referenced Papers')
    
    plt.tight_layout()
    plt.show()
    
    # Print network statistics
    print("\nCitation Network Analysis:")
    if not citing_df.empty:
        print(f"Average citations of citing papers: {citing_df['citations'].mean():.1f}")
        print(f"Most cited citing paper: {citing_df['citations'].max()} citations")
    
    if not referenced_df.empty:
        print(f"Average citations of referenced papers: {referenced_df['citations'].mean():.1f}")
        print(f"Most cited referenced paper: {referenced_df['citations'].max()} citations")

## Collecting Data for Your Final Project

Here's a comprehensive template for collecting academic literature data:

In [None]:
def collect_research_data(topics, papers_per_topic=50, include_authors=False, include_citations=False):
    """Collect comprehensive research data for analysis"""
    all_papers = []
    all_authors = []
    all_citations = []
    
    for topic in topics:
        print(f"Collecting papers for: {topic}")
        
        # Search for papers
        papers = ss_api.search_papers(topic, limit=papers_per_topic)
        
        if papers and 'data' in papers:
            for paper in papers['data']:
                paper_data = {
                    'paperId': paper['paperId'],
                    'topic_search': topic,
                    'title': paper['title'],
                    'year': paper['year'],
                    'citationCount': paper['citationCount'],
                    'venue': paper.get('venue', ''),
                    'abstract': paper.get('abstract', ''),
                    'num_authors': len(paper['authors']),
                    'url': paper.get('url', '')
                }
                all_papers.append(paper_data)
                
                # Collect author information if requested
                if include_authors:
                    for author in paper['authors']:
                        all_authors.append({
                            'paperId': paper['paperId'],
                            'authorId': author['authorId'],
                            'name': author['name'],
                            'topic_search': topic
                        })
                
                # Get detailed citation info if requested (warning: slow!)
                if include_citations and paper['citationCount'] > 10:
                    detailed = ss_api.get_paper(paper['paperId'])
                    if detailed and 'citations' in detailed:
                        for citation in detailed['citations'][:5]:  # Top 5 citations
                            all_citations.append({
                                'cited_paper_id': paper['paperId'],
                                'citing_paper_title': citation.get('title', ''),
                                'citing_paper_year': citation.get('year', None),
                                'topic_search': topic
                            })
        
        print(f"  Collected {len([p for p in all_papers if p['topic_search'] == topic])} papers")
    
    # Convert to DataFrames
    df_papers = pd.DataFrame(all_papers)
    df_authors = pd.DataFrame(all_authors) if all_authors else None
    df_citations = pd.DataFrame(all_citations) if all_citations else None
    
    return df_papers, df_authors, df_citations

# Example usage for final project
# research_topics = ['natural language processing', 'computer vision', 'robotics']
# papers_df, authors_df, citations_df = collect_research_data(
#     research_topics, 
#     papers_per_topic=30, 
#     include_authors=True
# )
# 
# # Save the data
# papers_df.to_csv('research_papers.csv', index=False)
# if authors_df is not None:
#     authors_df.to_csv('research_authors.csv', index=False)
# 
# print(f"Final dataset: {len(papers_df)} papers from {len(research_topics)} topics")

<div class="alert alert-success">

## ❗ Key Points

* Semantic Scholar API provides access to millions of academic papers across multiple disciplines
* You can search by keywords, get detailed paper information, and analyze citation networks
* Author profiles include metrics like h-index, total citations, and publication history
* Citation networks reveal academic influence and research connections
* Research trend analysis can show how fields evolve over time
* The API is free but rate-limited; consider getting an API key for larger projects
* Academic data is excellent for bibliometric analysis, research impact studies, and trend identification
* Be mindful of rate limits when collecting large datasets
  
</div>