# Python Web APIs: Accessing Academic Research with OpenAlex

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Introduction to OpenAlex API](#openalex)
2. [Searching for Research Works](#works)
3. [Author and Institution Analysis](#authors)
4. [Research Topics and Concepts](#topics)
5. [Data Analysis with Academic Metrics](#analysis)
6. [Demo: Research Network Visualization](#demo)

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import requests
import json
import time
import seaborn as sns

<a id='openalex'></a>

# OpenAlex API

OpenAlex is a free and open catalog of scholarly papers, authors, institutions, and more. It's the successor to Microsoft Academic Graph and provides access to:

- **Works**: Scholarly papers, books, datasets, etc.
- **Authors**: Researcher profiles and affiliations
- **Sources**: Journals, conferences, and repositories
- **Institutions**: Universities, companies, and research organizations
- **Topics**: Research areas and subject classifications
- **Publishers**: Academic publishers
- **Funders**: Funding organizations

💡 **Tip**: OpenAlex is completely free and requires no API key! However, adding your email to requests is recommended for best performance and helps them track usage.

## Setting Up API Access

While OpenAlex doesn't require an API key, it's recommended to include your email in requests for better performance and to help them understand usage patterns. The API has a daily limit of 100,000 requests per user.

In [None]:
import configparser
import os
from getpass import getpass

def get_email_for_openalex():
    config_file_path = os.path.expanduser("~/.notebook-api-keys")
    config = configparser.ConfigParser(interpolation=None)
    
    # Try reading the existing config file
    if os.path.exists(config_file_path):
        config.read(config_file_path)
    
    # Check if email is already stored
    if config.has_option("API_KEYS", "OPENALEX_EMAIL"):
        update_email = input("Email for OpenAlex already exists. Do you want to update it? (y/n): ").lower()
        if update_email == 'n':
            return config.get("API_KEYS", "OPENALEX_EMAIL")
    
    # Get email from user
    email = input("Enter your email for OpenAlex API (optional but recommended): ")
    
    if email:
        # Save the email in the config file
        if not config.has_section("API_KEYS"):
            config.add_section("API_KEYS")
        config.set("API_KEYS", "OPENALEX_EMAIL", email)
        
        with open(config_file_path, "w") as f:
            config.write(f)
    
    return email

# Get email for better API performance
user_email = get_email_for_openalex()

print("OpenAlex setup complete.")

## Setting Up the OpenAlex API Client

Unlike other APIs we've used, OpenAlex doesn't have an official Python client, but its REST API is straightforward to use directly. We'll create a simple wrapper class to make our requests easier.

In [None]:
class OpenAlexAPI:
    def __init__(self, email=None):
        self.base_url = "https://api.openalex.org"
        self.email = email
        # Rate limiting - be respectful!
        self.last_request = 0
        self.min_interval = 0.1  # 100ms between requests
    
    def _make_request(self, endpoint, params=None):
        """Make a rate-limited request to the API"""
        # Rate limiting
        elapsed = time.time() - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        
        if params is None:
            params = {}
        
        # Add email if provided
        if self.email:
            params['mailto'] = self.email
        
        url = f"{self.base_url}/{endpoint}"
        response = requests.get(url, params=params)
        self.last_request = time.time()
        
        if response.status_code == 200:
            return response.json()
        else:
            print(f"Error {response.status_code}: {response.text}")
            return None
    
    def search_works(self, query=None, filters=None, per_page=25, page=1):
        """Search for academic works"""
        params = {
            'per-page': per_page,
            'page': page
        }
        
        if query:
            params['search'] = query
        
        if filters:
            params['filter'] = filters
        
        return self._make_request('works', params)
    
    def get_work(self, work_id):
        """Get detailed information about a specific work"""
        return self._make_request(f'works/{work_id}')
    
    def search_authors(self, query=None, filters=None, per_page=25):
        """Search for authors"""
        params = {'per-page': per_page}
        
        if query:
            params['search'] = query
        
        if filters:
            params['filter'] = filters
        
        return self._make_request('authors', params)
    
    def get_author(self, author_id):
        """Get detailed information about a specific author"""
        return self._make_request(f'authors/{author_id}')
    
    def search_institutions(self, query=None, filters=None, per_page=25):
        """Search for institutions"""
        params = {'per-page': per_page}
        
        if query:
            params['search'] = query
        
        if filters:
            params['filter'] = filters
        
        return self._make_request('institutions', params)
    
    def search_topics(self, query=None, per_page=25):
        """Search for research topics"""
        params = {'per-page': per_page}
        
        if query:
            params['search'] = query
        
        return self._make_request('topics', params)

# Initialize the API client
openalex = OpenAlexAPI(user_email)
print("OpenAlex API client initialized.")

Perfect! We are now ready to explore academic research data!

<a id='works'></a>

# Searching for Research Works

Let's start by searching for academic papers on a specific topic. OpenAlex calls academic papers "works" and includes journal articles, conference papers, books, datasets, and more.

In [None]:
# Search for works about artificial intelligence
ai_works = openalex.search_works(query="artificial intelligence", per_page=50)

if ai_works and 'results' in ai_works:
    print(f"Found {ai_works['meta']['count']} total works about artificial intelligence")
    print(f"Retrieved {len(ai_works['results'])} works")
    
    # Look at the first work
    first_work = ai_works['results'][0]
    print(f"\nFirst work:")
    print(f"Title: {first_work['title']}")
    print(f"Publication Year: {first_work['publication_year']}")
    print(f"Citations: {first_work['cited_by_count']}")
    print(f"Type: {first_work['type']}")
    print(f"Open Access: {first_work['open_access']['is_oa']}")
else:
    print("No works found or API error occurred")

Let's examine the structure of a single work to understand what data is available to us:

In [None]:
# Examine the structure of the first work (excluding some verbose fields)
if ai_works and 'results' in ai_works:
    work = ai_works['results'][0]
    
    # Display key information
    print("Available fields:")
    for key in work.keys():
        if key not in ['abstract_inverted_index', 'referenced_works', 'related_works']:
            print(f"  {key}: {type(work[key])}")
    
    print(f"\nAuthors ({len(work['authorships'])}):")    
    for i, authorship in enumerate(work['authorships'][:3]):  # First 3 authors
        author = authorship['author']
        print(f"  {i+1}. {author['display_name']} (ID: {author['id']})")
        if authorship['institutions']:
            inst = authorship['institutions'][0]
            print(f"     Institution: {inst['display_name']}")

## Converting API Results to a DataFrame

Let's convert our results into a pandas DataFrame for easier analysis:

In [None]:
def works_to_dataframe(works_response):
    """Convert OpenAlex works response to a clean DataFrame"""
    if not works_response or 'results' not in works_response:
        return pd.DataFrame()
    
    data = []
    for work in works_response['results']:
        # Extract basic information
        row = {
            'id': work['id'],
            'title': work['title'],
            'publication_year': work['publication_year'],
            'cited_by_count': work['cited_by_count'],
            'type': work['type'],
            'is_open_access': work['open_access']['is_oa'],
            'oa_type': work['open_access'].get('oa_type', 'closed'),
            'num_authors': len(work['authorships']),
            'num_concepts': len(work['concepts']),
            'primary_location': None,
            'journal': None,
            'publisher': None
        }
        
        # Extract publication venue information
        if work['primary_location']:
            primary = work['primary_location']
            row['primary_location'] = primary.get('display_name')
            if primary.get('source'):
                row['journal'] = primary['source'].get('display_name')
        
        # Extract first author
        if work['authorships']:
            first_author = work['authorships'][0]['author']
            row['first_author'] = first_author['display_name']
            row['first_author_id'] = first_author['id']
        else:
            row['first_author'] = None
            row['first_author_id'] = None
        
        # Extract top concept
        if work['concepts']:
            top_concept = max(work['concepts'], key=lambda x: x['score'])
            row['top_concept'] = top_concept['display_name']
            row['top_concept_score'] = top_concept['score']
        else:
            row['top_concept'] = None
            row['top_concept_score'] = None
        
        data.append(row)
    
    return pd.DataFrame(data)

# Convert our AI works to a DataFrame
if ai_works:
    df_ai = works_to_dataframe(ai_works)
    print(f"Created DataFrame with {len(df_ai)} works")
    df_ai.head()

In [None]:
# Inspect the DataFrame metadata
if not df_ai.empty:
    df_ai.info()

## 🥊 Challenge: Explore a Research Topic

- Choose a research topic that interests you
- Search for works on that topic and convert to a DataFrame
- What are the most cited papers? What's the publication year range?
- How many are open access vs. closed access?

In [None]:
# YOUR CODE HERE



<a id='authors'></a>

# Author and Institution Analysis

Let's explore author information and institutional affiliations. We can search for specific authors or analyze the most prolific researchers in our dataset.

In [None]:
# Get detailed information about the most cited author from our AI dataset
if not df_ai.empty:
    # Find the work with the most citations
    most_cited_work = df_ai.loc[df_ai['cited_by_count'].idxmax()]
    author_id = most_cited_work['first_author_id']
    
    if author_id:
        # Clean the author ID (remove URL prefix)
        clean_author_id = author_id.split('/')[-1] if '/' in author_id else author_id
        
        author_info = openalex.get_author(clean_author_id)
        
        if author_info:
            print(f"Author: {author_info['display_name']}")
            print(f"Works count: {author_info['works_count']}")
            print(f"Cited by count: {author_info['cited_by_count']}")
            print(f"h-index: {author_info['summary_stats']['h_index']}")
            print(f"i10-index: {author_info['summary_stats']['i10_index']}")
            
            # Show current affiliations
            if author_info['affiliations']:
                print(f"\nCurrent affiliations:")
                for affiliation in author_info['affiliations'][:3]:
                    institution = affiliation['institution']
                    print(f"  - {institution['display_name']} ({institution.get('country_code', 'Unknown')})")
            
            # Show research areas (concepts)
            if author_info['x_concepts']:
                print(f"\nTop research areas:")
                for concept in author_info['x_concepts'][:5]:
                    print(f"  - {concept['display_name']} (score: {concept['score']:.2f})")

## Institution Analysis

Let's search for top institutions in artificial intelligence research:

In [None]:
# Search for institutions with AI research, sorted by citation count
top_ai_institutions = openalex.search_institutions(
    filters="topics.id:T11597",  # AI topic ID in OpenAlex
    per_page=20
)

if top_ai_institutions and 'results' in top_ai_institutions:
    print("Top institutions in AI research (by citation count):")
    print(f"Total found: {top_ai_institutions['meta']['count']}\n")
    
    institutions_data = []
    for i, institution in enumerate(top_ai_institutions['results'][:10]):
        institutions_data.append({
            'rank': i + 1,
            'name': institution['display_name'],
            'country': institution.get('country_code', 'Unknown'),
            'type': institution.get('type', 'Unknown'),
            'works_count': institution['works_count'],
            'cited_by_count': institution['cited_by_count']
        })
        
        print(f"{i+1:2d}. {institution['display_name']} ({institution.get('country_code', 'Unknown')})")
        print(f"    Works: {institution['works_count']:,}, Citations: {institution['cited_by_count']:,}")
    
    # Convert to DataFrame for analysis
    df_institutions = pd.DataFrame(institutions_data)
else:
    print("No institutions found or API error occurred")
    df_institutions = pd.DataFrame()

<a id='topics'></a>

# Research Topics and Concepts

OpenAlex automatically classifies research works by topic. Let's explore the topic landscape in our field of interest.

In [None]:
# Search for AI-related topics
ai_topics = openalex.search_topics(query="artificial intelligence", per_page=20)

if ai_topics and 'results' in ai_topics:
    print("AI-related research topics:")
    print(f"Found {ai_topics['meta']['count']} total topics\n")
    
    for i, topic in enumerate(ai_topics['results'][:10]):
        print(f"{i+1:2d}. {topic['display_name']}")
        print(f"    Description: {topic.get('description', 'No description')[:100]}...")
        print(f"    Works: {topic['works_count']:,}, Citations: {topic['cited_by_count']:,}")
        print(f"    Field: {topic.get('field', {}).get('display_name', 'Unknown')}")
        print()
else:
    print("No topics found or API error occurred")

## Analyzing Concepts in Our Dataset

Let's analyze what concepts appear most frequently in our AI papers:

In [None]:
if not df_ai.empty:
    # Analyze top concepts
    concept_counts = df_ai['top_concept'].value_counts()
    print("Most common research concepts:")
    print(concept_counts.head(10))
    
    # Visualize concept distribution
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    concept_counts.head(10).plot(kind='barh', alpha=0.7)
    plt.title('Top 10 Research Concepts')
    plt.xlabel('Number of Papers')
    
    plt.subplot(1, 2, 2)
    df_ai['top_concept_score'].hist(bins=20, alpha=0.7, color='green')
    plt.title('Distribution of Concept Scores')
    plt.xlabel('Concept Score')
    plt.ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()

<a id='analysis'></a>

# Data Analysis with Academic Metrics

Let's perform some comprehensive analysis on our research data, looking at trends over time, open access patterns, and citation distributions.

In [None]:
# Get more data for analysis - let's collect 200 AI papers
all_ai_works = []

# Get multiple pages of results
for page in range(1, 5):  # Get 4 pages = 100 works
    print(f"Fetching page {page}...")
    works = openalex.search_works(
        query="artificial intelligence", 
        per_page=25, 
        page=page
    )
    
    if works and 'results' in works:
        all_ai_works.extend(works['results'])
    
    time.sleep(0.1)  # Be nice to the API

print(f"Collected {len(all_ai_works)} total works")

# Convert to DataFrame
df_ai_large = works_to_dataframe({'results': all_ai_works})
print(f"DataFrame shape: {df_ai_large.shape}")

In [None]:
# Comprehensive analysis
if not df_ai_large.empty:
    # Basic statistics
    print("Dataset Overview:")
    print(f"Total papers: {len(df_ai_large)}")
    print(f"Year range: {df_ai_large['publication_year'].min()} - {df_ai_large['publication_year'].max()}")
    print(f"Median citations: {df_ai_large['cited_by_count'].median():.0f}")
    print(f"Open access papers: {df_ai_large['is_open_access'].sum()} ({df_ai_large['is_open_access'].mean()*100:.1f}%)")
    print(f"Average authors per paper: {df_ai_large['num_authors'].mean():.1f}")
    
    # Create comprehensive visualizations
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # 1. Publications over time
    yearly_counts = df_ai_large.groupby('publication_year').size()
    yearly_counts.plot(ax=axes[0,0], kind='line', marker='o', linewidth=2)
    axes[0,0].set_title('AI Publications Over Time')
    axes[0,0].set_xlabel('Year')
    axes[0,0].set_ylabel('Number of Papers')
    axes[0,0].grid(True, alpha=0.3)
    
    # 2. Citation distribution
    df_ai_large['cited_by_count'].hist(bins=50, ax=axes[0,1], alpha=0.7, color='skyblue')
    axes[0,1].set_title('Citation Distribution')
    axes[0,1].set_xlabel('Number of Citations')
    axes[0,1].set_ylabel('Frequency')
    axes[0,1].set_yscale('log')
    
    # 3. Open Access over time
    oa_by_year = df_ai_large.groupby('publication_year')['is_open_access'].agg(['sum', 'count'])
    oa_rate = (oa_by_year['sum'] / oa_by_year['count'] * 100).dropna()
    oa_rate.plot(ax=axes[0,2], kind='line', marker='s', color='green', linewidth=2)
    axes[0,2].set_title('Open Access Rate Over Time')
    axes[0,2].set_xlabel('Year')
    axes[0,2].set_ylabel('Open Access Rate (%)')
    axes[0,2].grid(True, alpha=0.3)
    
    # 4. Work types
    type_counts = df_ai_large['type'].value_counts()
    type_counts.plot(kind='pie', ax=axes[1,0], autopct='%1.1f%%')
    axes[1,0].set_title('Distribution of Work Types')
    axes[1,0].set_ylabel('')
    
    # 5. Author collaboration (number of authors)
    df_ai_large['num_authors'].hist(bins=range(1, 21), ax=axes[1,1], alpha=0.7, color='orange')
    axes[1,1].set_title('Author Collaboration Patterns')
    axes[1,1].set_xlabel('Number of Authors')
    axes[1,1].set_ylabel('Frequency')
    
    # 6. Citations vs. Year scatter
    scatter = axes[1,2].scatter(df_ai_large['publication_year'], df_ai_large['cited_by_count'], 
                               c=df_ai_large['is_open_access'], alpha=0.6, 
                               cmap='RdYlBu_r')
    axes[1,2].set_title('Citations vs Publication Year')
    axes[1,2].set_xlabel('Publication Year')
    axes[1,2].set_ylabel('Citations')
    axes[1,2].set_yscale('log')
    plt.colorbar(scatter, ax=axes[1,2], label='Open Access')
    
    plt.tight_layout()
    plt.show()

## 🥊 Challenge: Research Impact Analysis

- Find the most impactful papers (highest cited_by_count)
- Compare open access vs. closed access papers - which get cited more?
- Analyze if there's a correlation between number of authors and citations
- What publication venues (journals) appear most frequently?

In [None]:
# YOUR CODE HERE



<a id='demo'></a>

# 🎬 Demo: Research Network Visualization

Let's create a network visualization showing connections between authors, institutions, and research topics.

In [None]:
%pip install networkx

In [None]:
import networkx as nx
from collections import Counter

def create_research_network(works_data, max_nodes=50):
    """Create a network of authors, institutions, and concepts"""
    G = nx.Graph()
    
    # Collect data for network
    author_papers = {}
    concept_papers = {}
    institution_authors = {}
    
    for work in works_data[:20]:  # Use first 20 works to keep network manageable
        work_id = work['id']
        
        # Process authors and their institutions
        for authorship in work['authorships'][:3]:  # Top 3 authors per paper
            author = authorship['author']
            author_name = author['display_name']
            
            # Add author node
            if author_name not in author_papers:
                author_papers[author_name] = []
            author_papers[author_name].append(work_id)
            
            G.add_node(author_name, type='author', size=10)
            
            # Add institution connections
            for institution in authorship['institutions'][:1]:  # Primary institution
                inst_name = institution['display_name']
                
                # Simplify long institution names
                if len(inst_name) > 30:
                    inst_name = inst_name[:27] + '...'
                
                G.add_node(inst_name, type='institution', size=20)
                G.add_edge(author_name, inst_name, type='affiliation')
        
        # Process concepts
        for concept in work['concepts'][:2]:  # Top 2 concepts
            if concept['score'] > 0.3:  # Only high-confidence concepts
                concept_name = concept['display_name']
                
                if concept_name not in concept_papers:
                    concept_papers[concept_name] = []
                concept_papers[concept_name].append(work_id)
                
                G.add_node(concept_name, type='concept', size=15)
                
                # Connect authors to concepts
                for authorship in work['authorships'][:2]:
                    author_name = authorship['author']['display_name']
                    G.add_edge(author_name, concept_name, type='research')
    
    # Add collaboration edges (authors who co-authored papers)
    for work in works_data[:20]:
        authors = [auth['author']['display_name'] for auth in work['authorships'][:3]]
        for i in range(len(authors)):
            for j in range(i+1, len(authors)):
                if G.has_edge(authors[i], authors[j]):
                    G[authors[i]][authors[j]]['weight'] = G[authors[i]][authors[j]].get('weight', 0) + 1
                else:
                    G.add_edge(authors[i], authors[j], type='collaboration', weight=1)
    
    return G

# Create the network
if all_ai_works:
    network = create_research_network(all_ai_works)
    print(f"Created network with {network.number_of_nodes()} nodes and {network.number_of_edges()} edges")
    
    # Visualize the network
    plt.figure(figsize=(16, 12))
    
    # Create layout
    pos = nx.spring_layout(network, k=1, iterations=50)
    
    # Color nodes by type
    node_colors = []
    node_sizes = []
    for node in network.nodes():
        node_data = network.nodes[node]
        if node_data['type'] == 'author':
            node_colors.append('lightblue')
            node_sizes.append(300)
        elif node_data['type'] == 'institution':
            node_colors.append('lightgreen')
            node_sizes.append(500)
        else:  # concept
            node_colors.append('lightcoral')
            node_sizes.append(400)
    
    # Draw the network
    nx.draw_networkx_nodes(network, pos, node_color=node_colors, node_size=node_sizes, alpha=0.7)
    
    # Draw different types of edges with different colors
    edge_colors = []
    for edge in network.edges():
        edge_data = network.edges[edge]
        if edge_data['type'] == 'collaboration':
            edge_colors.append('blue')
        elif edge_data['type'] == 'affiliation':
            edge_colors.append('green')
        else:  # research
            edge_colors.append('red')
    
    nx.draw_networkx_edges(network, pos, edge_color=edge_colors, alpha=0.5, width=0.8)
    
    # Add labels for important nodes only
    important_nodes = {node: node for node in network.nodes() if network.degree(node) > 2}
    nx.draw_networkx_labels(network, pos, important_nodes, font_size=8, font_weight='bold')
    
    plt.title('AI Research Network: Authors, Institutions, and Concepts', fontsize=16)
    plt.axis('off')
    
    # Create legend
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], marker='o', color='w', markerfacecolor='lightblue', markersize=10, label='Authors'),
        Line2D([0], [0], marker='o', color='w', markerfacecolor='lightgreen', markersize=10, label='Institutions'),
        Line2D([0], [0], marker='o', color='w', markerfacecolor='lightcoral', markersize=10, label='Concepts'),
        Line2D([0], [0], color='blue', linewidth=2, label='Collaboration'),
        Line2D([0], [0], color='green', linewidth=2, label='Affiliation'),
        Line2D([0], [0], color='red', linewidth=2, label='Research Area')
    ]
    plt.legend(handles=legend_elements, loc='upper left', bbox_to_anchor=(1, 1))
    
    plt.tight_layout()
    plt.show()
    
    # Network statistics
    print(f"\nNetwork Statistics:")
    print(f"Number of authors: {len([n for n in network.nodes() if network.nodes[n]['type'] == 'author'])}")
    print(f"Number of institutions: {len([n for n in network.nodes() if network.nodes[n]['type'] == 'institution'])}")
    print(f"Number of concepts: {len([n for n in network.nodes() if network.nodes[n]['type'] == 'concept'])}")
    print(f"Average clustering coefficient: {nx.average_clustering(network):.3f}")
    print(f"Network density: {nx.density(network):.3f}")
    
    # Most connected nodes
    degree_centrality = nx.degree_centrality(network)
    top_nodes = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
    print(f"\nMost connected nodes:")
    for node, centrality in top_nodes:
        node_type = network.nodes[node]['type']
        print(f"  {node} ({node_type}): {centrality:.3f}")
else:
    print("No data available for network creation")

## Advanced Research Data Collection

Here's a template for collecting comprehensive research data using OpenAlex filters and advanced search:

In [None]:
def collect_comprehensive_research_data(topics, years_range=(2020, 2024), min_citations=5):
    """Collect comprehensive research data for multiple topics with filters"""
    all_data = []
    
    for topic in topics:
        print(f"Collecting data for: {topic}")
        
        # Build filters
        filters = f"publication_year:{years_range[0]}-{years_range[1]},cited_by_count:>{min_citations}"
        
        # Collect multiple pages
        for page in range(1, 6):  # 5 pages = 125 works max
            works = openalex.search_works(
                query=topic,
                filters=filters,
                per_page=25,
                page=page
            )
            
            if works and 'results' in works:
                for work in works['results']:
                    work_data = {
                        'search_topic': topic,
                        'id': work['id'],
                        'title': work['title'],
                        'publication_year': work['publication_year'],
                        'cited_by_count': work['cited_by_count'],
                        'type': work['type'],
                        'is_open_access': work['open_access']['is_oa'],
                        'oa_type': work['open_access'].get('oa_type', 'closed'),
                        'journal': work['primary_location']['source']['display_name'] if work['primary_location'] and work['primary_location'].get('source') else None,
                        'publisher': work['primary_location']['source'].get('publisher') if work['primary_location'] and work['primary_location'].get('source') else None,
                        'num_authors': len(work['authorships']),
                        'first_author': work['authorships'][0]['author']['display_name'] if work['authorships'] else None,
                        'first_author_institution': work['authorships'][0]['institutions'][0]['display_name'] if work['authorships'] and work['authorships'][0]['institutions'] else None,
                        'top_concept': max(work['concepts'], key=lambda x: x['score'])['display_name'] if work['concepts'] else None,
                        'top_concept_score': max(work['concepts'], key=lambda x: x['score'])['score'] if work['concepts'] else None,
                        'abstract_length': len(work.get('abstract_inverted_index', {})) if work.get('abstract_inverted_index') else 0,
                        'has_doi': bool(work.get('doi')),
                        'language': work.get('language', 'unknown')
                    }
                    all_data.append(work_data)
            
            time.sleep(0.1)  # Rate limiting
    
    return pd.DataFrame(all_data)

# Example usage for final project
# research_areas = ['machine learning', 'deep learning', 'natural language processing']
# comprehensive_df = collect_comprehensive_research_data(
#     research_areas,
#     years_range=(2022, 2024),
#     min_citations=10
# )
# 
# # Save the data
# comprehensive_df.to_csv('comprehensive_research_data.csv', index=False)
# print(f"Collected {len(comprehensive_df)} research works")
# print(f"Unique journals: {comprehensive_df['journal'].nunique()}")
# print(f"Unique institutions: {comprehensive_df['first_author_institution'].nunique()}")

print("Template ready for comprehensive data collection!")

<div class="alert alert-success">

## ❗ Key Points

* OpenAlex provides free access to millions of scholarly works, authors, institutions, and research topics
* No API key required, but including your email is recommended for better performance
* Rich metadata includes citations, open access status, author affiliations, and research concepts
* Advanced filtering allows you to narrow searches by year, citation count, publication type, and more
* Perfect for bibliometric analysis, research trend studies, and academic network analysis
* Author and institution data includes h-index, total citations, and research area classifications
* Research topics are automatically classified and can reveal interdisciplinary connections
* Network analysis can uncover collaboration patterns and research communities
* Excellent data source for final projects in digital humanities, science studies, or research analytics
  
</div>