# Python Web APIs: Accessing News Data with NewsAPI

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Setting up NewsAPI](#newsapi)
2. [Getting Headlines and Sources](#headlines)
3. [Searching for Specific Topics](#search)
4. [Analyzing News Data](#analysis)
5. [Demo: News Sentiment Over Time](#demo)

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import seaborn as sns
import requests
import json

<a id='newsapi'></a>

# NewsAPI

NewsAPI is a simple and easy-to-use API that returns JSON search results for current and historic news articles published by over 80,000 worldwide sources. It's perfect for tracking news trends, sentiment analysis, and staying updated on specific topics.

## Getting Your API Key

Before proceeding, you'll need to:
1. Visit https://newsapi.org/
2. Click "Get API Key" and sign up for a free account
3. Copy your API key from your account dashboard

⚠️ **Warning**: The free tier has limitations (500 requests/day, articles from last 30 days only). For production use, consider upgrading to a paid plan.

## Installing Required Libraries

We'll use the `newsapi-python` library for easier interaction with the API:

In [None]:
%pip install newsapi-python

## Handling API Keys

Let's securely store and retrieve our NewsAPI key:

In [None]:
import configparser
import os
from getpass import getpass

def get_api_key(api_name):
    config_file_path = os.path.expanduser("~/.notebook-api-keys")
    config = configparser.ConfigParser(interpolation=None)
    
    if os.path.exists(config_file_path):
        config.read(config_file_path)
    
    # Check if API key is present
    if config.has_option("API_KEYS", api_name):
        update_key = input(f"An API key for {api_name} already exists. Do you want to update it? (y/n): ").lower()
        if update_key == 'n':
            return config.get("API_KEYS", api_name)
    
    # Get new API key
    api_key = getpass(f"Enter your {api_name} API key: ")

    # Save the API key
    if not config.has_section("API_KEYS"):
        config.add_section("API_KEYS")
    config.set("API_KEYS", api_name, api_key)
    
    with open(config_file_path, "w") as f:
        config.write(f)
    
    return api_key

# Get NewsAPI key
newsapi_key = get_api_key("NEWSAPI")
print("NewsAPI key retrieved successfully.")

## Initialize NewsAPI Client

In [None]:
from newsapi import NewsApiClient

# Initialize the client
newsapi = NewsApiClient(api_key=newsapi_key)
print("NewsAPI client initialized successfully.")

<a id='headlines'></a>

# Getting Headlines and Sources

Let's start by exploring what NewsAPI offers us.

## Top Headlines

We can get the current top headlines from various countries and categories:

In [None]:
# Get top headlines from the US
top_headlines = newsapi.get_top_headlines(country='us', page_size=10)

print(f"Total results: {top_headlines['totalResults']}")
print(f"Articles retrieved: {len(top_headlines['articles'])}")

# Look at the first article
first_article = top_headlines['articles'][0]
print(f"\nFirst headline: {first_article['title']}")
print(f"Source: {first_article['source']['name']}")
print(f"Published: {first_article['publishedAt']}")

In [None]:
# Examine the structure of an article
first_article

## Available News Sources

Let's explore what news sources are available:

In [None]:
# Get all sources
sources = newsapi.get_sources()

print(f"Total sources: {len(sources['sources'])}")

# Convert to DataFrame for easier analysis
df_sources = pd.DataFrame(sources['sources'])
print(f"\nColumns: {df_sources.columns.tolist()}")
df_sources.head()

In [None]:
# Analyze sources by category and country
print("Sources by category:")
print(df_sources['category'].value_counts())

print("\nSources by country:")
print(df_sources['country'].value_counts().head(10))

## Headlines by Category

💡 **Tip**: NewsAPI supports these categories: `business`, `entertainment`, `general`, `health`, `science`, `sports`, `technology`.

In [None]:
# Get technology headlines
tech_headlines = newsapi.get_top_headlines(category='technology', country='us', page_size=15)

# Convert to DataFrame
df_tech = pd.json_normalize(tech_headlines['articles'])
print(f"Retrieved {len(df_tech)} technology articles")

# Show titles and sources
for idx, row in df_tech[['title', 'source.name']].head().iterrows():
    print(f"{idx+1}. {row['title']} - {row['source.name']}")

## 🥊 Challenge: Compare Categories

- Get headlines from 3 different categories
- Compare the number of articles available
- Which category has the most diverse sources?

In [None]:
# YOUR CODE HERE



<a id='search'></a>

# Searching for Specific Topics

The real power of NewsAPI comes from searching for specific topics across all available sources.

## Basic Search

In [None]:
# Search for articles about climate change
climate_articles = newsapi.get_everything(
    q='climate change',
    language='en',
    sort_by='publishedAt',
    page_size=50
)

print(f"Found {climate_articles['totalResults']} articles about climate change")
print(f"Retrieved {len(climate_articles['articles'])} articles")

# Convert to DataFrame
df_climate = pd.json_normalize(climate_articles['articles'])
df_climate.head()

## Advanced Search with Date Ranges

Let's search for articles within a specific time period:

In [None]:
# Calculate date range (last 7 days)
end_date = datetime.now()
start_date = end_date - timedelta(days=7)

# Format dates for API (YYYY-MM-DD)
from_date = start_date.strftime('%Y-%m-%d')
to_date = end_date.strftime('%Y-%m-%d')

print(f"Searching from {from_date} to {to_date}")

# Search for AI articles in the last week
ai_articles = newsapi.get_everything(
    q='artificial intelligence OR machine learning',
    from_param=from_date,
    to=to_date,
    language='en',
    sort_by='popularity',
    page_size=30
)

print(f"Found {ai_articles['totalResults']} AI articles in the last week")

# Convert to DataFrame
df_ai = pd.json_normalize(ai_articles['articles'])
df_ai.head()

## Search with Source Filtering

We can also limit our search to specific news sources:

In [None]:
# Get articles from specific sources
sources_list = 'bbc-news,cnn,reuters,the-guardian,the-new-york-times'

politics_articles = newsapi.get_everything(
    q='election OR politics',
    sources=sources_list,
    language='en',
    sort_by='publishedAt',
    page_size=25
)

print(f"Found {politics_articles['totalResults']} political articles from major sources")

df_politics = pd.json_normalize(politics_articles['articles'])

# Show distribution by source
print("\nArticles by source:")
print(df_politics['source.name'].value_counts())

<a id='analysis'></a>

# Analyzing News Data

Now let's perform some analysis on our collected news data.

## Publication Patterns

In [None]:
# Convert publication dates to datetime
df_ai['publishedAt'] = pd.to_datetime(df_ai['publishedAt'])
df_ai['hour'] = df_ai['publishedAt'].dt.hour
df_ai['day'] = df_ai['publishedAt'].dt.day_name()

# Analyze publication timing
plt.figure(figsize=(15, 4))

plt.subplot(1, 3, 1)
df_ai['hour'].hist(bins=24, alpha=0.7)
plt.xlabel('Hour of Day (UTC)')
plt.ylabel('Number of Articles')
plt.title('AI Articles by Hour of Publication')

plt.subplot(1, 3, 2)
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_counts = df_ai['day'].value_counts().reindex(day_order)
day_counts.plot(kind='bar', alpha=0.7)
plt.xlabel('Day of Week')
plt.ylabel('Number of Articles')
plt.title('AI Articles by Day of Week')
plt.xticks(rotation=45)

plt.subplot(1, 3, 3)
df_ai['source.name'].value_counts().head(10).plot(kind='barh', alpha=0.7)
plt.xlabel('Number of Articles')
plt.title('Top 10 Sources for AI Articles')

plt.tight_layout()
plt.show()

## Content Analysis

Let's analyze the content of headlines and descriptions:

In [None]:
# Install required libraries for text analysis
%pip install wordcloud textblob

In [None]:
from wordcloud import WordCloud
from textblob import TextBlob
import re

# Combine all titles for word cloud
all_titles = ' '.join(df_ai['title'].dropna())

# Clean text (remove common words, punctuation)
all_titles = re.sub(r'[^\w\s]', '', all_titles.lower())

# Create word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_titles)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in AI Article Headlines', fontsize=16)
plt.show()

## Sentiment Analysis

In [None]:
# Calculate sentiment for titles and descriptions
def get_sentiment(text):
    if pd.isna(text):
        return 0
    blob = TextBlob(str(text))
    return blob.sentiment.polarity

df_ai['title_sentiment'] = df_ai['title'].apply(get_sentiment)
df_ai['description_sentiment'] = df_ai['description'].apply(get_sentiment)

# Plot sentiment distribution
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
df_ai['title_sentiment'].hist(bins=20, alpha=0.7)
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Sentiment Distribution of Headlines')
plt.axvline(x=0, color='red', linestyle='--', alpha=0.5)

plt.subplot(1, 2, 2)
df_ai['description_sentiment'].hist(bins=20, alpha=0.7, color='orange')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Sentiment Distribution of Descriptions')
plt.axvline(x=0, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print(f"Average title sentiment: {df_ai['title_sentiment'].mean():.3f}")
print(f"Average description sentiment: {df_ai['description_sentiment'].mean():.3f}")

## 🥊 Challenge: Source Comparison

- Compare sentiment across different news sources
- Which sources tend to be more positive/negative about AI?
- Do you notice any patterns?

In [None]:
# YOUR CODE HERE



<a id='demo'></a>

# 🎬 Demo: News Sentiment Over Time

Let's track how sentiment about a topic changes over time by collecting articles from multiple days.

In [None]:
def collect_daily_sentiment(query, days_back=7):
    """Collect articles for each day and calculate daily sentiment"""
    daily_data = []
    
    for i in range(days_back):
        # Calculate date
        date = datetime.now() - timedelta(days=i)
        date_str = date.strftime('%Y-%m-%d')
        
        try:
            # Get articles for this specific day
            articles = newsapi.get_everything(
                q=query,
                from_param=date_str,
                to=date_str,
                language='en',
                sort_by='publishedAt',
                page_size=20
            )
            
            if articles['articles']:
                # Calculate sentiment for this day
                sentiments = []
                for article in articles['articles']:
                    title_sentiment = get_sentiment(article['title'])
                    desc_sentiment = get_sentiment(article.get('description', ''))
                    # Average of title and description sentiment
                    avg_sentiment = (title_sentiment + desc_sentiment) / 2
                    sentiments.append(avg_sentiment)
                
                daily_data.append({
                    'date': date,
                    'num_articles': len(articles['articles']),
                    'avg_sentiment': np.mean(sentiments),
                    'sentiment_std': np.std(sentiments)
                })
        
        except Exception as e:
            print(f"Error for {date_str}: {e}")
            continue
    
    return pd.DataFrame(daily_data)

# Collect sentiment data for a topic
sentiment_data = collect_daily_sentiment('cryptocurrency', days_back=7)
print(f"Collected data for {len(sentiment_data)} days")
sentiment_data

In [None]:
# Plot sentiment over time
if len(sentiment_data) > 0:
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    plt.plot(sentiment_data['date'], sentiment_data['avg_sentiment'], marker='o', linewidth=2)
    plt.fill_between(sentiment_data['date'], 
                     sentiment_data['avg_sentiment'] - sentiment_data['sentiment_std'],
                     sentiment_data['avg_sentiment'] + sentiment_data['sentiment_std'],
                     alpha=0.3)
    plt.xlabel('Date')
    plt.ylabel('Average Sentiment')
    plt.title('Cryptocurrency News Sentiment Over Time')
    plt.xticks(rotation=45)
    plt.axhline(y=0, color='red', linestyle='--', alpha=0.5)
    
    plt.subplot(1, 2, 2)
    plt.bar(sentiment_data['date'], sentiment_data['num_articles'], alpha=0.7)
    plt.xlabel('Date')
    plt.ylabel('Number of Articles')
    plt.title('Daily Article Count')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()
else:
    print("No data collected for analysis")

## Collecting Data for Your Final Project

Here's a comprehensive template for collecting news data for your final project:

In [None]:
def collect_comprehensive_news_data(queries, days_back=30, max_articles_per_query=100):
    """Collect comprehensive news data for multiple topics over time"""
    all_articles = []
    
    # Calculate date range
    end_date = datetime.now()
    start_date = end_date - timedelta(days=days_back)
    
    from_date = start_date.strftime('%Y-%m-%d')
    to_date = end_date.strftime('%Y-%m-%d')
    
    for query in queries:
        print(f"Collecting articles for: {query}")
        
        try:
            # Get articles for this query
            articles = newsapi.get_everything(
                q=query,
                from_param=from_date,
                to=to_date,
                language='en',
                sort_by='publishedAt',
                page_size=max_articles_per_query
            )
            
            # Process each article
            for article in articles['articles']:
                article_data = {
                    'query': query,
                    'title': article['title'],
                    'description': article.get('description', ''),
                    'content': article.get('content', ''),
                    'url': article['url'],
                    'source_name': article['source']['name'],
                    'source_id': article['source'].get('id', ''),
                    'author': article.get('author', ''),
                    'published_at': article['publishedAt'],
                    'url_to_image': article.get('urlToImage', ''),
                    'title_sentiment': get_sentiment(article['title']),
                    'description_sentiment': get_sentiment(article.get('description', ''))
                }
                all_articles.append(article_data)
            
            print(f"  Collected {len(articles['articles'])} articles")
            
        except Exception as e:
            print(f"  Error collecting articles for '{query}': {e}")
            continue
    
    # Convert to DataFrame
    df = pd.DataFrame(all_articles)
    
    if not df.empty:
        # Convert date column
        df['published_at'] = pd.to_datetime(df['published_at'])
        
        # Add additional time-based features
        df['date'] = df['published_at'].dt.date
        df['hour'] = df['published_at'].dt.hour
        df['day_of_week'] = df['published_at'].dt.day_name()
        
        # Remove duplicates based on URL
        df = df.drop_duplicates(subset=['url'])
    
    return df

# Example usage for final project data collection
# topics = ['climate change', 'renewable energy', 'carbon emissions', 'environmental policy']
# news_df = collect_comprehensive_news_data(topics, days_back=30, max_articles_per_query=50)
# news_df.to_csv('environmental_news_data.csv', index=False)
# print(f"Collected {len(news_df)} unique articles")

<div class="alert alert-success">

## ❗ Key Points

* NewsAPI provides access to current and recent news articles from thousands of sources worldwide
* You can search by keywords, filter by source, category, language, and date ranges
* The API returns structured data including headlines, descriptions, publication dates, and source information
* News data is excellent for trend analysis, sentiment tracking, and monitoring public discourse
* Free tier limitations mean you should plan your data collection carefully for larger projects
* Combining multiple search queries can give you comprehensive coverage of topics
* Time-based analysis reveals patterns in news coverage and sentiment shifts
  
</div>