# Working with the Hacker News API

This notebook demonstrates how to work with the Hacker News API, which is a RESTful API that provides access to stories, comments, jobs, and more from Hacker News. We'll apply the HTTP and API concepts we've learned to extract, analyse, and visualise data from Hacker News.

In [None]:
%pip install --quiet matplotlib pandas requests IPython seaborn networkx wordcloud scikit-learn textblob

## Understanding the Hacker News API

The Hacker News API is a simple, RESTful API that provides access to Hacker News content. The base URL for the API is `https://hacker-news.firebaseio.com/v0/`.

Here are some of the key endpoints we'll be using:

- `/topstories.json` - Returns the IDs of the top stories on Hacker News
- `/newstories.json` - Returns the IDs of the newest stories on Hacker News
- `/beststories.json` - Returns the IDs of the best stories on Hacker News
- `/item/{id}.json` - Returns details about an item (story, comment, job, etc.)
- `/user/{id}.json` - Returns details about a user

Let's set up our HTTP debugger from the previous notebook:

In [None]:
import requests
from requests import Request, Response, Session
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
import json
import pandas as pd
import matplotlib.pyplot as plt
import time
from datetime import datetime
import seaborn as sns

# Set up HTTP request/response debugger
def print_request(request: Request):
    """
    Print the details of an HTTP request.
    """
    url = urlparse(request.url)
    uri = "?".join([url.path, url.query]) if url.query else url.path

    print(f"\n--> HTTP Request to {url.netloc}")
    print(f"REQUEST: {request.method} {uri}")
    print(f"HEADERS:")

    for key, value in request.headers.items():
        print(f"  {key}: {value}")

    if request.body:
        print(f"BODY: {request.body[:100]}..." if len(request.body) > 100 else f"BODY: {request.body}")

def print_response(response: Response):
    """
    Print the details of an HTTP response.
    """
    url = urlparse(response.url)

    print(f"\n<-- HTTP Response from {url.netloc}")
    print(f"RESPONSE: {response.status_code} {response.reason}")
    print(f"HEADERS:")

    for key, value in response.headers.items():
        print(f"  {key}: {value}")

    if response.text:
        print(f"BODY: {response.text[:100]}..." if len(response.text) > 100 else f"BODY: {response.text}")

def http_logger(response, *args, **kwargs):
    """
    Log the details of an HTTP request and response.
    """
    print_request(response.request)
    print_response(response)

# Create a session to track request/response details
session = Session()
session.hooks["response"] = [http_logger]

# Base URL for the Hacker News API
base_url = "https://hacker-news.firebaseio.com/v0/"

## Getting the Top Stories

Let's start by getting the IDs of the top stories on Hacker News. This endpoint returns an array of story IDs.

In [None]:
# Get the top stories
top_stories_url = f"{base_url}topstories.json"
response = session.get(top_stories_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    top_story_ids = response.json()
    print(f"Retrieved {len(top_story_ids)} top story IDs")
    print(f"First 10 story IDs: {top_story_ids[:10]}")
else:
    print(f"Error: {response.status_code}")

## Fetching Story Details

Now that we have the IDs of the top stories, let's fetch the details of the first 10 stories. Each story has details like title, URL, score, author, etc.

We'll also implement a rate-limiting mechanism to be respectful of the API.

In [None]:
def fetch_item(item_id, session=None):
    """
    Fetch the details of an item (story, comment, job, etc.) from the Hacker News API.

    :param item_id: The ID of the item to fetch
    :param session: The session to use for the request (optional)
    :return: The item details as a dictionary
    """
    # Use the provided session or create a new one
    s = session if session else requests.Session()

    # Construct the URL for the item
    item_url = f"{base_url}item/{item_id}.json"

    # Make the request
    response = s.get(item_url)

    # Check if the request was successful
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error fetching item {item_id}: {response.status_code}")
        return None

# Fetch the details of the first 10 top stories with rate limiting
top_stories = []
for i, story_id in enumerate(top_story_ids[:10]):
    # Add a small delay to be respectful of the API
    if i > 0:
        time.sleep(0.1)  # 100ms delay between requests

    story = fetch_item(story_id, session)
    if story:
        top_stories.append(story)

print(f"Retrieved details for {len(top_stories)} top stories")

# Display the first story as an example
if top_stories:
    print(json.dumps(top_stories[0], indent=2))

## Transforming the Data

Let's transform the story data into a pandas DataFrame for easier analysis. We'll extract relevant fields like title, URL, score, author, etc.

In [None]:
# Extract relevant fields from each story and create a DataFrame
story_data = []
for story in top_stories:
    # Skip stories without a title (deleted or invalid)
    if 'title' not in story:
        continue

    # Extract relevant fields
    story_dict = {
        'id': story.get('id'),
        'title': story.get('title'),
        'url': story.get('url', ''),  # Some stories don't have a URL (text posts)
        'score': story.get('score', 0),
        'author': story.get('by', ''),
        'time': story.get('time', 0),  # Unix timestamp
        'num_comments': story.get('descendants', 0),
        'type': story.get('type', ''),
        'has_url': 'url' in story  # Flag to indicate if the story has a URL
    }

    # Convert Unix timestamp to datetime
    if story_dict['time']:
        story_dict['datetime'] = datetime.fromtimestamp(story_dict['time'])

    # Extract domain from URL if available
    if story_dict['url']:
        try:
            parsed_url = urlparse(story_dict['url'])
            story_dict['domain'] = parsed_url.netloc
        except:
            story_dict['domain'] = ''
    else:
        story_dict['domain'] = ''

    story_data.append(story_dict)

# Create DataFrame
df_stories = pd.DataFrame(story_data)

# Display the DataFrame
df_stories

## Let's scale up: Getting 30 Top Stories

Now let's fetch more stories for better analysis. We'll fetch the top 30 stories and add them to our DataFrame.

In [None]:
# Clear our existing data and fetch more stories
top_stories = []
for i, story_id in enumerate(top_story_ids[:30]):
    # Add a small delay to be respectful of the API
    if i > 0:
        time.sleep(0.1)  # 100ms delay between requests

    story = fetch_item(story_id)
    if story:
        top_stories.append(story)

print(f"Retrieved details for {len(top_stories)} top stories")

# Extract relevant fields and create DataFrame as before
story_data = []
for story in top_stories:
    if 'title' not in story:
        continue

    story_dict = {
        'id': story.get('id'),
        'title': story.get('title'),
        'url': story.get('url', ''),
        'score': story.get('score', 0),
        'author': story.get('by', ''),
        'time': story.get('time', 0),
        'num_comments': story.get('descendants', 0),
        'type': story.get('type', ''),
        'has_url': 'url' in story
    }

    if story_dict['time']:
        story_dict['datetime'] = datetime.fromtimestamp(story_dict['time'])

    if story_dict['url']:
        try:
            parsed_url = urlparse(story_dict['url'])
            story_dict['domain'] = parsed_url.netloc
        except:
            story_dict['domain'] = ''
    else:
        story_dict['domain'] = ''

    story_data.append(story_dict)

df_stories = pd.DataFrame(story_data)

# Display the number of stories and the first few rows
print(f"DataFrame contains {len(df_stories)} stories")
df_stories.head()

## Data Analysis and Visualisation

Now that we have a reasonable amount of data, let's analyse and visualise it.

In [None]:
# Basic statistics
print("Basic statistics for scores:")
print(df_stories['score'].describe())

print("\nBasic statistics for number of comments:")
print(df_stories['num_comments'].describe())

In [None]:
# Correlation between score and number of comments
correlation = df_stories['score'].corr(df_stories['num_comments'])
print(f"Correlation between score and number of comments: {correlation:.2f}")

# Scatter plot of score vs. number of comments
plt.figure(figsize=(10, 6))
sns.scatterplot(x='score', y='num_comments', data=df_stories)
plt.title('Score vs. Number of Comments')
plt.xlabel('Score')
plt.ylabel('Number of Comments')
plt.show()

In [None]:
# Top domains
domain_counts = df_stories['domain'].value_counts().head(10)
print("Top domains:")
print(domain_counts)

# Bar chart of top domains
plt.figure(figsize=(12, 6))
sns.barplot(x=domain_counts.index, y=domain_counts.values)
plt.title('Top 10 Domains in Hacker News Top Stories')
plt.xlabel('Domain')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Distribution of scores
plt.figure(figsize=(10, 6))
sns.histplot(df_stories['score'], bins=20, kde=True)
plt.title('Distribution of Story Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Posts by hour of day
df_stories['hour'] = df_stories['datetime'].dt.hour
hour_counts = df_stories['hour'].value_counts().sort_index()

plt.figure(figsize=(12, 6))
sns.barplot(x=hour_counts.index, y=hour_counts.values)
plt.title('Distribution of Top Stories by Hour of Day (UTC)')
plt.xlabel('Hour of Day (UTC)')
plt.ylabel('Number of Stories')
plt.xticks(range(0, 24))
plt.show()

## Getting and Analysing Comments

Let's dive deeper by fetching and analysing the comments for the top story.

In [None]:
# Get the top story ID
top_story_id = df_stories.iloc[0]['id']
top_story = fetch_item(top_story_id)

print(f"Analysing comments for the top story: {top_story['title']}")

# Recursive function to fetch all comments and replies
def fetch_comments_tree(item_id, level=0, max_level=2):
    """
    Recursively fetch comments and their replies up to a maximum depth.

    :param item_id: The ID of the item to fetch
    :param level: The current recursion level
    :param max_level: The maximum depth to recurse
    :return: A list of comment dictionaries with metadata
    """
    # Stop recursion if we've reached the maximum level
    if level > max_level:
        return []

    # Fetch the item
    item = fetch_item(item_id)
    if not item or 'deleted' in item or 'dead' in item:
        return []

    # Add level information to the item
    item['level'] = level

    # Base case: no kids/replies
    if 'kids' not in item:
        return [item]

    # Recursive case: fetch replies
    comments = [item]
    for kid_id in item.get('kids', []):
        # Add a small delay to be respectful of the API
        time.sleep(0.1)
        comments.extend(fetch_comments_tree(kid_id, level + 1, max_level))

    return comments

# Fetch comments for the top story (limited to depth 1 for demonstration)
if 'kids' in top_story:
    print(f"Fetching comments for story {top_story_id} (this might take a while)...")
    comments = []
    for i, kid_id in enumerate(top_story.get('kids', [])[:5]):  # Limit to first 5 comments
        if i > 0:
            time.sleep(0.1)  # Be nice to the API
        comments.extend(fetch_comments_tree(kid_id, level=0, max_level=1))

    print(f"Retrieved {len(comments)} comments and replies")

    # Transform comment data into a DataFrame
    comment_data = []
    for comment in comments:
        if 'text' not in comment:
            continue

        comment_dict = {
            'id': comment.get('id'),
            'author': comment.get('by', ''),
            'text': comment.get('text', ''),
            'time': comment.get('time', 0),
            'level': comment.get('level', 0),
            'parent': comment.get('parent'),
            'has_replies': 'kids' in comment,
            'num_replies': len(comment.get('kids', []))
        }

        # Convert timestamp to datetime
        if comment_dict['time']:
            comment_dict['datetime'] = datetime.fromtimestamp(comment_dict['time'])

        comment_data.append(comment_dict)

    df_comments = pd.DataFrame(comment_data)

    # Display the first few comments
    print("\nSample of comments:")
    for i, row in df_comments.head(3).iterrows():
        print(f"\nLevel {row['level']} comment by {row['author']}:")
        # Truncate long comments for display
        text = row['text'][:100] + '...' if len(row['text']) > 100 else row['text']
        print(text)
else:
    print("No comments found for this story.")

## Analysing User Activity

Let's fetch details about some of the most active users in our dataset.

In [None]:
def fetch_user(username):
    """
    Fetch user details from the Hacker News API.

    :param username: The username to fetch
    :return: The user details as a dictionary
    """
    user_url = f"{base_url}user/{username}.json"
    response = requests.get(user_url)

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error fetching user {username}: {response.status_code}")
        return None

# Find the top 5 most frequent authors in our stories
top_authors = df_stories['author'].value_counts().head(5).index.tolist()

print(f"Top 5 most frequent authors in our dataset: {top_authors}")

# Fetch details for these authors
users = []
for author in top_authors:
    time.sleep(0.1)  # Be nice to the API
    user = fetch_user(author)
    if user:
        users.append(user)

# Create DataFrame with user details
user_data = []
for user in users:
    user_dict = {
        'id': user.get('id'),
        'karma': user.get('karma', 0),
        'created': user.get('created', 0),
        'submitted_count': len(user.get('submitted', [])),
        'about': user.get('about', '')
    }

    # Convert timestamp to datetime
    if user_dict['created']:
        user_dict['created_date'] = datetime.fromtimestamp(user_dict['created'])
        user_dict['account_age_days'] = (datetime.now() - user_dict['created_date']).days

    user_data.append(user_dict)

df_users = pd.DataFrame(user_data)

# Display user information
df_users[['id', 'karma', 'created_date', 'account_age_days', 'submitted_count']]

In [None]:
# Visualise user karma vs submitted count
plt.figure(figsize=(10, 6))
sns.scatterplot(x='karma', y='submitted_count', data=df_users, s=100)
for i, row in df_users.iterrows():
    plt.annotate(row['id'], (row['karma'], row['submitted_count']),
                 xytext=(5, 5), textcoords='offset points')
plt.title('User Karma vs Number of Submissions')
plt.xlabel('Karma')
plt.ylabel('Number of Submissions')
plt.show()

## Comparing Different Story Types

Let's compare the top, new, and best stories on Hacker News.

In [None]:
# Function to fetch IDs for different story types
def get_story_ids(story_type):
    """
    Fetch story IDs for a specific type (top, new, best).

    :param story_type: The type of stories to fetch ('top', 'new', or 'best')
    :return: A list of story IDs
    """
    url = f"{base_url}{story_type}stories.json"
    response = requests.get(url)

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error fetching {story_type} stories: {response.status_code}")
        return []

# Fetch IDs for new and best stories
new_story_ids = get_story_ids('new')
best_story_ids = get_story_ids('best')

print(f"Retrieved {len(new_story_ids)} new story IDs")
print(f"Retrieved {len(best_story_ids)} best story IDs")

# Fetch details for the first 10 stories of each type
fetch_story_details = lambda ids: [fetch_item(id) for id in ids[:10]]

# We already have top stories, so just fetch new and best
new_stories = []
best_stories = []

print("Fetching new stories...")
for i, story_id in enumerate(new_story_ids[:10]):
    if i > 0:
        time.sleep(0.1)  # Be nice to the API
    story = fetch_item(story_id)
    if story:
        new_stories.append(story)

print("Fetching best stories...")
for i, story_id in enumerate(best_story_ids[:10]):
    if i > 0:
        time.sleep(0.1)  # Be nice to the API
    story = fetch_item(story_id)
    if story:
        best_stories.append(story)

print(f"Retrieved {len(new_stories)} new stories")
print(f"Retrieved {len(best_stories)} best stories")

In [None]:
# Function to convert story data to DataFrame
def stories_to_dataframe(stories, story_type):
    """
    Convert story data to a DataFrame.

    :param stories: List of story dictionaries
    :param story_type: Type of story ('top', 'new', or 'best')
    :return: DataFrame with story data
    """
    story_data = []

    for story in stories:
        if 'title' not in story:
            continue

        story_dict = {
            'id': story.get('id'),
            'title': story.get('title'),
            'score': story.get('score', 0),
            'author': story.get('by', ''),
            'time': story.get('time', 0),
            'num_comments': story.get('descendants', 0),
            'story_type': story_type  # Add the story type as a column
        }

        # Convert timestamp to datetime
        if story_dict['time']:
            story_dict['datetime'] = datetime.fromtimestamp(story_dict['time'])
            story_dict['age_hours'] = (datetime.now() - story_dict['datetime']).total_seconds() / 3600

        story_data.append(story_dict)

    return pd.DataFrame(story_data)

# Create DataFrames for each story type
df_top = stories_to_dataframe(top_stories[:10], 'top')
df_new = stories_to_dataframe(new_stories, 'new')
df_best = stories_to_dataframe(best_stories, 'best')

# Combine into a single DataFrame
df_all_stories = pd.concat([df_top, df_new, df_best], ignore_index=True)

# Display the first few rows
df_all_stories.head()

In [None]:
# Compare scores and comments across story types
# Calculate average score and number of comments for each story type
story_type_stats = df_all_stories.groupby('story_type').agg({
    'score': 'mean',
    'num_comments': 'mean',
    'age_hours': 'mean'
}).reset_index()

print("Average score, comments, and age by story type:")
print(story_type_stats)

# Create a bar chart to compare scores
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.barplot(x='story_type', y='score', data=df_all_stories)
plt.title('Distribution of Scores by Story Type')
plt.xlabel('Story Type')
plt.ylabel('Score')

# Create a bar chart to compare number of comments
plt.subplot(1, 2, 2)
sns.barplot(x='story_type', y='num_comments', data=df_all_stories)
plt.title('Distribution of Comments by Story Type')
plt.xlabel('Story Type')
plt.ylabel('Number of Comments')

plt.tight_layout()
plt.show()

## Analysing Submission Times

Let's analyse when stories are submitted to Hacker News and if there's any relationship between submission time and popularity.

In [None]:
# Extract hour and day of week from datetime
df_all_stories['hour'] = df_all_stories['datetime'].dt.hour
df_all_stories['day_of_week'] = df_all_stories['datetime'].dt.day_name()

# Calculate average score and comments by hour of day
hour_stats = df_all_stories.groupby('hour').agg({
    'score': 'mean',
    'num_comments': 'mean',
    'id': 'count'  # Count of stories
}).rename(columns={'id': 'count'}).reset_index()

# Plot average score by hour of day
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.lineplot(x='hour', y='score', data=hour_stats, marker='o')
plt.title('Average Score by Hour of Day (UTC)')
plt.xlabel('Hour of Day')
plt.ylabel('Average Score')
plt.xticks(range(0, 24, 2))

# Plot average comments by hour of day
plt.subplot(1, 2, 2)
sns.lineplot(x='hour', y='num_comments', data=hour_stats, marker='o')
plt.title('Average Comments by Hour of Day (UTC)')
plt.xlabel('Hour of Day')
plt.ylabel('Average Number of Comments')
plt.xticks(range(0, 24, 2))

plt.tight_layout()
plt.show()

## Ask HN vs. Show HN vs. Regular Posts

Hacker News has special post types like "Ask HN" and "Show HN". Let's analyse how these different post types perform.

In [None]:
# Identify post types based on title
def get_post_type(title):
    if title.startswith('Ask HN:'):
        return 'Ask HN'
    elif title.startswith('Show HN:'):
        return 'Show HN'
    elif title.startswith('Tell HN:'):
        return 'Tell HN'
    else:
        return 'Regular'

# Add post type column
df_all_stories['post_type'] = df_all_stories['title'].apply(get_post_type)

# Display counts by post type
post_type_counts = df_all_stories['post_type'].value_counts()
print("Counts by post type:")
print(post_type_counts)

# Calculate statistics by post type
post_type_stats = df_all_stories.groupby('post_type').agg({
    'score': ['mean', 'median'],
    'num_comments': ['mean', 'median'],
    'id': 'count'
}).reset_index()

# Flatten the multi-level columns
post_type_stats.columns = ['_'.join(col).strip('_') for col in post_type_stats.columns.values]

print("\nStatistics by post type:")
print(post_type_stats)

# Visualise post types by score and comments
plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
sns.boxplot(x='post_type', y='score', data=df_all_stories)
plt.title('Distribution of Scores by Post Type')
plt.xlabel('Post Type')
plt.ylabel('Score')

plt.subplot(2, 1, 2)
sns.boxplot(x='post_type', y='num_comments', data=df_all_stories)
plt.title('Distribution of Comments by Post Type')
plt.xlabel('Post Type')
plt.ylabel('Number of Comments')

plt.tight_layout()
plt.show()

## Sentiment Analysis of Titles

Let's analyse the sentiment of story titles to see if there's any relationship between sentiment and popularity.

In [None]:
# For a simple sentiment analysis, we can use a list of positive and negative words
# This is a very basic approach - for a real analysis, you might use NLTK or TextBlob
from textblob import TextBlob

# Function to get sentiment polarity using TextBlob
def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity  # Returns a value between -1 (negative) and 1 (positive)

# Calculate sentiment for each title
df_all_stories['sentiment'] = df_all_stories['title'].apply(get_sentiment)

# Display average sentiment by post type
print("Average sentiment by post type:")
print(df_all_stories.groupby('post_type')['sentiment'].mean())

# Display sentiment distribution
plt.figure(figsize=(10, 6))
sns.histplot(df_all_stories['sentiment'], bins=20, kde=True)
plt.title('Distribution of Title Sentiment')
plt.xlabel('Sentiment Polarity (-1 to 1)')
plt.ylabel('Frequency')
plt.axvline(x=0, color='r', linestyle='--')  # Add line at neutral sentiment
plt.show()

# Check if there's a correlation between sentiment and score/comments
sentiment_score_corr = df_all_stories['sentiment'].corr(df_all_stories['score'])
sentiment_comments_corr = df_all_stories['sentiment'].corr(df_all_stories['num_comments'])

print(f"Correlation between sentiment and score: {sentiment_score_corr:.3f}")
print(f"Correlation between sentiment and comments: {sentiment_comments_corr:.3f}")

# Scatter plot of sentiment vs. score
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='sentiment', y='score', data=df_all_stories)
plt.title('Sentiment vs. Score')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Score')

# Scatter plot of sentiment vs. comments
plt.subplot(1, 2, 2)
sns.scatterplot(x='sentiment', y='num_comments', data=df_all_stories)
plt.title('Sentiment vs. Number of Comments')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Number of Comments')

plt.tight_layout()
plt.show()

## Building a Simple Recommendation System

Let's build a simple content-based recommendation system for Hacker News stories based on title similarity.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Get a larger dataset for recommendations
all_stories = top_stories + new_stories + best_stories
all_story_titles = [story.get('title', '') for story in all_stories if 'title' in story]
all_story_ids = [story.get('id') for story in all_stories if 'title' in story]

# Create a mapping from ID to index
id_to_index = {id: idx for idx, id in enumerate(all_story_ids)}

# Convert titles to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(all_story_titles)

# Function to get recommendations based on a story ID
def get_story_recommendations(story_id, top_n=5):
    """
    Get content-based recommendations for a story based on title similarity.

    :param story_id: The ID of the story to get recommendations for
    :param top_n: Number of recommendations to return
    :return: List of recommended story dictionaries
    """
    if story_id not in id_to_index:
        print(f"Story ID {story_id} not found in dataset")
        return []

    # Get the index of the story
    idx = id_to_index[story_id]

    # Calculate similarity scores
    similarity_scores = cosine_similarity(tfidf_matrix[idx], tfidf_matrix).flatten()

    # Get indices of top similar stories (excluding the input story)
    similar_indices = similarity_scores.argsort()[::-1][1:top_n+1]

    # Get the recommended stories
    recommendations = [all_stories[i] for i in similar_indices]

    return recommendations

# Get recommendations for the top story
top_story_id = df_stories.iloc[0]['id']
recommended_stories = get_story_recommendations(top_story_id)

print(f"Recommendations for story: {df_stories.iloc[0]['title']}\n")
for i, story in enumerate(recommended_stories):
    print(f"{i+1}. {story.get('title')} (Score: {story.get('score')})")

## Conclusion

In this notebook, we've covered how to use the Hacker News API to fetch and analyse data. We've seen how to:

1. Fetch stories and comments from the API
2. Transform the data into pandas DataFrames
3. Analyse and visualise the data to extract insights
4. Build a simple recommendation system based on content similarity

This demonstrates how APIs can be powerful tools for data scientists, allowing you to access rich, real-time data from the web for analysis and modeling.

## Further Explorations

Here are some additional things you could explore with the Hacker News API:

1. Analyse how quickly stories rise and fall on the front page
2. Build a classifier to predict whether a story will get high engagement
3. Analyse comment threads and conversation patterns
4. Track specific domains or topics over time
5. Build a real-time dashboard to monitor Hacker News activity

Remember to always be respectful of the API by implementing rate limiting and caching where appropriate.