# ADS-509 Assignment 1.1
## Collecting Text Data via API and Web Scraping

**Student Version**  

In this assignment you will:
- Collect story metadata via an API (Hacker News).
- Scrape discussion comments from HTML using BeautifulSoup.
- Merge into one tidy dataset (one row per comment with story metadata).
- Save results and run some quick checks/EDA.

Although the discussion posts are available via the HackerNews API, this mirrors real-world pipelines where structured APIs don’t expose the exact text you need, so you complement them with carefully designed scrapers.

## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it.

Work through this notebook as if it were a worksheet, completing the code sections marked with **TODO** in the cells provided. Similarly, written questions will be marked by a "Q:" and will have a corresponding "A:" spot for you to fill in with your answers. **Make sure to answer every question marked with a Q: for full credit**.

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential import statements and make sure that all such statements are moved into the designated cell.

A .pdf of this notebook, with your completed code and written answers, is what you should submit in Canvas for full credit. **DO NOT SUBMIT A NEW NOTEBOOK FILE OR A RAW .PY FILE**. Submitting in a different format makes it difficult to grade your work, and students who have done this in the past inevitably miss some of the required work or written questions.

## Imports and Definitions

First we import our libraries, set up our folder structure, and define some useful functions and variables.

**TODO**
- update the directory names with whatever path you would like to use to store your data
- define a function that will cause your code to "sleep" for a random interval between 1 and 2 seconds

In [None]:
import os, json, time, random
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
# add any other import statements here


In [None]:
# Polite scraping: set a clear user agent and throttle requests a bit
HEADERS = {"User-Agent": "MADS-LLM-2025-Student/1.0 (+https://example.edu)"}
BASE = "https://hacker-news.firebaseio.com/v0"

def sleep_politely(sleep_range=(1.0,2.0)):
    # TODO: define a random sleep function
    ??

In [None]:
# define your folder structure
# TODO: Update the directory names as needed
DATA_DIR = 'data/module1'
RAW_API_DIR = os.path.join(DATA_DIR, 'raw_api')
RAW_HTML_DIR = os.path.join(DATA_DIR, 'raw_html')
for d in [DATA_DIR, RAW_API_DIR, RAW_HTML_DIR]:
    os.makedirs(d, exist_ok=True)

In [None]:
# add any other helper functions that you define here

## API Data Collection

Here we will use the **requests** library to interact with the HackerNews API. Our ultimate goal is to create a dataset with metadata and discussion posts for 50 of the top stories on this website, so first we will use the API to get the top story IDs.

**TODO**:
- Read through the HackerNews API documentation (https://github.com/HackerNews/API) and use the requests library to pull the IDs for all top news stories. *Be sure to use the headers and timeout arguments*.

**Q:** What do the headers and timeout arguments in the requests library do?

**A:** 

In [None]:
# Get a big list of top story IDs
r = ?? # TODO: Use the requests library to pull the IDs for the top stories
r.raise_for_status()
top_ids = r.json()
print('Total top story ids returned:', len(top_ids))

# For a faster assignment, we’ll just use the first 50 stories
CANDIDATE_N = 50
candidate_ids = top_ids[:CANDIDATE_N]
len(candidate_ids), candidate_ids[:5] # Print 5 of the IDs as a self-check

Next we need to use our list of article IDs to pull and organize the metadata that we want to include in our final dataset.

**TODO**:
- Use the HackerNews API documentation to pull the metadata for each of the stories in your candidate_ids list.
- Filter your stories for those with discussion threads. Hint: look for the "descendants" column.
- Format your stories with their associated metadata into a pandas dataframe. Hint: check out the Pandas `json_normalize()` function

In [None]:
def fetch_item(item_id):
    # cache to disk so re-runs are fast and reproducible
    fp = os.path.join(RAW_API_DIR, f"item_{item_id}.json")
    if os.path.exists(fp):
        with open(fp, 'r', encoding='utf-8') as f:
            return json.load(f)
    r = ?? # TODO: Use the requests library to pull the metadata for a given story ID
    r.raise_for_status()
    data = r.json()
    with open(fp, 'w', encoding='utf-8') as f:
        json.dump(data, f)
    sleep_politely()
    return data

items = []
for idx, i in enumerate(candidate_ids):
    if idx%5==0:
        print(f"{idx}/{len(candidate_ids)}")
    items.append(fetch_item(i))
stories = # TODO: filter your data for stories with comment threads

#TODO: Format your data into a pandas dataframe with one row per story, keeping only the metadata columns listed in keep_cols
keep_cols = ["id", "title", "by", "time", "url", "score", "descendants"]
stories_df = ??

print('Stories w/ discussion:', len(stories_df))
stories_df.head()

## Webscraping

Next, we will use our list of story IDs to scrape the discussion threads from HackerNews using the BeautifulSoup library.

**TODO**:
- Use the requests library to pull the html for each story's comment page on Hacker News (https://news.ycombinator.com/). Hint: you'll need to go to the website to see how to format your url properly.
- Use the BeautifulSoup library to extract the user, comment id, time span, and comment text from each story's HTML. Hint: use the html inspector (ctrl+shift+i) to identify relevant html tags in your browser.
- Format the comment threads into a pandas dataframe with the story id and one row per comment.

**Q**: Find the HackerNews robots.txt. What does it say about scraping, and are we acting within its stated policy?

**A**: 

In [None]:
def get_discussion_html(story_id):
    fp = os.path.join(RAW_HTML_DIR, f"hn_{story_id}.html")
    if os.path.exists(fp):
        with open(fp, 'r', encoding='utf-8') as f:
            return f.read()
    url = ?? # TODO: use the requests library to pull the html for each story's comment page
    r = ??
    r.raise_for_status()
    html = r.text
    with open(fp, 'w', encoding='utf-8') as f:
        f.write(html)
    sleep_politely(sleep_range = (30.0,31.0)) # Adjust sleep according to robots.txt
    return html

# Simple HTML -> comments parser
def parse_comments_from_html(html, story_id):
    soup = BeautifulSoup(html, 'html.parser')
    rows = []
    # TODO: Extract comment_id, user, time span, and comment text from each comment in the html
    for tr in ?? :
        comment_id = ??
        user = ??
        time_text = ??
        comment_text = ??
        if comment_text:
            rows.append({
                'story_id': story_id,
                'comment_id': comment_id,
                'user': user,
                'time_text': time_text,
                'comment_text': comment_text,
            })
    return rows

all_comments = []
for idx, sid in enumerate(stories_df['id'].tolist()):
    if idx%5==0:
        print(f"{idx}/{len(stories_df['id'])}")
    html = get_discussion_html(sid)
    all_comments.extend(parse_comments_from_html(html, sid))

comments_df = pd.DataFrame(all_comments)
print('Total comments scraped:', len(comments_df))
comments_df.head()

## Combine Datasets

Now we have two dataframes that need to be combined into a single dataset. Luckily, these dataframes have a shared key which will make it relatively simple to combine them, but that is often not the case, so you need to be creative about which pieces of data you could use to merge data from multiple sources.

**TODO**
- Combine your `stories_df` and `comments_df` dataframes using the shared key so that the resulting dataset has one line per comment.
- As a final cleaning step, convert the timestamp column from Unix epoch format to a pandas datetime format

**Q**: If we didn't have a shared key for the two dataframes in this scenario, what could you use instead to join them?

**A**: 

In [None]:
# TODO: merge your two dataframes
merged_df = ??

# TODO: convert the timestamp to datetime format
merged_df['story_time'] = ??

assert merged_df['comment_text'].notna().all()
print('Rows in merged dataset:', len(merged_df))
merged_df.sample(3)

## Save your dataset for future use

We will be using this dataset in future assignments, so use this code to save it to file.

In [None]:
out_csv = os.path.join(DATA_DIR, 'hn_comments_with_storymeta.csv')
merged_df.to_csv(out_csv, index=False)
out_csv

## Validation

This section is used for grading. Please run these cells before submission, but do not change any of the code.

In [None]:
comments_per_story = (
    merged_df.groupby(['story_id','title']).size()
    .rename('n_comments').reset_index()
)
comments_per_story.sort_values('n_comments', ascending=False).head(10)

In [None]:
ax = (
    comments_per_story.sort_values('n_comments')
    .tail(15)
    .plot(kind='barh', x='title', y='n_comments', figsize=(8,6))
)
ax.set_xlabel('Number of comments')
ax.set_ylabel('Story title')
ax.set_title('Most discussed HN stories (sample)')
plt.tight_layout()
plt.show()

In [None]:
merged_df['user'].fillna('unknown').value_counts().head(10)

<small>This assignment was designed with the assistance of ChatGPT.</small>