<a href="https://colab.research.google.com/github/YanranChen11/yanranchen11.github.io/blob/main/Session1_APIs_WebScraping_Workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 1 Workbook — Data Collection with APIs & Web Scraping
### Text Analysis in Python for Public Policy / International Affairs


**Structure (2.5h live coding → 30m break → 60m student work)**
1. Token‑less API warm‑up (JSON & endpoints)
2. NewsAPI.org — query 'New York' → JSON → pandas DataFrame → quick clean
3. Web scraping with BeautifulSoup using the URL column
4. Ethics & troubleshooting notes

> Tip: Run cells top‑to‑bottom. If you hit network/rate‑limit issues, use the *Offline Fallback* cells provided.


## Learning Objectives
By the end of this session, you will be able to:
- Explain what an API is (base URL, endpoints, query parameters) and read JSON.
- Make a request to a public API and parse JSON into Python objects.
- Convert nested JSON into a tidy pandas `DataFrame` and do light cleaning.
- Use article URLs from API results to scrape page text with BeautifulSoup.
- Follow basic ethical guidelines for scraping (robots.txt, rate limiting, attribution).


## 0) Environment Setup (Colab‑friendly)

In [40]:
# If you're in Colab, uncomment the next lines to install packages:
# !pip install requests pandas beautifulsoup4 lxml python-dotenv

# Imports used throughout the workbook
import requests            # for HTTP requests to APIs / web pages
import json                # for pretty-printing JSON
import pandas as pd        # for DataFrame work
from bs4 import BeautifulSoup  # for HTML parsing
import time                # for polite sleeping / rate-limiting
import os                  # for reading environment variables


## 1) Warm‑up: Token‑less API (Cat Facts API)
**Concepts:** base URL, endpoint, JSON, key–value pairs.

We’ll hit a simple, no‑auth endpoint to focus on the response shape.
API docs: `https://catfact.ninja/fact` (returns one random cat fact)


In [39]:
# Define the base URL (the main address of the API).
base_url = "https://catfact.ninja"
# Define the endpoint (the specific resource we want under the base URL).
endpoint = "/fact"
# Combine base URL and endpoint into a full URL.
url = f"{base_url}{endpoint}"

# Or can just say url = "https://catfact.ninja/fact"

# Send a GET request to the server to retrieve data.
response = requests.get(url)

# Check the HTTP status code (200 means OK/success).
print("Status code:", response.status_code)

# Convert the response body from JSON text into a Python dict.
data = response.json()

# Pretty-print the JSON so we can see its structure (keys and values).
print("Raw JSON:")
print(json.dumps(data, indent=2))

# Access a value by key from the JSON (dictionary).
print("Just the fact value:")
print(data["fact"])


Status code: 200
Raw JSON:
{
  "fact": "One reason that kittens sleep so much is because a growth hormone is released only during sleep.",
  "length": 96
}
Just the fact value:
One reason that kittens sleep so much is because a growth hormone is released only during sleep.


### Offline Fallback (if the API is down or blocked)

In [41]:
# This cell simulates the same JSON structure returned by the API.
# Use this if you're offline or the API rate-limits during class.
offline_json = {
    "fact": "Cats can rotate their ears 180 degrees.",
    "length": 41
}
print(json.dumps(offline_json, indent=2))
print("Access 'fact':", offline_json["fact"])


{
  "fact": "Cats can rotate their ears 180 degrees.",
  "length": 41
}
Access 'fact': Cats can rotate their ears 180 degrees.


## 2) Real‑world API: NewsAPI.org — `everything` endpoint
We’ll search for the term **“New York”**, retrieve results in JSON, then convert to a pandas `DataFrame`.

> **Setup:** You need a free API key from https://newsapi.org/  
> **Security tip:** Store your key as an environment variable (e.g., `NEWSAPI_KEY`) or use `python-dotenv`.


In [42]:
# Store your API KEY in a variab;e
NEWSAPI_KEY = '48b88b6b4c2e4ad683dbfd41c4f689fb'

# Define the endpoint and query parameters.
news_url = "https://newsapi.org/v2/everything"
params = {
    "q": "New York",   # search query
    "language": "en",  # restrict to English
    "pageSize": 25,    # number of results per page (max 100)
    "sortBy": "relevancy",  # or 'publishedAt' for recency
    "apiKey": NEWSAPI_KEY   # your API key
}

# Make the request with parameters to the NewsAPI endpoint.
news_resp = requests.get(news_url, params=params)

# Inspect status code to ensure the request worked.
print("Status code:", news_resp.status_code)

# Convert to Python objects (dict) from JSON.
news_data = news_resp.json()

# Sanity check: print the top-level keys.
print("Top-level keys:", list(news_data.keys()))

# Inspect the first article's keys to understand the structure.
if news_data.get("articles"):
    print("Article keys:", list(news_data["articles"][0].keys()))
else:
    print("No articles returned. Check your API key or params.")


Status code: 200
Top-level keys: ['status', 'totalResults', 'articles']
Article keys: ['source', 'author', 'title', 'description', 'url', 'urlToImage', 'publishedAt', 'content']


### Convert JSON → pandas DataFrame & Quick Cleaning

In [43]:
# Convert the list of articles (list of dicts) into a DataFrame.
articles = news_data.get("articles", [])
df = pd.DataFrame(articles) # dataframe: an interactive spreadsheet

# Show the first few rows to confirm shape and columns.
print("DataFrame shape:", df.shape)
df.head() # to see the first top rows
  # in the semi-column, can put exact numbers, e.g. df.tail(10); df.sample(10()
# DataFrame shop (25, 8) correspods with the 25 limit that is set before


DataFrame shape: (25, 8)


Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': 'the-verge', 'name': 'The Verge'}",Elissa Welle,New York bans AI-enabled rent price fixing,"On Thursday, New York Gov. Kathy Hochul signed...",https://www.theverge.com/news/801205/new-york-...,https://platform.theverge.com/wp-content/uploa...,2025-10-16T21:24:52Z,<ul><li></li><li></li><li></li></ul>\r\nIts th...
1,"{'id': None, 'name': 'BBC News'}",,Five takeaways from Mamdani-Cuomo New York may...,The three leading candidates for New York City...,https://www.bbc.com/news/articles/cn8xlx53jn6o,https://ichef.bbci.co.uk/news/1024/branded_new...,2025-10-17T02:43:13Z,Kayla Epstein and Grace Eliza Goodwin at Rocke...
2,"{'id': None, 'name': 'BBC News'}",Tom Rostance,Hodgkinson ends season with Athlos win in New ...,Britain's Keely Hodgkinson ends her 800m seaso...,https://www.bbc.com/sport/athletics/articles/c...,https://ichef.bbci.co.uk/ace/branded_sport/120...,2025-10-11T06:01:45Z,Keely Hodgkinson ended her 800m season with vi...
3,"{'id': 'the-verge', 'name': 'The Verge'}",Emma Roth,Trump’s solution for high drug prices is a dis...,President Donald Trump is launching a new gove...,https://www.theverge.com/news/790156/trump-hea...,https://platform.theverge.com/wp-content/uploa...,2025-10-01T19:45:48Z,<ul><li></li><li></li><li></li></ul>\r\nThe ne...
4,"{'id': None, 'name': 'Gizmodo.com'}",AJ Dellinger,New York City Sues Social Media Companies Over...,The Big Apple bites back.,https://gizmodo.com/new-york-city-sues-social-...,https://gizmodo.com/app/uploads/2020/12/l1qtvm...,2025-10-09T19:25:12Z,Here’s a new element of the East Coast vs. Wes...


In [44]:
# Select a subset of useful columns for our analysis.
keep_cols = ["source", "author", "title", "description", "url", "publishedAt"]
df = df[keep_cols]

# 'source' is a nested dict; flatten it to a simple string (source name).
# We create a new column 'source_name' with the inner 'name' field.
df["source_name"] = df["source"].apply(lambda d: d.get("name") if isinstance(d, dict) else None)

# Drop the original nested 'source' column now that we've extracted the name.
df = df.drop(columns=["source"])

# Convert 'publishedAt' to a proper datetime type for easier filtering/sorting.
df["publishedAt"] = pd.to_datetime(df["publishedAt"], errors="coerce")

# Drop rows with missing URLs or titles (these are critical for scraping/analysis).
df = df.dropna(subset=["url", "title"]).reset_index(drop=True)

# Sort by recency to bring the newest items to the top.
df = df.sort_values("publishedAt", ascending=False).reset_index(drop=True)

# Display a tidy preview
df.head(10)


Unnamed: 0,author,title,description,url,publishedAt,source_name
0,Jay Peters,Wordle has achievements now,Want to flex your Wordle habit beyond just kee...,https://www.theverge.com/news/806578/nyt-games...,2025-10-24 21:28:43+00:00,The Verge
1,Emma Roth,Microsoft Edge’s new Copilot Mode turns on mor...,Microsoft is joining the AI browser wave with ...,https://www.theverge.com/news/805833/microsoft...,2025-10-23 22:00:35+00:00,The Verge
2,Richard Lawler,Amazon claims the headline isn’t robots taking...,A New York Times report on Tuesday cited inter...,https://www.theverge.com/news/805098/amazon-ro...,2025-10-23 00:51:34+00:00,The Verge
3,Jess Weatherbed,"Amazon hopes to replace 600,000 US workers wit...",Amazon is reportedly leaning into automation p...,https://www.theverge.com/news/803257/amazon-ro...,2025-10-21 11:11:19+00:00,The Verge
4,Lucas Ropek,Google’s New York Offices Reportedly Developed...,The tech company needs a literal de-bugger. Mu...,https://gizmodo.com/google-new-york-bed-bugs-2...,2025-10-21 09:30:12+00:00,Gizmodo.com
5,Nilay Patel,Zocdoc CEO: ‘Dr. Google is going to be replace...,Today’s Decoder episode is a special one: I’m ...,https://www.theverge.com/podcast/801767/zocdoc...,2025-10-20 13:59:34+00:00,The Verge
6,Sofia Barnett,AI Is Changing What High School STEM Students ...,A degree in computer science used to promise a...,https://www.wired.com/story/stem-high-school-s...,2025-10-20 09:30:00+00:00,Wired
7,Victoria Song,The future I saw through the Meta Ray-Ban Disp...,Outside a florist-cum-coffee shop in upstate N...,https://www.theverge.com/tech/801684/meta-ray-...,2025-10-17 18:30:04+00:00,The Verge
8,,Five takeaways from Mamdani-Cuomo New York may...,The three leading candidates for New York City...,https://www.bbc.com/news/articles/cn8xlx53jn6o,2025-10-17 02:43:13+00:00,BBC News
9,Elissa Welle,New York bans AI-enabled rent price fixing,"On Thursday, New York Gov. Kathy Hochul signed...",https://www.theverge.com/news/801205/new-york-...,2025-10-16 21:24:52+00:00,The Verge


### Offline Fallback for NewsAPI (sample payload)

In [20]:
# Use a small, hard-coded sample if the API call fails/limits.
offline_news = {
  "status": "ok",
  "totalResults": 2,
  "articles": [
    {
      "source": {
        "id": null,
        "name": "Example Times"
      },
      "author": "Jane Doe",
      "title": "New York expands ferry service for commuters",
      "description": "City officials announce new routes and schedules.",
      "url": "https://www.example.com/ny-ferry",
      "publishedAt": "2025-10-20T10:00:00Z"
    },
    {
      "source": {
        "id": null,
        "name": "Policy Daily"
      },
      "author": "John Smith",
      "title": "Housing advocates push for zoning reform in New York",
      "description": "Debate intensifies over upzoning proposals.",
      "url": "https://www.example.com/ny-zoning",
      "publishedAt": "2025-10-19T09:30:00Z"
    }
  ]
}

offline_df = pd.DataFrame(offline_news["articles"])
offline_df["source_name"] = offline_df["source"].apply(lambda d: d.get("name") if isinstance(d, dict) else None)
offline_df = offline_df.drop(columns=["source"])
offline_df["publishedAt"] = pd.to_datetime(offline_df["publishedAt"], errors="coerce")
offline_df = offline_df.dropna(subset=["url", "title"]).sort_values("publishedAt", ascending=False).reset_index(drop=True)
offline_df


NameError: name 'null' is not defined

## 3) Web Scraping with BeautifulSoup (from the URL column)
**Goal:** Given an article URL, fetch the web page and extract the textual content (paragraphs).

> **Important:** Real news sites often have paywalls or dynamic content loaded by JavaScript. For teaching, start with any URL that returns visible `<p>` text. Otherwise, use the **offline fallback** cell.


In [45]:
# Choose a URL to scrape: try the live df first, fallback to offline_df if needed.
candidate_df = df if not df.empty else offline_df
article_url = candidate_df.loc[8, "url"]
print("Scraping URL:", article_url)
 # change from 0 to 8 STARRED

# Send a GET request to retrieve the raw HTML of the page.
page_resp = requests.get(article_url, timeout=15)

# Create a BeautifulSoup object to parse the HTML document.
soup = BeautifulSoup(page_resp.text, "html.parser")

# Find all paragraph tags <p> and extract text from each.
paragraphs = soup.find_all("p")
# print(paragraph)

# Use a list comprehension to strip whitespace and only keep non-empty paragraphs.
para_text = [p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)]
  # if find whitespace, strip it

# Join paragraphs into a single string for quick inspection (limit output length).
full_text = " ".join(para_text)
print("First 1000 characters of extracted text:\n")
print(full_text[:1000])


Scraping URL: https://www.bbc.com/news/articles/cn8xlx53jn6o
First 1000 characters of extracted text:

The three leading candidates for New York City mayor took the stage at Rockefeller Center in Manhattan on Thursday to make a case to lead America's biggest city. They tangled over housing, Israel and Gaza, and President Donald Trump, with frontrunner Zohran Mamdani pressing main competitor Andrew Cuomo during the heated two-hour debate. With early voting set to begin next week, neither dominated the evening - though both declared victory afterwards. The most recent polling suggests Mamdani has widened his lead to 46%, while Cuomo stands at 33%. The outcome could have political implications beyond the Empire State as Trump looms large, and whoever wins will likely face pressure from Washington in some form. The Democratic Party nationally is likely watching to see if the America's biggest Democratic stronghold chooses an establishment, centrist figure in Cuomo - who is running as an in

### (Optional) More robust scraping: headers + polite delay

In [26]:
# Some sites block default Python requests; set a user-agent header to look like a browser.
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
}

# Example loop to scrape first N article URLs with a polite delay.
N = min(3, len(candidate_df))  # limit to 3 for class demo
texts = []

for i in range(N):
    url_i = candidate_df.loc[i, "url"]
    print(f"Fetching ({i+1}/{N}):", url_i)
    try:
        r = requests.get(url_i, headers=headers, timeout=15)
        soup = BeautifulSoup(r.text, "html.parser")
        paras = [p.get_text(strip=True) for p in soup.find_all("p")]
        text_i = " ".join([t for t in paras if t])
        texts.append(text_i)
        # Be polite and avoid hammering servers.
        time.sleep(1.0)
    except Exception as e:
        print("Error fetching:", e)
        texts.append("")

# Add scraped text as a new column aligned to the first N rows.
candidate_df = candidate_df.copy()
candidate_df.loc[:N-1, "scraped_text"] = texts
candidate_df.head(N)


Fetching (1/3): https://www.theverge.com/news/806578/nyt-games-badges-achievements-wordle-spelling-bee-connections
Fetching (2/3): https://www.theverge.com/news/805833/microsoft-edge-copilot-mode-ai-launch
Fetching (3/3): https://www.theverge.com/news/805098/amazon-robots-ai-warehouses


Unnamed: 0,author,title,description,url,publishedAt,source_name,scraped_text
0,Jay Peters,Wordle has achievements now,Want to flex your Wordle habit beyond just kee...,https://www.theverge.com/news/806578/nyt-games...,2025-10-24 21:28:43+00:00,The Verge,Posts from this topic will be added to your da...
1,Emma Roth,Microsoft Edge’s new Copilot Mode turns on mor...,Microsoft is joining the AI browser wave with ...,https://www.theverge.com/news/805833/microsoft...,2025-10-23 22:00:35+00:00,The Verge,Posts from this topic will be added to your da...
2,Richard Lawler,Amazon claims the headline isn’t robots taking...,A New York Times report on Tuesday cited inter...,https://www.theverge.com/news/805098/amazon-ro...,2025-10-23 00:51:34+00:00,The Verge,Posts from this topic will be added to your da...


## 4) Save Scraped Text to Google Drive as `.txt` Files (Colab)
This section lets you export each row's `scraped_text` into a separate `.txt` file on Google Drive.

**Workflow**
1. Mount Drive (Colab)
2. Choose (or create) a destination folder in your Drive
3. Iterate through the DataFrame, sanitize filenames, and write `.txt` files

> If you're running **locally** (not in Colab), skip the mount cell and set `base_dir` to a local path (e.g., `./exports`).

In [71]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [72]:
# --- Colab-only: Mount Google Drive ---
# If running in Colab, uncomment the two lines below.
# from google.colab import drive
# drive.mount('/content/drive')

# Choose a Drive folder (adjust this path). If running locally, use a local path instead.
# Example for Colab:
base_dir = "/content/drive/MyDrive/intro-to-text-analysis"
# Example local fallback:
# base_dir = "./exports"

import os
os.makedirs(base_dir, exist_ok=True)
print("Saving .txt files to:", base_dir)

Saving .txt files to: /content/drive/MyDrive/intro-to-text-analysis


In [89]:
def scrape_article(url):
    response = requests.get(url)
    response.encoding = 'utf-8' # needed handle non-ASCII (American Standard Code for Information Interchange) characters correctly in text
    html_string = response.text
    return html_string

In [90]:
df['text'] = df['url'].apply(scrape_article)

In [91]:
df

Unnamed: 0,author,title,description,url,publishedAt,source_name,text
0,Jay Peters,Wordle has achievements now,Want to flex your Wordle habit beyond just kee...,https://www.theverge.com/news/806578/nyt-games...,2025-10-24 21:28:43+00:00,The Verge,X-Forbidden
1,Emma Roth,Microsoft Edge’s new Copilot Mode turns on mor...,Microsoft is joining the AI browser wave with ...,https://www.theverge.com/news/805833/microsoft...,2025-10-23 22:00:35+00:00,The Verge,X-Forbidden
2,Richard Lawler,Amazon claims the headline isn’t robots taking...,A New York Times report on Tuesday cited inter...,https://www.theverge.com/news/805098/amazon-ro...,2025-10-23 00:51:34+00:00,The Verge,X-Forbidden
3,Jess Weatherbed,"Amazon hopes to replace 600,000 US workers wit...",Amazon is reportedly leaning into automation p...,https://www.theverge.com/news/803257/amazon-ro...,2025-10-21 11:11:19+00:00,The Verge,X-Forbidden
4,Lucas Ropek,Google’s New York Offices Reportedly Developed...,The tech company needs a literal de-bugger. Mu...,https://gizmodo.com/google-new-york-bed-bugs-2...,2025-10-21 09:30:12+00:00,Gizmodo.com,"<!doctype html>\n<html lang=""en-US"">\n <head>..."
5,Nilay Patel,Zocdoc CEO: ‘Dr. Google is going to be replace...,Today’s Decoder episode is a special one: I’m ...,https://www.theverge.com/podcast/801767/zocdoc...,2025-10-20 13:59:34+00:00,The Verge,X-Forbidden
6,Sofia Barnett,AI Is Changing What High School STEM Students ...,A degree in computer science used to promise a...,https://www.wired.com/story/stem-high-school-s...,2025-10-20 09:30:00+00:00,Wired,"<!DOCTYPE html><html lang=""en-US"" dir=""ltr""><h..."
7,Victoria Song,The future I saw through the Meta Ray-Ban Disp...,Outside a florist-cum-coffee shop in upstate N...,https://www.theverge.com/tech/801684/meta-ray-...,2025-10-17 18:30:04+00:00,The Verge,X-Forbidden
8,,Five takeaways from Mamdani-Cuomo New York may...,The three leading candidates for New York City...,https://www.bbc.com/news/articles/cn8xlx53jn6o,2025-10-17 02:43:13+00:00,BBC News,"<!DOCTYPE html><html lang=""en-GB""><head><meta ..."
9,Elissa Welle,New York bans AI-enabled rent price fixing,"On Thursday, New York Gov. Kathy Hochul signed...",https://www.theverge.com/news/801205/new-york-...,2025-10-16 21:24:52+00:00,The Verge,X-Forbidden


In [76]:
for text in df['text']:
    print(text)

X-Forbidden
X-Forbidden
X-Forbidden
X-Forbidden
<!doctype html>
<html lang="en-US">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link rel="icon" type="image/png" href="/favicon.png?v1" sizes="96x96" />
    <link rel="icon" type="image/svg+xml" href="/favicon.svg?v1">
    <link rel="shortcut icon" href="/favicon.ico?v1" />
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png?v1">
    <link rel="manifest" href="/site.webmanifest">
    <link rel="mask-icon" href="/safari-pinned-tab.svg?v1" color="#004fff">
    <meta name="apple-mobile-web-app-title" content="Gizmodo">
    <meta name="application-name" content="Gizmodo">
    <meta name="msapplication-TileColor" content="#004fff">
    <meta name="theme-color" content="#ffffff">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/css/all.min.css">
    <script>window.iconPath = "https://gizmodo.com/app/themes/gi

In [77]:
for text in df['text']:
    soup = BeautifulSoup(text)
    article = soup.get_text()
    print(article)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m





Oct 31
2:44 pm


                  The Best Gadgets of October 2025
                


















Oct 30
5:24 pm


                  Putting a Bluetooth Speaker In an End Table Is a Bad Idea, Actually
                















Oct 29
11:45 am


                  Google Pixel Buds 2a vs. OnePlus Buds 4: Which Wireless Earbuds Win?
                















Oct 28
7:01 am


                  Gizmodo’s Best Tech of 2025 Awards: Our Favorite Phones, Laptops, Gaming Gear, and More
                















Oct 27
5:33 pm


                  14-Inch MacBook Pro (M5) Review: New Soul in an Old Body
                















Oct 26
11:00 am


                  Razer Clio Review: Headphones Are Still Better Than This Headrest Speaker
                















Oct 26
7:00 am


                  These AR Smart Glasses Tested My Patience in a Way I didn’t Think Was Possible
                





In [84]:
with open("all_articles.txt","w") as file:
    for text in df['text']:
        soup = BeautifulSoup(text)
        article = soup.get_text()
        file.write(article)

In [85]:
id = 0
for text in df['text']:
    soup = BeautifulSoup(text)
    article = soup.get_text()

    id += 1
    with open(f"{base_dir}/article_{id}.txt", "w") as file:
        file.write(str(article))

In [32]:
# # --- Export each row's scraped_text to a .txt file ---
# # import os
# # import re
# # import pandas as pd

# def slugify(text, max_len=80):
#     """Create a filesystem-safe slug from any text."""
#     if not isinstance(text, str) or not text.strip():
#         return "untitled"
#     text = text.lower()
#     text = re.sub(r"[^a-z0-9]+", "-", text)
#     text = re.sub(r"-{2,}", "-", text).strip("-")
#     return text[:max_len] if text else "untitled"

# # Pick the first non-empty DataFrame among common variables created earlier.
# df_source = None
# for name in ["df_topic", "candidate_df", "df", "offline_df"]:
#     if name in globals():
#         _df = globals()[name]
#         if isinstance(_df, pd.DataFrame) and not _df.empty:
#             df_source = _df
#             print(f"Using DataFrame: {name} with shape {_df.shape}")
#             break

# if df_source is None:
#     raise ValueError("No DataFrame available. Run earlier cells to create df_topic/df/candidate_df/offline_df.")

# if "scraped_text" not in df_source.columns:
#     print("`scraped_text` column not found. You may need to run the scraping cells first.")
# else:
#     used_names = set()
#     saved = 0
#     for i, row in df_source.iterrows():
#         text = row.get("scraped_text", "")
#         if not isinstance(text, str) or not text.strip():
#             continue  # skip empty text rows

#         # Build a filename using date + title/url slug for disambiguation.
#         title = row.get("title") if "title" in df_source.columns else None
#         url = row.get("url") if "url" in df_source.columns else None

#         # Try to extract a date for filename
#         date_str = ""
#         if "publishedAt" in df_source.columns and pd.notna(row.get("publishedAt")):
#             try:
#                 date_str = pd.to_datetime(row["publishedAt"]).strftime("%Y%m%d")
#             except Exception:
#                 date_str = ""

#         base_name = slugify(title) if title else (slugify(url) if url else f"article-{i}")
#         fname = f"{date_str + '_' if date_str else ''}{base_name}.txt"

#         # Ensure uniqueness
#         original_fname = fname
#         k = 2
#         while fname in used_names or os.path.exists(os.path.join(base_dir, fname)):
#             fname = original_fname.replace(".txt", f"_{k}.txt")
#             k += 1
#         used_names.add(fname)

#         # Write file (UTF-8)
#         path = os.path.join(base_dir, fname)
#         with open(path, "w", encoding="utf-8") as f:
#             f.write(text)

#         saved += 1
#         if saved <= 5:  # print only the first few to keep output tidy
#             print("Saved:", path)
#     print(f"Done. Saved {saved} text files to {base_dir}.")

Using DataFrame: candidate_df with shape (25, 6)
`scraped_text` column not found. You may need to run the scraping cells first.


## 5) Ethics, Legality, and Troubleshooting
- **robots.txt**: Check site’s crawling policy, but note it’s advisory; always follow terms of service.
- **Rate limiting**: Sleep between requests; don’t parallelize aggressively.
- **Attribution**: Cite sources when using scraped content in reports.
- **Paywalls / JS‑rendered sites**: Some pages need tools like `selenium` or `requests_html`. Use sparingly and ethically.
- **Stability**: News sites change their HTML; write resilient, minimal selectors (e.g., `find_all("p")` as a start).
- **Alternatives**: Prefer official APIs when available (structured, stable, legally safer).


## 6) Mini‑Project (60 min post‑break)
**Choose a topic** (e.g., *housing policy*, *AI regulation*, *public transit*, *Ukraine*) and:

1. Modify the NewsAPI query to fetch ~25 English articles from the last week.
2. Convert to a tidy `DataFrame`, keep: `source_name`, `title`, `description`, `url`, `publishedAt`.
3. Scrape the first 2–3 article URLs and add a `scraped_text` column.
4. Save your work to CSV: `results_<topic>.csv`.

> **Stretch goal:** Use `.str.len()` on `scraped_text` to identify the fullest articles; compute basic stats.


In [92]:
# Starter scaffold for the mini-project (students edit this cell).
topic = "Public Education"  # ← change to your chosen topic
NEWSAPI_KEY = os.getenv("NEWSAPI_KEY", "48b88b6b4c2e4ad683dbfd41c4f689fb")

from datetime import datetime, timedelta

# Get ISO 8601 format (e.g., '2025-10-25T12:34:56')
last_week = (datetime.now() - timedelta(days=7)).isoformat(timespec='seconds')
print(last_week)

news_url = "https://newsapi.org/v2/everything"
params = {
    "q": topic,
    "language": "en",
    "pageSize": 25,
    "from": last_week,
    "sortBy": "publishedAt",
    "apiKey": NEWSAPI_KEY
}
resp = requests.get(news_url, params=params)
data = resp.json()
df_topic = pd.DataFrame(data.get("articles", []))
if not df_topic.empty:
    df_topic["source_name"] = df_topic["source"].apply(lambda d: d.get("name") if isinstance(d, dict) else None)
    df_topic = df_topic.drop(columns=["source"])
    df_topic["publishedAt"] = pd.to_datetime(df_topic["publishedAt"], errors="coerce")
    df_topic = df_topic.dropna(subset=["url", "title"]).sort_values("publishedAt", ascending=False).reset_index(drop=True)
    df_topic = df_topic[["source_name", "title", "description", "url", "publishedAt"]]

    # Scrape first 3 URLs
    headers = {"User-Agent": "Mozilla/5.0"}
    texts = []
    # for i in range(min(3, len(df_topic))):
    for i in range(len(df_topic)):
        u = df_topic.loc[i, "url"]
        try:
            r = requests.get(u, headers=headers, timeout=15)
            s = BeautifulSoup(r.text, "html.parser")
            paras = [p.get_text(strip=True) for p in s.find_all("p")]
            texts.append(" ".join([t for t in paras if t]))
            time.sleep(1.0)
        except Exception as e:
            print("Error:", e)
            texts.append("")
    df_topic.loc[:len(texts)-1, "scraped_text"] = texts

    # Save to CSV
    out_name = f"{base_dir}/results_{topic.replace(' ', '_').lower()}.csv"
    df_topic.to_csv(out_name, index=False)
    print("Saved:", out_name)
    df_topic.head()
else:
    print("No results — check your API key or query.")


2025-10-25T18:53:58
Saved: /content/drive/MyDrive/intro-to-text-analysis/results_public_education.csv


## Appendix: Common Errors & Fixes

In [None]:
# 1) If you get a 401 error from NewsAPI -> invalid/expired API key.
#    Fix: double-check your key, or set it explicitly:
# os.environ['NEWSAPI_KEY'] = 'PASTE_YOUR_KEY_HERE'

# 2) If scraping returns empty text:
#    - Try adding headers with a real user-agent.
#    - Try a different URL (some pages are behind paywalls or JS-rendered).
#    - Verify that <p> tags exist by printing a snippet of soup:
# print(soup.prettify()[:1500])

# 3) If you see Unicode errors when saving CSV:
# df.to_csv("file.csv", index=False, encoding="utf-8")

# 4) If you need only recent articles, filter by date:
# cutoff = pd.Timestamp.utcnow() - pd.Timedelta(days=7)
# df = df[df['publishedAt'] >= cutoff]
