# **Download Data**

<br>

---
---
### **Overview**

---

Data for our analysis over presidential text and decreasing readability levels was scraped from a variety of online resources. This notebook provides a brief explanation of each dataset used along with the code used to obtain it.

In [None]:
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup
import time
from random import uniform
from tqdm import tqdm

import warnings
warnings.filterwarnings("ignore")

<br>

<br>

---
---
### **NAEP Data**

---

[NAEP](https://nces.ed.gov/nationsreportcard/https://nces.ed.gov/nationsreportcard/),
or the National Assessment of Educational Progress, is a national program used to measure student achievement across various core subject areas. Data from these assessments is used to generate
[the Nation's Report Card](https://www.nationsreportcard.gov/https://www.nationsreportcard.gov/),
which can be used to assess how well students are meeting targeted learning goals at national, state, and urban-district levels. For the purposes of this project, we examined results from the NAEP assessments in reading comprehension—NAEP Main and NAEP LTT (Long-Term Trends)—as a proxy for average reading levels of young American adults upon exiting the K-12 education system. Scores for both assessments are given on a
$0-500$
scale.

<br>

---

#### **NAEP Main Data, National Reading, Grade 12**

Beginning in 1992, NAEP Main has been administered to students at grades 4, 8, and 12. For our project, we gathered data on average scale NAEP reading scores for 12th graders from 1992 to 2024. Data was aggregated on an national level and did not account for demographic or geographic differences.

Visit the [Nations's Report Card](https://nces.ed.gov/nationsreportcard/reading/achieve.aspx#2009ald)
website for more detailed explanations on the meanings of the scores. Note that although cutoff scores for performance levels have remained the same since 1992, different descriptors are given for scores from 1992-2007 and for 2007-to present.

| NAEP Main Average Scale Score | Performance Level                            |
|:-----------------------------:|----------------------------------------------|
| $265$                         | NAEP Basic                                   |
| $302$                         | NAEP Proficient                              |
| $346$                         | NAEP Advanced                                |

In [None]:
df = pd.read_excel(
    "data/NAEP-Main, Reading, Grade 12.Xls",
    skiprows = list(range(0, 8)) + list(range(20, 22))
)

# Add column for "Accommodations Allowed"
n_rows = len(df)
accommodations = ["No"] * 3 + ["Yes"] * (n_rows - 3)
df.insert(loc = 1, column = "Accommodations Allowed", value = accommodations)
df = df.iloc[:-2]

# Add "Grade Level"
n_rows = len(df)
grade_level = 12
df.insert(loc = 3, column = "Grade Level", value = grade_level)

# Remove any non-digit characters
df["Year"] = df["Year"].astype(str).str.replace(r"\D", "", regex = True)

# Convert "Year" to integer
df["Year"] = df["Year"].astype(int)

df.to_csv("data/NAEP-Main, Reading, Grade 12.csv", index = False)

<br>

---

#### **NAEP LTT (Long-Term Trends), National Reading, Age 17**

From 1971 and onwards, NAEP LTT has been administered roughly every four years to students at ages 9, 13, and 17. For our project, we gathered data on average scale NAEP reading scores for 17-year-olds from 1971 to 2012. Data was aggregated on an national level and did not account for demographic or geographic differences.

Visit the [Nations's Report Card](https://nces.ed.gov/nationsreportcard/reading/achieve.aspx#2009ald)
website for more detailed explanations on the meanings of the scores. Because NAEP LTT assessments have remained relatively unchanged since 1992, descriptions for performance levels have not changed in the same way as NAEP Main.

| **NAEP LTT Average Scale Score**  | **Performance Level**                                    |
|:---------------------------------:|----------------------------------------------------------|
| $350$                             | Learn from Specialized Reading Materials                 |
| $300$                             | Understand Complicated Information                       |
| $250$                             | Interrelate Ideas and Make Generalizations               |
| $200$                             | Demonstrate Partially Developed Skills and Understanding |
| $150$                             | Carry Out Simple, Discrete Reading Tasks                 |

In [None]:
df = pd.read_excel(
    "data/NAEP-LTT, Reading, Age 17.Xls",
    skiprows = list(range(0, 8)) + list(range(24, 26))
)

# Add column for "Original Assessment Format"
n_rows = len(df)
original_assessment_format = ["Yes"] * 11 + ["No"] * (n_rows - 11)
df.insert(loc = 1, column = "Original Assessment Format", value = original_assessment_format)
df = df.iloc[:-2]

# Add "Age"
n_rows = len(df)
age = 17
df.insert(loc = 3, column = "Age", value = age)

# Remove any non-digit characters
df["Year"] = df["Year"].astype(str).str.replace(r"\D", "", regex = True)

# Convert "Year" to integer
df["Year"] = df["Year"].astype(int)

df.to_csv("data/NAEP-LTT, Reading, Age 17.csv", index = False)

<br>

<br>

---
---
### **US Presidents by Political Party and Years in Office (Britannica)**

---
Data on US presidents was taken from this
[Brittanica webpage](https://www.britannica.com/topic/Presidents-of-the-United-States-1846696https:). It contains president names, political party, and years in office.

In [None]:
url = "https://www.britannica.com/topic/Presidents-of-the-United-States-1846696"
resp = requests.get(url)
resp.raise_for_status()

tables = pd.read_html(resp.text)
df = tables[0]
df.to_csv("data/presidents.csv", index=False, encoding="utf-8")
presidents_df = df
presidents_df = presidents_df.drop(presidents_df.columns[[0, 1, 3]], axis = 1).iloc[:-2]

# Clean term column
def clean_term(term):
    if pd.isna(term):
        return None

    term = str(term).strip().replace("–", "-").replace("—", "-")
    parts = term.split("-")

    # Clean out non-digitc characters (like * † a b etc.)
    start = re.sub(r"\D", "", parts[0])

    if start == "":
        return None
    start = int(start)

    # Handle end year
    if len(parts) == 1 or parts[1].strip() == "":
        end = start

    else:
        end = re.sub(r"\D", "", parts[1])

        if end == "":
            end = start

        elif len(end) == 2:
            end = int(str(start)[:2] + end)

        else:
            end = int(end)

    return f"{start}–{end}" if start != end else str(start)

presidents_df["term"] = presidents_df["term"].apply(clean_term)
presidents_df[["term_start", "term_end"]] = (presidents_df["term"].str.split("–", expand=True))
presidents_df["term_start"] = presidents_df["term_start"].astype("Int64")
presidents_df["term_end"] = presidents_df["term_end"].astype("Int64")

presidents_df.to_csv("data/presidents.csv", index = False, encoding = "utf-8")

<br>

<br>

---
---
### **Presidential Text**

---

To create a corpus of texts to analyze, we scraped a variety of presidential documents from
[The American Presidency Project](https://www.presidency.ucsb.edu/documents).
This allowed us to collect various text samples either written or delivered by the US presidents during their term in office. Note that at this time, texts have NOT been labeled as either public-facing or internal. Documents were taken from the following categories:

  * Eulogies
  * Farewell Addresses
  * Fireside Chats
  * Inaugural Addresses
  * Interviews
  * Messages
  * News Conferences
  * Weekly Addresses
  * State of Union Addresses

Once all documents were collected, we had a sample size of
$18,144$
texts to analyze.

<br>

---

#### **Eulogies**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/spoken-addresses-and-remarks/presidential/eulogies?items_per_page=60&page="
total_pages = 2
min_delay = 1.5
max_delay = 3.5
text_category = "eulogy"


# ----------------------------------------------
# Step 1: Scrape Eulogy List
# ----------------------------------------------
def scrape_eulogy_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        eulogies = soup.find_all("div", class_="node-documents")

        for item in eulogies:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)
                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of eulogies
    eulogies = scrape_eulogy_list()
    df = pd.DataFrame(eulogies)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/eulogy.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()

<br>

---

#### **Farewell Addresses**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/spoken-addresses-and-remarks/presidential/farewell-addresses?items_per_page=10&page="
total_pages = 2
min_delay = 1.5
max_delay = 3.5
text_category = "farewell address"


# ----------------------------------------------
# Step 1: Scrape Farewell Address List
# ----------------------------------------------
def scrape_farewell_address_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        farewell_addresses = soup.find_all("div", class_="node-documents")

        for item in farewell_addresses:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)
                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of farewell_addresses
    farewell_addresses = scrape_farewell_address_list()
    df = pd.DataFrame(farewell_addresses)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/farewell_address.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()

---

#### **Fireside Chats**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/spoken-addresses-and-remarks/presidential/fireside-chats?items_per_page=20&page="
total_pages = 2
min_delay = 1.5
max_delay = 3.5
text_category = "fireside chat"

# ----------------------------------------------
# Step 1: Scrape Fireside Chat List
# ----------------------------------------------
def scrape_fireside_chat_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        fireside_chats = soup.find_all("div", class_="node-documents")

        for item in fireside_chats:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)
                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of fireside_chats
    fireside_chats = scrape_fireside_chat_list()
    df = pd.DataFrame(fireside_chats)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/fireside_chat.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()

<br>

---

#### **Inaugural Addresses**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/spoken-addresses-and-remarks/presidential/inaugural-addresses?items_per_page=60&page="
total_pages = 2
min_delay = 1.5
max_delay = 3.5
text_category = "inaugural address"

# ----------------------------------------------
# Step 1: Scrape Inaugural Address List
# ----------------------------------------------
def scrape_inaugural_address_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        inaugural_addresses = soup.find_all("div", class_="node-documents")

        for item in inaugural_addresses:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)
                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of inaugural_addresses
    inaugural_addresses = scrape_inaugural_address_list()
    df = pd.DataFrame(inaugural_addresses)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/inaugural_address.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()

<br>

---

#### **Interviews**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/presidential/interviews?items_per_page=60&page="
total_pages = 18
min_delay = 1.5
max_delay = 3.5
text_category = "interview"

# ----------------------------------------------
# Step 1: Scrape Interview List
# ----------------------------------------------
def scrape_interview_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        interviews = soup.find_all("div", class_="node-documents")

        for item in interviews:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)
                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of interviews
    interviews = scrape_interview_list()
    df = pd.DataFrame(interviews)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/interview.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()

<br>

---

#### **Messages**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/citations/presidential/messages?items_per_page=60&page="
total_pages = 212
min_delay = 1.5
max_delay = 3.5
text_category = "message"

# ----------------------------------------------
# Step 1: Scrape Message List
# ----------------------------------------------
def scrape_message_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        messages = soup.find_all("div", class_="node-documents")

        for item in messages:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)
                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(
            url,
            headers={"User-Agent": "Mozilla/5.0"},
            timeout=15
        )
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of messages
    messages = scrape_message_list()
    df = pd.DataFrame(messages)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/message.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()

<br>

---

#### **News Conferences**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/presidential/news-conferences?items_per_page=60&page="
total_pages = 43
min_delay = 1.5
max_delay = 3.5
text_category = "news conference"

# ----------------------------------------------
# Step 1: Scrape News Conference List
# ----------------------------------------------
def scrape_news_conference_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        news_conferences = soup.find_all("div", class_="node-documents")

        for item in news_conferences:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)
                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of news_conferences
    news_conferences = scrape_news_conference_list()
    df = pd.DataFrame(news_conferences)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/news_conference.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()

<br>

---

#### **Weekly Addresses**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/spoken-addresses-and-remarks/presidential/saturday-weekly-addresses?items_per_page=60&page="
total_pages = 28
min_delay = 1.5
max_delay = 3.5
text_category = "weekly address"

# ----------------------------------------------
# Step 1: Scrape Weekly Address List
# ----------------------------------------------
def scrape_weekly_address_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        weekly_addresses = soup.find_all("div", class_="node-documents")

        for item in weekly_addresses:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)

                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(
            url,
            headers={"User-Agent": "Mozilla/5.0"},
            timeout=15
        )
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of weekly_addresses
    weekly_addresses = scrape_weekly_address_list()
    df = pd.DataFrame(weekly_addresses)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/weekly_address.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()

<br>

---

#### **State of the Union Addresses**

In [None]:
# ----------------------------------------------
# Config
# ----------------------------------------------
base_url = "https://www.presidency.ucsb.edu"
list_url = f"{base_url}/documents/app-categories/spoken-addresses-and-remarks/presidential/state-the-union-addresses?items_per_page=60&page="
total_pages = 2
min_delay = 1.5
max_delay = 3.5
text_category = "State of the Union Address"

# ----------------------------------------------
# Step 1: Scrape State of the Union Address List
# ----------------------------------------------
def scrape_sotu_address_list():
    all_items = []

    for page in tqdm(range(total_pages), desc="Pages"):
        url = f"{list_url}{page}"
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

        if r.status_code != 200:
            continue

        soup = BeautifulSoup(r.text, "html.parser")
        sotu_addresses = soup.find_all("div", class_="node-documents")

        for item in sotu_addresses:
            try:
                date = item.find("span", {"property": "dc:date"}).get_text(strip=True)

                title_tag = item.find("div", class_="field-title").find("a")
                title = title_tag.get_text(strip=True)
                href = base_url + title_tag["href"]
                president = item.find("div", class_="col-sm-4").find("a").get_text(strip=True)

                all_items.append({
                    "date": date,
                    "title": title,
                    "url": href,
                    "president": president,
                    "text_category": text_category
                })

            except Exception:
                continue

        time.sleep(uniform(min_delay, max_delay))
    return all_items


# ----------------------------------------------
# Step 2: Scrape Full Text
# ----------------------------------------------
def scrape_full_text(url):
    try:
        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
        if r.status_code != 200:
            return None

        soup = BeautifulSoup(r.text, "html.parser")
        content = soup.find("div", class_="field-docs-content")
        if not content:
            return None

        paragraphs = [p.get_text(" ", strip=True) for p in content.find_all("p")]
        return "\n".join(paragraphs)

    except Exception:
        return None


# ----------------------------------------------
# Step 3: Main
# ----------------------------------------------
def main():
    # Step 3a: Scrape list of sotu_addresses
    sotu_addresses = scrape_sotu_address_list()
    df = pd.DataFrame(sotu_addresses)

    # Step 3b: Scrape full text with progress bar
    full_texts = []
    for row in tqdm(df.itertuples(index=False), total=len(df), desc="Scraping full text"):
        text = scrape_full_text(row.url)
        full_texts.append(text)
        time.sleep(uniform(min_delay, max_delay))

    df["full_text"] = full_texts

    # Save to CSV
    df.to_csv("data/sotu_address.csv", index=False, encoding="utf-8-sig")

if __name__ == "__main__":
    main()