### **Data Collection**

We are going to scrape transcripts of 10 most popular President Joe Biden's speeches in 2023 from the official website of the White House and Miller Center using packages:


*   requests
*   BeautifulSoup
*   lxml



In [11]:
import requests

from bs4 import BeautifulSoup
from lxml import etree

In [10]:
sources = [
    {
        'date': '19.09.2023',
        'title': 'Remarks by President Biden Before the 78th Session of the United Nations General Assembly | New '
                 'York, NY',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/09/19/remarks-by-president-biden'
               '-before-the-78th-session-of-the-united-nations-general-assembly-new-york-ny/',
    },
    {
        'date': '29.03.2023',
        'title': 'Remarks by President Biden at the Summit for Democracy Virtual Plenary on Democracy Delivering on '
                 'Global Challenges',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/03/29/remarks-by-president-biden-at'
               '-the-summit-for-democracy-virtual-plenary-on-democracy-delivering-on-global-challenges/',
    },
    {
        'date': '20.10.2023',
        'title': 'Remarks by President Biden on the United States’ Response to Hamas’s Terrorist Attacks Against '
                 'Israel and Russia’s Ongoing Brutal War Against Ukraine',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/10/20/remarks-by-president-biden-on'
               '-the-unites-states-response-to-hamass-terrorist-attacks-against-israel-and-russias-ongoing-brutal'
               '-war-against-ukraine/',
    },
    {
        'date': '11.07.2023',
        'title': 'Remarks by President Biden and President Gitanas Nausėda of Lithuania Before Bilateral Meeting | '
                 'Vilnius, Lithuania',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/07/11/remarks-by-president-biden-and'
               '-president-gitanas-nauseda-of-lithuania-before-bilateral-meeting-vilnius-lithuania/',
    },
    {
        'date': '21.02.2023',
        'title': 'Remarks on the One-Year Anniversary of the Ukraine War',
        'url': 'https://millercenter.org/the-presidency/presidential-speeches/february-21-2023-remarks-one-year'
               '-anniversary-ukraine-war',
    },
    {
        'date': '20.05.2023',
        'title': 'Remarks by President Biden, Prime Minister Kishida, Prime Minister Modi, and Prime Minister '
                 'Albanese at the Third In-Person Quad Leaders’ Summit',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/05/20/remarks-by-president-biden-prime'
               '-minister-kishida-prime-minister-modi-and-prime-minister-albanese-at-the-third-in-person-quad-leaders'
               '-summit/',
    },
    {
        'date': '20.05.2023',
        'title': 'Remarks by President Biden, Prime Minister Kishida, Prime Minister Modi, and Prime Minister '
                 'Albanese at the Third In-Person Quad Leaders’ Summit',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/05/20/remarks-by-president-biden-prime'
               '-minister-kishida-prime-minister-modi-and-prime-minister-albanese-at-the-third-in-person-quad-leaders'
               '-summit/',
    },
    {
        'date': '16.11.2023',
        'title': 'Remarks by President Biden at the APEC CEO Summit | San Francisco, CA',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/11/16/remarks-by-president-biden-at'
               '-the-apec-ceo-summit-san-francisco-ca/',
    },
    {
        'date': '09.09.2023',
        'title': 'Remarks by President Biden at Meeting for Partnership for Global Infrastructure and Investment',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/09/09/remarks-by-president-biden-at'
               '-meeting-for-partnership-for-global-infrastructure-and-investment/',
    },
    {
        'date': '13.03.2023',
        'title': 'Remarks by President Biden, Prime Minister Albanese of Australia, and Prime Minister Sunak of the '
                 'United Kingdom on the AUKUS Partnership',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/03/13/remarks-by-president-biden-prime'
               '-minister-albanese-of-australia-and-prime-minister-sunak-of-the-united-kingdom-on-the-aukus'
               '-partnership/',
    },
    {
        'date': '16.09.2023',
        'title': 'Remarks by President Biden at the Indo-Pacific Economic Framework | San Francisco, CA',
        'url': 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/11/16/remarks-by-president-biden-at'
               '-the-indo-pacific-economic-framework-san-francisco-ca/',
    }
]

We will use elements' XPATH in order to extract the transcript. Then we will put all transcripts on one list - **transcripts[]**.


For all speeches in the White House website it will be the same.


For the transcript in the Miller Center website it will be another. Moreover, the text there is split into several \<p> elements. That's why firstly we need to count the number of such elements, then extract the data from each of them and then combine into one text.

In order to understand that we have successfully scraped the data, let's print out the first 30 symbols of each speech.


In [49]:
def has_only_row_class(tag):
    return tag.name == 'div' and tag.get('class') == ['row']

In [86]:
transcripts = []

for num, source in enumerate(sources):
    response = requests.get(source['url'])
    soup = BeautifulSoup(response.content, 'lxml')
    dom = etree.HTML(str(soup))

    index = 0

    if 'www.whitehouse.gov' in source['url']:
        outer_element = soup.find_all(has_only_row_class)
        xpath_exp = '//*[@id="content"]/article/section/div/div/p['
        index = 1
    else:
        outer_element = soup.find_all('div', class_='transcript-inner')
        xpath_exp = '//*[@id="dp-expandable-text"]/div[1]/p['

    if not outer_element:
        print('Error occured: outer element not found')
        continue

    p_elements = outer_element[index].find_all('p')

    text = []
    for i in range(1, len(p_elements) + 1):
        element = dom.xpath(xpath_exp + str(i) + ']')

        if element:
            text.append(' '.join(element[0].itertext()))
        else:
            print('Error occurred: element not found')

    print('The text of Speech {}: {}'.format(num, ' '.join(text)[:30]))

    transcripts.append(' '.join(text))

The text of Speech 0: United Nations Headquarters Ne
The text of Speech 1: South Court Auditorium Eisenho
The text of Speech 2: 8:02 P.M. EDT   THE PRESIDENT:
The text of Speech 3: Presidential Palace Vilnius, L
The text of Speech 4: THE PRESIDENT:  Hello, Poland!
The text of Speech 5: 8:43 P.M. JST PRIME MINISTER A
The text of Speech 6: 8:43 P.M. JST PRIME MINISTER A
The text of Speech 7: 11:20 A.M. PST   THE PRESIDENT
The text of Speech 8: International Exhibition-cum-C
The text of Speech 9: Point Loma Naval Base San Dieg
The text of Speech 10: Moscone Convention Center San 


Let's also print out the text of the first speech:

In [87]:
transcripts[0]

'United Nations Headquarters New York, New York 10:17 A.M. EDT THE PRESIDENT:\xa0 Mr. President, Mr. Secretary-General, and my fellow leaders, about a week ago I stood on the other side of the world in Vietnam on soil once bloody with war. And I met a small group of veterans, Americans and Vietnamese, who wit- — and I wa- — I watched an exchange of personal artifacts from that war — identification cards and a diary.\xa0 It was deeply moving to see the reaction of the Vietnamese and American soldiers. A culmination of 50 years of hard work on both sides to address the painful legacies of war and to choose — to choose to work together toward peace and a better future. Nothing about that journey was inevitable.\xa0 For decades, it would have been unthinkable for an American president to stand in Hanoi alongside a Vietnamese leader and announce a mutual commitment to the highest level of countries partnership.\xa0 But it’s a powerful reminder that our history need not dictate our future. W

Everything seems to be ok, so we can now save all the speeches into one **.txt** file:

In [92]:
file_name = 'all_speeches.txt'

with open(file_name, 'w', encoding='utf-8') as file:
    file.write(' '.join(transcripts))

print(f"Text has been saved to {file_name}!")

Text has been saved to all_speeches.txt!


### **Preprocessing and Data Loading**

Firstly, we import all the packages that we are going to use further for data preprocessing.

In [88]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

import re
import nltk
# import networkx as nx
# import matplotlib.pyplot as plt

Then we download the set of stop words (common words that usually do not carry significant meaning) using NLTK package.

In [None]:
nltk.download('punkt')

# Downloading stop words
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Downloading WordNet for lemmatization (reducing the words to their initial form)
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

lemmatizer = WordNetLemmatizer()

Here we define several functions that we will use for cleansing the data scraped from the Web.

In [90]:
# Function for removing HTML tags
def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, ' ', text)

# Function for removing numbers and non-alphabetic characters
def remove_non_alpha(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

def remove_nbsp(text):
    return text.replace('nbsp', ' ').replace('\xa0', ' ')

Now we will download the speech, delete HTML tags, tokenize the words, filter and lemmatize them.

In [98]:
# Uploading .txt file with all collected speeches
with open('all_speeches.txt', 'r', encoding='utf-8') as file:
    text = file.read()

text = remove_html_tags(text)
text = remove_non_alpha(text)
text = remove_nbsp(text)
text = ' '.join(text.strip().split())
print(text)

# Tokenization of sentences and words
sentences = sent_tokenize(text)
words = [word_tokenize(sentence.lower()) for sentence in sentences]

# Creating words and verbs lists
words_list = []
verbs_list = []

# Extracting all words and verbs
for sentence in words:
    for word, tag in pos_tag(sentence):
        if word not in stop_words and len(word) > 1:
            if tag.startswith('V'):  # We add verbs to a separate list (verbs_list)
                verbs_list.append(word)
            else:
                words_list.append(lemmatizer.lemmatize(word))  # Lemmatization of words

# Lemmatization of verbs
lemmatized_verbs = [lemmatizer.lemmatize(verb, pos='v') for verb in verbs_list]

# Removing the verbs that end with "ing"
lemmatized_verbs = [verb for verb in lemmatized_verbs if not verb.endswith('ing')]

# Removing word "u"
words_list = [word for word in words_list if 'u' not in word]

United Nations Headquarters New York New York AM EDT THE PRESIDENT Mr President Mr SecretaryGeneral and my fellow leaders about a week ago I stood on the other side of the world in Vietnam on soil once bloody with war And I met a small group of veterans Americans and Vietnamese who wit and I wa I watched an exchange of personal artifacts from that war identification cards and a diary It was deeply moving to see the reaction of the Vietnamese and American soldiers A culmination of years of hard work on both sides to address the painful legacies of war and to choose to choose to work together toward peace and a better future Nothing about that journey was inevitable For decades it would have been unthinkable for an American president to stand in Hanoi alongside a Vietnamese leader and announce a mutual commitment to the highest level of countries partnership But its a powerful reminder that our history need not dictate our future With a concerted leadership and careful effort adversaries

Let's look at the list of words and lemmatized verbs:

In [99]:
words_list[-1:-6:-1]

['pst', 'pm', 'minister', 'prime', 'mr']

In [100]:
lemmatized_verbs[-1:-6:-1]

['make', 'turn', 'id', 'join', 'welcome']

Let's look at the characteristics of the collected data:

In [101]:
# Data description
word_freq = nltk.FreqDist(words_list)
verb_freq = nltk.FreqDist(lemmatized_verbs)

print(f"Total number of words: {len(words_list)}")
print(f"Unique words: {len(set(words_list))}")
print(f"The 10 most common words: {word_freq.most_common(10)}")

print()

print(f"Total number of verbs: {len(verbs_list)}")
print(f"Unique verbs: {len(set(lemmatized_verbs))}")
print(f"The 10 most common verbs: {verb_freq.most_common(10)}")

Total number of words: 6732
Unique words: 1534
The 10 most common words: [('world', 119), ('state', 104), ('people', 102), ('year', 82), ('together', 78), ('president', 74), ('today', 74), ('region', 67), ('one', 60), ('democracy', 60)]

Total number of verbs: 2910
Unique verbs: 689
The 10 most common verbs: [('go', 83), ('make', 75), ('work', 59), ('know', 58), ('stand', 54), ('continue', 48), ('thank', 46), ('take', 42), ('say', 40), ('want', 34)]


### **Descriptive statistics and centralities**

TO BE DONE