# Update Dataset

With this notebook, we check whether new episodes have come out, and make sure to add them to the dataset as needed. To accomplish this, we perform the following steps:
* Check for new episodes on website
* Download mp3 files and scrape episode information
* Process mp3 with whisper
* Add all relevant updates to the main dataframe

Secondary to this, we also check for updated titles as this sometimes causes issues.

In [27]:
# base
import re
import os
import pandas as pd

# webscraper
from selenium import webdriver
from selenium.webdriver.common.by import By
import urllib.request

# multiprocessing
import multiprocessing as mp

# slow the scraper down a little
import time


# Load data file

In [28]:
data = pd.read_pickle("../extract_data/data.pickle")


In [29]:
data.sort_values("date", ascending=False)


Unnamed: 0,titles,sources,date,duration,episode,mp3_path,txt_path
0,#326 - De Wet van Faraday,https://anchor.fm/s/21c734c4/podcast/play/6221...,2022-12-15,00:39:03,326,../data/audio/#326 - De Wet van Faraday.mp3,../data/text/file:#326 - De Wet van Faraday.mp...
1,#325 - Bevelhebber van Europa,https://anchor.fm/s/21c734c4/podcast/play/6209...,2022-12-13,00:40:52,325,../data/audio/#325 - Bevelhebber van Europa.mp3,../data/text/file:#325 - Bevelhebber van Europ...
102,#225 - Bevelhebber van Europa,https://anchor.fm/s/21c734c4/podcast/play/6209...,2022-12-13,00:40:52,225,../data/audio/#225 - Bevelhebber van Europa.mp3,../data/text/file:#225 - Bevelhebber van Europ...
2,#324 - Coup,https://anchor.fm/s/21c734c4/podcast/play/6201...,2022-12-11,00:40:56,324,../data/audio/#324 - Coup.mp3,../data/text/file:#324 - Coup.mp3.txt
3,#323 - Uitgeschakeld!,https://anchor.fm/s/21c734c4/podcast/play/6196...,2022-12-10,00:24:52,323,../data/audio/#323 - Uitgeschakeld!.mp3,../data/text/file:#323 - Uitgeschakeld!.mp3.txt
...,...,...,...,...,...,...,...
322,#4 - Maarten van Rossem drinkt elke nacht bier,https://anchor.fm/s/21c734c4/podcast/play/1945...,2020-09-11,00:51:17,4,../data/audio/#4 - Maarten van Rossem drinkt e...,../data/text/file:#4 - Maarten van Rossem drin...
325,#1 - De terminale arrogantie van Wopke Hoekstra,https://anchor.fm/s/21c734c4/podcast/play/1485...,2020-06-07,00:25:14,1,../data/audio/#1 - De terminale arrogantie van...,../data/text/file:#1 - De terminale arrogantie...
324,#2 - Heerlijk! Een zomer zonder festivals,https://anchor.fm/s/21c734c4/podcast/play/1485...,2020-06-07,00:29:11,2,../data/audio/#2 - Heerlijk! Een zomer zonder ...,../data/text/file:#2 - Heerlijk! Een zomer zon...
323,#3 - Je moet strakke dames van middelbare leef...,https://anchor.fm/s/21c734c4/podcast/play/1485...,2020-06-07,00:32:10,3,../data/audio/#3 - Je moet strakke dames van m...,../data/text/file:#3 - Je moet strakke dames v...


# Check for new episodes

In [30]:
driver = webdriver.Chrome(executable_path="../dependencies/chromedriver")
url = "https://podcastluisteren.nl/pod/Maarten-van-Rossem-De-Podcast"
driver.get(url)
time.sleep(5)  # wait for the browser to fully open and load the website


In [31]:
# Retrieve the relevant html elements.
elements = driver.find_elements(By.XPATH, "//h4[@class='mt-1 text-left']")


In [32]:
# Get the titles from the html elements.
titles = [element.text for element in elements]
titles = [title.replace("/", "-") for title in titles]
data_web = pd.DataFrame()
data_web["titles"] = titles


In [33]:
# Get the date and duration from the html page.
date_duration = driver.find_elements(By.XPATH, "//h4[@class='text-left mb-4']")
date_duration = [element.text for element in date_duration]
data_web["date_and_duration"] = date_duration
temp = data_web["date_and_duration"].str.split("|", n=1, expand=True)
data_web["date"] = temp[0]
data_web["duration"] = temp[1]
data_web = data_web.drop(columns="date_and_duration")
data_web["date"] = pd.to_datetime(data_web["date"])


In [34]:
# data_web["sources"] = sources
episode = data_web["titles"].str.findall("(?:#)(\d+)").str[0]
data_web["episode"] = episode
data_web["episode"] = data_web["episode"].fillna(-9999)
data_web["episode"] = data_web["episode"].astype(int)
data_web["mp3_path"] = data_web["titles"].transform(
    lambda title: f"../data/audio/{title}.mp3"
)
data_web["txt_path"] = data_web["titles"].transform(
    lambda title: f"../data/text/file:{title}.mp3.txt"
)


In [35]:
data_web = data_web[
    (data_web.date >= data.date.max()) & ~data_web.titles.isin(data.titles)
]


In [36]:
# Check which episode titles have been updated
# Currently, no action is taken, but all titles that are changed could be updated in our main DataFrame
updated_titles = data_web[
    data_web.date.isin(data.date)
    & data_web.duration.isin(data.duration)
    & data_web.episode.isin(data.episode)
]
updated_titles


Unnamed: 0,titles,date,duration,episode,mp3_path,txt_path


In [37]:
data_web = data_web.drop(updated_titles.index, axis=0)


# Download mp3 files and scrape episode information

In [38]:
elements = [
    element
    for element in elements
    if element.text.replace("/", "-") in data_web.titles.values
]
buttons = [element.find_element(By.XPATH, "../div/button") for element in elements]


In [39]:
def find_audio_path(button, audio_element):
    """
    Start playing the audofiles and retrieve the src attribute.

    The src attribute is only available after getting we start playing the audiofile.
    The play button is clicked, making the source available.

    Parameters
    ----------
    button: selenium.element
        A play button on the website.
    audio_element: selenium.element
        The element containing the audio src.

    Returns
    -------
    src: str
        The link to the audiofile.
    """
    # Start stream of episode
    button.click()
    # Pause the stream, as we only need it loaded
    button.click()
    time.sleep(0.01)

    src = audio_element.get_attribute("src")
    return src


In [40]:
audio = driver.find_element(By.XPATH, "//audio")
buttons = [element.find_element(By.XPATH, "../div/button") for element in elements]
sources = [find_audio_path(button, audio) for button in buttons]


In [41]:
# Close the selenium browser.
driver.close()


In [42]:
use_cores = mp.cpu_count()


In [43]:
def download_mp3(source, title):
    """
    Download the audiofile from the source.
    The episode title is used for naming the file.

    Parameters
    ----------
    source : str
        Link to the audiofile.
    title : str
        title of the episode.
    """
    path = f"../data/audio/{title}.mp3"
    if os.path.exists(path):
        return
    urllib.request.urlretrieve(source, path)
    time.sleep(2)


In [44]:
pool = mp.Pool(use_cores)
result = pool.starmap(download_mp3, tuple(zip(sources, titles)))


# Process mp3 with whisper


In [45]:
data_web


Unnamed: 0,titles,date,duration,episode,mp3_path,txt_path
0,#327 - ChatGPT,2022-12-17,00:36:16,327,../data/audio/#327 - ChatGPT.mp3,../data/text/file:#327 - ChatGPT.mp3.txt


In [46]:
with open("update_mp3.txt", "w") as f:
    for mp3 in data_web["mp3_path"]:
        path = mp3.split("audio/")[1]
        f.write(f"{path}\n")


In [47]:
! ../speech_to_text/transcribe_update.sh

[00:00.000 --> 00:08.600]  Ik voelde mij toch wel enigszins gecorrigeerd, want je denkt ook, nou ja, kijk, je kunt in Nederland aarmoeiers zijn, dat begrijp ik ook best.
[00:08.600 --> 00:18.280]  En dat blijkt ook nu met die verhoogde brandstofkosten, plotseling miljoenen mensen in de problemen geraken, omdat ze dat eigenlijk niet kunnen betalen.
[00:18.280 --> 00:28.280]  Dat vind ik ook al typisch in een van de rijkste EU-landen die er zijn, maar we hebben dat kennelijk ook niet echt goed op orde.
[00:28.280 --> 00:38.160]  Kortom, misschien is het dan wel heel nuttig om ook nog eens weer eens goed te gaan kijken naar de onderkant van de verzorgingstaat.
[00:38.160 --> 00:40.560]  Functioneert dat eigenlijk nog wel?
[00:40.560 --> 00:42.360]  Met Verlossum, kunt u mij horen?
[00:47.200 --> 00:51.600]  ChatGPT, dat is de trend van de laatste tijd, moeten we het vandaag even over hebben.
[00:51.600 --> 00:58.200]  Het is een artificial intelligence tool waarmee je in no time eigenlijk

# Add all relevant updates to the main dataframe

In [None]:
data_web["sources"] = sources
data = pd.concat([data, data_web])


In [None]:
data = data.sort_values("episode", ascending=False).reset_index(drop=True)


In [None]:
data.to_pickle("../extract_data/data.pickle")
