# Update Dataset

With this notebook, we check whether new episodes have come out, and make sure to add them to the dataset as needed. To accomplish this, we perform the following steps:
* Check for new episodes on website
* Download mp3 files and scrape episode information
* Process mp3 with whisper
* Add all relevant updates to the main dataframe

Secondary to this, we also check for updated titles as this sometimes causes issues.

In [8]:
#base
import re
import os
import pandas as pd

#webscraper
from selenium import webdriver
from selenium.webdriver.common.by import By
import urllib.request

#multiprocessing
import multiprocessing as mp

#slow the scraper down a little
import time

# Load data file

In [9]:
data = pd.read_pickle("../extract_data/data.pickle")

In [10]:
data.sort_values("date", ascending=False)

Unnamed: 0,titles,sources,date,duration,episode,mp3_path,txt_path
0,#311 - Maartens wens voor hij doodgaat,https://anchor.fm/s/21c734c4/podcast/play/6088...,2022-11-19,00:37:04,311,../data/audio/#311 - Maartens wens voor hij do...,../data/text (copy)/file:#311 - Maartens wens ...
1,#310 - Aanval op Kremlin,https://anchor.fm/s/21c734c4/podcast/play/6076...,2022-11-17,00:31:36,310,../data/audio/#310 - Aanval op Kremlin.mp3,../data/text (copy)/file:#310 - Aanval op Krem...
2,Maarten en Tom op RTL 4,https://anchor.fm/s/21c734c4/podcast/play/6069...,2022-11-16,00:09:31,-9999,../data/audio/Maarten en Tom op RTL 4.mp3,../data/text (copy)/file:Maarten en Tom op RTL...
3,#309 - VVD-implosie,https://anchor.fm/s/21c734c4/podcast/play/6064...,2022-11-15,00:36:56,309,../data/audio/#309 - VVD-implosie.mp3,../data/text (copy)/file:#309 - VVD-implosie.m...
4,#308 - Trump gaat eraan!,https://anchor.fm/s/21c734c4/podcast/play/6055...,2022-11-13,00:39:50,308,../data/audio/#308 - Trump gaat eraan!.mp3,../data/text (copy)/file:#308 - Trump gaat era...
...,...,...,...,...,...,...,...
320,#4 - Maarten van Rossem drinkt elke nacht bier,https://anchor.fm/s/21c734c4/podcast/play/1945...,2020-09-11,00:51:17,4,../data/audio/#4 - Maarten van Rossem drinkt e...,../data/text (copy)/file:#4 - Maarten van Ross...
321,#3 - Je moet strakke dames van middelbare leef...,https://anchor.fm/s/21c734c4/podcast/play/1485...,2020-06-07,00:32:10,3,../data/audio/#3 - Je moet strakke dames van m...,../data/text (copy)/file:#3 - Je moet strakke ...
322,#2 - Heerlijk! Een zomer zonder festivals,https://anchor.fm/s/21c734c4/podcast/play/1485...,2020-06-07,00:29:11,2,../data/audio/#2 - Heerlijk! Een zomer zonder ...,../data/text (copy)/file:#2 - Heerlijk! Een zo...
323,#1 - De terminale arrogantie van Wopke Hoekstra,https://anchor.fm/s/21c734c4/podcast/play/1485...,2020-06-07,00:25:14,1,../data/audio/#1 - De terminale arrogantie van...,../data/text (copy)/file:#1 - De terminale arr...


# Check for new episodes

In [11]:
driver = webdriver.Chrome(executable_path="../dependencies/chromedriver")
url = "https://podcastluisteren.nl/pod/Maarten-van-Rossem-De-Podcast"
driver.get(url)

In [59]:
elements = driver.find_elements(By.XPATH, "//h4[@class='mt-1 text-left']")

In [43]:
titles = [element.text for element in elements]
titles = [title.replace("/", "-") for title in titles]
data_web = pd.DataFrame()
data_web["titles"] = titles

In [44]:
date_duration = driver.find_elements(By.XPATH, "//h4[@class='text-left mb-4']")
date_duration = [element.text for element in date_duration]
data_web["date_and_duration"] = date_duration
temp = data_web["date_and_duration"].str.split("|", n = 1, expand = True)
data_web["date"] = temp[0]
data_web["duration"] = temp[1]
data_web = data_web.drop(columns="date_and_duration")
data_web["date"] = pd.to_datetime(data_web['date'])

In [45]:
# data_web["sources"] = sources
episode = data_web["titles"].str.findall("(?:#)(\d+)").str[0]
data_web["episode"] = episode
data_web["episode"] = data_web["episode"].fillna(-9999)
data_web["episode"] = data_web["episode"].astype(int)
data_web["mp3_path"] = data_web["titles"].transform(lambda title: f"../data/audio/{title}.mp3")
data_web["txt_path"] = data_web["titles"].transform(lambda title: f"../data/text/file:{title}.mp3.txt")

In [46]:
data_web = data_web[(data_web.date >= data.date.max()) & ~data_web.titles.isin(data.titles)]

In [47]:
updated_titles = data_web[data_web.date.isin(data.date) & data_web.duration.isin(data.duration) & data_web.episode.isin(data.episode)]
updated_titles

Unnamed: 0,titles,date,duration,episode,mp3_path,txt_path
10,#311 - Kon-Tiki expeditie,2022-11-19,00:37:04,311,../data/audio/#311 - Kon-Tiki expeditie.mp3,../data_web/text (copy)/file:#311 - Kon-Tiki e...


In [54]:
data_web = data_web.drop(updated_titles.index, axis=0)

# Download mp3 files and scrape episode information

In [79]:
elements = [element for element in elements if element.text.replace("/", "-") in data_web.titles.values]
buttons = [element.find_element(By.XPATH, "../div/button") for element in elements]

In [80]:
def find_audio_path(button, audio_element):
    # Start stream of episode
    button.click()
    # Pause the stream, as we only need it loaded
    button.click()
    time.sleep(0.01)
    
    src = audio_element.get_attribute("src")
    return src

In [81]:
audio = driver.find_element(By.XPATH, "//audio")
buttons = [element.find_element(By.XPATH, "../div/button") for element in elements]
sources = [find_audio_path(button, audio) for button in buttons]

In [82]:
use_cores = mp.cpu_count()

In [83]:
def download_mp3(source, title):
    path = f"../data/audio/{title}.mp3"
    if os.path.exists(path):
        return
    urllib.request.urlretrieve(source, path)
    time.sleep(2)

In [84]:
pool = mp.Pool(use_cores)
result = pool.starmap(download_mp3, tuple(zip(sources, titles)))

# Process mp3 with whisper


with open("update_mp3.txt","w") as f:
    for mp3 in data_incomplete["mp3_path"]:
        path = mp3.split("audio/")[1]
        f.write(f"{path}\n")

In [85]:
data_web

Unnamed: 0,titles,date,duration,episode,mp3_path,txt_path
0,#321 - Energietransitie,2022-12-06,00:34:40,321,../data/audio/#321 - Energietransitie.mp3,../data_web/text (copy)/file:#321 - Energietra...
1,#320 - Vichy-regime,2022-12-04,00:33:42,320,../data/audio/#320 - Vichy-regime.mp3,../data_web/text (copy)/file:#320 - Vichy-regi...
2,#319 - Radioactieve straling,2022-12-03,00:36:12,319,../data/audio/#319 - Radioactieve straling.mp3,../data_web/text (copy)/file:#319 - Radioactie...
3,#318 - Grote loser,2022-12-01,00:37:57,318,../data/audio/#318 - Grote loser.mp3,../data_web/text (copy)/file:#318 - Grote lose...
4,#317 - Slavernij gelul,2022-11-29,00:35:34,317,../data/audio/#317 - Slavernij gelul.mp3,../data_web/text (copy)/file:#317 - Slavernij ...
5,#316 - Eureka!,2022-11-27,00:45:24,316,../data/audio/#316 - Eureka!.mp3,../data_web/text (copy)/file:#316 - Eureka!.mp...
6,#315 - Totale halvegaren,2022-11-26,00:41:09,315,../data/audio/#315 - Totale halvegaren.mp3,../data_web/text (copy)/file:#315 - Totale hal...
7,#314 - Invasie van Taiwan,2022-11-24,00:48:39,314,../data/audio/#314 - Invasie van Taiwan.mp3,../data_web/text (copy)/file:#314 - Invasie va...
8,#313 - Trumps ondergang,2022-11-22,00:40:52,313,../data/audio/#313 - Trumps ondergang.mp3,../data_web/text (copy)/file:#313 - Trumps ond...
9,#312 - Matthijs van Nieuwkerk,2022-11-20,00:32:32,312,../data/audio/#312 - Matthijs van Nieuwkerk.mp3,../data_web/text (copy)/file:#312 - Matthijs v...


In [86]:
with open("update_mp3.txt","w") as f:
    for mp3 in data_web["mp3_path"]:
        path = mp3.split("audio/")[1]
        f.write(f"{path}\n")

In [94]:
! ../speech_to_text/transcribe_update.sh

100%|█████████████████████████████████████| 2.87G/2.87G [04:19<00:00, 11.9MiB/s]
[00:00.000 --> 00:05.000]  Ik had nog een leuk ideetje. In januari bestaat de VVD 75 jaar, hè?
[00:05.000 --> 00:10.000]  Zo, zo lang, met zoveel opgrutswerken. Sorry.
[00:11.000 --> 00:13.000]  Met Verlossum, kunt u mij horen?
[00:17.000 --> 00:22.000]  Het is me opgevallen in de actualiteit, eigenlijk het water.
[00:22.000 --> 00:29.000]  De kwaliteitskrant had een hoofdredactioneel waarin ook aan het water gerefereerd wordt.
[00:29.000 --> 00:34.000]  En aan het feit dat de overheid zonder dat eigenlijk heel erg duidelijk te stellen...
[00:34.000 --> 00:41.000]  maar ja, zo nu en dan, en als je alle onderdeeltjes bij elkaar veegt en naar elkaar klikt...
[00:41.000 --> 00:44.000]  een soort van lego-setje zal ik maar zeggen...
[00:44.000 --> 00:49.000]  dan heeft de overheid besloten om toch West-Nederland op de been te houden...
[00:49.000 --> 00:58.000]  ondanks de stijging van de zeespiegel en ondanks

# Add all relevant updates to the main dataframe

In [105]:
data_web["sources"] = sources
data = pd.concat(data, data_web)

  data = data.append(data_web)


In [111]:
data = data.sort_values("episode", ascending=False).reset_index(drop=True)

In [119]:
data.txt_path = data.txt_path.str.replace("text (copy)", "text", regex=False)

In [124]:
data.to_pickle("../extract_data/data.pickle")