# Quality Check

In this notebook, we check whether there are files for which the mp3 download has failed. While making the analysis, I noticed that some downloaded audiofiles were incomplete. By checking which episodes were incomplete in an automated fashion, we can quickly correct this error. This is done by comparing the runtime on our filesystem to the runtime on the website. A small discrepency of up to 2 seconds is allowd.

In [None]:
import os
import pandas as pd
import ffmpeg
import multiprocessing as mp
import urllib.request
import time


In [None]:
# Set magic number
dt = 1 / 60 * 2  # up to 2 seconds of runtime difference is accepted.


# Load data file

In [None]:
data = pd.read_pickle("../extract_data/data.pickle")


In [None]:
data


# Extract file length from .mp3

In [None]:
data["duration-m-file"] = data["mp3_path"].transform(
    lambda f: float(ffmpeg.probe(f)["format"]["duration"]) / 60
)


In [None]:
data["duration-s"] = pd.to_timedelta(data["duration"]).dt.total_seconds()
data["duration-m"] = data["duration-s"] / 60.0


# Get difference in podcast length and downloaded length

To filter out the files which were not downloaded correctly, compare the length as displayed on the website with the length we found for the downloaded files

In [None]:
data["length-diff-m"] = abs(data["duration-m-file"] - data["duration-m"])


In [None]:
data_incomplete = data[data["length-diff-m"] > dt]
data_incomplete


# Download incomplete files again

In [None]:
use_cores = mp.cpu_count()


In [None]:
def download_mp3(source, title):
    """
    Download the audiofile from the source.
    The episode title is used for naming the file.


    Parameters
    ----------
    source : str
        Link to the audiofile.
    title : str
        title of the episode.
    """
    path = f"../data/audio/{title}.mp3"
    urllib.request.urlretrieve(source, path)


In [None]:
pool = mp.Pool(use_cores)
result = pool.starmap(
    download_mp3, tuple(zip(data_incomplete["sources"], data_incomplete["titles"]))
)


In [None]:
data_incomplete.to_pickle("incomplete_mp3_files.pickle")


The incomplete audiofiles are stored in a simple .txt file. When calling `transcribe_incomplete.sh` this .txt file is read to determine which files need to be processed again by `whisper`.

In [None]:
with open("incomplete_mp3.txt", "w") as f:
    for mp3 in data_incomplete["mp3_path"]:
        path = mp3.split("audio/")[1]
        f.write(f"{path}\n")
