## Setup

Kaggle offers 20 hours of [TPU v3-8](https://www.kaggle.com/product-feedback/129828) usage per month for free, and this notebook requires it to run successfully. This is more than enough to download a podcast. We also import other dependencies. 

In [None]:
import jax
jax.devices()

In [None]:
!pip install -U pip

In [None]:
!pip install --quiet git+https://github.com/sanchit-gandhi/whisper-jax.git

In [None]:
from whisper_jax import FlaxWhisperPipline
import jax.numpy as jnp

pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16, batch_size=16)
#pipeline = FlaxWhisperPipline("openai/whisper-medium", dtype=jnp.bfloat16, batch_size=16)

In [None]:
from jax.experimental.compilation_cache import compilation_cache as cc

cc.initialize_cache("./jax_cache")

In [None]:
!pip install pytube

Test whether PyTube works correctly.

In [None]:
from pytube import YouTube

video_url = 'https://www.youtube.com/watch?v=lT-YD_cm_SA'
ex = True
while ex:
    try:
        yt = YouTube(video_url)
        vid_title = yt.title
        ex = False
    except:
        continue

audio= yt.streams.filter(only_audio=True).order_by('abr').first().download(filename="audio.mp4")
print(vid_title)

In [None]:
import re
import os
import pandas as pd
from pytube import Playlist
from tqdm import tqdm

Finally, get ffmpeg to process .mp4 files. 

In [None]:
!apt-get update && apt install -y ffmpeg

## First model run

As detailed in the [Whisper-JAX](https://github.com/sanchit-gandhi/whisper-jax) repo, we need to compile the `pmap` function the first time. 

> You can expect compilation to take ~2 minutes on a TPU v3-8 with a batch size of 16. Enough time to grab a coffee ☕️

So let's do that. 

In [None]:
# JIT compile the forward call - slow, but we only do once
%time text = pipeline(audio)

## Transcribe a YouTube podcast

We transcribe the podcast [Brains and Gains by Dr. Dave Maconi](https://www.youtube.com/channel/UCW-PI9YMJ6SXPiqXy2FYfLg) about all things natural bodybuilding via its YouTube channel. 



In [None]:
!mkdir BrainsAndGains

PyTube tends to fail to fetch links, so a while loop is used. 

In [None]:
def zfill_alternative(x,l=2): return x if len(x) >= l else '0'*(l-len(x))+x

c = Playlist('https://www.youtube.com/playlist?list=PL1WQINeRM1zeBpBpB-tnTDXIG5t4Wbbtf') # Brains

dl_dir = 'BrainsAndGains'

# Can catch exceptions but not implemented here. 
failsforsomereason = []

arr = os.listdir('/kaggle/working/'+dl_dir)

for url in tqdm(c.video_urls):
    notdone = True
    while notdone:
        try:
            video_url = url
            yt = YouTube(video_url)
            
            # Get the title and discard if already done.
            vid_title = yt.title

            if (vid_title in [x[:-3] for x in arr] or 
                re.sub(r'[^A-Za-z0-9 ]+', '', vid_title) in [x[:-3] for x in arr] or 
                vid_title in failsforsomereason):
                notdone = False
                continue
            
            # Download audio file. 
            audio_file = yt.streams.filter(only_audio=True).order_by('abr').first().download(filename="audio.mp4")
            print(vid_title)

            # Transcribe. 
            transcription = pipeline(audio_file,  task="transcribe", language="en", return_timestamps=True)

            # Take the 'chunks' and save as timestamp-text table. Finally save to .md
            df = pd.DataFrame(transcription['chunks'], columns=['timestamp', 'text'])

            df[['start','end']] = pd.DataFrame(df['timestamp'].tolist(), index=df.index)
            df['minute'] = (df['start']//60).astype(int)
            df['seconds'] = (df['start']-df.minute*60).astype(int)
            df['time'] = '('+df.minute.astype(str).apply(zfill_alternative)+':'+df.seconds.astype(str).apply(zfill_alternative)+')'

            dfgb = df.groupby(df.minute,as_index=False).text.agg(''.join)
            dfmg = pd.merge(df,dfgb,on='minute')

            dffirst = dfmg[['minute','time','text_y']].groupby(dfmg.minute,as_index=False).first()
            dffirst = dffirst.rename(columns={'text_y':'text'})
            dffirst[['time','text']].to_markdown( dl_dir+'/'+re.sub(r'[^A-Za-z0-9 ]+', '', vid_title)
                                                +'.md',index=False)
            
            print('Done')
            notdone = False
        except Exception as e:
            continue

In [None]:
!cd BrainsAndGains && tar -zcvf BrainsAndGains.tar.gz *.md

In [None]:
!tar -zcvf BrainsAndGains.tar.gz BrainsAndGains/*.md

Display a clickable link to the tar file. 

In [None]:
from IPython.display import FileLink
FileLink(r'BrainsAndGains.tar.gz')

## Downloading from RSS feed

The second part looks at downloading from an RSS link. 

The podcast in question is [Where Optimal Meets Practical](https://podcasts.apple.com/us/podcast/where-optimal-meets-practical/id1518859017) by Jordan Lips, also about natural bodybuilding (weird!). 


In [None]:
import feedparser
import urllib.request

NewsFeed = feedparser.parse("https://media.rss.com/whereoptimalmeetspractical/feed.xml")
entry = NewsFeed.entries[2]

print(entry.keys())


In [None]:
def zfill_alternative(x,l=2): return x if len(x) >= l else '0'*(l-len(x))+x


#WOMP
NewsFeed = feedparser.parse("https://media.rss.com/whereoptimalmeetspractical/feed.xml")

dl_dir = 'womp'

failsforsomereason = []

arr = os.listdir('/kaggle/working/'+dl_dir)

for entry in tqdm(NewsFeed.entries):
    notdone = True
    while notdone:
        try:
            vid_title = entry['title']


            if (vid_title in [x[:-3] for x in arr] or 
                re.sub(r'[^A-Za-z0-9 ]+', '', vid_title) in [x[:-3] for x in arr] or 
                vid_title in failsforsomereason):
                notdone = False
                continue

            opener = urllib.request.build_opener()
            opener.addheaders = [('User-agent', 'Mozilla/5.0')]
            urllib.request.install_opener(opener)
            urllib.request.urlretrieve(entry['links'][1]['href'], audio_file)
            print(vid_title)

            #print(yt.streams.filter(only_audio=True).order_by('abr'))


            transcription = pipeline(audio_file,  task="transcribe", language="en", return_timestamps=True)

            df = pd.DataFrame(transcription['chunks'], columns=['timestamp', 'text'])

            df[['start','end']] = pd.DataFrame(df['timestamp'].tolist(), index=df.index)


            df['minute'] = (df['start']//60).astype(int)
            df['seconds'] = (df['start']-df.minute*60).astype(int)

            df['time'] = '('+df.minute.astype(str).apply(zfill_alternative)+':'+df.seconds.astype(str).apply(zfill_alternative)+')'

            dfgb = df.groupby(df.minute,as_index=False).text.agg(''.join)
            dfmg = pd.merge(df,dfgb,on='minute')

            dffirst = dfmg[['minute','time','text_y']].groupby(dfmg.minute,as_index=False).first()
            dffirst = dffirst.rename(columns={'text_y':'text'})

            dffirst[['time','text']].to_markdown( dl_dir+'/'+re.sub(r'[^A-Za-z0-9 ]+', '', vid_title)
                                                +'.md',index=False)
            print('Done')
            notdone = False
        except Exception as e:
            continue

In [None]:
!tar -zcvf womp.tar.gz womp/*.md

In [None]:
from IPython.display import FileLink
FileLink(r'womp.tar.gz')