# Capstone Project Podcast Profiler

This is the hands-on session for the capstone project of the workshop.

Copyright (c) 2023 Michael Neumayr

## Setup

### 0. Set up the Colab in your drive

- Load this Colab from Github
- Run the first cell to install all required packages (this takes a moment)
- During installation jump to section "Set OpenAI API Key" and put the key we provide you instead of "PUT_YOUR_KEY_HERE"

### 1. Required python packages

In [None]:
# install required packages; this may take some minutes; ignore dependency warnings it should work anyway
%pip install openai
%pip install langchain
%pip install pypdf
%pip install tiktoken
%pip install pydub

### 2. Load the workshop github

In [None]:
!git clone https://github.com/michaelnoi/venture_labs_build.git

In [None]:
%cd venture_labs_build
!git checkout only_static_files

### 3. OpenAI API key

In [None]:
import os

openai_api_key = os.getenv('OPENAI_API_KEY', 'PUT_YOUR_KEY_HERE')

### 4. Optional: Connect to your Google Drive storage to upload your own documents later

In [None]:
# connect to your google drive storage
from google.colab import drive

drive.mount('/content/drive')

## Capstone Project: Podcast Profiler

In [None]:
import openai
from langchain import OpenAI

### 1. Load the podcast episodes as audio files

We give you three podcast episodes from the Venture Labs' [Entrepreneurial Realities](https://open.spotify.com/show/1r0vpcIZomm9W7ozp3qWLV) podcast. We have three episodes with the titles being the names of the guests. Below you see the episodes and the respective .mp3 file size, which will be important when we will transcribe them.

<img src="./static/podcast_episode.png" width="700"/>


Let's load an episode into the Colab:. As our transcription API will have a limit of 25 MB let's load the first that doesn't hit this limit:

In [None]:
# the "rb" option is important to (r)ead the file in (b)inary format
audio_file_alexa = open("./static/Alexa Sinyachova.mp3", "rb")

### 2. Transcribe the podcast episodes

In [None]:
# for transcription we use the Audio part of the openai library
# running this may take a bit since we are uploading and processing half an hour of audio
episode_alexa = openai.Audio.transcribe("whisper-1", audio_file_alexa, api_key=openai_api_key)
episode_alexa = episode_alexa["text"]


Let's see the result:

In [None]:
print(episode_alexa)

# load an LLM just see how many tokens are in the file
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key)

print("Number of tokens:", llm.get_num_tokens(episode_alexa))

Now we have the transcription of the first episode. Watch out that the transcription exceeds the token limit of the language models we used to summarize text. We will need to split up the text later.

<div class="alert" style="background-color:#151E35; color:#A450E6">
    🎯 <b>TODO</b>
  <p>Summarize the episode with <b>Tom Eisenmann</b> on your own now and store it into the variable <code style="color:#A450E6">episode_tom</code>! Don't summarize the episode with <b>Michael Eckhardt</b> yet as it will hit the upload limit of the Whisper API.</p>
</div>

In [None]:
### TODO: open the episode
audio_file_tom = ...

In [None]:
### TODO: do the transcription
episode_tom = ...
episode_tom = episode_tom["text"]


# close the file again so that we free up system resources
audio_file_tom.close()
print("Number of tokens:", llm.get_num_tokens(episode_tom))

### 3. Handle too large audio file

Great! For the episode with Michael Eckhardt, we need to find a different solution since the upload limit for the whisper API is 25 MB and the episode has almost 30 MB. 

Let's try to split the episode into two parts and transcirbe them separately. Then combine the strings again to get the full episode. Let's see how this works:

In [None]:
# we will use this package to segment our audio file into two parts and store them separately
from pydub import AudioSegment

episode_michael = AudioSegment.from_mp3("./static/Michael Eckhardt.mp3")

# PyDub handles time in milliseconds
fifteen_minutes = 15 * 60 * 1000

first_part_episode_michael = episode_michael[:fifteen_minutes]
second_part_episode_michael = episode_michael[fifteen_minutes:]

# save both parts of the episode into the static/ folder
first_part_episode_michael.export("./static/Michael_Eckhardt_1.mp3", format="mp3")
second_part_episode_michael.export("./static/Michael_Eckhardt_2.mp3", format="mp3")

<div class="alert" style="background-color:#151E35; color:#A450E6">
    🎯 <b>TODO</b>
  <p>Ok, now transcribe the both parts and concatenate the transcriptions to have the full episode.</p>
</div>

In [None]:
### TODO: load both parts of the episode
audio_first_part = ...
audio_second_part = ...

### TODO: do the transcription
first_part_episode_michael = ...
second_part_episode_michael = ...

# close the files again so that we free up system resources
audio_first_part.close()
audio_second_part.close()

In [None]:
# combine the two parts and see how many tokens we have this time
episode_michael =  first_part_episode_michael["text"] + second_part_episode_michael["text"]

print("Number of tokens:", llm.get_num_tokens(episode_michael))

### 4. Split the transcribed podcast episodes into chunks

For the TODOs from here refer back to the text_summarization.ipynb notebook.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

<div class="alert" style="background-color:#151E35; color:#A450E6">
    🎯 <b>TODO</b>
  <p>Split all three episodes into chunks of size 8000 (characters ~ 2000 tokens ~ 1500 words) with an overlap of 500 characters. </p>
  <p>Go back to the summarization notebook if you need to.</p>
</div>


In [None]:
### TODO: initialize the text splitter
text_splitter = ...

In [None]:
### TODO: split each episode string into chunks with the splitter
chunks_alexa = text_splitter.create_documents([...])
chunks_tom = ...
chunks_michael = ...

In [None]:
print(f"Alexa's episode has {len(chunks_alexa)} chunks.")
print(f"Tom's episode has {len(chunks_tom)} chunks.")
print(f"Michael's episode has {len(chunks_michael)} chunks.")


### 5. Summarize the podcast episodes

In [None]:
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

# define our llm for summarization
chat = ChatOpenAI(openai_api_key=openai_api_key)

<div class="alert" style="background-color:#151E35; color:#A450E6">
    🎯 <b>TODO</b>
  <p>Now summarize all three chunked episodes with the map_reduce technique with the pre-defined chain like in the summarization hands-on.</p>
</div>

In [None]:
### TODO: initialize the summarization chain as in the summarization hands-on
summarizer = ...

In [None]:
### TODO: summarize each episode, remember that the .create_documents() needs the chunks in a list as input: [chunks]
summary_alexa = ...
summary_tom = ...
summary_michael = ...

In [None]:
# Remember to get the output_text. It's sometimes helpful to strip the text of leading and trailing whitespace.
print(summary_alexa["output_text"].strip())
print(summary_tom["output_text"].strip())
print(summary_michael["output_text"].strip())

### 6. Make a cool Podcast profile

<div class="alert" style="background-color:#151E35; color:#A450E6">
    🎯 <b>TODO</b>
  <p>Design some smart prompts to get more information (featured speakers, topic, tags, mood, etc.) from the summary or the first chunk (beginning of the podcast) as data and then build a cool podcast profile.</p>
</div>


In [None]:
### TODO: make interesting prompts for the podcast episode profile

