# Podcast transcriber and summarizer

This notebook is split into two separate parts.
1. **Podcast transcription** - The first part of the notebook aims to convert the speech of a podcast to text and output the results
2. **Summarizer** - The second part of the notebook aims to convert the the long form podcast transcription to a summary that captures important details of the podcast
---

### Import relevant packages

Two packages are required for this notebook to run properly, the OpenaAI package and the pydub package. Each of these packages requires their own nuance to work properly so details are provided below:

* **OpenAI** - The openai package contains the API responsible for accessing the whisper and ChatGPT models that power the transcriber and the summarizer. There are several steps to get this working properly.
    - **Installing the openai package** - Installing the openai package is simple, just run `pip install openai` from the terminal or command line. 
    - **Generating an API key** - If you have an API key and do not know it or haven't generated an API key at all, you can access your key here https://platform.openai.com/account/api-keys. Instruction are pretty straight forward for generating a key
* **Pydub** - Pydub (https://github.com/jiaaro/pydub) is a package designed for audio manipulation in python. To quote the github page "Pydub lets you do stuff to audio in a way that isn't stupid". Pydub is also not the most straight forward to install so instructions are listed at this link (https://github.com/jiaaro/pydub#installation) and below as well.
    - **Installing the pydub package** - Installing the pydub package is simple, just run `pip install pydub` from the terminal or command line.
    - **Installing dependencies (ffmpeg)** - Python cannot natively read in mp3 files. Only wav files so an extra program must be installed to handle this. This is ffmpeg. This link (https://phoenixnap.com/kb/ffmpeg-windows) provides the best set of instructions to download the package. Make sure to add FFmpeg to the PATH environment variable on your local machine, which is described in the link.

In [253]:
#Import relevant packages

import openai 
API_KEY = ''    #use instructions above to get this key


from pydub import AudioSegment

#Specifying path for pydub mp3 reader binaries. Use proper path for you
AudioSegment.converter = "C:\\ffmpeg\\bin\\ffmpeg.exe"
AudioSegment.ffmpeg = "C:\\ffmpeg\\bin\\ffmpeg.exe"
AudioSegment.ffprobe = "C:\\ffmpeg\\bin\\ffprobe.exe"

### 1. Podcast transcription

The following cell uses the Pydub library to:
1. Create an AudioSegment object from the raw mp3
2. Chunk the AudioSegment object into 10 minute pieces which is the maximum size that the Whisper API will accept
3. Write the chunks to new mp3 files
4. Use the Whisper API to transcribe the chunks
5. Combine transcriptions into one long text

In [65]:
# 1. Import relevant API key and audio

podcast_file_path = "NPR_podcast.mp3"                    #use your file path here
podcast_audio = AudioSegment.from_mp3(podcast_file_path)

In [10]:
#2. Chunk audio into segments

segment_duration = 10*60*1000                                # 10 minute chunks on a millisecond basis
num_segments = int(len(podcast_audio)/segment_duration) + 1  # Calculate number of chunks in the podcast

#3. Loop for extracting chunks from AudioSegment object and re-writing the chunk as na mp3 file
for i in range(num_segments):
    segment = podcast_audio[i*segment_duration:(i+1)*segment_duration]
    segment_file_path = podcast_file_path[:-4] + ' segment ' + str(i) + '.mp3'
    segment.export(segment_file_path, format='mp3')

In [66]:
# 4. WhisperAPI to transcribe audio segments

transcripts = []
for i in range(num_segments):
    audio_file_path = podcast_file_path[:-4] + ' segment ' + str(i) + '.mp3'
    audio_file = open(audio_file_path, 'rb')
    
    #This is the whisper API. Any extra arguments can be added here
    audio_response = openai.Audio.transcribe(
        api_key=API_KEY,
        model='whisper-1',
        file=audio_file,
    )
    transcripts.append(audio_response['text'])

In [259]:
#5. Combine into one long text and print the full transcript
full_transcript = " ".join(transcripts)
num_words = len(full_transcript.split(" "))
print(f"The number of words in this podcast is {num_words}")
print(full_transcript)

The number of words in this podcast is 8113


### 2. Summarization

The following cells use the ChatGPT API to summarize over the transcript generated in the last section. The difficulty is that the transcript is often longer than the context window of the ChatGPT. The method used below sets a "window size" which is how much text is to be summarized over. Then the window "slides" over the text to summarize chunks. If the slide size equals the window size then there will be no overlap. If the slide size is less than the window size, there will be overlap in content of the summarized chunks. Window and slide size are defined on a number of words basis.

The summarizer creates a bullet point list for each window and then finally combines them into one final list.

In [232]:
#Summarizing each segment over sliding window

window_size = 1000   # 1000 word summarization window
slide_size = 1000   # 1000 word slide
num_windows = int(len(full_transcript.split(" "))/slide_size) + 1

summaries = []

for i in range(num_windows):
    segment_text = " ".join(full_transcript.split(" ")[i*slide_size:i*slide_size+window_size])
    initial_prompt = 'Please give me a detailed synopsis of the following text. Bullet point format must be used. Do not summarize over important details. Retain important names, places, and dates. Leave them as they were in the text. It is okay to be verbose or wordy in your synopsis to capture as much context as possible. I should be able to read the synopsis and have nearly the same understanding as if someone read the entire text. Please only respond to me with the bullet point synopsis. I don\'t want an other commentary from you. \n' + segment_text
    second_prompt = 'Evaluate the last synopsis. Does it capture all the important details from the original text? Please re-write the synopsis so that all important details are captured. Keep the structure in bullet point format. No need to explain rationale to me. I only want the rewritten synopsis as a response.'
    
    summarize_response = openai.ChatCompletion.create(
        api_key=API_KEY,
        model='gpt-3.5-turbo',
        messages=[
            {'role': 'user',
            'content': initial_prompt}
        ]
    )

# The section below allows for the model to "reflect" over its summary. Uncomment it if you'd like this functionality

    # chat_history = [
    #     {'role': 'user',
    #     'content': initial_prompt},
    #     {'role': 'assistant',
    #     'content': summarize_response.choices[0].message.content},
    #     {'role': 'user',
    #     'content': second_prompt}
    # ]

    # summarize_response = openai.ChatCompletion.create(
    #     api_key=API_KEY,
    #     model='gpt-3.5-turbo',
    #     messages=chat_history
    # )
    
    print(f"For segment {i+1}, {summarize_response.usage.prompt_tokens} prompt tokens were used and {summarize_response.usage.completion_tokens} tokens were used in the response for a total token count of {summarize_response.usage.total_tokens}.")
    summaries.append(summarize_response.choices[0].message.content)

For segment 1, 1357 prompt tokens were used, 422 tokens were used in the response for a total token count of 1779.
For segment 2, 1348 prompt tokens were used, 352 tokens were used in the response for a total token count of 1700.
For segment 3, 1366 prompt tokens were used, 192 tokens were used in the response for a total token count of 1558.
For segment 4, 1386 prompt tokens were used, 162 tokens were used in the response for a total token count of 1548.
For segment 5, 1359 prompt tokens were used, 222 tokens were used in the response for a total token count of 1581.
For segment 6, 1299 prompt tokens were used, 224 tokens were used in the response for a total token count of 1523.
For segment 7, 1284 prompt tokens were used, 271 tokens were used in the response for a total token count of 1555.
For segment 8, 1328 prompt tokens were used, 233 tokens were used in the response for a total token count of 1561.
For segment 9, 267 prompt tokens were used, 117 tokens were used in the response

In [234]:
# Look at the total summaries across chunks
total_response = "\n".join(summaries)
print(total_response)

1770
- NPR's new series, Taking Cover, is introduced by a White Lies co-host, Andy Grace, and co-hosted by Graham Smith, a producer for the first season of White Lies.
- The series uncovers a cover-up of a friendly fire incident in Iraq in the spring of 2004, which NPR correspondent Tom Bowman approached Smith to help dig into.
- Camp Pendleton in Southern California, the west coast home of the United States Marine Corps is described.
- Hornow Ridge, a sharp hill covered with scrub trees and bushes that overlooks the Pacific Ocean on the edge of the camp is described.
- Hornow Ridge has become a place of pilgrimage where Marines climb with memorials to honor their dead fallen. Scott Radetzky, a retired chaplain, helped get Hilltop Memorial started in the spring of 2003 to commemorate fallen marine in Iraq.
- The hilltop memorial consisted of carrying telephone poles to the top of the ridge and bolting them together as a cross.
- The idea was that the pain and suffering of carrying the 

In [250]:
# Combine the summaries into a hierarchical, organized list.
combine_response = openai.ChatCompletion.create(
    api_key=API_KEY,
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user',
        'content': 'Please combine all the bullet points in the following list into an organized, hierarchical list that makes sense. Retain names, places, and dates. The bullets are aggregated from summaries over mutlitple sections of a full article and are organized in the order that the article was written. The organization of your list should take into context all of the bullet points in the list I provide. \n\n' + total_response}
    ]
)

In [252]:
# Print results of combined results
print(combine_response.choices[0].message.content)

I. Introduction 
- NPR's new podcast series, Taking Cover, is co-hosted by Andy Grace and Graham Smith
- The series will uncover a cover-up of a friendly fire incident in Iraq in 2004
 
II. Setting the Stage
- Camp Pendleton in Southern California, the west coast home of the United States Marine Corps is described
- Hornow Ridge is described as a place of pilgrimage where Marines climb with memorials to honor their fallen
 
III. Investigating the Incident
- The story is about mistakes, faulty assumptions, miscalculations, and lies that were covered up to hide the truth of what happened to the Marines
- The series includes interviews with Marines wounded that day and families of those killed, and explores the frustration and sadness of those still left without answers
- The journalist teams up with their colleague, Graham, to investigate the incident
- The Marines initially prepared for peaceful interaction with Fallujah locals
- The Marines were led by General James Mattis, who later s