## Summarize audio with natural language processing and automatic speech recognition

### Import necessary libraries

In [1]:
!pip install -U openai-whisper

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
import os
import re
import csv
import torch
from typing import List
import pandas as pd
import whisper
from transformers import AutoTokenizer, pipeline

### Set options

In [3]:
# The size of the ASR model to use
ASR_MODEL_SIZE = "small.en"

# The maximum length of each bullet point (in tokens)
SUMMARY_LENGTH = 128

# Set device to GPU if available, otherwise use CPU
DEVICE = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

# Play around with other models at https://huggingface.co/models?pipeline_tag=summarization&sort=downloads
NLP_ARCH = 'facebook/bart-large-cnn'
NLP_TOKENIZER = AutoTokenizer.from_pretrained(NLP_ARCH)
NLP_MAXLEN = NLP_TOKENIZER.model_max_length - 5

### Download the video and extract audio


In [5]:
# !curl "https://pytorch-workshops.s3.amazonaws.com/videos/Allen+Newell%2C+%EF%BC%82Desires+and+Diversions%EF%BC%82+%5B405060633%5D.mp4" -o newell_lecture.mp4
# !curl "https://pytorch-workshops.s3.amazonaws.com/videos/Herb+Simon%2C+%EF%BC%82Intelligence-+Artificial+and+Natural%EF%BC%82+Part+1+%5B395461289%5D.mp4" -o simon_lecture.mp4
# !for name in *.mp4; do ffmpeg -i "$name" -vn -acodec libmp3lame -ab 128k "${name%.*}/audio.mp3"; done

#### or just download the pre-extracted audio :)

In [6]:
# create the destination directory first
!mkdir newell_lecture
# download to it
!curl "https://pytorch-workshops.s3.amazonaws.com/videos/audio/Allen+Newell%2C+%EF%BC%82Desires+and+Diversions%EF%BC%82+%5B405060633%5D.mp3" -o newell_lecture/audio.mp3
# !mkdir simon_lecture
# !curl "https://pytorch-workshops.s3.amazonaws.com/videos/audio/Herb+Simon%2C+%EF%BC%82Intelligence-+Artificial+and+Natural%EF%BC%82+Part+1+%5B395461289%5D.mp3" -o simon_lecture/audio.mp3

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 89.2M  100 89.2M    0     0  22.3M      0  0:00:03  0:00:03 --:--:-- 22.3M


### Choose a lecture

In [7]:
audio_file = "newell_lecture/audio.mp3"

### Load the Whisper ASR Model

In [8]:
asr_model = whisper.load_model(ASR_MODEL_SIZE).to(DEVICE)

### Transcribe speech in the audio

In [9]:
# EXERCISE:
# Write a function to 
# - transcribe an audio file using an ASR model 
# - save the resulting transcription as a CSV file
# - return the file path of the resulting transcript.
#
# You may use the helper function `save_transcript_to_csv` to save Whisper's output as a csv file


def save_transcript_to_csv(asr_result, file_path):
    """
    Save transcription to a CSV file
    
    Parameters:
        asr_result: The transcription data to save.
        file_path (str): The file path of the CSV file.
    """
    field_names = ['start', 'end', 'text', 'summary']
    with open(file_path, "w", newline="") as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=field_names)
        writer.writeheader()
        for entry in asr_result:
            writer.writerow({k:entry[k] for k in field_names if k in entry.keys()})


def transcribe_audio(audio_path: str, asr_model):
    """
    Transcribe an audio file using the provided ASR model.
    
    Parameters:
        audio_path (str): The file path of the audio file.
        asr_model: The ASR model to use for transcription.
    
    Returns:
        str: The file path of the transcript.
    """
    dir_name = os.path.dirname(audio_path)
    transcript_path = f"{dir_name}/transcript.csv"

    # Save the audio transcript as a csv file at the transcript filepath
    if not os.path.exists(transcript_path):        
        with torch.no_grad():
            asr_result = asr_model.transcribe(audio_path)['segments']
        save_transcript_to_csv(asr_result, transcript_path)
        
    return transcript_path    

In [10]:
# this takes a while... 
transcript_path = transcribe_audio(audio_file, asr_model)

### Preview the transcript

In [11]:
transcript_df = pd.read_csv(transcript_path, header=0)
transcript_df.head()

Unnamed: 0,start,end,text,summary
0,0.0,23.0,So I could just hear people talking coming up...,
1,23.0,31.0,"When you get to be two to the sixth years, yo...",
2,31.0,33.0,so you've got to talk about the past.,
3,33.0,37.0,"But that's not true, it turns out.",
4,37.0,41.0,"And I could have talked to you about SOAR, al...",


### Generate summary of transcription

We have the raw transcript in `transcript.csv`. The csv file has 4 columns: start, end, text, summary. So far the first three columns have been populated by the ASR model. Let's populate the summaries now.

One approach might be to generate a summary for each row. But as you can see, each row corresponds to around 7 seconds of the lecture, which also includes silence. So some rows have barely any words at all - not a good candidate for summarization!

NLP models have a maximum input length they can accept; in case of our model it is 1024. We'll iterate over the `text` column and chunk it up into segments containing 1019 tokens or less. (we use a buffer of 5 tokens to prevent inadvertent overflows).


In [5]:
def get_tokens(tokenizer, input_text):
    return tokenizer(input_text, add_special_tokens=False)['input_ids']


def generate_timestamped_segments(asr_df: pd.DataFrame):
    """
    Tokenize transcribed text, chunking into segments of length <= NLP_MAXLEN-5
    while preserving correct timestamps from ASR transcription.
    
    Parameters:
        asr_result (List[dict]): The transcription segments. Each dict should have a 'text' and 
            'start' and 'end' keys for the transcription text and start and end times, respectively.
    
    Returns:
        List[dict]: A list of dictionaries containing tokenized sentences with 'text', 'tokens', 'start',
            and 'end' keys for the sentence text, tokenized form, and start and end times, respectively.
    """
    segments = []
    curr_segment = {
        "start": None,
        "end": None,
        "text": "",
        "tokens": []
    }

    for _, row in asr_df.iterrows():
        text = row['text']
        tokens = get_tokens(NLP_TOKENIZER, text)
        # If total tokens exceed maximum length, add current segment to list and flush
        if len(tokens) + len(curr_segment['tokens']) >= NLP_MAXLEN:
            segments.append(curr_segment)
            curr_segment = {
                "start": None,
                "end": None,
                "text": "",
                "tokens": []
            }
        else:
            curr_segment['start'] = curr_segment['start'] or row['start']
            curr_segment['end'] = row['end']
            curr_segment['text'] += text 
            curr_segment['tokens'] += tokens

    segments.append(curr_segment)
    return pd.DataFrame(segments)

In [13]:
segments_df = generate_timestamped_segments(transcript_df)

Explore the generated dataframe. Compared to `transcript_df`, what is the time duration that each row captures? How many words on average in each segment?

In [14]:
# EXERCISE: what is the time duration that each row captures? 
print((segments_df['end'] - segments_df['start']).mean(), "seconds")

# EXERCISE: How many words on average in each segment?
print(segments_df["text"].str.len().mean())

# EXERCISE: How many tokenson average in each segment?
print(segments_df["tokens"].str.len().mean())

302.57894736842104 seconds
4334.315789473684
961.8421052631579


Verify that no text segment contains more than NLP_MAXLEN tokens

In [15]:
assert (segments_df['tokens'].str.len() > NLP_MAXLEN).sum() == 0, f"At least one text segment has more than {NLP_MAXLEN} tokens"

The next step is to pass these text segments to an NLP Summarizer model. For each text segment the model will generate summaries that are not more than `SUMMARY_LENGTH = 128` tokens.

The `transformers` library offers a convenient `pipeline` class that wraps up complex code for summarization (and more tasks) into a single API call. [API Reference](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/pipelines#transformers.SummarizationPipeline)

We defined `NLP_ARCH = facebook/bart-large-cnn` above; this is Meta's BART language model that has been finetuned for summarization on the CNN/Daily Mail dataset.

In [16]:
DEVICE

device(type='cuda', index=0)

In [17]:
bart_abstract_1 = "BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token."
bart_abstract_2 = " BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance."
bart_abstracts = [bart_abstract_1, bart_abstract_2]

summarizer = pipeline("summarization", model=NLP_ARCH, device=torch.device('cuda:0'))
abstracts_summary = summarizer(bart_abstracts, max_length=64, min_length=20)

print(abstracts_summary[0])
print(abstracts_summary[1])

print("original length: ", len(bart_abstract_1) + len(bart_abstract_2))
print("summarized length: ", len(abstracts_summary[0]['summary_text']) + len(abstracts_summary[1]['summary_text']))


{'summary_text': 'BART is a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by corrupting text with an arbitrary noising function. It uses a standard Tranformer-based neural machine translation architecture.'}
{'summary_text': 'BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD. BART achieves new state-of-the-art results on a range of abstractive dialogue, question answering,'}
original length:  1361
summarized length:  526


In [18]:
# EXERCISE: 
# Write a function that 
# - takes in the dataframe generated above, 
# - generates summaries of 128 tokens or less for each segment, and 
# - adds them in a new column of the dataframe. Return this dataframe

def generate_timestamped_summaries(segments_df: pd.DataFrame, summary_lengths: int = 128):
    """
    Generate summaries of each timestamped segments
    
    Parameters:
        segments_df (DataFrame): The dataframe containing timestamps and text segments
        summary_lengths (int): The maximum length of each generated summary
    
    Returns:
        pd.DataFrame: A dataframe with timestamps, transcriptions and summaries
    """
       
    # Extract the sentences from the timestamped transcript
    sentences = segments_df['text'].tolist()
    
    # Initialize the summarization pipeline
    summarizer = pipeline("summarization", model=NLP_ARCH, device=DEVICE)
    
    # Generate summaries for the sentences
    summaries = summarizer(sentences, max_length=summary_lengths, min_length=20, do_sample=False)
    summaries = [x['summary_text'] for x in summaries]
    
    # Add the summaries to the dataframe
    segments_df['summary'] = summaries
    # for a, b in zip(timestamped_sentences, summaries):
    #     a['summary'] = b['summary_text']
    
    # Save the timestamped summaries to a file
    # summary_file = os.path.dirname(transcript_file)+"/summary.csv"
    # save_transcript_to_csv(timestamped_sentences, summary_file)
    # print("[NLP] Video summary saved at ", summary_file)

    return segments_df

summary_df = generate_timestamped_summaries(segments_df, SUMMARY_LENGTH)

Your max_length is set to 128, but you input_length is only 59. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=29)


### View the summary

In [19]:
segments_df.head()

Unnamed: 0,start,end,text,tokens,summary
0,23.0,318.0,So I could just hear people talking coming up...,"[407, 38, 115, 95, 1798, 82, 1686, 567, 62, 11...","Soar is built out of a big production system, ..."
1,322.0,622.0,and it's very hard to prove theorems that are...,"[8, 24, 18, 182, 543, 7, 3364, 5, 1688, 4339, ...","In complexity theory, someone proves a really ..."
2,626.0,919.0,That is the AI question. And although I have ...,"[280, 16, 5, 4687, 864, 4, 178, 1712, 38, 33, ...",At age 17 he wanted to be a forest ranger. At ...
3,922.0,1226.0,Who would ever want to be an optical engineer...,"[3394, 74, 655, 236, 7, 28, 41, 17547, 8083, 1...","As an undergraduate at Stanford, I worked on x..."
4,1229.0,1532.0,"And I read this chapter, and it was so awful ...","[178, 38, 1166, 42, 7285, 6, 8, 24, 21, 98, 11...","In 1970, when Xerox came into existence in 197..."


### Format and save the dataframe

Looks good! Let's make the timestamps more readable so we can scroll in the video if we need to. Here's a `format_time` helper function.

Save the dataframe as a CSV file. This will be useful when we need to cross-check something in the video

In [20]:
def format_time(t):
    """
    Convert a time in seconds to a string in HH:MM:SS format.
    
    Parameters:
        t (str): The time in seconds.
    
    Returns:
        str: The time in HH:MM:SS format.
    """
    t = round(float(t))
    hh = t // 3600
    t %= 3600
    mm = t // 60
    ss = t % 60
    return f"{hh:02d}:{mm:02d}:{ss:02d}"


segments_df['start'] = segments_df['start'].apply(format_time)
segments_df['end'] = segments_df['end'].apply(format_time)

output_dir = os.path.dirname(audio_file)
segments_df.to_csv(output_dir + '/timestamped_summaries.csv', index_label=False)

Reading the summary row-by-row is tedious. Concatenate all the individial summaries into a single passage.

In [12]:
# EXERCISE: Concatenate the segment-summaries into a single paragraph

passage = " ".join(segments_df['summary'].tolist())
print(passage)

Soar is built out of a big production system, and thus there is behind SOAR a huge set of rules, 10,000 of them by the time you take the 2,000, original and 8,000. There's a big read-a-net there, which is a device for trying to execute those rules efficiently. So here we have the time per decision cycle in seconds, spin it on a log scale, and in fact, there's essentially no indication of what we call the average growth effect. In complexity theory, someone proves a really new interesting theorem which opens up a new area. You live your life moving from one theoretical area to another as it opens up. For Werner Reichert, life is a sequence of five years, phenomenally five years project, each one picked by looking at the state of science. At age 17 he wanted to be a forest ranger. At 19 he was out at Bikini at the atomic bomb test site. At 22 he was an optical engineer. At 25 he was a computer scientist. At 30 he became a computer engineer. As an undergraduate at Stanford, I worked on x-

We're busy people, we need a TL;DR

In [8]:
# EXERCISE: Generate a synopsis of the entire video not exceeding 64 tokens 

tokens = get_tokens(NLP_TOKENIZER, passage)
summarizer = pipeline("summarization", model=NLP_ARCH, device=DEVICE)

if len(tokens) > NLP_MAXLEN:
  tokens_list = [tokens[i:i+NLP_MAXLEN] for i in range(0, len(tokens), NLP_MAXLEN)]
  sentence_list = [NLP_TOKENIZER.decode(toks) for toks in tokens_list]
  output = summarizer(sentence_list, max_length=64, min_length=20, do_sample=False)
else:
  output = summarizer(passage, max_length=64, min_length=20, do_sample=False)

synopsis = ' '.join([x['summary_text'] for x in output])
print(synopsis)

Werner Reichert: Life is a sequence of five years, phenomenally five years project, each one picked by looking at the state of science. At age 17 he wanted to be a forest ranger. At 19 he was out at Bikini at the atomic bomb test site. At 22 I consider that I have done a hell of a lot of computer science. My goal is therefore not computer science, just like my goal is not AI. That doesn't keep me from being in the computer science department, providing textbooks on computers.


Write the TL;DR and paragraph to a file

In [13]:
# EXERCISE: Write a file called video_summary.txt that contains the above generated synopsis and passage

with open(output_dir + '/video_summary.txt', "w") as f:
  f.write("TL;DR\n")
  f.write(synopsis)
  f.write("\n\nPASSAGE:\n")
  f.write(passage)

