Teams presents files in two formats: docx and vtt.

- **docx** files include profile pics & names, time (from start, mm:ss), utterances and meta information (X joined, Y left, recording started, etc.)

- **vtt** (subtitle format) files include time (from start, HH:mm:ss.SSS) and utterances.

As of now, I work with docx files only as we need the speakers' names.

## Prepartion: installing packages & creating functions

Uncomment this cell for better view in google colab.

In [1]:
# from IPython.display import HTML, display

# def set_css():
#   display(HTML('''
#   <style>
#     pre {
#         white-space: pre-wrap;
#     }
#   </style>
#   '''))
# get_ipython().events.register('pre_run_cell', set_css)

In [2]:
! pip install -q openai docx simplify_docx tiktoken gdown

[33mDEPRECATION: simplify-docx 0.1.2 has a non-standard dependency specifier six>=1.12.0<2. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of simplify-docx or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [3]:
import docx
import re
import openai
import tiktoken
import gdown
import json
from tqdm.notebook import tqdm
from simplify_docx import simplify
from tqdm.notebook import tqdm

In [4]:
def get_text(json_part):
    str_to_return = json_part["VALUE"][-1]["VALUE"]
    return str_to_return

In [5]:
def structure_data_old(meeting_transcipt):
    meeting_transcipt_dict = {"datetime": '', "people_present": [], "utterances": []}
    for n, item in enumerate(meeting_transcipt):
        if n == 1:
            meeting_transcipt_dict["datetime"] = item
        elif "joined the meeting" in item:
            person = item.replace(" joined the meeting", "").strip()
            meeting_transcipt_dict["people_present"].append(person)
        elif "   " in item:
             item = [x for x in re.split(r'   |\r', item) if x]
             person = item[0]
             time = item[1]
             utts = ' '.join(item[2:])
             one_utt = {"time_start": time, "person": person, "sentences": utts}
             meeting_transcipt_dict["utterances"].append(one_utt)
    return meeting_transcipt_dict

In [6]:
def structure_data(meeting_transcipt):
    meeting_transcipt_dict = {"datetime": '', "people_present": [], "utterances": []}
    people = []
    for n, item in enumerate(meeting_transcipt):
        if "   " in item or "\r" in item:
             item = [x for x in re.split(r'   |\r', item) if x]
             time = item[0]
             person = item[1]
             utts = ' '.join(item[2:])
             one_utt = {"time_start": time, "person": person, "sentences": utts}
             meeting_transcipt_dict["utterances"].append(one_utt)
             if person not in people:
                people.append(person)
    meeting_transcipt_dict["people_present"] = people
    return meeting_transcipt_dict

In [7]:
def concat_utterances(transcript):
    speaker = transcript[0]["person"]
    concat_utt = transcript[0]["sentences"]
    transcript_concat = []
    for utt in transcript[1:]:
        if speaker == utt["person"]:
            concat_utt += f' {utt["sentences"]}'
        else:
            transcript_concat.append({'person': speaker, 'sentences': concat_utt})
            concat_utt = utt["sentences"]
            speaker = utt["person"]
    return transcript_concat

In [8]:
def make_transcript(dict_data):
    if "utterances" in dict_data:
        utterances = dict_data["utterances"]
    else:
        utterances = dict_data
    transcript_concat = concat_utterances(utterances)
    transcript_list = [f'{utt["person"]}: {utt["sentences"]}' for utt in transcript_concat]
    return transcript_list

In [9]:
def check_token_number(text):
    enc = tiktoken.encoding_for_model("gpt-4")
    return len(enc.encode(text))

In [10]:
def decide_where_to_break(transcript_list, limit=3000):
    len_tokens = 0
    break_points = []
    for n, utt in enumerate(transcript_list):
        len_tokens += check_token_number(utt)
        if len_tokens > limit:
            len_tokens = check_token_number(utt)
            break_points.append(n-1)
    return break_points

In [11]:
def split_transcript_into_chunks(transcript, break_points):
    transcript_chunks = []
    start_point = 0
    for break_point in break_points:
        chunk = transcript[start_point:break_point]
        transcript_chunks.append(chunk)
        start_point = break_point
    transcript_chunks.append(transcript[start_point:])
    return transcript_chunks

NB: gpt-4 also has a larger context window with a maximum size of **8,192** tokens compared to **4,096** tokens for gpt-3.5-turbo. However, gpt-3.5-turbo returns outputs with lower latency and costs much less per token.

In [12]:
default_system_message = "You are an office worker helper. You summarize transcripts, create key points, and write lists of tasks."
default_prompt = """Meeting transcript:
{chunk_text}
Provide the summary of the meeting based on the meeting transcript."""


def send_prompt_to_openai(prompt=default_prompt, system_message=default_system_message, openai_model="gpt-4"):
    response = openai.ChatCompletion.create(
    model=openai_model,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": prompt},
    ]
    )
    response_text = response["choices"][-1]["message"]["content"]
    return response_text

In [13]:
def process_meeting_chunks_update(transcript_chunks, prompt_first, prompt_others):
    all_responses = []
    for n, chunk in tqdm(enumerate(transcript_chunks)):
        chunk_text = '\n'.join(chunk)
        if n == 0:
            prompt = prompt_first.replace('{chunk_text}', chunk_text)
        else:
            prompt = prompt_others.replace('{response}', response).replace('{chunk_text}', chunk_text)
        response = send_prompt_to_openai(prompt=prompt)
        print(f'tokens in response: {check_token_number(response)}')
        all_responses.append(response)
    return all_responses


def process_meeting_chunks_concat(transcript_chunks, prompt_first, prompt_final, openai_model='gpt-4'):
    all_responses = []
    for n, chunk in enumerate(tqdm(transcript_chunks)):
        chunk_text = '\n'.join(chunk)
        prompt = prompt_first.replace('{chunk_text}', chunk_text)
        response = send_prompt_to_openai(prompt=prompt, openai_model=openai_model)
        print(f'tokens in response: {check_token_number(response)}')
        all_responses.append(response)
    print(f'Processing responses to get the final answer. This should take around 45 seconds.')
    gpt_responses = '\n'.join(all_responses)
    prompt_final = prompt_final.replace("{gpt_responses}", gpt_responses)
    final_response = send_prompt_to_openai(prompt=prompt_final, openai_model='gpt-4')
    all_responses.append(final_response)
    return all_responses

In [14]:
def process_meeting_teams(filepath, prompt_first, prompt_final, openai_model='gpt-4'):
    my_doc = docx.Document(filepath)
    my_doc_as_json = simplify(my_doc)
    list_data = [get_text(json_part) for json_part in my_doc_as_json['VALUE'][0]['VALUE']]
    dict_data = structure_data(list_data)
    transcript = make_transcript(dict_data)
    if openai_model == 'gpt-4':
        token_limit = 6500
    else:
        token_limit = 3000
    break_points = decide_where_to_break(transcript, limit=token_limit)
    transcript_chunks = split_transcript_into_chunks(transcript, break_points)
    result = process_meeting_chunks_concat(transcript_chunks, prompt_first, prompt_final, openai_model=openai_model)
    return result


def process_meeting_teams_old(filepath, prompt_first, prompt_final, openai_model='gpt-4'):
    my_doc = docx.Document(filepath)
    my_doc_as_json = simplify(my_doc)
    list_data = [get_text(json_part) for json_part in my_doc_as_json['VALUE'][0]['VALUE']]
    dict_data = structure_data_old(list_data)
    transcript = make_transcript(dict_data)
    if openai_model == 'gpt-4':
        token_limit = 6500
    else:
        token_limit = 3000
    break_points = decide_where_to_break(transcript, limit=token_limit)
    transcript_chunks = split_transcript_into_chunks(transcript, break_points)
    result = process_meeting_chunks_concat(transcript_chunks, prompt_first, prompt_final, openai_model=openai_model)
    return result


def process_meeting_ami_corpus(filepath,  prompt_first, prompt_final, openai_model='gpt-4'):
    with open(filepath, 'r') as file:
        data = json.load(file)
    meeting_chunks_dict = {}
    for meeting_name, meeting_content in data.items():
        transcript = make_transcript(meeting_content)
        if openai_model == 'gpt-4':
            token_limit = 6500
        else:
            token_limit = 3000
        break_points = decide_where_to_break(transcript, limit=token_limit)
        transcript_chunks = split_transcript_into_chunks(transcript, break_points)
        meeting_chunks_dict[meeting_name] = transcript_chunks
    meeting_results = {}
    for meeting_name, transcript_chunks in meeting_chunks_dict.items():
        result = process_meeting_chunks_concat(transcript_chunks, prompt_first, prompt_final, openai_model=openai_model)
        meeting_results[meeting_name] = result
    return meeting_results

## Prompts

In [20]:
prompt_first_summary = """Meeting transcript:
{chunk_text}
Provide a full summary of the meeting based on the meeting transcript.
Be concise but mention all important details. Do not include tasks separately."""

prompt_others_summary = """New information:
{chunk_text}
Previous summary:
{response}
Extend the previous summary based on New information.
Provide the updated summary as a whole, preserving all information from Previous summary."""

prompt_final_summary = """Here is are parts of a summary of a work meeting:
{gpt_responses}
Combine them to provide a complete and coherent summary.
It must include all the information from the parts above.
You should abridge the text and leave out minor detailes when possible."""

prompt_first_future_tasks = """Meeting transcript:
{chunk_text}
Write the lists of future to-do tasks set for each speaker.
Only mention new tasks that were set during this call and will be started later.
Do not mention tasks that are in progress.
Give one list of future tasks for each speaker without extra information."""

prompt_others_future_tasks = """Previous lists of future tasks:
{response}
Amend the lists of future tasks based on this part of the transcript:
{chunk_text}
Only mention tasks that were set during this call.
Provide the lists of future tasks as a whole, preserving all information from Previous lists of future tasks.
Give one list of future tasks for each speaker without extra information."""

prompt_final_future_tasks = """Here is a list of future to-do tasks set for each speaker:
{gpt_responses}
Format the list to use correct numbering and get rid of duplicate items."""

prompt_first_completed_tasks = """Meeting transcript:
{chunk_text}
Write the lists of tasks that each speaker completed before the meeting. Do not include things done during the meeting.
The tasks has to be clearly reported as finished by the speaker. Do not mention future plans.
Only give one list of completed tasks for each speaker without any extra information."""

prompt_others_completed_tasks = """Previous lists of completed tasks:
{response}
Amend the lists of completed tasks based on this part of the transcript:
{chunk_text}
Provide the lists of completed tasks as a whole, preserving all information from Previous lists of completed tasks.
Only give one list of completed tasks for each speaker without any extra information."""

prompt_final_completed_tasks = """Here is a list of completed tasks for each speaker:
{gpt_responses}
Format the list to use correct numbering and get rid of duplicate items."""

prompt_first_decisions = """Meeting transcript:
{chunk_text}
Write the list of most important decisions made during the meeting.
Keep it as short as possible.
Only include decisions about the project, product or the working process without any extra information."""

prompt_others_decisions = """Previous important decisions:
{response}
Amend the lists of most important decisions based on this part of the transcript:
{chunk_text}
Provide the list of most important decisions made during the meeting as a whole.
Keep it as short as possible. You must start with all information from Previous important decisions.
Only give the list of most important decisions without any extra information."""

prompt_final_decisions = """Here is a list of key decisions made during a work meeting:
{gpt_responses}
Format the list to use correct numbering. Delete unimportant items that include general statements and not decisions.
If several items can be merged into one, merge and abridge. Make the list as short as possible."""

## Getting summaries & tasks

NB: I'm using gpt-4 by default. It's quite expensive. If you want to switch to gpt-3.5-turbo, you can do it in `process_meeting_teams` by specifying `openai_model='gpt-3.5-turbo'`. Note that on the last step (concatenate & abridge) we always use `gpt-4` as `gpt-3.5-turbo` performs poorly there.

In [16]:
url = 'https://drive.google.com/uc?id=1-3W1egLeIeDwT7SxYXXa2g48NwRBbTOc'
filepath = './data/daily_sync_eng.docx'
gdown.download(url, filepath, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-3W1egLeIeDwT7SxYXXa2g48NwRBbTOc
To: /Users/veronicasmilga/Desktop/Sentius/dream/prototypes/data/daily_sync_eng.docx
100%|██████████| 27.3k/27.3k [00:04<00:00, 5.68kB/s]


'./data/daily_sync_eng.docx'

In [17]:
openai.api_key = "YOUR OPENAI KEY HERE"

In [21]:
# gpt-3.5-turbo summary (gpt-4 for concatenation and abridging)
summary = process_meeting_teams(filepath, prompt_first_summary, prompt_final_summary, openai_model='gpt-3.5-turbo')
print(summary[-1])

  0%|          | 0/4 [00:00<?, ?it/s]

tokens in response: 112
tokens in response: 202
tokens in response: 102
tokens in response: 215
Processing responses to get the final answer. This should take around 45 seconds.
At the series of meetings, the team discussed various technical and administrative issues alongside project updates. Technical issues discussed included audio quality, camera settings, issues with transcript formats, the usage of GPT-3/4/5 for better results, and the challenge with IP redirection. Solutions proposed for these challenges included optimizing task tracking, improving system authentication, and fine-tuning AI models, deemed more cost-effective.

The need for better task status tracking was brought up by Ksenia, while Nika fixed the difficulties with transcript formats. Daniel suggested providing an API key and considered the utility of AI models - GPT-4 for training data and fine-tuning GPT-3/5 to cut costs. Artem worked on setting up the front-end container on the AWS machine and resolved package 

In [None]:
# gpt-3.5-turbo summary (gpt-4 for concatenation and abridging)
summary = process_meeting_teams(filepath, prompt_first_summary, prompt_final_summary)
print(summary[-1])

  0%|          | 0/4 [00:00<?, ?it/s]

tokens in response: 166
tokens in response: 202
tokens in response: 389
tokens in response: 199
Processing responses to get the final answer. This should take around 45 seconds.
During the meeting, a variety of work-related topics were discussed by the participants. The talks revolved around task progression and optimising current operations. Ksenia outlined her work on enhancing the functionality of the system by working on task completion checks and conditions. She was also working on adding notes for clarification and fixing prompts. A pressing question during the meeting was raised by Marina about integrating browsing capabilities with Dream Builder, and Diliara offered her inputs explaining about an API available for parsing and performing actions on websites using JS snippets.

The meeting further ventured into the complexities of processing transcripts in different formats, which has been a challenge for Nika. She also highlighted problems concerning chat GPT focusing primarily 

In [None]:
# gpt-4 summary
summary = process_meeting_teams(filepath, prompt_first_summary, prompt_final_summary)
print(summary[-1])

  0%|          | 0/2 [00:00<?, ?it/s]

tokens in response: 408
tokens in response: 375
Processing responses to get the final answer. This should take around 45 seconds.
Meeting Summary:

The meeting opened with Daniel tackling various technical issues. Ksenia gave updates on her task of adding clarification notes in response to user interaction, and voiced issues with ChatGPT. She suggested better means for debugging and browser agent integration.

Financial concerns about high usage of GPT-4 were raised by Nika, who proposed using GPT-4 for channel note generation and GPT-3 for debugging as a cost-effective measure. Deployment of a corporate API key as a solution was suggested by Daniel. Nika additionally flagged plans of improving task status tracking.

Irina shared updates on her tasks and a technical glitch she's been experiencing. Insights from her education classes about product management were also discussed. Mike then suggested use of GPT-3.5 for debugging and GPT-4 for sample generation, with longer-term plans of l

In [None]:
current_tasks = process_meeting_teams(filepath, prompt_first_future_tasks, prompt_final_future_tasks)
print(current_tasks[-1])

  0%|          | 0/2 [00:00<?, ?it/s]

tokens in response: 159
tokens in response: 220
Processing responses to get the final answer. This should take around 45 seconds.
Here is the list, renumbered and with duplicates removed: 

**Diliara Zharikova:**
1. Correct the tasks in GitHub according to the meeting discussion.
2. Review the skill selector and response selector elements of the interface.
3. Meet with the research team to discuss future tasks.

**Nika Smilga:**
1. Implement better task status tracking for the transcription model.
2. Add a prompt for each participant for task tracking and task detection.

**Irina Nikitenko:**
1. Summarize the customer development (custdev) interviews.
2. Input summarized custdev interviews into a table.

**Daniel Kornev:**
1. Follow up on the customer development interviews.
2. Provide access to Fedor for a submachine.
3. Confirm John Morris' needs with Simon Muzio, the Director Program management at Meta.
4. Call with a representative from an early-stage venture fund in Seattle.

**Ks

In [None]:
completed_tasks = process_meeting_teams(filepath, prompt_first_completed_tasks, prompt_final_completed_tasks)
print(completed_tasks[-1])

  0%|          | 0/2 [00:00<?, ?it/s]

tokens in response: 166
tokens in response: 140
Processing responses to get the final answer. This should take around 45 seconds.
1. Diliara Zharikova: 
   - Modified tasks on Asana.
   - Discussed task prioritization and allocation with the team.
   - Conducted customer development interviews.

2. Ksenia Petukhova: 
   - Worked on adding notes for clarification to enhance user interactions.
   - Started working on updating task completion status by integrating the dream with the browser agent.

3. Nika Smilga: 
   - Updated code to process different transcript formats from Teams.
   - Corrected issues related to missing information from the beginning of meeting transcripts.

4. Irina Nikitenko: 
   - Shared learning materials about product management roles.
   - Read and summarized customer development interviews. 

5. Daniel Kornev: 
   - Conducted a customer development interview with John Moore, Head of Project Management.
   - Worked on project planning.

6. Maxim Talimanchuk:
  -

In [None]:
decisions = process_meeting_teams(filepath, prompt_first_decisions, prompt_final_decisions, openai_model='gpt-4')
print(decisions[-1])

  0%|          | 0/2 [00:00<?, ?it/s]

tokens in response: 214
tokens in response: 205
Processing responses to get the final answer. This should take around 45 seconds.
1. The team will initially utilize Chat GPT for debugging and GPT-4 for generating samples, with future plans to fine-tune smaller models for specific tasks.
2. Task status tracking will be refined to distinguish completed tasks from future ones, with assignments based on planning inputs from Nika and Diliara.
3. To manage the cost of using API keys, a corporate credit card will be used in the short term, with a long-term cost optimization plan acknowledged.
4. Responsibilities of product managers will be understood by reviewing the document shared by Irina.
5. Recognizing the importance of customer interviews, Daniel outlined different strategies for talking to department heads versus individual contributors. Irina will sum up these interviews.
6. The issue of authorization in the new front end was identified as a top priority to merge front and back end pa