<a href="https://colab.research.google.com/github/barbaroja2000/llm/blob/main/Langchain_Meeting_Transcript_Analyser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Langchain Meeting Transcript Analyser

This colab provides a test harness to experiment with prompt engineering required to extract the following information from a .vtt file.

Participants

*   Meeting topic (metadata or parsed)
*   Meeting date, time, location (metadata or parsed)
*   Meeting actions & deadlines
*   Meeting key points
*   Decisions Made
*   Questions: Raised (and possibly unanswered)

Notes:

* Embedding model -  "sentence-transformers/all-mpnet-base-v2"
* FAISS for Vector store - swap out with pinecone
* Model meta-llama/Llama-2-70b-chat-hf
* Chunk size for documents 1000 char with 100 overlap
* 512 max new tokens Llama-2-70b-chat-hf


In [1]:
#@title Load Keys
#@markdown Utitily to load keys from fs, replace with environ vars if not using

import os

#os.environ.get("OPENAI_API_KEY")
#os.environ.get("HUGGINGFACE_API_KEY")



!python -m pip install python-dotenv
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
import dotenv
dotenv.load_dotenv('/content/drive/MyDrive/keys/keys.env')

Mounted at /content/drive/


True

In [2]:
# turn on wandb logging for langchain
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

In [3]:
# @title  Synthetic .vtt meeting transcript
synthetic_transcript_uri="https://gist.githubusercontent.com/barbaroja2000/277fd35e17ae6bc8610c29591f39c3a9/raw/5ecd4dc010e98c54f2d1537835a6acff4317443a/synthetic-transcript"

In [4]:
import requests

def fetch_text_file(url, save_path):
    """
    Fetch a text file from a URL and save it locally.

    Parameters:
    - url (str): The URL of the text file.
    - save_path (str): Local path where the file should be saved.
    """

    response = requests.get(url)

    # Ensure the request was successful
    response.raise_for_status()

    # Write the content to a local file
    with open(save_path, 'w', encoding=response.encoding) as file:
        file.write(response.text)


save_path = 'synthetic_transcript.vtt'
fetch_text_file(synthetic_transcript_uri, save_path)

In [12]:
!pip install -qU langchain faiss-cpu huggingface_hub sentence_transformers wandb > /dev/null

In [6]:
from langchain.document_loaders import TextLoader

from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import PromptTemplate, LLMChain, HuggingFaceHub
from langchain.document_loaders import DirectoryLoader
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
import os , requests
from typing import List, Dict
import glob
from langchain.chains.summarize import load_summarize_chain

In [7]:
import os

HUGGINGFACEHUB_API_TOKEN = os.environ["HUGGINGFACEHUB_API_TOKEN"]

In [8]:
# @title  SummarizeNQA Class

class SummarizeNQA:
    def __init__(self, key: str, dir: str) -> None:
        if not key:
            raise ValueError("API key must be provided.")
        if not dir  or not os.path.isdir(dir):
            raise ValueError("Directory must be provided.")

        self.key = key
        self.dir = dir
        self.db = None
        self.docs = None

    def load(self, chunk_size: int = 1000, chunk_overlap: int = 100) -> None:

        documents = []
        if not glob.glob(f"{self.dir}*.*"):
            raise ValueError("Directory must contain at least one file.")

        if  glob.glob(f"{self.dir}*.vtt"):
          loader = DirectoryLoader(
              "", glob=f"{self.dir}*.vtt", loader_cls=TextLoader
          )
          documents = [*loader.load(), *documents]

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap
        )
        self.docs  = text_splitter.split_documents(documents)

        embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-mpnet-base-v2") #what other ones to use here
        self.db = FAISS.from_documents(self.docs, embeddings)

    def summarize(self, max_tokens=1000,chain_type='map_reduce' ):

      if not self.db:
         raise ValueError("Load first")

      map_prompt = """
                Write a  summary of the following:
                "{text}"
                 SUMMARY:
                """
      map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

      combine_prompt = """
      Write a  summary of the following text delimited by triple backquotes.
      Return your response in bullet points which covers the key points of the text.
      ```{text}```
      BULLET POINT  SUMMARY:
      """
      combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

      repo_id = "meta-llama/Llama-2-70b-chat-hf"

      llm = HuggingFaceHub(
          huggingfacehub_api_token=self.key,
          repo_id=repo_id, model_kwargs={"max_new_tokens":512}
      )

      summary_chain = load_summarize_chain(llm=llm,
              chain_type=chain_type,
              map_prompt=map_prompt_template,
              combine_prompt=combine_prompt_template,
              verbose=False, return_intermediate_steps=False
          )

      return  summary_chain.run(self.docs)

    def qa(
        self,
        query: str,
        temperature: float = 0,
        count: float = 4,
        chain_type: str = "stuff", #map_reduce, #refine
        return_only_outputs: bool = True,
        return_intermediate_steps: bool = False,
    ) -> Dict:

        docs = self.db.similarity_search(query, k=count)

        repo_id = "meta-llama/Llama-2-70b-chat-hf"

        llm = HuggingFaceHub(
              huggingfacehub_api_token=self.key,
              repo_id=repo_id, model_kwargs={"max_new_tokens":512}
          )

        if chain_type == "stuff":
            chain = load_qa_with_sources_chain(llm, chain_type=chain_type)
        else:
            chain =  load_qa_with_sources_chain(llm, chain_type=chain_type, return_intermediate_steps=return_intermediate_steps)
        return chain(
            {"input_documents": docs, "question": query},
            return_only_outputs=return_only_outputs,
        )

In [9]:
# @title  Set Up
summarize_and_qa = SummarizeNQA(os.environ.get("HUGGINGFACEHUB_API_TOKEN"),"./")

In [10]:
# @title Load the texts into documents and index
summarize_and_qa.load()

In [13]:
# @title Summarize
# @markdown ```Depending on the length of the text, you may have to reduce the max_token parameter```
import re
summary = summarize_and_qa.summarize(max_tokens=1000)
summary = re.sub('\n{3,}', '\n\n', summary) #summary generates a lot of \n characters

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Streaming LangChain activity to W&B at https://wandb.ai/bandulu/uncategorized/runs/wa5mx1yz
[34m[1mwandb[0m: `WandbTracer` is currently in beta.
[34m[1mwandb[0m: Please report any issues to https://github.com/wandb/wandb/issues with the tag `langchain`.
Token indices sequence length is longer than the specified maximum sequence length for this model (3941 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3941 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2382 > 1024). Running this sequence through the model will result in indexing errors


In [14]:
print(summary)


      * IT director calls a meeting to discuss a new AI strategy for clients.
      * Head of architecture mentions an increase in client interest in AI, particularly generative AI.
      * Principal architect highlights the potential of generative AI in content creation, design, and simulation.
      * Managing director emphasizes the importance of developing universal AI products for clients.
      * Company discusses AI capabilities and how to make them stand out in the market.
      * Enterprise architect highlights predictive analytics and automation in finance and healthcare.
      * Head of marketing stresses the importance of differentiation.
      * IT director suggests partnering with industry leaders to gain an edge.
      * Principal architect emphasizes the need for a robust technical infrastructure to support resource-intensive AI operations.
      * AC Head of Architecture suggests dedicated GPU clusters and collaborations with cloud providers.
      * BD Managing Direc

In [15]:
# @title Meeting Participants
response = summarize_and_qa.qa("list the meeting participants",chain_type="map_reduce")
print(response["output_text"])

Token indices sequence length is longer than the specified maximum sequence length for this model (1785 > 1024). Running this sequence through the model will result in indexing errors


The meeting participants are:

1. AC Head of Architecture
2. BD Managing Director
3. AJ Principal Architect
4. MM Enterprise Architect
5. GC IT Director
6. RM Head of Marketing

SOURCES:

1. synthetic_transcript.vtt
2. synthetic_transcript.vtt
3. synthetic_transcript.vtt
4. synthetic_transcript.vtt
5. synthetic_transcript.vtt
6. synthetic_transcript.vtt


In [16]:
# @title Actions and Deadlines
response = summarize_and_qa.qa("describe the meeting follow-on actions and any deadlines",chain_type="map_reduce")
print(response["output_text"])

Token indices sequence length is longer than the specified maximum sequence length for this model (2350 > 1024). Running this sequence through the model will result in indexing errors


The meeting follow-on actions and deadlines are as follows:

1. Coordinate with sales team to arrange a workshop for clients (no deadline specified).
2. Draft a proposal for the team structure and responsibilities (no deadline specified).
3. Evaluate infrastructure needs for scalable AI and prepare a report (deadline: three weeks from the date of the meeting).
4. Collaborate with universities for training and academic insights (no deadline specified).
5. Recruit fresh talent through university collaborations (no deadline specified).
6. Innovate solutions through fresh perspectives (no deadline specified).
7. Set aside a budget for internal R&D (no deadline specified).

SOURCES:

* synthetic_transcript.vtt (lines 10-13, 18-24)
* synthetic_transcript.vtt (lines 20-24)


In [17]:
# @title Decisions Made
response = summarize_and_qa.qa("List the decisions made in the meeting",chain_type="map_reduce")
print(response["output_text"])

Token indices sequence length is longer than the specified maximum sequence length for this model (2094 > 1024). Running this sequence through the model will result in indexing errors


The decisions made in the meeting are:

1. Create a dedicated AI support team.
2. Showcase case studies and real-world applications of AI solutions.
3. Coordinate with the sales team and arrange a workshop for clients.
4. Gather feedback from clients to understand their specific needs in AI solutions.
5. Implement a regular training schedule for the team to stay up-to-date with the latest AI advancements.
6. Evaluate the company's infrastructure needs for scalable AI.
7. Scout for potential partnerships in the tech space.
8. Establish a dedicated ethics committee for AI.
9. Ensure ethical soundness of products through regular reviews.
10. Use "Ethically Designed AI Solutions" as a selling point.

SOURCES:

1. synthetic_transcript.vtt
2. synthetic_transcript.vtt
3. synthetic_transcript.vtt
4. synthetic_transcript.vtt
5. synthetic_transcript.vtt
6. synthetic_transcript.vtt
7. synthetic_transcript.vtt
8. synthetic_transcript.vtt
9. synthetic_transcript.vtt
10. synthetic_transcript.vtt


In [18]:
# @title Questions raised
response = summarize_and_qa.qa("list all the questions raised in the meeting, and answers provided.",chain_type="map_reduce")
print(response["output_text"])

Token indices sequence length is longer than the specified maximum sequence length for this model (2019 > 1024). Running this sequence through the model will result in indexing errors




Questions raised in the meeting:

1. How to pique interest and drive sales of AI solutions?
2. How to understand what clients are specifically looking for in AI solutions?
3. How to ensure the team is up-to-date with the latest AI advancements?
4. Proposal for the team structure and responsibilities of the dedicated AI support team.
5. What are the infrastructure needs for scalable AI?
6. Who will lead the team to evaluate infrastructure needs?
7. How long will the evaluation take?
8. What is the concern regarding AI development?
9. What is the suggestion to address security concerns?
10. How do we plan to support these AI solutions?
11. How can we ensure our products are ethically sound?
12. What are the ethical implications of generative AI?

Answers provided:

1. By demonstrating AI capabilities firsthand through a workshop for clients, showcasing case studies and real-world applications, emphasizing customization and client-specific benefits.
2. By gathering feedback from clients