In [None]:
import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient

In [4]:
document = [
        "At Microsoft, we have been on a quest to advance AI beyond existing techniques, by taking a more holistic, "
        "human-centric approach to learning and understanding. As Chief Technology Officer of Azure AI Cognitive "
        "Services, I have been working with a team of amazing scientists and engineers to turn this quest into a "
        "reality. In my role, I enjoy a unique perspective in viewing the relationship among three attributes of "
        "human cognition: monolingual text (X), audio or visual sensory signals, (Y) and multilingual (Z). At the "
        "intersection of all three, there's magic-what we call XYZ-code as illustrated in Figure 1-a joint "
        "representation to create more powerful AI that can speak, hear, see, and understand humans better. "
        "We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, "
        "spanning modalities and languages. The goal is to have pretrained models that can jointly learn "
        "representations to support a broad range of downstream AI tasks, much in the way humans do today. "
        "Over the past five years, we have achieved human performance on benchmarks in conversational speech "
        "recognition, machine translation, conversational question answering, machine reading comprehension, "
        "and image captioning. These five breakthroughs provided us with strong signals toward our more ambitious "
        "aspiration to produce a leap in AI capabilities, achieving multisensory and multilingual learning that "
        "is closer in line with how humans learn and understand. I believe the joint XYZ-code is a foundational "
        "component of this aspiration, if grounded with external knowledge sources in the downstream AI tasks."
    ]

In [15]:
endpoint = os.getenv('AZURE_ENDPOINT')
key = os.getenv("AZURE_KEY")
text_analytics_client = TextAnalyticsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key),
)

#poller = text_analytics_client.begin_abstract_summary(document)
abstract_summary_results = text_analytics_client.begin_abstract_summary(document).result()
extract_summary_results = text_analytics_client.begin_extract_summary(document).result()

In [16]:
for result in abstract_summary_results:
    if result.kind == "AbstractiveSummarization":
        print("Summaries abstracted:")
        [print(f"{summary.text}\n") for summary in result.summaries]
    elif result.is_error is True:
        print("...Is an error with code '{}' and message '{}'".format(
            result.error.code, result.error.message
        ))

Summaries abstracted:
The Chief Technology Officer of Azure AI Cognitive Services discusses Microsoft's commitment to advancing AI by integrating monolingual text, audio/visual sensory signals, and multilingual data, termed the XYZ-code, to create AI that better understands humans across different domains. This innovative approach aims to enable cross-domain transfer learning, spanning modalities and languages, drawing inspiration from human cognition. The team's efforts over the past five years have led to breakthroughs in conversational speech recognition, machine translation, conversational question answering, machine reading comprehension, and image captioning, achieving human-level performance on benchmarks. These advancements are seen as stepping stones toward a significant leap in AI capabilities, aspiring to develop multisensory and multilingual learning systems that mirror human learning processes. The ultimate goal is to ground the XYZ-code with external knowledge sources to 

In [17]:
for result in extract_summary_results:
    if result.kind == "ExtractiveSummarization":
        print("Summary extracted: \n{}".format(
            " ".join([sentence.text for sentence in result.sentences]))
        )
    elif result.is_error is True:
        print("...Is an error with code '{}' and message '{}'".format(
            result.error.code, result.error.message
        ))

Summary extracted: 
At the intersection of all three, there's magic-what we call XYZ-code as illustrated in Figure 1-a joint representation to create more powerful AI that can speak, hear, see, and understand humans better. We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, spanning modalities and languages. The goal is to have pretrained models that can jointly learn representations to support a broad range of downstream AI tasks, much in the way humans do today.


In [20]:
result.sentences

[SummarySentence(text=At the intersection of all three, there's magic-what we call XYZ-code as illustrated in Figure 1-a joint representation to create more powerful AI that can speak, hear, see, and understand humans better., rank_score=1.0, offset=517, length=203),
 SummarySentence(text=We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, spanning modalities and languages., rank_score=0.86, offset=721, length=134),
 SummarySentence(text=The goal is to have pretrained models that can jointly learn representations to support a broad range of downstream AI tasks, much in the way humans do today., rank_score=0.96, offset=856, length=158)]

# Now make a function

In [80]:
text_analytics_client = TextAnalyticsClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(key),
    )

def azure_abstractive_summarization(text_input: str) -> str:
    if not isinstance(text_input, list):
        text_input = [text_input]
    abstract_summary_results = text_analytics_client.begin_abstract_summary(text_input).result()
    summaries_text = ""
    for result in abstract_summary_results:
        if result.kind == "AbstractiveSummarization":
            summaries_text += "\n".join([summary.text for summary in result.summaries]) + " "
        elif result.is_error:
            print("...Is an error with code '{}' and message '{}'".format(
                result.error.code, result.error.message
            ))
            return ""
    return summaries_text

def azure_extractive_summarization(text_input: str) -> str:
    if not isinstance(text_input, list):
        text_input = [text_input]
    extract_summary_results = text_analytics_client.begin_extract_summary(text_input).result()
    summaries_text = ""
    for result in extract_summary_results:
        if not result.is_error:
            summaries_text += "\n".join([sentence.text for sentence in result.sentences]) + " "
        else:
            print("...Is an error with code '{}' and message '{}'".format(
                result.error.code, result.error.message
            ))
            return ""
    return summaries_text

In [66]:
text_abstract = azure_abstractive_summarization(document)
text_extract = azure_extractive_summarization(document)

In [67]:
print(text_abstract)
print(text_extract)

The Chief Technology Officer of Azure AI Cognitive Services discusses Microsoft's commitment to advancing AI by integrating monolingual text, audio/visual sensory signals, and multilingual data, termed the XYZ-code, to create AI that better understands humans across different domains. This innovative approach aims to enable cross-domain transfer learning, spanning modalities and languages, drawing inspiration from human cognition. The team's efforts over the past five years have led to breakthroughs in conversational speech recognition, machine translation, conversational question answering, machine reading comprehension, and image captioning, achieving human-level performance on benchmarks. These advancements are seen as stepping stones toward a significant leap in AI capabilities, aspiring to develop multisensory and multilingual learning systems that mirror human learning processes. The ultimate goal is to ground the XYZ-code with external knowledge sources to enhance downstream AI 

In [87]:
def count_words(text: str) -> tuple:
    words = text.split()
    word_count = len(words)
    character_count = len(text)
    
    print(f"Word count: {word_count}, Character count: {character_count}")

document_word_count = count_words(document[0])
abstract_word_count = count_words(text_abstract)
extract_word_count = count_words(text_extract)

Word count: 247, Character count: 1627
Word count: 150, Character count: 1132
Word count: 78, Character count: 498


## Using own text

In [69]:
with open('./transcription.txt', 'r') as file:
    transcription = file.read()

res_abstract = azure_abstractive_summarization(transcription)

with open('./res_abstract.txt', 'w') as file:
    file.write(res_abstract)

count_words(transcription)
count_words(res_abstract)

HttpResponseError: (InvalidParameterValue) Job task: 'AbstractiveSummarization-0' failed with validation errors: ['Invalid Document in request.']
Code: InvalidParameterValue
Message: Job task: 'AbstractiveSummarization-0' failed with validation errors: ['Invalid Document in request.']
Exception Details:	{'code': 'InvalidRequest', 'message': "Job task: 'AbstractiveSummarization-0' failed with validation error: Request Payload sent is too large to be processed. Limit request size to: 125000"}

## Check document length fist

In [89]:
with open('./transcription.txt', 'r') as file:
    transcription = file.read()

count_words(transcription)

def split_document(document: str, max_length: int = 125000) -> list:
    return [document[i:i + max_length] for i in range(0, len(document), max_length)]

doc = split_document(transcription)
print(len(doc))

n4 = count_words(doc[0])
n1 = count_words(doc[1])
n2 = count_words(doc[2])
n3 = count_words(doc[3])

Word count: 74840, Character count: 428953
4
Word count: 21725, Character count: 125000
Word count: 21839, Character count: 125000
Word count: 21834, Character count: 125000
Word count: 9443, Character count: 53953


In [82]:
azure_abstractive_summarization(doc[0])
azure_abstractive_summarization(doc[1])
azure_abstractive_summarization(doc[2])
azure_abstractive_summarization(doc[3])

"The document appears to be a casual conversation among individuals discussing the development and investment opportunities in soften beach areas of Bali, focusing on kayaking and water-based activities. Participants highlight the continuous growth and the allure of investing in this sector, noting the island's charm and the potential for lucrative returns based on proximity to the water. They touch upon the local kayaking scene, which is vibrant and possibly undervalued or underestimated in terms of investment potential. The conversation also implies a sense of local pride and a call to recognize the ongoing development and investment opportunities that Bali offers, despite any initial uncertainties or lack of formal investment analysis.\nStringSummary:\nThe conversation among individuals revolves around the investment and development opportunities in Bali's beach areas, with a particular focus on kayaking as a growing sector. They discuss the island's appeal and the potential for inv

## Refine summarization 

In [91]:
# OK
text_analytics_client = TextAnalyticsClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(key),
    )

# OK
def azure_abstractive_summarization(text_input: str) -> str:
    if not isinstance(text_input, list):
        text_input = [text_input]
    abstract_summary_results = text_analytics_client.begin_abstract_summary(text_input).result()
    summaries_text = ""
    for result in abstract_summary_results:
        if result.kind == "AbstractiveSummarization":
            summaries_text += "\n".join([summary.text for summary in result.summaries]) + " "
        elif result.is_error:
            print("...Is an error with code '{}' and message '{}'".format(
                result.error.code, result.error.message
            ))
            return ""
    return summaries_text

# OK
def azure_extractive_summarization(text_input: str) -> str:
    if not isinstance(text_input, list):
        text_input = [text_input]
    extract_summary_results = text_analytics_client.begin_extract_summary(text_input).result()
    summaries_text = ""
    for result in extract_summary_results:
        if not result.is_error:
            summaries_text += "\n".join([sentence.text for sentence in result.sentences]) + " "
        else:
            print("...Is an error with code '{}' and message '{}'".format(
                result.error.code, result.error.message
            ))
            return ""
    return summaries_text

# OK
def split_document(document: str, max_length: int = 125000) -> list:
    return [document[i:i + max_length] for i in range(0, len(document), max_length)]

# Refine Summarization
def refine_summarization(document: str) -> str:
    # Split the document if it exceeds the maximum length
    documents = split_document(document)
    # Summarize each document
    summaries = [azure_abstractive_summarization(doc) for doc in documents]
    # Combine the summaries
    summaries =  " ".join(summaries)
    # New summary
    new_summaries = azure_abstractive_summarization(summaries)
    return new_summaries


In [92]:
with open('./transcription.txt', 'r') as file:
    transcription = file.read()
    
res_abstract = refine_summarization(transcription)

with open('./res_abstract.txt', 'w') as file:
    file.write(res_abstract)

## Refine summarization (again)

In [100]:
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential
from concurrent.futures import ThreadPoolExecutor
from grapheme import length

text_analytics_client = TextAnalyticsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key),
)

# Keep your original split function
def split_document(document: str, max_length: int = 125000) -> list:
    return [document[i:i + max_length] for i in range(0, len(document), max_length)]

def process_chunk(chunk: str, operation: str):
    """Process a chunk with specified summarization type"""
    if operation == "abstractive":
        poller = text_analytics_client.begin_abstract_summary([chunk])
    elif operation == "extractive":
        poller = text_analytics_client.begin_extract_summary([chunk])
    return poller.result()

async def parallel_summarization(chunks: list, operation: str) -> list:
    """Process chunks in parallel"""
    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(process_chunk, chunk, operation) for chunk in chunks]
        return [future.result() for future in futures]

def handle_results(results: list, operation: str) -> str:
    """Extract text from results based on operation type"""
    summary = []
    for result in results:
        for doc in result:
            if doc.is_error:
                print(f"Error: {doc.error.code} - {doc.error.message}")
                continue
                
            if operation == "abstractive":
                if doc.kind == "AbstractiveSummarization":
                    summary.extend([s.text for s in doc.summaries])
            elif operation == "extractive":
                summary.extend([s.text for s in doc.sentences])
    
    return " ".join(summary)

async def hybrid_summarization(document: str) -> str:
    """Combined abstractive and extractive summarization"""
    # First split using your original function
    initial_chunks = split_document(document)
    
    # Process both summarization types in parallel
    abstract_results = await parallel_summarization(initial_chunks, "abstractive")
    extract_results = await parallel_summarization(initial_chunks, "extractive")
    
    # Combine results from both methods
    combined = handle_results(abstract_results, "abstractive") + " " + \
               handle_results(extract_results, "extractive")
    
    # Check final length and refine if needed
    if length(combined) > 5120:
        refined_chunks = split_document(combined)
        final_results = await parallel_summarization(refined_chunks, "abstractive")
        return handle_results(final_results, "abstractive")
    
    return combined

import asyncio

with open('./transcription.txt', 'r') as file:
    transcription = file.read()
result = asyncio.run(hybrid_summarization(transcription))
print(result)

RuntimeError: asyncio.run() cannot be called from a running event loop

In [99]:
documents = split_document(transcription)
documents[0]

'Bagi banyak kasih ya malah. Jadi nomor satu masih opsi anda tetangga pak. Sorry aku boleh izin rekam ya soalnya buat buat nonton takutnya kita ada lupa aja sih sebenarnya. Saya tunggu satu masih lsy. Nomor duanya setiap tiba tiba. Kemarin sih asia timur banyak Kazakhstan sekali promonya india. Oh iya oh iya kemarin india jadi india nih lagi ramai banget di Bali. Iya sih kamu kemarin nilai biasa kita memang sama nih om ayamnya juga kan idealnya ini dia ayamnya juga kita ayamnya india ini india lagi. Amin banget datang. Kerja juga enggak nulisnya. Kita lihat data tourism nih. 3 baru china cimbo. Tempatnya. Hores selatan sama juki di situ. Jadi memang kalau bulenya ini lagi agak ini bu ya. Lagi agak menurun nampil balik lagi nanti marketnya. Sinar Mas helmy apakah memang mau main lisold atau pre old atau 2 duanya? Atau mau 2 duanya karena kalau dia. Lice hoax pasti target marketnya kan? Tanyakan beli mulai orang tua walaupun bule sekarang sudah banyak beli free hold pakai PMA. Melalui PM

In [83]:
def split_document(document: str, max_length: int = 125000) -> list:
    # Split the document into chunks of max_length characters
    return [document[i:i + max_length] for i in range(0, len(document), max_length)]

def azure_abstractive_summarization(document: str) -> str:
    text_analytics_client = TextAnalyticsClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(key),
    )

    # Split the document if it exceeds the maximum length
    documents = split_document(document)

    summaries_text = ""
    for doc in documents:
        abstract_summary_results = text_analytics_client.begin_abstract_summary([doc]).result()
        for result in abstract_summary_results:
            if result.kind == "AbstractiveSummarization":
                summaries_text += "\n".join([summary.text for summary in result.summaries]) + "\n"
            elif result.is_error:
                print("...Is an error with code '{}' and message '{}'".format(
                    result.error.code, result.error.message
                ))
                return ""
    
    return summaries_text

def azure_extractive_summarization(document: str) -> str:
    text_analytics_client = TextAnalyticsClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(key),
    )

    # Split the document if it exceeds the maximum length
    documents = split_document(document)

    summaries_text = ""
    for doc in documents:
        extract_summary_results = text_analytics_client.begin_extract_summary([doc]).result()
        for result in extract_summary_results:
            if not result.is_error:
                summaries_text += "\n".join([sentence.text for sentence in result.sentences]) + "\n"
            else:
                print("...Is an error with code '{}' and message '{}'".format(
                    result.error.code, result.error.message
                ))
                return ""
    
    return summaries_text

# Example usage:
with open('./transcription.txt', 'r') as file:
    transcription = file.read()

abstract_summary = azure_abstractive_summarization(transcription)
extractive_summary = azure_extractive_summarization(transcription)

print(f"Abstract summary:\n{abstract_summary}")
print(f"Extractive summary:\n{extractive_summary}")

Abstract summary:
The document discusses the burgeoning digital nomad scene in Bali, highlighting the potential for growth in this sector due to its target market. The author expresses anticipation for policy changes that could further facilitate this lifestyle, while also noting current promising developments in Kazakhstan and India as destinations. There is a sense of opportunity in the local tourism data, with a focus on the challenge of differentiating and marketing to the ideal customer base. The conversation touches on pricing strategies, suggesting that goods like salmon and accommodations like Freeport could become more accessible. The overall tone suggests optimism for the future of digital nomadism in these regions, with a call to adapt to changing policies and market demands. The discussion implies a broader trend of global mobility and the potential for local economies to capitalize on this growing niche.
The document appears to be a chaotic and informal discussion or trans

Word count: 1281


1281