# **## Analyzing Audio Transcriptions with LDA and Large Language Models**


In [1]:
!pip install scipy --upgrade
!pip install gensim --upgrade

Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.16.0
    Uninstalling scipy-1.16.0:
      Successfully uninstalled scipy-1.16.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tsfresh 0.21.0 requires scipy>=1.14.0; python_version >= "3.10", but you have scipy 1.13.1 which is incompatible.[0m[31m
[0mSuccessfully installed scipy-1.13.1


## Download and Analyze Audio Files from Gutenberg

In [2]:
import requests
import os
import librosa

# List of audio files to download
audio_files = [
    "01",
    "04",
    "05",
    "06",
    "14",
    "22",
    "24",
    "25",
    "26",
    "27"
]

# Base URL of the Gutenberg site
base_url = "https://www.gutenberg.org/files/21144/mp3/21144-"
# Path where audio files will be saved
download_path = "audios_fabulas_esopo"

# Create directory if it doesn't exist
if not os.path.exists(download_path):
    os.makedirs(download_path)

# Download each file if not already present
for file in audio_files:
    # Skip download if file already exists
    if os.path.exists(os.path.join(download_path, f"fabula_{file}.mp3")):
        print(f"The file fabula_{file}.mp3 already exists, it will not be downloaded again.")
        continue
    url = f"{base_url}{file}.mp3"
    respuesta = requests.get(url)
    if respuesta.status_code == 200:
        file_path = os.path.join(download_path, f"fabula_{file}.mp3")
        with open(file_path, 'wb') as f:
            f.write(respuesta.content)
        print(f"Downloaded: {file_path}")
    else:
        print(f"Could not download the file: {file}")
        print(f"Status: {respuesta.status_code} - {respuesta.reason}")

# Display audio duration using librosa
downloaded_files = os.listdir(download_path)

for file in downloaded_files:
    if file.endswith(".mp3"):
        file_path = os.path.join(download_path, file)
        y, sr = librosa.load(file_path, sr=None)
        duracion = librosa.get_duration(y=y, sr=sr)
        print(f"File: {file}, Duration: {duracion:.2f} seconds")



The file fabula_01.mp3 already exists, it will not be downloaded again.
The file fabula_04.mp3 already exists, it will not be downloaded again.
The file fabula_05.mp3 already exists, it will not be downloaded again.
The file fabula_06.mp3 already exists, it will not be downloaded again.
The file fabula_14.mp3 already exists, it will not be downloaded again.
The file fabula_22.mp3 already exists, it will not be downloaded again.
The file fabula_24.mp3 already exists, it will not be downloaded again.
The file fabula_25.mp3 already exists, it will not be downloaded again.
The file fabula_26.mp3 already exists, it will not be downloaded again.
The file fabula_27.mp3 already exists, it will not be downloaded again.
File: fabula_05.mp3, Duration: 58.44 seconds
File: fabula_04.mp3, Duration: 67.08 seconds
File: fabula_22.mp3, Duration: 57.21 seconds
File: fabula_06.mp3, Duration: 71.76 seconds
File: fabula_27.mp3, Duration: 60.26 seconds
File: fabula_26.mp3, Duration: 63.66 seconds
File: fabu

## Transcribe Audio Files Using Whisper API

In [3]:
from openai import OpenAI
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=OPENAI_API_KEY)

def transcribe_audio(file):
    with open(file, 'rb') as audio_file:
        response = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            language="es"
        )
    return response

audio_files_paths = [os.path.join(download_path, file) for file in downloaded_files if file.endswith('.mp3')]

transcriptions = {}

for file in audio_files_paths:
    transcriptions[file] = transcribe_audio(file)

# Print transcriptions
for file_id in audio_files:
    print(f"Transcription of file {file_id}:")
    print(transcriptions[os.path.join(download_path, f'fabula_{file_id}.mp3')])


# Dictionary of transcriptions and audio
transcriptions_text = {f"fable_{archivo}": transcriptions[os.path.join(download_path, f'fabula_{archivo}.mp3')].text for archivo in audio_files}
# Imprimiendo un ejemplo del diccionario
for key, value in transcriptions_text.items():
    print(f"{key}: {value}\n")


Transcription of file 01:
Transcription(text='Las fábulas de Sopo Grabado para LibriVox.org por Paulino www.paulino.info Fábula número 61 El lobo y el cordero en el templo Dándose cuenta de que era perseguido por un lobo, un pequeño corderito decidió refugiarse en un templo cercano. Lo llamó lobo y le dijo que si el sacrificador lo encontraba allí adentro, lo inmolaría a su dios. Mejor así, replicó el cordero, prefiero ser víctima para un dios a tener que perecer en tus colmillos. Si sin remedio vamos a ser sacrificados, más nos vale que sea con el mayor honor. Fin de la fábula Esta es una grabación del dominio público.', logprobs=None, usage=UsageDuration(duration=None, type='duration', seconds=73))
Transcription of file 04:
Transcription(text='Las fábulas de Esopo, grabado para LibriVox.org por Roberto Antonio Muñoz, fábula número 64, El Lobo y la Cruz. A un lobo que comía un hueso, se le atragantó el hueso en la garganta y corría por todas partes en busca de auxilio. Encontró en su 

## Clean and Export Transcriptions

In [4]:
import nltk
from nltk.corpus import stopwords
import json
import re

nltk.download('stopwords')

# Lista de stopwords en español
stopwords_es = set(stopwords.words('spanish'))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
def clean_text(text):
    """
    Cleans the transcription text by removing standard introductory
    and ending phrases, as well as the fable number prefix.

    Args:
        text (str): Raw transcription text.

    Returns:
        str: Cleaned transcription text.
    """
    start_pattern = r"(Las fábulas de E?[sS]opo[\.\,]? [Gg]rabado para Libr[ie][vV]ox.org)"
    end_pattern = r"(Fin de la fábula|Fin de fábula)\.? Esta (?:es una )?grabación (es )?d(?:el|e) dominio público\. *(Subtítulos realizados por la comunidad de Amara\.org)?"

    # Clean the text using regular expressions
    cleaned_text = re.sub(start_pattern, "", text)
    cleaned_text = re.sub(end_pattern, "", cleaned_text)

    fable_number = r"^.* [fF]ábula número \d{2}[\.,]?\s"
    cleaned_text = re.sub(fable_number, "", cleaned_text)

    return cleaned_text

cleaned_transcriptions = {
    key: clean_text(value) for key, value in transcriptions_text.items()
}

# Save cleaned transcriptions to a JSON file
with open('cleaned_transcriptions.json', 'w', encoding='utf-8') as f:
    json.dump(cleaned_transcriptions, f, ensure_ascii=False, indent=4)

# Print cleaned transcriptions
for key, value in cleaned_transcriptions.items():
    print(f"{key}: {value}\n")





fable_01: El lobo y el cordero en el templo Dándose cuenta de que era perseguido por un lobo, un pequeño corderito decidió refugiarse en un templo cercano. Lo llamó lobo y le dijo que si el sacrificador lo encontraba allí adentro, lo inmolaría a su dios. Mejor así, replicó el cordero, prefiero ser víctima para un dios a tener que perecer en tus colmillos. Si sin remedio vamos a ser sacrificados, más nos vale que sea con el mayor honor. 

fable_04: El Lobo y la Cruz. A un lobo que comía un hueso, se le atragantó el hueso en la garganta y corría por todas partes en busca de auxilio. Encontró en su correra a una grulla y le pidió que le salvara de aquella situación y que enseguida le pagaría por ello. Aceptó la grulla e introdujo su cabeza en la boca del lobo, sacando de la garganta el hueso atravesado. Pidió entonces la cancelación de la paga convenida. Oye, amiga, dijo el lobo, ¿no crees que es suficiente paga con haber sacado tu cabeza sana y salva de mi boca? Nunca hagas favores a mal

## Full Text Cleaning Function

In [6]:
def clean_full_text(text):
    """
    Performs full cleaning of transcription text by removing leading/trailing spaces,
    newlines, punctuation, and stopwords. Also converts to lowercase.

    Args:
        text (str): The raw transcription text.

    Returns:
        str: Cleaned and normalized text.
    """
    # Remove leading/trailing whitespace
    text = text.strip()

    # Remove unnecessary newlines
    text = re.sub(r'\n+', ' ', text)

    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)

    # Remove everything except letters and spaces
    text = re.sub(r"[^a-zA-ZáéíóúüñÁÉÍÓÚÜÑ\s]", "", text)

    # Convert to lowercase (optional, depending on downstream analysis)
    text = text.lower()

    # Remove stopwords
    tokens = text.split()
    filtered_tokens = [t for t in tokens if t not in stopwords_es]
    text = ' '.join(filtered_tokens)

    return text


fully_cleaned_transcriptions = {
    key: clean_full_text(value) for key, value in cleaned_transcriptions.items()
}

# Save the fully cleaned transcriptions to a JSON file
with open('fully_cleaned_transcriptions.json', 'w', encoding='utf-8') as f:
    json.dump(fully_cleaned_transcriptions, f, ensure_ascii=False, indent=4)

fully_cleaned_transcriptions


{'fable_01': 'lobo cordero templo dándose cuenta perseguido lobo pequeño corderito decidió refugiarse templo cercano llamó lobo dijo si sacrificador encontraba allí adentro inmolaría dios mejor así replicó cordero prefiero ser víctima dios tener perecer colmillos si remedio vamos ser sacrificados vale mayor honor',
 'fable_04': 'lobo cruz lobo comía hueso atragantó hueso garganta corría todas partes busca auxilio encontró correra grulla pidió salvara aquella situación enseguida pagaría ello aceptó grulla introdujo cabeza boca lobo sacando garganta hueso atravesado pidió entonces cancelación paga convenida oye amiga dijo lobo crees suficiente paga haber sacado cabeza sana salva boca nunca hagas favores malvados traficantes corruptos pues mucha paga si dejan sano salvo',
 'fable_05': 'lobo caballo pasaba lobo sembrado cebada comida gusto dejó siguió camino encontró rato caballo llevó campo comentándole gran cantidad cebada hallado vez comérsela mejor dejado agradaba oír ruido dientes mas

## Extract Keywords from Each Fable Using LDA

In [7]:
#Extract key words from each fable using the LDA algorithm.
#Assume each fable has only one topic, and extract 20 keywords per topic

import nltk
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
from nltk.tokenize import word_tokenize

# Download required tokenizer
nltk.download('punkt')

# Tokenize each cleaned transcription
def tokenize_text(text):
    return word_tokenize(text)

# Tokenize all transcriptions
tokenized_transcriptions = {
    key: tokenize_text(value) for key, value in fully_cleaned_transcriptions.items()
}

# Dictionary to store keywords per fable
keywords_by_fable = {}

for name, tokens in tokenized_transcriptions.items():
    # Create dictionary and corpus for a single fable
    dct = corpora.Dictionary([tokens])
    corpus = [dct.doc2bow(tokens)]

    # Apply LDA assuming a single topic
    lda = gensim.models.ldamodel.LdaModel(
        corpus=corpus,
        id2word=dct,
        num_topics=1,
        passes=10,
        alpha=0.6,
        eta=0.2
    )

    # Extract top 20 keywords
    topic = lda.show_topics(num_topics=1, num_words=20, formatted=False)
    keywords = [word for word, _ in topic[0][1]]

    # Store results
    keywords_by_fable[name] = keywords

    print(f"Fable: {name}")
    print(keywords)


Fable: fable_01
['lobo', 'templo', 'cordero', 'si', 'ser', 'dios', 'perecer', 'perseguido', 'prefiero', 'refugiarse', 'remedio', 'replicó', 'sacrificados', 'adentro', 'tener', 'vale', 'vamos', 'víctima', 'sacrificador', 'decidió']
Fable: fable_04
['lobo', 'hueso', 'paga', 'pidió', 'garganta', 'boca', 'cabeza', 'grulla', 'todas', 'suficiente', 'pues', 'sacando', 'oye', 'pagaría', 'si', 'nunca', 'mucha', 'malvados', 'partes', 'sano']
Fable: fable_05
['cebada', 'caballo', 'lobo', 'preferido', 'siguió', 'sino', 'sembrado', 'si', 'pasaba', 'hallado', 'parezca', 'ruido', 'oídos', 'mejor', 'masticarla', 'malvado', 'lobos', 'rato', 'oír', 'estómago']
Fable: fable_06
['lobo', 'ley', 'asno', 'partes', 'tener', 'vez', 'repártelo', 'si', 'legislar', 'orejas', 'ordenando', 'manera', 'sé', 'magnífica', 'lobos', 'llévalo', 'llegas', 'leyes', 'moviendo', 'poder']
Fable: fable_14
['lobo', 'cabrito', 'sino', 'ocasión', 'burlándose', 'comenzó', 'ampliamente', 'pasar', 'poderosos', 'protegido', 'replicó',

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Generate Fable Summaries and Subtopics with GPT


In [8]:
# Generate summaries and subtopics from extracted keywords using GPT

def generate_summary_and_subtopics(keywords, model="gpt-3.5-turbo"):
    """
    Uses an OpenAI language model to generate a short summary and
    three subtopics based on a list of Spanish keywords from a fable.

    Args:
        keywords (list of str): List of keywords from the fable.
        model (str): OpenAI model name.

    Returns:
        str: Summary and subtopics in structured text format.
    """

    # Prompt is in Spanish because the input data (fables) is in Spanish.
    # English translation:
    # "I have the following list of keywords extracted from a fable.
    # Write a short sentence (max 25 words) summarizing the central message of a fable based on these words.
    # Do not begin with 'In a fable' or similar phrases.
    # Generate three possible subtopics (1–5 words each) that could be developed from these keywords.
    # Each subtopic should be clear and different.
    # Format:
    # 1. Summary of the fable: <summary>
    # 2. Subtopic 1: <subtopic>
    # 3. Subtopic 2: <subtopic>
    # 4. Subtopic 3: <subtopic>"

    prompt = f"""
Tengo la siguiente lista de palabras clave extraídas de una fábula:
{', '.join(keywords)}.

Redacta una oración breve (máximo 25 palabras) que resuma el mensaje central de una fábula basada en estas palabras. No comiences con "En una fábula" ni frases similares.

Genera tres posibles subtemas (de 1 a 5 palabras) que podrían desarrollarse a partir de esas palabras clave. Cada subtema debe ser claro y diferente.

Responde tal cual en el formato:
1. Resumen de la fábula: <Resumen>
2. Subtema 1: <Subtema>
3. Subtema 2: <Subtema>
4. Subtema 3: <Subtema>
"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    return response.choices[0].message.content


# Dictionary to store GPT-generated summaries and subtopics per fable
fable_summaries = {}

for name, keywords in keywords_by_fable.items():
    print(name)
    result = generate_summary_and_subtopics(keywords)
    fable_summaries[name] = result
    print(result)
    print("-" * 100)


fable_01
1. Resumen de la fábula: Un lobo decide no perseguir a un cordero indefenso y prefiero refugiarse en un templo en lugar de ser sacrificador.

2. Subtema 1: Decisiones morales
3. Subtema 2: La importancia de la empatía
4. Subtema 3: El valor de la bondad
----------------------------------------------------------------------------------------------------
fable_04
1. Resumen de la fábula: El lobo pidió a la grulla que sacara un hueso atascado en su garganta, prometiendo una recompensa.
2. Subtema 1: La importancia de ser honesto en las promesas.
3. Subtema 2: La astucia de los personajes en las fábulas.
4. Subtema 3: La lección de confiar en las habilidades de los demás.
----------------------------------------------------------------------------------------------------
fable_05
1. Resumen de la fábula: Un caballo preferido por un granjero es seguido por lobos malvados, pero aprende a no confiar en ellos.
2. Subtema 1: Confianza en extraños
3. Subtema 2: Valorar la amistad verdad

## Project Reflections and Lessons Learned


This project provided valuable experience in converting audio to text, enabling automated processing and analysis of spoken content. One of the main challenges was designing an effective prompt for the LLM, as it required precise and detailed instructions to generate structured and relevant output. We learned that prompt quality plays a critical role in the usefulness of language model responses.

Additionally, we discovered that audio file formats significantly affect tool compatibility and performance—some libraries (e.g., SpeechRecognition) work best with specific formats like WAV, and format conversions can be resource-intensive.

Text cleaning proved essential to improve the relevance of topic modeling and summarization. However, we also noted that excessive cleaning can lead to loss of important context, especially in short texts.

Overall, this was a strong exercise in applying natural language processing techniques to non-textual sources and leveraging large language models for summarization and topic extraction.