In [10]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [122]:
import requests
from bs4 import BeautifulSoup
import re

from openai import OpenAI
import os
from pathlib import Path
from tqdm import tqdm
from pydub import AudioSegment

In [123]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [14]:
client = OpenAI(
    api_key=OPENAI_API_KEY
)

## prompts

In [20]:
system_prompt = f"""You are a large language model that is an expert at taking scientific and mathematical research papers, typeset in LaTeX, and transcribing them in spoken English, for the purpose of generating audio content. Please transcribe the following paper, typeset in LaTeX, into a format that will sound coherent when read by a text-to-speech program."""

Please take the following LaTeX code and transcribe all content into a format optimized for text-to-speech. All text content should be preserved or transcribed into a easily readable format.

In [130]:
system_prompt = r"""I want to generate a podcast from LaTeX code. Please take the following LaTeX code and transcribe all content into a format optimized for text-to-speech. Do not make any effort to summarize or compress content–all original words by the author must be preserved. All text content should be preserved or transcribed into a easily readable format. Equations and math should be transcribed such that they are human readable in text. For example, $a^2$ should be transcribed as 'a squared'. Furthermore, all commands should also be transcribed to readable text. For example, commands such as \section and \title should be read as 'section' and 'title' respectively, and \cite or \citet should transcribe the citation as an in-text citation. Figures must be omitted in their entirety."""

In [154]:
system_prompt = r"""I want to generate a podcast from LaTeX code. Please take the following LaTeX code and transcribe all content into a format optimized for text-to-speech. Do not make any effort to summarize or compress content–all original words by the author must be preserved. All text content should be preserved or transcribed into a easily readable format. Equations and math should be transcribed such that they are human readable in text. For example, $a^2$ should be transcribed as 'a squared'. Furthermore, all commands should also be transcribed to readable text. For example, commands such as \section and \title should be read as 'section' and 'title' respectively, and \cite or \citet should transcribe the citation as an in-text citation. Figures, tables, and comments must be omitted in their entirety."""

In [155]:
print(system_prompt)

I want to generate a podcast from LaTeX code. Please take the following LaTeX code and transcribe all content into a format optimized for text-to-speech. Do not make any effort to summarize or compress content–all original words by the author must be preserved. All text content should be preserved or transcribed into a easily readable format. Equations and math should be transcribed such that they are human readable in text. For example, $a^2$ should be transcribed as 'a squared'. Furthermore, all commands should also be transcribed to readable text. For example, commands such as \section and \title should be read as 'section' and 'title' respectively, and \cite or \citet should transcribe the citation as an in-text citation. Figures, tables, and comments must be omitted in their entirety.


#### chonk

In [43]:
chonk = input("paste raw LaTeX here:")

#### endchonk

In [55]:
minichonk = r"""\maketitle

\begin{abstract}
Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple --- a classifier is trained to predict some linguistic property from a model's representations --- and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This article critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances. 
\end{abstract}

\section{Introduction}

\looseness=-1
The opaqueness of deep neural network models of natural language processing (NLP) has spurred a line of research into interpreting and analyzing them. 
Analysis methods may aim to answer questions about a model's structure or its decisions. For instance, one might  ask which parts of a neural neural model are responsible for certain linguistic properties, or which parts of the input led the model to make a certain decision. 
A common methodology to answer questions about the structure of models is to associate internal representations with external properties, by training a classifier on said representations that predicts a given property. This framework, known as \mbox{\textbf{probing classifiers}}, has emerged as a prominent analysis strategy in many studies of NLP models.\footnote{For an overviews of analysis methods in NLP, see the survey by \citet{belinkov-glass-2019-analysis}, as well as the tutorials by \citet{belinkov-etal-2020-interpretability} and \citet{wallace-etal-2020-interpreting}. For an overview of explanation methods in particular, see the survey by \citet{danilevsky-etal-2020-survey}.}  
"""

In [133]:
minichonk = r"""\section{The Probing Classifiers Framework} \label{sec:framework} 



On the surface, the probing classifiers idea seems straightforward. We take a model that was trained on some task, such as a language model. We generate representations using the model, and train another classifier that takes the representations and predicts some property. If the classifier performs well, we say that the model has learned information relevant for the property. 
%
However, upon closer inspection, it turns out that much more is involved here. To see this, we now define this framework a bit more formally. 

Let us denote by $f : x \mapsto \hat{y}$ a model that maps input $x$ to output $\hat{y}$. We call this model the original model. It is trained on some annotated dataset $\mathcal{D}_O = \{x^{(i)}, y^{(i)}\}$, which we refer to as the original dataset. Its performance is evaluated by some measure, denoted $\textsc{Perf}(f, \mathcal{D}_O)$.
The function $f$ is typically a deep neural network that generates intermediate representations of $x$, for example $f_l(x)$ may denote the representation of $x$ at layer $l$ of $f$.\footnote{We use $f_l(x)$ to refer more generally to any intermediate output of $f$ when applied to $x$, so the framework includes analyses of other model components, such as attention weights \cite{clark-etal-2019-bert}.} 
A probing classifier $g : f_l(x) \mapsto \hat{z}$ maps intermediate representations to some property $\hat{z}$, which is typically some linguistic feature of interest. 
As a concrete example, $f$ might be a sentiment analysis model, mapping a text $x$ to a sentiment label  $y$, while $g$ might be a classifier mapping intermediate representations $f_l(x)$ to part-of-speech tags $z$.
The classifier $g$ is trained and evaluated on some annotated dataset $\mathcal{D}_P = \{x^{(i)}, z^{(i)}\}$, and some performance measure $\textsc{Perf}(g, f, \mathcal{D}_O, \mathcal{D}_P)$ (e.g., accuracy) is reported. Note that the performance measure depends on the probing classifier $g$ and the probing dataset $\mathcal{D}_P$, as well as on the original model $f$ and the original dataset $\mathcal{D_O}$.  


From an information theoretic perspective, training the probing classifier $g$ can be seen as estimating the mutual information between the intermediate representations $f_l(x)$ and the property $z$ (\citealt[p. 42]{belinkov:2018:phdthesis}; \citealt{pimentel-etal-2020-information}; \citealt{zhu-rudzicz-2020-information}), which we write $\mathrm{I}(\mathbf{z} ; \mathbf{h})$, where  $\mathbf{z}$ is a random variable ranging over properties $z$ and $\mathbf{h}$ is a random variable ranging over representations $f_l(x)$.  

The above careful definition of the probing classifiers framework reveals that it is comprised of multiple concepts and components, depicted in \Cref{fig:probing-components-basic}.  The choice of each such component, and the interactions between them, lead to non-trivial questions regarding the design and implementation of any probing classifier experiment. Before we turn to these considerations   in \Cref{sec:shortcomings-advances}, we briefly review some history and promises of probing classifiers in the next section. 


\begin{figure}[h]
    \centering
    % \begin{framed}
    \begin{subfigure}[b]{\textwidth}
    \centering
        \begin{tabular}{l @{\hskip 1em} l } 
        % \centering
        \toprule
         $x \mapsto y$ & Original task \\
         $\mathcal{D}_O = \{x^{(i)}, y^{(i)}\} $ & Original dataset \\
         $f : x \mapsto y $ & Original model \\
         $\textsc{Perf}(f, \mathcal{D}_O)$ & Performance on the original task \\
         $f_l(x)$ & Representations of $x$ from $f$\\
         $f_l(x) \mapsto z$ & Probing task \\
         $\mathcal{D}_P = \{x^{(i)}, z^{(i)}\} $ & Probing dataset \\
         $g : f_l(x) \mapsto z$ & Probing classifier \\
         $\textsc{Perf}(g, f, \mathcal{D}_O, \mathcal{D}_P) $ & Probing performance  \\ 
         \bottomrule
        \end{tabular}
         \caption{Basic Components.}
         \label{fig:probing-components-basic}
     \end{subfigure}
     \begin{subfigure}[b]{\textwidth}
     \centering
        \begin{tabular}{l @{\hskip 1em} l } 
        \toprule          
         $\bar{f} : x \mapsto y$ & Skyline model or upper bound \\ 
         $\underline{f} : x \mapsto y$ & Baseline model \\          
         $x \mapsto y_{Rand}$ & Control task \cite{hewitt-liang-2019-designing} \\ 
         $c : f_l(x) \mapsto c(f_l(x)) $ & Control function \cite{pimentel-etal-2020-information} \\ 
         $\mathcal{D}_{P,Rand}$ & Control task dataset \cite{hewitt-liang-2019-designing} \\ 
         $\mathcal{D}_{O,z}$ & Control dataset \cite{ravichander:2021:eacl} \\          
         $\textsc{Sel}(g, f, \mathcal{D}_O, \mathcal{D}_P, \mathcal{D}_{P,Rand})$ & Probing selectivity \cite{hewitt-liang-2019-designing} \\
         $ \mathcal{G}(\mathbf{z}, \mathbf{h}, c) $ & Information gain w.r.t control function \cite{pimentel-etal-2020-information} \\ 
         $\textsc{MDL}(g, f, \mathcal{D}_O, \mathcal{D}_P)$ & Probe minimum description length \cite{voita-titov-2020-information} \\ 
         $\tilde{f}_l(x)$ & Representations of $x$ from $f$, after an intervention \\ 
         \bottomrule 
        \end{tabular}
        \caption{Additional Components.}
        \label{fig:probing-components-extended}
        \vspace{-3pt}
    \end{subfigure}
    \caption{Components comprising the probing classifiers framework.}
    \label{fig:probing-components}
    \vspace{-19pt}
\end{figure}


"""

In [134]:
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": minichonk}
    ],
    model="gpt-3.5-turbo-0125",
    max_tokens=4096,
    n=1
)

In [135]:

print(chat_completion.choices[0].message.content)

The Probing Classifiers Framework

On the surface, the probing classifiers idea seems straightforward. We take a model that was trained on some task, such as a language model. We generate representations using the model, and train another classifier that takes the representations and predicts some property. If the classifier performs well, we say that the model has learned information relevant for the property.

However, upon closer inspection, it turns out that much more is involved here. To see this, we now define this framework a bit more formally.

Let us denote by f: x maps to y-hat a model that maps input x to output y-hat. We call this model the original model. It is trained on some annotated dataset D_O = {x^(i), y^(i)}, which we refer to as the original dataset. Its performance is evaluated by some measure, denoted Perf(f, D_O). 
The function f is typically a deep neural network that generates intermediate representations of x, for example f_l(x) may denote the representation 

In [52]:
len(chat_completion.choices[0].message.content)

3420

In [54]:
len(chat_completion.choices[0].message.content.split(' '))

474

In [64]:
text_for_speech = r"""Taking an information-theoretic perspective on probing, Pimentel et al. (2020) proposed to use control functions instead of control tasks in order to compare probes. Their control function is any function applied to the representation, c : f_l(x) maps to c(f_l(x)), and they compare the information gain, which is the difference in mutual information between the property z and the representation before and after applying the control function: G(z, h, c) = I(z ; h) - I(z ; c(h)).
While Pimentel et al. (2020) posit that their control function are a better criterion than the control tasks of Hewitt and Liang (2019), subsequent work showed that the two criteria are almost equivalent, both theoretically and empirically. """

In [68]:
text_for_speech

'Taking an information-theoretic perspective on probing, Pimentel et al. (2020) proposed to use control functions instead of control tasks in order to compare probes. Their control function is any function applied to the representation, c : f_l(x) maps to c(f_l(x)), and they compare the information gain, which is the difference in mutual information between the property z and the representation before and after applying the control function: G(z, h, c) = I(z ; h) - I(z ; c(h)).\nWhile Pimentel et al. (2020) posit that their control function are a better criterion than the control tasks of Hewitt and Liang (2019), subsequent work showed that the two criteria are almost equivalent, both theoretically and empirically. '

In [67]:
speech_file_path = "speech.mp3"
response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input=text_for_speech
)

response.to_file(speech_file_path)

  response.stream_to_file(speech_file_path)


## putting it together

In [114]:
def split_latex_by_section(latex_content):
    # Pattern to match section and subsection commands
    pattern = r'(?=(\\section\{.*?\}|\\subsection\{.*?\}))'
    
    # Split the content by the pattern and filter out any empty strings
    parts = [part for part in re.split(pattern, latex_content) if part.strip()]
    
    return parts

In [115]:
def split_latex_by_section(latex_content):
    # Pattern to match section and subsection commands
    pattern = r'(\\section\{.*?\}|\\subsection\{.*?\})'
    
    # Split the content by the pattern, keeping the delimiters
    parts = re.split(pattern, latex_content)
    
    # Combine each command with its following content
    combined_parts = []
    for i in range(1, len(parts) - 1, 2):
        combined_parts.append(parts[i] + parts[i + 1])
    
    # Add the last part if it doesn't end with a command
    if len(parts) % 2 == 1:
        combined_parts.append(parts[-1])
    
    return combined_parts

In [116]:
sections = split_latex_by_section(chonk)

In [117]:
text = ""
for section in tqdm(sections): 
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": section}
        ],
        model="gpt-3.5-turbo-0125",
        max_tokens=4096,
        n=1
    )
    snippet = chat_completion.choices[0].message.content
    text += f"""{snippet}
    
    """

100%|█████████████████████████████████████████████| 11/11 [01:46<00:00,  9.64s/it]


In [146]:

def generate_audio(snippet, index):
    speech_file_path = f"snippet_{index}.mp3"
    response = client.audio.speech.create(
        model="tts-1",
        voice="alloy",
        input=snippet
    )
    response.stream_to_file(speech_file_path)
    return speech_file_path

def generate_snippet(section):
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": section}
        ],
        model="gpt-3.5-turbo-0125",
        max_tokens=4096,
        n=1
    )
    snippet = chat_completion.choices[0].message.content
    return snippet

# Create a list to store the paths of the individual audio files
audio_file_paths = []

text = ""

for i, section in enumerate(tqdm(sections)):
    # Generate the chat completion
    snippet = generate_snippet(section)
    
    # Check if the snippet is too long
    if len(snippet) > 4096:
        # Split the snippet at an arbitrary newline
        split_index = snippet.find('\n', len(snippet) // 4)
        snippet_part1 = snippet[:split_index]
        snippet_part2 = snippet[split_index + 1:]
        
        # rerun cleaning for both parts
        snippet_part1 = generate_snippet(snippet_part1)
        snippet_part2 = generate_snippet(snippet_part2)
        
        # Generate audio for both parts
        audio_file_paths.append(generate_audio(snippet_part1, f"{i}_1"))
        audio_file_paths.append(generate_audio(snippet_part2, f"{i}_2"))
        
        text += f"{snippet_part1} \n\n"
        text += f"{snippet_part2} \n\n"
    else:
        # Generate audio for the snippet
        audio_file_paths.append(generate_audio(snippet, i))
        text += f"{snippet} \n\n"

# Concatenate all the audio files
combined_audio = AudioSegment.empty()
for path in audio_file_paths:
    audio = AudioSegment.from_mp3(path)
    combined_audio += audio

# Export the combined audio to a single MP3 file
combined_audio.export("belinkov.mp3", format="mp3")

  response.stream_to_file(speech_file_path)
100%|█████████████████████████████████████████| 11/11 [07:11<00:00, 39.26s/it]


<_io.BufferedRandom name='belinkov.mp3'>

In [147]:
print(text)

Title: Introduction

The opaqueness of deep neural network models of natural language processing (NLP) has spurred a line of research into interpreting and analyzing them. Analysis methods may aim to answer questions about a model's structure or its decisions. For instance, one might ask which parts of a neural model are responsible for certain linguistic properties, or which parts of the input led the model to make a certain decision. A common methodology to answer questions about the structure of models is to associate internal representations with external properties, by training a classifier on said representations that predicts a given property. This framework, known as probing classifiers, has emerged as a prominent analysis strategy in many studies of NLP models.1

Despite its apparent success, the probing classifiers paradigm is not without limitations. Critiques have been made about comparative baselines, metrics, the choice of classifier, and the correlational nature of the m

In [144]:
split_index = snippet.find('\n', len(snippet) // 4)
snippet_part1 = snippet[:split_index]
snippet_part2 = snippet[split_index + 1:]

In [145]:
snippet_part1

'Section: Correlation versus causation\n\nA main limitation of the probing classifier paradigm is the disconnect between the probing classifier $g$ and the original model $f$. They are trained in two different steps, where $f$ is trained once and only used to generate feature representations $f_l(x)$, which are fed into $g$. Once we have $f_l(x)$, we get a probing performance from $g$, which tells us something about the information in $f_l(x)$. However, in the process, we have forgotten about the original task assigned to $f$, which was to predict $y$. This raises an important question, which early work has largely taken for granted (Section: Promises): Does model $f$ use the information discovered by probe $g$? In other words, the probing framework may indicate correlations between representations $f_l(x)$ and linguistic property $z$, but it does not tell us whether this property is involved in predictions of $f. Indeed, several studies pointed out this limitation (Belinkov & Glass, 2

In [151]:
!git add .

In [152]:
!git commit -m "add more streamlined workflow"

[main 523f0da] add more streamlined workflow
 2 files changed, 338 insertions(+), 35 deletions(-)
 rename prompt_engineering.ipynb => scratch.ipynb (99%)
 create mode 100644 tex_to_audio_generation.ipynb


In [153]:
!git push

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 58.28 KiB | 9.71 MiB/s, done.
Total 4 (delta 0), reused 0 (delta 0), pack-reused 0
remote: This repository moved. Please use the new location:[K
remote:   https://github.com/ericcccsliu/tex_to_speech.git[K
To https://github.com/ericcccsliu/paper_to_speech.git
   bd12d0b..523f0da  main -> main
