# An example of using OpenAI API to translate lecture script from one language to another

My first supervisor, [Konstantin Postnov](http://xray.sai.msu.ru/~moulin/) and another SAI faculty member [Anatoli Zasov](https://www.sai.msu.ru/dept/zasovotd/staff.html#zasov) used to read a nice introductory course on astronomy to physicists at the faculty of physics at MSU.
I never attended the course myself, but used it quite often as inspiration when teaching to students without strong astronomical background at Tübingen. Unfortunately, lecture script is in russian, so I could not just point students to this resource. Here I'd like to exploit OpenAI API to translate [the course](http://www.astronet.ru/db/msg/1170612/node1.html) to english, and at the same time convert it to markdown so it can be hosted online or converted to pdf or other formats. Note that using vision API is probably a big overkill and not really efficient from financial perspective as there are simpler and faster local solutions such as [pix2tex](https://github.com/lukas-blecher/LaTeX-OCR). Better still, latex source used to generate the online version is likely available upon request from the authors. However, goal of this experiment is to illustrate usage of openai APIs, i.e. vision and function calling, hence the choice. Let us start by installing requirements:

In [35]:
#!pip install requests beautifulsoup4 lxml openai pypandoc
#!mkdir imgs # to store images locally

Now let's take a look at links from the main page:

In [141]:
import requests, openai, json
from bs4 import BeautifulSoup
from dotenv import load_dotenv,find_dotenv # to read OpenAI key from .env file
from openai import OpenAI
load_dotenv(find_dotenv())
client = OpenAI()

# Fetch TOC
toc_url = 'http://www.astronet.ru/db/msg/1170612/node1.html'
response = requests.get(toc_url)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all('a'):
        str(link.get('href')).find('node')>0 and str(link.get('href')).find('msg')>0 and print(link.getText(), f"http://www.astronet.ru{link.get('href')}")

Оглавление http://www.astronet.ru/db/msg/1170612/node1.html
1. Введение http://www.astronet.ru/db/msg/1170612/node2.html
Оглавление http://www.astronet.ru/db/msg/1170612/node1.html
1. Введение. Пространственно-временные масштабы в астрофизике http://www.astronet.ru/db/msg/1170612/node2.html
1.1 Угловое и фотометрическое расстояния http://www.astronet.ru/db/msg/1170612/node3.html
1.2 Времена http://www.astronet.ru/db/msg/1170612/node4.html
1.3 Массы http://www.astronet.ru/db/msg/1170612/node5.html
1.4 Солнечные единицы http://www.astronet.ru/db/msg/1170612/node6.html
1.5 Планковские единицы http://www.astronet.ru/db/msg/1170612/node7.html
1.6 Безразмерные числа http://www.astronet.ru/db/msg/1170612/node8.html
2. Излучение. Основы теории переноса излучения http://www.astronet.ru/db/msg/1170612/node9.html
2.1 Уравнение переноса излучения излучения http://www.astronet.ru/db/msg/1170612/node10.html
2.1.1 Основные определения http://www.astronet.ru/db/msg/1170612/node10.html#SECTION003110000

The link naming scheme is a bit inconsistent, but we can deduce sections/subsections from the TOC text, overthise, the structure is clear, i.e. we can just loop over links and parse page by page. Translation itself is fairly easy and can be done with many tools, however, we'd like also to convert equations and variables to TeX format rather than have them as images, so parsing shall include also this step (figures shall stay, however, figures). Both the translation and recognition of equations can be done OpenAI visual API:

Before wrapping to a function, it makes sense to test conversion on a representative example first:

In [19]:
eq_url = "https://images.astronet.ru/pubd/2002/05/14/0001176797/img222.gif"
response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Convert image showing an equation typeset using Latex back to Latex code. Only return code itself without explanation or any other characters. Do not use [] to mark begin and en of equation, i.e. only output commands which will appear within equation itself"},
        {
          "type": "image_url",
          "image_url": {
            "url": eq_url,
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

In [78]:
from IPython.display import Markdown, display, HTML

display(Markdown(f"Original image: ![]({eq_url}) and rendering: $${response.choices[0].message.content}$$"))

Original image: ![](https://images.astronet.ru/pubd/2002/05/14/0001176797/img222.gif) and rendering: $$I_\nu(\tau_\nu) = I_\nu(0)e^{-\tau_\nu} + S_\nu(1-e^{-\tau_\nu}) = S_\nu + e^{-\tau_\nu}(I_\nu(0) - S_\nu)$$

Note that besides equation images, there are also some true images, i.e. figures. These we would like to download and convert to a more modern format, i.e. png to be able to insert into markdown. The difference between the two cases is obviously image size, i.e. all equations are less than 64 pixels in height:

In [254]:
from PIL import Image
import requests
import uuid

def parse_image(url, max_height=140, img_path="./imgs"):
    if url.find('pubd')<0:
        return '' # make sure that irrelevant images are stripped out
    im = Image.open(requests.get(url, stream=True).raw)
    if min(im.width,im.height)<max_height:
        response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": [
            {"type": "text", "text": "Convert image showing an equation typeset using Latex back to Latex code. Only return code itself without explanation or any other characters. Never use [] to mark begin and end of any equation, i.e. only output commands which will appear within equation itself"},
            {"type": "image_url","image_url": {"url": url,},},],}],
  max_tokens=300,)
        tex_code = response.choices[0].message.content
        if len(tex_code)<30:
            return f"${tex_code}$"
        else:
            return f"$${tex_code}$$"
    else:
        out_name = f"{img_path}/{uuid.uuid4().hex}_{url.split('/')[-1].replace('.gif','.png')}"
        im.save(out_name,'PNG')
        return f"![]({out_name})"

Now we can call this function as a GPT tool whenever image is encountered to get correct parsing. But first let's test with couple examples.

In [255]:
display(Markdown(parse_image(eq_url)))

$$I_\nu(\tau_\nu) = I_\nu(0)e^{-\tau_\nu} + S_\nu(1 - e^{-\tau_\nu}) = S_\nu + e^{-\tau_\nu}(I_\nu(0) - S_\nu)$$

In [256]:
display(Markdown(parse_image("https://images.astronet.ru/pubd/2002/05/14/0001176797/img235.gif")))

![](./imgs/9c39d3412606435191c71dd6301d9aec_img235.png)

Seems to work as intended, now we can wrap it up as a tool taking URL of the section as input:

In [261]:
def parse_url(page_url):
    soup = BeautifulSoup(requests.get(page_url).content, "html.parser")
    # Extract main content which is between 2nd and 2nd to last hr tags
    all_hr_tags = soup.find_all('hr')
    first_hr = all_hr_tags[0]
    last_hr = all_hr_tags[-1]
    # Initialize a variable to collect the content
    content_between_hr = []
    # Start with the next sibling of the first <hr> tag
    element = first_hr.find_next_sibling()
    # Loop through siblings until we reach the last <hr>
    while element and element != last_hr:
        content_between_hr.append(str(element))  # Append the HTML of each element
        element = element.find_next_sibling()  # Move to the next sibling
    # Append the last <hr> tag itself if required
    content_between_hr.append(str(last_hr))
    # Join the collected HTML into a single string
    content_between_hr = ''.join(content_between_hr)
    tools = [
    {
        "type": "function",
        "function": {
            "name": "parse_image",
            "strict": True,
            "description": "Returns markdown code to include image at correct adress or alternative text in the output. Call this whenever you encounter image URL in the input and replace entire img tag with the output of the function.",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "original image URL",
                    },
                },
                "required": ["url"],
                "additionalProperties": False,
            },
        }
    }]
    messages = [
    {"role": "system", "content": """You are highly qualified AI system which takes HTML input 
            containing scientific text in russian and converts it to markdown output in english.
            Strictly follow the following rules:
            1) Use language style appropriate for lecture on astrophysics for undergraduate students 
            2) stick as close as possible to the original text. 
            3) Use your domain knowledge to improve quality of translation. 
            4) Strip header and footer containing navigation links and return only body.
            5) Ensure that caption to figures is formatted as caption of figures, not as a section.
            6) Ensure that all equations are enclosed between $ or $$ so that they are rendered correctly in markdown
            7) Use the supplied tools to replace images in the input with already correctly formatted markdown code."""},
    {"role": "user", "content": content_between_hr}
    ]
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools
    )
    # run the tools to parse images
    tool_calls = response.choices[0].message.tool_calls
    if tool_calls:
        tool_responses = []
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments)
            # Handle the parse_image tool call
            if function_name == 'parse_image':
                image_url = arguments['url']
                # Call your parse_image function with the image URL
                markdown_image = parse_image(image_url)
                # Add the result to the tool_responses
                tool_responses.append({
                    "role": "function",
                    "name": "parse_image",
                    "id": tool_call.id, 
                    "content": markdown_image
                })
        for tool_response in tool_responses:
            messages.append(tool_response)

        final_response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
        )
        return final_response.choices[0].message.content
    else:
        return response.choices[0].message.content

In [252]:
test_out = parse_url('http://www.astronet.ru/db/msg/1170612/10lec/node2.html')

In [253]:
display(Markdown(test_out))


## 10. Cosmology

In this section, we will consider homogeneous and isotropic cosmological models of the Universe, first examined by A.A. Friedmann in 1922, which bear his name.

### 10.1 Friedmann Cosmology

The modern cosmological models are based on the **cosmological principle**, which posits that there should be no privileged observers in the Universe, similar to the principle of constancy of the speed of light or the principle of equivalence, both fundamental to general relativity. Sometimes this principle is referred to as the "Copernican principle," as it marks the first abandonment of the geocentric model of the cosmos. This principle implies that the global characteristics of the Universe are the same for any observer situated at any point on the hyper-surface of constant time.

Currently, this principle is confirmed with remarkable precision by astronomical observations of the homogeneity of matter distribution in the Universe over large scales (on the order of \(\sim 100 \, \text{Mpc}\)) and isotropy (absence of a preferred direction at the level of temperature fluctuations of the cosmic microwave background). This observation allows us to select a very narrow class of homogeneous isotropic spaces (the so-called Friedmann-Robertson-Walker models) from the vast range of conceivable mathematical models describing the Universe as a whole. For more details, refer to the exceptional monograph by S. Weinberg, *Gravitation and Cosmology*, Moscow: Mir, 1975, Chapter 13 and onward.

### 10.1.1 The Cosmological Principle

A brief outline of recent modern cosmological history can be traced through key observational and theoretical discoveries:

- **1910-1922**: V. Slipher, redshifts in the spectra of galaxies.
  ![](https://images.astronet.ru/pubd/2002/05/14/0001176797/10lec/img10.gif)
  
- **1916**: A. Einstein, General Theory of Relativity.

- **1922-24**: A. Friedmann, non-stationary solutions to Einstein's equations (Friedmann cosmological models).

- **1929**: E. Hubble, Hubble's Law for receding galaxies. The recession velocity of a galaxy is determined from the redshift, interpreting it through the Doppler effect.
  ![](https://images.astronet.ru/pubd/2002/05/14/0001176797/10lec/hubble_diag.gif)
  
- **1933**: F. Zwicky, dark matter in galaxy clusters.

- **1949**: Alfer, Bethe, Gamow - the "hot Universe" hypothesis ("Big Bang") and prediction of the isotropic cosmic background radiation with a thermal spectrum at a temperature of approximately \(T \sim 5 \, \text{K}\).

- **1965**: A. Penzius, R. Wilson - discovery of isotropic cosmic microwave background radiation with a temperature around \(3 \, \text{K}\).
  
  ![](https://images.astronet.ru/pubd/2002/05/14/0001176797/10lec/cmb_sp.gif)

- **1979-80**: A. Guss, A.A. Starobinsky, A.D. Linde, D.A. Kirzhnits - the "inflationary" Universe hypothesis.

- **1992-1993**: In space experiments "Relikt" (Russia) and "COBE" (USA), fluctuations in the relic radiation were detected on the order of \(10^{-5}\) in scales of about \(10\) degrees.

- **1998**: Hubble diagrams (dependence of apparent magnitude at peak brightness vs. redshift) for Type Ia Supernovae (thermonuclear explosions of white dwarfs near the Chandrasekhar limit) indicate that cosmic expansion is occurring with acceleration at great distances. This necessitates the introduction of a positive cosmological constant (Einstein, 1917) or a more complex form of matter (the so-called "dark energy" or "quintessence") with an equation of state that significantly contributes to the current energy density of the Universe and effectively generates anti-gravity on large scales.

- **2000**: Measurement of the angular spectrum of fluctuations in relic microwave radiation in BOOMERanG and MAXIMA experiments. The discovery of the first Doppler peak in the angular spectrum of fluctuations on scales of about \(1\) degree, predicted by A.D. Sakharov in 1967, provides evidence for flat (Euclidean) spatial geometry of the observed Universe with an accuracy of about \(10\%\) up to redshifts (the era of recombination).

### 10.1.2 "A Brief Course" on the History of 20th Century Cosmology

1. **Homogeneity:** Current understanding and observational data confirm the homogeneity of the Universe on large scales.
2. **Isotropy:** The isotropy of the cosmic microwave background points to the uniformity of initial conditions throughout the cosmos.

### 10.1.3 Hubble's Law

We will first consider the simplest homogeneous and isotropic cosmological models without a cosmological constant. Due to homogeneity, we can take a limited spherical region in space and observe its evolution. The external regions are irrelevant because the gravitational field created by matter outside the sphere (under strict spherical symmetry) is identically zero (Tolman, 1934, proven within the framework of general relativity). 

#### Note
In Newtonian gravity, the force is described by the equation \(F = G \frac{m_{1} m_{2}}{r^2}\), and within a hollow sphere, the gravitational force is also zero, consistent with Newtonian theory in a sufficiently weak gravitational field.

As derived from astronomical observations of galaxy spectra, the recession velocity of these galaxies from the observer is directly proportional to their distance:
\[
v = H r  \tag{10.1}
\]
Where \(v\) is the recession velocity, \(r\) is the distance, and \(H\) is Hubble's constant.

### Conclusion

Despite significant progress in modern cosmology, many important questions remain unanswered:
1. The issue of non-baryonic dark matter (the observable matter in the Universe is at most a few percent of the total gravitating mass).
2. The cosmological constant problem (why is the enormous vacuum energy not observed?) and the related "quintessence" problem.
3. The early Universe (quantum birth, arrow of time, cosmology on a 3-dimensional brane in higher-dimensional space, etc.).

Further exploration of these topics can significantly deepen our understanding of cosmic phenomena and the evolution of our Universe.

Now we can run loop the function over all pages:

In [269]:
full_markdown = ""

# this is needed to avoid double processing of links with ancors
urls = []
for link in soup.find_all('a'):
        if str(link.get('href')).find('node')>0 and str(link.get('href')).find('msg')>0:
            print(link.getText(), f"http://www.astronet.ru{link.get('href')}")
            url = f"http://www.astronet.ru{link.get('href').split('#')[0]}"
            if len(urls)==0:
                urls.append(url)
            elif url!=urls[-1]:
                urls.append(url)

for url in urls:
    print(f"Working on {url}")
    full_markdown += parse_url(url)

Оглавление http://www.astronet.ru/db/msg/1170612/node1.html
1. Введение http://www.astronet.ru/db/msg/1170612/node2.html
Оглавление http://www.astronet.ru/db/msg/1170612/node1.html
1. Введение. Пространственно-временные масштабы в астрофизике http://www.astronet.ru/db/msg/1170612/node2.html
1.1 Угловое и фотометрическое расстояния http://www.astronet.ru/db/msg/1170612/node3.html
1.2 Времена http://www.astronet.ru/db/msg/1170612/node4.html
1.3 Массы http://www.astronet.ru/db/msg/1170612/node5.html
1.4 Солнечные единицы http://www.astronet.ru/db/msg/1170612/node6.html
1.5 Планковские единицы http://www.astronet.ru/db/msg/1170612/node7.html
1.6 Безразмерные числа http://www.astronet.ru/db/msg/1170612/node8.html
2. Излучение. Основы теории переноса излучения http://www.astronet.ru/db/msg/1170612/node9.html
2.1 Уравнение переноса излучения излучения http://www.astronet.ru/db/msg/1170612/node10.html
2.1.1 Основные определения http://www.astronet.ru/db/msg/1170612/node10.html#SECTION003110000

As already mentioned, using ChatGPT is not really efficient, and full run takes about over an hour and a half and costs 3$ (gpt-4o-mini, mostly vision). Local LLM or some translation API plus pix2tex would be probably significantly faster (the bottleneck is use of vision API to digitize every equation and even variable). Anyway, now we can convert markdown to pdf for more portability, for instance, using pandoc (pandoc needs to be installed, not only pypandoc bindings):

In [290]:
import pypandoc

intro_text = """
# Lectures on General Astrophysics for Physicists
&nbsp;

&nbsp;

This lecture course serves as an introduction to modern observational and theoretical astrophysics. 
It is designed with the assumption that the reader has knowledge of general physics courses and 
some sections of theoretical physics. However, the first half of the course is quite accessible 
to junior students of natural sciences and well-prepared high school seniors.

This course was delivered from 1998 to 2001 for third-year students of the Physics Department 
at Moscow State University. The version presented here is from 2001. 
You can find versions from previous years and additional materials on the author website 
or on the website of the Physics Department at Moscow State University.

The english version is auto-translated from online russian version available
at [astronet](http://www.astronet.ru/db/msg/1170612/index.html) with the help of ChatGPT using
the following [script](https://github.com/doroshv/llm-examples/blob/main/translate_gpt.ipynb)


"""

clean_markdown = intro_text + full_markdown
# LLM left some mess with brackets enclosing equations
clean_markdown = clean_markdown.replace('\\(','$').replace('\\)','$').replace('\\[','$').replace('\\]','$').replace('\\$','$')
open('translated.md','w').write(clean_markdown)

371349

The resulting markdown is not always clean, i.e. there may be some errors in equations, etc which need to be fixed manually or in another LLM loop, especially for latex processing. In my experience these are not many, and once fixed one can produce a nice looking pdf via something like:

#!pandoc translated.md -o translated.pdf --pdf-engine=xelatex --verbose