# Translate a book writen in LaTeX from Slovenian into English

With permission of the author, we will demonstrate how to translate the book [Euclidean Plane Geometry](https://sites.google.com/site/projektivna/), written by Milan Mitrović from Slovenian into English, without modifying any of the LaTeX commands.

To achieve this, we will first split the book into chunks, each roughly a page long, then translate each chunk into English, and finally stitch them back together.

In [4]:
!pip install openai



In [5]:
import openai
openai.api_key = "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"

## 1. Read in the data

In [6]:
!pip install transformers



In [3]:

from transformers import GPT2Tokenizer

# OpenAI GPT-2 tokenizer is the same as GPT-3 tokenizer
# we use it to count the number of tokens in the text
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

with open("data/geometry_slovenian.tex", "r") as f:
    text = f.read()

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
print(len(text))


1485565


In [7]:
print(text[:1000])

\documentclass[11pt]{book}

%PAPIR - US TRADE
\paperwidth 15.24cm
\paperheight 22.86cm

%TEKST
\textwidth 11.9cm \textheight 19.4cm
\oddsidemargin=-0.5cm
\evensidemargin=-1.2cm
\topmargin=-15mm

\headheight=13.86pt

%\usepackage[slovene]{babel}
\usepackage[english]{babel}

%\usepackage[cp1250]{inputenc}
\usepackage[utf8]{inputenc}


\usepackage[T1]{fontenc}
\usepackage{amsmath}
\usepackage{color}
\usepackage{amsfonts}
\usepackage{makeidx}
\usepackage{calc}
\usepackage{gclc}
%\usepackage[dvips]{hyperref}
\usepackage{amssymb}
\usepackage[dvips]{graphicx}
\usepackage{fancyhdr}

%za slike
\usepackage{caption}
\DeclareCaptionFormat{empty}

\def\contentsname{Vsebina}

\makeindex

\newcommand{\ch}{\mathop {\mathrm{ch}}}
\newcommand{\sh}{\mathop {\mathrm{sh}}}
\newcommand{\tgh}{\mathop {\mathrm{th}}}
\newcommand{\tg}{\mathop {\mathrm{tg}}}
\newcommand{\ctg}{\mathop {\mathrm{ctg}}}
\newcommand{\arctg}{\mathop {\mathrm{arctg}}}
\newcommand{\arctgh}{\mathop {\mathrm{arcth}}}

\def\indexname{Indek

### 1.1 Count the tokens in each chunk

In [8]:
print(len(text))
chunks = text.split('\n\n')
print(len(chunks))
ntokens = []
for chunk in chunks:
    ntokens.append(len(tokenizer.encode(chunk)))
max(ntokens)
print(len(ntokens))

1485565
5877


Token indices sequence length is longer than the specified maximum sequence length for this model (1327 > 1024). Running this sequence through the model will result in indexing errors


5877


In [9]:
print(ntokens[:10])

[10, 24, 50, 8, 25, 25, 110, 22, 12, 3]


In [11]:
print(ntokens[0], chunks[0])
print(ntokens[1], chunks[1])
print(ntokens[2], chunks[2])
print(ntokens[3], chunks[3])

10 \documentclass[11pt]{book}
24 %PAPIR - US TRADE
\paperwidth 15.24cm
\paperheight 22.86cm
50 %TEKST
\textwidth 11.9cm \textheight 19.4cm
\oddsidemargin=-0.5cm
\evensidemargin=-1.2cm
\topmargin=-15mm
8 \headheight=13.86pt


It turns out that a double newline is a good separator in this case, in order not to break the flow of the text. Also no individual chunk is larger than 1500 tokens. The model we will use is text-davinci-002, which has a limit of 4096 tokens, so we don't need to worry about breaking the chunks down further.

We will group the shorter chunks into chunks of around 1000 tokens, to increase the coherence of the text, and decrease the frequency of breaks within the text.

In [10]:
ns = [ "a", "b", "c", "d" ]
ls = [ 1, 2, 3, 4, 5 ]
for n, l in zip(ns, ls):
    print(n,l)

for i in range(len(ns)):
    print(ns[i], ls[i])


a 1
b 2
c 3
d 4


In [None]:
    
abc   def   xxxx
yyyyy zzzz
sssss lllll

In [11]:
def group_chunks(chunks, ntokens, max_len=1000):
    """
    Group very short chunks, to form approximately a page long chunks.
    """
    batches = []
    cur_batch = ""
    cur_tokens = 0

    # iterate over chunks, and group the short ones together
    for chunk, ntoken in zip(chunks, ntokens):
        cur_tokens += ntoken + 2  # +2 for the newlines between chunks

        print(ntoken, cur_tokens)
        # if adding this chunk would exceed the max length, finalize the current batch and start a new one
        if ntoken + cur_tokens > max_len:
            batches.append(cur_batch)
            cur_batch = chunk # ADDED
            cur_tokens = ntoken # ADDED
        else:
            cur_batch += "\n\n" + chunk
    batches.append(cur_batch)
    return batches

chunks = group_chunks(chunks, ntokens)
len(chunks)

10 12
24 38
50 90
8 100
25 127
25 154
110 266
22 290
12 304
3 309
122 433
9 444
123 569
42 613
24 639
129 770
5 777
26 805
20 827
34 863
35 900
33 935
37 974
40 79
38 119
43 164
100 266
12 280
33 315
4 321
4 327
64 393
10 405
25 432
22 456
21 479
27 508
8 518
23 543
20 565
49 616
21 639
8 649
75 726
13 741
3 746
37 785
4 791
8 801
8 811
9 822
47 871
47 920
190 1112
47 239
36 277
91 370
34 406
49 457
47 506
69 577
8 587
29 618
3 623
5 630
9 641
4 647
0 649
68 719
13 734
28 764
595 1361
60 657
37 696
171 869
270 443
790 1235
28 820
336 1158
455 793
175 632
0 634
31 667
127 796
231 1029
208 441
63 506
110 618
97 717
847 1566
630 1479
59 691
37 730
251 983
432 685
156 590
350 942
340 692
499 841
435 936
62 499
166 667
154 823
562 1387
338 902
175 515
363 880
408 773
157 567
137 706
256 964
312 570
941 1513
139 1082
311 452
383 837
138 523
134 659
0 661
406 1069
352 760
477 831
571 1050
235 808
194 431
153 586
0 588
0 590
0 592
73 667
24 693
142 837
459 1298
41 502
446 950
41 489
85 576
0 5

1135

Notice that adding a sample untranslated and translated first command, where only the content of the chapter name needs to be translated, helps to get more consistent results.

The format of the prompt sent to the model consists of:
1. A high level instruction to translate only the text, but not commands into the desired language
2. A sample untranslated command, where only the content of the chapter name needs to be translated
3. The chunk of text to be translated
4. The translated sample command from 2, which shows the model the beginning of the translation process

The expected output is the translated chunk of text.

In [12]:
print(len(chunks))

1135


In [13]:
print(chunks[0])




\documentclass[11pt]{book}

%PAPIR - US TRADE
\paperwidth 15.24cm
\paperheight 22.86cm

%TEKST
\textwidth 11.9cm \textheight 19.4cm
\oddsidemargin=-0.5cm
\evensidemargin=-1.2cm
\topmargin=-15mm

\headheight=13.86pt

%\usepackage[slovene]{babel}
\usepackage[english]{babel}

%\usepackage[cp1250]{inputenc}
\usepackage[utf8]{inputenc}


\usepackage[T1]{fontenc}
\usepackage{amsmath}
\usepackage{color}
\usepackage{amsfonts}
\usepackage{makeidx}
\usepackage{calc}
\usepackage{gclc}
%\usepackage[dvips]{hyperref}
\usepackage{amssymb}
\usepackage[dvips]{graphicx}
\usepackage{fancyhdr}

%za slike
\usepackage{caption}
\DeclareCaptionFormat{empty}

\def\contentsname{Vsebina}

\makeindex

\newcommand{\ch}{\mathop {\mathrm{ch}}}
\newcommand{\sh}{\mathop {\mathrm{sh}}}
\newcommand{\tgh}{\mathop {\mathrm{th}}}
\newcommand{\tg}{\mathop {\mathrm{tg}}}
\newcommand{\ctg}{\mathop {\mathrm{ctg}}}
\newcommand{\arctg}{\mathop {\mathrm{arctg}}}
\newcommand{\arctgh}{\mathop {\mathrm{arcth}}}

\def\indexname{Ind

In [35]:
print(chunks[1])


\newcommand{\baksiom}{\color{viol3}\begin{aksiom}}
\newcommand{\eaksiom}{\end{aksiom}\normalcolor}

\newcommand{\bzgled}{\color{green1}\begin{zgled}}
\newcommand{\ezgled}{\end{zgled}\normalcolor}

\newcommand{\bnaloga}{\color{red}\begin{naloga}}
\newcommand{\enaloga}{\end{naloga}\normalcolor}

\newcommand{\btrditev}{\color{blue}\begin{trditev}}
\newcommand{\etrditev}{\end{trditev}\normalcolor}


\newcommand{\del}[1]{\chapter{#1}}
\newcommand{\poglavje}[1]{\section{#1}}
%\newcommand{\naloge}[1]{\color{red}\section*{#1}\normalcolor}
\newcommand{\naloge}[1]{\section{#1}}
\newcommand{\ppoglavje}[1]{\subsection{#1}}

\setlength\arraycolsep{2pt}

\author{Milan Mitrovi\'c}
\title{\textsl{\Huge{\textbf{Euclidean Plane Geometry}}}}

\date{}

%_________________________________________________________________________________________

\begin{document}
\pagestyle{fancy}
\lhead[\thepage]{\textsl{\nouppercase{\rightmark}}}
\rhead[\textsl{\nouppercase{\leftmark}}]{\thepage}
\cfoot[]{}


 \vspace*{-12m

In [37]:
print(chunks[2])


The first two chapters deal with the history and axiomatic design of geometry.
The consequences of the axioms of incidence,  congruence and parallelism are discussed in detail, while in the other two groups (axioms of order and continuity) the consequences are mostly not proven.
Chapters three and four deal with the relation of the congruence of figures, the use of the triangle congruence theorems, and a circle.
In the fifth chapter, a vectors are defined. Thales's theorem of proportion is proven.
Chapter six deals with isometries and their use. Their classification has been performed.
Chapters 7 and 8 deal with similarity transformations, figure similarity relation, and area of figures. The ninth chapter presents the inversion.
At the end of each chapter (except the introductory one) are exercises. Solutions and instructions can be found in the last, tenth chapter.


The book contains 341 theorems, 247 examples and 418 solved problems (28 of them from the IMO). In this sense, the book

In [16]:
def translate_chunk(chunk, engine='text-davinci-002',
                    dest_language='English',
                    sample_translation=("\poglavje{Osnove Geometrije} \label{osn9Geom}", "\poglavje{The basics of Geometry} \label{osn9Geom}")
                    ):
    prompt = f'''Translate only the text from the following LaTeX document into {dest_language}. Leave all LaTeX commands unchanged
    
"""
{sample_translation[0]}
{chunk}"""

{sample_translation[1]}
'''
    response = openai.Completion.create(
        prompt=prompt,
        engine=engine,
        temperature=0,
        top_p=1,
        max_tokens=1500,
    )
    result = response['choices'][0]['text'].strip()
    result = result.replace('"""', '') # remove the double quotes, as we used them to surround the text
    return result
print(translate_chunk(chunks[800], engine='text-davinci-002', dest_language='English'))

\textbf{\textit{Proof.}} Let $ABCD$ be a trapezoid with bases $AB$ and $CD$ of lengths $|AB|=a$ and $|CD|=c$ and a height of length $v$ (Figure \ref{sl.plo.8.2.4.pic}).
With $S$ we mark the center of the leg $BC$ and $\mathcal{S}_S:\hspace*{1mm}A,D\mapsto A', D'$. Because $S$ is the common center of the lines $AA'$ and $DD'$, by Theorem \ref{paralelogram} $AD'A'D$ is a parallelogram. Because $\mathcal{S}_S(B)=C$ and $\mathcal{B}(A,B,D')$, also $\mathcal{B}(A',C,D)$ is. The isometry $\mathcal{S}$ maps the trapezoid $ABCD$ to the similar trapezoid $A'CBD'$, therefore $|BD'|=|CD|=c$ or $|AD'|=a+c$, and by Theorem \ref{ploscGlavniIzrek} \textit{3)} also $p_{ABCD}=p_{A'CBD'}$. So the parallelogram $AD'A'D$ with base $AD'$ of length $|AD'|=a+c$ and a height equal to the height of the trapezoid $ABCD$ of length $v$, is divided into two similar trapezoids $ABCD$ and $A'CBD'$ with equal areas.

From this and Theorems \ref{ploscGlavniIzrek} \textit{4)}, \ref{ploscDaljice} and \ref{ploscParal} it

In [15]:
print(chunks[800])


 \textbf{\textit{Proof.}} Naj bo $ABCD$ trapez z osnovnicama $AB$ in $CD$ dolžin $|AB|=a$ in $|CD|=c$ ter višino dolžine $v$
 (Figure \ref{sl.plo.8.2.4.pic}).
 S $S$ označimo središče kraka $BC$ in $\mathcal{S}_S:\hspace*{1mm}A,D\mapsto A', D'$. Ker je $S$ skupno središče daljic $AA'$ in $DD'$, je po izreku \ref{paralelogram} $AD'A'D$ paralelogram. Ker je še $\mathcal{S}_S(B)=C$ in $\mathcal{B}(A,B,D')$, je tudi $\mathcal{B}(A',C,D)$. Izometrija $\mathcal{S}$ preslika trapez $ABCD$ v skladni trapez $A'CBD'$, zato je $|BD'|=|CD|=c$ oz. $|AD'|=a+c$, in po izreku \ref{ploscGlavniIzrek} \textit{3)} še $p_{ABCD}=p_{A'CBD'}$. Torej je paralelogram $AD'A'D$ z osnovnico $AD'$ dolžine $|AD'|=a+c$ in višino, ki je enaka višini trapeza $ABCD$ dolžine $v$, razdeljen na dva skladna trapeza $ABCD$ in $A'CBD'$ z enakima ploščinama.

Iz tega in izrekov \ref{ploscGlavniIzrek} \textit{4)}, \ref{ploscDaljice} in \ref{ploscParal} sledi:
 \begin{eqnarray*}
 2\cdot p_{ABCD}=p_{ABCD}+p_{A'CBD'}=p_{AD'A'D}=\

We can see here that this one chunk in particular translates only the text, but leaves LaTeX commands intact.

Let's now translate all the chunks in the book - this will take 2-3 hours, as we're processing requests sequentially.

In [39]:
dest_language = "English"

translated_chunks = []
for i, chunk in enumerate(chunks):
    print(str(i+1) + " / " + str(len(chunks)))
    # translate each chunk
    translated_chunks.append(translate_chunk(chunk, engine='text-davinci-002', dest_language=dest_language))

# join the chunks together
result = '\n\n'.join(translated_chunks)

# save the final result
with open(f"data/geometry_{dest_language}.tex", "w") as f:
    f.write(result)

0 / 869
1 / 869
2 / 869
3 / 869
4 / 869
5 / 869
6 / 869
7 / 869
8 / 869
9 / 869
10 / 869
11 / 869
12 / 869
13 / 869
14 / 869
15 / 869
16 / 869
17 / 869
18 / 869
19 / 869
20 / 869
21 / 869
22 / 869
23 / 869
24 / 869
25 / 869
26 / 869
27 / 869
28 / 869
29 / 869
30 / 869
31 / 869
32 / 869
33 / 869
34 / 869
35 / 869
36 / 869
37 / 869
38 / 869
39 / 869
40 / 869
41 / 869
42 / 869
43 / 869
44 / 869
45 / 869
46 / 869
47 / 869
48 / 869
49 / 869
50 / 869
51 / 869
52 / 869
53 / 869
54 / 869
55 / 869
56 / 869
57 / 869
58 / 869
59 / 869
60 / 869
61 / 869
62 / 869
63 / 869
64 / 869
65 / 869
66 / 869
67 / 869
68 / 869
69 / 869
70 / 869
71 / 869
72 / 869
73 / 869
74 / 869
75 / 869
76 / 869
77 / 869
78 / 869
79 / 869
80 / 869
81 / 869
82 / 869
83 / 869
84 / 869
85 / 869
86 / 869
87 / 869
88 / 869
89 / 869
90 / 869
91 / 869
92 / 869
93 / 869
94 / 869
95 / 869
96 / 869
97 / 869
98 / 869
99 / 869
100 / 869
101 / 869
102 / 869
103 / 869
104 / 869
105 / 869
106 / 869
107 / 869
108 / 869
109 / 869
110 / 869
