## Extracting text from `pdf`s and other files

The [`textract` Python package](https://textract.readthedocs.io/en/latest/python_package.html#python-package) is, apparently, a good way of getting texts out of `.pdf` files. Let us play around with it here:

In [1]:
import textract

Here, we will experiment with a simple file, with no images and almost entirely text. You can check out the file, and if you're a fan of algebra and rings, try the questions(!)

In [2]:
rings_file = textract.process("ram_questions.pdf", method='pdfminer', encoding='utf8')
print(rings_file)

b'Question 1 (2018 A1).\n\n\xe2\x80\xa2 De\xef\xac\x81ne the terms ring, integral domain and \xef\xac\x81eld.\n\n\xe2\x80\xa2 Is it true that every integral domain is a \xef\xac\x81eld? Justify your answer.\n\nQuestion 2 (2018 A2). Suppose that R is a ring and that R[X] is a polynomial ring in one\nindeterminate over R. Suppose that g \xe2\x88\x88 R[X] is a monic element and that f is any element of\nR[X]. Explain the process of long division of f by g. Illustrate your answer with the case where\nR = Z, g = X 2 \xe2\x88\x92 5 and f = 2X 3 + 1.\n\nQuestion 3 (2018 A3).\n\n\xe2\x80\xa2 If R, S are rings, de\xef\xac\x81ne the notion of a homomorphism from R to\n\nS.\n\n\xe2\x80\xa2 De\xef\xac\x81ne the notion of the kernel of a homomorphism.\n\xe2\x80\xa2 De\xef\xac\x81ne the notion of an ideal of a ring R.\n\xe2\x80\xa2 Is the set of all odd integers an ideal in the ring Z? Justify your answer.\n\nQuestion 4 (2016 A1).\n\n\xe2\x80\xa2 Show that if\n\nis an ascending chain of ideals, then

Of course, it is always worth checking how in date a post is, as the above is unreadable (and nothing like what the output should look like!) Back to searching, PyMuPDF has been suggested, let's see if that is any better:

In [3]:
import fitz
doc_mr = fitz.open("ram_questions.pdf")
for page in doc_mr:
    print(page.get_text())

Question 1 (2018 A1).
• Deﬁne the terms ring, integral domain and ﬁeld.
• Is it true that every integral domain is a ﬁeld? Justify your answer.
Question 2 (2018 A2). Suppose that R is a ring and that R[X] is a polynomial ring in one
indeterminate over R. Suppose that g ∈ R[X] is a monic element and that f is any element of
R[X]. Explain the process of long division of f by g. Illustrate your answer with the case where
R = Z, g = X2 − 5 and f = 2X3 + 1.
Question 3 (2018 A3).
• If R, S are rings, deﬁne the notion of a homomorphism from R to
S.
• Deﬁne the notion of the kernel of a homomorphism.
• Deﬁne the notion of an ideal of a ring R.
• Is the set of all odd integers an ideal in the ring Z? Justify your answer.
Question 4 (2016 A1).
• Show that if
I1 ⊆ I2 ⊆ I3 ⊆ · · ·
is an ascending chain of ideals, then �
n≥1 In is also an ideal of A.
• Consider the ideals (2) and (3) in the ring Z. Is (2) ∪ (3) also an ideal in Z?
Question 5 (2018 A4). Construct a homomorphism φ : Z[X] → C whose ke

Yes, that's much better, and what I was looking for. Now, we can then use this to grab the text from a `.pdf`, and use it for whatever we need to. The [documentation is here](https://pymupdf.readthedocs.io/en/latest/index.html), in particular note the [`Page.get_text()` method](https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_text).

In [4]:
import requests

In [5]:
doc_mr[0].get_text() # note that the pages objects are zero based, unlike the "natural" page numbering

'Question 1 (2018 A1).\n• Deﬁne the terms ring, integral domain and ﬁeld.\n• Is it true that every integral domain is a ﬁeld? Justify your answer.\nQuestion 2 (2018 A2). Suppose that R is a ring and that R[X] is a polynomial ring in one\nindeterminate over R. Suppose that g ∈ R[X] is a monic element and that f is any element of\nR[X]. Explain the process of long division of f by g. Illustrate your answer with the case where\nR = Z, g = X2 − 5 and f = 2X3 + 1.\nQuestion 3 (2018 A3).\n• If R, S are rings, deﬁne the notion of a homomorphism from R to\nS.\n• Deﬁne the notion of the kernel of a homomorphism.\n• Deﬁne the notion of an ideal of a ring R.\n• Is the set of all odd integers an ideal in the ring Z? Justify your answer.\nQuestion 4 (2016 A1).\n• Show that if\nI1 ⊆ I2 ⊆ I3 ⊆ · · ·\nis an ascending chain of ideals, then �\nn≥1 In is also an ideal of A.\n• Consider the ideals (2) and (3) in the ring Z. Is (2) ∪ (3) also an ideal in Z?\nQuestion 5 (2018 A4). Construct a homomorphism φ

In [6]:
print(doc_mr[0].get_text())

Question 1 (2018 A1).
• Deﬁne the terms ring, integral domain and ﬁeld.
• Is it true that every integral domain is a ﬁeld? Justify your answer.
Question 2 (2018 A2). Suppose that R is a ring and that R[X] is a polynomial ring in one
indeterminate over R. Suppose that g ∈ R[X] is a monic element and that f is any element of
R[X]. Explain the process of long division of f by g. Illustrate your answer with the case where
R = Z, g = X2 − 5 and f = 2X3 + 1.
Question 3 (2018 A3).
• If R, S are rings, deﬁne the notion of a homomorphism from R to
S.
• Deﬁne the notion of the kernel of a homomorphism.
• Deﬁne the notion of an ideal of a ring R.
• Is the set of all odd integers an ideal in the ring Z? Justify your answer.
Question 4 (2016 A1).
• Show that if
I1 ⊆ I2 ⊆ I3 ⊆ · · ·
is an ascending chain of ideals, then �
n≥1 In is also an ideal of A.
• Consider the ideals (2) and (3) in the ring Z. Is (2) ∪ (3) also an ideal in Z?
Question 5 (2018 A4). Construct a homomorphism φ : Z[X] → C whose ke

In [7]:
type(doc_mr[0].get_text())

str

Now, I'm trying to work with the [iSpeech API](https://www.ispeech.org/), but the email verification hasn't come through yet. Ah, well...

The [Python `gtts` package](https://pypi.org/project/gTTS/) allows you to convert text locally, which is quite convenient as I do not particularly find any of the online versions satisfactory for some reason or another (and I would really not like to spend cash on an API that I do not intend on using long term).

In [8]:
from gtts import gTTS
for page in doc_mr:
    output = gTTS(text = page.get_text(), lang = 'en', slow = False)
    output.save(f"ram_questions_page{page.number + 1}.mp3") # again recall the pages start counting from zero, so add 1 to get correct numbering 

Great, but note that here we are saving the pages one by one. Is there a way to create one single file?

In [11]:
output_string = ""
for page in doc_mr:
    output_string += (" " + page.get_text()) 
output_long = gTTS(text = output_string, lang = 'en', slow = False)
output_long.save("rings_long_vers.mp3")

Note, of course, that the saving of the `.mp3`s and the processing of the `.pdf`s are intensive tasks which take a bit of time.

The `gTTS` package hasn't been perfect, and didn't seem to recognise the term "field" or "define" (lol), which may be down to how the source `.tex` file was compiled into a `.pdf`. However, it has recognised mathematical symbols and has spoken them correctly. Of course, some of the rendering is quite challenging to speak out (mathematics in general is like that, unfortunately).

---

While I'm not planning to actually use the Google API, as it requires a phone number (which can be used for more than just verification), I'll see how I would use it if I had access to it (you know, without needing to expose myself any more than possible!)

In [12]:
saved_api_key = "AIzaSyB5Hty3AjEq-NLGGqTpEFelA2--VgMO7_g"
gctts_url = f"https://texttospeech.googleapis.com/v1/text:synthesize?key={saved_api_key}"
gctts_paramets = {
  "audioConfig": {
    "audioEncoding": "MP3"
  },
  "voice": {
    "languageCode": "en"
  },
  "input": {
    "text": output_string[:5000]
  }
} # there is a 5000 character limit to the api
#gctts_headers = {"key" : saved_api_key}
resp = requests.post(url = gctts_url, json = gctts_paramets) #, headers = gctts_headers)
#print(resp.json()) # don't do this, you'll regret it

Now, we need to run some terminal prompts from here, which we can do from the `os` module.

In [16]:
import os

In [14]:
file_ttf = open("synthesize-text.txt","w")
file_ttf.write(resp.json()["audioContent"])
file_ttf.close()

In [15]:
os.system("base64 synthesize-text.txt --decode > synthesize-text-audio.mp3")

0

...and voila! This works as we want it to, the output is listenable as per the others!

It looks like bedtime for me then, goodnight...