## Extracting text from `pdf`s and other files

The [`textract` Python package](https://textract.readthedocs.io/en/latest/python_package.html#python-package) is, apparently, a good way of getting texts out of `.pdf` files. Let us play around with it here:

In [3]:
saved_api_key = "YOUR GOOGLE TTS KEY"

In [2]:
import textract

ModuleNotFoundError: No module named 'textract'

Here, we will experiment with a simple file, with no images and almost entirely text. You can check out the file, and if you're a fan of algebra and rings, try the questions(!)

In [None]:
rings_file = textract.process("ram_questions.pdf", method='pdfminer', encoding='utf8')
print(rings_file)

Of course, it is always worth checking how in date a post is, as the above is unreadable (and nothing like what the output should look like!) Back to searching, PyMuPDF has been suggested, let's see if that is any better:

In [None]:
import fitz
doc_mr = fitz.open("ram_questions.pdf")
for page in doc_mr:
    print(page.get_text())

Yes, that's much better, and what I was looking for. Now, we can then use this to grab the text from a `.pdf`, and use it for whatever we need to. The [documentation is here](https://pymupdf.readthedocs.io/en/latest/index.html), in particular note the [`Page.get_text()` method](https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_text).

In [None]:
import requests

In [None]:
doc_mr[0].get_text() # note that the pages objects are zero based, unlike the "natural" page numbering

In [None]:
print(doc_mr[0].get_text())

In [None]:
type(doc_mr[0].get_text())

Now, I'm trying to work with the [iSpeech API](https://www.ispeech.org/), but the email verification hasn't come through yet. Ah, well...

The [Python `gtts` package](https://pypi.org/project/gTTS/) allows you to convert text locally, which is quite convenient as I do not particularly find any of the online versions satisfactory for some reason or another (and I would really not like to spend cash on an API that I do not intend on using long term).

In [None]:
from gtts import gTTS
for page in doc_mr:
    output = gTTS(text = page.get_text(), lang = 'en', slow = False)
    output.save(f"ram_questions_page{page.number + 1}.mp3") # again recall the pages start counting from zero, so add 1 to get correct numbering 

Great, but note that here we are saving the pages one by one. Is there a way to create one single file?

In [None]:
output_string = ""
for page in doc_mr:
    output_string += (" " + page.get_text()) 
output_long = gTTS(text = output_string, lang = 'en', slow = False)
output_long.save("rings_long_vers.mp3")

Note, of course, that the saving of the `.mp3`s and the processing of the `.pdf`s are intensive tasks which take a bit of time.

The `gTTS` package hasn't been perfect, and didn't seem to recognise the term "field" or "define" (lol), which may be down to how the source `.tex` file was compiled into a `.pdf`. However, it has recognised mathematical symbols and has spoken them correctly. Of course, some of the rendering is quite challenging to speak out (mathematics in general is like that, unfortunately).

---

While I'm not planning to actually use the Google API, as it requires a phone number (which can be used for more than just verification), I'll see how I would use it if I had access to it (you know, without needing to expose myself any more than possible!)

In [None]:
gctts_url = f"https://texttospeech.googleapis.com/v1/text:synthesize?key={saved_api_key}"
gctts_paramets = {
  "audioConfig": {
    "audioEncoding": "MP3"
  },
  "voice": {
    "languageCode": "en"
  },
  "input": {
    "text": output_string[:5000]
  }
} # there is a 5000 character limit to the api
#gctts_headers = {"key" : saved_api_key}
resp = requests.post(url = gctts_url, json = gctts_paramets) #, headers = gctts_headers)
#print(resp.json()) # don't do this, you'll regret it

Now, we need to run some terminal prompts from here, which we can do from the `os` module.

In [None]:
import os

In [None]:
file_ttf = open("synthesize-text.txt","w")
file_ttf.write(resp.json()["audioContent"])
file_ttf.close()

In [None]:
os.system("base64 synthesize-text.txt --decode > synthesize-text-audio.mp3")

...and voila! This works as we want it to, the output is listenable as per the others!

It looks like bedtime for me then, goodnight...