# Typical forms for our own data

## text files

We've done this many times.

So far we've always assumed the default encoding would work.

In [None]:
text_f = open("corpora/genesis.txt")
txt = text_f.read()
txt[:500]

In [None]:
list("\n".encode())

## CSV files

The `csv` module is part of the standard python distribution

csv = "comma separated values"

In [None]:
import csv
csvfile = open('corpora/tekno_flat.csv')
tekno_reader = csv.DictReader(csvfile)
data_list = []
for row in tekno_reader:
    data_list.append(row)

In [None]:
data_list[0]

## Excel spreadsheets

You can convert these to `csv` files. But then every sheet will be in a separate file. And you may lose some information.

An alternative is the `openpyxl` library, which lets you work directly with spreadsheets and is pretty intuitive.

https://openpyxl.readthedocs.io/en/stable/

In [None]:
import openpyxl

In [None]:
from openpyxl import load_workbook

In [None]:
wb = load_workbook('corpora/tekno_fractions_nona.xlsx')

In [None]:
print(wb.sheetnames)

In [None]:
ws = wb["teknoclip7_1"]

In [None]:
ws["A1"]

In [None]:
ws.cell(row=1, column=1)

In [None]:
ws["A2"].value

In [None]:
for c in ws["C"]:
    print(c.value)

## PDF files

PDF files are generally pretty unpleasant to deal with.

`pdfminer` is a standard library that is geared toward extracting the text from pdf files.
It's powerful but not very intuitive.

https://euske.github.io/pdfminer/

The current offical version of pdfminer doesn't work with python 3. 
But this "fork" of it does: 

https://github.com/pdfminer/pdfminer.six

Here's a random page that's slightly helpful:

https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

### Command line usage

There are two ways to use pdfminer. One is from the command line in a terminal window.
This is easier but it won't always get you what you want

``` 
pdf2txt.py -o output.txt -c "ascii" corpora/mtms2008-09-122a.pdf
```

As aside: You can execute command line commands from within python using `subprocess`

In [None]:
import subprocess
import os
def extract_text(fpath, outdir):
    newname = os.path.basename(fpath).split(".")[0] + ".txt"
    subprocess.call(["pdf2txt.py","-o", outdir + "/" + newname, "-c", "ascii", fpath])

In [None]:
fpath = "corpora/mtms2008-09-122a.pdf"
extract_text(fpath, "pdf_extracts")

### Python API

Sometimes there will be no choice but to use the full python api. 

First open the file in binary mode.

In [None]:
fp = open('corpora/mtms2008-09-122a.pdf', 'rb')

Then you'll execute a long list of commands that hopefully there will be no need to fully understand.

In [None]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator


parser = PDFParser(fp)
document = PDFDocument(parser)

rsrcmgr = PDFResourceManager()
device = PDFDevice(rsrcmgr)
interpreter = PDFPageInterpreter(rsrcmgr, device)
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
layouts = []
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layouts.append(device.get_result())

I think that each `layout` corresponds to a page. But I'm not sure.

In [None]:
len(layouts)

In [None]:
layout = layouts[0]
objs = layout._objs

In [None]:
objs[0]

In [None]:
objs[1]

In [None]:
objs[1].get_text()

In [None]:
from pdfminer.layout import LTText
txt_list = []
for obj in objs:
    if isinstance(obj, LTText):
        new_text = obj.get_text()
        txt_list.append(new_text)

In [None]:
txt_list