# Reading typical file types

## text files

We've done this many times.

In [None]:
text_f = open("corpora/genesis.txt")
txt = text_f.read()
txt[:500]

In [None]:
list("\n".encode())

## CSV files

The `csv` module is part of the standard python distribution

csv = "comma separated values"

In [None]:
import csv
csvfile = open('corpora/tekno_flat.csv')
tekno_reader = csv.DictReader(csvfile)
data_list = []
for row in tekno_reader:
    data_list.append(row)

In [None]:
data_list[0]

In [None]:
!pip install openpyxl

## Excel spreadsheets

You can convert these to `csv` files. But then every sheet will be in a separate file. And you may lose some information.

An alternative is the `openpyxl` library, which lets you work directly with spreadsheets and is pretty intuitive.

https://openpyxl.readthedocs.io/en/stable/

In [None]:
import openpyxl

In [None]:
from openpyxl import load_workbook

In [None]:
wb = load_workbook('corpora/tekno_fractions_nona.xlsx')

In [None]:
print(wb.sheetnames)

In [None]:
ws = wb["teknoclip7_1"]

In [None]:
ws["A1"]

In [None]:
ws.cell(row=1, column=1)

In [None]:
ws["A2"].value

In [None]:
for c in ws["C"]:
    print(c.value)

## PDF files

PDF files are generally pretty unpleasant to deal with.

`pdfminer` is a standard library that is geared toward extracting the text from pdf files.
It's powerful but not very intuitive.

https://euske.github.io/pdfminer/

The current offical version of pdfminer doesn't work with python 3. 
But this "fork" of it does: 

https://github.com/pdfminer/pdfminer.six

Here's a random page that's slightly helpful:

https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

In [None]:
!pip install pdfminer

### Command line usage

There are two ways to use pdfminer. One is from the command line in a terminal window.
This is easier but it won't always get you what you want

``` 
pdf2txt.py -o output.txt -c "ascii" corpora/mtms2008-09-122a.pdf
```

As aside: You can execute command line commands from within python using `subprocess`

In [None]:
import subprocess
import os
def extract_text(fpath, outdir):
    newname = os.path.basename(fpath).split(".")[0] + ".txt"
    subprocess.call(["pdf2txt.py","-o", outdir + "/" + newname, "-c", "ascii", fpath])

In [None]:
fpath = "corpora/mtms2008-09-122a.pdf"
extract_text(fpath, "pdf_extracts")