# Table of Contents
* [Exploring science textbooks for parsing and annotations](#Exploring-science-textbooks-for-parsing-and-annotations)
	* [Intro](#Intro)
	* [basic parameters](#basic-parameters)
	* [Grouping texbooks by publisher](#Grouping-texbooks-by-publisher)
		* [groupings](#groupings)
	* [Testing pdf miner on single page](#Testing-pdf-miner-on-single-page)
		* [drawing bounding boxes over image](#drawing-bounding-boxes-over-image)
	* [Drawing sample pages for book categories](#Drawing-sample-pages-for-book-categories)
		* [a generic page from each category-](#a-generic-page-from-each-category-)
		* [a generic page from a single category](#a-generic-page-from-a-single-category)
	* [Draft Schema](#Draft-Schema)
	* [Proposal](#Proposal)
* [END](#END)
	* &nbsp;
		* [experiments with acrobat text recognition- abandoned for now](#experiments-with-acrobat-text-recognition--abandoned-for-now)


In [1]:
%%capture
import numpy as np
import pandas as pd
import scipy.stats as st
import itertools
import math
from collections import Counter, defaultdict
%load_ext autoreload
%autoreload 2

In [2]:
from wand.image import Image as WImage
from IPython.display import display
import PIL.Image as Image
import cv2

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.converter import TextConverter

In [3]:
from pdf_processing import draw_pdf_with_boxes
from pdf_processing import make_page_layouts

# Exploring science textbooks for parsing and annotations

## Intro

This notebook documents my eda on the k-12 science textbook dataset. In the 

## basic parameters

In [4]:
ls pdfs/ | wc -l 

In [5]:
book_list = !ls pdfs/
book_list

There are 28 total, with three distinct series.

In [38]:
total_page_count = 0
for textbook in book_list:
    test_book_path = './pdfs/' + textbook
    with open(test_book_path, 'r') as fp:
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        total_page_count += len(list(PDFPage.create_pages(document)))
total_page_count

There are 4686 total pages.

In [14]:
extractable = 0
for textbook in book_list:
    test_book_path = './pdfs/' + textbook
    with open(test_book_path, 'r') as fp:
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        extractable += document.is_extractable
extractable

all of the documents are extractable!

In [23]:
!find ./pdfs/* | xargs -L 1 pdffonts

fonts and encodings are consistent

## Grouping texbooks by publisher

In [6]:
book_breakdowns = defaultdict(list)

In [7]:
spectrum_science =  !ls pdfs/ | grep 'Spectrum Science'
book_breakdowns['spectrum_sci'] = spectrum_science
print('Spectrum Science,: ', len(spectrum_science), ' total')
# print('\n'.join(spectrum_science))

In [8]:
daily_science =  !ls pdfs/ | grep 'Daily Sc' 
book_breakdowns['daily_sci'] = daily_science
print('Daily Science: ', len(daily_science), ' total')
# print('\n'.join(daily_science))

In [9]:
read_understand =  !ls pdfs/ | grep 'Read and Understand Science' 
book_breakdowns['read_und_sci'] = read_understand
print('Read and Understand Science: ', len(read_understand), ' total')
# print('\n'.join(read_understand))

In [10]:
workbooks =  !ls pdfs/ | grep -i  'workbook' 
book_breakdowns['workbooks'] = workbooks
print('workbooks: ', len(workbooks), ' total')
# print('\n'.join(workbooks))

In [11]:
for book in book_list:
    if not sum([book in series for series in book_breakdowns.values()]):
        book_breakdowns['misc'].append(book)

In [12]:
sum([len(books) for books in book_breakdowns.values()])

all books accounted for in groupings

### groupings

In [13]:
for group, books in book_breakdowns.items():
    print(group)
    print('\n'.join(books + [' ']))

## Testing pdf miner on single page

This section has been superseded by the module I wrote for parsing.

In [16]:
# pages = []
# laparams = LAParams()
# device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# interpreter = PDFPageInterpreter(rsrcmgr, device)
# for page in PDFPage.create_pages(document):
#     interpreter.process_page(page)
#     # receive the LTPage object for the page.
#     pages.append(device.get_result())

### drawing bounding boxes over image

In [18]:
# page_file ='test_page.pdf'
# page_layout = make_page_layout(page_file)
# page_png_stream, y_height = make_png_stream(page_file)
# page_img = make_open_cv_img(page_png_stream)

# for box in page_layout._objs:
#     lr, ul = get_bbox_tuple(box, y_height)
#     cv2.rectangle(page_img, ul, lr, color=random_color(), thickness=2)
#     try:
#         print(box.get_text())
#     except AttributeError:
#         pass
# display(Image.fromarray(page_img, 'RGB'))

In [19]:
#  laparams = LAParams()
#     page_layouts = []
#     with open(pdf_file, 'r') as fp:
#         parser = PDFParser(fp)
#         document = PDFDocument(parser)
#         rsrcmgr = PDFResourceManager()
#         device = PDFPageAggregator(rsrcmgr, laparams=laparams)
#         interpreter = PDFPageInterpreter(rsrcmgr, device)
#         for page in PDFPage.create_pages(document):
#             interpreter.process_page(page)
#             layout = device.get_result()
#             page_layouts.append(layout)
#     return page_layouts

## Drawing sample pages for book categories

### a generic page from each category-

In [20]:
for group, books in book_breakdowns.items():
    print(group)
    print('\n'.join(books[2:3] + [' ']))
    book_file ='./pdfs/' + books[2]
    draw_pdf_with_boxes(book_file, [51,51])

decent splitting params-
line_overlap=0.5,
 char_margin=2.0,
 line_margin=0.5,
 word_margin=0.2,
 boxes_flow=0.5

### a generic page from a single category

In [21]:
for idx in range(len(book_breakdowns['daily_sci'])):
    book_file ='./pdfs/' + book_breakdowns['daily_sci'][idx]
    draw_pdf_with_boxes(book_file, [149,149])

## Draft Schema

header/topic

    -The page or section topic. This should be straightforward for turkers to recognize.

definition

    -The last page in the example above (How are living things different from nonliving things?) has a a good example--the vocab sidebar.

discussion

    - I envision this being the paragraph level discussion under a specific topic. This is by far the label with the most room for interpretation. I'm not sure what the right level of granularity is here. The paragraph is a nice easily identifiable block, but I could also see collapsing all of of the text under a topic (not falling under another label) into a single discussion blob.

question 

    - Text from a multiple choice, short answer, or fill-in-the-blank question. These three categories cover most questions, but there are some less frequent types, e.g. word find, word scramble, drawing connections. 

diagram

    - Diagram + figure labels and captions. 

other
    
    - Anything not covered by the above. Most of this should be structural or navigation related, e.g. page numbers, week/day headings.

## Proposal

The results above demonstrate that the textbooks are fairly consistent in their organization and page layouts. The three series of books from the same publisher should be a good place to start, as they're very consistent and cover over half of the textbooks collected.

Initially, I envision two rounds of mechanical turk annotations.
<br><br/>

    0. Pre-process the pdf-extraction in any way we can to simplify the turk tasks and reduce noise. This could include removing footers, table of contents (assuming we don't want those), and could even include programmatically joining some boxes.   

    1. Present a single page- label every box identified in the pdf extraction.


    2. Present boxes of a single category on a page- select boxes that should be grouped together or split. I'm not sure how difficult this would be to convey. The idea would be to join text separated or joined erroneously. In the examples above, this happens at paragraph indentations and many multi-line or fill-in-the-blank questions. 
    
If this sounds reasonable, I'll start prototyping the interface for the first turk task and generate some example results so you know what to expect. I envision the output being a json-

In [32]:
{
    "category_label": {
        "box_id": {
            "id": "D1",
            "text":"the contents of the text box",
            "source": ["textbook", "page_n"],
            "bounding_box": [
                [
                    400,
                    100
                ],
                [
                    500,
                    100
                ]
            ]
        }
    }
}

# END

### experiments with parser settings and acrobat text recognition- abandoned for now

In [22]:
# line_overlap=0.5,
#                  char_margin=2.0,
#                  line_margin=0.5,
#                  word_margin=0.1,
#                  boxes_flow=0.5,

# book_file = 'exact_1.pdf'
# draw_pdf_with_boxes(book_file, 0)

# book_file = 'single_page_test.pdf'
# draw_pdf_with_boxes(book_file, 0, 
#                 line_overlap=0.5,
#                  char_margin=2.,
#                  line_margin=0.9,
#                  word_margin=0.1, #no affect
#                  boxes_flow=1.0 )# no affect

# # tpl1 = make_page_layouts('single_page_test.pdf', 0)
# tpl1 = make_page_layouts(book_file, 0, 
#              line_overlap=0.5,
#                  char_margin=2.5,
#                  line_margin=0.9,
#                  word_margin=0.1, #no affect
#                  boxes_flow=0.5 )# no affect

# tpl1_boxes = tpl1[0]._objs
# tpl1_boxes