# Table of Contents
* [Setting paths](#Setting-paths)
* [**Pipeline for textbook dataset creation**](#**Pipeline-for-textbook-dataset-creation**)
	* [**Extracting page images**](#**Extracting-page-images**)
		* [pdfminer](#pdfminer)
	* [**OCR**  (set schema here)](#**OCR**--%28set-schema-here%29)
		* [First pass, Eric's service- MS Oxford and Google cloud vision](#First-pass,-Eric's-service--MS-Oxford-and-Google-cloud-vision)
		* [Add vertical dim to annotations](#Add-vertical-dim-to-annotations)
	* [possible second pass to resolve OCR issues](#possible-second-pass-to-resolve-OCR-issues)
	* [**First box merging and annotation template (set schema here)**](#**First-box-merging-and-annotation-template-%28set-schema-here%29**)
		* [Overmerge to make labeling easier/ more reliable](#Overmerge-to-make-labeling-easier/-more-reliable)
	* [**First round of MTurk annotion**](#**First-round-of-MTurk-annotion**)
		* [This round is for highest-level annotation](#This-round-is-for-highest-level-annotation)
		* [Sanity checks, first round cleaning](#Sanity-checks,-first-round-cleaning)
	* [**Generate question annotations**](#**Generate-question-annotations**)
		* [**Unmerge boxes** to seperate question components](#**Unmerge-boxes**-to-seperate-question-components)
		* [make separate question annotation boxes based on first round labels](#make-separate-question-annotation-boxes-based-on-first-round-labels)
		* [weakly remerge question boxes](#weakly-remerge-question-boxes)
	* [**Second round of MTurk annotation**](#**Second-round-of-MTurk-annotation**)
	* [**Hierarchy extraction**](#**Hierarchy-extraction**)
		* [Testing](#Testing)


In [1]:
%%capture
from __future__ import division
import numpy as np
import pandas as pd
import scipy.stats as st
import itertools
import math
%load_ext autoreload
%autoreload 2

In [3]:
import pickle

In [4]:
with open('breakdowns.pkl', 'r') as f:
    book_breakdowns = pickle.load(f)

with open('pdfs/page_ranges.csv') as f:
    ranges = f.readlines()

# Setting paths

In [144]:
turk_data_base_path = './ai2-vision-turk-data/textbook-annotation-test/'
full_scale_img_dir  = 'page-images/'
scaled_down_img_dir = 'smaller-page-images/'

raw_ocr_res =  'raw-ocr-results/'
raw_annotations_w_vert_scale = 'raw-ocr-ws/'

merged_anno_dest_dir = 'first-merge/'

# overmerged_dir = 'corrected_raw_hit_results/'
overmerged_dir = merged_anno_dest_dir
unmerged_dir = 'raw-ocr-results/'
lessmerged_dir = 'raw-unmerged/'
remerged_dir = 'remerged-first-round/'
first_round_annotation_results_dir = 'labeled-annotations/'
first_round_question_annotation_results_dir = 'labeled-questions/'


# question_annotation_dir = 'annotations-w-questions/'
question_annotation_dir = 'question-specific-annotations/'

rectified_ocr_dir = 'apply-new-ocr'

new_ocr_trans_dir = 'transferred-labels-first-round/'
unmerged_ocr_trans_dir = 'transferred-labels-first-round-unmerged/'

new_q_ocr_dir ='new-question-only-temp/'
transferred_q_labels_dir = 'transferred-question-labels/'

# **Pipeline for textbook dataset creation**

## **Extracting page images**

### pdfminer

In [5]:
import pdfextraction.ocr_pipeline as ocrp

In [6]:
range_lookup = {line.split(' ')[0]:[int(num) for num in line.strip().split(' ')[1:]] for line in ranges}

This extracts and writes images for a group of textbooks

In [None]:
for book in book_breakdowns['daily_sci']:
    ocrp.process_book(book, range_lookup[book], line_overlap=0.5,
                                            word_margin=0.1, char_margin=2.0, line_margin=0.5, boxes_flow=0.5)

## **OCR**  (set schema here)

In [26]:
book_breakdowns, page_ranges = amt_util.load_book_info()

In [7]:
import pdfextraction.ocr_pipeline as ocrp

### First pass, Eric's service- MS Oxford and Google cloud vision

In [8]:
# for book in book_breakdowns['spectrum_sci']:
#     ocrp.perform_ocr(book, 'annotations', range_lookup[book])

# for book in book_breakdowns['daily_sci']:
#     ocrp.perform_ocr(book, 'annotations', range_lookup[book])

# for book in book_breakdowns['read_und_sci']:
#     ocrp.perform_ocr(book, 'annotations', range_lookup[book])

### Add vertical dim to annotations

In [86]:
ocrp.add_anno_img_dim(turk_data_base_path + scaled_down_img_dir, raw_ocr_res, turk_data_base_path + raw_annotations_w_vert_scale)

### possible second pass to resolve OCR issues

## **First box merging and annotation template (set schema here)**

In [31]:
import pdfextraction.merge as merge_tool
import pdfextraction.amt_boto_modules as amt_util 

### Overmerge to make labeling easier/ more reliable

In [66]:
merge_params = {
    'near_x': 3.0,
    'near_y': 1.0,
    'overlap_x': 0.30,
    'overlap_y': 0.6,
    'start_x': 1.0,
    'short_length': 4.0,
    'char_size_ratio': 1.0,
    'starting_near_near_y': 1.5,
    'near_overlap_x': 0.35,
    'overlap_fract': 0.1
}

books_to_merge = book_breakdowns['daily_sci'] # + book_breakdowns['spectrum_sci'] + book_breakdowns['read_und_sci'] + book_breakdowns['workbooks']
# books_to_merge = book_breakdowns['spectrum_sci']

In [87]:
for textbook in books_to_merge:
    merge_tool.merge_single_book(textbook, page_ranges[textbook], turk_data_base_path + merged_anno_dest_dir, turk_data_base_path + raw_annotations_w_vert_scale, merge_params)

## **First round of MTurk annotion**

**This round is for highest-level annotation**

high-level labels include **discussion, definition, question, figure label, header/topic, and answer key**

The actual work of submitting and processing results can be found in **mturk_textbook_annotation_task.ipynb**

### Sanity checks, first round cleaning

## **Generate question annotations**

In [44]:
import pdfextraction.unmerge as unmerge_tool
import pdfextraction.question_annotation as question_anno_util

### **Unmerge boxes** to seperate question components

In [135]:
# books_to_unmerge = textbook in book_breakdowns['daily_sci'] + book_breakdowns['spectrum_sci'] + book_breakdowns['read_und_sci'] + book_breakdowns['workbooks']
# books_to_unmerge = book_breakdowns['spectrum_sci']
books_to_unmerge = book_breakdowns['daily_sci']

for textbook in books_to_unmerge:
#     unmerge_tool.unmerge_single_textbook(textbook, page_ranges[textbook], turk_data_base_path, overmerged_dir, unmerged_dir, lessmerged_dir)
    unmerge_tool.unmerge_single_textbook(textbook, page_ranges[textbook], turk_data_base_path, new_ocr_trans_dir, raw_annotations_w_vert_scale, unmerged_ocr_trans_dir)

[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_3_(Daily_Practice_Books)_Evan_Moore_153.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_4_Evan_Moor_97.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_4_Evan_Moor_109.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_5_Evan_Moor_66.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_5_Evan_Moor_75.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_5_Evan_Moor_152.json'
[Errno 2] No such file or

### weakly remerge question boxes

This should be performed with only the horizontal and overlap merge passes by setting **merge_pass = 2**

In [56]:
merge_pass = 2
remerge_params = {
    'near_x': 3.0,
    'near_y': 1.0,
    'overlap_x': 0.30,
    'overlap_y': 0.6,
    'start_x': 1.0,
    'short_length': 4.0,
    'char_size_ratio': 1.0,
    'starting_near_near_y': 1.5,
    'near_overlap_x': 0.35,
    'overlap_fract': 0.1
}

# books_to_merge = textbook in book_breakdowns['daily_sci'] + book_breakdowns['spectrum_sci'] + book_breakdowns['read_und_sci'] + book_breakdowns['workbooks']
books_to_merge = book_breakdowns['spectrum_sci']

In [57]:
for textbook in books_to_merge:
    merge_tool.merge_single_book(textbook, page_ranges[textbook], turk_data_base_path + remerged_dir, turk_data_base_path + lessmerged_dir, remerge_params, merge_pass)

### Make **question-specific annotations** based on first round labels

In [151]:
# books_to_amend = textbook in book_breakdowns['daily_sci'] + book_breakdowns['spectrum_sci'] + book_breakdowns['read_und_sci'] + book_breakdowns['workbooks']
books_to_amend = book_breakdowns['daily_sci']
for textbook in books_to_amend:
    question_anno_util.amend_single_book(textbook, page_ranges[textbook], turk_data_base_path + new_q_ocr_dir, turk_data_base_path + new_ocr_trans_dir)

[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_3_(Daily_Practice_Books)_Evan_Moore_153.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_4_Evan_Moor_97.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_4_Evan_Moor_109.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_5_Evan_Moor_66.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_5_Evan_Moor_75.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/transferred-labels-first-round/Daily_Science_Grade_5_Evan_Moor_152.json'
[Errno 2] No such file or

## Incorporate **new OCR** results

This step is needed only because of issues uncovered with the first OCR round. The OCR was improved, and the labels generated for the previous results are transferred to the new OCR detections. This won't be needed when starting ab initio.

In [101]:
import pdfextraction.transmit_anno as ocr_trans

First the broad category labels

In [150]:
books_to_transfer_anno = book_breakdowns['daily_sci']
# books_to_transfer_anno = book_breakdowns['spectrum_sci']
for textbook in books_to_transfer_anno:
    ocr_trans.transmit_anno_single_textbook(textbook, page_ranges[textbook], 0.5, turk_data_base_path, 
                                            first_round_annotation_results_dir, raw_annotations_w_vert_scale, new_ocr_trans_dir )

[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-annotations/Daily_Science_Grade_3_(Daily_Practice_Books)_Evan_Moore_153.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-annotations/Daily_Science_Grade_4_Evan_Moor_97.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-annotations/Daily_Science_Grade_4_Evan_Moor_109.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-annotations/Daily_Science_Grade_5_Evan_Moor_66.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-annotations/Daily_Science_Grade_5_Evan_Moor_75.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-annotations/Daily_Science_Grade_5_Evan_Moor_152.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/label

Then again for the question specific labels

In [152]:
books_to_transfer_anno = book_breakdowns['daily_sci']
# books_to_transfer_anno = book_breakdowns['spectrum_sci']
for textbook in books_to_transfer_anno:
    ocr_trans.transmit_anno_single_textbook(textbook, page_ranges[textbook], 0.5, turk_data_base_path, 
                                            first_round_question_annotation_results_dir, new_q_ocr_dir, transferred_q_labels_dir, True)

[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-questions/Daily_Science_Grade_1_Evan_Moor_8.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-questions/Daily_Science_Grade_1_Evan_Moor_9.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-questions/Daily_Science_Grade_1_Evan_Moor_15.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-questions/Daily_Science_Grade_1_Evan_Moor_21.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-questions/Daily_Science_Grade_1_Evan_Moor_27.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-questions/Daily_Science_Grade_1_Evan_Moor_37.json'
[Errno 2] No such file or directory: './ai2-vision-turk-data/textbook-annotation-test/labeled-questions/Daily_Science_Grade_1_Evan_M

KeyError: 'question'

## **Second round of MTurk annotation**

The purpose of this round is to get finer-grained labels for the questions in the dataset.

Questions can be labeled as **Multiple Choice, True/False, Fill-in-the-Blank, or Short Answer**

The actual work of submitting and processing results can be found in **question_annotation_turk.ipynb**

## **Hierarchy extraction**

Programmatically extract the hierarchy and piece labels from questions

### Testing 

# Send to viz tool

In [79]:
pages_to_review = ['Daily_Science_Grade_2_Evan_Moor_61.jpeg']

In [99]:
sampling_rate = 1
sample_size = int(len(pages_to_review) * sampling_rate)
sampled_pages_to_review = list(np.random.choice(pages_to_review, size= sample_size, replace=False))
print 'sampling ' + str(sample_size) + ' pages out of ' + str(len(pages_to_review))
to_review = ['start_seq'] + sampled_pages_to_review

sampling 1 pages out of 1


In [100]:
# anno_dir = 'simpler-test-questions/'
anno_dir = 'first-merge/'
anno_dir = 'transferred-labels-first-round/'
# anno_dir = 'labeled-annotations/'
anno_dir = 'labeled-questions/'
# anno_dir = 'raw-ocr-ws/'

amt_util.review_results(to_review, anno_dir)
print 'posting to review tool, navigate to http://localhost:8080/ to see the sampled consensus results'

posting to review tool, navigate to http://localhost:8080/ to see the sampled consensus results
